It’s hard to believe it’s December… in some respects the year has felt incredibly slow as we have watched the pandemic run its inexorable course; and then in other ways it feels like a blur of zoom calls and box sets that has gone by in a flash. Lets hope 2021 proves better… for the time being, how about hunkering down with some ‘home-improvement’ via a few curated data science reading materials!
Following is the December edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity …
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
Industrial Strength Data Science December 2020 NewsletterRSS Data Science Section
As the virus continues to spread, how the holiday period will effect infection rates, and how quickly vaccines can be distributed are some of the hot topics. As always numbers, statistics and models are front and centre in all sorts of ways.
- The dramatic rise in positive cases and hospitalisations led to a national lockdown in the UK at the end of November which is set to be lifted at the beginning of December. Although positive case rates do seem to have slowed somewhat, there are still over 16,000 people hospitalised with Covid, not far off the peak of 20,000 in April, so the risk of increased fatalities is still high, particularly with the loosening of restrictions set to take place over Christmas.
- For those planning a get together over the holidays, this tool can give you an understanding of the risks involved.
- Again, the communication of numbers, statistics and trends is very much front and centre: David Spiegelhalter argues how important trust is, in this process
- Minimising unnecessary fatalities seems increasingly important as the first positive news in a while came to light.
- First, on November 9th, Pfizer and BioNTech announced that the phase 3 trial of their vaccine candidate had been more than 90% effective at preventing Covid infections. The vaccine now has preliminary approval from the MRHA
- Then, on November 16th, Moderna announced that their vaccine candidate had been 94.5% effective in its phase 3 trial.
- Finally (for the time being with many more vaccines potentially on the way) on November 23rd, the Oxford University/AstraZeneca partnership announced that their vaccine candidate showed an average efficacy of 70%, with a specific dosage regime exhibiting efficacy of 90%.
- The first two vaccines (Pfizer and Moderna) have been developed using a relatively new approach – mRNA – and the speed of the development and their success is a true scientific breakthrough, as Adam Finn discusses.
- For those of us used to the realms of ‘Big Data’ where machine learning data sets of millions of records are common, the numbers involved in the clinical trials are striking. Clearly it is impossible to conduct these trials at the scale of web data but with the relatively low incidence of Covid, the 30 to 40 thousand participants in the phase 3 trials result in less than 100 infections combined across the test and control groups. For instance, with Moderna:
"This first interim analysis was based on 95 cases, of which 90 cases of COVID-19 were observed in the placebo group versus 5 cases observed in the mRNA-1273 group, resulting in a point estimate of vaccine efficacy of 94.5% (p <0.0001)"
- While these figures are very unlikely to have been generated by chance, it highlights the intractability of understanding effects in sub-groups (age bands etc), and also the critical importance of true randomisation in test and control group selection.
- The Oxford-AstraZenca has a number of positives compared to the other two vaccines, notably cost (it is likely to be far more affordable and so practical globally) and distribution (it does not need to be kept at the very cold temperatures of the other two). However, questions are now being raised about the trial results (see Wired, and New Scientist) which may mean that further trials are required before regulatory approval is gained.
- Testing is of course still key- MIT Researchers have produced a prototype AI model to detect Covid from recordings of coughs … perhaps soon Alexa will be able to diagnose us!
- A different but important Covid related topic is understanding the economic impact of the pandemic. Raj Chetty Professor of Public Economics at Harvard University, has been using a variety of publicly available data in novel ways to attempt to understand how Covid has affected different socio-economic groups from a financial perspective. He has produced elegant visualisations highlighting the disparity in outcomes.
Declines in high-income spending led to significant employment losses among low-income individuals working in the most affluent ZIP codes in the country,
- For those looking for a good listen, Chetty was interviewed by Kara Swisher for the NYTimes Sway podcast. The study using anonymised tax records to build longitudinal understanding of the changes in inequality, and the impact of small changes in location, was particularly insightful.
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
As previewed in our last newsletter, and our recent release, we are excited to be launching a new initiative: AI Ethics Happy Hours. We are now working on organising the first event based on suggestions we have received.
The joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation are picking up again and we are actively involved in these. We also hope to be posting our own version of a basic data science curriculum soon- will keep you posted.
Anjali was also one of the four co-chairs organising the very successful ‘AI and Data Science in the age of COVID-19‘ conference at the Alan Turing Institute. There were representatives from 35 countries, 58 government departments, 62 institutes and 158 universities engaged in the audience! The talks will be published on YouTube shortly.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and has been active in with virtual events. The most recent event was on December 2nd, where Sasha Rush from Hugging Face discussed deep probabilistic structure in NLP. Videos are posted on the meetup youtube channel – and future events will be posted here.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- Following the theme of BMW publishing their AI code of ethics which we highlighted in the last newsletter, AstraZenca have done the same.
- Is it possible to conduct facial recognition research ethically? Interesting discussion in Nature.
- What do we really mean by ‘model explainability’?. Useful breakdown of different terms and approaches.
- In a slightly different vein (perhaps data science for transparency), Sophie Hill has created a graph based visualisation (engagingly titled ‘My Little Crony‘) that uses public data to highlight the links between politicians and companies awarded contracts during the pandemic.
- Another piece pointing the finger at recommendation algorithms for exacerbating ‘filter bubbles’ – this time at Facebook. Interestingly, it contrasts the way the Facebook algorithm works compared to that at Reddit and highlights some key differences:
- At Facebook, you are recommended things based on people who have agreed to be friends with you- so you are unlikely to ever see content from different viewpoints.
- Reddit prioritises content based on what users vote to be the most interesting or informative, but Facebook gives priority to what has garnered the most engagement.
- Another example of flawed governmental use of algorithms and machine learning models- this time in housing.
The recurring theme here is an assumption that by just using an algorithm you can find a completely objective solution to any issue. That all these algorithms have struggled as they come into contact with the real world suggests otherwise
Real world data science applications …
All sorts of great applications of data science and machine learning, regularly coming to light.
- Deep Mind are pushing the boundaries again- this time they have used their Alpha-Fold system to solve a 50 year old grand challenge in biology This excellent Nature article puts some context around the achievement.
We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.
- The international conservation charity ZSL (Zoological Society of London) is using Machine Learning to better track endangered species around the world. In this case, they were able to identify gun-shots from hours of audio recordings made by concealed devices in the wild.
- A new tool and technique pioneered at Los Alamos, is able to automatically detect gas leaks in oilfield pipelines.
Developments in Data Science…
As always, lots of new developments…
- Great discussion with Geoff Hinton on why he thinks Deep Learning is “going to be able to do everything”
"Neural nets are surprisingly good at dealing with a rather small amount of data, with a huge numbers of parameters, but people are even better"
- Elegant technique to uncover which training examples are more important to model generalisation using reinforcement learning.
- Interesting tutorial and approach to pre-processing training data for deep learning models by Hadrien Jean.
- Marco Ribeiro discusses CheckList, an open source project for testing NLP frameworks (also a topic of a recent London ML meetup).
- Intriguing recent post on the google research blog about automatically creating video from a web page
Some practical tips and tricks to try..
And as always, lots of fantastic tutorials out there…
- Give DeepNote and FastAI a trial run with this interesting tutorial from Anthony Agnone using Kaggle’s San Fransisco crime classification data set.
- Isolation forests look like an excellent new approach to outlier detection- good explanation from Andrew Young.
- An excellent read for those interested in understanding the complexities of MLOps – ‘A Brief History of TensorFlow Extended’. Covering a similar topic there is also this good summary of challenges in deploying machine learning.
- Useful approach to the initial setup for ML projects, abstracting away dependencies and library set up.
As always here are a few potential practical projects to while away the socially distanced hours:
- ‘It’s the screams of the damned’ – the eerie AI world of deep fake music
- How to get to the front page of hacker news with GPT3
- Generate Weird Al Yankovich lyrics for any occasion
- Understanding coffee grades and flavours with machine learning …
- For any new parents out there… create your own baby-monitor with a raspberry pi and tensor flow.
Updates from Members and Contributors
- Kevin O’Brien mentioned what looks to be an excellent webinar series: X-Europe Webinars. X-Europe Webinars is an organization for joint online events of Vienna Data Science Group, Frankfurt Data Science, Budapest Data Science Meetup, BCN Analytics, Budapest.AI, Barcelona Data Science and Machine Learning Meetup, Budapest Deep Learning Reading Seminar and Warsaw R Users Group.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
The views expressed are our own and do not necessarily represent those of the RSS