Well.. January seemed to fly by. 2021 has certainly started with a bang (Brexit!, Impeachment!, New President!, Vaccinations!) and the holidays seem an age ago. I hope you are surviving lockdown 3.0 as best as you can… maybe there is room in the long dark evenings for a few curated data science reading materials?
Following is the February edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity …
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners.
Industrial Strength Data Science February 2021 NewsletterRSS Data Science Section
I keep thinking we might be able to drop the ‘Covid Corner’ section from the newsletter, but sadly the pandemic is still very much alive. The vaccination roll-out in the UK does seem to be going well, however, with over 9m first dose vaccinations made (as of Feb 1st) which is great news.
- The rest of the covid numbers make for grim reading – we still have over 20,000 new positive cases a day, there are over 30,000 people in hospital, and we now have over 100,000 deaths from COVID-19.
- There is no doubting the hospital figures, which are putting an incredible strain on NHS staff, and are dramatically higher than the previous peak of 20,000 in April. Much was made, however, of the daily reported Covid death figure of 1820 on January 20th.
- It’s important to understand how this metric is defined- first of all, it is based on the date of the death being reported rather than occurring, but more importantly, there must have been a positive Covid test with 28 days of the death for it to be counted.
- Since we are testing much more widely now than we were in March and April, the reported counts back then were likely under representing the true extent of mortality.
- This is shown in the ONS figures (based on death certificates) where we see deaths from the first wave still higher than at present. The numbers are still shocking but this difference does at least highlight the great improvements in treatment we have made since the first peak. (As a side note the ONS has done an amazing job of not only gathering representative data, but of summarising and communicating the analysis well).
- On the vaccine front there was more good news with the recent announcement of Novavax’s successful phase 3 trial in the UK. There is much discussion of what vaccine ‘efficacy’ actually means, and David Spiegelhalter does a good job of explaining how it is measured.
- The speed and scale of the vaccine roll-out is impressive purely in terms of the numbers (although the RSS are keen to see more details…). But what does it actually take to get manufacture the vaccine and get it to where it is needed? This exhaustive investigation shows just how many different capabilities are needed.
- How far would you go, if you felt the organisation you worked for was not telling the truth? Thankfully, this is perhaps not something many data scientists have to deal with, but the case of Rebekah Jones in Florida is sobering reading.
- The UK Government’s proposed use of INNOVA Lateral Flow tests for Covid has created a fair amount of controversy. The RSS has called for a re-evaluation of these plans, claiming “that negative INNOVA results are too inaccurate to rule out Covid entirely”. However, this view is far from unchallenged, as this excellent piece from Tom Chivers explains.
"one side claims that the tests are more than 90% effective at what they do; the other side says they could be as low as 3%, depending on what you mean by “effective”."
- Finally, this feels like a very exciting development. The recent breakthroughs in natural language processing (NLP) and language models (like BERT-2/3) are at heart based on understanding the likelihood of different sequences of letters and words, codified into word embeddings (vector representations). Applying this approach to other fields (remember chess?) feels very elegant, and the MIT researchers in this case have used the underlying gene sequences (‘letters’) of viruses to train their model. From this they are able to predict likely virus mutations using sequence data alone:
"The model achieved 0.85 AUC in predicting SARS-CoV-2 variants that were highly infectious and capable of evading antibodies."
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
There is still time to register for our upcoming fireside chat with none-other than Andrew Ng on February 10th. We are very excited for what is going to be a fantastic event: don’t miss out, sign up here.
As we previously announced we are looking forward to our first AI Ethics Happy Hour event – details to follow.
The joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation are picking up again and we are actively involved in these. We also hope to be posting our own version of a basic data science curriculum soon- will keep you posted.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and continues to be very active in with virtual events. The next event is on 11th February where Manzil Zaheer, a research scientist at Google, will talk about Big Bird: Transformers for Longer Sequences. Videos are posted on the meetup youtube channel – and future events will be posted here.
Finally, we are really pleased to include a call for contributions to RSS 2021 Conference, 6-9 September in Manchester. The organisers are seeking submissions for contributed talks which can be on any topic related to statistics and data science (deadline April 6th).
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
Big Government and AI
Governments around world mapping out grand AI plans…
- The UK Government recently released their ‘AI Roadmap’, making the case for investment in AI technology, and outlining the government’s plans for ‘staying at the forefront of the development of AI’. A new Central Digital and Data Office has also been created. We have commented publicly on the lack of practical detail in this released plan, and the lack of insight into what it will take to drive adoption of AI in the UK forward.
- It is interesting to contrast the approach taken in China, where an aggressive investment plan was unveiled in 2017, and the Beijing Academy of Artificial Intelligence was opened a year later.
- Meanwhile, in the US, the government recently announced a central hub for co-ordinating AI research and policy making.
- What is not all that clear, is how much of an impact any of these endeavours is having, compared to the research advances generated in the private sector. Alexander Dante Camuto from the Turing Institute, highlights the discrepancy in this thoughtful piece.
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- The Montreal AI Ethics Institute have issued a ‘State of AI Ethics Report’ which provides a useful summary of recent developments in this area.
- More examples of questionable applications of AI keep surfacing. First of all, researchers have apparently been able to predict mental illness from facebook data… And then the same researcher, Michal Kosinski, behind identifying psychological traits from facebook likes, has attempted to predict political affiliation from facial recognition technology.
- Perhaps we are all in need of third party advice- firms are appearing focused specifically on advising around AI Ethics.
AI in Healthcare
Increasing utilisation of AI and machine learning in healthcare…
- Exciting announcement from the Korea Institute of Science and Technology who have developed a prostate cancer urine screening test using machine learning.
- Interesting comment published in Nature discussing how recent applications of AI to ageing research are leading to the emergence of the field of longevity medicine.
- We have seen a number of studies in recent times highlighting the power of deep learning techniques in medical imaging and the automatic assessment of resulting scans- this review article in nature assesses the overall gains over the last decade.
- As the previous article alludes to, going from prototype to real world production in a healthcare setting is far from simple, and this article from Rachel Thomas of fast.ai highlights some of the underlying issues.
- Interestingly, the FDA in the US has released an action plan focused on methods for approving AI and Machine Learning based applications in health care in the US.
Developments in Data Science…
As always, lots of new developments…
- Fresh on the heels of GPT-3, OpenAI have released an amazing application, called DALL-E (Salvador Dali crossed with Pixar’s WALL-E…), a 12 billion parameter version of GPT-3 trained to generate images from text descriptions. You have to try this… Good summary here from MIT Technology Review.
“In the long run, you’re going to have models which understand both text and images. AI will be able to understand language better because it can see what words and sentences mean.”
- Not to be outdone on the ‘my model has more parameters than your model’ stakes, Google recently announced their Switch Transformer Language Model with 1.6 trillion parameters.
- Great summary, from Jeff Dean, head of Google AI, of Google’s research output in 2020 (over 800 publications) and what lies ahead for 2021. This is long, but well worth a read as it highlights the amazing breadth and depth of the output from the Google researchers.
"I’m particularly enthusiastic about the possibilities of building more general-purpose machine learning models that can handle a variety of modalities and that can automatically learn to accomplish new tasks with very few training examples"
- Whatever your opinion on Facebook as a product, their AI team continue to generate impressive work. They have just open-sourced a new approach to teaching robots to learn from examples. (Of course we can’t mention robots without including everyone’s favourites from Boston Dynamics…)
How does that work?
A new section on understanding different approaches and techniques
- Transformers originally surfaced in 2017, but have had a huge impact in NLP since then, allowing the concept of ‘memory’ to be deployed in Deep Learning models (rather than learning from direct sequences). This is a good intuitive guide to how they work from Nikolas Adaloglou. Interestingly (as with the virus sequencing above), Transformers are finding applications outside of NLP, for instance in earthquake detection.
- Excellent guide to how different image recognition techniques work under the covers.
- A good summary of boosting and bagging– how our favourite tree based models (xgboost, random forest, gbm…) work.
- Good tutorial on real world applications of Markov Decision Processes from Somnath Banerjee
Teams, people and production…
Still one of the biggest obstacles…
- Interesting commentary from Gergely Orosz on the approach to motivating and empowering software engineers in Silicon Valley, very relevant also for Data Scientists and ML engineers.
- What skills do you really need in your data team? Is it all about the models, or do you need more breadth, both on the business side, and engineering.
- How do you scale a team at different stages of development? Useful advice here from Peter Gao.
- If you want to put in place proper monitoring of your ML systems but aren’t quite ready for a full blown MLOps solution, how about giving this a try, from Jeremy Jordan?
- A pretty bland ‘top x trends in data’ title, but some useful pointers on best practices in building out a a modern data stack
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to while away the socially distanced hours:
- Simulating guitar amps with ML … complete with obligatory shredding video…
- Creating an “unbiased” news source
- Topic Modelling in python
- ‘Ten computer codes that transformed science’ – the title says it all
Updates from Members and Contributors
- Adriano Soares Koshiyama highlights what looks like an excellent upcoming UCL webinar on AI in the Judicial System on Feb 25 at 1pm: “In this webinar we welcome Dr Pamela Ugwudike (University of Southampton, Alan Turing Institute) and Charles Kerrigan (CMS partner and global head of Fintech) to present their perspectives from academia and industry”. Register here.
- Rafael Garcia-Navarro has been doing some impressive work in conversational ai, implementing on top of Metaflow (Netflix’s MLOps framework) – definitely worth a read.
- Kevin O’Brien draws our attention to a great write-up on the Climate Modeling Alliance (CliMA) project and how they use Julia (“Meet the team shaking up climate models”). Also, don’t forget JuliaCon 2021 Wednesday 28th July to Friday 30th July 2021.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
The views expressed are our own and do not necessarily represent those of the RSS