Another month flies by… still cold, but I’ve definitely seen the sun once or twice… I hope the on-again off-again dreams of a proper summer holiday aren’t proving too painful … perhaps a few curated data science reading materials might ease the burden over the Easter weekend?
Following is the April edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity …
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners.
Industrial Strength Data Science April 2021 NewsletterRSS Data Science Section
It definitely feels like progress, at least in the UK, on the Covid front, with over 30m people now having received their first vaccine dose. Supply issues notwithstanding, it is clear that the vaccine roll-out is progressing very well.
- It is now over a year since the UK first went into lockdown to attempt to restrict the spread of the virus. It’s interesting to reflect on how much data and statistics have become part of general public discussion: we still have daily updates of a number of different metrics on the news and published in papers. ‘More or Less’ has a nice summary of the UK’s efforts to collate and disseminate the figures and how the centralised healthcare setup contrasts favourably with the US, which required volunteers to generate national figures in the Covid Tracking Project.
- Despite (or perhaps because of) the proliferation of data, the statistics have been made to argue many sides of the same case as highlighted in this research from MIT, stressing the importance of good visualisations.
- It has been quite a month for Astra Zeneca …
- First, a number of European countries suspended use of their vaccine due to blood clotting concerns.
- As David Spiegelhalter pointed out, at face value the blood clot incidence seemed no worse than the base rate and nothing to be worried about.
- An in depth study published in the BMJ, focusing on some of the rarer blood clotting forms that have appeared amongst vaccinated patients, found
“The Oxford-AstraZeneca covid-19 vaccine is not linked to an increased risk of blood clots and is both safe and effective.” However, the saga continues as the bbc summarised here.
- Then, the initial announcement of FDA phase 3 trial results for the vaccine in the US caused unprecedented challenges from the US National Institute of Allergy and Infectious Disease (NIAID), which was overseeing the trial. However, as described in this summary in Nature, the discrepancies do seem to have now been put to rest.
“Overall it’s a win for the world”
- The reductions we have seen in covid case and death rate in the UK are dramatic, but likely driven by the lockdown more than vaccinations up to this point. Sadly the reductions are not being seen elsewhere in the world, with talk of a ‘third wave’ in Italy, Spain and France.
- This sobering description of life in Brazil under Covid highlights the diverging trajectories happening in different countries.
- Finally, it is good to see there are still plenty of excellent use cases for Machine Learning and AI in combatting the virus spread and improving outcomes- here, for example, diagnosing which patients are at risk of deterioration based on chest x-rays. However, as this excellent paper published in Nature points out, there is a big difference in test data performance and actual realisation of improved patient outcomes.
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
Our first ‘Ethics Happy Hour’ on March 17th was very well received – see the write up here. The video recording will shortly be posted on youtube and we will publish links to it when it is available. Please let us know if you have any comments or would like to suggest topics for future events via email to firstname.lastname@example.org
Fresh on the heels of our incredibly successful event with Andrew Ng, we are excited to announce the next instalment in the series. The RSS Data Science section invites you to a fireside conversation with Anthony Goldbloom – founder and CEO of Kaggle (now a Google company), the world’s largest data science and machine learning community with over 6MM members. Forbes has twice named Anthony one of the 30 under 30 in technology, the MIT Technology Review has named him as one of the 35 Innovators Under 35 and the University of Melbourne has given Anthony an Alumni of Distinction Award. Hear Anthony share his thoughts and experiences from the past 10 years at the forefront of competitive Machine Learning. Watch this space for more details!
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and continues to be very active in with virtual events. The next event is on 7th April where Mike Lewis, research scientist at Facebook AI Research in Seattle, will give a talk titled ‘Beyond BERT: Representation Learning for Natural Language at Scale’ . Videos are posted on the meetup youtube channel – and future events will be posted here.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- Google attempted to draw a line under the storm over the firing of Timnit Gebru and Margaret Mitchell from their AI Ethics unit (discussed in previous newsletters), by announcing the restructuring of the unit under Marian Croak. This has not necessarily been well received, at least not by Gebru:
"I will have a lot more to say about this later. But announcing a new org by a Black woman as if we’re all interchangeable while harassing, terrorizing and gaslighting my team and doing absolutely ZERO to acknowledge & redress the harm that’s been done is beyond gaslighting."
- Meanwhile, recent research from the Turing Institute highlights the huge gender gap in AI.
- It is hard to redress some of these imbalance if you don’t have scalable and robust measures of diversity. Some useful progress from Google (somewhat ironically) in enforcing diversity into their search results
- Fraud in academia with fictitious co-authors found on a number of AI papers …
- Perhaps this idea will help flush out these type of issues – PapersWithoutCode , naming and shaming published research where it is not possible to replicate the published results with the information given.
- The increasing spread and influence of automated AI based image recognition is discussed in this extended article in Vice, which digs into ‘Talon’ a system that links AI enabled surveillance cameras across the US with disconcerting effects on privacy.
- It’s of course one thing if the AI based approaches are robust and generalise well- but we still see numerous examples where the system is fooled by simple adversarial approaches, obvious to the human eye.
- But none of this is stopping the spread of facial recognition technology. The NYTimes highlights the commercial opportunities available with the military, while Facebook is considering facial recognition for it’s upcoming ‘Smart Glasses’
- Given Facebook’s track record with ethical issues, this seems far from ideal. The Technology Review has an in depth interview with Joaquin Quiñonero Candela, Director of Responsible AI at Facebook, where he candidly discusses the conflicting pressures of commercial and ethical outcomes.
"Everything the company does and chooses not to do flows from a single motivation: Zuckerberg’s relentless desire for growth."
Developments in Data Science…
As always, lots of new developments…
- Of course, it is still absolutely true that both Google and Facebook continue to produce outstanding AI research on a regular basis. A few snippets from this month:
- Understanding Deep Learning Generalisation from Google research
- Accelerating Neural Networks with Sparse Inference (particularly useful for edge based inference) also from Google
- ‘AI names colours much as humans do‘ from Facebook
- It’s always worth keeping track of what Geoff Hinton is up to – GLOM networks look intriguing
- A good summary from Sergei Ivanov of the top 10 AI research papers – ‘the most cited AI works that influence our life today’ – a must read including adversarial examples, semi-supervised learning, AlphaGo and of course ‘Attention is all you need’.
- In a similar vein, Elvis Saravia gives an excellent list of 10 must read blog posts with leading practitioners explaining a wide variety of complex machine learning concepts.
- Rounding off the ‘lists’ we have a great summary of the current best approaches in generative models from Aran Komatsuzaki
- Useful extended discussion from Derek Lowe in sciencemag of how AI can practically help in the drug discovery process
- ‘From a worm to a fly’ – “The most comprehensive wiring map to date of the fruit fly brain has transformed the field of neuroscience”
- Live video transcription in the browser!
- Audio generation from samples (‘Deep Fake Audio’) is now increasingly accessible as discussed in Wired. Couldn’t resist a few examples…
The Practical side … getting stuff to work in production
- A good summary of the main causes of failure for ML systems, with some strong tips including ‘Learn in the cloud’ and ‘Invest in observability and monitoring’.
- Andrew Ng recently discussed these challenges in a talk on ‘MLOps’ including some very wise words:
"When a system isn’t performing well, many teams instinctually try to improve the Code. But for many practical applications, it’s more effective instead to focus on improving the Data." "It’s a common joke that 80 percent of machine learning is actually data cleaning, as though that were a lesser task. My view is that if 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team."
- Interesting that the researchers at Google back this up
- Of course there is more to Data Science than straight machine learning, and it is easy to forget the impressive array of tools now available to streamline the time and effort required to interact with data – this is a good summary of the components of a ‘modern data stack’
- An even less glamorous part of the Data Science ecosystem is data discovery – how do you find and understand what data sets are actually available. Some interesting commentary on why this is hard and some approaches to help, both culturally and with a new breed of tools.
How does that work?
A new section on understanding different approaches and techniques
- Outlier detection is a useful capability – here’s how to do it with Mahalanobis distance from Sergen Cansiz
- You may have heard of Word2Vec but what is Node2Vec and why is it useful?
- We know ‘attention’ is really useful, but how does it actually work?
- A simple but useful breakdown of different approaches to identifying proper names in text.
- Finally, some useful pointers on A/B testing:
Thinking about intelligence and bigger picture stuff …
Stepping back from the code for a bit…
- Thought provoking article proposing that “Computers will never write good novels” – definitely worth thinking through how much of this you agree with
"The best that computers can do is spit out word soups. They leave our neurons unmoved."
- Crumple Theory anyone? (you know you’re intrigued!)
- The case for ‘technical competence’ in leaders from the Harvard Business Review
"Employees are far happier when they are led by people with deep expertise in the core activity of the business."
- More informational than thought provoking, but there were various ‘State of AI’ reviews published in the last month, including the 2021 AI Index report from Stanford, and the US Government’s take.
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to while away the socially distanced hours:
- Wine and maths – what’s not to like?
- Excellent end to end python tutorial for classifying satellite imagery with Deep Learning…
- I hadn’t heard of GeoGuessr – looks like fun – especially if you can build a tool to help!
- How about playing around with Truchet tiles?
- We all love a good Raspberry Pi project (especially now there is more ML built in…) – how about intruder detection?
Updates from Members and Contributors
- Marco Gorelli is running an excellent workshop on 10th April about contributing to Pandas. The workshop is being run in collaboration with PyLadies and is specifically targeting people from underrepresented genders in tech. Sign up for the morning session or the afternoon session.
- Emre Kasim is running the brilliant Algo Conference which this year is taking place online on April 29th with a number of very relevant streams, including ‘Foundational AI’, ‘AI and Innovation’ and ‘Implications of AI and other Disruptive Technologies- well worth signing up for here.
- Alex Spanos highlights the upcoming Data Science Festival which in April is focused on Fintech- check out his talk on Data Science/Machine Learning and Open Banking APIs on April 15th.
- Vijay Kumar Mishra, Research Scientist at Public Health for India, is running a 5-day online international workshop on ‘’Designing and Conducting Clinical Trials” from the 3rd to the 7th of May. The workshop will be jointly conducted by Public Health Foundation of India, Sitaram Bhartia Institute of Science and Research, Paropakar Maternity and Women Hospital and University College London and will be aimed at providing a theoretical understanding of designing and conducting clinical trials. Contact Vijay (email@example.com) for more details.
- Harin Sellahewa draws our attention to the 35 of 70 masters students entering their final assessment for the University of Buckingham MSc in Applied Data Science- best of luck to everyone!
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
The views expressed are our own and do not necessarily represent those of the RSS