Another month flies by… At least it’s getting a bit lighter in the mornings and I’ve even seen the sun once or twice. I hope you are staying as sane as possible despite home-schooling, home-working, home-everything else (delete as appropriate…) … perhaps a few curated data science reading materials might lighten the mood?
Following is the March edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity …
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners.
Industrial Strength Data Science March 2021 NewsletterRSS Data Science Section
The vaccination roll-out in the UK continues to progress well with now over 20m first doses delivered, and we even have a road-map out of lockdown… perhaps some light at the end of the tunnel.
- In addition to the impressive and increasing vaccination numbers, we are now beginning to see ‘real-world’ efficacy studies. Despite news outlets in Europe questioning the AstraZeneca vaccine’s efficacy for older people (without evidence), a recent study from Scotland showed very encouraging results, although off relatively small sample sizes. After 28 days, a single dose of the AstraZeneca vaccine was shown to reduce the risk of Covid-19 hospital admissions by roughly 94 percent; with a comparable figure for the Pfizer vaccine of roughly 85 percent.
"Both of these are working spectacularly well"
- In additional positive vaccine news, a recent FDA review showed that the new Johnson and Johnson ‘one-shot’ vaccine appeared safe and effective in trials; and we also saw the first shipment of the AstraZeneca vaccine as part of the COVAX program, delivered to Ghana.
- As we all know, the pandemic has thrown up a wide variety of new terms, metrics and statistics that can be easily misinterpreted or misunderstood – the RSS has published an excellent FAQ on Covid-19 measures and statistics which is well worth circulating.
- The UK government has charted a cautious route out of lockdown. In sobering reading, this cautiousness was apparently linked to research commissioned from the teams at Imperial and Warwick University by the modelling group (SPI-M) in SAGE.
- These models have proved surprisingly accurate, at least in terms of predicting the surge in cases over the winter.
- This time both teams were asked to independently model the effect of different lockdown exit strategies and both reached similar conclusions- that lifting all restrictions by April 26th would likely drive another wave comparable in size to January 2021, resulting in a further 62,000 to 107,000 deaths in England.
- The NHS Test and Trace App did not have the most auspicious beginnings, but recent research from the Alan Turing Institute indicates that it has indeed had a positive effect in reducing the impact of Covid.
- The virus does seem to be in retreat in a number of countries around the world. The recent decrease in positive cases in the US is puzzling researchers somewhat (also covered by more or less)- decreased testing? improved behaviour? vaccination roll-out? seasonality? herd immunity? … the upshot seems to be, a little bit of everything and we don’t really know.
- Although the retreat is great news, the results in the US and elsewhere have been devastating and disproportionately felt. This recent study published in PNAS shows how life expectancy in the US has fallen by 1.13 years due to Covid, with “estimated reductions for the Black and Latino populations 3 to 4 times that for Whites”.
- Finally a thoughtful piece from the Ada Lovelace Institute about vaccination passports and what role they could or should play in society.
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
Our Fireside Chat with Andrew Ng on February 10th was a roaring success. We had over 500 people attend what proved to be an entertaining and thought provoking discussion on technical leadership in AI, artfully hosted by our chairman Martin Goodson, and introduced by the RSS President Sylvia Richardson. For those who missed it, here’s the 5 minute edited highlights (and below if you are viewing on the blog) – check out the full video here
We are excited to host our first ‘Ethics Happy Hour’, which will take place on March 17th from 5 to 6pm. As previously announced, events in this new series provide an opportunity to discuss and meet other people interested in questions of AI ethics and data science ethics more broadly. The first event will take place virtually and focus on COVID-19. We are delighted that the following three experts have agreed to share their thoughts on the ethics of data science in addressing the public health crisis:
- Dr Zachary Lipton (Carnegie Mellon University)
- Dr Anjali Mazumder (RSS Data Science Section Committee / The Alan Turing Institute)
- Dr Nicola Stingelin (RSS Data Ethics and Governance Section Committee)
The event will begin with each expert offering an initial take on the topic, drawing on their different areas of experience. This will be followed by an open discussion with the opportunity for all participants to share questions, comments, and contributions. To sign up for the event, please register here.
The joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation are picking up again and we are actively involved in these. We also hope to be posting our own version of a basic data science curriculum soon- will keep you posted.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and continues to be very active in with virtual events. The next event is on 8th march where Mingxing Tan a research scientist at Google Brain, will talk about AutoML for Efficient Vision Learning. Videos are posted on the meetup youtube channel – and future events will be posted here.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- The storm around ethical AI at Google continues, with the recent announcement of the firing of Margaret Mitchell, the founder and co-head of its artificial intelligence ethics unit. This follows on from the firing of Timnit Gebru in December as we covered previously. All this clearly highlights the tensions inherent in companies both attempting to profit from AI while at the same time investigating what limits should be placed on the technology.
- An enlightening history from MIT Technology Review of facial recognition data sets: where they originated, how they have changed over time, and how our quest for scale has come at the expense of quality, bias and privacy.
- And of course with these huge data sets now easily accessible to train machine learning models on, facial recognition capabilities can be embedded far and wide
- This really interesting study shows how AI can be used to counter inherent biases if the right data is used to train the models- in this case focusing on what patients actually say rather than what diagnosis is recorded.
- Azeem Azhar‘s podcast this week is an excellent one, a conversation with Professor Sinan Aral, director of MIT’s Initiative on the Digital Economy.
- They discuss how false information is more likely to spread further and faster than the truth (proven out in a research paper from 2018).
"We found that falsehood diffuses significantly farther, faster, deeper, and more broadly than the truth, in all categories of information, and in many cases by an order of magnitude... False news is more novel, and people are more likely to share novel information"
- In addition they talk about what could be done to limit the power and reach of incumbent social networks
- Portability and interoperability – as happened with mobile phone numbers, and instant messenger apps – is much more likely to succeed than splitting up the leading players, since the network effects naturally lead to another dominant player taking over.
- Clearly, flagging or removing false information and inflammatory posts would be beneficial all around, but automating and scaling this process is very difficult as this article about how ads for clothing for people with disabilities have been repeatedly banned, highlights.
Developments in Data Science…
As always, lots of new developments…
- While OpenAI’s GPT-3 has garnered a lot of press, there have been concerns about the proprietary nature of the underlying model. EleutherAI is attempting to create a fully open-sourced replication of a GPT-3 sized model, called GPT-Neo.
- With the inexorable rise in complexity of language models, and the corresponding rise in cost of training these models, there are increasing calls to identify more efficient ways of learning. This is why the recent research on switch transformers are generating interest. These use ‘mixture of expert’ models which are somewhat comparable to how random forests combine randomly selected combinations of features.
- In a similar vein, Deep Mind has released their research into ‘Normalizer-Free ResNets (NFNets)’ which allow image recognition models to be trained without batch-normalisation, significantly reducing training times (and so costs).
- Model generalisation continues to be a challenge where models with excellent test results perform poorly over time in real world situations.
- One approach to improving this is increased transparency, which is what the ML Reproducibility Challenge is all about
- Separately research into why this occurs helps us all improve- as demonstrated in this paper from researchers at Cornell, linking the problem to underspecification.
- As covered in a recent London Machine Learning meetup, research from Marco Ribeiro proposes a new model agnostic testing approach for NLP models called checklist, inspired by behavioural testing in software engineering.
- Another area of intense research is how to create models that don’t require vast quantities of data to learn. Active learning looks like a promising avenue in this regard.
“By having the human iteratively teach the model, it's possible to make a better model, in less time, with much less labelled data.”
- Why should we bother building machine learning models to problems we know how to solve? Interesting discussion on Kolmogorov complexity
- Detailed post from google research on how they have implemented “Cinematic” photos
The Art of Visualisation…
- Stamen (creator of my favourite leaflet map tile- water colour maps, what is not to like…), talks through the process of creating Facebook’s new maps.
- Some useful tips and tricks for using the seaborn python plot library, from Michael Waskom
How does that work?
A new section on understanding different approaches and techniques
- Continuing the Transformers theme we mentioned last time, another useful tutorial on how they actually work from Elvis Saravia
- What is reinforcement learning and how does it work? – useful resources from Jason Gauci
- A guide to cohort analysis from Sylvain Giuliani
- How to build data quality monitoring with some simple SQL, from Barr Moses
Thinking about intelligence…
How does the brain really work, how should we think about AI morality…
- Back-propagation, although first proposed in the mid 1980’s, is still how neural network models are trained. The ‘Neural Network’ as a concept is loosely based on the brain’s structure of neurons and synapses – but is back propagation really how the human brain learns? Most researchers don’t think so, but if not, then how does it learn? There are lots of alternative theories – but don’t write off back-prop just yet!
- If you feel like thinking big-picture, this long but thought-provoking read is well worth the time – “raising good robots”
"Imagine it’s 2026. An autonomous public robocar is driving your two children to school, unsupervised by a human. Suddenly, three unfamiliar kids appear on the street ahead – and the pavement is too slick to stop in time. The only way to avoid killing the three kids on the street is to swerve into a flooded ditch, where your two children will almost certainly drown."
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to while away the socially distanced hours:
- Agent based models in action… starling murmuration.
- “Localize your cat at home with BLE beacon, ESP32s, and Machine Learning” – nothing more to say…
- Cool visualisation on ‘desirable streets‘ from MIT
- Great compilation of data science related podcasts … dspods
Updates from Members and Contributors
- The CMA has recently published a paper ‘Algorithms: how they can reduce competition and harm consumers‘ laying out the landscape of potential harms to consumers and competition from the misuse of algorithms (see also a summary here from Helena Quinn). The CMA are keen to make sure they have correctly represented the potential harms and would welcome contributions via their call for information.
- The RSS runs the Statisticians for Society initiative which links UK charities and other third sector organisations with RSS fellows who are willing to help collect, analyse and present data at no cost: (http://www.rss.org.uk/statisticians-for-society) – anyone interested in volunteering, sign up here
- Marco Gorelli has released an elegant tool (and pre-commit hook) to automatically convert relative import paths to absolute ones
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
The views expressed are our own and do not necessarily represent those of the RSS