Featured

The UK AI Strategy: are we listening to the experts?

The emerging UK National AI Strategy is out of step with the needs of the nation’s technical community and, as it stands, is unlikely to result in a well-functioning AI industry. The Data Science & Artificial Intelligence Section (Royal Statistical Society) asks whether the government has actively sought the views of expert practitioners.

The UK government has released plans for a new AI Strategy, with the stated goal of making ‘the UK a global centre for the development, commercialisation and adoption of responsible AI’. We asked our members—UK-based technical practitioners of artificial intelligence—their opinion of the plans. Our results point to a fundamental disconnect between the roadmap for the Strategy and the views of those actually building AI-based products and services in the UK.

The basis of the AI Strategy is the AI Council’s ‘AI Roadmap‘, which was developed with input mainly from public sector leaders and university researchers. The AI Council does not appear to have engaged with engineers and scientists from the commercial technology sector.

Tech companies commercialise AI, not universities. Yet between the 52 individuals who contributed to the Roadmap, only four software companies are represented. There are 19 CBEs and OBEs but not one startup CTO.

Hoping to fill this gap, we surveyed our community of practicing data scientists and AI specialists, asking for their thoughts on the Roadmap. We received 284 detailed responses; clearly the technical community cares deeply about this subject.

Only by direct engagement with technical specialists can we hope to uncover the key ingredients of a successful AI industry. For example, while the AI Roadmap focusses on moonshots and flagship institutes, the community seems to care more about practical issues such as open-source software, startup funding and knowledge-sharing.

The economic opportunity of AI represents at least 5% of GDP (compare to fisheries, at about 0.05% of GDP). If the National AI Strategy does not correctly identify the challenges that lie ahead, this opportunity will be squandered.

We will publish our findings in four parts, covering the different sections of the AI Roadmap. This first covers AI research and development.

Comparison with the AI Roadmap for R&D

Three areas are central to the Roadmap’s plans for R&D: the Alan Turing Institute, Moon shots (such as ‘digital twin’ technology) and ‘AI to transform research, development and innovation’. These topics were scarcely mentioned by our respondents, despite them being listing as potential subjects for discussion.

For example the Alan Turing institute was mentioned only 4 times by respondents. Two were negative.

There were 7 responses on the topic of moon shots, 3 of them negative. ‘Digital twins’ were not mentioned at all:

“moonshotting” […] without a solid foundation and shared values would destroy the field in perpetuity.

The central concerns of the Roadmap may sound plausible on paper but they don’t resonate strongly with the technical community.

Better collaboration between academia and industry

By far the most frequently mentioned topic was better collaboration between academia and industry, which was addressed by 52 respondents. To summarise: knowledge transfer between academia and companies is not currently working. The UK’s strength in academic research will be wasted if industry and academia cannot easily learn from each other.

The Roadmap barely addresses this topic, other than one mention of the pre-existing Knowledge Transfer Partnerships (KTP) scheme. Yet our practitioner community think that clearing this obstacle should be at the core of the strategy. A typical request was:

Better sharing of knowledge and experience between universities and industry, specifically industry use case examples.

There were many voices suggesting the knowledge transfer should also operate in the opposite direction:

The knowledge transfer deficit is in the opposite direction: industry making investment and research headway while universities cannot compete.

Encourage adoption of good software engineering practices amongst researchers.

Another key concern is the brain drain from academia to industry:

UK universities were leading in the AI space until the industry (Google, Msft, Amz, FB) started poaching all the top professors […]

There needs to be strong support for this area in academia to stop ‘brain drain’ to big tech companies and allow UK to make research advances that will allow competitive advantages for startups.

Open source

40 respondents recommend that the Strategy focus on open-source. This makes it the second most mentioned issue in the entire survey. Strikingly, the AI Roadmap doesn’t contain a single mention of the term ‘open-source’.

Many respondents agreed that funding positions for contributors to key open-source projects would bring many benefits. This is well-founded: when Columbia university hired core developers on the Scikit-learn open source project they facilitated knowledge transfer and training on cutting edge techniques.

Open source should be embraced by the Government, it sends a positive message about intent and helps to draw in the right talent to the field (most people learning practical machine learning will start their experience in open source).

Support for startups

40 responses agreed on a need to support startups through direct funding, incubators, tax breaks and other approaches such as access to compute infrastructure.

More funding and assistance for AI startups, and assisting their collaboration with UK-based research and universities.

Funding for AI and Deep Tech startups.

Funding/grants for startups for the use of cloud computing infrastructure.

Ethics

26 responses want to see consideration of ethics at the heart of future AI innovation. For example:

Finally, I think governance of how AI and DS are used by the private sector is very important, and something that, in my opinion, should be a priority for any government AI roadmap.

If you fail to identify and analyze the obstacles, you don’t have a strategy

We draw attention to the work of UCLA strategy researcher Richard Rumelt. He makes a specific warning: ‘If you fail to identify and analyze the obstacles, you don’t have a strategy’. Has the AI Roadmap made this mistake? Its 37 pages do not apparently contain a clear analysis of the obstacles in the way of a strong AI industry.

Identification and analysis of these obstacles requires close and sustained collaboration with AI practitioners; our survey is just a starting point. We urge the Office for AI to engage directly with the technical community before moving forward to finalising their AI Strategy.

Sign up to the Data Science & AI Section if you are interested in this topic

Processing…
Success! You're on the list.

Data Science and AI Section (Royal Statistical Society) Committee

Chair: Dr Martin Goodson (CEO & Chief Scientist, Evolution AI)

Vice Chair: Dr Jim Weatherall (VP, Data Science & AI, AstraZeneca)

Trevor Duguid Farrant (Senior Principal Statistician, Mondeléz International)

Rich Pugh (Chief Data Scientist, Mango Solutions (an Ascent Company))

Dr Janet Bastiman (Head of Analytics, Napier AI. AI Venture Partner)

Dr Adam Davison (Head of Insight & Data Science, The Economist)

Dr Anjali Mazumder (AI and Justice & Human Rights Theme Lead, Alan Turing Institute)

Giles Pavey (Global Director – Data Science, Unilever)

Piers Stobbs (Chief Data Officer, Cazoo)

Magda Woods (Data Director, New Statesman Media Group)

Dr Danielle Belgrave (Senior Staff Research Scientist, DeepMind)

Appendix: Analysis

Our survey was designed to bring out the voice of technical community. We asked leading questions – prompting the respondents with topics from the AI roadmap as well as other topics we thought might be of interest to the community. We collected free-text responses.

Our analysis is subjective and we will make our full dataset available for independent analysis. We do not make any quantitative claims, because our sample is biased (for example, geographically).

We included a single quantitative question: ‘To what extent do you agree that these are the top priorities for the UK in AI Research, Development & Innovation? (5 means ‘Strongly agree’)’. Responses could range from 0-5. The average response was 3.4 (neither agree nor disagree).

We received 284 responses in total. We selected qualified respondents by requiring:

  • They declared they were either “a practising data scientist” or “used to be a practising data scientist”
  • They declared they were “an individual data science contributor”, “a line manager of data scientists” or “a senior leader involved in data science”

After applying these requirements 245 qualified responses remained. 118 (47%) of respondents identified as either ‘Managers’ or ‘Senior leaders’.

In order to interpret our results we made a crude manual classification of every comment and focused on those topics which at least 20 respondents mentioned.

The declared demographic of our qualified responses was primarily male (77%) and white (75%). We note that only 60% answered questions on demographics.

The Data Science and AI section is grateful for the support of our partner communities PyLadies London, PyData London, PyDataUK, London Machine Learning and the Apache Spark+AI Meetup, representing a combined (overlapping) membership of 27K data scientists and technologists.

October Newsletter

Hi everyone-

Well, September certainly seemed to disappear pretty rapidly (along with the sunshine sadly). And dramatic events keep accumulating, from the sad death of the Queen, together with epic coverage of ‘the queue‘, to dramatic counter offensives in the Ukraine, to unprecedented IMF criticism of the UK government’s tax-cutting plans. Perhaps time for a breather, with a wrap up data science developments in the last month.

Following is the October edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at https://datasciencesection.org/)

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science October 2022 Newsletter

RSS Data Science Section

Committee Activities

The RSS 2022 Conference, held on 12-15 September in Aberdeen was a great success. The Data Science and AI Section’s session ‘The secret sauce of open source’ was undoubtedly a highlight (we are clearly biased!) but all in all lots of relevant, enlightening and entertaining talks for a practicing data scientist. See David Hoyle’s commentary here (also highlighted in the Members section below).

Following hot on the heels of our July meetup, ‘From Paper to Pitch‘ we were very pleased with our latest event, “IP Freely, making algorithms pay – Intellectual property in Data Science and AI” which was held on Wednesday 21 September 2022. A lively and engaging discussion was held including leading figures such as Dr David Barber (Director of the UCL Centre for Artificial Intelligence ) and Professor Noam Shemtov (Intellectual Property and Technology Law at Queen Mary’s University London).

The AI Standards Hub, an initiative that we reported on earlier this year, led by committee member Florian Ostmann, will see its official launch on 12 October. Part of the National AI Strategy, the Hub’s new online platform and activities will be dedicated to knowledge sharing, community building, strategic research, and international engagement around standardisation for AI technologies. The launch event will be livestreamed online and feature presentations and interactive discussions with senior government representatives, the Hub’s partner organisations, and key stakeholders. To join the livestream, please register before 10 October using this link (https://tinyurl.com/AIStandardsHub). 

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on October 12th when Aditya Ramesh, Researcher at OpenAI, will discuss (the very topical) “Manipulating Images with DALL-E 2“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…

Bias, ethics and diversity continue to be hot topics in data science…

"Interactive deepfakes have the capability to impersonate people with realistic interactive behaviors, taking advantage of advances in multimodal interaction. Compositional deepfakes leverage synthetic content in larger disinformation plans that integrate sets of deepfakes over time with observed, expected, and engineered world events to create persuasive synthetic histories"
"We argue that the upcoming regulation might be particularly important in offering the first and most influential operationalisation of what it means to develop and deploy trustworthy or human-centred AI. If the EU regime is likely to see significant diffusion, ensuring it is well-designed becomes a matter of global importance.."
"Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning expert, most of the gains come from great features, not great machine learning algorithms. So, the basic approach is:
1. make sure your pipeline is solid end to end
2. start with a reasonable objective
3. add common­sense features in a simple way
4. make sure that your pipeline stays solid.
This approach will make lots of money and/or make lots of people happy for a long period of time. Diverge from this approach only when there are no more simple tricks to get you any farther. Adding complexity slows future releases."
  • Finally, we can also try and build ‘fairness’ into the underling algorithms, and machine learning approaches. For instance, this looks to be an excellent idea – FairGBM
"FairGBM is an easy-to-use and lightweight fairness-aware ML algorithm with state-of-the-art performance on tabular datasets.

FairGBM builds upon the popular LightGBM algorithm and adds customizable constraints for group-wise fairness (e.g., equal opportunity, predictive equality) and other global goals (e.g., specific Recall or FPR prediction targets)."

Developments in Data Science Research…

As always, lots of new developments on the research front and plenty of arXiv papers to read…

"Even the largest neural networks make errors, and once-correct predictions can become invalid as the world changes. Model editors make local updates to the behavior of base (pre-trained) models to inject updated knowledge or correct undesirable behaviors"
"We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models"
  • DeepMind have released Menagerie – “a collection of high-quality models for the MuJoCo physics engine”: looks very useful for anyone working with physics simulators
  • Finally, another great stride for the open source community this time from LAION – a large scale open source version of CLIP (a key component of image generation models that computes representations of images and texts to measure similarity)
We replicated the results from openai CLIP in models of different sizes, then trained bigger models. The full evaluation suite on 39 datasets (vtab+) are available in this results notebook and show consistent improvements over all datasets.

Stable-Dal-Gen oh my…

Lots of discussion about the new breed of text-to-image models (type in a text prompt/description and an -often amazing- image is generated) with three main models available right now: DALLE2 from OpenAI, Imagen from Google and the open source Stable-Diffusion from stability.ai.

"minGPT tries to be small, clean, interpretable and educational, as most of the currently available GPT model implementations can a bit sprawling. GPT is not a complicated model and this implementation is appropriately about 300 lines of code (see mingpt/model.py). All that's going on is that a sequence of indices feeds into a Transformer, and a probability distribution over the next index in the sequence comes out. The majority of the complexity is just being clever with batching (both across examples and over sequence length) for efficiency."

Real world applications of Data Science

Lots of practical examples making a difference in the real world this month!

"By using our latest AI model, Multitask Unified Model (MUM), our systems can now understand the notion of consensus, which is when multiple high-quality sources on the web all agree on the same fact. Our systems can check snippet callouts (the word or words called out above the featured snippet in a larger font) against other high-quality sources on the web, to see if there’s a general consensus for that callout, even if sources use different words or concepts to describe the same thing. We've found that this consensus-based technique has meaningfully improved the quality and helpfulness of featured snippet callouts."
“One of the motivations of this work was our desire to study systems that learn models of datasets that is represented in a way that humans can understand. Instead of learning weights, can the model learn expressions or rules? And we wanted to see if we could build this system so it would learn on a whole battery of interrelated datasets, to make the system learn a little bit about how to better model each one"

How does that work?

Tutorials and deep dives on different approaches and techniques

"Deep learning is sometimes referred to as “representation learning” because its strength is the ability to learn the feature extraction pipeline. Most tabular datasets already represent (typically manually) extracted features, so there shouldn’t be a significant advantage using deep learning on these."

Practical tips

How to drive analytics and ML into production

Bigger picture ideas

Longer thought provoking reads – lean back and pour a drink! …

“We’re not trying to re-create the brain,” said David Ha, a computer scientist at Google Brain who also works on transformer models. “But can we create a mechanism that can do what the brain does?”
"A common finding is that with the right representation, the problem becomes much easier. However, how to train the neural network to learn useful representations is still poorly understood. Here, causality can help. In causal representation learning, the problem of representation learning is framed as finding the causal variables, as well as the causal relations between them.."
"As we’ve seen, the nature of algorithms requires new types of tradeoff, both at the micro-decision level, and also at the algorithm level. A critical role for leaders is to navigate these tradeoffs, both when the algorithm is designed, but also on an ongoing basis. Improving algorithms is increasingly a matter of changing rules or parameters in software, more like tuning the knobs on a graphic equalizer than rearchitecting a physical plant or deploying a new IT system"
"Lucas concludes his essay by stating that the characteristic attribute of human minds is the ability to step outside the system. Minds, he argues, are not constrained to operate within a single formal system, but rather they can switch between systems, reason about a system, reason about the fact that they reason about a system, etc. Machines, on the other hand, are constrained to operate within a single formal system that they could not escape. Thus, he argues, it is this ability that makes human minds inherently different from machines."

Fun Practical Projects and Learning Opportunities

A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

Updates from Members and Contributors

  • David Hoyle has published an excellent review of the recent RSS conference, highlighting the increasing relevance to practicing Data Scientists- well worth a read
  • The ONS are keen to highlight the last of this year’s ONS – UNECE Machine Learning Groups Coffee and Coding session on 2 November 2022 at 1400 – 1530 (CEST) / 0900 – 1030 (EST) when Tabitha Williams and Brittny Vongdara from Statistics Canada will provide an interactive lesson on using GitHub, and an introduction to Git. For more information and to register, please visit the Eventbrite page (Coffee and Coding Session 2 November). Any questions, get in touch at ML2022@ons.gov.uk

Jobs!

The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise

  • EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

September Newsletter

Hi everyone-

I hope you have all been enjoying a great summer. Certainly lots to engage with from heat waves, sewage spills, leadership elections, spiralling energy costs… and of course on a much more positive note the Lionesses winning the Euros for the first time (it’s come home…)! Apologies for skipping a month but it does mean we have plenty to talk about so prepare for a somewhat longer than normal read…

Following is the September edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at https://datasciencesection.org/)

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science September 2022 Newsletter

RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in the Alliance for Data Science Professionals, a joint initiative between the RSS and various other relevant organisations in defining standards for data scientist accreditation. The first tranche of data scientists to complete the new defined standard of professionalism received their awards at a special ceremony at the Royal Society in July. The U. K’s National Statistician welcomed the initiative.

Our recent event “From paper to pitch, success in academic/industry collaboration” which took place on Wednesday 20th July was very successful with strong attendance and a thought provoking and interactive discussion- may thanks to Will Browne for organising. We will write up a summary and publish shortly.

We also excited to announce our next event catchily titled “IP Freely, making algorithms pay – Intellectual property in Data Science and AI” which will be held on Wednesday 21 September 2022, 7.00PM – 8.00PM. Sign up here to hear leading figures such as Dr David Barber (Director of the UCL Centre for Artificial Intelligence ) and Professor Noam Shemtov (Intellectual Property and Technology Law at Queen Mary’s University London) in what should be an excellent discussion.

The RSS 2022 Conference is rapidly approaching (12-15 September in Aberdeen). The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on September 14th when Gwanghyun Kim, Ph.D. student at Seoul National University (SNU), will discuss “Text-Guided Diffusion Models for Robust Image Manipulation”. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • We have to acknowledge that many of the new AI tools are astonishing both in their performance and their sophistication and that it is incredibly hard if not impossible to eliminate all mistakes. However, applying best practice and using high quality data sets should be at the core of all work in this area.
"“They were claiming near-perfect accuracy, but we found that in each of these cases, there was an error in the machine-learning pipeline,” says Kapoor."
"The new Vehicle General Safety Regulation starts applying today. It introduces a range of mandatory advanced driver assistant systems to improve road safety and establishes the legal framework for the approval of automated and fully driverless vehicles in the EU"
"Facebook’s stated mission is “to give people the power to build community and bring the world closer together.” But a deeper look at their business model suggests that it is far more profitable to drive us apart. By creating “filter bubbles”—social media algorithms designed to increase engagement and, consequently, create echo chambers where the most inflammatory content achieves the greatest visibility—Facebook profits from the proliferation of extremism, bullying, hate speech, disinformation, conspiracy theory, and rhetorical violence"
“We remain committed to protecting our users against improper government demands for data, and we will continue to oppose demands that are overly broad or otherwise legally objectionable,” Ms. Fitzpatrick wrote.
"He posted again on Twitter later in the day, saying: "apparently, this exploit happened because the gov developer wrote a tech blog on CSDN and accidentally included the credentials", referring to the China Software Developer Network."
"This paper first discusses what humanoid robots are, why and how humans tend to anthropomorphise them, and what the literature says about robots crowding out human relations. It then explains the ideal of becoming “fully human”, which pertains to being particularly moral in character."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

  • Solving proteins…
    • We have previously discussed the groundbreaking work of DeepMind in solving the protein folding problem with AlphaFold, which can generate the estimated 3d structure for any protein. They have now gone a step further and publicly released the structures of of over 200m proteins
    • Lots of background and commentary on this ground breaking step here and here
"Prof Dame Janet Thornton, the group leader and senior scientist at the European Molecular Biology Laboratory’s European Bioinformatics Institute, said: “AlphaFold protein structure predictions are already being used in a myriad of ways. I expect that this latest update will trigger an avalanche of new and exciting discoveries in the months and years ahead, and this is all thanks to the fact that the data are available openly for all to use."
Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions
  • Finally another phenomenon I find pretty extraordinary… “Grokking” where model performance improves after a seemingly over-fitting. Researchers at Apple give the full story
"The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

Many real-world machine learning problems can be framed as graph problems. On online platforms, users often share assets (e.g. photos) and interact with each other (e.g. messages, bookings, reviews). These connections between users naturally form edges that can be used to create a graph.

However, in many cases, machine learning practitioners do not leverage these connections when building machine learning models, and instead treat nodes (in this case, users) as completely independent entities. While this does simplify things, leaving out information around a node’s connections may reduce model performance by ignoring where this node is in the context of the overall graph.

How does that work?
Tutorials and deep dives on different approaches and techniques

"To conclude: we have shown that for in the presence of (many) irrelevant variables, RF performance suffers and something needs to be done. This can be either tuning the RF, most importantly increasing the mtry parameter, or identifying and removing the irrelevant features using the RFE procedure rfe() part of the caret package in R. Selecting only relevant features has the added advantage of providing insight into which features contain the signal."
"Text Embeddings give you the ability to turn unstructured text data into a structured form. With embeddings, you can compare two or more pieces of text, be it single words, sentences, paragraphs, or even longer documents. And since these are sets of numbers, the ways you can process and extract insights from them are limited only by your imagination."

Practical tips
How to drive analytics and ML into production

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! …

"As these LLMs become more common and powerful, there seems to be less and less agreement over how we should understand them. These systems have bested many “common sense” linguistic reasoning benchmarks over the years, many which promised to be conquerable only by a machine that “is thinking in the full-bodied sense we usually reserve for people.” Yet these systems rarely seem to have the common sense promised when they defeat the test and are usually still prone to blatant nonsense, non sequiturs and dangerous advice. This leads to a troubling question: how can these systems be so smart, yet also seem so limited?"
"To be sure, there are indeed some ways in which AI truly is making progress—synthetic images look more and more realistic, and speech recognition can often work in noisy environments—but we are still light-years away from general purpose, human-level AI that can understand the true meanings of articles and videos, or deal with unexpected obstacles and interruptions. We are still stuck on precisely the same challenges that academic scientists (including myself) having been pointing out for years: getting AI to be reliable and getting it to cope with unusual circumstances."
"Ongoing debates about whether large pre-trained models understand text and images are complicated by the fact that scientists and philosophers themselves disagree about the nature of linguistic and visual understanding in creatures like us. Many researchers have emphasized the importance of “grounding” for understanding, but this term can encompass a number of different ideas. These might include having appropriate connections between linguistic and perceptual representations, anchoring these in the real world through causal interaction, and modeling communicative intentions. Some also have the intuition that true understanding requires consciousness, while others prefer to think of these as two distinct issues. No surprise there is a looming risk of researchers talking past each other."
Liberating the world’s scientific knowledge from the twin barriers of accessibility and understandability will help drive the transition from a web focused on clicks, views, likes, and attention to one focused on evidence, data, and veracity. Pharma is clearly incentivized to bring this to fruition, hence the growing number of startups identifying potential drug targets using AI — but I believe the public, governments, and anyone using Google might be willing to forgo free searches in an effort for trust and time-saving. The world desperately needs such a system, and it needs it fast
"To put this in context: until this paper, it was conventional to train all large LMs on roughly 300B tokens of data.  (GPT-3 did it, and everyone else followed.)

Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got[6].

People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far Chinchilla did."

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

Updates from Members and Contributors

  • Kevin O’Brien highlights the PyData Global 2022 Conference, taking place online between Thurs 1st and Sat 3rd December. Calls for proposals are still open until September 12th, 2022. Submit here.
  • Ole Schulz-Trieglaff also mentions the PyData Cambridge meetup which is running a talk on Sept 14th by Gian Marco Iodice (Tech Lead ML SW Performance Optimizations at ARM)
  • Ronald Richman and colleagues have published a paper on their innovative work using deep neural nets for discrimination free pricing in insurance, when discriminatory characteristics are not known. Well worth a read.
  • Many congratulations to Prithwis De who has published a book on a very relevant topic: “Towards Net-Zero Targets: Usage of Data Science for Long-Term Sustainability Pathways
  • Mark Marfé and Cerys Wyn Davies recently published an article about data and IP issues in the context of AI deployed on ESG projects which looks interesting and relevant.
  • Finally, more news from The Data Science Campus who are helping organise this year’s UN Big Data Hackathon, November 8-11.
    • The UN Big Data Hackathon is an exciting global competition for data professionals and young people from all around the world to work together on important global challenges.
    • It’s part of this year’s UN Big Data conference in Indonesia. There are two tracks, one for data science professionals and the other for young people and students (under 32 years of age).
    • Registrations should preferably be done as a team of 3 to 5 people, but individual applications can also be accepted. Registration deadline in Sept 15th.

Jobs!

The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise

  • EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Don’t miss out – ‘From Paper to Pitch’ meetup on Wednesday July 20th

From paper to pitch : success stories of academic and industry collaboration.

Next Wednesday (Wednesday 20 July 2022, 7.00PM – 9.00PM) the RSS Data Science and AI section are hosting an event to bring together practitioners and researchers to improve collaboration.

We have two excellent speakers in Rebecca Pope, Ph.D. (she/her) and Andre Vauvelle. It will be great opportunity to discuss how we can bring industry and academia together and it would be lovely to see people in person again.

The event is free, but there are limited places, so please sign up here. Looking forward to seeing you all!

July Newsletter

Hi everyone-

Welcome to July! Inflation, union strikes, sunshine … lots of commentary drawing parallels to the mid-70s. One thing that is very different from that period is the world of data science (which didn’t even exist as a discipline) – crazy to think that the Apple II launched in ’77 with 4 KB RAM, 4 million times less memory than the laptop I’m writing this on…

Following is the July edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. We’ll take a break in August, so fingers crossed this sees you through to the beginning of September…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science July 2022 Newsletter

RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in a joint initiative between the RSS and various other bodies (The Chartered Institute for IT (BCS), the Operational Research Society (ORS), the Royal Academy of Engineering (RAEng), the National Physical Laboratory (NPL), the Royal Society and the IMA (The Institute of Mathematics and its Applications)) in defining standards for data scientist accreditation, with plans underway to launch the Advanced Certificate shortly.

We are very excited to announce our next meetup, “From paper to pitch, success in academic/industry collaboration” which will take place on Wednesday 20th July from 7pm-9pm. We believe that there is huge potential in greater collaboration between industry and academia and have invited two excellent speakers to provide examples of how this can work in practice. This should be a thought provoking, and very relevant (and free) event – sign up here.

The full programme is now available for the September RSS 2022 Conference. The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry.

As mentioned last time, Janet Bastiman (Chief Data Scientist at Napier AI) recently spoke at the FinTech FinCrime Exchange Conference (FFECON) in a panel session entitled “With great AI power comes great FinCrime responsibility”: cool summary from the discussion…

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event will be on July 13th when Stéphane d’Ascoli, Ph.D. candidate at Facebook AI, discusses “Solving Symbolic Regression with Transformers“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"From California to Colorado and Pennsylvania, as child welfare agencies use or consider implementing algorithms, an AP review identified concerns about transparency, reliability and racial disparities in the use of the technology, including their potential to harden bias in the child welfare system."
"In summary, GPT-4chan resulted in a large amount of public discussion and media coverage, with AI researchers generally being critical of Kilcher’s actions and many others disagreeing with these criticisms. This sequence of events was generally predictable, so much so that I was able to prompt GPT-3 – which has no knowledge whatsoever about current events – to summarize the controversy somewhat accurately"
"Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life.

The future of human-machine interaction is full of possibility and promise, but any powerful technology needs careful deployment. The joint statement below represents a step towards building a community to address the global challenges presented by AI progress, and we encourage other organizations who would like to participate to get in touch."
  • Of course the sad truth is that, in simplistic terms, this type of model is basically regurgitating the same biases present in the material it was trained on. Some thought provoking analysis from textio highlighting the inherent biases present in performance feedback.
  • A Google researcher (since placed on administrative leave…) caused controversy by claiming that one of these Large Language Models (in this case Google’s LaMDA) was sentient- good summary in Wired here. The guardian followed up on this with some thoughtful pieces on how the model works, and why we are prone to be fooled by mimicry.
"It’s strategic transparency. They get to come out and say they're helping researchers and they're fighting misinformation on their platforms, but they're not really showing the whole picture.”
"While AI can calculate, retrieve, and employ programming that performs limited rational analyses, it lacks the calculus to properly dissect more emotional or unconscious components of human intelligence that are described by psychologists as system 1 thinking."
"China’s ambition to collect a staggering amount of personal data from everyday citizens is more expansive than previously known, a Times investigation has found. Phone-tracking devices are now everywhere. The police are creating some of the largest DNA databases in the world. And the authorities are building upon facial recognition technology to collect voice prints from the general public."
"Police can not only obtain search histories from a pregnant person’s device, but can also obtain records directly from search engines, and sometimes they don’t even need a warrant."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models"
"To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant."
"In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data"
"Parti treats text-to-image generation as a sequence-to-sequence modeling problem, analogous to machine translation – this allows it to benefit from advances in large language models, especially capabilities that are unlocked by scaling data and model sizes"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

"Using a neural network trained on widely available weather forecasts and historical turbine data, we configured the DeepMind system to predict wind power output 36 hours ahead of actual generation. Based on these predictions, our model recommends how to make optimal hourly delivery commitments to the power grid a full day in advance"
"This was AlphaFold 2, which was published in July 2021. It had a level of atomic accuracy of less than one angstrom. I work with a lot of colleagues in structural biology. They've spent years to determine the structure of a protein and many times they never solve it. But not only do you produce confidence measures, you also — anyone — can put in their favorite protein and see how it works in seconds. And you also get feedback from the user. You also linked up with the European Bioinformatics Institute (EMBL-EBI). It's open-source and it's free."

More DALL-E fun..
DALL-E is still making headlines so we’ll keep serving up a few fun posts!

"We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For example, it seems that \texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests"

How does that work?
Tutorials and deep dives on different approaches and techniques

"An important point: if you train the first level on the whole dataset first and then the second level, you will get a leakage in the data. At the second level, the content score of matrix factorization will take into account the targeting information"
  • You’ve been wanting to explore GPT-3 but haven’t known where to start? Here you go!
"I think a big reason people have been put off trying out GPT-3 is that OpenAI market it as the OpenAI API. This sounds like something that’s going to require quite a bit of work to get started with.

But access to the API includes access to the GPT-3 playground, which is an interface that is incredibly easy to use. You get a text box, you type things in it, you press the “Execute” button. That’s all you need to know.."
  • I’m a regular user of Jupyter Lab (and notebooks) … but I’ve never used it build a web app! Lots of useful tips here
  • And … it’s live! Andrew Ng’s new foundational course in Machine Learning is open for enrolment – if you do one course, do this one
"Newly rebuilt and expanded into 3 courses, the updated Specialization teaches foundational AI concepts through an intuitive visual approach, before introducing the code needed to implement the algorithms and the underlying math."

Practical tips
How to drive analytics and ML into production

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! A few extra this month to get you through the long summer…

"At the heart of this debate are two different visions of the role of symbols in intelligence, both biological and mechanical: one holds that symbolic reasoning must be hard-coded from the outset and the other holds it can be learned through experience, by machines and humans alike. As such, the stakes are not just about the most practical way forward, but also how we should understand human intelligence — and, thus, how we should pursue human-level artificial intelligence."
"Now it is true that GPT-3 is genuinely better than GPT-2, and maybe true that InstructGPT is genuinely better than GPT-3. I do think that for any given example, the probability of a correct answer has gone up...

...But I see no reason whatsoever to think that the underlying problem — a lack of cognitive models of the world —have been remedied. The improvements, such as they are, come, primarily because the newer models have larger and larger sets of data about how human beings use word sequences, and bigger word sequences are certainly helpful for pattern matching machines. But they still don’t convey genuine comprehension, and so they are still very easy for Ernie and me (or anyone else who cares to try) to break.
"There’s an important point about expertise hidden in here: we expect our AGIs to be “experts” (to beat top-level Chess and Go players), but as a human, I’m only fair at chess and poor at Go. Does human intelligence require expertise? (Hint: re-read Turing’s original paper about the Imitation Game, and check the computer’s answers.) And if so, what kind of expertise? Humans are capable of broad but limited expertise in many areas, combined with deep expertise in a small number of areas. So this argument is really about terminology: could Gato be a step towards human-level intelligence (limited expertise for a large number of tasks), but not general intelligence?"
For those not well-versed in chess, here’s a summary of what happened. The first three or four moves were a fairly standard opening from both sides. Then, the AI began making massive blunders, even throwing away its queen. Finally, as the vice began to close around its king, the AI eventually made an illegal move, losing the game.

All in all, a pretty solid showing: it understood the format, (mostly) knew what moves were legal, and even played a decent opening. But this AI is not good at chess. Certainly, nothing close to 5000 ELO.

Is this just a “flub”, which will be fixed by scale? Will a future, even-larger GPT be the world chess champion? I don’t believe so.
"In January, 2021, Microsoft filed a patent to reincarnate people digitally through distinct voice fonts appended to lingual identities garnered from their social media accounts. I don’t see any reason why it can’t work. I believe that, if my grandchildren want to ask me a question after I’m dead, they will have access to a machine that will give them an answer and in my voice. That’s not a “new soul.” It is a mechanical tongue, an artificial person, a virtual being. The application of machine learning to natural language processing achieves the imitation of consciousness, not consciousness itself, and it is not science fiction. It is now."
  • Don’t worry – if we eventually get to an artificial general intelligence that everyone agrees on, we have a thoughtful taxonomy of all the ways it could kill us (AGI ruin: a list of lethalities)!

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

  • The latest results from the ONS tracking study estimate 1 in 30 people in England (1 in 18 in Scotland) have Covid. Sadly this has risen (from 1 in 60 last month) due to infections compatible with Omicron variants BA.4 and BA.5, but is at least down on it’s peak when it reached 1 in 14… Still a far cry from the 1 in 1000 we had last summer.
  • Promising research on the use of fitness tracker data to detect Covid early

Updates from Members and Contributors

  • Arthur Turrell has some excellent updates from the ONS Data Science Campus:
    • The ONS Data Science Campus was involved in this widely covered ONS piece on the cost of living inspired by Jack Monroe and other food campaigners.
    • Making text count: Economic forecasting using newspaper text’, which was a collaboration across multiple institutions and for which I am a co-author, was published in the journal of applied econometrics and shows how machine learning + text from newspaper can improve macroeconomic forecasts.
    • We released a package from the Campus for evaluating how well synthetic data matches real data. Repository here, blog post here.

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

June Newsletter

Hi everyone-

It’s June already – time flies – and in the UK an extra bank holiday! Perhaps the data science reading materials below might help fill the void now the Jubilee celebrations have finished …

Following is the June edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science June 2022 Newsletter

RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in a joint initiative between the RSS and various other bodies (The Chartered Institute for IT (BCS), the Operational Research Society (ORS), the Royal Academy of Engineering (RAEng), the National Physical Laboratory (NPL), the Royal Society and the IMA (The Institute of Mathematics and its Applications)) in defining standards for data scientist accreditation, with a plan to launch the Advanced Certificate in the summer.

We will also shortly be announcing details of our next meetup – watch this space!

Janet Bastiman (Chief Data Scientist at Napier AI) recently spoke at the FinTech FinCrime Exchange Conference (FFECON) in a panel session entitled “With great AI power comes great FinCrime responsibility”, discussing how AI implementations can go wrong and what we need to do about it.

The RSS is running an in-person Discussion Meeting on Thursday June 16th at the Errol Street headquarters: “Statistical Aspects of the Covid-19 Pandemic”. Register here for free attendance.

The full programme is now available for the September RSS 2022 Conference. The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry. An early booking registration discount is available until 6 June for in-person attendance at the conference and 20 June for viewing content via the online conference platform.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on June 15th when Ting Chen from Google Brain, will discuss Pix2Seq, “A new language interface for object detection“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"After three separate experiments, the researchers found the AI-created synthetic faces were on average rated 7.7% more trustworthy than the average rating for real faces... The three faces rated most trustworthy were fake, while the four faces rated most untrustworthy were real, according to the magazine New Scientist."
"The settlement, filed Monday in a federal court in Illinois, bars the company from selling its biometric data to most businesses and private firms across the U.S. The company also agreed to stop offering free trial accounts to individual police officers without their employers' knowing or approving, which had allowed them to run searches outside of police departments' purview"
"Even when you filter medical images past where the images are recognizable as medical images at all, deep models maintain a very high performance. That is concerning because superhuman capacities are generally much more difficult to control, regulate, and prevent from harming people."
"This brief focuses on three sub-areas within “AI safety,” a term that has come to refer primarily to technical research (i.e., not legal, political, social, etc. research) that aims to identify and avoid unintended AI behavior. AI safety research primarily seeks to make progress on technical aspects of the many socio-technical challenges that have come along with progress in machine learning over the past decade."
"The AI industry does not seek to capture land as the conquistadors of the Caribbean and Latin America did, but the same desire for profit drives it to expand its reach. The more users a company can acquire for its products, the more subjects it can have for its algorithms, and the more resources—data—it can harvest from their activities, their movements, and even their bodies."
"The answers are complex and depend to some extent on your exact threat models, but if you want a summary of the advice I usually give it boils down to:
 - Treat your training data like you do your traditional source code.
 - Treat your model files like compiled executables."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"Another class of specification gaming examples comes from the agent exploiting simulator bugs. For example, a simulated robot that was supposed to learn to walk figured out how to hook its legs together and slide along the ground."
"A lot of the existing video models have poor quality (especially on long videos), require enormous amounts of GPUs/TPUs, and can only solve one specific task at a time (only prediction, only generation, or only interpolation). We aimed to improve on all these problems. We do so through a Masked Conditional Video Diffusion (MCVD) approach."
A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the diversity of voices defining the ethical considerations of such technologies.
  • DeepMind has been at its ground breaking best again …
    • Firstly with Flamingo which elegantly combines visual and text user feedback to refine responses
    • And perhaps most impressively with Gato, a single generalist agent
The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
  • Real world applications of reinforcement learning can still be hard to come by despite the progress at DeepMind. One promising approach is Offline RL (which utilises historic data) – looks like BAIR (Berkley Artificial Intelligence Research) has made good progress
"Let’s begin with an overview of the algorithm we study. While lots of prior work (Kumar et al., 2019; Ghosh et al., 2021; and Chen et al., 2021) share the same core algorithm, it lacks a common name. To fill this gap, we propose the term RL via Supervised Learning (RvS). We are not proposing any new algorithm but rather showing how prior work can be viewed from a unifying framework"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

Advocates like Mr. Ward look to beneficial, low-cost, intermediate technologies that are available now. A prime example is intelligent speed assistance, or I.S.A., which uses A.I. to manage a car’s speed via in-vehicle cameras and maps. The technology will be mandatory in all new vehicles in the European Union beginning in July, but has yet to take hold in the United States.
At Google, we’re always dreaming up new ways to help you uncover the information you’re looking for — no matter how tricky it might be to express what you need. That’s why today, we’re introducing an entirely new way to search: using text and images at the same time. With multisearch in Lens, you can go beyond the search box and ask questions about what you see.

More DALL-E fun..
A one off section on everyone’s favourite image generation tool, DALL-E

  • Last month we highlighted the amazing examples of images generated from text prompts using OpenAI’s DALL-E 2. There’s been lots more commentary so we’ve pulled it together in one place…
  • First of all, an update from OpenAI – apparently early users have generated over 3m images to date.
  • How does it actually work- good breakdown of the underlying methods here.
  • A different take on DALL-E and what it means for design and a potential ‘vibe-shift’ – well worth a read.
  • Another great take- this time exploring how DALL-E seems to combine objects in ways that make sense but that can’t be known from the words themselves.
  • Finally, watch out DALL-E, here comes IMAGEN from the Google Brain team
"A marble statue of a Koala in front of a marble statue of a turntable. The Koala has large marble headphones"

How does that work?
Tutorials and deep dives on different approaches and techniques

"Graphs are a convenient way to abstract complex systems of relations and interactions. The increasing prominence of graph-structured data from social networks to high-energy physics to chemistry, and a series of high-impact successes have made deep learning on graphs one of the hottest topics in machine learning research"
"Recommender systems work well when we have a lot of data on user-item preferences. With a lot of data, we have high certainty about what users like. Conversely, with very little data, we have low certainty. Despite the low certainty, recommenders tend to greedily promote items that received higher engagement in the past. And because they influence how much exposure an item gets, potentially relevant items that aren’t recommended continue getting no to low engagement, perpetuating the feedback loop."
"The goal of structural optimization is to place material in a design space so that it rests on some fixed points or “normals” and resists a set of applied forces or loads as efficiently as possible."
"This article outlines different methods for creating confidence intervals for machine learning models. Note that these methods also apply to deep learning. This article is purposefully short to focus on the technical execution without getting bogged down in details; there are many links to all the relevant conceptual explanations throughout this article."
"My team spent many hours debating the most important concepts to teach. We developed extensive syllabi for various topics and prototyped course units in them. Sometimes this process helped us realize that a different topic was more important, so we cut material we had developed to focus on something else. The result, I hope, is an accessible set of courses that will help anyone master the most important algorithms and concepts in machine learning today — including deep learning but also a lot of other things — and to build effective learning systems." 

Practical tips
How to drive analytics and ML into production

"For example, when you’re in a BI tool like Looker, you inevitably think, “Do I trust this dashboard?” or “What does this metric mean?” And the last thing anyone wants to do is open up another tool (aka the traditional data catalog), search for the dashboard, and browse through metadata to answer that question.." 
"I actually don’t care that much about the bundling argument that I will make in this post. Truthfully, I just want to argue that feature stores, metrics layers, and machine learning monitoring tools are all abstraction layers on the same underlying concepts, and 90% of companies should just implement these “applications” in SQL on top of streaming databases."
"At its core, data storytelling is about taking the step beyond the simple relaying of data points. It’s about trying to make sense of the world and leveraging storytelling to present insights to stakeholders in a way they can understand and act on. As data scientists, we can inform and influence through data storytelling by creating personal touch points between our audience and our analysis."

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"But this morning I woke to a new reification, a Twitter thread that expresses, out loud, the Alt Intelligence creed, from Nando de Freitas, a brilliant high-level executive at DeepMind, Alphabet’s rightly-venerated AI wing, in a declaration that AI is “all about scale now.” Indeed, in his mind (perhaps deliberately expressed with vigor to be provocative), the harder challenges in AI are already solved. “The Game is Over!”, he declares"
"It is a tale told by an idiot, full of sound and fury, signifying nothing". —Macbeth

"AI-generated artwork is the same as a gallery of rock faces. It is pareidolia, an illusion of art, and if culture falls for that illusion we will lose something irreplaceable. We will lose art as an act of communication, and with it, the special place of consciousness in the production of the beautiful."
"AIs will make increasingly complex and important decisions, but they may make these decisions based on different criteria that could potentially go against our values. Therefore, we need a language to talk to AI for better alignment. "
"But the algorithmic summaries could make errors, include outdated information or remove nuance and uncertainty, without users appreciating this. If anyone can use LLMs to make complex research comprehensible, but they risk getting a simplified, idealized view of science that’s at odds with the messy reality, that could threaten professionalism and authority. It might also exacerbate problems of public trust in science."

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

  • The latest results from the ONS tracking study estimate 1 in 60 people in England have Covid. This is at least moving in the right direction compared to couple of weeks ago, when it reached 1 in 14… Still a far cry from the 1 in 1000 we had last summer.

Updates from Members and Contributors

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here
  • AstraZeneca are looking for a Data Science and AI Engagement lead – more details here
  • Cazoo is looking for a number of senior data engineers – great modern stack and really interesting projects!

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

May Newsletter

Hi everyone-

Another month flies by, and although we had that rarest of occasions in the UK – a sunny Easter weekend – the news in general continues to be depressing: law breakers at the highest ranks of government, covid infections high, and of course the devastating war in Ukraine. Hopefully the data science reading materials below might distract a little…

Following is the May edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. Check out our new ‘Jobs!’ sectionan extra incentive to read to the end!

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science May 2022 Newsletter

RSS Data Science Section

Committee Activities

We have all been shocked and saddened by events in the Ukraine and our thoughts and best wishes go out to everyone affected

Committee members continue to be actively involved in a joint initiative between the RSS and various other bodies (The Chartered Institute for IT (BCS), the Operational Research Society (ORS), the Royal Academy of Engineering (RAEng), the National Physical Laboratory (NPL), the Royal Society and the IMA (The Institute of Mathematics and its Applications)) in defining standards for data scientist accreditation, with a plan to launch the Advanced Certificate in the summer.

Florian Ostmann (Head of AI Governance and Regulatory Innovation at The Alan Turing Institute) continues to work on setting up the AI Standards Hub pilot. As set out in the previously shared announcement, this new initiative aims to promote awareness and understanding of the role of technical standards as an AI governance and innovation mechanism, and to grow UK stakeholder involvement in international AI standardisation efforts. The AI Standards Hub team have set up an online form to sign up for updates about the initiative (including a notification when the newly developed AI Standards Hub website goes live) and an opportunity to provide feedback to inform the Hub’s strategy. The form can be accessed at www.aistandardshub.org. If you are interested to learn more about the initiative, you can also watch a recording of the recent AI UK session about the AI Standards Hub here

The RSS has a number of annual awards – nominations for next year are open. It would be fantastic to have more data scientist nominations, particularly for the David Cox Research prize, or maybe an Honorary Fellowship. Suggestions most welcome – post here!

The next RSS DMC (Discussion Meetings Committee) is holding their next Discussion Meeting on 11th May 3-5pm BST held online (with the DeMO at 2pm), discussing the paper ‘Vintage Factor Analysis with Varimax Performs Statistical Inference’ – all welcome

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on Mat 11th when Drew Jaegle, (Research Scientist at DeepMind in London), will discuss his research on “ Perceivers: Towards General-Purpose Neural Network Architectures“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • We’ve discussed previously the increasing ease with which realistic ‘fakes’ (profiles, images, videos…) can be generated. It’s quite hard to estimate the scale of the problem though, as we are likely increasingly unaware of most instances we come across. Two Stanford University researchers have attempted to shed some light, uncovering over 1,000 AI-generated LinkedIn faces across 70 different businesses:
"It's not a story of mis- or disinfomation, but rather the intersection of a fairly mundane business use case w/AI technology, and resulting questions of ethics & expectations. What are our assumptions when we encounter others on social networks? What actions cross the line to manipulation"
"The image recognition app botched its task, Mr. Monteith said, because it didn’t have proper training data. Ms. Edmo explained that tagging results are often “outlandish” and “offensive,” recalling how one app identified a Native American person wearing regalia as a bird. And yet similar image recognition apps have identified with ease a St. Patrick’s Day celebration, Ms. Ardalan noted as an example, because of the abundance of data on the topic."
  • Under-representation of minority groups is a key challenge for the US Census, which is a critical problem as many government decisions (from voting districts to funding) are based on the census figures. Excellent Wired article digging into these challenges and whether ML approaches to understanding satellite imagery can help.
  • Research from Facebook/Meta attempting to counter the imbalance in wikipedia coverage by automatically generating basic wikipedia entries for those who are under-represented…
"While women are more likely to write biographies about other women, Wikimedia’s Community Insights 2021 Report, which covers the previous year, found that only 15 percent of Wikipedia editors identified as women. This leaves women overlooked and underrepresented, despite the enormous impact they’ve had throughout history in science, entrepreneurship, politics, and every other part of society."
"These modes of research require organizations that can gather a lot of data, data that is often collected via ethically or legally questionable technologies, like surveilling people in nonconsensual ways. If we want to build technology that has meaningful community input, then we need to really think about what’s best. Maybe AI is not the answer for what some particular community needs."
"We argue that for the NAIRR to meet its goal of supporting non-commercial AI research, its design must take into account what we predict will be another closely related trend in AI R&D: an increasing reliance on large pre-trained models, accessed through application programming interfaces (APIs)."
  • Finally some commentary and a more in depth review (from the Ada Lovelace Institute) of the European Commission proposal for the Artificial Intelligence Act (‘the AI Act’)
"An analysis of the Act for the U.K.-based Ada Lovelace Institute by a leading internet law academic, Lilian Edwards, who holds a chair in law, innovation and society at Newcastle University, highlights some of the limitations of the framework — which she says derive from it being locked to existing EU internal market law; and, specifically, from the decision to model it along the lines of existing EU product regulations."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"We evaluated PaLM on 29 widely-used English natural language processing (NLP) tasks. PaLM 540B surpassed few-shot performance of prior large models, such as GLaM, GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA, on 28 of 29 of tasks that span question-answering tasks (open-domain closed-book variant), cloze and sentence-completion tasks, Winograd-style tasks, in-context reading comprehension tasks, common-sense reasoning tasks, SuperGLUE tasks, and natural language inference tasks."
"Those results demonstrate that our search for ever
increasing generalization performance -averaged over all classes and samples- has left us with models
and regularizers that silently sacrifice performances on some classes. This scenario can become dangerous when deploying a model on downstream tasks"
  • I’m always paranoid about overfitting models, and intrigued by phenomena such as double decent, where, on very large data sets, you can find test set performance improve long after you think you have trained too far. OpenAI have been exploring similar phenomena on smaller data sets (‘Grokking’ – paper here) which could be very powerful.
  • This is pretty amazing – DiffusionClip from Korea Advanced Institute of Science and Technology: “zero shot image manipulation guided by text prompts”! And you can play around with it in pyTorch – repo here
  • I still struggle getting Deep Learning techniques to perform well (or better than tree based approaches) on traditional tabular data – useful survey on this topic here, and again good to see the repo here
  • Does AI make human decision making better? Yes, it looks like. Interesting analysis: using AlphaGo to evaluate Go player moves before and after the release of the system from DeepMind.
"Our analysis of 750,990 moves in 25,033 games by 1,242 professional players reveals that APGs significantly improved the quality of the players’ moves as measured by the changes in winning probability with each move. We also show that the key mechanisms are reductions in the number of human errors and in the magnitude of the most critical mistake during the game. Interestingly, the improvement is most prominent in the early stage of a game when uncertainty is higher"
"In this paper, we train Transformers
to infer the function or recurrence relation underlying sequences of integers or oats, a typical task in
human IQ tests which has hardly been tackled in the
machine learning literature. We evaluate our integer
model on a subset of OEIS sequences, and show that it
outperforms built-in Mathematica functions for recurrence prediction"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • Not to be outdone by Google, OpenAI released DALL-E 2 providing some jaw-dropping examples of generating realistic images from text prompts
    • Some excellent commentary from TheVerge here including:
      – details of some of the new features like ‘in-painting’ (allowing editing of pictures);
      – how it works (building on and improving CLIP – similar to the DiffusionCLIP paper discussed above);
      – and also some of the safeguards built in in an attempt to prevent miss-use (“As a preemptive anti-abuse feature, the model also can’t generate any recognisable faces based on a name “)
    • A dig into how it performs from LessWrong here: some amazing examples although best to steer clear of text and smaller objects
"Overall this is more powerful, flexible, and accurate than the previous best systems. It still is easy to find holes in it, with with some patience and willingness to iterate, you can make some amazing images."
  • In terms of real world impact, DeepMind’s AlphaFold’s ability to predict protein structures has already proven hugely beneficial, and it seems we are only just scratching the surface of its potential – good paper in Nature
“AlphaFold changes the game,” says Beck. “This is like an earthquake. You can see it everywhere,” says Ora Schueler-Furman, a computational structural biologist at the Hebrew University of Jerusalem in Israel, who is using AlphaFold to model protein interactions. “There is before July and after.”
"When applied to the data sets taken from the Long Beach area, the algorithms detected substantially more earthquakes and made it easier to work out how and where they started. And when applied to data from a 2014 earthquake in La Habra, also in California, the team observed four times more seismic detections in the “denoised” data compared with the officially recorded number."
  • Fully autonomous vehicles risk becoming the next Nuclear Fusion… always on the horizon but never quite realised. Waymo (Google’s subsidiary) announced a significant step though with their testing in San Fransisco
"This morning in San Francisco, a fully autonomous all-electric Jaguar I-PACE, with no human driver behind the wheel, picked up a Waymo engineer to get their morning coffee and go to work. Since sharing that we were ready to take the next step and begin testing fully autonomous operations in the city, we’ve begun fully autonomous rides with our San Francisco employees. They now join the thousands of Waymo One riders we’ve been serving in Arizona, making fully autonomous driving technology part of their daily lives."
  • And it’s been a month or two since our last ‘slightly scary robot dog’ video… so here we go. This time learning to run very fast using a completely new (and ungainly running technique)
"Yeah, OK, what you’re looking at in the video above isn’t the most graceful locomotion. But MIT scientists announced last week that they got this research platform, a four-legged machine known as Mini Cheetah, to hit its fastest speed ever—nearly 13 feet per second, or 9 miles per hour—not by meticulously hand-coding its movements line by line, but by encouraging digital versions of the machine to experiment with running in a simulated world"

How does that work?
A new section on understanding different approaches and techniques

  • Contrastive learning is pretty cool- it trains models on the basis of the relationships between examples rather than the examples themselves and underpins some of the recent advances in learning visual representations (e.g. DALL-E, CLIP). But how does it work?- good tutorial here
  • How do Graph Neural Networks actually work? Excellent detailed tutorial here complete with fun hand-drawn diagrams…
  • This is very elegant – an in-browser visualisation of neural net activations, definitely worth playing around with
"While teaching myself the basics of neural networks, I was finding it hard to bridge the gap between the foundational theory and a practical "feeling" of how neural networks function at a fundamental level. I learned how pieces like gradient descent and different activation functions worked, and I played with building and training some networks in a Google Colab notebook.

Despite the richness of the ecosystem and the incredible power of the available tools, I felt like I was missing a core piece of the puzzle in my understanding."
"The paper, which was inspired by a short comment in McElreath's book (first edition), shows that theta does not necessarily change much even if you get a significant result. The probability theta can change dramatically under certain conditions, but those conditions are either so stringent or so trivial that it renders many of the significance-based conclusions in psychology and psycholinguistics questionable at the very least."

Practical tips
How to drive analytics and ML into production

"It's no secret that good analyses are often the result of very scattershot and serendipitous explorations. Tentative experiments and rapidly testing approaches that might not work out are all part of the process for getting to the good stuff, and there is no magic bullet to turn data exploration into a simple, linear progression.

That being said, once started it is not a process that lends itself to thinking carefully about the structure of your code or project layout, so it's best to start with a clean, logical structure and stick to it throughout. We think it's a pretty big win all around to use a fairly standardized setup like this one." 
"Despite their conceptual simplicity, A/B tests are complex to implement, and flawed setups can lead to incorrect conclusions. One problem that can arise in misconfigured experiments is imbalance, where the groups being compared consist of such dissimilar user populations that any attempt to credit the feature under test with a change in success metrics becomes questionable."
"My colleagues Ian Johnson, Mike Freeman, and I recently collaborated on a series of data-driven stories about electricity usage in Texas and California to illustrate best practices of Analyzing Time Series Data. We found ourselves repeatedly changing how we visualized the data to reveal the underlying signals, rather than treating those signals as noise by following the standard practice of aggregating the hourly data to days, weeks, or months. Behind many of the best practices we recommended for time series analysis was a deeper theme: actually embracing the complexity of the data." 

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"Why is this viewpoint useful? Because it gives us some hints on why ML works or doesn’t work:
1) ML models don’t just minimize a singular loss functions. Instead, they evolve dynamically. We need to consider the dynamical evolution when thinking about ML.
2) We cannot really understand ML models using just a handle of metrics. They capture the macroscopic but not the microscopic behaviors of the model. We should think of metrics as tiny windows into a complex dynamical system, with each metric highlighting just one aspect of our models."
"The benefits of human-like artificial intelligence (HLAI) include soaring productivity, increased leisure, and perhaps most profoundly a better understanding of our own minds. But not all types of AI are human-like–in fact, many of the most powerful systems are very different from humans–and an excessive focus on developing and deploying HLAI can lead us into a trap."
"Suppose you’re a robot visiting a carnival, and you confront a fun-house mirror; bereft of common sense, you might wonder if your body has suddenly changed. On the way home, you see that a fire hydrant has erupted, showering the road; you can’t determine if it’s safe to drive through the spray. You park outside a drugstore, and a man on the sidewalk screams for help, bleeding profusely. Are you allowed to grab bandages from the store without waiting in line to pay? "
"Very rarely does one actually know the data generating function, or even a reasonable proxy - real world data is disorganized, inconsistent, and unpredictable. As a result, the term “distribution” is vague enough to not address the additional specificity necessary to direct actions and interventions"
"Machine learning and traditional algorithms are two substantially different ways of computing, and algorithms with predictions is a way to bridge the two."
"The "adjacent possible" is an idea that comes from Stewart Kaufmann and describes how evolutionary systems grow - at any given point you’ve got the set of things that already exist, and the adjacent possible is the set of things that could exist as the next generation from the current possibilities."

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

  • However, no-one told the virus. The latest results from the ONS tracking study estimate 1 in 25 people in England have Covid. This is at least moving in the right direction compared to couple of weeks ago, when it reached 1 in 14… Still a far cry from the 1 in 1000 we had last summer.
  • Simple but elegant diagram showing how a new variant may appear milder even with no change in the underlying virulence due to re-infection
"For example, she estimated that the average vaccinated and boosted person who was at least 65 years old had a risk of dying after a Covid infection slightly higher than the risk of dying during a year of military service in Afghanistan in 2011"

Updates from Members and Contributors

  • Ole Schulz-Trieglaff highlights the excellent upcoming PyData London conference (June 17th-19th, 2022)- for those who aren’t aware :
    • PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organisation in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other.
    • NumFOCUS supports many of the most popular data science tools such as pandas and scikit-learn: https://numfocus.org/sponsored-projects
  • James Lupino is pleased to announce that RISC AI and IntelliProp have entered into an agreement to cooperate in the in the development of programs, projects and activities related to system integration of processors for Artificial Intelligence (AI) computing in high-speed network fabrics that connect memory, storage and compute resources. More information here. RISC AI do not use gradient descent but a novel method using modal interval arithmetic to guarantee the optimal solution is found in a single run.

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • Holisticai, a startup focused on providing insight, assessment and mitigation of AI risk, has a number of relevant AI related job openings- see here for more details
  • EvolutionAI, are looking for a machine learning research engineer to develop their award winning AI-powered data extraction platform, putting state of the art deep learning technology into production use. Strong background in machine learning and statistics required
  • AstraZeneca are looking for a Data Science and AI Engagement lead – more details here
  • Lloyds Register are looking for a data analyst to work across the Foundation with a broad range of safety data to inform the future direction of challenge areas and provide society with evidence-based information.
  • Cazoo is looking for a number of senior data engineers – great modern stack and really interesting projects!

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

April Newsletter

Hi everyone-

The news from Ukraine is truly devastating and brings a huge dose of perspective to our day to day lives in the UK. I know I for one care rather less about fixing my python package dependencies when I see the shocking scenes from Mariupol… However, those of us more distant from the war do at least have the option to think about other things, and hopefully the data science reading materials below might distract a little…

Following is the April edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. Check out our new ‘Jobs!’ sectionan extra incentive to read to the end!

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science April 2022 Newsletter

RSS Data Science Section

Committee Activities

We have all been shocked and saddened by events in the Ukraine and our thoughts and best wishes go out to everyone affected

The committee is busy planning out our activities for the year with lots of exciting events and even hopefully some in-person socialising… Watch this space for upcoming announcements.

Louisa Nolan (Chief Data Scientist, Data Science Campus, ONS) is helping drive the Government Data Science Festival 2022, a virtual event running from 27 April to 11 May 2022. This exciting event is a space for the government and UK public sector data science community, and colleagues in the academic sector, to come together to learn, discover, share and connect. This year’s theme is: The Future of Data Science for Public Good. Register here!

Anyone interested in presenting their latest developments and research at the Royal Statistical Society Conference? The organisers of this year’s event – which will take place in Aberdeen from 12-15 September – are calling for submissions for 20-minute and rapid-fire 5-minute talks to include on the programme.  Submissions are welcome on any topic related to data science and statistics.  Full details can be found here. The deadline for submissions is 5 April.

Janet Bastiman (Chief Data Scientist at NapierAI) recorded a podcast with Moodys on “AI and transparent boxes”, looking at the use of AI in detecting financial crime and explainability- will post the link once it is published.

Giles Pavey (Global Director Data Science at Unilever) was interviewed for the Data Storytellers podcast about his career in data science – check it out here.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on April 13th when Martha White, (Associate Professor of Computing Science at the University of Alberta), discusses her research on “Advances in Value Estimation in Reinforcement Learning“. Videos are posted on the meetup youtube channel – and future events will be posted here.

As we highlight in the Members and Contributors section, Martin was interviewed by the American Statistical Association (ASA) about Practical Data Science & The UK’s AI Roadmap

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"The notion of a killer robot—where you have artificial intelligence fused with weapons—that technology is here, and it's being used,” says Zachary Kallenborn, a research affiliate with the National Consortium for the Study of Terrorism and Responses to Terrorism (START).
That short-lived saga could be the first weaponized use of deepfakes during an armed conflict, although it is unclear who created and distributed the video and with what motive. The way the fakery unraveled so quickly shows how malicious deepfakes can be defeated—at least when conditions are right.

Not all people targeted by deepfakes will be able to react as nimbly as Zelensky—or find their repudiation so widely trusted. “Ukraine was well positioned to do this,” Gregory says. “This is very different from other cases, where even a poorly made deepfake can create uncertainty about authenticity.”
While debates are heating up on AI campaigning, the National Election Commission (NEC) is yet to determine whether it is legitimate or not. "It is difficult to make a finding on whether it is against the laws governing campaigning or not because it is uncertain how the technologies will be used in the campaign," an NEC official said.
Just as clickable icons have replaced obscure programming commands on home computers, new no-code platforms replace programming languages with simple and familiar web interfaces. And a wave of start-ups is bringing the power of A.I. to nontechnical people in visual, textual and audio domains. 

… there are also obvious downsides, with the increased risk of miss-application a key one…

“If you’re using low-code, no-code, you don’t really have a good sense of the quality of the ingredients coming in, and you don’t have a sense of the quality of the output either,” he said. While low- and no-code software have value for use in training or experimentation, “I just wouldn’t apply it in subject areas where the accuracy is paramount”.
Surprisingly, we find that anger travels easily along weaker ties than joy, meaning that it can infiltrate different communities and break free of local traps because strangers share such content more often
When AI gets attention for recovering lost works of art, it makes the technology sound a lot less scary than when it garners headlines for creating deep fakes that falsify politicians’ speech or for using facial recognition for authoritarian surveillance.

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"I have rarely been as enthusiastic about a new research direction. We call them GFlowNets, for Generative Flow Networks. They live somewhere at the intersection of reinforcement learning, deep generative models and energy-based probabilistic modelling"
"µP provides an impressive step toward removing some of the black magic from scaling up neural networks. It also provides a theoretically backed explanation of some tricks used by past work, like the T5 model. I believe both practitioners and researchers alike will find this work valuable."
  • Ai Explainability continues to be a hot research topic. Most widely used approaches attempt to ‘explain’ a given AI output by approximating the local decision criteria. ‘CX-TOM‘ looks to be an interesting new approach in which it “generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user”
  • Speaking of ‘minds’ … useful summary of recent Neuroscience/ML research
Reading and being aware of the evolution and new insights in neuroscience not only will allow you to be a better “Artificial Intelligence” guy 😎, but also a finer neural network architectures creator 👩‍💻!
  • Comprehending images and videos is something we all take for granted as humans. However it is an incredible complex task for AI systems, and although we have got a lot better in recent years, even the best systems can still be easily led astray. So research continues, particularly in understanding actions and processes:
  • Even with the breakthroughs of GPT-3 and other large language models, comprehension and trust (almost “common sense”) are still huge challenges in natural language processing as well. Researchers at DeepMind have released GopherCite which adds a bit more “sense” to the responses given to factual questions (great quote below… emphasis mine!)
“Recent large language models often answer factual questions correctly. But users can't trust any given claim a model makes without fact-checking, because language models can hallucinate convincing nonsense. In this work we use reinforcement learning from human preferences (RLHP) to train "open-book" QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness"
The standard model for sequential decision-making under uncertainty is the Markov decision process (MDP). It assumes that actions are under control of the agent, whereas outcomes produced by the environment are random ... This, famously, leads to deterministic policies which are brittle — they “put all eggs in one basket”. If we use such a policy in a situation where the transition dynamics or the rewards are different from the training environment, it will often generalise poorly.

We want to train a policy that works well, even in the worst-case given our uncertainty. To achieve this, we model the environment to not be simply random, but being (partly) controlled by an adversary that tries to anticipate our agent’s behaviour and pick the worst-case outcomes accordingly.
Stochastic gradient descent (SGD) is perhaps the most popular optimization algorithm for deep neural networks. Due to the non-convex nature of the deep neural network’s optimization landscape, different runs of SGD will find different solutions. As a result, if the solutions are not perfect, they will disagree with each other on some of the unseen data. This disagreement can be harnessed to estimate generalization error without labels:

1) Given a model, run SGD with the same hyperparameters but different random seeds on the training data to get two different solutions.
2) Measure how often the networks’ predictions disagree on a new unlabeled test dataset.

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

This analysis showed that different parts of the brain work together in surprising ways that differ from current neuroscientific wisdom. In particular, the study calls into question our current understanding of how brains process emotion
“Some operators have more robust responsible gambling programs than others,” says Lia Nower, director of the Center for Gambling Studies at Rutgers University. “But in the end there is a profit motive and I have yet to see an operator in the U.S. put the same amount of money and effort into developing a system for identifying and assisting at-risk players as they do developing A.I. technologies for marketing or extending credit to encourage players to return.”
  • Nowcasting is useful concept in the modern world – how can make the most of whatever information is currently available to understand the state of the world now or in the near future. Good progress in near-time precipitation forecasting. (“Alexa, should I bring an umbrella?” … “I don’t know, let me check my DGMR PySteps model”…)
Instead of relying on expert opinion, the computer scientists used a mathematical approach known as stylometry. Practitioners say they have replaced the art of the older studies with a new form of science, yielding results that are measurable, consistent and replicable.
"I too was pretty skeptical of Copilot when I started using it last summer.
However it is shockingly good for filling out Python snippets - ie smarter autocomplete when teaching.

Popular libraries like Pandas, Beautiful Soup, Flask are perfect for this.

About 80% time it will fill out the code exactly they way I would want. About 10% time it will be something you want to correct or nudge.

Then about 10% of time it will be a howler or anti-pattern."

How does that work?
A new section on understanding different approaches and techniques

I still struggle with the basic 4 dimensions of our physical world. When I first heard about 768-dimension embeddings, I feared my brain would escape from my ear. If you can relate, if you want to truly master the tricky subject of NLP encoding, this article is for you.
Surprisingly few software engineers and scientists seem to know about it, and that makes me sad because it is such a general and powerful tool for combining information in the presence of uncertainty. At times its ability to extract accurate information seems almost magical— and if it sounds like I’m talking this up too much, then take a look at this previously posted video where I demonstrate a Kalman filter figuring out the orientation of a free-floating body by looking at its velocity. Totally neat!
  • One thing we all do on a regular basis is load up some data and then try and get a feel for it- how big, how many dimensions, what are the characteristics of and relationships between the dimensions etc etc. I normally just plug away in pandas, but there are now various elegant ‘profiling’ packages that do a lot of the work for you, well worth exploring:
  • Airflow is a great open source tool for scheduling and orchestration, well worth getting to know – an introduction here
  • Useful lower level background on Deep Learning – understanding where to focus and what to focus on- from Horace He
  • If you are investigating Deep Learning, it is increasingly likely you will be using PyTorch. This looks like a very useful add on for recommenders (TorchRec), and this ‘NN template‘ could be useful in setting up your PyTorch projects.
  • This is very elegant – a visual introduction to machine learning
  • Finally, an excellent review of ML Competitions over the last year across Kaggle and other platforms from newsletter subscribers Harald Carlens and Eniola Olaleye (shorter version here) – lots of great insight into the libraries and approaches used.

Practical tips
How to drive analytics and ML into production

“In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.”
  • How should you structure your data team? One role that is often overlooked is the data product manager – good discussion on why this role is so important
  • Ok… so you have your team setup, how should you run it? What principals should you adhere to? Great suggestions here (“0/1/Done Strategy”) from newsletter subscriber Marios Perrakis
  • When you have models, pipelines and decision tools in production, being used across the organisation, you need to know they are working… or at least know when something has gone wrong. That is where ‘observability’ comes in – incredibly useful if you can get it right.
  • Part of observability is understanding why something has changed. This is well worth a read- are there ways you can automatically explain changes in aggregations through ‘data-diff algorithms‘?
  • How Netflix built their ‘trillions scale’ real time data platform
  • We talk about MLOps on a reasonably regular basis – how best to implement, manage and monitor your machine learning models in production. Still struggling to figure out the right approach? You are definitely no the only one – “MLOps is a mess
MLOps is in a wild state today with the tooling landscape offering more rare breeds than an Amazonian rainforest.

To give an example, most practitioners would agree that monitoring your machine learning models in production is a crucial part of maintaining a robust, performant architecture.

However when you get around to picking a provider I can name 6 different options without even trying: Fiddler, Arize, Evidently, Whylabs, Gantry, Arthur, etc. And we haven’t even mentioned the pure data monitoring tools.

Bigger picture ideas
Longer thought provoking reads – musing from some of the ‘OGs’ this month! – lean back and pour a drink!

"Comprehension is a poorly-defined term, like many terms that frequently show up in discussions of artificial intelligence: intelligence, consciousness, personhood. Engineers and scientists tend to be uncomfortable with poorly-defined, ambiguous terms. Humanists are not.  My first suggestion is that  these terms are important precisely because they’re poorly defined, and that precise definitions (like the operational definition with which I started) neuters them, makes them useless. And that’s perhaps where we should start a better definition of comprehension: as the ability to respond to a text or utterance."
"To think that we can simply abandon symbol-manipulation is to suspend disbelief. "
But the most important trend I want to comment on is that the whole setting of training a neural network from scratch on some target task (like digit recognition) is quickly becoming outdated due to finetuning, especially with the emergence of foundation models like GPT. These foundation models are trained by only a few institutions with substantial computing resources, and most applications are achieved via lightweight finetuning of part of the network, prompt engineering, or an optional step of data or model distillation into smaller, special-purpose inference networks
To summarise: suppose you have an unfair coin that lands on heads 3 times out of 4. If you toss this coin 16 times, you would expect to see 12 heads (H) and 4 tails (T) on average. Of course you wouldn’t expect to see exactly 12 heads and 4 tails every time: there’s a pretty good chance you’d see 13 heads and 3 tails, or 11 heads and 5 tails. Seeing 16 heads and no tails would be quite surprising, but it’s not implausible: in fact, it will happen about 1% of the time. Seeing all tails seems like it would be a miracle. Nevertheless, each coin toss is independent, so even this has a non-zero probability of being observed.

If we do not ignore the order, and ask which sequence is the most likely, the answer is ‘all heads’. That may seem surprising at first, because seeing only heads is a relatively rare occurrence. But note that we’re asking a different question here, about the ordered sequences themselves, rather than about their statistics

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

"3. Treat research hypotheses like impressionist paintings

The big picture looks coherent but the details wash out when scrutinized. Use vague sciency sounding concepts that can mean anything. 

Don't show it to the statistician until the end of the study. its best as a surprise"

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

  • However, no-one told the virus. The latest results from the ONS tracking study estimate 1 in 16 people (over 6%) in England have Covid. It’s worse in Scotland where the figure is 1 in 11. This is as bad as it has ever been in the whole 2+ years of the pandemic and a far cry from the 1 in 1000 we had last summer. Bear in mind in the chart below that the levels we had in February 2021 were enough to drive a national lockdown …

Updates from Members and Contributors

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • Holisticai, a startup focused on providing insight, assessment and mitigation of AI risk, has a number of relevant AI related job openings- see here for more details
  • EvolutionAI, are looking for a machine learning research engineer to develop their award winning AI-powered data extraction platform, putting state of the art deep learning technology into production use. Strong background in machine learning and statistics required
  • AstraZeneca are looking for a Data Science and AI Engagement lead – more details here
  • Lloyds Register are looking for a data analyst to work across the Foundation with a broad range of safety data to inform the future direction of challenge areas and provide society with evidence-based information.
  • Cazoo is looking for a number of senior data engineers – great modern stack and really interesting projects!

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

March Newsletter

Hi everyone-

Another month flies by – at least it finally seems to be getting a bit lighter in the mornings although I fear sunny spring days are still a way off… I imagine you are suffering withdrawal from a lack of dramatic Olympics Curling action so perhaps some thought provoking data science reading materials to fill the void…

Following is the March edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. Check out our new ‘Jobs!’ sectionan extra incentive to read to the end!

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science March 2022 Newsletter

RSS Data Science Section

Committee Activities

We have all been shocked and saddened by events in the Ukraine and our thoughts and best wishes go out to everyone affected

The committee is busy planning out our activities for the year with lots of exciting events and even hopefully some in-person socialising… Watch this space for upcoming announcements.

We are very pleased to announce that Jennifer Hall, Senior AI Lab Data Scientist at NHSX and Will Browne, Associate Partner – Data Science & Analytics at CF Healthcare are both joining the Data Science and AI Section committee. They bring a wealth of talent and experience in all aspects of data science and we are very much looking forward to their contributions across our various activities.

Florian Ostmann has been involved with recent developments of the AI Standards Hub pilot (led by the Alan Turing Institute, in partnership with BSI and NPL)

Anyone interested in presenting their latest developments and research at the Royal Statistical Society Conference? The organisers of this year’s event – which will take place in Aberdeen from 12-15 September – are calling for submissions for 20-minute and rapid-fire 5-minute talks to include on the programme.  Submissions are welcome on any topic related to data science and statistics.  Full details can be found here. The deadline for submissions is 5 April.

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active with events. The next one is on March 9th when Lucas Beyer, a Researcher at Google Brain Zurich, will discuss his research on “Learning General Visual Representations“. Videos are posted on the meetup youtube channel – and future events will be posted here.

Help RSS to support the data science community

The Royal Statistical Society (RSS) is developing resources to support everyone working in data science to meet their learning and development goals and career objectives. If you have an interest in data science, we invite you to take part in this survey, whether or not you are a member of RSS. 

The survey should take around 15 minutes to complete. Your responses will be invaluable in helping us to understand and meet the wants and needs of the data science community, and to support your work in this exciting, fast-developing field.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"The UK Statistics Authority have written to Downing St to advise them that the Prime Minister's claim that there are more people in work now than at the start of the pandemic is wrong. He has now made this claim 7 times but knows it is wrong! When will he correct the record?!."
"Those surveyed were asked: suppose there was a diagnostic test for a virus. The false-positive rate (the proportion of people without the virus who get a positive result) is one in 1,000. You have taken the test and tested positive. What is the probability that you have the virus? Of the politicians surveyed, 16 per cent gave the correct answer that there was not enough information to know."
The truth is AI failures are not a matter of if but when. AI is a human endeavor that combines information about people and the physical world into mathematical constructs. Such technologies typically rely on statistical methods, with the possibility for errors throughout an AI system’s lifespan. As AI systems become more widely used across domains, especially in high-stakes scenarios where people’s safety and wellbeing can be affected, a critical question must be addressed: how trustworthy are AI systems, and how much and when should people trust AI? 
We found two key through lines: Lawmakers and the public lack fundamental access to information about what algorithms their agencies are using, how they’re designed, and how significantly they influence decisions.
Tesla Chief Executive Officer Elon Musk said on Twitter "there were no safety issues" with the function. "The car simply slowed to ~2 mph & continued forward if clear view with no cars or pedestrians," Musk wrote.
To train InstructGPT, OpenAI hired 40 people to rate GPT-3’s responses to a range of prewritten prompts, such as, “Write a story about a wise frog called Julius” or “Write a creative ad for the following product to run on Facebook.” Responses that they judged to be more in line with the apparent intention of the prompt-writer were scored higher. Responses that contained sexual or violent language, denigrated a specific group of people, expressed an opinion, and so on, were marked down. This feedback was then used as the reward in a reinforcement learning algorithm that trained InstructGPT to match responses to prompts in ways that the judges preferred.

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"Synthetically generated faces are not just highly photorealistic, they are nearly indistinguishable from real faces and are judged more trustworthy"
  • The researchers at Facebook/Meta have been busy:
    • They have built their own super-computer, dubbed the AI Research Super Cluster...
    • They have developed a Natural Language Processing (NLP) approach that does not use text or labels at all – it is able to learn directly from raw audio signals- pretty astonishing!
"GSLM leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text. It opens the door to a new era of textless NLP applications for potentially every language spoken on Earth—even those without significant text data sets."
“People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

“It’s an incredibly powerful method,” says Jonathan Citrin at the Dutch Institute for Fundamental Energy Research, who was not involved in the work. “It’s an important first step in a very exciting direction.”
“Outracing human drivers so skillfully in a head-to-head competition represents a landmark achievement for AI,” said Chris Gerdes, a professor at Stanford who studies autonomous driving, in an article published on Wednesday alongside the Sony research in the journal Nature.
"To help clinicians avoid remedies that may potentially contribute to a patient’s death, researchers at MIT and elsewhere have developed a machine-learning model that could be used to identify treatments that pose a higher risk than other options"

How does that work?
A new section on understanding different approaches and techniques

"A* is a modification of Dijkstra’s Algorithm that is optimized for a single destination. Dijkstra’s Algorithm can find paths to all locations; A* finds paths to one location, or the closest of several locations. It prioritizes paths that seem to be leading closer to a goal."
"Vector databases are purpose-built to store, index, and query across embedding vectors generated by passing unstructured data through machine learning models."

Practical tips
How to drive analytics and ML into production

"Professor: “Yes, outstanding. However, you failed to ask me what metrics I used to grade your model. Your opinion of model quality doesn’t matter. It’s your users’ needs that do.”

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"A lot has happened in the past half century! The eight ideas reviewed below represent a categorization based on our experiences and reading of the literature and are not listed in a chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics."
  • There are lots of “here are all the problems with statistical significance” type articles out there, but the visual examples in this one make it more compelling than many
"You can have a miniscule effect size and still have a significant effect. Do we always prefer the (c) to the (a)? Is a meager, but mostly positive benefit necessarily better than a treatment potentially of large benefit to some but harmful to others necessarily? Wouldn’t it be in our interest to understand this spread of outcomes so we could isolate the group of individuals who benefit from the treatment?'”
"So consider this Deep Blue’s final gift, 25 years after its famous match. In his defeat, Kasparov spied the real endgame for AI and humans. “We will increasingly become managers of algorithms,” he told me, “and use them to boost our creative output—our adventuresome souls.”
But in the future, he says, systems will be needed that can handle all other scenarios as well: “It’s not just about the trajectory of a missile or the movement of a robotic arm, which can be modeled through careful mathematics. It’s about everything else, everything we observe in the world: About human behavior, about physical systems that involve collective phenomena like water or branches in a tree, about complex things for which humans can easily develop abstract representations and models,” LeCun said

Bringing data to life – the art and science of visualisation
Leland Wilkinson, author of Grammar of Graphics, sadly passed away at the end of last year. Hadley Wickham created ggplot2 as a way to implement the ideas contained in this formative work (gg = grammar of graphics) and I know I for one have been heavily influenced by it in how I think about visualisation. In memory of Leland I thought it would be fitting to call out some recent articles of interest in the field.

"The problem with guidelines based on precision is that visualization is not really about precision. Sure, there are cases where precision matters because it allows readers to detect important differences that would otherwise be missed. But visualization is less about precision, and much  more about what the visual representation expresses."

Covid Corner

Well, apparently Covid is now all over according to the UK government, or at least there is no need for any more restrictions…

  • Given the government is removing requirements and incentives to test for Covid, the ONS Coronavirus infection survey is now one of the only ways we can tell the prevalence of the virus in our society.
  • The latest results estimate 1 in 25 people (4%) in England have Covid. While this is down from its peak of 1 in 15 in January it is still a long way from the 1 in 1000 we had last summer. Bear in mind in the chart below that the levels we had in February 2021 were enough to drive a national lockdown …

Updates from Members and Contributors

  • Jona Shehu and her colleagues at Helix Data Innovation are hosting what looks to be a high quality and relevant online roundtable on model explainability with leaders across the AI, finance, consumer rights and data governance sectors. The event is on March 15th (11-12.30) and is free to attend. Register here
  • Kevin OBrien highlights the inaugural SciMLCon (of the Scientific Machine Learning Open Source Software Community) taking place online on Wednesday 23rd March 2022. Core topics include: Physics-Informed Model Discovery and Learning, Compiler-Assisted Model Analysis and Sparsity Acceleration, ML-Assisted Tooling for Model Acceleration and many more. SciMLCon is focused on the development and applications of the Julia-based SciML tooling -with expansion into R and Python planned in the near future.
  • Maria Rosario Mestre is CEO of DataQA which offers tools to search, label and organise unstructured documents: sounds very useful! They are currently enrolling beta customers for the first release of the platform which includes a free trial so could be well worth checking out.

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • Holisticai, a startup focused on providing insight, assessment and mitigation of AI risk, has a number of relevant AI related job openings- see here for more details
  • EvolutionAI, are looking for a machine learning research engineer to develop their award winning AI-powered data extraction platform, putting state of the art deep learning technology into production use. Strong background in machine learning and statistics required
  • AstraZeneca are looking for a Data Science Training Developer – more details here
  • Lloyds Register are looking for a data analyst to work across the Foundation with a broad range of safety data to inform the future direction of challenge areas and provide society with evidence-based information.
  • Cazoo is looking for a number of senior data engineers – great modern stack and really interesting projects!

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

February Newsletter

Hi everyone-

Well, January seemed to flash by in the blink of an eye- certainly the holiday period seems a long time ago already. All is not lost- the Winter Olympics seems to have crept up on us and is just about to start which will no doubt provide some entertainment and distraction…. as I hope will some thought provoking data science reading materials.

Following is the February edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. Check out our new ‘Jobs!’ sectionan extra incentive to read to the end!

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science February 2022 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

The committee is busy planning out our activities for the year with lots of exciting events and even hopefully some in-person socialising… Watch this space for upcoming announcements.

We do in fact have a couple of spaces opening up on our committee (RSS Data Science and AI Section) – if you are interested in learning more please contact James Weatherall

Anyone interested in presenting their latest developments and research at the Royal Statistical Society Conference? The organisers of this year’s event – which will take place in Aberdeen from 12-15 September – are calling for submissions for 20-minute and rapid-fire 5-minute talks to include on the programme.  Submissions are welcome on any topic related to data science and statistics.  Full details can be found here. The deadline for submissions is 5 April.

Our very own Giles Pavey took part in a panel debate, exploring the role of AI in creating trustworthy digital commerce – see recording here

Meanwhile, Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk will be tomorrow (February 2nd) where Sebastian Flennerhag, research scientist at DeepMind, will give a talk entitled “Towards machines that teach themselves“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • With the anniversary of the January 6th attack on the US Capital, there is commentary in the mainstream press about misinformation and how algorithms can both exacerbate and help curb the problem – see here in the Washington Post for example.
"The provocative idea behind unrest prediction is that by designing an AI model that can quantify variables — a country’s democratic history, democratic “backsliding,” economic swings, “social-trust” levels, transportation disruptions, weather volatility and others — the art of predicting political violence can be more scientific than ever."
  • We’ve posted previously about bias in recruiting and hiring algorithms – so it’s welcome to see the Data and Trust Alliance‘s publication of their Algorithmic Bias Safeguards for Workforce: criteria and education for HR teams to evaluate vendors on their ability to detect, mitigate, and monitor algorithmic bias in workforce decisions
  • There was an interesting recent recommendation from the UK Law Commission that users of self driving cars should have immunity from a wide range of motoring offences. This is increasingly relevant, as the various self-driving car providers move towards commercial propositions- Waymo (Google/Alphabet’s self-driving unit), for instance, recently announced its first commercial autonomous trucking customer (interesting background on how Waymo does what it does here)
"While a vehicle is driving itself, we do not think that a human should be required to respond to events in the absence of a transition demand (a requirement for the driver to take control). It is unrealistic to expect someone who is not paying attention to the road to deal with (for example) a tyre blow-out or a closed road sign. Even hearing ambulance sirens will be difficult for those with a hearing impairment or listening to loud music.”
"People were more likely to roll with a positive suggestion than a negative one— participants also often found themselves in a situation where they wanted to disagree, but were only offered expressions of agreement. The effect is to make a conversation go faster and more smoothly" ... 
... "This technology (combined with our own suggestibility) could discourage us from challenging someone, or disagreeing at all. In making our communication more efficient, AI could also drum our true feelings out of it, reducing exchanges to bouncing “love it!” and “sounds good!” back at each other"

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

  • The research theme around making models more ‘efficient’ (whether that’s in terms of power consumption, model size, data usage etc) continues:
    • Focusing on reducing computational cost for low power network-edge usage, ‘Mobile-Former‘ breaks all sorts of records
    • Interesting research into reducing/simplifying inputs to neural net models looks promising … and they said feature engineering was dead;-)
    • More progress on ‘few-shot learning’ (making accurate predictions with limited examples) – this time with ‘HyperTransformers
    • Active Learning is an elegant approach to improving sample efficiency by focusing efforts in the most productive areas of the data space – however, watch out for outliers
  • Then some more random research directions…
“However, Automated Reinforcement Learning (AutoRL) involves not only standard applications of AutoML but also includes additional challenges unique to RL, that naturally produce a different set of methods. As such, AutoRL has been emerging as an important area of research in RL, providing promise in a variety of applications from RNA design to playing games such as Go.”
"Over the last several decades, I've witnessed a lot of change in the fields of machine learning (ML) and computer science. Early approaches, which often fell short, eventually gave rise to modern approaches that have been very successful. Following that long-arc pattern of progress, I think we'll see a number of exciting advances over the next several years, advances that will ultimately benefit the lives of billions of people with greater impact than ever before"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • What seems like our now monthly update from ETH Zürich’s Robotic Systems Lab this time ‘robots learning to hike‘ (cue robot-dog interaction videos…).
  • In order for robots to take action, they have to understand the world around them, a far from trivial task: a couple of useful developments in this space using large language models to understand the relationship between objects and relevant actions, from MIT and also from Carnegie-Mellon/ Google Brain
“In an effort to solve this problem, MIT researchers have developed a model that understands the underlying relationships between objects in a scene. Their model represents individual relationships one at a time, then combines these representations to describe the overall scene. This enables the model to generate more accurate images from text descriptions, even when the scene includes several objects that are arranged in different relationships with one another.”
“It's a monumental shift,” says Jahmy Hindman, Deere’s chief technology officer, of the new machine, revealed at the 2022 Consumer Electronics Show in Las Vegas. “I think it's every bit as big as the transition from horse to tractor.”
"'Is it safe to walk downstairs backwards if I close my eyes?'

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?
GPT-3: No, it is not safe.
GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes."
“You’re playing a pot that’s effectively worth half a million dollars in real money,” he said afterward. “It’s just so much goddamned stress.”

How does that work?
A new section on understanding different approaches and techniques

  • For those with a programming background, vectorisation may come naturally, but it can be hard to think through if you are new to it … it does speed things up though, so worth digging into: good python tutorial here.
  • We are a section of the Royal Statistical Society, so it’s good to see a bit of stats once in a while- ‘Six Statistical Critiques That Don’t Quite Work
  • If you’ve not come across Streamlit, you should definitely check it out – very quick and easy way to create apps in python.
  • JAX is a relatively new but very scalable framework for numerical methods (bayesian sampling etc) developed at DeepMind – definitely worth exploring
  • It’s always good to understand at a low level how different modelling approaches work. If you’re unclear on the fundamentals of neural networks, this is an excellent introductory guide from Simon Hørup Eskildsen (love that it’s called ‘Napkin Math’!)
"In this edition of Napkin Math, we'll invoke the spirit of the Napkin Math series to establish a mental model for how a neural network works by building one from scratch"
  • I know, we’ve had a fair few ‘this is how Transformers work’ posts over the last few months… but they are so central to many of the image processing and NLP improvements over the last few years that checking out another good one couldn’t hurt..
"It was in the year 2017, the NLP made the key breakthrough. Google released a research paper “Attention is All you need” which introduced a concept called Attention. Attention helps us to focus only on the required features instead of focusing on all features. Attention mechanism led to the development of the Transformer and Transformer-based models.."
  • Finally, variational autoencoders... unsupervised learning is an area of data science that can sometimes feel neglected, and variational autoencoders are a fantastic tool in the unsupervised learning arsenal, leveraging the power of Deep Learning.
  • For anyone interested in learning more about how DeepMind does what it does, I definitely recommend Hannah Fry‘s podcast- the last episode, ‘A breakthrough unfolds‘ tells the story well of how they went from winning at Go to predicting protein structures…

Practical tips
How to drive analytics and ML into production

"I’m not a management expert, but I did try really hard during my first year managing and I’ve since spent time digesting the experience. My hope is that others will find a few of the things I learned useful when they’re at the start of their own management journey.”

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"Isaac Newton apocryphally discovered his second law – the one about gravity – after an apple fell on his head. Much experimentation and data analysis later, he realised there was a fundamental relationship between force, mass and acceleration. He formulated a theory to describe that relationship – one that could be expressed as an equation, F=ma – and used it to predict the behaviour of objects other than apples. His predictions turned out to be right (if not always precise enough for those who came later).

Contrast how science is increasingly done today."
"These schemas were the subject of a competition held in 2016 in which the winning program was correct on only 58% of the sentences — hardly a better result than if it had guessed. Oren Etzioni, a leading AI researcher, quipped, 'When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.'”
"Repeatedly tap on a box of marbles or sand and the pieces will pack themselves more tightly with each tap. However, the contents will only approach its maximum density after a long time and if you use a carefully crafted tapping sequence. But in new experiments with a cylinder full of dice vigorously twisted back and forth, the pieces achieved their maximum density quickly. The experiments could point to new methods to produce dense and technologically useful granular systems, even in the zero gravity environments of space missions."

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

Although there are still some Covid restrictions in place, the UK Government has eased a number of rules: to be fair, it’s quite hard to keep track. Omicron is far from gone though…

Updates from Members and Contributors

  • Kevin OBrien highlights a couple of excellent events:
    • The inaugural SciMLCon (of the Scientific Machine Learning Open Source Software Community) will take place online on Wednesday 23rd March 2022. SciMLCon is focused on the development and applications of the Julia-based SciML tooling -with expansion into R and Python planned in the near future.
    • JuliaCon which will be free and virtual with the main conference taking place Wednesday 27th July to Friday 29th July 2022. (Julia is a high performance, high-level dynamic language designed to address the requirements of high-level numerical and scientific computing, and is becoming increasingly popular in Machine Learning, IOT, Robotics, Energy Trading and Data Science)
  • Harald Carlens launched a very useful Discord server to help facilitate easier matchmaking for teams in the competitive ML community spanning across Kaggle and other platforms (AIcrowd/Zindi/DrivenData/etc), to go along with the mlcontests.com website. There are over 250 people on the server already and the audience is growing daily. More info here
  • Prithwis De contributed as chair at the 6th International Conference on Data Management, Analytics & Innovation, held during January 14-16, 2022.
  • Sarah Parker calls out the work of Professor Simon Maskell, (Professor Autonomous Systems, and Director of the EPSRC Centre for Doctoral Training in Distributed Algorithms at University of Liverpool), who has developed a Bayesian model used by the UK Government to estimate the UK’s R number – the reproduction number – of COVID -19. More info here.

Jobs!

A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • Holisticai, a startup focused on providing insight, assessment and mitigation of AI risk, has a number of relevant AI related job openings- see here for more details
  • EvolutionAI, are looking for a machine learning research engineer to develop their award winning AI-powered data extraction platform, putting state of the art deep learning technology into production use. Strong background in machine learning and statistics required
  • AstraZeneca are looking for a Data Science Training Developer – more details here
  • Cazoo is looking for an experienced Principal Data Scientist to lead technical development of a wide range of ML projects – more details here (I’m biased… but this is an amazing job for the right person 😉 )

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

January Newsletter

Hi everyone-

Happy New Year! I hope you all had as relaxing a holiday period as possible and enjoyed the fireworks from around the world… London trumps them all as far as I’m concerned although I’m clearly biased. As we all gear up for 2022, perhaps time for some thought provoking data science reading materials to help guide plans for the year ahead.

Following is the January edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science January 2022 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

2021 has been a busy and productive year for the RSS Data Science and AI section, focusing on our goals of:

  • Supporting the career development of data scientists and AI specialists
  • Fostering good practice for professional data scientists
  • Providing the voice of the practitioner to policy-makers

A few edited highlights:

  • We kicked off our “Fireside Chat” series back in February with an amazing discussion with Andrew Ng attended by over 500 people, followed up with a similarly thought provoking conversation with Anthony Goldblum, founder of Kaggle, in May.
  • In March we hosted our inaugural Data Science Ethics Happy Hour, discussing a wide range of topics focused on ethical challenges with an experienced panel. We also hosted “Confessions of a Data Scientist” at the annual RSS conference based on contributions from you, our experienced data science practitioner readership.
  • Throughout the year we have engaged with various initiatives focused on the accreditation of data science. More recently we have been actively engaged in the UK Government’s AI Roadmap and strategy, first conducting a survey and publishing our findings and critiques (which were publicly acknowledged). We then hosted a well attended event focused on the implications of the strategy and will be collaborating with the UK Government’s Office for AI to host a roundtable event on AI Governance and Regulation, on of the 3 main pillars of the UK AI Strategy.
  • … And we’ve managed to produce 12 monthly newsletters, expanding our readership

Our very own Jim Weatherall has co-authored a paper, “Really Doing Great at Estimating CATE?” which has been accepted to NeurIPS- many congrats Jim!

Meanwhile, Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk will be on January 12th where Alexey Bochkovskiy, research engineer at Intel, will discuss “YOLOv4 and Dense Prediction Transformers“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • It’s not exactly breaking news that the ImageNet data set is very influential in driving image recognition AI research, but new research from the University of California and Google Research highlights the overall importance of these ‘benchmark’ datasets, largely from influential western institutions, and frequently from government organisations.
"[We] find that there is increasing inequality in dataset usage globally, and that more than 50% of all dataset usages in our sample of 43,140 corresponded to datasets introduced by twelve elite, primarily Western, institutions."
  • TikTok is considered by many to have one of the best recommendation systems, driving phenomenal usage figures amongst its users. The NYTimes obtained an internal company document that offers a new level of detail about how the algorithm works. It’s clear that the algorithm optimises for retention and time-spent much like many other similar systems.
"The company’s edge comes from combining machine learning with fantastic volumes of data, highly engaged users, and a setting where users are amenable to consuming algorithmically recommended content (think how few other settings have all of these characteristics!). Not some algorithmic magic.”
  • So TikTok is not doing anything inherently different to facebook, twitter and any other site that recommends content. And in this excellent in-depth article, MIT Technology Review walks through how ‘clickbait farms‘ use these sites to spread misinformation.
On an average day, a financially motivated clickbait site might be populated with celebrity news, cute animals, or highly emotional stories—all reliable drivers of traffic. Then, when political turmoil strikes, they drift toward hyperpartisan news, misinformation, and outrage bait because it gets more engagement”
"It’s not the most “interesting” stories that make their way to the top of your News Feed (the word “interesting” implying “valuable”), but the most emotional. The most divisive. The ones with the most Likes, Comments, and Shares, and most likely to spark debate, conflict, anger. Either that, or the content a brand was willing to spend the most money sponsoring—all of which reveals a disconcerting conclusion: as a user of these platforms, being forced to see what the algorithm and brands want you to see, you have no rights"
"Instead of fighting from the inside, I want to show a model for an independent institution with a different set of incentive structures.”

Developments in Data Science…
As always, lots of new developments…

“It feels like Galileo picking up a telescope and being able to gaze deep into the universe of data and see things never detected before.”
  • In addition, DeepMind released Gopher, a new 280 billion parameter model, together with insight into the areas where parameter scaling helps, and where it is less important
"Our research investigated the strengths and weaknesses of those different-sized models, highlighting areas where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language. We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense task"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

“The results are compelling. It's certainly opening a new class of antimicrobial peptides, and finding them in an unexpected place.”

How does that work?
A new section on understanding different approaches and techniques

"The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data"
  • Given the increasing prevalence of PyTorch this looks very useful – miniTorch
MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code. The project was developed for the course 'Machine Learning Engineering' at Cornell Tech.

Practical tips
How to drive analytics and ML into production

"We think about three large primitives: the ingest primitive in this chat interface, the transform interface, and the publisher interface. All of these apply to “data sets” – which could be tables, they could be models, they could be reports, dashboards, and all the other things that you mentioned. When you think of ingest, transform, publish, these are all operating on instead of storage.  We are building the lakehouse architecture: our storage is GCS, Iceberg table format, plus Parquet. … Trino is our query engine.”

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"Well, computers haven’t changed much in 40 or 50 years. They’re smaller and faster, but they’re still boxes with processors that run instructions from humans. AI changes that on at least three fronts: how computers are made, how they’re programmed, and how they’re used. Ultimately, it will change what they are for. 
The core of computing is changing from number-crunching to decision-­making."
"This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research."
"In this series, I focus on the third trend [novel computing infrastructure capable of processing large amounts of data at massive scales and/or with fast turnaround times], and specifically, I will give a high-level overview of accelerators for artificial intelligence applications — what they are, and how they became so popular."

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

As we head into a new year, there are some depressing similarities with last year. The new Omicron cases to skyrocket world wide, with the UK being at the forefront…Thank goodness for vaccinations

  • The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be an astonishing 1 in 25 people, by far the largest prevalence we have seen (over 2m people currently with coronavirus)… Back in May the prevalence was less than 1 in 1000..
  • As yet the hospitalisation figures have not shown similar dramatic increases, although there are some worrying very recent trends.

Updates from Members and Contributors

  • Mani Sarkar has conducted a two part interview with Ian Ozsvald (pydata London founder) on Kaggling (see twitter posts here and here, as well as a summary in Ian’s newsletter here)
  • David Higgins has been very productive on topics in medical AI, digital health and data driven business, posting an article a week from September through Christmas – lots of excellent material here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

December Newsletter

Hi everyone-

Properly dark and cold now in the UK, and even some initial sightings of Christmas trees so it must be getting to the end of year… perhaps time for some satisfying data science reading materials while pondering what present to buy for your long lost auntie!

Following is the December edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science December 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

On Tuesday 23rd November we hosted our latest event “The National AI Strategy – boom or bust to your career in data science?” and it was another great success with a strong turnout.

  • First of all Seb Krier, Senior Technology Policy Researcher at the Stanford University Cyber Policy Centre, gave an excellent overview of the published National AI strategy using his extensive experience to provide insight into the strengths and weaknesses of the different focus areas, and how it compares to different approaches around the world.
  • Next, Adam Davison and Martin Goodson talked through the results of our recent data science practitioner survey on the government strategy proposals, highlighting areas of discrepancy and omission.
  • We then finished with a lively round-table discussion, additionally including Stian Westlake, Chief Executive of the RSS and Janet Bastiman, Chief Data Scientist at Napier AI.

We will publish a more detailed review and video in the coming weeks for those who missed out.

If anyone is interested in getting more involved in this discussion, we are collaborating with the UK Government’s Office for AI to host a roundtable event on AI Governance and Regulation which is one of the 3 main pillars of the UK AI Strategy. We are seeking Data Science and AI experts and practitioners to participate – please express any interest by emailing weatheralljames@hotmail.com.

Many congratulations to DSS section committee’s Rich Pugh who has been elected to the RSS Council – joining the DSS’s Anjali Mazumder and Jim Weatherall… all part of our cunning plan for global domination!

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The last talk was on October 27th where Anees Kazi, senior research scientist at the chair of Computer Aided Medical Procedure and Augmented Reality (CAMPAR) at Technical University of Munich, discussed “Graph Convolutional Networks for Disease Prediction“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"This change will represent one of the largest shifts in facial recognition usage in the technology’s history. More than a third of Facebook’s daily active users have opted in to our Face Recognition setting and are able to be recognized, and its removal will result in the deletion of more than a billion people’s individual facial recognition templates."
For example, asking AI to cure cancer as quickly as possible could be dangerous. “It would probably find ways of inducing tumours in the whole human population, so that it could run millions of experiments in parallel, using all of us as guinea pigs,” said Russell. “And that’s because that’s the solution to the objective we gave it; we just forgot to specify that you can’t use humans as guinea pigs and you can’t use up the whole GDP of the world to run your experiments and you can’t do this and you can’t do that.”

Developments in Data Science…
As always, lots of new developments…

“The brain is able to use information coming from the skin as if it were coming from the eyes. We don’t see with the eyes or hear with the ears, these are just the receptors, seeing and hearing in fact goes on in the brain.”
"This trend of massive investments of dozens of millions of dollars going into training ever more massive AI models appears to be here to stay, at least for now. Given these models are incredibly powerful this is very exciting, but the fact that primarily corporations with large monetary resources can create these models is worrying"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

“Biology is likely far too complex and messy to ever be encapsulated as a simple set of neat mathematical equations. But just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”
“There was no problem with the algorithm as long as they stay within the boundaries of the business model and buy cookie-cutter homes that are easier to sell. There are a lot of things that affect the valuation of homes that even very sophisticated algorithms cannot catch"

How does that work?
A new section on understanding different approaches and techniques

"Before we start, just a heads-up. We're going to be talking a lot about matrix multiplications and touching on backpropagation (the algorithm for training the model), but you don't need to know any of it beforehand. We'll add the concepts we need one at a time, with explanation.."
For example, speech recognition systems need to disambiguate between phonetically similar phrases like “recognize speech” and “wreck a nice beach”, and a language model can help pick the one that sounds the most natural in a given context. For instance, a speech recognition system transcribing a lecture on audio systems should likely prefer "recognize speech", whereas a news flash about an extraterrestrial invasion of Miami should likely prefer "wreck a nice beach".
"But I am going to define this stuff three times. Once for mum, once for dad, and once for the country."

Practical tips
How to drive analytics and ML into production

  • We’ve previously highlighted the importance of MLOps and the standardisation of processes for updating and monitoring ML models in production. Another good podcast on the ‘The Data Exchange’ this time about ML Ops Anti-Patterns (the underlying research paper is here)
  • Speaking of MLOps – excellent summary of the platforms used across the big players, highlighting how much is still ‘home grown’ (labeled ‘IH’ below)
"Machine learning systems are extremely complex, and have a frustrating ability to erode abstractions between software components. This presents a wide array of challenges to the kind of iterative development that is essential for ML success.”

Bigger picture ideas
Longer thought provoking reads – a few more than normal, lean back and pour a drink!

"Abundant evidence and decades of sustained research suggest that the brain cannot simply be assembling sensory information, as though it were putting together a jigsaw puzzle, to perceive its surroundings. This is borne out by the fact that the brain can construct a scene based on the light entering our eyes, even when the incoming information is noisy and ambiguous."
"I would love to incorporate deep learning into the design, manufacturing, and operations of our aircraft. But I need some guarantees."

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

As we head into winter, we continue to experience the conflicting emotions of relaxing regulations and behaviour with increasing Covid prevalence and hospitals at breaking point. And now there is a news of a new variant…

"Whatever the reason, by half-term, only around 16 per cent of vaccinations in the cohort had been achieved. Meanwhile, school-age kids had caught Covid by the truckload. Over 7 per cent of the entire Year 7 to Year 11 cohort was infected on any day in the last week of October alone. Maybe that was the unspoken plan. Certainly the JCVI’s minutes – released at the end of October after lengthy delays – make grim reading in this respect. The idea, already noted, that “natural infection” might be better than vaccination for young people was under discussion even here. Somehow, catching Covid was proffered as a better way of not getting ill with Covid than preventing its worst effects with a proven vaccine."

Updates from Members and Contributors

  • Professor Harin Sellahewa reports that nearly 50 of the University of Buckingham’s first ever master’s level data science apprentices have graduated. The Integrated Master’s level Degree Apprenticeship course was set up two years ago to help address an urgent shortage of people with advanced digital skills and to produce expert data scientists by giving them the technological and business skills to transform their workplace. The graduates receive the MSc in Applied Data Science from Buckingham as well as the Level 7 Digital and Technology Solutions Specialist degree apprenticeship certificate from ESFA. The apprenticeship is provided in partnership with AVADO who work with businesses to train staff to develop the skills needed to compete in a digital world. Industry partners such as IBM, Tableau, TigerGraph and Zizo conducted practical workshops for the learners.

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

November Newsletter

Hi everyone-

The clocks have changed – officially the end of ‘daylight savings’ in the UK – does that mean we no longer try and save daylight? Certainly feels that way … definitely time for some satisfying data science reading materials while drying out from the rain!

Following is the November edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science November 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are pleased to announce our next virtual DSS meetup event, on Tuesday 23rd November at 5pm: “The National AI Strategy – boom or bust to your career in data science?”. Following on from our commentary on the UK Government’s AI Strategy (based on the excellent feedback from our community), and the pick-up we have received, we are going to run a focused event discussing this topic. You will hear key information about the strategy and have the opportunity to ask questions, provide input, and hear a panel of experts discuss the implications of the strategy for practitioners of AI in the UK. Save the date- all welcome!

Of course, the RSS never sleeps… so preparation for next year’s conference, which will take place in Aberdeen, Scotland from 12-15 September 2022, is already underway. The RSS is inviting proposals for invited topic sessions. These are put together by an individual, group of individuals or an organisation with a set of speakers who they invite to speak on a particular topic. The conference provides one of the best opportunities in the UK for anyone interested in statistics and data science to come together to share knowledge and network. Deadline for proposals is November 18th.

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The last talk was on October 27th where Anees Kazi, senior research scientist at the chair of Computer Aided Medical Procedure and Augmented Reality (CAMPAR) at Technical University of Munich, discussed “Graph Convolutional Networks for Disease Prediction“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

“As far as we can tell, the algorithm is using problematic and biased criteria, like nationality, to choose which “stream” you get in. People from rich white countries get “Speedy Boarding”; poorer people of colour get pushed to the back of the queue.”
"Facebook has been unwilling to accept even little slivers of profit being sacrificed for safety"
"In a competitive marketplace, it may seem easier to cut corners. But it’s unacceptable to create AI systems that will harm many people, just as it’s unacceptable to create pharmaceuticals and other products—whether cars, children’s toys, or medical devices—that will harm many people."

Developments in Data Science…
As always, lots of new developments…

  • Before delving into the research, it’s sometimes useful to step back and observe the lie of the land. Interesting perspective here on how the major players have ended up focusing in slightly different areas of deep learning research
"It is important to not only look at average task accuracy -- which may be biased by easy or redundant tasks -- but also worst-case accuracy (i.e. the performance on the task with the lowest accuracy)."
"Classification, extractive question answering, and multiple choice tasks benefit so much from additional examples that collecting a few hundred examples is often "worth" billions of parameters"
  • The extent to which you can use synthetic data in machine learning always generates discussion. Microsoft Research highlights you can go far with facial analysis, with the potential benefits of improving diversity in data sets.
  • The annual ‘State of AI’ report is always a weighty tome – this years’ comes in at 188 slides… Worth a skim to see what people are working on, but perhaps be wary of the predictions…
  • This is very relevant – ‘editing’ models. We have talked about how some of the large data sets used to train the leading image and language models have questionable data quality. Is there a way of removing the influence of particular erroneous data points from the final model when they are identified? Researchers at Stanford University think so
"MEND can be trained on a single GPU in less than a day even for 10 billion+ parameter models; once trained MEND enables rapid application of new edits to the pre-trained model. Our experiments with T5, GPT, BERT, and BART models show that MEND is the only approach to model editing that produces effective edits for models with tens of millions to over 10 billion parameters"
"By leveraging advances in graph neural networks, we propose a hypernetwork that can predict performant parameters in a single forward pass taking a fraction of a second, even on a CPU. The proposed model achieves surprisingly good performance on unseen and diverse networks"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • Google has announced it plans to include multi-modal models in its search algorithms- learning from the linkages between text and images- good commentary here
“It holds out the promise that we can ask very complex queries and break them down into a set of simpler components, where you can get results for the different, simpler queries and then stitch them together to understand what you really want.”
"To compute the embedding of the tabular context, it first uses a BERT-based architecture to encode several rows above and below the target cell (together with the header row). The content in each cell includes its data type (such as numeric, string, etc.) and its value, and the cell contents present in the same row are concatenated together into a token sequence to be embedded using the BERT encoder”

How does that work?
A new section on understanding different approaches and techniques

  • Why do neural networks generalise so well? Good question… let the BAIR help you out (well worth a read – note you may need to reload the page as it doesnt seem to take in-bound links)
"Perhaps the greatest of these mysteries has been the question of generalization: why do the functions learned by neural networks generalize so well to unseen data? From the perspective of classical ML, neural nets’ high performance is a surprise given that they are so overparameterized that they could easily represent countless poorly-generalizing functions."

Practical tips
How to drive analytics and ML into production

"Nobody cared that I speak 5 languages, that I know a bunch about how microcontrollers work in the tiniest of details, how an analog high-frequency circuit is built from bare metal, and how computers actually work. All of that is abstracted away. You only need…algorithms & data structures pretty much.”

Bigger picture ideas
Longer thought provoking reads – a few more than normal, lean back and pour a drink!

"A number of researchers are showing that idealized versions of these powerful networks are mathematically equivalent to older, simpler machine learning models called kernel machines. If this equivalence can be extended beyond idealized neural networks, it may explain how practical ANNs achieve their astonishing results."
"I wrote earlier this year about Morioka Shoten, a bookshop in Tokyo that only sells one book, and you could see this as an extreme reaction to a problem of infinite choice. Of course, like all these solutions it really only relocates the problem, because now you have to know about the shop instead of having to know about the book"

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

"Transcribing Japanese cursive writing found in historical literary works like this one is usually an arduous task even for experienced researchers. So we tested a machine learning model called KuroNet to transcribe these historical scripts."
"A competition focused on helping advance development of next-generation virtual assistants that will assist humans in completing real-world tasks by harnessing generalizable AI methodologies such as continuous learning, teachable AI, multimodal understanding, and reasoning"

Covid Corner

Although life seems to be returning to normal for many people in the UK, there is still lots of uncertainty on the Covid front… booster vaccinations are now rolling out in the UK, which is good news, but we still have exceedingly high community covid case levels due to the Delta variant and rising hospitalisations…

"From the viewpoint of some JCVI members, children aren’t independent agents with a right to be protected from a potentially dangerous virus. Rather, because they can serve as human shields for more vulnerable adults, it’s downright good when children get sick. They explicitly stated that “natural infection in children could have substantial long-term benefits for COVID-19 in the UK.”  Not only is this scientific nonsense, as the high number of infections in the UK clearly shows, it’s a moral abomination"

Updates from Members and Contributors

  • Sorry we didnt do more publicity around PyData Global 2021 … it just happened last week. Many congrats to Kevin O’Brien one of the main organisers and to Marco Gorelli for his talk on Bayesian Ordered Logistic Regression!
  • Ronald Richman has just published a new paper on explainable deep learning which looks very interesting.
  • Sarah Phelps invites everyone to what looks to be an excellent webinar hosted by the UK ONS Data Science Campus:
    • “The UK Office for National Statistics Data Science Campus and UNECE HLG-MOS invite you to join them for the ONS-UNECE Machine Learning Group 2021 Webinar on 19 November. “
    • “The webinar will provide an opportunity to learn about the progress that the Group has made this year in its different work areas, from coding and classification and satellite imagery to operationalisation and data ethics. Bringing together colleagues from across the global official statistics community, it will include contributions from senior figures in the data science divisions of various NSOs as well as discussion on the priorities for advancing the use of machine learning in official statistics in 2022.”

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

October Newsletter

Hi everyone-

I guess summer is over, what there was of it- I was hoping we might get a bit of autumn sunshine but it feels like it’s big coat weather already… definitely time for some tasty data science reading materials in front of a warm fire!

Following is the October edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science October 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

First of all, we have a new name… Data Science and AI Section! To be honest, we’ve always talked about machine learning and artificial intelligence, and have some very experienced practitioners both on the committee and in our network, so it doesn’t really change our focus. It is nice to have it officially recognised by the RSS though.

Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy. As you may have seen, Martin Goodson, our chair, summarised some of the findings in a recent post, highlighting the significant gaps in the government’s proposed approach based on comments from you. Some of these gaps, particularly on open-source, have now been publicly acknowledged, multiple times. In addition Martin, and Jim Weatherall met with Sana Khareghani (director of the Office for AI) and Tabitha Goldstaub (chair of the AI council) in order to further advocate for our community’s needs, with Sana agreeing that the Office for AI will run workshops together with the RSS focused on the technical practitioner community, in order to gain their perspective and identify their needs.

“Confessions of a Data Scientist” seemed to go down very well at the recent RSS conference- massive thanks to Louisa Nolan for making it so successful, and to you all for your contributions.

Of course, the RSS never sleeps… so preparation for next year’s conference, which will take place in Aberdeen, Scotland from 12-15 September 2022, is already underway. The RSS is inviting proposals for invited topic sessions. These are put together by an individual, group of individuals or an organisation with a set of speakers who they invite to speak on a particular topic. The conference provides one of the best opportunities in the UK for anyone interested in statistics and data science to come together to share knowledge and network. Deadline for proposals is November 18th.

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The last talk was on September 7th where Thomas Kipf, Research Scientist at Google Research in the Brain Team in Amsterdam, discussed “Relational Structure Discovery“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

“Artificial intelligence can be a force for good, helping societies overcome some of the great challenges of our times. But AI technologies can have negative, even catastrophic, effects if they are used without sufficient regard to how they affect people’s human rights”
"Depoliticizing people’s feeds makes sense for a company that is perpetually in hot water for its alleged impact on politics"
"We don’t want viewers regretting the videos they spend time watching and realized we needed to do even more to measure how much value you get from your time on YouTube."

Developments in Data Science…
As always, lots of new developments… thought we’d have a more extended look at some of the new research this month

  • Plenty of great arXiv papers out there this month- I know these can be a bit dry, so will try and give a bit of context…
    • One theme of research we have been following is “fewer-shot” training of models. Fundamentally, humans don’t need millions of examples of an orange before being able to identify one, so learning from limited examples should be possible. Large language models like GPT-3 have shown great promise in this area, where, given a few “prompts” (question and answer examples), they seem to be able to provide remarkable results to this type or problem. Sadly, this paper, “true few-shot learning” suggests we need a more standardised approach to example selection as previous results may have been artificially inflated by biased approaches.
    • More positively, “Can you learn an algorithm” talks through recent research showing that simple recurrent neural networks can learn approaches that can be successfully applied to larger scale problems, just as humans can learn from toy examples. Similarly, a new sequence to sequence learning approach from MIT CSAIL includes a component that learns “grammar” across examples.
    • Another popular research theme is simplifying architecture and reducing processing. A team at Google Brain have shown (“Pay Attention to MLPs“) that you can almost replicate the performance of transformers (a more complex deep learning architecture) with a simpler approach based on basic building blocks (multi-layer perceptrons)
    • GANs (generational adversarial networks) are pretty cool – they generate new similar looking examples from input data (see here for an intro). A recent paper (GAN’s N’ Roses) takes this to a new level, generating stable video from an input and a theme. (“GAN’s N’ Roses” is clearly a popular meme – this tutorial predates the paper by 4 years!)
  • Of course the big industrial research powerhouses (Google/DeepMind, Facebook etc.) keep churning out fantastic work:
“We would like our agents to leverage knowledge acquired in previous tasks to learn a new task more quickly, in the same way that a cook will have an easier time learning a new recipe than someone who has never prepared a dish before"
  • Finally, one paper I encourage everyone to read- “A Farewell to the Bias-Variance Tradeoff?“, one of the conundrums I still struggle to fully understand … why is that over-parameterised models (those which seem to have far too many parameters given the data set they are trained on) are able to generalise so well.

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • Great article in Wired on the development of large language models outside of the US, and the English language
"What's surprising about these large language models is how much they know about how the world works simply from reading all the stuff that they can find"
"It is a pioneering program that’s mixing responsible AI and science with indigenous led knowledge and solving complex environmental management problems at spots in Northern Australia"
  • We don’t hear much from Amazon about their use of AI, although clearly they have very advanced applications across their business. This was an interesting post digging into the practical problem of how you help delivery workers find the actual entrance to a given residence, from noisy data.
  • “In this project, we’ve trained physically simulated humanoids to play a simplified version of 2v2 football” …. and there’s video!
  • And the Boston Dynamics robots continue to fascinate/scare in equal measure… they can now do Parkour!
"On the Atlas project, we use parkour as an experimental theme to study problems related to rapid behavior creation, dynamic locomotion, and connections between perception and control that allow the robot to adapt – quite literally – on the fly."
"Everyone was floored, there was a lot of press, and then it was radio silence, basically. You’re in this weird situation where there’s been this major advance in your field, but you can’t build on it.”

How does that work?
A new section on understanding different approaches and techniques

  • Hyper-parameter optimisation can often require more art than science if you don’t have a systematic approach- some useful tips here using Argo
  • There are lots of different activation functions (defining the output from given inputs) you can use in neural networks, but which one should you use for a given task? Useful paper here.
  • Interesting comparison: using meme search to explore the performance of different image encoders, in particular CLIP from OpenAI vs Google’s Big Transfer
  • I’m not a massive fan of media-mix modelling (building models that optmise marketing expenditure based on historic performance) because it always feels there is so much fundamentally missing in the underying data sets. However, they can certainly be useful, and using a Bayesian approach would seem to be a good way to go (more detail here)
"The Bayesian approach allows prior knowledge to be elegantly incorporated into the model and quantified with the appropriate mathematical distributions."

Practical tips
How to drive analytics and ML into production

"Companies that are starting with the problem first, improving on a defined metric and reach ML as a solution naturally are the ones that will treat their models as a continuously developing product”

Bigger picture ideas
Longer thought provoking reads

"the modern data stack isn't enough. We have to create a modern data experience."
"We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already—on its own—deep"

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

What’s interesting with that system, contrary to classical game development, is that you don’t need to hard-code every interaction. Instead, you use a language model that selects what’s robot possible action is the most appropriate given user input.
Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings.

Covid Corner

Although life seems to be returning to normal for many people in the UK, there is still lots of uncertainty on the Covid front… vaccinations keep progressing in the UK, which is good news, but we still have high community covid case levels due to the Delta variant…

"By comparing Eva’s performance against modelled counterfactual scenarios, we show that Eva identified 1.85 times as many asymptomatic, infected travellers as random surveillance testing, with up to 2-4 times as many during peak travel, and 1.25-1.45 times as many asymptomatic, infected travellers as testing policies that only utilize epidemiological metrics."

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

September Newsletter

Hi everyone-

I don’t know about you, but that didn’t feel particularly August-like…. I miss the sun! Perhaps September will save the summer, together with some inspiration from the Paralympics … How about a few curated data science materials for perusing during the lull in the wheelchair rugby final?

Following is the September edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science September 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy We are working on a series of posts digging into the results which we hope will be thought provoking.

This year’s RSS Conference is almost here (Manchester from 6-9 September, register here), with some great keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers. There is online access to over 40 hours of content at the conference covering a wide variety of topics. The full list of the online content can be found here. We really hope to see you all there, particularly at “Confessions of a Data Scientist” (11:40-13:00 Tuesday, 7 September), chaired by Data Science Section committee member Louisa Nolan.  

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk is on September 7th when Thomas Kipf, Research Scientist at Google Research in the Brain Team in Amsterdam, will discuss “Relational Structure Discovery“. Videos are posted on the meetup youtube channel – and future events will be posted here.

Many congratulations to Martin and the team at evolution.ai for winning the Leading Innovators in Data Extraction Award during the FinTech Awards 2021!

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"The fact that diagnostic models recognize race in medical scans is startling. The mystery of how they do it only adds fuel to worries that AI could magnify existing racial disparities in health care"
  • The Stanford Institute for Human-Centered Artificial Intelligence released a comprehensive review of the opportunities and risks of what it calls “Foundation Models” – these are models (such as BERT, DALL-E, and GPT-3) that are trained on “broad data at scale and are adaptable to a wide range of downstream tasks”
    • The research paper is a weighty tome (available here) but definitely worth a look
    • A good review can be found here
"They create a single point of failure, so any defects, any biases which these models have, any security vulnerabilities . . . are just blindly inherited by all the downstream tasks"
  • Of course the models and algorithms could be perfect, but still cause harm if they are not solving the right problem, or the outputs are not used in the right way
    • Motherboard reports that police are apparently attempting to have evidence generated from gunshot-detecting AI system altered
    • And a short but well reasoned piece in defence of algorithms:
"These algorithms aren’t “mutant” in any meaningful sense – their outcomes are the inevitable consequence of decisions made during their design"

Developments in Data Science…
As always, lots of new developments…

  • All sorts of activity in the reinforcement learning/robotics space this month:
“As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent"
  • As always, lots of research is going on in the deep learning architecture space:
  • Similarly investigation into methods that learn from smaller data sets continues
    • Researchers at Facebook, PSL Research and NYU have developed an elegant unsupervised pre-training method called VICReg that attempts to minimise issues of variance (identical representations for different inputs), invariance (dissimilar representations for inputs that humans find similar) and covariance (redundant parts of a representation)- this shows great promise for aiding more efficient use of pre-training and data augmentation
    • This paper also gives a good survey of data augmentation methods for Deep Learning

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

"If we intervene early, the treatments can kick in early and slow down the progression of the disease and at the same time avoid more damage"
"Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images."

How does that work?
A new section on understanding different approaches and techniques

"ML is notoriously bad at this inverse causality type of problems. They require us to answer “what if” questions, what Economists call counterfactuals. What would happen if instead of this price I’m currently asking for my merchandise, I use another price?"

Practical tips
How to drive analytics and ML into production

"Analytics isn’t primarily technical. While technical skills are useful, they’re not what separate average analysts from great ones."

Bigger picture ideas
Longer thought provoking reads

If you tell me a story and I say, ‘Oh, the same thing happened to me,’ literally the same thing did not happen to me that happened to you, but I can make a mapping that makes it seem very analogous. It’s something that we humans do all the time without even realizing we’re doing it. We’re swimming in this sea of analogies constantly.
"There’s a slightly humorous stereotype about computational complexity that says what we often end up doing is taking a problem that is solved a lot of the time in practice and proving that it’s actually very difficult"

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

All of the images in this post were synthesized by a combination of several machine learning models, directed by text that I provided, VQGAN for generation, and CLIP for directing the image to match the text.

Covid Corner

Still lots of uncertainty on the Covid front… vaccinations keep progressing in the UK, which is good news, but we still have very high community covid case levels due to the Delta variant…

“In the end, many hundreds of predictive tools were developed. None of them made a real difference, and some were potentially harmful.”

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

August Newsletter

Hi everyone-

That was quick, August already, but at least we have had the occasional day when it properly feels like summer- and now we have some Olympics to watch which is always entertaining! … How about a few curated data science materials for reading in while watching the marathon?

Following is the August edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science August 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are still working on releasing the video and a summary of the latest in our ‘Fireside chat’ series- an engaging and enlightening conversation with with Anthony Goldbloom, founder and CEO of Kaggle. Sorry for the delay- we will post a link when it is available.

Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy (If you haven’t already, you can still contribute here). We are passionate about making sure the government focuses on the right things in this area, and are now analysing the results which we will publish shortly.

The full programme for this year’s RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed.  The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers.  Registration is open

Speaking of the RSS Conference, we are running a session there, and we need your help! We would like to hear stories about your worst mistakes in data science. From these, we will select common themes and topics, and create a crowd-sourced compilation of the deadliest sins of data science. These will be presented – anonymously – to our panel, for a live, interactive discussion in front of an audience, at our session on Tuesday 7 September, 11:40 – 13:00. We hope this will both entertain and inform. Maybe your pain can help save someone else’s (data science) soul… CONFESS YOUR SINS HERE – the survey is anonymous, we won’t embarrass anyone!

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The most recent event was on July 14th when Xavier Bresson, Associate Professor in the Department of Computer Science at the National University of Singapore, discussed “The Transformer Network for the Traveling Salesman Problem“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"One gave our candidate a high score for English proficiency when she spoke only in German."
  • We talk about bias a fair amount, and it’s always good to define terms – this summary from the ACM (Association for Computer Machinery) gives a good overview. They split biases in AI systems into four sensible high level areas (as well as splitting out more specific types in each area):
    • Data-creation bias
    • Biases related to problem formulation
    • Biases related to the algorithm/data analysis
    • Biases related to evaluation/validation
  • It’s easy to overlook the first area highlighted above – data-creation bias. Often we train supervised learning models based on hand-labeled examples which we assume to be ‘correct’ but may not be. This article from O’Reilly talks through this issue and discusses different approaches (such as semi-supervised learning and weak supervision), while this article (from Sandeep Uttamchandani) gives some practical tips on data set selection for ML model building.
There is no such thing as gold labels: even the most well-known hand labeled datasets have label error rates of at least 5% (ImageNet has a label error rate of 5.8%!).
  • More positively, Apple has released information about their approach for face detection in photos, highlighting positive aspects such as on-device scoring, and fairness.
  • And this analysis charting the ‘data-for-good’ landscape shows it’s not all doom and gloom…

Developments in Data Science…
As always, lots of new developments…

  • When the ‘founding fathers’ of Deep Learning (Bengio, Hinton and LeCun) get together it’s always worth reading… here they discuss the future of Deep Learning and key research directions. They highlight key issues with existing approaches (large volumes of data for supervised learning or large numbers of iterations for reinforcement learning) but are not convinced by hybrid approaches including symbolic learning, believing research into more efficient learning from fewer examples will bear fruit.
“Humans and animals seem to be able to learn massive amounts of background knowledge about the world, largely by observation, in a task-independent manner. This knowledge underpins common sense and allows humans to learn complex tasks, such as driving, with just a few hours of practice.”
Interestingly, the ways that languages categorize color vary widely. Nonindustrialized cultures typically have far fewer words for colors than industrialized cultures. So while English has 11 words that everyone knows, the Papua-New Guinean language Berinmo has only five, and the Bolivian Amazonian language Tsimane’ has only three words that everyone knows, corresponding to black, white and red

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

"we’ve found that other approaches, such as reinforcement learning with human feedback, lead to faster progress in our reinforcement learning research"
"GitHub Copilot has been described as ‘magical’, ‘god send’, ‘seriously incredible work’, et cetera. I agree, it’s a pretty impressive tool, something I see myself using daily ... In my experience, Copilot excels at writing repetitive, tedious, boilerplate-y code. With minimal context, it can whip up a function that slices and dices a dataset, trains and evaluates several ml models, and, if you ask it nicely, also makes a nice batch of french fries"
  • Ok, so maybe not quite so practical, but still great fun – AI driven art out of Berkley (‘Alien Dreams’)
"this CLIP method is more like a beautifully hacked together trick for using language to steer existing unconditional image generating models"
  • A useful rundown from DoorDash on how they use ML models to balance supply and demand, including some interesting discussion on optimisation approaches which are often the way of turning a ML model into something that is used in decision making.

How does that work?
A new section on understanding different approaches and techniques

Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution

Getting it live
How to drive ML into production

  • Andrew Ng brings to life the challenges of building an AI product…
"Unsurprisingly, things did not go exactly as planned. Thus, this post is about what worked and what didn’t. I have focused on the most challenging aspects of trying to get data scientists to get review from their peers. I hope this helps others who wish to formalize peer review processes in data science"

Correlation or Causation?
A deep dive into causal analysis in machine learning

  • You have a machine learning model and it seems to perform great, not only on the training set, but even on hold out test sets- sorted right? It’s worth considering how you are going to use the model- if you are making predictions and using the output as is, then maybe you are ok; but if you are going to use the model for scenario planning, and counter-factual assessment (‘what-ifs?’) it would be worth thinking about causal analysis. Here’s a good starting point, from Jane Huang.
  • Here’s a useful example – estimating price elasticity
  • The technique often relies on something called ‘Double Machine Learning’
    • Overview here, with different implementations here and here and a worked example here
As any great technology, Double Machine Learning for causal inference has the potential to become pretty ubiquitous. But let’s calm the enthusiasm of this writer down and go back to our task
  • Finally, an intriguing approach for time series and econometrics… causal forests

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

How to get involved in the IRCAI AI Award 2021?

The International Research Centre in Artificial Intelligence under the auspices of UNESCO is launching an AI Award for individuals who have dedicated their work to solving problems related to the United Nations Sustainable Development Goals (SDGs) by means of the application of Artificial Intelligence.

Covid Corner

Not sure what to say here… vaccinations keep progressing in the UK, which is good news, but we now have what appear to be the highest covid case levels we have seen over the whole of the pandemic due to the Delta variant…

  • The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 65 people, up from 1 in 75 the week before and an almost unbelievable increase from only June, when the estimate was 1 in 1100.
  • More or Less gives an excellent review of the Delta variant and how it has come to dominate other strains of coronavirus the world over
  • One of the core findings about Delta, as discussed by More or Less, is its apparent ability to transmit through vaccinated individuals (or those with antibodies from prior infections) – in other words vaccinations, while still protecting against the worst outcomes, are not as effective at reducing transmission.
  • This definitely raises the stakes of the recent UK governmental re-opening and relaxation of restrictions on July 13th (symbolically welcomed by the prime minister in self-isolation…) which has been roundly condemned by the scientific community
  • In addition, in a recent article in the guardian, SAGE committee member Professor Robert West states the government’s express intention is to allow infections to rip through the younger population, a very worrying statement.
“What we are seeing is a decision by the government to get as many people infected as possible, as quickly as possible, while using rhetoric about caution as a way of putting the blame on the public for the consequences”

Updates from Members and Contributors

  • Marco Gorelli announces the first official release (1.0.0) of his highly acclaimed nbQA repo, full of very useful code formatting features and pre-commit hooks for jupyter notebooks
  • Alex Spanos will be presenting TrueLayer’s data science work at the RSS conference in Manchester (“An end-to-end Data Science workflow for building scalable and performant data enrichment APIs in Open Banking“) – another great reason to attend in September!
  • Mark Baillie highlights an upcoming special issue of the Biometrical Journal
    “Data scientists are frequently faced with an array of methods to choose from; often this makes selection difficult especially beyond one’s own particular interests and expertise. Neutral comparison studies are an essential cornerstone towards the improvement of this situation, providing evidence to help guide practitioners. For the special issue of Biometrical Journal we are interested in submissions that define, develop, discuss or illustrate concepts related to practical issues and improvement of neutral method comparison studies, as well as articles reporting well-designed neutral comparison studies of methods”

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

In memoriam

With great sadness I announce the untimely death of Rebecca Nettleship, a valued colleague and talented data scientist, on 22nd July 2021. She will be sorely missed. Our deepest condolences go out to her family and friends.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

July Newsletter

Hi everyone-

Not sure what happened to June – seemed to fly by – I know there were some lovely sunny days but then it got cold again… fingers crossed summer it’s not over already! … How about a few curated data science reading materials for reading in the garden, rain or shine?

Following is the July edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science July 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are working on releasing the video and a summary of the latest in our ‘Fireside chat’ series- an engaging and enlightening conversation with with Anthony Goldbloom, founder and CEO of Kaggle. We will post a link when it is available.

We have released a survey to our readers and members focused on the UK Government’s proposed AI Strategy. We are passionate about making sure the government focuses on the right things in this area, and feel like, as the organisation representing technical Data Science and AI practitioners, we need to make sure our voice is heard. If you havn’t already, please give us your thoughts by participating here.

The full programme for this year’s RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed.  The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers.  Registration is open with early-bird discounts available until Friday 4 June. 

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. On June 30th, the meetup hosted Frank Willet (Research Scientist at Stanford University) for a talk titled “High-performance brain-to-text communication via handwriting“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

Imagine a world where a state government, or other actor, can realistically manipulate images to show either nothing there or a different layout
"68% chose the option declaring that ethical principles focused primarily on the public good will not be employed in most AI systems by 2030"
Our method will facilitate deepfake detection and tracing in real-world settings, where the deepfake image itself is often the only information detectors have to work with.

Developments in Data Science…
As always, lots of new developments…

In remote sensing images, we can use temporal information to obtain pairs of images from the same location at different points in time, which we call seasonal positive pairs. Seasonal changes provide more semantically meaningful content than artificial transformations, and remote sensing images provide this natural augmentation for free.
  • Facebook have released ‘TextStyleBrush’ allowing you to emulate a text style in an image using just a single word
  • Generating realistic synthetic video is computationally intensive – new work out of UC Berkeley, called VideoGPT, uses novel approaches to make the whole process more efficient, allowing anyone to generate video on a standalone computer.
  • A Chinese Lab is challenging the supremacy of Google and OpenAI in the language model space with a model containing 1.7 trillion parameters. Interestingly, the original article seems to have been removed – although copies are still available online, with more technical details:
The Chinese lab claims that Wudao's sub-models achieved better performance than previous models, beating OpenAI’s CLIP and Google’s ALIGN on English image and text indexing in the Microsoft COCO dataset
"Will better engineering produce CNNs [Convolutional Neural Networks] that understand sameness and difference in the generalizable way that children do? Or are CNNs’ abstract-reasoning powers fundamentally limited, no matter how cleverly they’re built and trained?"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • I’m not familiar with the underlying challenge, but I understand that this is a big breakthrough (nature paper here) : a team at Google has automated the design of the physical layout of computer chips using deep reinforcement learning.
  • This is pretty compelling- well worth a read: Facebook AI have released details of their advanced object recognition system which allows consumers to shop items from images. It uses an elegant compound approach, modelling the objects and attributes separately as well as multi-modal signals. Also good to see they are attempting to avoid bias by building an monitoring the models appropriately:
"As part of our ongoing efforts to improve the algorithmic fairness of models we build, we trained and evaluated our AI models across subgroups, including 15 countries and four age buckets."
“Welcome to Hardcore High School” bellowed the script kiddo. We had just gotten to the kindergarten level when the music and lights began to blink. I frowned. “What is that?”
“Beats me” said the A.I. As he walked down the halls, mimicking the sounds of the various musical instruments, he fiddled with the script kiddo a bit. “Welcome to Hardcore High School” He said again, a bit more softly this time.

How does that work?
A new section on understanding different approaches and techniques

Getting it live
How to drive ML into production

"On a daily average, there are over 4,000 models at Facebook running on PyTorch"
  • The importance of Data preparation and curation in the ML lifecycle is highlighted in this piece on Data Cascades from Google Research.
"One of the most common causes of data cascades is when models that are trained on noise-free datasets are deployed in the often-noisy real world. For example, a common type of data cascade originates from model drifts, which occur when target and independent variables deviate, resulting in less accurate models"

From Prediction to Decision
The art and science of decision making

  • Lovely extended essay from Hannah Fry on the history of graphs and how they help us understand data and make decisions
  • An excellent article published in HBR from Michael Ross on why company investments in AI often don’t generate the gains they expect (the asymmetric cost function is particularly interesting)
(1) They don’t ask the right question, and end up directing AI to solve the wrong problem. 
(2) They don’t recognize the differences between the value of being right and the costs of being wrong, and assume all prediction mistakes are equivalent. 
(3) They don’t leverage AI’s ability to make far more frequent and granular decisions, and keep following their old practices

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

Again, more positive progress in the UK on the Covid front with 45m people now having received their first vaccine dose and over 30m fully vaccinated. However, the new Delta variant originating in India is cause for concern and case rates and hospitalisations are now rising again.

Updates from Members and Contributors

Everyone must be out enjoying themselves as no specific updates from members and contributors this month- let me know if you’d like to include anything here next month.

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

The UK AI Roadmap: your expert views needed

This year will see the first version of an AI Strategy from the UK government. Led by the Office for AI, this strategy will build on the AI Roadmap (which was published in January 2021).

If you work in data science or AI, the AI strategy will affect your career.

The Data Science Section of the  Royal Statistical Society will ensure the voice of technical practitioners is heard – and that decisions are made with your interests in mind. 

However, we cannot do this without your help. Please fill in the UK Artificial Intelligence Strategy Survey and give us your expert views. It will take less than 5 minutes to complete.

This is a great opportunity. The government is attempting to embrace data science and AI. Help us make sure the strategy focuses on the areas that will really make a difference.

Many thanks!

RSS plans for engaging with the development of the government’s AI strategy

The government’s AI Roadmap, published at the start of 2021, sets the direction for the development of a national AI strategy.
As the roadmap is developed into a strategy, there is a vital role for the RSS to play in setting out the role that statistics and data science have to play in the wider national strategy. There are two senses in which this perspective is important:

  • The RSS as a membership organisation has access to the experience and expertise of over a thousand professional data scientists who, as practitioners, focus on the types of issues which are central to the roadmap – there is an opportunity to strengthen the strategy by ensuring that these experiences are represented in the development of the strategy.
  • As they stand, the government’s plans do not show an appreciation of the role that statistics will have to play in the strategy. The disciplines of statistics and data science are closely related, and the role of statistics – as well as data science – should be reflected in the AI strategy.

The RSS – led by our Data Science Section – is kicking off a programme of work to shape the AI strategy and ensure that both the discipline of statistics and the experience of working data scientists are reflected in the strategy.

To start this work, we are planning to highlight a number of questions concerning the practice of data science in order to further inform the roadmap. We will be launching a survey soon to help gather intelligence from the community to support this work.

We are also planning a series of events and roundtables to discuss these issues. These events will help us share knowledge and refine our thinking, as well as engage directly with government stakeholders.

This is an important point in the development of AI as countries seek to position themselves as leaders in the field. The UK is well-positioned to lead on many areas of AI – but the strategy must be right and we hope to be able to help shape the strategy in the coming months.

RSS Chief Executive Stian Westlake said:

“The RSS welcomes the development of a national AI strategy, but it is important that the views of practitioners are represented in the process. With our strong data science section, the RSS is uniquely placed to access the perspective of practitioners and there is a vital role for us to play in ensuring that this is represented as the strategy develops.”

June Newsletter

Hi everyone-

It’s a bank holiday weekend – again – so that means it’s June and hopefully some warmer weather as May has definitely not delivered on that front … perhaps a few curated data science reading materials might prove useful for sunshine in the garden?

Following is the June edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science June 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are now ‘two for two’ on our ‘Fireside chat’ series! Following on from our fantastic discussion with Andrew Ng, Giles Pavey hosted an engaging and enlightening conversation with with Anthony Goldbloom on May 20th. Anthony is founder and CEO of Kaggle (now a Google company), the world’s largest data science and machine learning community. There was a great deal of insight into the evolution of data science over the 10 years Kaggle has been running as well as lots of audience questions. We will distill the session down and publish a summary shortly.

We will soon be releasing a survey to our readers and members focused on the UK Government’s proposed AI Strategy. We are passionate about making sure the government focuses on the right things in this area, and feel like true Data Science and AI practitioners need to feed into this process. So when you see the survey, do please take the time to fill it out if you can!

The full programme for this year’s RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed.  The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers.  Registration is open with early-bird discounts available until Friday 4 June. 
In addition, the RSS now has a new accreditation – Data Analyst.

Data Analyst is a registered form of professional membership status that provides formal recognition of a member’s statistical training and work-based experience at entry level

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The last event was on 24th May where Christian Szegedy, machine learning and AI researcher at Google Research, gave a talk titled ‘The Inverse Mindset of Machine Learning‘. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

The real danger wasn’t “Deep Fakes.” The real danger is cheap fakes, fakes that can be produced quickly, easily, in bulk, and at virtually no cost
  • Regulators are rightly becoming increasingly active in an attempt to combat these issues. This HBR article helps map out what organisations need to know to be prepared.
  • We all know how complex ML models are becoming and the scale at which some of them now operate, and so we have to be open to the fact that mistakes will happen. The critical question becomes: what do you do about it when the issue surfaces? Twitter has taken a positive and transparent approach to dealing with some of their previous bias related issues in automated cropping, releasing a detailed and technical analysis about why it was happening and the steps they are taking to remove the bias:
We want to thank you for sharing your open feedback and criticism of this algorithm with us. As we discussed in our recent blog post about our Responsible ML initiatives, Twitter is committed to providing more transparency around the ways we’re investigating and investing in understanding the potential harms that result from the use of algorithmic decision systems like ML.
  • Really interesting discussion on the Kara Swisher’s Sway podcast with Daniel Kahneman (renowned behavioural economist – “Thinking Fast and Slow”) delving into why we require much higher accuracy from computers and technology than from humans before we are willing to trust them.
  • And in a similar vein, this is thought provoking– does more data necessarily mean better decision making?
  • Less specifically focused on bias and ethics, but really interesting commentary from Benedict Evans on Amazon and how much it really knows about what it sells, touching on how much of a responsibility a platform has for moderation of its own recommendation content.
Of Amazon’s top 50 best-sellers in “Children's Vaccination & Immunisation”, close to 20 are by anti-vaccine polemicists, and 5 are novels about fictional pandemics

Developments in Data Science…
As always, lots of new developments…

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

How does that work?
A new section on understanding different approaches and techniques

Getting it live
How to drive ML into production

"For me, teaching this course was an unusual experience. MLOps standards and tools are still evolving, so it was exciting to survey the field and try to convey to you the cutting edge. I hope you will find it equally exciting to learn about this frontier of ML development, and that the skills you gain from this will help you build and deploy valuable ML systems." Andrew Ng

The Art of Visualisation
Making data science look right..

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

Again, more positive progress in the UK on the Covid front with over 40m people now having received their first vaccine dose and over 25m fully vaccinated. However, the new variant originating in India is cause for concern.

 Experts gave a median estimate of 30,000 Covid deaths by the end of the year, whereas the non-experts said 20,000. The truth was around 75,000

Updates from Members and Contributors

  • Harald Carlens has put together a very useful comparison of cloud GPU services and pricing – definitely check it out if you are using deep learning in the cloud.
  • Lucie Burgess would like to announce an interesting set of discussions around the provenance and legality of automated decisions taking place on June 15th and June 22nd. Helix Data Innovation are running the sessions on behalf of the PLEAD project (King’s College London, University of Southampton, with partners Experian, Roke and Southampton Connect) – sign up here for what should be a good discussion on a very relevant topic
  • Kevin O’Brien highlights the upcoming UseR! 2021 conference on 5-9th of July – a must see for those R users out there

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS