July Newsletter

Hi everyone-

Welcome to July! Inflation, union strikes, sunshine … lots of commentary drawing parallels to the mid-70s. One thing that is very different from that period is the world of data science (which didn’t even exist as a discipline) – crazy to think that the Apple II launched in ’77 with 4 KB RAM, 4 million times less memory than the laptop I’m writing this on…

Following is the July edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. We’ll take a break in August, so fingers crossed this sees you through to the beginning of September…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science July 2022 Newsletter

RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in a joint initiative between the RSS and various other bodies (The Chartered Institute for IT (BCS), the Operational Research Society (ORS), the Royal Academy of Engineering (RAEng), the National Physical Laboratory (NPL), the Royal Society and the IMA (The Institute of Mathematics and its Applications)) in defining standards for data scientist accreditation, with plans underway to launch the Advanced Certificate shortly.

We are very excited to announce our next meetup, “From paper to pitch, success in academic/industry collaboration” which will take place on Wednesday 20th July from 7pm-9pm. We believe that there is huge potential in greater collaboration between industry and academia and have invited two excellent speakers to provide examples of how this can work in practice. This should be a thought provoking, and very relevant (and free) event – sign up here.

The full programme is now available for the September RSS 2022 Conference. The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry.

As mentioned last time, Janet Bastiman (Chief Data Scientist at Napier AI) recently spoke at the FinTech FinCrime Exchange Conference (FFECON) in a panel session entitled “With great AI power comes great FinCrime responsibility”: cool summary from the discussion…

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event will be on July 13th when Stéphane d’Ascoli, Ph.D. candidate at Facebook AI, discusses “Solving Symbolic Regression with Transformers“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"From California to Colorado and Pennsylvania, as child welfare agencies use or consider implementing algorithms, an AP review identified concerns about transparency, reliability and racial disparities in the use of the technology, including their potential to harden bias in the child welfare system."
"In summary, GPT-4chan resulted in a large amount of public discussion and media coverage, with AI researchers generally being critical of Kilcher’s actions and many others disagreeing with these criticisms. This sequence of events was generally predictable, so much so that I was able to prompt GPT-3 – which has no knowledge whatsoever about current events – to summarize the controversy somewhat accurately"
"Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life.

The future of human-machine interaction is full of possibility and promise, but any powerful technology needs careful deployment. The joint statement below represents a step towards building a community to address the global challenges presented by AI progress, and we encourage other organizations who would like to participate to get in touch."
  • Of course the sad truth is that, in simplistic terms, this type of model is basically regurgitating the same biases present in the material it was trained on. Some thought provoking analysis from textio highlighting the inherent biases present in performance feedback.
  • A Google researcher (since placed on administrative leave…) caused controversy by claiming that one of these Large Language Models (in this case Google’s LaMDA) was sentient- good summary in Wired here. The guardian followed up on this with some thoughtful pieces on how the model works, and why we are prone to be fooled by mimicry.
"It’s strategic transparency. They get to come out and say they're helping researchers and they're fighting misinformation on their platforms, but they're not really showing the whole picture.”
"While AI can calculate, retrieve, and employ programming that performs limited rational analyses, it lacks the calculus to properly dissect more emotional or unconscious components of human intelligence that are described by psychologists as system 1 thinking."
"China’s ambition to collect a staggering amount of personal data from everyday citizens is more expansive than previously known, a Times investigation has found. Phone-tracking devices are now everywhere. The police are creating some of the largest DNA databases in the world. And the authorities are building upon facial recognition technology to collect voice prints from the general public."
"Police can not only obtain search histories from a pregnant person’s device, but can also obtain records directly from search engines, and sometimes they don’t even need a warrant."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models"
"To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant."
"In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data"
"Parti treats text-to-image generation as a sequence-to-sequence modeling problem, analogous to machine translation – this allows it to benefit from advances in large language models, especially capabilities that are unlocked by scaling data and model sizes"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

"Using a neural network trained on widely available weather forecasts and historical turbine data, we configured the DeepMind system to predict wind power output 36 hours ahead of actual generation. Based on these predictions, our model recommends how to make optimal hourly delivery commitments to the power grid a full day in advance"
"This was AlphaFold 2, which was published in July 2021. It had a level of atomic accuracy of less than one angstrom. I work with a lot of colleagues in structural biology. They've spent years to determine the structure of a protein and many times they never solve it. But not only do you produce confidence measures, you also — anyone — can put in their favorite protein and see how it works in seconds. And you also get feedback from the user. You also linked up with the European Bioinformatics Institute (EMBL-EBI). It's open-source and it's free."

More DALL-E fun..
DALL-E is still making headlines so we’ll keep serving up a few fun posts!

"We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For example, it seems that \texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests"

How does that work?
Tutorials and deep dives on different approaches and techniques

"An important point: if you train the first level on the whole dataset first and then the second level, you will get a leakage in the data. At the second level, the content score of matrix factorization will take into account the targeting information"
  • You’ve been wanting to explore GPT-3 but haven’t known where to start? Here you go!
"I think a big reason people have been put off trying out GPT-3 is that OpenAI market it as the OpenAI API. This sounds like something that’s going to require quite a bit of work to get started with.

But access to the API includes access to the GPT-3 playground, which is an interface that is incredibly easy to use. You get a text box, you type things in it, you press the “Execute” button. That’s all you need to know.."
  • I’m a regular user of Jupyter Lab (and notebooks) … but I’ve never used it build a web app! Lots of useful tips here
  • And … it’s live! Andrew Ng’s new foundational course in Machine Learning is open for enrolment – if you do one course, do this one
"Newly rebuilt and expanded into 3 courses, the updated Specialization teaches foundational AI concepts through an intuitive visual approach, before introducing the code needed to implement the algorithms and the underlying math."

Practical tips
How to drive analytics and ML into production

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! A few extra this month to get you through the long summer…

"At the heart of this debate are two different visions of the role of symbols in intelligence, both biological and mechanical: one holds that symbolic reasoning must be hard-coded from the outset and the other holds it can be learned through experience, by machines and humans alike. As such, the stakes are not just about the most practical way forward, but also how we should understand human intelligence — and, thus, how we should pursue human-level artificial intelligence."
"Now it is true that GPT-3 is genuinely better than GPT-2, and maybe true that InstructGPT is genuinely better than GPT-3. I do think that for any given example, the probability of a correct answer has gone up...

...But I see no reason whatsoever to think that the underlying problem — a lack of cognitive models of the world —have been remedied. The improvements, such as they are, come, primarily because the newer models have larger and larger sets of data about how human beings use word sequences, and bigger word sequences are certainly helpful for pattern matching machines. But they still don’t convey genuine comprehension, and so they are still very easy for Ernie and me (or anyone else who cares to try) to break.
"There’s an important point about expertise hidden in here: we expect our AGIs to be “experts” (to beat top-level Chess and Go players), but as a human, I’m only fair at chess and poor at Go. Does human intelligence require expertise? (Hint: re-read Turing’s original paper about the Imitation Game, and check the computer’s answers.) And if so, what kind of expertise? Humans are capable of broad but limited expertise in many areas, combined with deep expertise in a small number of areas. So this argument is really about terminology: could Gato be a step towards human-level intelligence (limited expertise for a large number of tasks), but not general intelligence?"
For those not well-versed in chess, here’s a summary of what happened. The first three or four moves were a fairly standard opening from both sides. Then, the AI began making massive blunders, even throwing away its queen. Finally, as the vice began to close around its king, the AI eventually made an illegal move, losing the game.

All in all, a pretty solid showing: it understood the format, (mostly) knew what moves were legal, and even played a decent opening. But this AI is not good at chess. Certainly, nothing close to 5000 ELO.

Is this just a “flub”, which will be fixed by scale? Will a future, even-larger GPT be the world chess champion? I don’t believe so.
"In January, 2021, Microsoft filed a patent to reincarnate people digitally through distinct voice fonts appended to lingual identities garnered from their social media accounts. I don’t see any reason why it can’t work. I believe that, if my grandchildren want to ask me a question after I’m dead, they will have access to a machine that will give them an answer and in my voice. That’s not a “new soul.” It is a mechanical tongue, an artificial person, a virtual being. The application of machine learning to natural language processing achieves the imitation of consciousness, not consciousness itself, and it is not science fiction. It is now."
  • Don’t worry – if we eventually get to an artificial general intelligence that everyone agrees on, we have a thoughtful taxonomy of all the ways it could kill us (AGI ruin: a list of lethalities)!

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

  • The latest results from the ONS tracking study estimate 1 in 30 people in England (1 in 18 in Scotland) have Covid. Sadly this has risen (from 1 in 60 last month) due to infections compatible with Omicron variants BA.4 and BA.5, but is at least down on it’s peak when it reached 1 in 14… Still a far cry from the 1 in 1000 we had last summer.
  • Promising research on the use of fitness tracker data to detect Covid early

Updates from Members and Contributors

  • Arthur Turrell has some excellent updates from the ONS Data Science Campus:
    • The ONS Data Science Campus was involved in this widely covered ONS piece on the cost of living inspired by Jack Monroe and other food campaigners.
    • Making text count: Economic forecasting using newspaper text’, which was a collaboration across multiple institutions and for which I am a co-author, was published in the journal of applied econometrics and shows how machine learning + text from newspaper can improve macroeconomic forecasts.
    • We released a package from the Campus for evaluating how well synthetic data matches real data. Repository here, blog post here.


A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: