Time flies- even with the extra day, February felt pretty short…
Anyway, here’s round 2 of the Royal Statistical Society Data Science Section monthly newsletter- any and all feedback most welcome!
If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
Success! You're on the list.
Whoops! There was an error and we couldn't process your subscription. Please reload the page and try again.
Industrial Strength Data Science March 2020 Newsletter
RSS Data Science Section
Section and Member Activities
Jim Weatherall is hosting our next RSS DSS event, which is in Manchester on the 18th March. It will be an expert panel discussion focused on skills and ethics for modern data science- sign up for free tickets here
Danielle Belgrave has a busy few weeks coming up! She is co-organising the Advances in Data Science event – more info here – in Manchester (June 22-23) where Anjali Mazumder is a keynote speaker. In addition she is tutorial chair for NeurIPS – any tutorial proposals from the community would be very welcome. Finally, she is giving an upcoming talk (March 12th) at an Imperial college diversity event with other women in AI including 2 other panelists and speakers from DeepMind (Marta Garnelo and Laura Weidinger). More info here.
As we collectively plough on with leaving the EU, it was interesting to see the EU’s take on AI : “Prepare for socio-economic changes brought about by AI”…
On the practical applications of machine learning front, there were a couple of compelling results in the health/pharma area.
First of all, “Powerful antibiotics discovered using AI” highlights the use of Machine Learning techniques to identify successful (and unforeseen) molecular combinations with antibiotic properties from a sizeable pool of potential candidates.
In addition Uber has released “manifold” which helps de-bug machine learning models using various easy to use interactive visualisations. This is similar to Google’s What-If tool and highlights the increasing number of out-of-the-box ‘explainability’ options for ML models (in addition to Shap,Lime etc.).
For those into Causality (and everything involved…) this was a good read– “In this post we explain a Bayesian approach to infering the impact of interventions or actions” – although you may need a quiet spot and a bit of time!
We thought it could be useful (and fun) to pick the collective brains of our Data Science Section committee members (as well as those of our impressive array of subscribers and followers) and put together a monthly newsletter. This will undoubtedly be biased but will hopefully surface materials that we collectively feel is interesting and relevant to the data science community at large.
So, without further ado, here goes our first attempt, creatively titled…
Industrial Strength Data Science Feb 2020 Newsletter
RSS Data Science Section
To give this some vague attempt at structure, we thought we would roughly break the newsletter down into three sections: Section and Member Activities; Posts We Like; Upcoming Events
It is easy to assume there is always a right way and a wrong way to do data science, and certainly in many instances some approaches are objectively better than others. However, we all know that often it is far more nuanced than non-practitioners might assume- here’s an opinionated guide to Machine Learning we found interesting
Regardless of your views on Facebook as a product, they employ some pretty impressive data scientists and produce some pretty impressive work (e.g. Prophet is great if you’ve not come across it). Reproducibility in machine learning is an increasingly important topic, and is surprisingly (or not so to those who do it…) difficult. While it is is key in academia in order to build on the foundations of others, it is also crucial in an industrial setting to make sure you have complete audit trails and can reproduce decisions made in the past. This piece from the Facebook AI group provided some interesting commentary
Finally, understanding why a machine learning model produces a given output is also an increasingly hot topic. Even though fundamentally the multi-dimensional nature of the underlying models makes it very complex and hard to “boil down” to a simple explanation, the field of ‘model explainability’ is looking to do so, and we found this a useful primer on the topic
Last week, the Prime Minister’s chief strategic adviser – Dominic Cummings – wrote a blog which attracted a huge amount of media attention. He called for a radical new approach to civil service recruitment – suggesting that data scientists (among others) should play increasingly important roles.
But while data scientists were top of Cummings’ list, it was his call, later on, for more ‘weirdos’ in Whitehall which really caught the media’s imagination. Here, we outline some do’s and don’ts when building a data science team.
For anyone kicking off the year with a new data science initiative, we applaud you! Embedding data and technology into decision making processes can be a wonderful thing. To help you along your way, here are a few do’s and don’ts that have been borne out of experience.
Don’t… Assume R&D is easy Do… Appoint a technical leader
If you’ve been tasked with managing this initiative, but you’re not an experienced data scientist, then you need someone who is. You need a team leader who lives and breathes selection bias, measurement bias, and knows when a result is meaningless. Without this experience in your team you will at best waste time and resources, and at worst create dangerously unsound technology.
Don’t… Just hire weirdos and misfits Do… Carefully craft your team The notion that data scientists are geniuses who can solve all your problems, armed only with a computer and some data, is flattering – but ridiculous. Data scientists come in many flavours, with different interests and experience, and the problems worth solving require a team effort – with the best ideas coming from diverse teams who can communicate well.
Don’t… Trust textbook knowledge alone Do… Hire for experience too There is data science knowledge you can glean from a textbook, and then there is the hard-earned stuff you learn from years of building models and algorithms with real data, implemented in the real world. Nothing makes you understand overfitting and the limits of theoretical models like living through that cycle a few (hundred) times.
Don’t… Ignore ethical issues Do… Take an ethics-first approach Get ahead of any ethical and legal issues with your work, or the data you are using. Don’t assume it’s OK to do something just because you heard a Silicon Valley start-up does it like that.
Don’t… Obsess on the latest academic papers Do… Identify questions
Normal rules of business apply to data science; you want a return for your investment. Start by identifying the intersection of high-value business problems and the information contained in the data. You could ‘dart about’, trying out ideas from cool papers you’ve read, to see if anything useful comes out. But such unstructured work is akin to randomly digging for treasure on a beach. Get yourself a metal detector—identify business problems first.
Don’t… Show off Do… Keep it simple, stupid
Unless you have been specifically asked to build something superficially clever and incomprehensible (and this is a genuine objective for some), then you should use interpretable models first. Often this will be good enough. Only introduce complexity if you need to, and use a simple model as a baseline against which you can measure improvements.
Don’t… Propagate hype Do… Manage expectations So, you’ve been thrown some resources to set up a data science team and you’re embedded in an organisation that doesn’t necessarily understand what data science is. With such power comes responsibility! Avoid hype. Manage expectations. Help your peers and leaders understand what you are doing, and make sure they have input to it. This is a joint effort and they bring important domain knowledge. Agree on goals, and be transparent about progress.
Don’t… Command and control Do… Create a scientific culture Do your team feel they can challenge the scientific views of the leadership—or are they scared of being ‘binned’ if they step out of line? Your team is on a mission to solve a problem, and it is unlikely the path will be an easy one. Your data scientists will spend most of their time stuck, navigating a sea of unknowns, while in pursuit of answers. Scientists need to be able to talk freely about what they do and don’t know, and to share ideas with each other without any sense of one-upmanship.
Inaugural Industrial Strength Data Science event report
On Thursday May 16th, The Royal Statistical Society’s Data Science Section hosted our inaugural Industrial Strength Data Science event of the year at the RSS headquarters in central London. The event was titled “We are not unicorns” and consisted of a panel discussion on a range of topics centered around the current state of data science in industry today, and how external expectations are affecting the success or failure of data science projects and teams.
We assembled an experienced panel of data science practitioners:
Adam Davison, Head of Insight and Data Science at The Economist (AD)
Kate Land, Chief Data Scientist at Havelock London (KL)
Simon Raper, Founder at Coppelia Machine Learning and Analytics (SR)
Magnus Rattray, Director of the Data Science Institute at the University of Manchester (MR)
And the the event was very ably hosted by Magda Piatkowska (Head of Data Science Solutions, BBC) and opened by Martin Goodson (CEO Evolution AI, and chair of the RSS Data Science Section)
We had a lively debate, together with some excellent audience interaction and participation which continued over drinks later in the evening. Some key takeaways include:
Data science hype is driving unrealistic expectations both from data scientists (about what they will be working on), and from businesses (about what they will be able to achieve).
To mitigate this, data science leaders need to work closely with business stakeholders and sponsors to clearly define the problems to be addressed and the actions to be taken on delivery of data science projects.
In addition, they need to recruit for more general skills including stats and coding as well as key attributes such curiosity and pragmatism and be clear with candidates on the type and variety of work that will be undertaken on a day to day basis.
Data science leaders need to drive buy-in for efficient data and analytics platforms and drive self-sufficiency within the data teams by leveraging engineering best practice and serverless cloud based services.
Below is a more detailed summary of the key discussion points – the full video of the event can be viewed here and below.
“Effects of the hype”
After introductions and quick biographies, we started with some comments around the evolution of data science as a capability, highlighting the positive benefits of bringing together quantitative practitioners from different functional areas of a business to share experiences and approaches. In academia, MR explained how historically, the techniques currently found in data science were predominantly explored in maths and computer science departments, but that there has been a move to where the data is generated- more physics and biology based research. This has led to more isolated researchers, and so the rise of the cross functional data science department has similarly reduced this isolation.
We then moved on to questions around the effect of all the data science hype. Firstly we discussed the effects on practitioners- with all the hyperbole in the press, and the breakthroughs released by google on a regular basis, it is not surprising that many data science practitioners can feel they are “not the authentic data scientist” (KL) unless they are uncovering new deep learning architectures or working on petabyte scale problems. Of course this is one of the key purposes of these types of discussions, to demystify what actually goes on and highlight the fact that data science can drive incredibly positive impact in a business setting without needing to push the boundaries of research or reinvent the wheel. A key component of the recruitment process has to be explaining the type and variety of work expected from candidates and making sure this is aligned to expectations.
We moved on to discuss the hype effect on business, the fact that CEOs and business leaders are feeling pressured to invest in “AI” without really knowing what it is and how it can help. This can be a ”recipe for disaster” (PS), as teams of data scientists are hired without a clear remit and without the right infrastructure in place. “You can’t do AI without machine learning, you can’t do machine learning without analytics, and you can’t do analytics without data infrastructure” (PS quoting Hilary Mason)- businesses often jump to the top of the tree without building the foundations (pulling the data together in one place, data engineering). “A lot of companies think they are ready for data science but are probably not” (MR).
Are these fundamental misunderstandings based on the hype contributing to a perceived lack of success? Likely so. One key component is having senior business leaders (chief data scientists or chief data officers) who understand more than the hype and can help educate decision makers to direct the efforts on tractable problems. What is the “signal to noise of the problem” (KL): it should be possible to differentiate between cat and dog images but predicting the direction of movement of a stock might not be in the data.
One final discussion point around hype was the benefits of embracing it. Although there was general consensus that true general intelligence (AGI) was still some way off, there were tangible benefits from a marketing and funding perspective to embracing the term. The Turing institute successfully headed off other “AI” focused entities by incorporating the term (MR), and it might well be worth data science teams embracing the term despite any misgivings if only to avoid “AI teams” springing up in the same organisation.
“What does good look like”
An additional consequence of the hype is a recruiting process focused on buzzwords and methods because the recruiting manager doesn’t know what they need- “we want someone who is an expert on Restricted Boltzmann Machines” (SR). There was general agreement that from a recruiting perspective, you want people who are more interested in problem solving than algorithm development although a solid background in probability with strong quantitative fundamentals is important so you can understand how different techniques work, what assumptions are made and where the gotchas lie.
Another theme that came out was around the makeup of a good team, whether specifically in data science or more broadly across data in general. The team needs a variety of skills ranging from business and process understanding to strong statistical methods, to strong production standard coding (the classic venn diagram) but although individuals should be encouraged to gain skills in all areas, it is the team that becomes the unicorn, rather than in the individual. The classic “T-shape” profile works well- with general capabilities across a broad range of areas combined with deeper knowledge in one or two.
Another area of discussion was self-sufficiency- data science/data teams need to be self sufficient with dependencies on tech resources minimised. It is critical to gain agreement from the technology function about who is able to do what and instilling the requisite skills and processes within the team, so that a model doesn’t need to be re-written to go into production. The increasing prevalence of server-less services in AWS and GCP make this self sufficiency much more realistic and data science teams in general much more productive.
This lead into a lively conversation about how to set up good data science projects. A key theme was to focus on the problem and be crystal clear with stakeholders on what the outcome of the project would produce and how it would be used, not about what methods would be utilised- SR characterised it elegantly as “solving problems in business with maths” . “Find something you can practically deliver to someone who can take advantage of it” (AD). Stakeholder management and delivering early and often, with feedback on iterations was a recurring theme. The comparison to the software development process was made- business stakeholders are now used to the concept of software being delivered in an agile and iterative way and we will hopefully see this approach becoming more acceptable and adopted for data science.
We ended with the provocative question – “should all CEOs become chief data scientists?” – which was met with a resounding “No” from the panel: “I’m not very good at golf” (SR).
We concluded with an excellent interactive session with the audience, including many relevant questions:
“To what extent should data science be responsible for production” – general feeling that data science teams should be able to own and manage productionised processes.
“What about role proliferation: research data scientist, product data scientist, machine learning engineer etc…”?- general feeling to be wary of overly specialised job titles, although a realisation that there may become some specialisation between automated decision making vs operations research/helping people make better decisions
“What is the best mix of skills for data science teams; what about management skills?”- general agreement that it depends on the scale of the organisation and the team: larger teams in larger more bureaucratic organisations could well benefit from data product/program managers to help manage stakeholders and change. In general though you want people who “can write production code, who are driven to build stuff- not coding up algorithms” (MG) .
“What about standards- what is a data scientist, should there be a qualification?”- tricky one- there are definitely core required skills but because the field and roles are still evolving it might be premature. However, the RSS DSS is keen to shape the discussion and our next event in July will be focused on this topic. From an education perspective, “we do need some kind of guidelines over what the masters courses need to deliver” (MR)
“Where should ethics sit- should data scientists own ethics or should it be a separate role?” There was consensus that the potential for doing bad things with data is high, and that data scientists should strive to have a high ethical and moral standards. Depending on the organisation, though, there may be specialist roles in compliance or risk departments that should be leveraged/included in the discussion.
“What should be the interaction between data science and behavioural science?” Agreement on a huge overlap between the two, particularly in finance (KL); bring back research teams (SR)!
So, all in all if felt like a very successful and enjoyable evening- do check out the full video below and do let us know in the comments your thoughts on any of these topics, and also any questions you would like to see discussed in the future.