The term ‘data-driven’ is today used almost exclusively in a positive tone; perhaps because it implies logic or decisions that are somehow based on impartial facts (nb. even though data != facts nor is it impartial). But there’s a substantially more sombre reality to all of it – one of data-driven bad.
You may not think of it, but you are a victim of data-driven bad fairly constantly; that completely irrelevant ‘targeted’ ad on Facebook or wherever? Data-driven bad. Amazon or your store loyalty program sending either naïve suggestions or pushing products you’d never dream of buying? Data-driven bad.
But it gets much worse.
From policy decisions to business strategies, being ‘data-driven’ can go wrong in many ways.
The most obvious situation is when someone – the government comes to mind pretty often – claims some decision is data-driven when in fact it is not. The data is either non-existent or made up. Why do they bother? Because it works; as noted by Joe Romm in his great book Language Intelligence, “many studies find that repeated exposure to a statement increases its acceptance as true.”
In cases when there some actual data is present, one of the most common pitfalls is selection bias, i.e. you pay attention to just the data or the results that best fit your preconceptions. As I wrote earlier, this tendency to ignore undesirable data can result in entire organisations acting irrationally.
Good data, bad data – but can you even tell the difference?
Data in and of itself is often not particularly useful. It needs to be analysed one way or another to make use of it effectively, or to uncover insights. The results of your data-driven endeavour depend on many things, but let’s look at the two obvious ones: quality of the data and quality of the analysis.
Seems straightforward enough, right? Just do good analysis on good data and it’s all good, right?
Well, kind of.
Problem is most data is not of particularly high quality. And even when you think it is, it may not be.
This is well illustrated by a recent discovery of a 15-year-old bug in software used to analyse functional MRI (fMRI) images; the cool brain activity scans in those “how the brain works” articles? That’s fMRI.
And the bug? It caused false positive rates as high as 70%.
How bad can it be, you may ask? Surely it’s just a matter of minor re-calibration of some results.
Unfortunately no. As The Register put it, “that’s not a gentle nudge that some results might be overstated: it’s more like making a bonfire of thousands of scientific papers.” The validity of some 40,000 fMRI studies spanning more than a decade are now in question.
When much of a field that prided itself on being data-driven and using state-of-the-art equipment to acquire the said data is now under the shadow of data-driven-bad, just how confident are you in your organisation’s capability to ensure the quality of data?
Actually, before you answer that, you should keep in mind that everything is broken – bugs, mistakes and errors leading to unexpected behaviour are everywhere.
Good analysis, bad analysis
Even when you do have good data, it’s not enough. Just like the standard financial results disclaimer of “past performance is no guarantee of future results” falls on deaf ears, so does the #1 principle of statistics, “correlation does not imply causation”.
Both so obviously true, and yet usually ignored – because the alternative is hard. When there is a compelling story to be extracted out of good data and a clear correlation, why bother with the analysis bit?
Not to worry, Big Data is here to make it worse….wait, what?
Close to a decade ago, Wired’s editor-in-chief Chris Anderson embraced this line of thinking, stating “with enough data, the numbers speak for themselves”.
He neglected to mention that if we let the numbers speak for themselves, even good data can lie through its non-existent teeth.
One major problem is that Big Data makes the discovery of spurious correlations so much easier. As noted by one paper on the topic, “too much information tends to behave like very little information”.
Hand on your hearts now, data scientists – how often do you really dig into the data and to verify that your correlation is, in fact, causation?
I’ll venture a guess; exceedingly rarely.
Demand more of ourselves
I’m not against data, quite the opposite. I’m not even against Big Data. But am I vehemently against using it just because, or to somehow replace insight or theory.
Luckily there is, if not a solution, at least a way to improve the situation.
As Nate Silver puts it in “The Signal and the Noise“;
Data-driven predictions can succeed – and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves.
In reality, the world often does the opposite – we use data to make life easier for us, to demand less of ourselves. We let the data speak for itself, while doing anything from fabricating the data in the first place to neglecting to check whether what it says makes any sense whatsoever.
But when we lose understanding and forget the theory, we find it becomes very easy to mistake those correlations for causation and trust the ‘data’ – no matter how misguided it may be.
It’s time to demand more of ourselves; only then we can demand more of the data.