On the value of data – routinely vs purposefully

I listen to a bunch of podcasts, and the podcast “The Pitch” is one of them. In that podcast, Entrepreneurs of start-up companies pitch their ideas to investors. Not only is it amusing to hear some of these crazy business ideas, but the podcast also help me to understand about professional life works outside of science. One thing i learned is that it is ok if not expected, to oversell by about a factor 142.

Another thing that I learned is the apparent value of data. The value of data seems to be undisputed in these pitches. In fact, the product or service the company is selling or providing is often only a byproduct: collecting data about their users which subsequently can be leveraged for targeted advertisement seems to be the big play in many start-up companies.

I think this type of “value of data” is what it is: whatever the investors want to pay for that type of data is what it is worth. But it got me thinking about the value of data that we actually collect in medical. Let us first take a look at routinely data, which can be very cheap to collect. But what is the value of the data? The problem is that routinely collected data is often incomplete, rife with error and can lead to enormous biases – both information bias as well as selection bias. Still, some research questions can be answered with routinely collected data – as long as you make some real efforts to think about your design and analyses. So, there is value in routinely collected data as it can provide a first glance into the matter at hand.

And what is the case for purposefully collected data? The idea behind this is that the data is much more reliable: trained staff collects data in a standardised way resulting in datasets without many errors or holes. The downside is the “purpose” which often limits the scope and thereby the amount collected data per included individual. this is the most obvious in randomised clinical trials in which often millions of euro’s are spent to answer one single question. Trials often do no have the precision to provide answers to other questions. So it seems that the data can lose it value after answering that single question.

Luckily, many efforts were made to let purposefully collected keep some if its value even after they have served their purpose. Standardisation efforts between trials make it now possible to pool the data and thus obtain a higher precision. A good example from the field of stroke research is the VISTA collaboration, i.e the Virtual International Stroke Trials Archive”. Here, many trials – and later some observational studies – are combined to answer research questions with enough precision that otherwise would never be possible. This way we can answer questions with high quality of purposefully collected data with numbers otherwise unthinkable.

This brings me to a recent paper we published with data from the VISTA collaboration: “Early in-hospital exposure to statins and outcome after intracerebral haemorrhage”. The underlying question whether and when statins should be initiated / continued after ICH is clinically relevant but also limited in scope and impact, so is it justified to start a trial? We took the the easier and cheaper solution and analysed the data from VISTA. We conclude that

… early in-hospital exposure to statins after acute ICH was associated with better functional outcome compared with no statin exposure early after the event. Our data suggest that this association is particularly driven by continuation of pre-existing statin use within the first two days after the event. Thus, our findings provide clinical evidence to support current expert recommendations that prevalent statin use should be continued during the early in-hospital phase.1921

link

And this shows the limitations of even well collected data from RCT: as long as the exposure of interest is potentially provided to a certain subgroup (i.e. Confounding by indication), you can never really be certain about the treatment effects. To solve this, we would really need to break the bond between exposure and any other clinical characteristic, i.e. randomize. That remains the golden standard for intended effects of treatments. Still, our paper provided a piece of the puzzle and gave more insight, form data that retained some of its value due to standardisation and pooling. But there is no dollar value that we can put on the value of medical research data – routinely or purposefully collected alike- as it all depends on the question you are trying to answer.

Our paper, with JD in the lead, was published last year in the European Stroke Journal, and can be found here as well as on my Publons profile and Mendeley profile.