While we are hearing a lot these days about so-called “fake news,” we are also seeing instances of something that is arguably more pernicious: fake data. By “fake data,” I mean datasets that have been sloppily constructed or maintained, or datasets that appear to contain false data, either in whole or in part. What is particularly scary about fake data is that it is often used in hugely important applications that result in bad and even dangerous outcome.
My first encounter with fake data was in 2006. A reputable data company, First DataBank, had been producing an obscure database of wholesale drug prices. The company would survey drug companies and publish average prices for many widely prescribed drugs. Over time, and seemingly more through sloth than nefarious intent, First DataBank let the number of drug companies submitting data dwindle down to one, so the product was no longer producing industry average pricing. That’s embarrassing enough on its own, but because both Medicaid and Medicare were relying on the data to reimburse prescription claims, it resulted in vast reimbursement overpayments.
In 2009, we had the example of Ingenix, a medical data publisher. It published two datasets that reported what are called “usual, customary and reasonable” physician fees for various procedures. It was used by hundreds of health insurance companies to make out-of-network payments to physicians and was much-hated by physicians because of the low amounts they received because of these data products. Nothing about this is untoward, except for the fact that Ingenix was owned by UnitedHealth Group, one of the nation’s largest health insurers, and the products relied heavily on data from UnitedHealth Group. The lower the prices that Ingenix reported, the more money its parent company made.
More recently, you may recall the global LIBOR scandal, a dataset maintained by the British Bankers’ Association. It reflected the average interest rates banks would charge to loan money to each other and was a tiny dataset with one huge impact: LIBOR is used to determine interest rates for an estimated $350 trillionin loans, mortgages and other financial transactions. Traders at the banks supplying data to the BBA eventually realized that by colluding with each other they could manipulate the LIBOR rates and profit off this advance knowledge simply by reporting false data to BBA.
The most recent example relates to the COVID pandemic. A company called Surgisphere bursts on the scene with anonymized personal health data for nearly 100,000 COVID patients worldwide. The dataset is pristine: completely normalized, with every data element fully populated, and all information timely and current. Excited by this treasure trove of quality data, unavailable elsewhere, several reputable physicians went to work on assessing the data, with their most notable conclusion that the drug hydroxychloroquine was not effective in treating COVID patients. The results, published in a reputable medical journal, caused the World Health Organization and several countries to suspend randomized controlled trials that had been set up to test the drug.
With more than a few medical researchers suspicious of this dataset that had emerged from nowhere, Surgisphere suddenly found itself under scrutiny. One newspaper discovered that the company, which had originally been founded to market textbooks, had a total of six employees, one of whom was a science fiction writer, and another of whom had previously worked as an actor in the adult film industry. Surgisphere claimed that non-disclosure agreements prohibited it from disclosing even the names of the 600 hospitals that had allegedly provided it with its data. The Surgisphere website has recently gone dark.
The simple lesson here is that if you rely on data products for anything important, it’s necessary to trust but verify. The less the data provider wants to tell you, the more questions you need to ask. Claiming non-disclosure prohibitions is an easy way to hide a host of sins. If you are relying on a data source for industry averages of any kind, at a minimum confirm the sample size. You should also assess whether the data producer has any conflicts of interest that could influence what data it collects or how it presents the results. The good news, of course, is that this is also an opportunity for reputable data producers to showcase their data quality.