Fake Data, Real Consequences

While we are hearing a lot these days about so-called “fake news,” we are also seeing instances of something that is arguably more pernicious: fake data. By “fake data,” I mean datasets that have been sloppily constructed or maintained, or datasets that appear to contain false data, either in whole or in part. What is particularly scary about fake data is that it is often used in hugely important applications that result in bad and even dangerous outcome.

My first encounter with fake data was in 2006. A reputable data company, First DataBank, had been producing an obscure database of wholesale drug prices. The company would survey drug companies and publish average prices for many widely prescribed drugs. Over time, and seemingly more through sloth than nefarious intent, First DataBank let the number of drug companies submitting data dwindle down to one, so the product was no longer producing industry average pricing. That’s embarrassing enough on its own, but because both Medicaid and Medicare were relying on the data to reimburse prescription claims, it resulted in vast reimbursement overpayments.

In 2009, we had the example of Ingenix, a medical data publisher. It published two datasets that reported what are called “usual, customary and reasonable” physician fees for various procedures. It was used by hundreds of health insurance companies to make out-of-network payments to physicians and was much-hated by physicians because of the low amounts they received because of these data products. Nothing about this is untoward, except for the fact that Ingenix was owned by UnitedHealth Group, one of the nation’s largest health insurers, and the products relied heavily on data from UnitedHealth Group. The lower the prices that Ingenix reported, the more money its parent company made.

More recently, you may recall the global LIBOR scandal, a dataset maintained by the British Bankers’ Association. It reflected the average interest rates banks would charge to loan money to each other and was a tiny  dataset with one huge impact: LIBOR is used to determine interest rates for an estimated $350 trillionin loans, mortgages and other financial transactions. Traders at the banks supplying data to the BBA eventually realized that by colluding with each other they could manipulate the LIBOR rates and profit off this advance knowledge simply by reporting false data to BBA.

 The most recent example relates to the COVID pandemic. A company called Surgisphere bursts on the scene with anonymized personal health data for nearly 100,000 COVID patients worldwide. The dataset is pristine: completely normalized, with every data element fully populated, and all information timely and current. Excited by this treasure trove of quality data, unavailable elsewhere, several reputable physicians went to work on assessing the data, with their most notable conclusion that the drug hydroxychloroquine was not effective in treating COVID patients. The results, published in a reputable medical journal, caused the World Health Organization and several countries to suspend randomized controlled trials that had been set up to test the drug.

With more than a few medical researchers suspicious of this dataset that had emerged from nowhere, Surgisphere suddenly found itself under scrutiny. One newspaper discovered that the company, which had originally been founded to market textbooks, had a total of six employees, one of whom was a science fiction writer, and another of whom had previously worked as an actor in the adult film industry. Surgisphere claimed that non-disclosure agreements prohibited it from disclosing even the names of the 600 hospitals that had allegedly provided it with its data. The Surgisphere website has recently gone dark. 

The simple lesson here is that if you rely on data products for anything important, it’s necessary to trust but verify. The less the data provider wants to tell you, the more questions you need to ask. Claiming non-disclosure prohibitions is an easy way to hide a host of sins. If you are relying on a data source for industry averages of any kind, at a minimum confirm the sample size. You should also assess whether the data producer has any conflicts of interest that could influence what data it collects or how it presents the results. The good news, of course, is that this is also an opportunity for reputable data producers to showcase their data quality.

 

 

 

 

   

 

When Bad Data Is Good Business

The New York Times recently ran a story that describes the inner workings of the tenant online screening business, where companies create background reports for landlords on prospective apartment renters. These are companies that access multiple public databases to aggregate data on a specific individual that the landlord can use to determine whether to rent to that person. The article is a scary take-down of a segment of the data business that decided to compete largely on price, and in the process threw quality out the window.

 This is not a small segment of the data industry. Indeed, it is estimated that there are over 2,000 companies involved in generating both employment background and tenant screening reports, generating over $3.2B annually. Companies in this segment range from a handful of giants to tiny mom-and-pop operators. 

 As the Times article notes, the tenant screening segment of the business is largely unregulated. In the tight market for rental apartments, landlords can afford to be picky and apartments rent quickly, so prospective renters typically will lose an apartment before they can get an erroneous report corrected. And with no central data source and lots of small data vendors, it’s impossible to get erroneous data corrected permanently. 

The Times article pins the problem in large part on the widespread use of wildcard and Soundex name searches designed to search public databases exhaustively. And with lots of players and severe price pressure, most of the reports that are generated are fully automated. In most cases, landlords simply get a pile of whatever material results from these broad searches. In some cases, the data company provides a score or simply a yes/no recommendation to the landlord. Not surprisingly, landlords prefer these summaries to wading through and trying to assess lengthy source documents.

The core problem is that in this corner of the industry, we have the rare occurrence of unsophisticated data producers selling to unsophisticated data users. Initially, these data producers differentiated themselves by trying to tap the greatest number of data sources (terrorist databases, criminal databases, sex offender databases). This strategy tapped out pretty quickly, which is why these companies shifted to selling on price. To do this, they had to automate, meaning they began to sell reports based on broad searches with no human review. There are also a lot of data wholesalers in this business, meaning it is fast and relatively inexpensive to set yourself up as a background screening company.

There is also a more subtle aspect to this business that should interest all data producers. The use of broad wild card searches is ostensibly done because “it’s better to produce a false positive than a false negative.” This sounds like the right approach on the surface, but hiding underneath is an understanding that the key dynamic of this business is a need to deliver “hits,” otherwise known as negative information. This is where the unsophisticated data user comes into play. Landlords evaluate and value their background screening providers based on how frequently they find negative information on an applicant. If landlords don’t see negative information regularly, they begin to question the value of the screening company, and become receptive to overtures from competitors who claim they do more rigorous screening. In other words, the more rigorous your data product, the more you are exposed competitively.

There’s a lesson here: if you create a data product whose purpose is to help users identify problems, you need to deliver problems frequently in order to succeed. This sets up a warped incentive where precision is the enemy of profit. Place this warped incentive in a market with strong downward price pressure, and the result is messy indeed. 

Facebook Stores Come Up Empty

Facebook recently rolled out, to great fanfare, a new offering called Facebook Stores. In brief, it allows businesses with Facebook pages to add e-commerce functionality to those pages. It’s a free service to businesses, because Facebook hopes to profit mightily off transaction fees and additional advertising on its sites. Reportedly, over one million businesses have already signed on to use this new feature.

 It’s a bit difficult to assess the significance of this new offering. This is inarguably a smart, if not particularly inspired move by Facebook to cut itself (as all marketplaces and platforms dream of doing) into the revenue of the businesses on its site. But the value Facebook adds through Facebook Stores isn’t all that large. That’s because, despite its huge base of users, Facebook isn’t doing anything to drive these users to Facebook Stores. Anyone who makes a purchase through Facebook Stores is an existing customer or prospect of the store owner or has been driven there by paid advertising on Facebook. Nor does Facebook offer any classification structure or taxonomy to help its users discover businesses on Facebook. You really need to already know they are there. Indeed, businesses can’t even really use Facebook Stores in place of an ecommerce website because Facebook business pages provide only limited access to non-users of Facebook. In many ways, I feel about Facebook Stores the way I feel about Apple’s App Store: yes, it will make a lot of money, but imagine how much money it could have made had it been done right. 

 For those who are writing breathlessly about Facebook Stores as the dawn of “social commerce,” there is a theme. First, they say, forget about Facebook and focus on its sister company Instagram, where brands can promote new products and users can order them seamlessly. It sounds interesting, but when you pick it apart the same issues arise: you’re spending money with Instagram to build your audience and drive traffic, and then you give a percentage of your revenue to Instagram in exchange for capabilities you already have on your website.

In short, Facebook Stores is a smart move for Facebook. Is it a smart move for small businesses? I remain unconvinced. As the saying goes, “If your business depends on a platform, you don’t have a business.” 

 

 

No Profit, No Problem

I’ve been asked a number of times if not-for-profit data producers have an inherent advantage over for-profit data producers who may be selling similar or identical data. In fact, there are a lot of non-profits founded, at least in part, to build databases and disseminate them. 

 As just a few examples, you may already be familiar with GuideStar (now called Candid and a 2004 Infocommerce Model of Excellence award winner) that provides financial data on non-profits. It has other non-profit competitors such as Charity Navigator, ProPublica and even the Better Business Bureau. But this space also includes for-profit players such as Metasoft Systems too. In the educational world, non-profit GreatSchools (a 2007 Infocommerce Model of Excellence award winner) competes with for-profit players such as U.S. News and Niche.com. The further you dig into the world of data, the more non-profit players you find, often competing directly with for-profit data providers.

 So in head-to-head competitive match-ups, do non-profits have an advantage? In my experience, non-profits do have a number of advantages. Their primary one is in perception. Particularly when it comes to data collection, non-profits seem less threatening, they are viewed as neutral and independent, their need for data is rarely questioned, and many will supply data to non-profits because they feel they are helping out or supporting a cause. This warm and fuzzy perception extends to marketing and sales as well. Having seen it first-hand, I have no doubt that non-profits prefer doing business with other non-profits. There is a sense of shared purpose, and a belief that one non-profit won’t take advantage of another non-profit. Commercial data buyers won’t buy data from a non-profit simply because it is a non-profit, but non-profits will get full and equal consideration along with for-profit data vendors.

 But that doesn’t mean it’s easy going for non-profit data providers. As mission-driven organizations, many give away their data, limiting their revenue opportunity to such things as API access and customer datasets. Also, while non-profits like doing business with other non-profits, in part that’s because they expect whatever they get from another non-profit will be heavily discounted if not free. Selling against for-profit competitors, the non-profit is often at a disadvantage because it can’t as easily invest in the newest technologies and third-party datasets either because of resource constraints or because staying competitive begins to conflict with its own mission objectives.

On balance, I think that non-profit data producers do have a number of marketplace advantages, but these advantages are largely offset by marketplace expectations that non-profits must offer low-cost or free data, and competitive realities that make it hard to sell against a determined, for-profit competitor.

Fishing In the Data Lake

You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.” 

What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.

The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.

 It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets. 

The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.