Viewing entries in
Thoughts and Predictions

Fake Data, Real Consequences

While we are hearing a lot these days about so-called “fake news,” we are also seeing instances of something that is arguably more pernicious: fake data. By “fake data,” I mean datasets that have been sloppily constructed or maintained, or datasets that appear to contain false data, either in whole or in part. What is particularly scary about fake data is that it is often used in hugely important applications that result in bad and even dangerous outcome.

My first encounter with fake data was in 2006. A reputable data company, First DataBank, had been producing an obscure database of wholesale drug prices. The company would survey drug companies and publish average prices for many widely prescribed drugs. Over time, and seemingly more through sloth than nefarious intent, First DataBank let the number of drug companies submitting data dwindle down to one, so the product was no longer producing industry average pricing. That’s embarrassing enough on its own, but because both Medicaid and Medicare were relying on the data to reimburse prescription claims, it resulted in vast reimbursement overpayments.

In 2009, we had the example of Ingenix, a medical data publisher. It published two datasets that reported what are called “usual, customary and reasonable” physician fees for various procedures. It was used by hundreds of health insurance companies to make out-of-network payments to physicians and was much-hated by physicians because of the low amounts they received because of these data products. Nothing about this is untoward, except for the fact that Ingenix was owned by UnitedHealth Group, one of the nation’s largest health insurers, and the products relied heavily on data from UnitedHealth Group. The lower the prices that Ingenix reported, the more money its parent company made.

More recently, you may recall the global LIBOR scandal, a dataset maintained by the British Bankers’ Association. It reflected the average interest rates banks would charge to loan money to each other and was a tiny  dataset with one huge impact: LIBOR is used to determine interest rates for an estimated $350 trillionin loans, mortgages and other financial transactions. Traders at the banks supplying data to the BBA eventually realized that by colluding with each other they could manipulate the LIBOR rates and profit off this advance knowledge simply by reporting false data to BBA.

 The most recent example relates to the COVID pandemic. A company called Surgisphere bursts on the scene with anonymized personal health data for nearly 100,000 COVID patients worldwide. The dataset is pristine: completely normalized, with every data element fully populated, and all information timely and current. Excited by this treasure trove of quality data, unavailable elsewhere, several reputable physicians went to work on assessing the data, with their most notable conclusion that the drug hydroxychloroquine was not effective in treating COVID patients. The results, published in a reputable medical journal, caused the World Health Organization and several countries to suspend randomized controlled trials that had been set up to test the drug.

With more than a few medical researchers suspicious of this dataset that had emerged from nowhere, Surgisphere suddenly found itself under scrutiny. One newspaper discovered that the company, which had originally been founded to market textbooks, had a total of six employees, one of whom was a science fiction writer, and another of whom had previously worked as an actor in the adult film industry. Surgisphere claimed that non-disclosure agreements prohibited it from disclosing even the names of the 600 hospitals that had allegedly provided it with its data. The Surgisphere website has recently gone dark. 

The simple lesson here is that if you rely on data products for anything important, it’s necessary to trust but verify. The less the data provider wants to tell you, the more questions you need to ask. Claiming non-disclosure prohibitions is an easy way to hide a host of sins. If you are relying on a data source for industry averages of any kind, at a minimum confirm the sample size. You should also assess whether the data producer has any conflicts of interest that could influence what data it collects or how it presents the results. The good news, of course, is that this is also an opportunity for reputable data producers to showcase their data quality.

 

 

 

 

   

 

A Good Business in Bad Times

As I write this, the federal government has announced 3.3 million new unemployment claims – and this in just one week. In other words, this could likely represent just the tip of the iceberg. The human toll of the coronavirus is difficult to comprehend, with the toll on businesses not far behind.

In any sudden downturn, it has long been understood that some businesses will always get hit harder and faster than others. The rule of the game in any business downturn is to preserve cash in anticipation of reduced revenue. Consequently, expense reduction becomes the focus. Rightly or wrongly, most companies view advertising and marketing as something that can be suspended for some period of time with little consequence. Other business activities, such as company meetings and events, quickly get postponed. Every company reacts slightly differently, and often in uniquely arbitrary and sometimes ill-advised ways. But the goal is always the same: slow spending as much as possible to conserve cash.

With everyone trying to cuts costs and slow payables at the same time, an adverse ripple effect is created that amplifies the pain. That’s why in widespread business downturns, few businesses are left truly unscathed by the resulting fallout.

Can one ever find safety from events of this magnitude? Probably not, but while few if any businesses will be totally untouched, some business models are clearly stronger than others. 

The information business is inherently one of the stronger industries to be in right now. That’s not because information products are uniformly essential to their customers. We learned during the Great Recession that many information products believed to be “must-have” became “nice-to-have” almost overnight. But the B2B subscription model employed by most information and data publishers adds an important additional level of resiliency.

B2B subscriptions tend to be, in effect, annual or even multi-year contracts. Many are prepaid. Many are difficult to cancel during the contract term. This buys information and data publishers the most important protection of all: time. Time to ride out the storm, for conditions to improve or at least for calmer heads to prevail. Sure, new subscriptions will decline and renewal rates will drop during a downturn, but the bulk of the business will remain relatively safe.

In addition to being contractual and often prepaid, subscriptions to information products typically are not high visibility or so expensive that they capture the early attention of cost-cutters. And for data products in particular, they don’t sit idle during downturns like our current one, because they are just as useful to employees working from home.

Some data products have even a further level of protection because they are embedded into the workflow and systems of their customers. Simply put, it’s too slow, complicated and sometimes even risky to turn them off.

As I said earlier, there are no winners in a global pandemic. But the importance and value of data products, coupled with the strength of the dominant industry business model, will help this industry spring back quickly.

This pandemic is bigger than all of us. But if we all act responsibly, we can minimize severity and duration and get back to business sooner. Stay safe … and stay healthy. We’ll get through this if we all work together!

 

When Data Is Smarter Than Its Users

In my review of the decade past and my predictions for our new decade, the common thread is that the quality of commercial data products has advanced immeasurably, as has their insight and predictive capability. As an industry, we’ve accomplished some truly remarkable things in the past ten years by making data more powerful, more useful and more current.

This said, data buyers remain far less sophisticated than the datasets they are buying. While buyers of data used for research and planning purposes seem to both appreciate and use powerful new data capabilities, marketers – generally speaking – do not. Even worse, this problem is ages-old.

Earlier in my career, I spent several years in the direct marketing business. Even back in the 1980s we were doing file overlays, assessing purchase propensity and building out detailed prospect profiles based on hundreds of individual data elements. It was slower and sloppier and harder back then, but we were doing it. We even had artificial intelligence software, though one project in particular I recall involving a million customer records required that we rent exclusive use of  a mainframe computer for two weeks! And not only did we have the capability, we had the buy-in of the marketing world. There was a fever pitch of interest in the incredible potential of super-targeted marketing.

But what we quickly learned as mailing list providers was that while sales and marketing types talked quality, what they bought was quantity. If you went to any organization of any size and said, “we have identified the 5,000 absolute best prospects in the country for you, all ready, willing and able to buy,” you would get interest but few if any takers. At best, you’d have marketers say that they’d throw these prospects in the pot with all the others – as long as they weren’t too expensive. 

From this experience came my epiphany: marketers had no experience with high quality prospects. They were so used to crappy data they had built processes and organizations optimized to churn through vast quantities of poor quality prospects. As to our 5,000 perfect prospects, we heard things like, “we’d chew through them in a week.” Note the operative word “chew.” 

We have new and better buzzwords now, but the broad problem is the same. Nowadays, when it comes to sales leads, companies are literally feeding the beast in the form of their marketing automation platforms. And everything has to flow through the platform because otherwise reports would be inaccurate and KPIs would be wrong.

Companies today will pay handsomely for qualified sales leads – sometimes up to several hundred dollars per lead. But these top quality leads won’t get treated any better than the mediocre ones. How do I know? Because the marketers spending all these big bucks will insist the leads be formatted for easy loading into their marketing platforms, and I’ve also been told, “we’re not interested unless you can guarantee at least 100 leads per week.” And that’s how far we have progressed in 30 years: marketers have solved the tension between quality and quantity by simply insisting on both. And the pressure to deliver both will necessarily come at the expense of quality. This essential disconnect won’t be solved easily, but when it is, a new golden age of data will arrive.

 

 

 

Looking Ahead: The Application Decade

 As I noted in my previous post, the data business was the right place to be in the last decade. Commercial data producers were already well-positioned in 2010. The value of data products was already well understood. The quaint subscription model that data producers had been stubbornly clinging to for years suddenly became all the rage. The birth of Big Data and the growth of data science as a profession put a spotlight on the need for high quality datasets.

From 2010-2019, things only got better as Big Data tools proliferated, the cloud offered cheap, efficient storage, computer processing power continued to increase and we were finally able to build and make effective use of truly massive, multi-sourced databases, many updated in real time.

The advances we have made as in industry in the last ten years have been truly breathtaking. But if the last decade was characterized by a wondrous growth in the accumulation of data, the decade in front of us will be about the smart application of that data.

A picture of what’s in store for us is already emerging. Artificial intelligence will take the data industry to the next evolutionary plane by enabling us to predict buyers and sellers and other transactional activity with confidence and in advance. That’s no small statement when you consider that the vast majority of commercial data products exist to bring buyers and sellers together or otherwise enable business transactions.

Our new decade will also be notable for its embrace of data governance. There simply won’t be any place for poorly managed and sloppily maintained datasets. Those who properly see data governance as an opportunity and not a burden will prosper mightily. And yes, the commercial data business will yield a first-mover advantage, because we understood the power of data governance even before it had a name.

Boil it all down, and my prediction is that we will be entering the decade of data-driven predictions. By 2030, commercial data producers will literally be able to predict the future, at least from a sales and marketing enablement perspective. The new tools required already exist, and they will continue to improve. All that’s needed is the creativity to apply them to the oldest, most basic objective of business: buying and selling. And our industry is nothing if not creative!

Looking Back: The Data Decade

In so many respects, the last ten years can be fairly called the Data Decade. In large part, that’s because the data business came into the last decade on a strong footing. While the ad-based media world was decimated by the likes of Google and Facebook, data companies held firm to their subscription-based revenue models and thrived as a result. And while legacy print publishers struggled to make online work for them, data publishers moved online without issues or complications, in large part because their products were inherently more useful when accessed online. As importantly, data entered the last decade with a lot of buzz, because the value and power of data products had become broadly understood.
At the highest level,, data got both bigger and better in the last decade. The much-used and much-

abused term “Big Data” came into popular usage. While Big Data was misunderstood by many, the impact for data publishers is that for the first time we became able to both aggregate and productively access and use truly massive amounts of data, creating endless new opportunities for both new and enhanced data products.

\While life without the cloud is unimaginable today, at the beginning of the last decade it was just getting started and its importance was vastly underappreciated. But the cloud profoundly altered for the better both the cost and convenience of maintaining and manipulating large amounts of data.
 
I’d argue too that APIs came into their own in the last decade to become a necessary component of almost every online data business. The result of this is that data became more portable and easier to aggregate and mix and match and integrate in ways that generated lots of new revenue for data owners while also building powerful lock-in with data licensees who increasingly became reliant on these data feeds. That’s one of the reasons that the data business didn’t feel the impact of the Great Recession as severely as many others.
 
Through a combination of Big Data, the cloud and APIs, the last decade saw incredible growth in collection and use of behavioral signals to infer such critical things as purchase interest and intent, opening both new markets and new revenue opportunities.  This of course allowed many data publishers to tap into the many household name marketing automation platforms. Hopefully, companies will someday develop marketing campaigns as sophisticated as the data powering them, as the holy grail of fewer but more effective email messages still seems badly out of reach.
 
Another fascinating development of the last decade is the growing understanding of the power and value of data. The cutesy term “data exhaust” came into common usage in the last few years, referring to data created as a by-product of some other activity. And just as start-ups once rushed to add social media elements to their products, however inappropriately, venture capitalists now rarely see a business plan without a reference to a start-up’s data opportunity. There will be backlash here as both entrepreneurs and venture capitalists learn the expensive lesson that “not all data is good data,” but in the meantime, the goldrush continues unabated.
 
Somewhat related to this trend, we’ve seen much interest and activity around the concept of “data governance,” which is an acknowledgement that while poor quality data is close to useless, top quality data is enormously powerful in large part because it can be trusted implicitly. Indeed, if you listen in at any gathering of data scientists, the grousing you will hear is that they see themselves in fact as “data janitors,” reflecting the fact that they spend far more of their time cleaning and structuring data than actually analyzing it.
 
I can’t close out this decade without also mentioning the trend towards open data, which in large part refers to the increasing availability of public sector databases that often can be used to enhance commercial data products.
 
In all, it was a very good decade for the data business, a happy outcome that resulted primarily from the increased technical ability to aggregate and process huge amounts of data, growing willingness to share data on a computer-to-computer basis, and much greater attention to improving the overall quality of data. 

And the decade now in front of us? Next week, I'll take a look ahead.