Viewing entries in
Best Practices

When Bad Data Is Good Business

The New York Times recently ran a story that describes the inner workings of the tenant online screening business, where companies create background reports for landlords on prospective apartment renters. These are companies that access multiple public databases to aggregate data on a specific individual that the landlord can use to determine whether to rent to that person. The article is a scary take-down of a segment of the data business that decided to compete largely on price, and in the process threw quality out the window.

 This is not a small segment of the data industry. Indeed, it is estimated that there are over 2,000 companies involved in generating both employment background and tenant screening reports, generating over $3.2B annually. Companies in this segment range from a handful of giants to tiny mom-and-pop operators. 

 As the Times article notes, the tenant screening segment of the business is largely unregulated. In the tight market for rental apartments, landlords can afford to be picky and apartments rent quickly, so prospective renters typically will lose an apartment before they can get an erroneous report corrected. And with no central data source and lots of small data vendors, it’s impossible to get erroneous data corrected permanently. 

The Times article pins the problem in large part on the widespread use of wildcard and Soundex name searches designed to search public databases exhaustively. And with lots of players and severe price pressure, most of the reports that are generated are fully automated. In most cases, landlords simply get a pile of whatever material results from these broad searches. In some cases, the data company provides a score or simply a yes/no recommendation to the landlord. Not surprisingly, landlords prefer these summaries to wading through and trying to assess lengthy source documents.

The core problem is that in this corner of the industry, we have the rare occurrence of unsophisticated data producers selling to unsophisticated data users. Initially, these data producers differentiated themselves by trying to tap the greatest number of data sources (terrorist databases, criminal databases, sex offender databases). This strategy tapped out pretty quickly, which is why these companies shifted to selling on price. To do this, they had to automate, meaning they began to sell reports based on broad searches with no human review. There are also a lot of data wholesalers in this business, meaning it is fast and relatively inexpensive to set yourself up as a background screening company.

There is also a more subtle aspect to this business that should interest all data producers. The use of broad wild card searches is ostensibly done because “it’s better to produce a false positive than a false negative.” This sounds like the right approach on the surface, but hiding underneath is an understanding that the key dynamic of this business is a need to deliver “hits,” otherwise known as negative information. This is where the unsophisticated data user comes into play. Landlords evaluate and value their background screening providers based on how frequently they find negative information on an applicant. If landlords don’t see negative information regularly, they begin to question the value of the screening company, and become receptive to overtures from competitors who claim they do more rigorous screening. In other words, the more rigorous your data product, the more you are exposed competitively.

There’s a lesson here: if you create a data product whose purpose is to help users identify problems, you need to deliver problems frequently in order to succeed. This sets up a warped incentive where precision is the enemy of profit. Place this warped incentive in a market with strong downward price pressure, and the result is messy indeed. 

Fishing In the Data Lake

You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.” 

What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.

The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.

 It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets. 

The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.

Variable Pricing, Data-Style

Variable pricing is a well-known pricing strategy that changes the price for the same product or service based on factors such as time, date, sale location and level of demand. Implemented properly, variable pricing is a powerful tool to optimize revenue.

The downside to variable pricing is that it has a bad reputation. For example, when prices go up at times of peak demand (which often translates into times of peak need), that’s variable pricing. Generally speaking, when you notice variable pricing, it’s because you’re on the wrong end of the variance.

Variable pricing lends itself nicely to data products. But rather than thinking about a traditional variable pricing strategy, consider pricing based on intensity of usage.

Intensity of usage means tying the price of your data product to how intensely a customer uses it – the greater the use, the greater the price. Intensity pricing is not an attempt to support multiple prices for the same product, but rather an attempt to tie pricing to the value derived from use of the product, with intensity of usage a proxy for value derived from the product.

For data producers, intensity-based pricing can take many forms. Here are just a few examples to fuel your thinking:

1.         Multi-user pricing. Yes, licensing multiple users and seats to large organizations is hardly a new idea. But it’s still a complex, mysterious thing to many data producers who shy away from it, leaving money on the table and probably encouraging widespread password sharing at the same time. The key to multi-user pricing is not to try and extract more from larger organizations simply because “they can afford it,” (a contentious and unsustainable approach), but to tie pricing to actual levels of usage as much as possible.

2.         Modularize data product functionality. Not every user makes use of all your features and functionality. Think about identifying those usage patterns and then re-casting your data product into modules: the more modules you use, the more you pay. We all know the selling power of those grayed-out, extra cost items on the main dashboard!

3.         Limit or meter exports. Many sales-oriented data products command high prices in part because of the contact information that they offer, such as email addresses. Unfortunately, many subscribers still view data products like these as glorified mailing lists to be used for giant email blasts. This is a high intensity use that should be priced at a premium. A growing number of data producers limit the number of records that can be downloaded in list format, charging a premium for additional records to reflect this high-intensity type of usage. It’s similarly possible to limit and then up-charge certain types of high-value reports and other results that provide value beyond the raw data itself.

4.         Modularize the dataset. Just as few users will use all the features available to them in a data product, many will not use all the datamade available to them. For example, it’s not uncommon for data producers to charge more for access to historical data because not everyone will use it, and those who do use it value it highly. Consider whether you have a similar opportunity to segment your dataset.

While your first consideration should be revenue enhancement, also keep in mind that an intensity-based pricing approach helps protect your data from abuse, permits lower entry-level price points, creates up-sell opportunities, and properly positions your data as valuable and important.

There are competitive considerations as well. When you are selling an over-stuffed data product in order to justify a high price, the easiest strategy for a competitor is to build a slimmed-down version of your product at a much lower price – Disruption 101. You simply don’t want to be selling a prix fixe product in an increasingly a la carte world (look at the cable companies and their inability to sustain bundled pricing even with near-monopoly positions).

This Score Doesn't Compute

This week the College Board, operators of the SAT college admissions tests, made a very big announcement: in addition to its traditional verbal and mathematic skills measurement scores, it will be adding a new score, which it is calling an “adversity score.”

In a nutshell, the purpose of the adversity score is to help college admissions officers “contextualize” the other two scores. Primarily based on area demographic data (crime rates, poverty rates, etc.) and school-specific data (number of AP courses offered, etc.) this new assessment will generate a score from 1 to 100, with 100 indicating that the student has experienced the highest level of adversity.

Public reaction so far has been mixed. Some see it as an honest effort to help combat college admission disparities. Other see it is a desperate business move by the College Board, which is facing an accelerating trend towards college adopting test-optional admission policies (over 1,000 colleges nationwide are currently test-optional).

I’m willing to stipulate that the College Board had its heart in the right place in developing this new score, but I am underwhelmed by its design and execution.

My first concern is that the College Board is keeping the design methodology of the score secret. I find that odd since the new score seems to rely on benign and objective Census and school data. However, at least a few published articles seemed to suggest that the College Board has included “proprietary data” as well. Let the conspiracy theories begin!

Secondly, the score is being kept secret from students for no good reason that I can see. All this policy does is add to adolescent and parental angst and uncertainty, while creating lots of new opportunities for high-priced advisors to suggest ways to game the score to advantage. And the recent college admissions scandal shows just how far some parents are willing to go to improve the scores of their children.

My third concern is that this new score is assigned to each individual student, when it is in reality a score of the school and its surrounding area. If the College Board had created a school scoring data product (one that could be easily linked to any student’s application) and sold it as a freestanding product, there would likely be no controversy around it. 

Perhaps most fundamentally though, the new score doesn’t work to strengthen or improve the original two scores. That’s because what it is measuring and how it measures is completely at odds with the original two scores. The new score is potentially useful, but it’s a bolt-on. Moreover, the way this score was positioned and launched opens it up to all the scrutiny and criticism the original scores have attracted, and that can’t be what the College Board wants. Already, Twitter is ablaze with people citing specific circumstances where the score would be inaccurate or yield unintended outcomes.

Scores and ratings can be extremely powerful. But the more powerful they become, the more carefully you need to tread in updating, modifying or extending them. The College Board hasn’t just created a new Adversity Score for students. It’s also likely to have a caused a lot of new adversity for itself.

Choose Your Customer

From the standpoint of “lessons learned,” one of the most interesting data companies out there is TrueCar.

Founded in 2005 as Zag.com, TrueCar provides consumers with data on what other consumers actually paid for specific vehicles in their local area. You can imagine the value to consumers if they could walk into dealerships with printouts of the lowest price recently paid for any given vehicle. 

The original TrueCar business model is awe-inspiring. It convinced thousands of car dealers to give it detailed sales data, including the final price paid for every car they sold. TrueCar aggregated the data and gave it to consumers for free. In exchange, the dealers got sales leads, for which they paid a fee on every sale.

 Did it work? Indeed it did. TrueCar was an industry disruptor well before the term had even been coined. As a matter of fact, TrueCar worked so well that dealers started an organized revolt in 2012 that cost TrueCar over one-third of its dealer customers.

The problem was with the TrueCar model. TrueCar collected sales data from dealers then essentially weaponized it, allowing consumers to purchase cars with little or no dealer profit. Moreover, after TrueCar allowed consumers to purchase cars on the cheap, it then charged dealers a fee for every sale! Eventually, dealers realized they were paying a third-party to destroy their margins, and decided not to play any more.

TrueCar was left with a stark choice: close up shop or find a new business model. TrueCar elected the latter, pivoting to a more dealer-friendly model that provided price data in ways that allowed dealers to better preserve their margins. It worked. TrueCar re-built its business, and successfully went public in 2014.

A happy ending? Not entirely. TrueCar, which had spent tens of millions to build its brand and site traffic by offering data on the cheapest prices for cars, quietly shifted to offering what it calls “fair prices” for cars without telling this to the consumers who visited its website. Lawsuits followed.  

There are four important lessons here. First, you can succeed in disrupting an industry and still fail f you are dependent on that industry to support what you are doing. Second, when it comes to B2C data businesses, you really need to pick a side. Third, if you change your revenue model in a way that impacts any of your customers, best to be clear and up-front about it. In fact, if you feel compelled to be sneaky about it, that’s a clue your new business model is flawed. Fourth, and I’ve said it before, market disruption is a strategy, not a business requirement.