Viewing entries in
Building Databases


A Healthy New Year

We’re in the midst of a transformational shift in the healthcare industry. Likely you have experienced it yourself, and it’s probably already hit you in the pocketbook. It’s the shift to what is called consumer-directed healthcare.

While on the surface consumer-directed healthcare may seem like nothing more than an attempt by employers to shift some of their spiraling healthcare costs onto their employees, there is much more going on behind the scenes. There is a lot of public policy driving this shift. The general idea is that healthcare costs are out of control because those buying healthcare services traditionally haven’t been the ones paying for them. By shifting healthcare costs to the consumer, the reasoning goes, consumers will demand better value for their money by becoming smart healthcare shoppers, and healthcare costs will begin to decline.

It all makes sense on paper, but there is one huge stumbling block in making this approach work: it’s hard to be a smart shopper when none of the things you are buying have price tags on them.

Data entrepreneurs have already seen this opportunity. Companies like Healthcare Blue Book and ClearCost Health have made real strides, but it’s a big and enormously complicated problem to solve. In part, that’s because hospitals don’t like to disclose their prices and insurers are often contractually prohibited from sharing what they pay specific hospitals for specific procedures.   

 Recognizing the issue, the federal government had mandated that as of January 1 of this year, hospitals must post their pricing for common procedures on their websites in an easily downloadable format.

 There’s a quick opportunity here to put your website scraping tools to work to gather all this pricing data in one place and normalize it. Certainly, there is an analytical product in there somewhere. But it’s less of an opportunity than it seems because what hospitals are generally posting are their list prices – and virtually nobody pays these prices. 

The challenge in hospital pricing is to find out what a specific insurance plan pays a specific hospital for, say, a hip replacement. This could be an ideal opportunity to turn to the crowd.

 One approach might be to aggregate all the pricing data that hospitals are now required to publish and use it as a data backbone – essentially a starting point. Then you could turn to consumers and ask them to anonymously submit their hospital bills and insurance statements. Take those images, use optical character recognition to get them into raw data format, then develop software to extract the valuable pricing data. When specific price data isn’t available, you could back off to list price data that would at least show if a hospital is relatively more or less expensive.

 Obviously it will take a long time to build a comprehensive database consisting of millions of price points, but there are a lot of consumer groups and other constituencies that would be very interested in your success and would work with you to increase the number of bills submitted. Hospitals won’t like this a bit, but as is so often the case, if one group doesn’t want the data out there, you have immediate confirmation that the data are valuable to some other group. Ironically, hospitals submit their price quotes for medical devices to a fascinating data company called MDBuyline to make sure they aren’t over-paying for their purchases.

 Sure, there is lots of complexity hiding under this simple framework. Also, it’s obvious that it will take a long time to build a comprehensive database. But the bromide “don’t let the perfect be the enemy of the good” nicely describes a key to success in the data business. As long as your database is the best available, it doesn’t have to be either complete or perfect. In almost every case, data is so important to decision-making that buyers will take what they can get, warts and all. This is not an invitation to be lazy or sloppy. Rather, it is recognition that you’ll have a marketable product long before you have a complete and perfect product. Just one more reason data is such a great business. Should hospital price data be on your New Year’s resolution list?


Form Follows Function

Numerous online marketing trade associations have announced their latest initiative to bring structure and transparency to an industry that can only be called the Wild, Wild West of the data world: online audience data. Their approach offers some useful lessons to data publishers.

At their brand-new one-page website ( this industry coalition is introducing its “Data Transparency Label.” In an attempt to be hip and clever, the coalition has modeled its data record on the familiar nutrition labels found on most food packaging today. It’s undeniably cute, but it’s a classic case of form not following function. Having decided on this approach, the designers of this label immediately boxed themselves in as to what kind and how much data they could present to buyers. I see this all the time with new data products: so much emphasis is placed on how the data looks, its visual presentation, that important data elements often end up getting minimized, hidden or even discarded. Pleasing visual presentation is desirable, but it shouldn’t come at the expense of our data.

The other constraint you immediately see is that this label format works great if an audience is derived from a single source by a single data company. But the real world is far messier than that. What if the audience is aggregated from multiple sources? What if its value derives from complex signal data that may be sourced from multiple third parties? What about resellers? Life is complicated. This label pretends it is simple. Having spent many years involved with data cards for mailing lists, during which time I became deeply frustrated by the lost opportunities caused by a simple approach used to describe increasingly sophisticated products, I see history about to repeat itself.

My biggest objection to this new label is that its focus seems to be 100% on transparency, with little attention being paid to equally valuable uses such as sourcing and comparison. The designers of this label allude to a taxonomy that will be used for classification purposes, but it’s only mentioned in passing and doesn’t feel like a priority focus at all. Perhaps most importantly, there’s no hint of whether or not these labels will be offered as a searchable database or not. There’s a potentially powerful audience sourcing tool here, and if anyone is considering that, they aren’t talking about it.

 Take-aways to consider:

·     When designing a new data product, don’t allow yourself to get boxed in by design

·     The real world is messy, with lots of exceptions. If you don’t provide for these exceptions, you’ll have a product that will never reach its full potential

·     Always remember that a good data product is much more than a filing cabinet that is used to look up specific facts. A thoughtful, well-organized dataset can deliver a lot more value to users and often to multiple groups of users. Don’t limit yourself to a single use case for your product – you’ll just be limiting your opportunity.

Regulating by the Numbers

While so many large financial institutions were teetering during the Great Recession, regulators trying to bring stability to the global financial system quickly learned a startling, shocking fact: there was really no way to net out how much money one financial institution owed to another.

The reason for this is that the complex financial trades that banks were engaged in weren’t straightforward bank-to-bank deals. JP Morgan didn’t just do trades with Citibank, for example. Rather, they were done through a web of subsidiaries, many of them set up specifically to be opaque and obscure. And that’s just the banks. Add in hedge funds and other investors, and their offshore companies and subsidiaries that also were designed to be opaque, and you quickly get to mind-numbing complexity. 

 With an eye to better regulation and better information during a future financial crisis, an idea was proposed during a 2011 meeting of the G-20 countries to create a numbering system called the Legal Entity Identifier (LEI). The simple idea was that if every legal entity engaged in financial transaction had a unique number, and the record of that legal entity also contained the number for its parent company, it would be easy to roll up these records to see the total financial exposure of any institution.

While you may never have heard of it, the LEI system actually exists, and most financial institutions now have LEI numbers. There is a push in some countries (in the United States, the Treasury Department is leading the charge) to require all companies to obtain a LEI number, it’s been slow going so far.

If this discussion has you wondering about the DUNS number from D&B, not to worry: it’s alive and well. It’s also far more evolved and comprehensive than the LEI system. However, as a privately maintained identifier system, D&B not unreasonably wants to be paid for its use. This rankles some government agencies that are paying substantial sums to D&B for access to the DUNS system, and more than a few are pushing for broad expansion of the LEI system as a replacement for the DUNS system. Suffice to say there is a lot going on behind the scenes.

There are a number of free lookup services for LEI records, and the information is in the public domain. Some data publishers may find immediate uses for LEI data, but its fundamental weakness at this point is that it’s hit and miss as to what companies have registered. Still, it’s a database to know about and watch, particularly if you have an interest in company relationships. Over time, its likely its coverage and importance will grow.

Fresh Data Sold Here! 

While many successful data publishers obsess about continually adding new features and functionality to their data products, there are lots of good reasons to be regularly evaluating your data as well.

Don’t get me wrong: new features and functionality are critically important, particularly if you have a data product that offers a workflow solution.

But adding new, well-selected data elements can add significant value and appeal as well. Here’s a few examples:

Morningstar just enhanced its suite of investment analysis tools by introducing a single new data element: a Carbon Risk Score. This score assesses how vulnerable a company is financially to the transition away from a fossil-fuel-based economy to a lower-carbon economy. Not only does the score hold significant value in its own right, but as an individual and consistently presented data element, it can be used for discovery and filtering by investment analysts. Moreover, as a proprietary piece of information, it gives Morningstar additional differentiation and strengthens its competitive edge.

Data-driven real estate listings sites such as, Zillow and Trulia have moved away from tussling over who has the most complete listings to trying to outdo each other with deeper datasets. Various combinations of these three sites now give detailed information and ratings on local schools, crime data, traffic data, neighborhood data, walkability data … even data on whether or not a particular home is likely to be a good candidate for solar panels! And in a move I particularly admire, they have gotten major cable and companies to pay to indicate if a particular house is eligible for their services. In the hotly competitive world of real estate data sites, it’s a relentless battle at the data element level, all with the goal of providing the most attractive one-stop shop for prospective homebuyers.

Consider too the intensely competitive market of hotel booking databases. Think of services such as Expedia, TripAdvisor, Oyster and Having exhausted themselves by all claiming to offer the lowest rates, they’re now seeking to differentiate themselves at the data element level. Using filters, site visitors can draw on specific data elements to locate hotels with free wi-fi, that accept pets, that have handicapped access, that are green or sustainable, that are LGBT-welcoming and even hotels that have a party atmosphere.

Features and functionality matter, but a single new and well-chosen data element can add tremendous value, while simultaneously providing competitive advantage and product differentiation. Keep your data fresh of course, but always be on the lookup for fresh new data elements as well.

Data Flipping

One of the best things above government databases is that even when the government agency makes the database available on its website for free, it isn’t very useful. That’s because government agencies put these databases online for regulatory or compliance reasons.  They’re designed to search for known entities because the expectation is that you are checking the license status of a company, or perhaps its compliance history.

Occasionally, a government agency will get ambitious and permit geographic searches, but in these cases, there are real limitations. That’s because the underlying data were collected for regulatory, not marketing purposes. So, for example, a manufacturer with 30 plants around the country may only appear in one ZIP code because the government agency wants filings only from headquarters locations.

Taking a regulatory database and changing it into, say, a marketing database, is something I call “flipping the file,” because while the underlying data remains the same, the way the database is accessed is different. Sometimes this is as simple as offering more search options; sometimes it involves normalizing or re-structuring the data to make it more useful and accessible. As just one example, a company called Labworks built a product called the RIA Database. It started with an  investment advisor database that the SEC maintains for regulatory purposes, and then flipped the file to make the same database useful to companies that wanted to market toinvestment advisors.  There are hundreds of data publishers doing this in different markets, and as you might expect, it’s a very attractive model since the underlying data can be obtained for free.

In addition to simply flipping a file, you can also enhance a database. The shortcoming of many government databases is that they focus on companies, not people, so while there may be a wealth of information on the company, data buyers typically want to know the names of contacts at those companies. Companies such as D&B and ZoomInfo do a brisk business licensing their contact information to be appended onto government databases of company information.

This is one of the truly magical aspects of the data business. Databases built for one reason can often be re-purposed for an entirely different use. And re-purposing can involve something as little as a new user interface. This magic isn’t limited to government data of course. Another great place to look for flipping opportunities is so-called “data exhaust,” data created in the course of some other activity, and thus not considered valuable by the entity creating it. You can even license data from other data providers and re-purpose it. There are a number of mapping products, for example, that take licensed company data and essentially create a new user interface by displaying data in a map context.

Increasingly, identifying the data need is as important as identifying the data source. With data, it’s all in how you look at it.