Viewing entries in
Building Databases

Use Your Computer Vision

Those familiar with the powerhouse real estate listing site Zillow will likely recall that it burst on the scene in 2006 with an irresistible new offering: a free online estimate of the value of every house in the United States. Zillow calls them Zestimates. The site crashed continuously from too much traffic when it first launched, and Zillow now gets a stunning 195 million unique visitors monthly, all with virtually no advertising. Credit the Zestimates for this.

 As you would expect, Zestimates are derived algorithmically, using a combination of public domain and recent sales data. The algorithm selects recent sales of similar comparable nearby houses to compute estimated value. 

As you would also expect, professional appraisers hate Zestimates. They believe that they produce better valuation estimates because they hand select the comparable nearby homes and are thus more accurate. However, with the goal of consistent appraisals, the hand selection process that appraisers use is so prescribed and formulaic that it operates much like an algorithm does. At this level, you could argue that appraisers have little advantage over the computed Zestimate.

However, one area in which appraisers have a distinct advantage is that they are able to assess the condition and interiors of the properties they are appraising. They visually inspect the home and can use interior photos of comparable homes that have recently sold to refine their estimates.

Not to be outdone, Zillow is employing artificial intelligence to create what it calls “computer vision.” Using interior and exterior photos of millions of recently sold homes, Zillow now assesses such things as curb appeal, construction quality and even landscape;  quantifies what it finds;  and factors that information into its valuation algorithm. When it has interior photos of a house, it scans for such things as granite countertops, upgraded bathrooms and even how much natural light the house enjoys, and incorporates this information into its algorithm as well.

 With this advance, appraisers look very much like their competitive advantage is owning “the last mile,” because they are the feet on the street that actually visit the house being appraised. But you can see where things are heading: as companies like Zillow refine their technology, the day may well come that an appraisal is performed by the homeowner uploading interior pictures of her house, and perhaps confirming public record data, such as number of rooms in the house.

There are many market verticals where automated inspection and interpretation of visual data can be used. While the technology is in its infancy, its power is undeniable, so it’s not too early to think about possible ways it might enhance your data products.

A Healthy New Year

We’re in the midst of a transformational shift in the healthcare industry. Likely you have experienced it yourself, and it’s probably already hit you in the pocketbook. It’s the shift to what is called consumer-directed healthcare.

While on the surface consumer-directed healthcare may seem like nothing more than an attempt by employers to shift some of their spiraling healthcare costs onto their employees, there is much more going on behind the scenes. There is a lot of public policy driving this shift. The general idea is that healthcare costs are out of control because those buying healthcare services traditionally haven’t been the ones paying for them. By shifting healthcare costs to the consumer, the reasoning goes, consumers will demand better value for their money by becoming smart healthcare shoppers, and healthcare costs will begin to decline.

It all makes sense on paper, but there is one huge stumbling block in making this approach work: it’s hard to be a smart shopper when none of the things you are buying have price tags on them.

Data entrepreneurs have already seen this opportunity. Companies like Healthcare Blue Book and ClearCost Health have made real strides, but it’s a big and enormously complicated problem to solve. In part, that’s because hospitals don’t like to disclose their prices and insurers are often contractually prohibited from sharing what they pay specific hospitals for specific procedures.   

 Recognizing the issue, the federal government had mandated that as of January 1 of this year, hospitals must post their pricing for common procedures on their websites in an easily downloadable format.

 There’s a quick opportunity here to put your website scraping tools to work to gather all this pricing data in one place and normalize it. Certainly, there is an analytical product in there somewhere. But it’s less of an opportunity than it seems because what hospitals are generally posting are their list prices – and virtually nobody pays these prices. 

The challenge in hospital pricing is to find out what a specific insurance plan pays a specific hospital for, say, a hip replacement. This could be an ideal opportunity to turn to the crowd.

 One approach might be to aggregate all the pricing data that hospitals are now required to publish and use it as a data backbone – essentially a starting point. Then you could turn to consumers and ask them to anonymously submit their hospital bills and insurance statements. Take those images, use optical character recognition to get them into raw data format, then develop software to extract the valuable pricing data. When specific price data isn’t available, you could back off to list price data that would at least show if a hospital is relatively more or less expensive.

 Obviously it will take a long time to build a comprehensive database consisting of millions of price points, but there are a lot of consumer groups and other constituencies that would be very interested in your success and would work with you to increase the number of bills submitted. Hospitals won’t like this a bit, but as is so often the case, if one group doesn’t want the data out there, you have immediate confirmation that the data are valuable to some other group. Ironically, hospitals submit their price quotes for medical devices to a fascinating data company called MDBuyline to make sure they aren’t over-paying for their purchases.

 Sure, there is lots of complexity hiding under this simple framework. Also, it’s obvious that it will take a long time to build a comprehensive database. But the bromide “don’t let the perfect be the enemy of the good” nicely describes a key to success in the data business. As long as your database is the best available, it doesn’t have to be either complete or perfect. In almost every case, data is so important to decision-making that buyers will take what they can get, warts and all. This is not an invitation to be lazy or sloppy. Rather, it is recognition that you’ll have a marketable product long before you have a complete and perfect product. Just one more reason data is such a great business. Should hospital price data be on your New Year’s resolution list?

Form Follows Function

Numerous online marketing trade associations have announced their latest initiative to bring structure and transparency to an industry that can only be called the Wild, Wild West of the data world: online audience data. Their approach offers some useful lessons to data publishers.

At their brand-new one-page website ( this industry coalition is introducing its “Data Transparency Label.” In an attempt to be hip and clever, the coalition has modeled its data record on the familiar nutrition labels found on most food packaging today. It’s undeniably cute, but it’s a classic case of form not following function. Having decided on this approach, the designers of this label immediately boxed themselves in as to what kind and how much data they could present to buyers. I see this all the time with new data products: so much emphasis is placed on how the data looks, its visual presentation, that important data elements often end up getting minimized, hidden or even discarded. Pleasing visual presentation is desirable, but it shouldn’t come at the expense of our data.

The other constraint you immediately see is that this label format works great if an audience is derived from a single source by a single data company. But the real world is far messier than that. What if the audience is aggregated from multiple sources? What if its value derives from complex signal data that may be sourced from multiple third parties? What about resellers? Life is complicated. This label pretends it is simple. Having spent many years involved with data cards for mailing lists, during which time I became deeply frustrated by the lost opportunities caused by a simple approach used to describe increasingly sophisticated products, I see history about to repeat itself.

My biggest objection to this new label is that its focus seems to be 100% on transparency, with little attention being paid to equally valuable uses such as sourcing and comparison. The designers of this label allude to a taxonomy that will be used for classification purposes, but it’s only mentioned in passing and doesn’t feel like a priority focus at all. Perhaps most importantly, there’s no hint of whether or not these labels will be offered as a searchable database or not. There’s a potentially powerful audience sourcing tool here, and if anyone is considering that, they aren’t talking about it.

 Take-aways to consider:

·     When designing a new data product, don’t allow yourself to get boxed in by design

·     The real world is messy, with lots of exceptions. If you don’t provide for these exceptions, you’ll have a product that will never reach its full potential

·     Always remember that a good data product is much more than a filing cabinet that is used to look up specific facts. A thoughtful, well-organized dataset can deliver a lot more value to users and often to multiple groups of users. Don’t limit yourself to a single use case for your product – you’ll just be limiting your opportunity.

Regulating by the Numbers

While so many large financial institutions were teetering during the Great Recession, regulators trying to bring stability to the global financial system quickly learned a startling, shocking fact: there was really no way to net out how much money one financial institution owed to another.

The reason for this is that the complex financial trades that banks were engaged in weren’t straightforward bank-to-bank deals. JP Morgan didn’t just do trades with Citibank, for example. Rather, they were done through a web of subsidiaries, many of them set up specifically to be opaque and obscure. And that’s just the banks. Add in hedge funds and other investors, and their offshore companies and subsidiaries that also were designed to be opaque, and you quickly get to mind-numbing complexity. 

 With an eye to better regulation and better information during a future financial crisis, an idea was proposed during a 2011 meeting of the G-20 countries to create a numbering system called the Legal Entity Identifier (LEI). The simple idea was that if every legal entity engaged in financial transaction had a unique number, and the record of that legal entity also contained the number for its parent company, it would be easy to roll up these records to see the total financial exposure of any institution.

While you may never have heard of it, the LEI system actually exists, and most financial institutions now have LEI numbers. There is a push in some countries (in the United States, the Treasury Department is leading the charge) to require all companies to obtain a LEI number, it’s been slow going so far.

If this discussion has you wondering about the DUNS number from D&B, not to worry: it’s alive and well. It’s also far more evolved and comprehensive than the LEI system. However, as a privately maintained identifier system, D&B not unreasonably wants to be paid for its use. This rankles some government agencies that are paying substantial sums to D&B for access to the DUNS system, and more than a few are pushing for broad expansion of the LEI system as a replacement for the DUNS system. Suffice to say there is a lot going on behind the scenes.

There are a number of free lookup services for LEI records, and the information is in the public domain. Some data publishers may find immediate uses for LEI data, but its fundamental weakness at this point is that it’s hit and miss as to what companies have registered. Still, it’s a database to know about and watch, particularly if you have an interest in company relationships. Over time, its likely its coverage and importance will grow.

Fresh Data Sold Here! 

While many successful data publishers obsess about continually adding new features and functionality to their data products, there are lots of good reasons to be regularly evaluating your data as well.

Don’t get me wrong: new features and functionality are critically important, particularly if you have a data product that offers a workflow solution.

But adding new, well-selected data elements can add significant value and appeal as well. Here’s a few examples:

Morningstar just enhanced its suite of investment analysis tools by introducing a single new data element: a Carbon Risk Score. This score assesses how vulnerable a company is financially to the transition away from a fossil-fuel-based economy to a lower-carbon economy. Not only does the score hold significant value in its own right, but as an individual and consistently presented data element, it can be used for discovery and filtering by investment analysts. Moreover, as a proprietary piece of information, it gives Morningstar additional differentiation and strengthens its competitive edge.

Data-driven real estate listings sites such as, Zillow and Trulia have moved away from tussling over who has the most complete listings to trying to outdo each other with deeper datasets. Various combinations of these three sites now give detailed information and ratings on local schools, crime data, traffic data, neighborhood data, walkability data … even data on whether or not a particular home is likely to be a good candidate for solar panels! And in a move I particularly admire, they have gotten major cable and companies to pay to indicate if a particular house is eligible for their services. In the hotly competitive world of real estate data sites, it’s a relentless battle at the data element level, all with the goal of providing the most attractive one-stop shop for prospective homebuyers.

Consider too the intensely competitive market of hotel booking databases. Think of services such as Expedia, TripAdvisor, Oyster and Having exhausted themselves by all claiming to offer the lowest rates, they’re now seeking to differentiate themselves at the data element level. Using filters, site visitors can draw on specific data elements to locate hotels with free wi-fi, that accept pets, that have handicapped access, that are green or sustainable, that are LGBT-welcoming and even hotels that have a party atmosphere.

Features and functionality matter, but a single new and well-chosen data element can add tremendous value, while simultaneously providing competitive advantage and product differentiation. Keep your data fresh of course, but always be on the lookup for fresh new data elements as well.