Viewing entries in
Big Data

Where the Value is In Visual Data

The New York Times recently reported on the results of a fascinating project conducted at Stanford University. Using over 50 million images drawn from Google Street View, along with ZIP code data, the researchers were able to associate automobile ownership preferences with voting patterns. For example, the researchers found that the type of vehicles most strongly associated with Republican voting districts are extended-cab pickup trucks.

While this particular finding may not surprise you, the underlying work represents a programmatic tour de force, because artificial intelligence software was used to identify and classify the vehicles found in these 50 million images. The researchers used automotive experts to identify specific makes and models of cars from the images, giving the software a basis for training itself to find and identify vehicles all by itself, regardless of the angle of the photo, shadows and a host of other factors that make this anything but an easy task.

This project is believed to represent that first time that images have been used on a large scale to develop data. And while this image identification is a technically impressive example of both artificial intelligence and Big Data, most of the really useful insights come from associating the finding with other datasets, what I like to refer to as Little Data.

Think about it. The artificial intelligence software is given as input an image, and the ZIP code associated with that image. The software identifies an automobile make and model from the image, and creates an output record with two elements: the ZIP code and a normalized make and model description of the automobile. With this, you can explore auto ownership patterns by geography. But with just a few more steps, you can go a lot further.

You can use “little data” government and private datasets to link ZIP code to voting districts and thus voting patterns. With this information, you can determine that people living in Republican districts prefer extended-cab pickup trucks.

You can also use the ZIP code in the record to link to “little data” Census demographic data summarized at ZIP level. With this, you can correlate car ownership patterns to such things as income, race, education and ethnicity. Indeed, the study found it could predict demographics and voting patterns based on auto ownership.

And you can go further. You can link your normalized automobile make and model data to “little data” datasets of automobile technical specifications which is how the study determined, for example, that based on miles per gallon, Burlington, Vermont is the greenest city in the United States.

Using artificial intelligence on a Big Data image database to build a normalized text database is impressive. But all the real insights in this study could only be developed by linking Big Data to Little Data to allow for granular analysis.

While Big Data and artificial intelligence are getting all the breathless coverage, we should never forget that Little Data is what’s providing the real value behind the scenes.  

The Roomba Ruckus

Roomba, the robot vacuum cleaner that took advanced technology and applied it to the consumer market by trying to eliminate the lowly task of vacuuming, has been in the news recently. Apparently, its devices suck up more than the dirt in your home: they are sucking up data about your home as well. And Roomba is starting the think about selling this trove of data.

There are several aspects to this development that merit discussion. First, of course, there’s the privacy issue. Roomba was forward-thinking to the extent that it buried appropriate language in its privacy agreement that allows it to do pretty much anything to the data it collects. However, that language wasn’t prominent and was written in legalese. In short, while it may be legal for Roomba to sell customer data, it wasn’t up-front and transparent with its customers.

Right now, most pundits are saying that convenience trumps privacy every time. That may be true currently, but I expect consumer attitudes will begin to shift as the nature and extent of furtive data collection fully penetrate the collective conscience.

Exactly what data does Roomba collect and how valuable is it? I have said many times that not all data are valuable, and while Roomba certainly has a trove of data, I am not convinced it is a treasure trove of data. Many articles on the subject talk breathlessly about this goldmine of “room geometry” data. Specific potential uses (of which very few are mentioned – a big clue right there) are such things as designing speaker systems. Sounds legit, but can Roomba tell you the ceiling height of the room? Can it tell you what rooms play music now? There are lots of clues that these data may not in reality be all that useful.

And who would buy these data? The articles are equally breathless on this subject, suggesting that of course Amazon would want it. Others suggest Apple will snap it up, and perhaps Home Depot as well. If you step back, all you see is a list of big companies with products for the home.

The increasingly common view that every company, including manufacturers, is expected to have a data strategy, is trendy, silly and will ultimately collapse. Not all data are valuable, and having huge quantities of not-valuable data doesn’t change that fact. And when you consider that to gather these data you risk a privacy backlash and reputational damage, companies (and those who fund them) will ultimately start to realize that not all data are created equal. Only a fortunate few can casually generate high-value datasets, and even then, it’s not cost or risk free. My prediction: Roomba won’t be cleaning up with data anytime soon.


Ebay Revamps By Adding Structure

Ebay, the giant online marketplace/flea market, is reacting to lackluster growth in an interesting way: with a new focus on structured data. The goal, simply put, is to make it easier for users to find merchandise on its site.

Currently, eBay merchants upload free-text descriptions of the products they are offering for sale. This works reasonably well, but as we all know, searching on unstructured text is ultimately a hit-or-miss proposition. And with over one million merchants on eBay doing their own data entry with very few rules and little data validation, you can imagine the number of errors that result, ranging from typos, to use of inconsistent terminology to missing data elements, etc. The consequence of this is that buyers can’t efficiently and confidently discover all items available for sale, and sellers can sell their products because they are not being seen.

It may seem odd that after several decades in business, eBay is just getting around to this. But in fact it hasn’t been standing still. Rather, it’s been investing its resources in perfecting its search software, trying to use algorithms to overcome weaknesses in the descriptive product data. And while eBay has made great strides, this shift to structured data is really an admission that there are limits to free text searching.

Granular, precise search results can’t be better or more accurate than the underlying data. If you want to be able to distinguish between copper and aluminum fasteners in your search results, you need your merchants to specify copper or aluminum, spell the words correctly and consistently, and have agreement on how to handle exceptions such as copperplate aluminum. Ideally, you also want your merchants to tag the metal used in the fastener so that you don’t have to hunt for the information in a block of text, with the associated chance of an erroneous result.

While we’ve come to believe there are no limits to full-text search wizardry, remember the best software in the world breaks down when the data is wrong or doesn’t exist. Google spent many years and millions of dollars trying to build online company directories, before finally admitting that even it couldn’t overcome missing and incorrect data.

Databases and data products are all about structure. Cleaning up and organizing data is slow, expensive and not a lot of fun, but it is a huge value-add. Indeed, one of the biggest complaints of those working in the Big Data arena is that the data they want to analyze is simply too inconsistent and undependable to use.

These days, anyone can aggregate giant pots of data. But increasingly, value is being created by making these pots of data more accessible by adding more structure. This is the essence of data publishing, and something successful data publishers fully appreciate and never forget.  

Another Kind of Data Harvesting

I have written before about the data-driven revolution that’s taking place in agriculture today that will allow farms to radically increase their productivity and crop yields. Data collected from farm equipment and soil sensors allow farms to plant exactly the right seeds at exactly the right depth to maximize yields, all handled automatically by high tech farm equipment guided by GPS that can run itself autonomously. It’s an exciting future.

One of the key points of my earlier article is that a farmer’s data, by itself, isn’t that valuable. Knowledge comes from building a large enough sample of planting data from other similar farms in similar geographies in order to find benchmarks and best practices. Thus if you want data from your own farm to benefit your own farm, you need to pool your data.

But what if a farmer doesn’t want to make the needed investment to benefit from data-driven agriculture? Are there other markets for the data?

Well it turns out that there are. As an article in the Wall Street Journal makes clear, field level data doesn’t just benefit the farmer, there are others who will happily pay for it. For example, seed companies can get extremely detailed insights into what’s being planted and what’s growing best and where. They can use such data to inform both their R&D and their marketing and forecasting activities. There’s a Wall Street angle as well, with commodities traders looking for an edge by trying to get an early insight into what the forthcoming growing season will bring.

But even here, there’s a need for aggregation. The experience of one farm doesn’t help seed companies or traders very much. But the more farm data you can aggregate, the more valuable your dataset. The race is already on with companies such as Grower Information Services Cooperative, Farmobile and Granular Inc. are already duking it out to sign up the most farmers as quickly as possible.

The simple lesson here is that even though the same farm data can be monetized in multiple ways, there is a valid, indeed critical, role for an aggregator. We see also that first-mover advantage is critical in data plays like this. And as always, market neutrality is an important advantage: you’ll have a much harder time collecting this kind of data if you are a seed company as opposed to an independent information company.

Everyone into the (data) Pool

There’s a quiet revolution going on in agriculture, much of it riding under the label of “precision agriculture.” What this means is that farms are finding they can use data both to increase their productivity and their crop yields.

To provide just one vivid example, unmanned tractors now routinely plow fields, guided by GPS and information on how deep to dig in which sections of the field for optimal results. Seeds are being planted variably as well. Instead of just dumping seeds in the earth and hoping for the best, precision machinery, guided by soil data, now determines what seeds are planted and where, almost on an inch-by-inch basis.

It’s a big opportunity, with big dollars attached to it, and everyone is jockeying to collect and own this data. The seed companies want to own it. The farm equipment companies want to own it. Even farm supply stores – the folks who sell farmers their fertilizer and other supplies want to own it. In fact, everyone is clamoring to own the data, except perhaps the farmer.

Why not? Because a farmer’s own soil data is effectively a sample size of one. Not too valuable. Value is added when it  is aggregated to data from other farmers to find patterns and establish benchmarks. It’s a natural opportunity for someone to enable farmers to share their data to mutual benefit. This is a content model we call the “closed data pool,” where a carefully selected group agrees to contribute its data, and pay to receive back the insights gleaned from the aggregated dataset.

One great example of this model is Farmers Business Network. Farmers pool their data and pay $500 per year to access the benchmarks and insights it generates. Farmers Business Network is staffed with data scientists to make sense of the data. Very importantly, Farmers Business Network is a neutral player: it doesn’t sell seeds or tractors. Its business model is transparent, and farmers can get data insights without being tied to a particular vendor. Farmers Business Network makes its case brilliantly in its promotional video, which is well worth watching:

Market neutrality and a high level of trust are essential to building content using the closed data pool model. But it’s a powerful, sticky model that benefits every player involved. Many data publishers and other media companies are well positioned to create products using this model because they already have the neutral market position and market trust. Closed data pools are worth a closer look. Google certainly agrees: it just invested $15 million into Farmers Business Network.