Viewing entries in
Big Data

AI in Action

Two well-known and highly successful data producers, Morningstar and Spiceworks, have both just announced new capabilities built on artificial intelligence (AI) technology. 

Artificial Intelligence is a much-abused umbrella term for a number of distinctive technologies. Speaking very generally, the power of AI initially came from sheer computer processing power. Consider how early AI was applied to the game of chess. The “AI advantage” came from the ability to quickly assess every possible combination of moves and likely responses, as well as having access to a library of all the best moves of the world’s best chess players. It was a brute force approach, and it worked.

Machine learning is a more nuanced approach to AI where the system is fed both large amounts of raw data and examples of desirable outcomes. The software actually learns from these examples and is able to generate successful outcomes of its own using the raw data it is supplied. 

There’s more, much more, to AI, but the power and potential is clear.

So how are data producers using AI? In the case of Morningstar, it has partnered with a company called Mercer to create a huge pool of quantitative and qualitative data, to help investment advisors make smarter decisions for their clients. The application of AI here is to create what is essentially a next generation search engine that moves far beyond keyword searching to make powerful connections between disparate collections of data to identify not only the most relevant results, but to pull meaning out of those search results as well.

 At Spiceworks (a 2010 Model of Excellence), AI is powering two uses. The first is also a supercharged search function, designed to make it easier for IT buyers to more quickly access relevant buying information, something that is particularly important in an industry with so much volatility and change.

Spiceworks is also using AI to power a sell-side application that ingests the billions of data signals created on the Spiceworks platform each day to help marketers better target in-market buyers of specific products and services.

As the data business has evolved from offering fast access to the most data to fast access to the most relevant data, AI looks to play an increasingly important and central role. These two industry innovators, both past Models of Excellence m are blazing the trail for the rest of us, and they are well worth watching to see how their integration of AI into their businesses evolves over time.

For reference:

Spiceworks Model of Excellence profile
Morningstar Model of Excellence Profile



Where the Value is In Visual Data

The New York Times recently reported on the results of a fascinating project conducted at Stanford University. Using over 50 million images drawn from Google Street View, along with ZIP code data, the researchers were able to associate automobile ownership preferences with voting patterns. For example, the researchers found that the type of vehicles most strongly associated with Republican voting districts are extended-cab pickup trucks.

While this particular finding may not surprise you, the underlying work represents a programmatic tour de force, because artificial intelligence software was used to identify and classify the vehicles found in these 50 million images. The researchers used automotive experts to identify specific makes and models of cars from the images, giving the software a basis for training itself to find and identify vehicles all by itself, regardless of the angle of the photo, shadows and a host of other factors that make this anything but an easy task.

This project is believed to represent that first time that images have been used on a large scale to develop data. And while this image identification is a technically impressive example of both artificial intelligence and Big Data, most of the really useful insights come from associating the finding with other datasets, what I like to refer to as Little Data.

Think about it. The artificial intelligence software is given as input an image, and the ZIP code associated with that image. The software identifies an automobile make and model from the image, and creates an output record with two elements: the ZIP code and a normalized make and model description of the automobile. With this, you can explore auto ownership patterns by geography. But with just a few more steps, you can go a lot further.

You can use “little data” government and private datasets to link ZIP code to voting districts and thus voting patterns. With this information, you can determine that people living in Republican districts prefer extended-cab pickup trucks.

You can also use the ZIP code in the record to link to “little data” Census demographic data summarized at ZIP level. With this, you can correlate car ownership patterns to such things as income, race, education and ethnicity. Indeed, the study found it could predict demographics and voting patterns based on auto ownership.

And you can go further. You can link your normalized automobile make and model data to “little data” datasets of automobile technical specifications which is how the study determined, for example, that based on miles per gallon, Burlington, Vermont is the greenest city in the United States.

Using artificial intelligence on a Big Data image database to build a normalized text database is impressive. But all the real insights in this study could only be developed by linking Big Data to Little Data to allow for granular analysis.

While Big Data and artificial intelligence are getting all the breathless coverage, we should never forget that Little Data is what’s providing the real value behind the scenes.  

The Roomba Ruckus

Roomba, the robot vacuum cleaner that took advanced technology and applied it to the consumer market by trying to eliminate the lowly task of vacuuming, has been in the news recently. Apparently, its devices suck up more than the dirt in your home: they are sucking up data about your home as well. And Roomba is starting the think about selling this trove of data.

There are several aspects to this development that merit discussion. First, of course, there’s the privacy issue. Roomba was forward-thinking to the extent that it buried appropriate language in its privacy agreement that allows it to do pretty much anything to the data it collects. However, that language wasn’t prominent and was written in legalese. In short, while it may be legal for Roomba to sell customer data, it wasn’t up-front and transparent with its customers.

Right now, most pundits are saying that convenience trumps privacy every time. That may be true currently, but I expect consumer attitudes will begin to shift as the nature and extent of furtive data collection fully penetrate the collective conscience.

Exactly what data does Roomba collect and how valuable is it? I have said many times that not all data are valuable, and while Roomba certainly has a trove of data, I am not convinced it is a treasure trove of data. Many articles on the subject talk breathlessly about this goldmine of “room geometry” data. Specific potential uses (of which very few are mentioned – a big clue right there) are such things as designing speaker systems. Sounds legit, but can Roomba tell you the ceiling height of the room? Can it tell you what rooms play music now? There are lots of clues that these data may not in reality be all that useful.

And who would buy these data? The articles are equally breathless on this subject, suggesting that of course Amazon would want it. Others suggest Apple will snap it up, and perhaps Home Depot as well. If you step back, all you see is a list of big companies with products for the home.

The increasingly common view that every company, including manufacturers, is expected to have a data strategy, is trendy, silly and will ultimately collapse. Not all data are valuable, and having huge quantities of not-valuable data doesn’t change that fact. And when you consider that to gather these data you risk a privacy backlash and reputational damage, companies (and those who fund them) will ultimately start to realize that not all data are created equal. Only a fortunate few can casually generate high-value datasets, and even then, it’s not cost or risk free. My prediction: Roomba won’t be cleaning up with data anytime soon.


Ebay Revamps By Adding Structure

Ebay, the giant online marketplace/flea market, is reacting to lackluster growth in an interesting way: with a new focus on structured data. The goal, simply put, is to make it easier for users to find merchandise on its site.

Currently, eBay merchants upload free-text descriptions of the products they are offering for sale. This works reasonably well, but as we all know, searching on unstructured text is ultimately a hit-or-miss proposition. And with over one million merchants on eBay doing their own data entry with very few rules and little data validation, you can imagine the number of errors that result, ranging from typos, to use of inconsistent terminology to missing data elements, etc. The consequence of this is that buyers can’t efficiently and confidently discover all items available for sale, and sellers can sell their products because they are not being seen.

It may seem odd that after several decades in business, eBay is just getting around to this. But in fact it hasn’t been standing still. Rather, it’s been investing its resources in perfecting its search software, trying to use algorithms to overcome weaknesses in the descriptive product data. And while eBay has made great strides, this shift to structured data is really an admission that there are limits to free text searching.

Granular, precise search results can’t be better or more accurate than the underlying data. If you want to be able to distinguish between copper and aluminum fasteners in your search results, you need your merchants to specify copper or aluminum, spell the words correctly and consistently, and have agreement on how to handle exceptions such as copperplate aluminum. Ideally, you also want your merchants to tag the metal used in the fastener so that you don’t have to hunt for the information in a block of text, with the associated chance of an erroneous result.

While we’ve come to believe there are no limits to full-text search wizardry, remember the best software in the world breaks down when the data is wrong or doesn’t exist. Google spent many years and millions of dollars trying to build online company directories, before finally admitting that even it couldn’t overcome missing and incorrect data.

Databases and data products are all about structure. Cleaning up and organizing data is slow, expensive and not a lot of fun, but it is a huge value-add. Indeed, one of the biggest complaints of those working in the Big Data arena is that the data they want to analyze is simply too inconsistent and undependable to use.

These days, anyone can aggregate giant pots of data. But increasingly, value is being created by making these pots of data more accessible by adding more structure. This is the essence of data publishing, and something successful data publishers fully appreciate and never forget.  

Another Kind of Data Harvesting

I have written before about the data-driven revolution that’s taking place in agriculture today that will allow farms to radically increase their productivity and crop yields. Data collected from farm equipment and soil sensors allow farms to plant exactly the right seeds at exactly the right depth to maximize yields, all handled automatically by high tech farm equipment guided by GPS that can run itself autonomously. It’s an exciting future.

One of the key points of my earlier article is that a farmer’s data, by itself, isn’t that valuable. Knowledge comes from building a large enough sample of planting data from other similar farms in similar geographies in order to find benchmarks and best practices. Thus if you want data from your own farm to benefit your own farm, you need to pool your data.

But what if a farmer doesn’t want to make the needed investment to benefit from data-driven agriculture? Are there other markets for the data?

Well it turns out that there are. As an article in the Wall Street Journal makes clear, field level data doesn’t just benefit the farmer, there are others who will happily pay for it. For example, seed companies can get extremely detailed insights into what’s being planted and what’s growing best and where. They can use such data to inform both their R&D and their marketing and forecasting activities. There’s a Wall Street angle as well, with commodities traders looking for an edge by trying to get an early insight into what the forthcoming growing season will bring.

But even here, there’s a need for aggregation. The experience of one farm doesn’t help seed companies or traders very much. But the more farm data you can aggregate, the more valuable your dataset. The race is already on with companies such as Grower Information Services Cooperative, Farmobile and Granular Inc. are already duking it out to sign up the most farmers as quickly as possible.

The simple lesson here is that even though the same farm data can be monetized in multiple ways, there is a valid, indeed critical, role for an aggregator. We see also that first-mover advantage is critical in data plays like this. And as always, market neutrality is an important advantage: you’ll have a much harder time collecting this kind of data if you are a seed company as opposed to an independent information company.