Viewing entries in

AI in Action

Two well-known and highly successful data producers, Morningstar and Spiceworks, have both just announced new capabilities built on artificial intelligence (AI) technology. 

Artificial Intelligence is a much-abused umbrella term for a number of distinctive technologies. Speaking very generally, the power of AI initially came from sheer computer processing power. Consider how early AI was applied to the game of chess. The “AI advantage” came from the ability to quickly assess every possible combination of moves and likely responses, as well as having access to a library of all the best moves of the world’s best chess players. It was a brute force approach, and it worked.

Machine learning is a more nuanced approach to AI where the system is fed both large amounts of raw data and examples of desirable outcomes. The software actually learns from these examples and is able to generate successful outcomes of its own using the raw data it is supplied. 

There’s more, much more, to AI, but the power and potential is clear.

So how are data producers using AI? In the case of Morningstar, it has partnered with a company called Mercer to create a huge pool of quantitative and qualitative data, to help investment advisors make smarter decisions for their clients. The application of AI here is to create what is essentially a next generation search engine that moves far beyond keyword searching to make powerful connections between disparate collections of data to identify not only the most relevant results, but to pull meaning out of those search results as well.

 At Spiceworks (a 2010 Model of Excellence), AI is powering two uses. The first is also a supercharged search function, designed to make it easier for IT buyers to more quickly access relevant buying information, something that is particularly important in an industry with so much volatility and change.

Spiceworks is also using AI to power a sell-side application that ingests the billions of data signals created on the Spiceworks platform each day to help marketers better target in-market buyers of specific products and services.

As the data business has evolved from offering fast access to the most data to fast access to the most relevant data, AI looks to play an increasingly important and central role. These two industry innovators, both past Models of Excellence m are blazing the trail for the rest of us, and they are well worth watching to see how their integration of AI into their businesses evolves over time.

For reference:

Spiceworks Model of Excellence profile
Morningstar Model of Excellence Profile



Where the Value is In Visual Data

The New York Times recently reported on the results of a fascinating project conducted at Stanford University. Using over 50 million images drawn from Google Street View, along with ZIP code data, the researchers were able to associate automobile ownership preferences with voting patterns. For example, the researchers found that the type of vehicles most strongly associated with Republican voting districts are extended-cab pickup trucks.

While this particular finding may not surprise you, the underlying work represents a programmatic tour de force, because artificial intelligence software was used to identify and classify the vehicles found in these 50 million images. The researchers used automotive experts to identify specific makes and models of cars from the images, giving the software a basis for training itself to find and identify vehicles all by itself, regardless of the angle of the photo, shadows and a host of other factors that make this anything but an easy task.

This project is believed to represent that first time that images have been used on a large scale to develop data. And while this image identification is a technically impressive example of both artificial intelligence and Big Data, most of the really useful insights come from associating the finding with other datasets, what I like to refer to as Little Data.

Think about it. The artificial intelligence software is given as input an image, and the ZIP code associated with that image. The software identifies an automobile make and model from the image, and creates an output record with two elements: the ZIP code and a normalized make and model description of the automobile. With this, you can explore auto ownership patterns by geography. But with just a few more steps, you can go a lot further.

You can use “little data” government and private datasets to link ZIP code to voting districts and thus voting patterns. With this information, you can determine that people living in Republican districts prefer extended-cab pickup trucks.

You can also use the ZIP code in the record to link to “little data” Census demographic data summarized at ZIP level. With this, you can correlate car ownership patterns to such things as income, race, education and ethnicity. Indeed, the study found it could predict demographics and voting patterns based on auto ownership.

And you can go further. You can link your normalized automobile make and model data to “little data” datasets of automobile technical specifications which is how the study determined, for example, that based on miles per gallon, Burlington, Vermont is the greenest city in the United States.

Using artificial intelligence on a Big Data image database to build a normalized text database is impressive. But all the real insights in this study could only be developed by linking Big Data to Little Data to allow for granular analysis.

While Big Data and artificial intelligence are getting all the breathless coverage, we should never forget that Little Data is what’s providing the real value behind the scenes.  

Do You Rate?

An article in the New York Times today discusses the growing proliferation of college rankings as focus shifts to trying to evaluate colleges based on their economic value.

Traditionally, rankings of colleges have tended to focus on their selectivity/exclusivity, but now the focus has shifted to what are politely called “outcomes,” in particular, how many graduates of a particular college get jobs in their chosen fields, and how well they are paid. Interestingly, many of the existing college rankings, such as the well-known one produced by U.S. News, have been slow to adopt to this new area of interest, creating opportunities for new entrants. For example, PayScale (an InfoCommerce Model of Excellence winner) has produced earnings-driven college rankings since 2008. Much more recently, both the Economist and the Wall Street Journal have entered the fray with outcomes-driven college rankings. And let’s not forget still another college ranking system, this one from the U.S. Department of Education.

At first blush, the tendency is to say, “enough is enough.” Indeed, one professor quoted in the Times article somewhat humorously noted that there are so many college rankings that, “We’ll soon be ranking the rankings.”

However, there is typically always room for another useful ranking. The key is utility. Every ranking system is inherently an alchemic blend of input data and weightings. What data are used and how they are evaluated depend on what the ratings service thinks is important. For some, it is exclusivity. For others it is value. There are even the well-known (though somewhat tongue in cheek) rankings of top college party schools.

And since concepts like “quality” and “value” are in the eye of the beholder with results often a function of available data, two rating systems can produce wildly varying results. That’s why when multiple rating systems exist, most experts suggest considering several of them to get the most rounded picture and most informative result.

It’s this lack of a single right way to create a perfect ranking that means that in almost every market, multiple competing rating systems can exist and thrive. Having a strong brand that can credential your results always helps, but in many cases, you can be competitive just with a strong and transparent methodology. It helps too when your rankings aren’t too far out of whack with general expectations. Totally unintuitive ranking results are great for a few days of publicity and buzz, but longer term they struggle with credibility issues.

A take-away for publishers is that just because you weren’t first to market with the rankings for your industry, there may still be a solid opportunity for you, if you have better data, a better methodology and solid credibility as a neutral information provider. 

Credit Scores: Not Just for Credit Anymore

A credit score, like it or not, is something that exists for all of us. Pioneered by a company called Fair Isaac (now just known as FICO), the credit score provided powerful advantages to credit granters in two key ways. First, using massive samples of consumer payment data, FICO analysts were able to tease out what characteristics were predictive of an individual’s willingness to re-pay their debts. With this knowledge, the company built sophisticated algorithms to automatically assess and score consumers. This approach is obviously more efficient than manual credit reviews by humans, but it offered consistency and dependability as well. Second, FICO reduces your credit history to a single number in a fixed range. The higher the number, the better your credit. This innovation made it possible for banks and other to write software to offer instant credit decisions, online credit approvals and more. Moreover, a consistent national scoring system made it easy for banks to both manage and benchmark their credit portfolios, as well as watch for early signs of credit erosion.

There’s little doubt that credit scoring was a brilliant innovation, but is it so specialized it can’t be replicated elsewhere? Well, it appears that creative data types are seeing scoring opportunities everywhere these days.

Consider just one example: computer network security scores. There are several companies (and FICO just acquired one of them) that use a variety of publicly available inputs to score the computer networks of companies to assess their vulnerability to hackers. Is this even possible to do? A lot of smart people in the field say it is, and pretty much everyone agrees the need is so great that even if these scores aren’t perfect, they’re better than nothing.

You may also be asking whether or not there is a business opportunity here and indeed there is. Companies buy their own scores to assess how they are doing and to benchmark themselves against their peers. Insurance companies writing policies to cover data hacks and other cybercrimes are desperate for these objective assessments. And increasingly, companies are asking potential vendors to provide them with their scores to make sure all their vendors are taking cybersecurity seriously.

While scoring started with credit, it certainly doesn’t end there. Are there scoring opportunities in your own market? Put on your thinking cap and get creative!

The 50% Solution

A saying attributed to the famous Philadelphia retailer John Wanamaker is that, “Half the money I spend on advertising is wasted; the trouble is I don't know which half.” Apparently, that saying can be updated for the Internet age to read, “Half the traffic to my website is non-human; the trouble is I don't know which half.”

In fact, the percentage is worse than that. According to a study by online researcher Imperva, a whopping 61.5% of traffic on the web is non-human. What do we mean by non-human? Well, it’s a category that include search engines, software that’s scraping your website, hackers, spammers and others who are up to no good.

And yes, it gets worse. The lower the traffic to your website, the greater the percentage that is likely to be non-human. Indeed, if your site gets 1,000 of fewer visits per day, the study suggests that as much as 80% of your traffic may be non-human.

Sure, a lot of this non-human traffic is search engines (and you’d be amazed how many there still are out there), and that’s probably a good thing. After all, we want exposure. But the rest of this traffic is more dubious. About 5% of your overall site traffic is likely to be scrapers -- –people using software to grab all the content on your site, for purposes benign or evil. Sure, they can’t get to your password protected content, but if you publish any amount of free data on your site in structured form, chances are that others now have that data in their databases.

Obviously, if your sell online advertising, these statistics represent an inconvenient truth. The only saving grace is that your competitors are in the same boat. But if you are a subscription site, does any of this even matter?

I think it does. Because all this non-human activity distorts all of our web analytics in addition to our overall visitor counts. Half the numbers we see are not real. These non-human visitors could lead you to believe certain pages are more popular on your site than the really are; this could cause you to use bad insights to fashion your marketing strategy. And if you are using paid search to generate traffic, you could be getting similarly bad marketing data, and paying for the privilege as well.

Most importantly, this non-human traffic distorts reality. If you’re beating yourself up because of low response, lead generation or order rates, especially given the number of uniques and page views you appear to be getting, start by dividing by two. Do your numbers suddenly look a lot better? Bots and scrapers and search engines don’t request demos, don’t download white pages and certainly don’t buy merchandise. Keep that in mind next time you’re looking at your site analytics reports or puzzling why some pages on your site get so much more attention than others. Remember, not all data are good data.