Viewing entries in
Technology

Good Data + Good Analytics = Good Business

Mere weeks ago, I made my predictions of what this decade would bring for the data industry. I said that while the decade we just left behind was largely about collecting and organizing data, the decade in front of us would be about putting these massive datasets to use. Machine learning and artificial intelligence are poised to make data even more powerful and thus valuable … provided of course the underlying data are properly structured and standardized.

Leave it to 2013 Infocommerce Model of Excellence winner Segmint to immediately show us what these predictions mean in practice through its recent acquisition of the Product and Service Taxonomy division of WAND Inc. WAND, by the way, is a 2004 Infocommerce Model of Excellence winner, making us especially proud to report this combination of capabilities.

Segmint is tackling a huge opportunity by helping banks better understand their customers for marketing and other purposes. Banks capture tremendous amounts of transactional activity, much of it in real-time. The banking industry has also invested billions of dollars in building data warehouses to store this information. So far, so good. But if you want to derive insights from all these data, you have to be able to confidently roll it up to get summary data. And that’s where banks came up short. You can’t assess customer spending on home furnishings unless you can identify credit card merchants who sell home furnishings. That’s where Segmint and WAND come in. How many ways can people abbreviate and misspell the name “Home Depot”? Multiply that by billions of transactions and millions of companies, and you start to get the idea of both the problem and the opportunity.

When WAND is done cleaning and standardizing the data, Segmint goes to work with its proprietary segmentation and predictive analytics tools. Segmint helps bank marketers understand the lifestyle characteristics of its customers and target them with appropriate messages both to aid retention and sell new products. These segments are continuously updated via real-time feeds from its bank customers (all fully anonymized). With that level of high quality, real-time and granular data, Segmint can readily move from profiling customers to predicting their needs and interests.

Simply put: this is the future of the data business. It starts with the clean-up work nobody else wants to do (and it’s why data scientists spend more time cleaning data than analyzing it) and then uses advanced software to find actionable, profitable insights from the patterns in that data. This is the magic of the data business that will be realized in this new decade. And we couldn’t be prouder that two Infocommerce Model of Excellence winners are leading the way … together. Congrats to both! 

Email: Valuable However You Look at It

The Internet has evolved dramatically in the last 25 years. But one aspect of online interaction has remained largely untouched. I am talking about the humble email address.

Despite the growth of web-based phone and video and the dominance of social media, the importance of the email address has actually increased. Indeed, the primary way to log into these other burgeoning communications channels is most commonly to use your email address as your username.  After all these years, it’s easy to take it for granted. But from a data perspective, it’s worth taking a few minutes to explore some of its hidden value.

First, an email address is a unique identifier. Yes, many people have both a personal and business email address, but those email addresses are role-based, so they generally remain unique within a given role. Moreover, many data aggregators have been busy building databases of individuals and all their known email addresses, making it easy to resolve multiple email addresses back to a single individual. 

Unique identifiers of all kinds are extraordinarily important because they provide disambiguation. That means that you can confidently match datasets based on email address because no two people can have the same email at the same time.

But email addresses aren’t just unique identifiers, they are persistent unique identifiers. That means that people don’t change them often or on a whim. Further, unlike telephone numbers, email addresses tend not to be re-issued. That’s because businesses work hard to avoid re-issuing email addresses and as to personal emails, they are typically very cheap to keep and a big hassle to change, resulting in a lot of stability.

 Let’s go a step further: email addresses are persistent, intelligent unique identifiers. At least for business use, email addresses are not only tied to a particular company, the company name is embedded in the email address. And again, data aggregators have been hard at work mapping domain names to detailed information on the companies behind them. That’s why an increasing number of B2B companies actually prohibit people from signing up for such things as a newsletter using a personal email address. A personal email address (e.g. arty1745@gmail.com) tells them little; a business email address (e.g. jsmith@pfizer.com) readily yields a wealth of company demographics with which to both target promotions and build audience profiles. Indeed, even the structure of email addresses has been monetized. There are, for example, companies that will append inferred email addresses based on the naming convention used by a specific company (e.g. first initial and last name). It’s also interesting that the top level domain can tell you the nature of the organization (e.g. “.org”), the country where it is operating (e.g. “co.uk”), and even the nature of its business (e.g. “.edu”).

 The unique format of the email address also adds to its value. While the length of an email address will vary, the overall pattern with the distinctive @ sign makes it easy to harvest and extract. It also makes it possible to link text documents (perhaps academic papers that include the email address of the author) to data records.

 Sure, email addresses have real value because they can put marketing messages under the noses of prospects, but to a data publisher, email addresses are worth a whole lot more.

 

 

Use Your Computer Vision

Those familiar with the powerhouse real estate listing site Zillow will likely recall that it burst on the scene in 2006 with an irresistible new offering: a free online estimate of the value of every house in the United States. Zillow calls them Zestimates. The site crashed continuously from too much traffic when it first launched, and Zillow now gets a stunning 195 million unique visitors monthly, all with virtually no advertising. Credit the Zestimates for this.

 As you would expect, Zestimates are derived algorithmically, using a combination of public domain and recent sales data. The algorithm selects recent sales of similar comparable nearby houses to compute estimated value. 

As you would also expect, professional appraisers hate Zestimates. They believe that they produce better valuation estimates because they hand select the comparable nearby homes and are thus more accurate. However, with the goal of consistent appraisals, the hand selection process that appraisers use is so prescribed and formulaic that it operates much like an algorithm does. At this level, you could argue that appraisers have little advantage over the computed Zestimate.

However, one area in which appraisers have a distinct advantage is that they are able to assess the condition and interiors of the properties they are appraising. They visually inspect the home and can use interior photos of comparable homes that have recently sold to refine their estimates.

Not to be outdone, Zillow is employing artificial intelligence to create what it calls “computer vision.” Using interior and exterior photos of millions of recently sold homes, Zillow now assesses such things as curb appeal, construction quality and even landscape;  quantifies what it finds;  and factors that information into its valuation algorithm. When it has interior photos of a house, it scans for such things as granite countertops, upgraded bathrooms and even how much natural light the house enjoys, and incorporates this information into its algorithm as well.

 With this advance, appraisers look very much like their competitive advantage is owning “the last mile,” because they are the feet on the street that actually visit the house being appraised. But you can see where things are heading: as companies like Zillow refine their technology, the day may well come that an appraisal is performed by the homeowner uploading interior pictures of her house, and perhaps confirming public record data, such as number of rooms in the house.

There are many market verticals where automated inspection and interpretation of visual data can be used. While the technology is in its infancy, its power is undeniable, so it’s not too early to think about possible ways it might enhance your data products.

Where the Value is In Visual Data

The New York Times recently reported on the results of a fascinating project conducted at Stanford University. Using over 50 million images drawn from Google Street View, along with ZIP code data, the researchers were able to associate automobile ownership preferences with voting patterns. For example, the researchers found that the type of vehicles most strongly associated with Republican voting districts are extended-cab pickup trucks.

While this particular finding may not surprise you, the underlying work represents a programmatic tour de force, because artificial intelligence software was used to identify and classify the vehicles found in these 50 million images. The researchers used automotive experts to identify specific makes and models of cars from the images, giving the software a basis for training itself to find and identify vehicles all by itself, regardless of the angle of the photo, shadows and a host of other factors that make this anything but an easy task.

This project is believed to represent that first time that images have been used on a large scale to develop data. And while this image identification is a technically impressive example of both artificial intelligence and Big Data, most of the really useful insights come from associating the finding with other datasets, what I like to refer to as Little Data.

Think about it. The artificial intelligence software is given as input an image, and the ZIP code associated with that image. The software identifies an automobile make and model from the image, and creates an output record with two elements: the ZIP code and a normalized make and model description of the automobile. With this, you can explore auto ownership patterns by geography. But with just a few more steps, you can go a lot further.

You can use “little data” government and private datasets to link ZIP code to voting districts and thus voting patterns. With this information, you can determine that people living in Republican districts prefer extended-cab pickup trucks.

You can also use the ZIP code in the record to link to “little data” Census demographic data summarized at ZIP level. With this, you can correlate car ownership patterns to such things as income, race, education and ethnicity. Indeed, the study found it could predict demographics and voting patterns based on auto ownership.

And you can go further. You can link your normalized automobile make and model data to “little data” datasets of automobile technical specifications which is how the study determined, for example, that based on miles per gallon, Burlington, Vermont is the greenest city in the United States.

Using artificial intelligence on a Big Data image database to build a normalized text database is impressive. But all the real insights in this study could only be developed by linking Big Data to Little Data to allow for granular analysis.

While Big Data and artificial intelligence are getting all the breathless coverage, we should never forget that Little Data is what’s providing the real value behind the scenes.  

The 50% Solution

A saying attributed to the famous Philadelphia retailer John Wanamaker is that, “Half the money I spend on advertising is wasted; the trouble is I don't know which half.” Apparently, that saying can be updated for the Internet age to read, “Half the traffic to my website is non-human; the trouble is I don't know which half.”

In fact, the percentage is worse than that. According to a study by online researcher Imperva, a whopping 61.5% of traffic on the web is non-human. What do we mean by non-human? Well, it’s a category that include search engines, software that’s scraping your website, hackers, spammers and others who are up to no good.

And yes, it gets worse. The lower the traffic to your website, the greater the percentage that is likely to be non-human. Indeed, if your site gets 1,000 of fewer visits per day, the study suggests that as much as 80% of your traffic may be non-human.

Sure, a lot of this non-human traffic is search engines (and you’d be amazed how many there still are out there), and that’s probably a good thing. After all, we want exposure. But the rest of this traffic is more dubious. About 5% of your overall site traffic is likely to be scrapers -- –people using software to grab all the content on your site, for purposes benign or evil. Sure, they can’t get to your password protected content, but if you publish any amount of free data on your site in structured form, chances are that others now have that data in their databases.

Obviously, if your sell online advertising, these statistics represent an inconvenient truth. The only saving grace is that your competitors are in the same boat. But if you are a subscription site, does any of this even matter?

I think it does. Because all this non-human activity distorts all of our web analytics in addition to our overall visitor counts. Half the numbers we see are not real. These non-human visitors could lead you to believe certain pages are more popular on your site than the really are; this could cause you to use bad insights to fashion your marketing strategy. And if you are using paid search to generate traffic, you could be getting similarly bad marketing data, and paying for the privilege as well.

Most importantly, this non-human traffic distorts reality. If you’re beating yourself up because of low response, lead generation or order rates, especially given the number of uniques and page views you appear to be getting, start by dividing by two. Do your numbers suddenly look a lot better? Bots and scrapers and search engines don’t request demos, don’t download white pages and certainly don’t buy merchandise. Keep that in mind next time you’re looking at your site analytics reports or puzzling why some pages on your site get so much more attention than others. Remember, not all data are good data.