Viewing entries in
Building Databases

Fishing In the Data Lake

You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.” 

What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.

The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.

 It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets. 

The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.

Good Data + Good Analytics = Good Business

Mere weeks ago, I made my predictions of what this decade would bring for the data industry. I said that while the decade we just left behind was largely about collecting and organizing data, the decade in front of us would be about putting these massive datasets to use. Machine learning and artificial intelligence are poised to make data even more powerful and thus valuable … provided of course the underlying data are properly structured and standardized.

Leave it to 2013 Infocommerce Model of Excellence winner Segmint to immediately show us what these predictions mean in practice through its recent acquisition of the Product and Service Taxonomy division of WAND Inc. WAND, by the way, is a 2004 Infocommerce Model of Excellence winner, making us especially proud to report this combination of capabilities.

Segmint is tackling a huge opportunity by helping banks better understand their customers for marketing and other purposes. Banks capture tremendous amounts of transactional activity, much of it in real-time. The banking industry has also invested billions of dollars in building data warehouses to store this information. So far, so good. But if you want to derive insights from all these data, you have to be able to confidently roll it up to get summary data. And that’s where banks came up short. You can’t assess customer spending on home furnishings unless you can identify credit card merchants who sell home furnishings. That’s where Segmint and WAND come in. How many ways can people abbreviate and misspell the name “Home Depot”? Multiply that by billions of transactions and millions of companies, and you start to get the idea of both the problem and the opportunity.

When WAND is done cleaning and standardizing the data, Segmint goes to work with its proprietary segmentation and predictive analytics tools. Segmint helps bank marketers understand the lifestyle characteristics of its customers and target them with appropriate messages both to aid retention and sell new products. These segments are continuously updated via real-time feeds from its bank customers (all fully anonymized). With that level of high quality, real-time and granular data, Segmint can readily move from profiling customers to predicting their needs and interests.

Simply put: this is the future of the data business. It starts with the clean-up work nobody else wants to do (and it’s why data scientists spend more time cleaning data than analyzing it) and then uses advanced software to find actionable, profitable insights from the patterns in that data. This is the magic of the data business that will be realized in this new decade. And we couldn’t be prouder that two Infocommerce Model of Excellence winners are leading the way … together. Congrats to both! 

What Facebook Knows and Doesn’t Know

Privacy concerns have been in the forefront of the news lately, and no article discussing privacy is complete without mentioning Facebook. That’s because Facebook is considered to be the all-knowing machine that’s tirelessly collecting data about us and turning it into insights that can be used to better market things to us with extreme precision. Certainly Facebook isn’t the only online juggernaut with this strategy and sophisticated data collection capabilities, but in many ways it’s the poster child for our collective concerns and anxieties.

I joined Facebook in 2007. At the time, it was becoming the next big thing, and I wanted to see what it was all about. After some initial excitement, I noticed my usage dropping as the years went by. My usage massively dropped in 2019 when I somehow changed my default language settings to German and I didn’t feel any real urgency to figure out how to undo it, all this to say I am certainly not a typical Facebook user.

While not a high intensity Facebook user, I am a high intensity data nerd, so when I read an article that explained how to peek under the hood to see in detail what Facebook knows about you, and what it has learned about me from third parties, I of course could not resist. If your interest is equally high, start your journey here: https://www.facebook.com/off_facebook_activity/

I clicked all the options so that I could see everything Facebook knew about me. While not a heavy user, I was a long-term user, and I imagined Facebook had likely learned a lot about me in 13 years. In due course, Facebook presented me with a downloadable Zip file that contained a number of folders.

The folder “Ads and Businesses” turned out to be the money folder. This is where I learned my personal interests as divined by Facebook – all individual categories that can be selected by marketers. Here are some highlights of my interests:

  • Cast iron (who doesn’t love cast iron?)

  • Scratching (what can I say?)

  • Tesla (Facebook helpfully clarified that my interest was not in the car, but rather the band … the band?)

  • Oysters (I don’t eat them)

  • Skiing (I don’t ski)

  • Star Trek (absolutely true – when I was about 14 years old)

 There were about 50 interest categories in all; not all wrong, but overall far from an accurate picture. What I infer by looking at these interest categories is that they are keywords crudely extracted from various ads I had clicked on over the years. I say “crudely” because these interest tags don’t represent an organized taxonomy; there is no hierarchy, and there is only a lackluster attempt to disambiguate. For example, one of my interests is “online.” Without any context, this is useless information. And if Facebook assesses the recency of my interests, or the intensity of my interest (how many times, for example, did I look at things relating to cast iron?), it is not sharing these data with its users.

If Facebook underwhelmed me with its insights into my interests, the listing of “Advertisers who uploaded a contact list with my information” totally confused me. I was presented with a list of literally hundreds of businesses that ostensibly had my contact information and had tried to match it to my Facebook data. What I saw on this list were probably close to a hundred local car dealerships from all over the country, followed by almost as many local real estate agencies. I feel certain, for example, that I have never visited the website of, much less interacted with, International Honda of Sheboygan, WI. But this car dealership – reportedly – has my contact information and is matching it to Facebook.

There are a few possible explanations for this. The one I find most likely is that in the case of automobiles, some unscrupulous middlemen are selling the same file of “leads” to unsuspecting car dealers nationwide. It could also be inexperienced or bad marketers or marketing agencies. Some free advice to Toledo Edison, Maybelline, The Property Girls of Michigan, Bank Midwest and Choctaw Casinos and Resorts – take a look at your list sources and maybe even your marketing strategies, because something seems broken.

Looking at your own Facebook data gives you a rare opportunity to see and evaluate what’s going on behind the curtain. To me, Facebook’s secret sauce really doesn’t appear to be its technology. Grabbing keywords from ads I have clicked is utterly banal. Offering marketers hundreds of thousands of interest tags does in fact allows for extreme microtargeting, but in the sloppiest, laziest possible way. Capturing all my ad clicks is useful and valuable, but hardly cutting edge. What appears to make Facebook so valuable seems not to be the data it has collected, but the fact it has collected data on a hitherto unknown scale. Knowing that I have an interest in flax (yes, this is really one of my reputed interests!) even if true is pretty useless until you get enough scale to identify thousands of people interested in flax, at which point this obscure data point suddenly acquires monetary value.

What my Facebook  data suggest is that while it may not be good enough to deliver the precision and accuracy many marketers have bought into, what it has done is create “good enough” data at extreme scale. And that is proving to be even better than good enough. 

Email: Valuable However You Look at It

The Internet has evolved dramatically in the last 25 years. But one aspect of online interaction has remained largely untouched. I am talking about the humble email address.

Despite the growth of web-based phone and video and the dominance of social media, the importance of the email address has actually increased. Indeed, the primary way to log into these other burgeoning communications channels is most commonly to use your email address as your username.  After all these years, it’s easy to take it for granted. But from a data perspective, it’s worth taking a few minutes to explore some of its hidden value.

First, an email address is a unique identifier. Yes, many people have both a personal and business email address, but those email addresses are role-based, so they generally remain unique within a given role. Moreover, many data aggregators have been busy building databases of individuals and all their known email addresses, making it easy to resolve multiple email addresses back to a single individual. 

Unique identifiers of all kinds are extraordinarily important because they provide disambiguation. That means that you can confidently match datasets based on email address because no two people can have the same email at the same time.

But email addresses aren’t just unique identifiers, they are persistent unique identifiers. That means that people don’t change them often or on a whim. Further, unlike telephone numbers, email addresses tend not to be re-issued. That’s because businesses work hard to avoid re-issuing email addresses and as to personal emails, they are typically very cheap to keep and a big hassle to change, resulting in a lot of stability.

 Let’s go a step further: email addresses are persistent, intelligent unique identifiers. At least for business use, email addresses are not only tied to a particular company, the company name is embedded in the email address. And again, data aggregators have been hard at work mapping domain names to detailed information on the companies behind them. That’s why an increasing number of B2B companies actually prohibit people from signing up for such things as a newsletter using a personal email address. A personal email address (e.g. arty1745@gmail.com) tells them little; a business email address (e.g. jsmith@pfizer.com) readily yields a wealth of company demographics with which to both target promotions and build audience profiles. Indeed, even the structure of email addresses has been monetized. There are, for example, companies that will append inferred email addresses based on the naming convention used by a specific company (e.g. first initial and last name). It’s also interesting that the top level domain can tell you the nature of the organization (e.g. “.org”), the country where it is operating (e.g. “co.uk”), and even the nature of its business (e.g. “.edu”).

 The unique format of the email address also adds to its value. While the length of an email address will vary, the overall pattern with the distinctive @ sign makes it easy to harvest and extract. It also makes it possible to link text documents (perhaps academic papers that include the email address of the author) to data records.

 Sure, email addresses have real value because they can put marketing messages under the noses of prospects, but to a data publisher, email addresses are worth a whole lot more.

 

 

Use Your Computer Vision

Those familiar with the powerhouse real estate listing site Zillow will likely recall that it burst on the scene in 2006 with an irresistible new offering: a free online estimate of the value of every house in the United States. Zillow calls them Zestimates. The site crashed continuously from too much traffic when it first launched, and Zillow now gets a stunning 195 million unique visitors monthly, all with virtually no advertising. Credit the Zestimates for this.

 As you would expect, Zestimates are derived algorithmically, using a combination of public domain and recent sales data. The algorithm selects recent sales of similar comparable nearby houses to compute estimated value. 

As you would also expect, professional appraisers hate Zestimates. They believe that they produce better valuation estimates because they hand select the comparable nearby homes and are thus more accurate. However, with the goal of consistent appraisals, the hand selection process that appraisers use is so prescribed and formulaic that it operates much like an algorithm does. At this level, you could argue that appraisers have little advantage over the computed Zestimate.

However, one area in which appraisers have a distinct advantage is that they are able to assess the condition and interiors of the properties they are appraising. They visually inspect the home and can use interior photos of comparable homes that have recently sold to refine their estimates.

Not to be outdone, Zillow is employing artificial intelligence to create what it calls “computer vision.” Using interior and exterior photos of millions of recently sold homes, Zillow now assesses such things as curb appeal, construction quality and even landscape;  quantifies what it finds;  and factors that information into its valuation algorithm. When it has interior photos of a house, it scans for such things as granite countertops, upgraded bathrooms and even how much natural light the house enjoys, and incorporates this information into its algorithm as well.

 With this advance, appraisers look very much like their competitive advantage is owning “the last mile,” because they are the feet on the street that actually visit the house being appraised. But you can see where things are heading: as companies like Zillow refine their technology, the day may well come that an appraisal is performed by the homeowner uploading interior pictures of her house, and perhaps confirming public record data, such as number of rooms in the house.

There are many market verticals where automated inspection and interpretation of visual data can be used. While the technology is in its infancy, its power is undeniable, so it’s not too early to think about possible ways it might enhance your data products.