Viewing entries in
Big Data

Comment

The Billion Prices Project

Last week, I discussed how the Internet of Things creates all sorts of potential opportunities to create highly valuable, highly granular data. The Billion Prices Project, which is based at MIT, provides another route to the same result. Summarized very simply, two MIT professors, Alberto Cavallo and Roberto Rigobon, collect data from hundreds of online retailers all over the world to build a massive database of product-level pricing data, updated daily. It’s an analytical goldmine that can be applied to solve a broad range of problems.

One obvious example is the measurement of inflation. Currently, the U.S. Government develops its Consumer Price Index inflation data the old fashioned way: mail, phone and field surveys. And inherently, this process is slow. Contrast that with the Billion Price Project that can measure inflation on a daily basis, and do so for a large number of countries.

But measuring inflation is just the beginning. The Billion Prices Project is exploring a range of intriguing questions, such as the premiums that are charged for organic foods and the impact of exchange rates on pricing. You’re really only limited by your specific business information needs – and your imagination.

The Billion Prices Project also offers some useful insights for data publishers. First, the underlying data is scraped from websites. The Billion Prices Project didn’t ask for it or pay for it. That means you can build huge datasets quickly and economically. Secondly, the dataset is significantly incomplete. For example, it entirely ignores the huge service sector of the economy. But’s it’s better than the existing dataset in many ways, and that’s what really matters.

When considering building a database, new web extraction technology gives you the ability to build massive, useful and high quality datasets quickly and economically. And as we have seen time after time, the old aphorism, “don’t let the perfect be the enemy of the good” still holds true. If you can do better than what’s currently available, you generally have an opportunity. Don’t focus on what you can’t get. Instead, focus on whether what you can get meaningfully advances the ball.

Comment

Comment

Enigma: Disrupting Public Data

Can you actually disrupt public data, which by definition is public, and by extension is typically free or close to free? Well, in a way, you can. Enigma LogoA new start-up called Enigma, which can be thought of as “the Google of public data,” has assembled over 100,000 public data sources – some of them not even fully or easily accessible online. Think all kinds of public records from land ownership, public company data, customs filings, private plan registrations, all sorts of data, and all in one place.

But there’s more. Enigma doesn’t just aggregate, it integrates. That means it has expended tremendous effort to both normalize and link these disparate datasets, making information easier to find, and data easier to analyze.

The potentially disruptive aspect to a database that contains so much public data is that there are quite a few data publishers with very successful businesses built in whole or in part on public datasets.

But beyond the potential for disruption, there’s some other big potential for this (I’ve requested a trial, so at this point I am working with limited information). First, Enigma isn’t (at least for now) trying to create a specific product, e.g. a company profile database. Rather, it’s providing raw data. That will make it less interesting to many buyers of existing data products who want a fast answer with minimum effort. But it also means that Enigma could be a leveraged way for many data publishers to access public data to integrate into their own products, especially since Enigma touts a powerful API.

The other consideration with a product like this is that even with 100,000 datasets, it is inherently broad-based and scatter-shot in its coverage. That makes it far less threatening to vertical market data publishers.

Finally, Enigma has adopted a paid subscription model, so it’s not going to accelerate the commoditization of data by offering itself free to everyone and adopting an ad-supported model.

So from a number of angles, this is a company to watch. I’m eagerly waiting for my trial subscription; I urge you to dig in deep on Enigma as well.

Comment

Comment

Does Correlation Trump Causation?

A new book called Big Data: A Revolution That Will Transform How We Live, Work and Think, written by Viktor Mayer-Schonberger of Oxford and Kenneth Cukier of The Economist, raises some intriguing and provocative issues for data publishers.  Among  them is this one:

 “…society will need to shed some of its obsession for causality in exchange for simple correlation: not knowing why but only what.”

The underlying thinking as I understand it is that Big Data, because it can analyze and yield insight from millions or even billions of data points, is both incredibly powerful and uncannily accurate, in large part because of the massive sample sizes involved.

But are all Big Data insights created equal?

Without a doubt, some insights from Big Data analytics yields useful and low-risk results. If Big Data, for example, were to determine that from a price perspective, the best time to purchase an airline ticket is 11 days prior to departure, I have both useful information and not a care in the world about causation. Ironically, in this example, Big Data would be used to outsmart airline Big Data analytics that are trying to optimize revenues through variable pricing.

But riding solely on correlation often creates situations where heavy-handed or even ridiculous steps would be necessary to act on Big Data insights. Consider a vexing issue such as alcoholism. What if we learned through Big Data analytics that left-handed males who played tennis and drove red cars had an unusually high propensity to become alcoholics? Correlation identifies the problem, but it doesn’t provide much of a solution. Do we ban alcohol for this entire group? Do we tell left-handed males that they can either play tennis or drive a red car, but not both? Does breaking the correlative pattern actually work to prevent the correlated result? Things can get strange and confusing very quickly when you rely entirely on correlation.

Am I calling into question the value of Big Data analytics? Not at all. The ability to powerfully analyze massive data sets will be beneficial to all of us, in many different ways. But to suggest that Big Data correlations can largely supplant causation research plays into the Big Data hype by suggesting it is a pat, “plug and play” solution to all problems. Big Data can very usefully shape and define causal research, but there are numerous situations where it can’t simply replace it.

The lesson here is that while you should embrace Big Data and its big potential, remain objective and ask tough questions to separate Big Data from Big Hype because lately, the two have been tightly correlated.

Comment

Comment

Walking Around Money

A young company called Placed is deep into Big Data analytics, but with a twist: it marries customer data with its own proprietary data to yield insights into customer behavior. Essentially, Placed wants to provide context around how customers use the mobile applications of its clients, for example, when do they use the app and where do they use it?

The “where” part of the analysis is what’s interesting. Placed could simply spit back to its clients that its customers are in certain ZIP codes or other dry demographics – interesting, like so many analytics reports are, but not particularly useful.

Instead Placed marries customer location with its own proprietary database of places – named stores, major buildings, points of interest. By connecting the two, Placed can tell its clients where mobile use of its app is occurring. For example, if a client’s customers utilize its mobile app in a competitor’s store, it might suggest competitive price comparisons. Knowing its customers frequent Starbucks and nightclubs might influence the clients’ marketing strategy or advertising campaign design. Knowing that the app is used most often when someone is walking (yes, Placed can tell you that) can be important for user interface design – you get the idea.

And therein lies an important insight. There are an endless number of companies offering Big Data analytics capabilities. But almost all of them expect their customers to bring both the problem and the data. That’s a sure recipe for commoditization, and as analytics software evolve, it’s also certain that the companies with the biggest analytics needs will decide to do the work themselves.

Solution? Big Data analytics players should bring proprietary data to the party. Placed is a perfect case study. It differentiates itself by providing answers others can’t. It adds value to its analytics by integrating proprietary and licensed data with customer data and its own optimized analytical tools. As I discussed in my presentation at DataContent 2012, there are lots of ways publishers can profit from the Big Data revolution -- even if they don't have big data themselves.

In a market where companies like Placed can make money by tracking people walking around, it behooves data publishers to walk around to some of these Big Data analytics players and suggest data partnerships that will help them stand out from the crowd.

Comment

Comment

Sentimental Journey

Sentiment analysis represents a real opportunity for many data publishers to add new, high-value, proprietary and even real-time insight to their data products. But sentiment analysis has inherent strengths and weaknesses you need to appreciate when considering if there is a sentiment analysis opportunity for your data products.
A wonderful example of sentiment analysis at work is represented in a new service called the Twitter Political Index. Working with two respected political polling organizations, Twitter analyzes over 400 million tweets per day, to determine how people are feeling about the two presidential candidates. The key word here is "feeling," because that's what sentiment analysis is all about -- guessing how people feel about a topic. It's not easy, and it is particularly complex for tweets, which are short and often lack context. Moreover, most sentiment analysis tools go well beyond simply binary like/dislike assessments and try to gauge the degree of like or dislike. It's tricky stuff, but the potential applications are numerous and exciting.
This is often the point where many people will start questioning such issues as whether or not Twitter offers a representative sample, and what level of precision and confidence sentiment analysis can offer. These are valid questions, but be careful not to fall into the trap described by research industry veteran Ray Poynter, who notes that all new research methodologies are invariably measured against the standard of perfection. This implies that all current research metholodogies are perfect, which is far from the case. When using tools such as sentiment analysis, you need to consider the application, then pick the methodology, seeking the best fit possible.

That's why data publishers should be thinking about sentiment analysis. You don't need to analyze every tweet; indeed you don't necessarily need Twitter at all. Sentiment analysis can be applied to research reports, blog posts and press releases. And if you can help your customers better understand how the world currently views a company or product, for example, you can deliver a useful new layer of insight that differentiates you from the competition, makes your products more valuable, and can be acquired and implemented fairly quickly and economically.
And particularly with new research methodologies, I think it's useful to remember the saying, "don't let the perfect become the enemy of the good." Building powerful new data analysis tools is a long journey, one that both publisher and customer are taking together.

Comment