You have likely bumped into the hot new IT buzzword “data lake.” A data lake is simply a collection of data files, structured and unstructured, located in one place. This is in the eyes of some an advance over the “data warehouse,” where datasets are curated and highly organized. Fun fact: a bad data lake (one that holds too much useless data) is called a “data swamp.”
What’s the purpose of a data lake? Primarily, it’s to provide raw input to artificial intelligence and machine learning software. This new class of software is both powerful and complex, with the result that it has been bestowed with near-mystical qualities. As one senior executive of a successful manufacturing company told me, his company was aggressively adopting machine learning because “you just feed it the data and it gives you answers.” Yes, we now have software so powerful that it not only provides answers, but apparently formulates the questions as well.
The reality is much more mundane. This will not surprise any data publisher, but the more structure you provide to machine learning and artificial intelligence software, the better the results. That’s because while you can “feed” a bunch of disparate datasets into machine learning software, if there are no ready linkages between the datasets, your results will be, shall we say, suboptimal. And if the constituent data elements aren’t clean and normalized, you’ll get to see the axiom “garbage in, garbage out” playing out in real life.
It’s a sad reality that highly trained and highly paid data scientists still spend the majority of their time acting as what they call “data wranglers” and “data janitors,” trying to smooth out raw data enough that machine learning will deliver useful and dependable insights. In a timely response to this, software vendor C3-AI has just launched a Covid 19 data lake. Its claimed value is rather than just a collection of datasets in one place, C3-AI has taken the time to organize, unify and link the datasets.
The lesson here is that as data producers, we should never underestimate the value we create when we organize, normalize and clean data. Indeed, clean and organized data will be the foundation for the next wave of advances in both computing and human knowledge. Better data: better results.