Caching In


There's been a flurry of activity lately around the obscure but important practice of Web page caching -- taking and preserving copies of someone else's Web pages.

One of the more remarkable Web sites I have ever run across is www.archive.org, also known as "The Wayback Machine," a non-profit venture co-founded by Brewster Kahle to essentially take snapshots of the Web at different points in time. Using The Wayback Machine, you can easily take a look at how any individual Web site has evolved over time. Perhaps not surprisingly, not everyone wants their history to be so readily accessible.

In a recent court case, a law firm used The Wayback Machine to uncover some evidence to support its case. The other party in the lawsuit turned around and sued the Internet Archive, operators of The Wayback Machine, for inappropriately making and holding copies of their Web pages. The case is complicated, and actually revolves more around something called a "robots.txt" file than the cached pages itself, but an adverse decision could have a chilling effect on archiving and making available historical content gathered on the Web.

At the same time, Canadian legislators are considering an amendment to the Canadian Copyright Act that will actually prohibit anyone from making and holding a cached copy of someone else's Web site without permission. While this could be a speedbump for search engines that cache content, it's not likely to disrupt them too much as they don't need to cache content to index Web sites, and indexing itself will not be prohibited.

But this is all part of a larger trend towards history disappearing from the Web, and therein may reside a real opportunity for data publishers.

I regularly see examples of companies removing all traces of unsucessful products, ousted executives and failed ventures from their Web sites, leaving no clue they ever existed. A large percentage of companies, mostly for benign reasons, "age off" old press releases and announcements from their Web sites, leaving only a narrow window of corporate history. Most companies seem to feel that the primary value of their Web sites is to provide current if not real-time information, with only a small nod to what has happened in the past. That means that those who capture and retain this type of business information will ultimately end up with a vast repository of business intelligence, much of it unavailable elsewhere.

Even in the pre-Internet days, historical data had real value. I published one directory that ran a small index of corporate name changes that was one of the most popular and heavily used sections of the publication. I know another healthcare directory that didn't simply delete companies that went out of business, merged or were acquired. Instead, it ran it as an index called “Mutations” which proved incredibly popular. I know one financial publisher that actually retains all the previous positions held by the executives in its database, valuable information that could be the basis for a number of specialized, high-value products. In many industries, there are successful databases that cross-reference old and new part numbers, or suggest equivalent parts to replace discontinued parts. And knowing what products a company used to make, what ventures it has exited, and what executives it used to employ will become increasingly valuable as the information becomes harder to access. When it comes to data, the past can be a prelude to lucrative opportunities.

Comment