Just a few weeks ago, Google, Microsoft and Yahoo announced that their search engines would jointly support a new semantic markup language convention called microdata, the details of which can be found at a site called schema.org.
Semantic coding? Microdata? Okay, I promise not to go too deep into the details, but this is important stuff. In a nutshell, this new schema -- which will be recognized by all major search engines -- allows websites to present their information to the search engines in a more structured fashion. The schema supports standardized HTML tags. By tagging data at a very granular level, search engines not only get smarter about the data they are indexing, they can manipulate and process it just as if it was information in a conventional database.
Yes, by encouraging website owners to add these standardized tags to their content, the search engines want to marry the power and precision of structured data with the richness and depth of textual and graphical information.
As a data publisher, should you be concerned? Well, let's look at just a few of the specific business types already defined by the schema: travel agency, child care, financial services, real estate agent, shopping center. If enough of these types of companies ultimately add these tags to their websites, the search engines will be able to pull them out with complete precision.
Not precise enough? Doesn't impact your industry? Well consider too that the schema allows for tagging of postal address and geo-coordinates. That means that over time the search engines will with precision be able to list (and even sort) all businesses at any location or zip code. Precise mapping will also be a breeze.
And let's not forget products. The schema provides tags to identify specific products, and even if a specific item is being offered for sale or not. There is a way to tag associated product reviews, and even to identify a product ID code -- either a proprietary code, or an industry standard code system.
Most significant of all: this schema will allow search engines to do easily what they've never been able to do before: parametric searches. Want all self-storage centers in Fairfield, Connecticut? Done. All self-storage centers within 50 miles of Wichita, Kansas that have units for rent? Done. Yes, it starts to hit a little close to home, doesn't it?
Will this schema take off? It depends on wide-scale adoption, of course, but it's free to use and there's the huge carrot of potentially improved search engine results rankings. The schema is not sufficiently robust as of now to scare too many data publishers, and it does feel heavily weighted to local, retail businesses, but by improving the precision of search engine results, it reinforces the thinking that the big search engines are "good enough."
Can you use these tags to improve the results of your own web scraping efforts? Absolutely. Like almost everything with the search engines, this one is a two-edged sword.