jump to navigation

The Future of Data Information Retrieval April 29, 2010

Posted by Andre Vellino in Data Mining, Information retrieval, Knowledge Representation.

It seems like the “Open Data” movement is at last getting some traction and the floodgates are opening. [Thanks to Daniel Lemire, Richard Akerman, Peter Turney and Paul Gilbert for all the pointers and helpful suggestions!]

For instance, just the other day the World Bank opened up its data vaults to add to the already voluminous quantities of “social science” data. It probably won’t take long for Google to add some of those data sets to its growing collection of public data that it can display with Google Motion Charts API. This is in parallel to “Google Fusion Tables”  (read “Google Docs for Data”), for which visualizations are also be available.

I am not sure about Google’s commercial motives – they probably don’t know either – “build it and the money will come” seems to work for Google, somehow. But Amazon’s motive for providing 30 or so significant data sets (between 20 and 300Gb each) is more transparent: to sell their cloud-computing services for those who want to data-mine this information. What a great honey-pot for data-miners who need to chew up CPU cycles!  My favourites are

There are stacks of other datasets as well, lovingly cared for by people for whom this data really matters.

The list is much too large to provide a comprehensive catalogue, although CISTI has begun developing such a list of Canadian scientific datasets at the Gateway to Scientific Data.

In sum, we have lots of scientists contributing lots of data to databases they care about.  Now what?

Here’s the analogy with the web that Tim Berners-Lee makes in this TED video. In ~ 1990 we already had lots and lots of electronic documents on PC hard-disks, not to mention mainframes and even file-servers.  Then came his wonderful, awesome idea – do what MacIntosh HyperCard did to “Cards” to files on the internet. Brilliant!  But hypertext alone wasn’t enough. First requirement: a “universal locator scheme” for linking documents to one another.  Second requirement: a harvesting / indexing method to make the content accessible via a search engine.

The corresponding ideas are now migrating to the data world. For instance Berners-Lee is spearheading the Linked Data movement.  The idea of Data URIs (addressable via HTTP) is an essential first step but the corresponding second step – a data-harvesting / indexing method – hasn’t (yet) been taken.  Not by conventional internet search engines anyway.

We need to be able to search for and find data, but unlike text, data itself can’t be indexed! What can we do about that?  One suggestion – for scientific datasets anyway – is to make the link between scientific documents (i.e. published articles) and the datasets that they depend on and use the text of the scientific articles as “metadata” for the datasets.

Fortunately, thanks to DataCite, scientists can now obtain Digital Object Identifiers (DOIs) for data and cite datasets in their publications. They may even get credit for developing and publishing datasets in academic peer-review. In the not too distant future scientists in Canada will be going to CISTI to obtain those DOIs.

What might a future in which data is properly indexed and discoverable look like?  A little like WolframAlpha, I expect.  This “computational knowledge engine” – note the absence of “search” in this descriptor – already relies on lots of “databases” to return “search results” and compute [possibly relevant] related facts.  It’s also already got a pretty cool iPad app for navigating and visualizing results.  This is where the puck is going and we ought to be skating there.

Currently the state of the art for cataloging datasets is to annotate the metadata with some bibliographic standard like Dublin Core. And to “crawl” the metadata for datasets, we could use Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) as a starting point.

But we are going to need something more than just “metadata” (which is, in fact “meta-metadata”, since the “metadata” for a dataset should really be the Data Schema or XMLSchema that it conforms to). Remember David Weinberger’s advice in Everything is Miscellaneous: “The solution to overabundance of information is more information”.

So where can we find more information about the data, above and beyond how it is linked to other data and how it is referenced in the published literature?  How about the putting Data-Schemas and XML Schemas to work and mining them for meaning? The idea is still a bit vague in my mind, but something like an automated way of extracting or reverse-engineering the entities and relationships from the data schema and using them to index the data elements.  This could enable even more links between data elements in one dataset and elements in another.

This is perhaps where the social sciences have a leg-up on the physical sciences – there is quite a rich metadata standard for social sciences: Data Documentation Initiative (DDI) (which is a misnomer, really – it should be called “Social Sciences Data Documentation Initiative”). The value of this standard has been a framework for developing tools navigating datasets and statistical tools for analysis.

There are some such toolsets in some scientific fields: NetCDF (network Common Data Form), for instance, is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

Perhaps if each scientific discipline can establish a similar set of standards the future of scientific dataset discovery will look as compelling as the WWW does today.


1. gawp - April 30, 2010

After watching a few of presentations by Peter Norvig (head of research at Google), I realized that Google is a Machine Learning company. They accumulate all this data because they can train ML systems to do smart things with it. There is a great presentation by Norvig (can’t find it right now) where he says that a lot of ML systems are moderately good when trained with, say, 10^5 examples. But even the crappy ones do pretty well if you can train them with 10^9 examples.

I think that Google Books, for example, is not (primarily) some attempt to get control of books, but rather to get them into a format where they can be used for training; translation systems, spell checkers, etc.

Andre Vellino - April 30, 2010

@Gareth – You might be thinking about this Google Research blog post:


2. Toward data-driven science - May 3, 2010

[…] improving access to data is fast becoming a critical issue. In a thought-provoking post, Andre Vellino sketches the future of data Information Retrieval. Some key […]

3. Kevembuangga - May 3, 2010

we are going to need something more than just “metadata”

Yeah! We are going to need semantics (the real ones not the “Semantic Web”).
But this has been tried before to no avail, the data miners are just about to rediscover ontologies and how unruly they are.
It’s an old story, good luck…

4. Toward data-driven science | Science Report | Biology News, Economics News, Computer Science News, Mathematics News, Physics News, Psychology News - May 3, 2010

[…] improving access to data is fast becoming a critical issue. In a thought-provoking post, Andre Vellino sketches the future of data Information Retrieval. Some key […]

5. Data Archiving « Synthèse - May 7, 2010

[…] a previous post, I was suggesting that the text from publications and the data DOIs that are referenced in them […]

6. Scientific Research Data « Synthèse - August 23, 2010

[…] I indicated in a previous post, one development that will help redress the problems endured by small, orphaned and inaccessible […]

7. La difficile accessibilité des données scientifiques « meridianes - July 26, 2011

[…] je l’ai indiqué dans un billet précédent, l’émergence de méthodes de référencement unique pour les jeux de données comme celles […]

8. Les scientifiques découvrent que seuls 20% des données climatiques sont accessibles « Les moutons enragés - September 26, 2011

[…] je l’ai indiqué dans un billet précédent, l’émergence de méthodes de référencement unique pour les jeux de données comme celles de […]

9. How to get a guy back - March 15, 2012

Everything is very open with a very clear clarification of the issues.
It was truly informative. Your site is very helpful. Thanks for sharing!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: