The Future of Data Information Retrieval April 29, 2010Posted by Andre Vellino in Data Mining, Information retrieval, Knowledge Representation.
It seems like the “Open Data” movement is at last getting some traction and the floodgates are opening. [Thanks to Daniel Lemire, Richard Akerman, Peter Turney and Paul Gilbert for all the pointers and helpful suggestions!]
For instance, just the other day the World Bank opened up its data vaults to add to the already voluminous quantities of “social science” data. It probably won’t take long for Google to add some of those data sets to its growing collection of public data that it can display with Google Motion Charts API. This is in parallel to “Google Fusion Tables” (read “Google Docs for Data”), for which visualizations are also be available.
I am not sure about Google’s commercial motives – they probably don’t know either – “build it and the money will come” seems to work for Google, somehow. But Amazon’s motive for providing 30 or so significant data sets (between 20 and 300Gb each) is more transparent: to sell their cloud-computing services for those who want to data-mine this information. What a great honey-pot for data-miners who need to chew up CPU cycles! My favourites are
- U.S. Census Data (1980, 1990, 2000)
- Daily Weather (1929-2009) currated from National Climatic Data Center Data Sets
- Sloan Digital Sky Survey (DR6)
- GenBank (the NIH Gene DataBase)
- PubChem (another database from NIH)
- Ensembl (Human and other animal Genomes)
There are stacks of other datasets as well, lovingly cared for by people for whom this data really matters.
- Astronomy Data Set
- BioMedical Informatics Network
- NRCAN Geo Science Data Repository
- NIST Data Gateway
The list is much too large to provide a comprehensive catalogue, although CISTI has begun developing such a list of Canadian scientific datasets at the Gateway to Scientific Data.
In sum, we have lots of scientists contributing lots of data to databases they care about. Now what?
Here’s the analogy with the web that Tim Berners-Lee makes in this TED video. In ~ 1990 we already had lots and lots of electronic documents on PC hard-disks, not to mention mainframes and even file-servers. Then came his wonderful, awesome idea – do what MacIntosh HyperCard did to “Cards” to files on the internet. Brilliant! But hypertext alone wasn’t enough. First requirement: a “universal locator scheme” for linking documents to one another. Second requirement: a harvesting / indexing method to make the content accessible via a search engine.
The corresponding ideas are now migrating to the data world. For instance Berners-Lee is spearheading the Linked Data movement. The idea of Data URIs (addressable via HTTP) is an essential first step but the corresponding second step – a data-harvesting / indexing method – hasn’t (yet) been taken. Not by conventional internet search engines anyway.
We need to be able to search for and find data, but unlike text, data itself can’t be indexed! What can we do about that? One suggestion – for scientific datasets anyway – is to make the link between scientific documents (i.e. published articles) and the datasets that they depend on and use the text of the scientific articles as “metadata” for the datasets.
Fortunately, thanks to DataCite, scientists can now obtain Digital Object Identifiers (DOIs) for data and cite datasets in their publications. They may even get credit for developing and publishing datasets in academic peer-review. In the not too distant future scientists in Canada will be going to CISTI to obtain those DOIs.
What might a future in which data is properly indexed and discoverable look like? A little like WolframAlpha, I expect. This “computational knowledge engine” – note the absence of “search” in this descriptor – already relies on lots of “databases” to return “search results” and compute [possibly relevant] related facts. It’s also already got a pretty cool iPad app for navigating and visualizing results. This is where the puck is going and we ought to be skating there.
Currently the state of the art for cataloging datasets is to annotate the metadata with some bibliographic standard like Dublin Core. And to “crawl” the metadata for datasets, we could use Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) as a starting point.
But we are going to need something more than just “metadata” (which is, in fact “meta-metadata”, since the “metadata” for a dataset should really be the Data Schema or XMLSchema that it conforms to). Remember David Weinberger’s advice in Everything is Miscellaneous: “The solution to overabundance of information is more information”.
So where can we find more information about the data, above and beyond how it is linked to other data and how it is referenced in the published literature? How about the putting Data-Schemas and XML Schemas to work and mining them for meaning? The idea is still a bit vague in my mind, but something like an automated way of extracting or reverse-engineering the entities and relationships from the data schema and using them to index the data elements. This could enable even more links between data elements in one dataset and elements in another.
This is perhaps where the social sciences have a leg-up on the physical sciences – there is quite a rich metadata standard for social sciences: Data Documentation Initiative (DDI) (which is a misnomer, really – it should be called “Social Sciences Data Documentation Initiative”). The value of this standard has been a framework for developing tools navigating datasets and statistical tools for analysis.
There are some such toolsets in some scientific fields: NetCDF (network Common Data Form), for instance, is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
Perhaps if each scientific discipline can establish a similar set of standards the future of scientific dataset discovery will look as compelling as the WWW does today.