jump to navigation

Scientific Research Data August 23, 2010

Posted by Andre Vellino in Data, Information, Information retrieval, Open Access.

Scientific research data is, without a doubt, a central component in the lifecycle of knowledge production. For one thing, scientific data is critical to the corroboration (or falsification) of theories. Equally important to the process of scientific inquiry is making this data openly available to others – as is vividly demonstrated by the so-called “ClimateGate” controversy and the more recent cloud on Marc Houser’s research data on primate cognition. The public accessibility of data enables open peer review and encourages the reproducibility of results.

Hence the importance of data management practices in 21st century science libraries: the curation of, access to and preservation of scientific research data set will be critical to the future of scientific discourse.

It is true that “Big Science” has been in the business of curating “reference data” for years. Institutional data centers in many disciplines have been gathering large amounts of data in databases that contain the fruit of years of research. GenBank, for instance, is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (containing over 150,000 sequence records.)

However, other kinds of data gathered by scientists are either transient or highly context-dependant and are not being preserved for the long term benefit of future research either by individuals or by institutions. This might not be so serious for those data elements that are reproducible – either by experiment or simulation – but much of it, such as data on oil-content and dissipation rates in the Gulf of Mexico water column in 2010, is uniquely valuable and irreproducible.

As I indicated in a previous post, one development that will help redress the problems endured by small, orphaned and inaccessible dataset is the emergence of methods of uniquely referencing datasets such as the data DOIs that are being implemented by DataCite partners.  The combination of data-deposit policies by science research funding agencies (such as NSF in the US and NSERC in Canada) and peer-recognition from university faculty for contributions to data repositories, data publication and referencing will soon grow to match the present status of scholarly publications.

In parallel, the growing “open access for Data” movement and other initiatives to increase the availability of data generated by government and government-funded institutions (including NASA, the NIH and the World Bank are now well underway in a manner consistent with the OECD’s principles, which, incidentally, offers a long and convincing list of economic and social benefits to be obtained from making accessible scientific research data.

In particular, the United States , the UK and Australia are spearheading the effort of making public and scientific research data more accessible. For instance, in the U.S., the National Science and Technology Council (NSTC)’s recent report to President Obama details a comprehensive strategy to promote the preservation of and access to digital scientific data.

These reports and initiatives show that the momentum is building globally to realize visions that have been articulated in principle by several bodies concerned with the curation and archiving of data in the first decade of the 21st century (see To Stand the Test of Time and Long Lived Scientific Data Collections).

In Canada, several similar reports such as the Consultation on Access to Scientific Research Data and the Canadian Digital Information Strategy also point to the need for the national stewardship of digital information, not least scientific data sets. Despite much discussion, systematic efforts in the stewardship of Canadian digital scientific data sets are still only at the preliminary stages.  While there are well managed and curated reference data in domains such as earth science (Geogratis) and Astronomy (Canadian Astronomy Data Centre) which have a community of specialist scientific users and whose needs are generally well met, the data-management needs of individual scientists in small, less well funded research groups is either impossible to find or lost.

One impediment to the effective bibliographic curation of data sets is the absence of common standards. There are currently “no rules about how to publish, present, cite or otherwise catalogue datasets.” [Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing]

CISTI’s Gateway to Scientific Data sets and other such national sites (e.g. the British National Archives of Datasets) that aggregate information about data sets, use bibliographic standards (e.g. Dublin Core) for representing meta-data.  The advantage is that these standards are not domain-dependant yet sufficiently rich to express the core elements of the content needed for archiving storage and retrieval.  However, these metadata standards, developed for traditional bibliographic purposes, are not (yet) sufficiently rich to fully capture the wealth of scientific data from all disciplines, as I argued in a previous post.

One of the major concerns when deciding on the feasibility of creating a data repository is the cost associated with the deposit, curation and long-term preservation of research data. Typically, costs depend on a variety of factors including how each of the typical phases (planning, acquisition, disposal, ingest, archive, storage, preservation and access services) are deployed (see the JISC reports “Keeping Research Data Safe” Part 1 and Part 2). The costs associated with different data collections are also likely to vary considerably according to how precious (rare/valuable) the stored information is and what the requirements are for access over time.

One point to note from the “Keeping research data safe” reports commissioned for JISC is that

“the costs of archiving activities (archival storage and preservation planning and actions) are consistently a very small proportion of the overall costs and significantly lower than the costs of acquisition/ingest or access.”

In short – librarianship for datasets is critical to the future of science and technology costs are the least of our concerns.


1. Daniel Lemire - August 23, 2010

I agree that the cost of curating the data is small compared to overall costs.

A concern though: there is a huge incentive to pay for producing the data, but where’s the incentive to pay for the curation?

If the climate hockey stick debacle showed us anything is that many researchers have a strong incentive to see the data quietly disappearing (or so it seems).

I think that the money would need to come from… the funding agencies directly. But what is their own incentive? Archiving and curating the data seems boring… nothing “new” is being produced… how do you sell it?

2. Andre Vellino - August 24, 2010

That’s right Daniel – the incentive to pay for data curation is an issue. However, I think that when (or maybe I should say “if”) data-creation becomes valuable to researchers as peer-reviewed output, which would count for tenure and promotions AND it is citable (via DOIs or some other mechanism) then researchers themselves will have an incentive to make their data discoverable (so that it can be cited, so that they can be promoted, etc.) – hence the need for better curation.

3. gawp - August 25, 2010

If citation of data itself was academically acknowledged or some other academically relevant credit was given for making data available in a usable fashion, that would be a good carrot. Right now you’ve got to get users to cite some sort of publication associated with the data to get credit for the data use.

Grant requirements and review to ensure researchers are properly sharing their data (consequences if they’re not) would be a good stick.

Making a well structured and annotated data set available in association with a paper is great “citation bait”; the easier it is to use the more likely it is to be used. There are some gene expression microarray data sets of a cell differentiation time series we’ve made available that have gotten us more quite a few citations for the associated publication.

4. Andre Vellino - September 1, 2010

This post is now translated into French on a blog aggregation site from France:


5. Bibliotheek en het online leven in Augustus 2010 « Dee'tjes - September 3, 2010

[...] Scientific Research Data [...]

6. geistlogistic » Blog Archive » Introduction to the 1st data set Marketplace at the RecSysTEL 2010, Barcelona, ECTEL 2010 - November 27, 2010

[...] Scientific Research Data ” Synthèse (synthese.wordpress.com) [...]

7. Life in optical » Blog Archive » Science in the open - July 22, 2011

[...] also this interesting post. [...]

8. La difficile accessibilité des données scientifiques « meridianes - July 26, 2011

[...] article est une traduction d’un billet publié sur Synthèse par Andre [...]

9. Les scientifiques découvrent que seuls 20% des données climatiques sont accessibles « Les moutons enragés - September 27, 2011

[...] article est une traduction d’un billet publié sur Synthèse par Andre [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

%d bloggers like this: