Scientific Research Data August 23, 2010Posted by Andre Vellino in Data, Information, Information retrieval, Open Access.
Scientific research data is, without a doubt, a central component in the lifecycle of knowledge production. For one thing, scientific data is critical to the corroboration (or falsification) of theories. Equally important to the process of scientific inquiry is making this data openly available to others – as is vividly demonstrated by the so-called “ClimateGate” controversy and the more recent cloud on Marc Houser’s research data on primate cognition. The public accessibility of data enables open peer review and encourages the reproducibility of results.
Hence the importance of data management practices in 21st century science libraries: the curation of, access to and preservation of scientific research data set will be critical to the future of scientific discourse.
It is true that “Big Science” has been in the business of curating “reference data” for years. Institutional data centers in many disciplines have been gathering large amounts of data in databases that contain the fruit of years of research. GenBank, for instance, is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (containing over 150,000 sequence records.)
However, other kinds of data gathered by scientists are either transient or highly context-dependant and are not being preserved for the long term benefit of future research either by individuals or by institutions. This might not be so serious for those data elements that are reproducible – either by experiment or simulation – but much of it, such as data on oil-content and dissipation rates in the Gulf of Mexico water column in 2010, is uniquely valuable and irreproducible.
As I indicated in a previous post, one development that will help redress the problems endured by small, orphaned and inaccessible dataset is the emergence of methods of uniquely referencing datasets such as the data DOIs that are being implemented by DataCite partners. The combination of data-deposit policies by science research funding agencies (such as NSF in the US and NSERC in Canada) and peer-recognition from university faculty for contributions to data repositories, data publication and referencing will soon grow to match the present status of scholarly publications.
In parallel, the growing “open access for Data” movement and other initiatives to increase the availability of data generated by government and government-funded institutions (including NASA, the NIH and the World Bank are now well underway in a manner consistent with the OECD’s principles, which, incidentally, offers a long and convincing list of economic and social benefits to be obtained from making accessible scientific research data.
In particular, the United States , the UK and Australia are spearheading the effort of making public and scientific research data more accessible. For instance, in the U.S., the National Science and Technology Council (NSTC)’s recent report to President Obama details a comprehensive strategy to promote the preservation of and access to digital scientific data.
These reports and initiatives show that the momentum is building globally to realize visions that have been articulated in principle by several bodies concerned with the curation and archiving of data in the first decade of the 21st century (see To Stand the Test of Time and Long Lived Scientific Data Collections).
In Canada, several similar reports such as the Consultation on Access to Scientific Research Data and the Canadian Digital Information Strategy also point to the need for the national stewardship of digital information, not least scientific data sets. Despite much discussion, systematic efforts in the stewardship of Canadian digital scientific data sets are still only at the preliminary stages. While there are well managed and curated reference data in domains such as earth science (Geogratis) and Astronomy (Canadian Astronomy Data Centre) which have a community of specialist scientific users and whose needs are generally well met, the data-management needs of individual scientists in small, less well funded research groups is either impossible to find or lost.
One impediment to the effective bibliographic curation of data sets is the absence of common standards. There are currently “no rules about how to publish, present, cite or otherwise catalogue datasets.” [Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing]
CISTI’s Gateway to Scientific Data sets and other such national sites (e.g. the British National Archives of Datasets) that aggregate information about data sets, use bibliographic standards (e.g. Dublin Core) for representing meta-data. The advantage is that these standards are not domain-dependant yet sufficiently rich to express the core elements of the content needed for archiving storage and retrieval. However, these metadata standards, developed for traditional bibliographic purposes, are not (yet) sufficiently rich to fully capture the wealth of scientific data from all disciplines, as I argued in a previous post.
One of the major concerns when deciding on the feasibility of creating a data repository is the cost associated with the deposit, curation and long-term preservation of research data. Typically, costs depend on a variety of factors including how each of the typical phases (planning, acquisition, disposal, ingest, archive, storage, preservation and access services) are deployed (see the JISC reports “Keeping Research Data Safe” Part 1 and Part 2). The costs associated with different data collections are also likely to vary considerably according to how precious (rare/valuable) the stored information is and what the requirements are for access over time.
One point to note from the “Keeping research data safe” reports commissioned for JISC is that
“the costs of archiving activities (archival storage and preservation planning and actions) are consistently a very small proportion of the overall costs and significantly lower than the costs of acquisition/ingest or access.”
In short – librarianship for datasets is critical to the future of science and technology costs are the least of our concerns.