Data Archiving May 7, 2010Posted by Andre Vellino in Data, Digital library, Information retrieval.
In a previous post, I was suggesting that the text from publications and the data DOIs that are referenced in them could be used as metadata for the datasets. Others have thought about addressing this issue too. For instance, The Edinburgh Research Archive’s StORe project has as a goal:
…. to address the area of interactions between output repositories of research publications and source repositories of primary research data.
However, linking conventional research output (publications) with the data they depend on may be more of a challenge than it ought to be. According to this paper on Constructing Data Curation Profiles, the authors discovered that very few scientists are motivated to deposit their datasets, let alone make them discoverable. What ought to be a technological no-brainer may be a cultural challenge in some quarters.
A few other interesting things emerge from looking at a variety of subject-dependant “Data Curation Profiles“:
- Scientists highly value the ability to view usage statistics (of their data).
- Data and the tools to create / view / analyse them cannot be separated.
- There are often lots of links between the data-set itself and (informative) text (e.g. documentation) that are not just the publication that it leads to.
- Data, much more so than publications, are heterogeneous – in format, structure and location as well as in the populations that generate them.
- How the data was obtained is key to understanding its significance.
This last point has a philosophical generalization: data is never “theory-free”. Why the data means anything has to be related to the theory or hypothesis that it was meant to test or support. The corrolary from an information retrieval point of view is that the theoretical underpinnings of datasets must be tied (via text) to the datasets themselves.
Hence we need the lab notes, the prior literature and the context for which data is collected (all sources of indexable meta-data, I might add). Simply having zip files with numbers in them stored in a trusted digital repository isn’t going to be of much use – even if there is a lot of bibliographic metadata attached to it.
I hope I’m not belabouring the point.
P.S. I really like the Digital Curation Center’s motto: “because good research needs good data”. Brilliant slogan! One day when data is getting all the funding, I trust that “because good data needs good research” will find its way into a University’s motto.