jump to navigation

Data Archiving May 7, 2010

Posted by Andre Vellino in Data, Digital library, Information retrieval.
trackback

In a previous post, I was suggesting that the text from publications and the data DOIs that are referenced in them could be used as metadata for the datasets. Others have thought about addressing this issue too. For instance, The Edinburgh Research Archive’s StORe project has as a goal:

…. to address the area of interactions between output repositories of research publications and source repositories of primary research data.

However, linking conventional research output (publications) with the data they depend on may be more of a challenge than it ought to be. According to this paper on Constructing Data Curation Profiles, the authors discovered that very few scientists are motivated to deposit their datasets, let alone make them discoverable.  What ought to be a technological no-brainer may be a cultural challenge in some quarters.

A few other interesting things emerge from looking at a variety of subject-dependant “Data Curation Profiles“:

  • Scientists highly value the ability to view usage statistics (of their data).
  • Data and the tools to create / view / analyse them cannot be separated.
  • There are often lots of links between the data-set itself and (informative) text (e.g. documentation) that are not just the publication that it leads to.
  • Data, much more so than publications, are heterogeneous – in format, structure and location as well as in the populations that generate them.
  • How the data was obtained is key to understanding its significance.

    This last point has a philosophical generalization: data is never “theory-free”. Why the data means anything has to be related to the theory or hypothesis that it was meant to test or support.  The corrolary from an information retrieval point of view is that the theoretical underpinnings of datasets must be tied (via text) to the datasets themselves.

    Hence we need the lab notes, the prior literature and the context for which data is collected (all sources of indexable meta-data, I might add).  Simply having zip files with numbers in them stored in a trusted digital repository isn’t going to be of much use – even if there is a lot of bibliographic metadata attached to it.

    I hope I’m not belabouring the point.

    P.S. I really like the Digital Curation Center’s motto: “because good research needs good data”.  Brilliant slogan!  One day when data is getting all the funding, I trust that  “because good data needs good research” will find its way into a University’s motto.

    Comments»

    1. gawp - May 7, 2010

    Good examples of this, though very domain specific, are Gene Expression Omnibus (GEO) and ArrayExpress. If you use microarray or expression data in a publication, most journals require the data and metadata to be deposited in one of these archives. A significant effort went into the creation of the MIAME microarray metadata standard which the sites conform to.

    I would expect that every data type would have to have such a standard created for metadata and linking.

    If data could be directly cited and such citations were worth something this would encourage data archiving by data generators. Right now the linkage is indirect.

    We spent a lot of effort putting StemBase data (www.stembase.ca) into GEO but the occasional direct data citations (to GEO accessions) are worth nothing academically. Usually the associated papers are credited. This indirect citation obscures which of the data sets are being used, however. Data citations were possible it would be very easy to find a highly cited (and likely usable) data set of a specific type.

    PMID:19447786 ‘A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide” is a good example of the sort of analysis that archived data and metadata makes possible.

    I wonder how data citations would work for meta-analysis like this…

    2. Daniel Lemire - May 7, 2010

    One way to solve this problem is to have bidirectional linking from the research papers (or similar byproducts) and the data. Presumably, someone somewhere has to explain how the data was collected.

    Obviously, this linking has to be done somewhat automatically (though the authors must collaborate). Maybe it is a silly thought, but wouldn’t the PLoS people be interested in collaborating with such initiatives?

    As for the researchers…

    Some colleagues of mine recently got a large grant from some Canadian funding agency which requires Open Access publications. When asked about it, the researchers had no idea. Nobody ever told them of this requirement. As far as I can tell, they have no intention of abiding by it. And the funding does not seem to be conditional to proving that the papers were Open Access.

    (Of course, Open Data is different from Open Access, but you see my point.)

    3. Scientific Research Data « Synthèse - August 23, 2010

    […] CISTI’s Gateway to Scientific Data sets and other such national sites (e.g. the British National Archives of Datasets) that aggregate information about data sets, use bibliographic standards (e.g. Dublin Core) for representing meta-data.  The advantage is that these standards are not domain-dependant yet sufficiently rich to express the core elements of the content needed for archiving storage and retrieval.  However, these metadata standards, developed for traditional bibliographic purposes, are not (yet) sufficiently rich to fully capture the wealth of scientific data from all disciplines, as I argued in a previous post. […]

    4. La difficile accessibilité des données scientifiques « meridianes - July 26, 2011

    […] La Passerelle vers les données scientifiques de l’Institut canadien de l’information scientifique et technique (ICIST) ainsi que d’autres sites nationaux (tel que British National Archives of Datasets) qui rassemble des informations à propos des jeux de données, utilisent des normes bibliographiques (tel que Dublin Core) pour représenter les méta-données. L’avantage est que ces normes ne dépendent pas du domaine et sont déjà suffisamment riches pour exprimer les éléments clés dont on a besoin pour archiver et retrouver les données. Toutefois, ces normes de méta-données développées pour la bibliothéconomie traditionnelle, ne sont pas (encore) suffisamment riches pour récupérer complètement la complexité des données scientifiques venant de toutes les disciplines, comme je l’ai fait valoir dans un précédent billet. […]

    5. Les scientifiques découvrent que seuls 20% des données climatiques sont accessibles « Les moutons enragés - September 26, 2011

    […] La Passerelle vers les données scientifiques de l’Institut canadien de l’information scientifique et technique (ICIST) ainsi que d’autres sites nationaux (tel que British National Archives of Datasets) qui rassemble des informations à propos des jeux de données, utilisent des normes bibliographiques (tel que Dublin Core) pour représenter les méta-données. L’avantage est que ces normes ne dépendent pas du domaine et sont déjà suffisamment riches pour exprimer les éléments clés dont on a besoin pour archiver et retrouver les données. Toutefois, ces normes de méta-données développées pour la bibliothéconomie traditionnelle, ne sont pas (encore) suffisamment riches pour récupérer complètement la complexité des données scientifiques venant de toutes les disciplines, comme je l’ai fait valoir dans un précédent billet. […]


    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: