The Cost (vs. Value) of Data Curation October 2, 2010

Posted by Andre Vellino in Data, Open Access.

There is a tension between the cost (to the curator) of data-curation and the potential value (to others) of making data (e.g. data from scientific experiments) available. For the purposes of selection, it would be nice to know ahead of time whether the data you wish to make available (now) is ever going to have value (in the future).

Unfortunately, you can’t predict that ahead of time because (i) you don’t know who your data-users might turn out to be or (ii) how the circumstances might change that make what was previously an irrelevant-seeming piece of data into an planet-saving one.

Indeed it’s impossible to know how any element of data might be used for any given purpose and by whom. For instance consider whether the Nuclear Magnetic Resonance spectra that you have collected for the purpose of analyzing the structure and composition a pathogen might not be fruitfully reused in the future for the purpose of understanding the bias of an improperly calibrated instrument or indeed (for technology historians of the future) how the (what may then be “primitive”) NMR spectroscopy technology was used in the 20th and early 21st century.

So, how are we to interpret Weinberger’s advice in “Everything is Miscellaneous”?:

  • “The solution to overabundance of information is more information”
  • “Filter on the way out, not on the way in”
  • “Put each leaf on as many branches as possible”
  • “Everything is metadata and can be a label”
  • “Give up control”
  • “A ‘topic’ is anything someone somewhere is interested in.”

The cash value of this advice for data: publish as much data in as you can; give users as many ways as you can to let them get at it (e.g. APIs but also user-interfaces); give users as many ways as you can to add more data (tags, metadata, text, links to other data – viz. “linked data“).

Which is fine advice if you assume that publishing data, like putting text and images on the internet is (almost) free. But publishing data isn’t (yet) close to free.  Why? Because it (still) needs to be curated by someone who understands how to annotate it in at least the obvious ways in which it may be useful – e.g. to other contemporary scientists.

Prediction: either scientist will have to be trained to become data-curators or the process of creating data will have to generate the metadata or e-librarians will have to have train in the sciences (inclusive sense of “or”).


