jump to navigation

International Digital Curation Conference 2015 February 17, 2015

Posted by Andre Vellino in Data, Data Curation.
1 comment so far

dccI had never intended to leave this blog void of entries in 2014, let alone leave it with a “top 10” list as the last entry.  So it’s time to re-boot Synthese with a short report on the 2015 International Digital Curation Conference.

The opening keynote by Tony Hey was both a master-class in how to give a compelling lecture and an impressive demonstration of how much one person can know about his field.  When the video of this talk comes out, watch it!

It was also great to see such a wide variety of topics in the poster sessions: A poster on Data Citation was the award winner (I still can’t believe that the graduate student who did this research had to pay for her own subscription to Web of Science to do this research!). The runner-up award for best paper was about authorship attribution metadata to climate datasets.

Climate data figured quite prominently, including at least three talks : one on implementing an ISO standard MOLES3 (Metadata Objects Linking Environmental Sciences) at the Centre for Environmental Data Archival a second on Twenty years of data management in the British Atmospheric Data Centre and my own on Harmonizing metadata among diverse climate change datasets.

There were 3 parallel sessions on the second day – one just has to be resigned to giving up on two thirds of the interesting talks. I did go to this one one: A system for distributed minting and management of persistent identifiers, which I found especially intriguing. In a sentence, it proposes to do for digital identifiers (e.g. DOI) what Bitcoin does for money. In other word’s it’s a Bitcoin-like, distributed and secure method of generating unique identifiers.  I hope is succeeds.

This talk by the Ph.D. student Tiffany Chao Mapping methods metadata for research data struck me as a perfect application for text mining.  She proposes extracting the Methods and Instrumentation sections from the National Environmental Methods Index to generate metadata descriptors for the corresponding datafiles.  Right now her work is being done by hand to demonstrate its feasibility but a machine could do it too.

I registered for a DataCarpentry workshop to “access life science data available on the web”.  I learned a little R programming, discovered the ROpenSci repository and got my feet wet with the AntWeb and Gender packages. I look forward to graduating to rWBclimate, an R interface to the World Bank climate data in the climate knowledge portal.

One treasure trove led to another. I gate-crashed a small visualization hackathon workshop at which I discovered the British Library’s digital collection and the 1001 things that could be done to it if you had a small army of graduate students in the Digital Humanities at your disposal. Hopefully, that’s exactly what’s going to happen when the Universities of Cambridge, Oxford, Edinburgh, Warwick and University College London start collaborating at the Alan Turing Institute (to be located in the British Library).

The Data Spring Workshop was exciting in a different way – a lot of presenters gave lightening talks on their practical problems and solutions with managing data.  There was so much, I can hardly remember any of it!  One item stood out for me,  though, because it addresses my pain: a method for re-creating and preserving the environments for computational experiments.  It took me about 1.2 minutes to become an instant convert to Recomputation.org mission.

This only skims the surface, but it will have to do for now.

The End of Files December 8, 2012

Posted by Andre Vellino in Data, Digital library.
add a comment

A few weeks ago, I boldly predicted in my class on copyright that the computer file was as doomed in annals of history as the piano roll (the last of which was printed in 2008 – See this documentary video on YouTube on how they are made and copied!)

This is a slightly different prediction than the one made by the Economist in 2005: Death to Folders. Their argument was that folders as a method of organizing files was obsolete and that search, tagging and “smart folders” were going to change everything. My assertion is the very notion of a file – these things that are copied, edited, executed by computers – will eventually disappear (to the end-user, anyway.)

The path to the “end of files” is more than just a question of masking the underlying data-representation to the user. It is true that Apps (as designed for mobile devices) have begun to do that as a convenient way of hiding the details of a file from the user – be it an application file or a document file.  The reason that Apps (generally) contain within them the (references to) data-items (i.e. files) that they need, particularly if the information is stored in the cloud, is to provide a Digital Rights Management scheme. Which no doubt why this App model is slowly creeping its way from mobile devices to mainstream laptops and desktops (viz. Mac OS Mountain Lion and Windows 8).

But this is just the beginning.  There’s going to be a paradigm shift (a perfectly fine phrase, when it’s used correctly!) in our mental representations of computing objects and it is going to be more profound than merely masking the existence of the underlying representation. I think the new paradigm that will replace “file” is going to be: “the set of information items and interfaces that are needed to perform some action the current use-context”.

Consider as an example of this trend towards the new paradigm, Wolfram’s Computable Document Format. In this model, documents are created by dynamically assembling components from different places and performing computations on them.  In this model there are distributed, raw information components – data mostly – that are assembled in the application and don’t correspond to a “file” at all. Or consider information mashups like Google Maps with restaurant reviews and recommendations are generated as a function of search-history, location, and user-identity.  These “content-bundles”, for want of a better phrase, are definitely not files or documents but, from the end-user’s point of view, they are also indistinguishable from them.

Even, MS Word DocX “files” are instances of this new model.  The Open Document XML file format is a standardized data-structure: XML components bound together in a zip file. Imagine de-regimenting this convention a little and what constitutes a “document” could change quite significantly.

Conventional, static files will continue to exist for some time and version control systems will continue to provide change management services to what we now know as “files”. But I predict that my grand children won’t know what a file is – and won’t need to.  The procedural instructions required for assembling information-packages out of components, including the digital rights constraints that govern them, will eventually dominate the world of consumable digital content to the point where the idea of a file will be obsolete.

Building a Better Citation Index March 20, 2012

Posted by Andre Vellino in Citation, Data, Open Source.
3 comments

Scholars in a variety of disciplines (not just bibliometrics!) have been building better measures of scholarly output.  First came the H-index in 2005 followed by the G-index in 2006, and these are now part of the standard measures for scholarly output.

However, as Daniel Lemire points out in his latest blog post, the raw data of mere citations is pretty crude.  In any given article, it’s often hard to tell which of the (typically) dozens of references are “en passant” (to fend off the critics who might think you haven’t read the literature) or incidental to the substance of the article. What’s interesting for the authors of the articles being cited is the question “how citical is this citation to the author who cited me”?

One way to find out (and hence, perhaps, to build a better citation measure) is to train a Machine Learning algorithm to extract “key citations” – by analogy with extracting “key phrases” from a text (see Peter Turney’s 2000 article Machine Learning Algorithms for Keyphrase Extraction). As a starting point, we’d like to compile data from researchers which asks the question: “What are the key references of your papers?”

It will take 10 minute: please fill  this Google-documents questionaire. In it we ask you, as the author of an article, to tell us which 1, 2, 3 or 4 references are essential to that article. By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper; that is, a reference that inspired or strongly influenced your new algorithm, your experimental design, or your choice of a research problem.

When this survey is completed, we will be releasing the resulting data set under the ODC Public Domain Dedication and Licence so that you can use this data in other ways, if you wish.

What is ‘Data’? June 14, 2011

Posted by Andre Vellino in Data, Data Mining, Information retrieval.
3 comments

“What does ‘data’ mean to you?” I asked innocently to various participants at JCDL 2011 today.  I had just come out of a very interesting panel discussion entitled “Big Data, Big Deal?” at which most of the discussion was about large amounts of proprietary text at http://www.hathitrust.org/ (some of of the discussion was also about large amounts of music in the SALAMI project at McGill).

Now I am very interested in text, text retrieval (and music IR too) and I found the panel discussion most rewarding.  But it wasn’t aboutwhat I had been expecting it to be about (from the title) and I was perplexed by this use of the term “data” in this context. After all, the subtitle of the JCDL 2011 conference is “Bringing Together Scholars, Scholarship and Research Data”.  So the context for “data” was (for me) “research data” in the sense of the term that is pretty much the same the first 3 sentences of the Wikipedia entry for Data:

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of “datum”) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived.

So I was somewhat taken aback by the argument that ensued. Everyone, it seems (except me), was quite happy to speak of “Big data” and “large amounts of text” as synonymous.  As though the streams of bytes that are common to readings from an NMR spectrometer, digital music and electronic journal articles were in all significant respects indistinguishable.

Of course, large volumes of byte-sequences share some kinds of problems like storage, preservation and search. But “text data” is a different kind of beast, isn’t it? For one thing, text typically has meaning – cognitive content that is different from, say, music or images or spreadsheets of temperature variations in Glasgow over the past 500 years. It has more structure too, as evidenced by how efficiently it compresses and how (relatively) easy it is to search.

I’m happy to speak of data about text that is inferred by the act of mining text.  Word frequencies, ngrams, term clusters, sentiment categories etc. fit the definition of “data” above. Even the textual “meta-data” about text is data of a certain kind. But the text itself just doesn’t seem to be that kind of thing (qualitative or quantitative attributes of a variable).

Mendeley Data vs. Netflix Data November 2, 2010

Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.
9 comments

Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.

The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.

I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05.  Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!

But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e.  the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).

In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]

Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)

I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.

Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.

The Cost (vs. Value) of Data Curation October 2, 2010

Posted by Andre Vellino in Data, Open Access.
add a comment

There is a tension between the cost (to the curator) of data-curation and the potential value (to others) of making data (e.g. data from scientific experiments) available. For the purposes of selection, it would be nice to know ahead of time whether the data you wish to make available (now) is ever going to have value (in the future).

Unfortunately, you can’t predict that ahead of time because (i) you don’t know who your data-users might turn out to be or (ii) how the circumstances might change that make what was previously an irrelevant-seeming piece of data into an planet-saving one.

Indeed it’s impossible to know how any element of data might be used for any given purpose and by whom. For instance consider whether the Nuclear Magnetic Resonance spectra that you have collected for the purpose of analyzing the structure and composition a pathogen might not be fruitfully reused in the future for the purpose of understanding the bias of an improperly calibrated instrument or indeed (for technology historians of the future) how the (what may then be “primitive”) NMR spectroscopy technology was used in the 20th and early 21st century.

So, how are we to interpret Weinberger’s advice in “Everything is Miscellaneous”?:

  • “The solution to overabundance of information is more information”
  • “Filter on the way out, not on the way in”
  • “Put each leaf on as many branches as possible”
  • “Everything is metadata and can be a label”
  • “Give up control”
  • “A ‘topic’ is anything someone somewhere is interested in.”

The cash value of this advice for data: publish as much data in as you can; give users as many ways as you can to let them get at it (e.g. APIs but also user-interfaces); give users as many ways as you can to add more data (tags, metadata, text, links to other data – viz. “linked data“).

Which is fine advice if you assume that publishing data, like putting text and images on the internet is (almost) free. But publishing data isn’t (yet) close to free.  Why? Because it (still) needs to be curated by someone who understands how to annotate it in at least the obvious ways in which it may be useful – e.g. to other contemporary scientists.

Prediction: either scientist will have to be trained to become data-curators or the process of creating data will have to generate the metadata or e-librarians will have to have train in the sciences (inclusive sense of “or”).