The End of Files December 8, 2012Posted by Andre Vellino in Data, Digital library.
add a comment
A few weeks ago, I boldly predicted in my class on copyright that the computer file was as doomed in annals of history as the piano roll (the last of which was printed in 2008 – See this documentary video on YouTube on how they are made and copied!)
This is a slightly different prediction than the one made by the Economist in 2005: Death to Folders. Their argument was that folders as a method of organizing files was obsolete and that search, tagging and “smart folders” were going to change everything. My assertion is the very notion of a file – these things that are copied, edited, executed by computers - will eventually disappear (to the end-user, anyway.)
The path to the “end of files” is more than just a question of masking the underlying data-representation to the user. It is true that Apps (as designed for mobile devices) have begun to do that as a convenient way of hiding the details of a file from the user – be it an application file or a document file. The reason that Apps (generally) contain within them the (references to) data-items (i.e. files) that they need, particularly if the information is stored in the cloud, is to provide a Digital Rights Management scheme. Which no doubt why this App model is slowly creeping its way from mobile devices to mainstream laptops and desktops (viz. Mac OS Mountain Lion and Windows 8).
But this is just the beginning. There’s going to be a paradigm shift (a perfectly fine phrase, when it’s used correctly!) in our mental representations of computing objects and it is going to be more profound than merely masking the existence of the underlying representation. I think the new paradigm that will replace “file” is going to be: “the set of information items and interfaces that are needed to perform some action the current use-context”.
Consider as an example of this trend towards the new paradigm, Wolfram’s Computable Document Format. In this model, documents are created by dynamically assembling components from different places and performing computations on them. In this model there are distributed, raw information components – data mostly – that are assembled in the application and don’t correspond to a “file” at all. Or consider information mashups like Google Maps with restaurant reviews and recommendations are generated as a function of search-history, location, and user-identity. These “content-bundles”, for want of a better phrase, are definitely not files or documents but, from the end-user’s point of view, they are also indistinguishable from them.
Even, MS Word DocX “files” are instances of this new model. The Open Document XML file format is a standardized data-structure: XML components bound together in a zip file. Imagine de-regimenting this convention a little and what constitutes a “document” could change quite significantly.
Conventional, static files will continue to exist for some time and version control systems will continue to provide change management services to what we now know as “files”. But I predict that my grand children won’t know what a file is – and won’t need to. The procedural instructions required for assembling information-packages out of components, including the digital rights constraints that govern them, will eventually dominate the world of consumable digital content to the point where the idea of a file will be obsolete.
Building a Better Citation Index March 20, 2012Posted by Andre Vellino in Citation, Data, Open Source.
Scholars in a variety of disciplines (not just bibliometrics!) have been building better measures of scholarly output. First came the H-index in 2005 followed by the G-index in 2006, and these are now part of the standard measures for scholarly output.
However, as Daniel Lemire points out in his latest blog post, the raw data of mere citations is pretty crude. In any given article, it’s often hard to tell which of the (typically) dozens of references are “en passant” (to fend off the critics who might think you haven’t read the literature) or incidental to the substance of the article. What’s interesting for the authors of the articles being cited is the question “how citical is this citation to the author who cited me”?
One way to find out (and hence, perhaps, to build a better citation measure) is to train a Machine Learning algorithm to extract “key citations” – by analogy with extracting “key phrases” from a text (see Peter Turney’s 2000 article Machine Learning Algorithms for Keyphrase Extraction). As a starting point, we’d like to compile data from researchers which asks the question: “What are the key references of your papers?”
It will take 10 minute: please fill this Google-documents questionaire. In it we ask you, as the author of an article, to tell us which 1, 2, 3 or 4 references are essential to that article. By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper; that is, a reference that inspired or strongly influenced your new algorithm, your experimental design, or your choice of a research problem.
When this survey is completed, we will be releasing the resulting data set under the ODC Public Domain Dedication and Licence so that you can use this data in other ways, if you wish.
What is ‘Data’? June 14, 2011Posted by Andre Vellino in Data, Data Mining, Information retrieval.
“What does ‘data’ mean to you?” I asked innocently to various participants at JCDL 2011 today. I had just come out of a very interesting panel discussion entitled “Big Data, Big Deal?” at which most of the discussion was about large amounts of proprietary text at http://www.hathitrust.org/ (some of of the discussion was also about large amounts of music in the SALAMI project at McGill).
Now I am very interested in text, text retrieval (and music IR too) and I found the panel discussion most rewarding. But it wasn’t aboutwhat I had been expecting it to be about (from the title) and I was perplexed by this use of the term “data” in this context. After all, the subtitle of the JCDL 2011 conference is “Bringing Together Scholars, Scholarship and Research Data”. So the context for “data” was (for me) “research data” in the sense of the term that is pretty much the same the first 3 sentences of the Wikipedia entry for Data:
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of “datum”) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived.
So I was somewhat taken aback by the argument that ensued. Everyone, it seems (except me), was quite happy to speak of “Big data” and “large amounts of text” as synonymous. As though the streams of bytes that are common to readings from an NMR spectrometer, digital music and electronic journal articles were in all significant respects indistinguishable.
Of course, large volumes of byte-sequences share some kinds of problems like storage, preservation and search. But “text data” is a different kind of beast, isn’t it? For one thing, text typically has meaning – cognitive content that is different from, say, music or images or spreadsheets of temperature variations in Glasgow over the past 500 years. It has more structure too, as evidenced by how efficiently it compresses and how (relatively) easy it is to search.
I’m happy to speak of data about text that is inferred by the act of mining text. Word frequencies, ngrams, term clusters, sentiment categories etc. fit the definition of “data” above. Even the textual “meta-data” about text is data of a certain kind. But the text itself just doesn’t seem to be that kind of thing (qualitative or quantitative attributes of a variable).
Mendeley Data vs. Netflix Data November 2, 2010Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.
Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.
The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.
I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05. Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!
But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e. the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).
In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]
Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)
I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.
Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.
The Cost (vs. Value) of Data Curation October 2, 2010Posted by Andre Vellino in Data, Open Access.
add a comment
There is a tension between the cost (to the curator) of data-curation and the potential value (to others) of making data (e.g. data from scientific experiments) available. For the purposes of selection, it would be nice to know ahead of time whether the data you wish to make available (now) is ever going to have value (in the future).
Unfortunately, you can’t predict that ahead of time because (i) you don’t know who your data-users might turn out to be or (ii) how the circumstances might change that make what was previously an irrelevant-seeming piece of data into an planet-saving one.
Indeed it’s impossible to know how any element of data might be used for any given purpose and by whom. For instance consider whether the Nuclear Magnetic Resonance spectra that you have collected for the purpose of analyzing the structure and composition a pathogen might not be fruitfully reused in the future for the purpose of understanding the bias of an improperly calibrated instrument or indeed (for technology historians of the future) how the (what may then be “primitive”) NMR spectroscopy technology was used in the 20th and early 21st century.
So, how are we to interpret Weinberger’s advice in “Everything is Miscellaneous”?:
- “The solution to overabundance of information is more information”
- “Filter on the way out, not on the way in”
- “Put each leaf on as many branches as possible”
- “Everything is metadata and can be a label”
- “Give up control”
- “A ‘topic’ is anything someone somewhere is interested in.”
The cash value of this advice for data: publish as much data in as you can; give users as many ways as you can to let them get at it (e.g. APIs but also user-interfaces); give users as many ways as you can to add more data (tags, metadata, text, links to other data – viz. “linked data“).
Which is fine advice if you assume that publishing data, like putting text and images on the internet is (almost) free. But publishing data isn’t (yet) close to free. Why? Because it (still) needs to be curated by someone who understands how to annotate it in at least the obvious ways in which it may be useful – e.g. to other contemporary scientists.
Prediction: either scientist will have to be trained to become data-curators or the process of creating data will have to generate the metadata or e-librarians will have to have train in the sciences (inclusive sense of “or”).
Scientific Data is Interpreted September 26, 2010Posted by Andre Vellino in Data, Data Mining, Epistemology.
add a comment
It must be a truism by now that there is no such thing as theory-free observation. Scientific data is necessarily tied up with the theories that are required to interpret them and which led to their discovery.
By analogy I would argue that scientific data sets are useless unless they are interpreted. There is no such thing as a useful “raw” data set.
Consider for instance the data on leap seconds from the National Research Council. It’s a simple enough table: there are three columns (Date, UTC Leap Seconds, and MJD) and only a few dozen rows. Here are two such rows:
DATE UTC Leap Seconds MJD 2006-01-01 - 2009-01-01 33 53 736 - 54 832 1999-01-01 - 2006-01-01 32 51 179 - 53 736
The first question for the uninitiated in the measurement of time: what is a “UTC Leap Second”? It’s easy enough to look up and learn that UTC is
is a time standard based on International Atomic Time (TAI) with leap seconds added at irregular intervals to compensate for the Earth’s slowing rotation.
Ah, so this was news to me: the earth’s rotation is slowing down! (“the solar day becomes 1.7 ms longer every century due mainly to tidal friction (2.3 ms/cy, reduced by 0.6 ms/cy due to glacial rebound”).
The (implicit) frame of reference for (exact) time with respect to which the earth is slowing down is the atomic (cesium) clock, which requires an understanding of the highly theoretical processes of quantum mechanics to interpret correctly.
So now we have an inkling of what the data means. They give us the variance between two time measurements – those from atomic clocks and those from the earth’s rotation. A first attempt at interpreting the first row in the table is: it took 3 years between 2006-01-01 and 2009-01-01 to add one leap-second to the calendar date.
A little “Binging” (I’ve all but abandoned “Googling” since Google became “instant” – not because it can’t be turned off, but to make a statement to Google) yields “Modified Julian Day” for MJD. So the third column is primarily a conversion of the first column into a standard, though not without its own theoretical reasons for being the preferred measure.
All this to say – repositories of datasets without (substantial) amounts of textual metadata, not to mention software and tools designed for its interpretation and navigation are going to be (at best) not very useful.