jump to navigation

Building a Better Citation Index March 20, 2012

Posted by Andre Vellino in Citation, Data, Open Source.

Scholars in a variety of disciplines (not just bibliometrics!) have been building better measures of scholarly output.  First came the H-index in 2005 followed by the G-index in 2006, and these are now part of the standard measures for scholarly output.

However, as Daniel Lemire points out in his latest blog post, the raw data of mere citations is pretty crude.  In any given article, it’s often hard to tell which of the (typically) dozens of references are “en passant” (to fend off the critics who might think you haven’t read the literature) or incidental to the substance of the article. What’s interesting for the authors of the articles being cited is the question “how citical is this citation to the author who cited me”?

One way to find out (and hence, perhaps, to build a better citation measure) is to train a Machine Learning algorithm to extract “key citations” – by analogy with extracting “key phrases” from a text (see Peter Turney’s 2000 article Machine Learning Algorithms for Keyphrase Extraction). As a starting point, we’d like to compile data from researchers which asks the question: “What are the key references of your papers?”

It will take 10 minute: please fill  this Google-documents questionaire. In it we ask you, as the author of an article, to tell us which 1, 2, 3 or 4 references are essential to that article. By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper; that is, a reference that inspired or strongly influenced your new algorithm, your experimental design, or your choice of a research problem.

When this survey is completed, we will be releasing the resulting data set under the ODC Public Domain Dedication and Licence so that you can use this data in other ways, if you wish.

Mendeley Data vs. Netflix Data November 2, 2010

Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.

Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.

The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.

I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05.  Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!

But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e.  the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).

In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]

Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)

I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.

Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.

MS Libra Academic Search Engine May 19, 2009

Posted by Andre Vellino in CISTI, Citation, Data Mining, Search.
add a comment

libra-logoMicrosoft Research Asia appears to have taken over where Microsoft Live Academic left off with Libra Academic Search.  Libra’s collection is limited to computer science (although, with 1.8M articles to data-mine, it appears to be quite comprehensive) it proves that one can do better than Google Scholar by implementing simple facets (Papers / Authors / Conferences / Journals / Communities) that cluster or order results according to different criteria.

I don’t think Libra is new (it appears to have started in April 2007) and it may be that no on is working on it actively any more – perhaps because CiteSeerX (also supported by Microsoft!) dominates the (limited) market.  But I hope it’s core features are not forgoten.

End of Universities April 28, 2009

Posted by Andre Vellino in Citation.

nyt-opedYesterday’s NYTimes Op. Ed. “The End of the University as We Know It” by professor Mark C. Taylor is quickly making the rounds in academic circles.  

There are two elements in this article that concerned me.  One is the dismissive criticism of one of his colleague’s student thesis topic as trivial:

Each academic becomes the trustee not of a branch of the sciences, but of limited knowledge that all too often is irrelevant for genuinely important problems. A colleague recently boasted to me that his best student was doing his dissertation on how the medieval theologian Duns Scotus used citations.

Presumably professor Taylor is not bothered that a Ph.D. student in a religion department is studying Duns Scotus – one of the most important philosophers of the middle ages.  It must be, then, that he thinks it is not important to be studying how Scotus is using citations.

I know enough about citation analysis to be confident that professor Taylor is being dismissive too hastily. I wish this graduate student well in his or her use of a 21st century tool to discover new things about Scotus that were heretofore unknown about his thinking. 

The other remark which I thought was ill conceived is the argument that Universities should:

Abolish permanent departments, even for undergraduate education, and create problem-focused programs. These constantly evolving programs would have sunset clauses, and every seven years each one should be evaluated and either abolished, continued or significantly changed. It is possible to imagine a broad range of topics around which such zones of inquiry could be organized: Mind, Body, Law, Information, Networks, Language, Space, Time, Media, Money, Life and Water.

Professor Taylor appears not to have headed the advice about categories in David Weinberger’s book Everything is Miscellaneous. It is indeed possible to imagine not just a “broad range” of topics, but a virtually infinite range of “zones of inquiry”, each equally worthy of consideration.

Furthermore, getting academics to agree on even on one set of such topics within which to fit their work would be an interminable excercise. Take the citation analysis of Duns Scotus’ work, for instance.  It  arguably belongs equally to “Information” or “Networks” or “Language” or, in categories not yet mentioned, such as the more traditional “Mathematics” or “Philosophy” or “Library Science”.  

Besides, who would decide which categories are the relavent ones?  The government of the day? In that case would professor Taylor have wanted to apply for grant funding under the topic “Anti-Terrorism” in the past 8 years?

Intelligent Librarian Agent April 2, 2009

Posted by Andre Vellino in CISTI, Citation, Data Mining, Digital library.
1 comment so far

intelligent-agentDaniel Lemire has been pining for a research tool that will notify him of anything that may be relevant to his research needs: someone citing his work, other “smilar-to-his articles” that have recently been published and anything else that might be relevant to his research.

The idea of personalized software agents has been around for at least 10 years (e.g. at the MIT Media Lab and Carnegie Mellon University) but perhaps it’s time has come.  The editors of Technology Review believe that anyway and list an Intelligent Software Assistant as one of the top 10 technologies for 2009.

This got me thinking about other things besides Daniel’s suggestion that a Personalized Intelligent Librarian Agent might do for you:

  1. A collaborative filtering recommender such as the Synthese Recommender 0n CISTI Lab could be at the core of an alterting service that informs you about new articles that other people who are “like you” (based on a profile of your “bookmarked articles”) are reading / downloading / bookmarking.
  2. A service that informs you when a an article in your field has become a “sleeping beauty”. A “sleeping beauty” is a publication that has gone unreferenced for a long time and then suddenly attracts a lot of attention.
  3. A patent alert service that informs you about patents related to your recent research.  This is trickier than it seems because patent descriptions often deliberately obscure the nature of their inventions.

It’s good to know what some of the work we are doing at CISTI is going to have a role to play in satisfying scientific researchers’ information needs.

Recommending BioMedical Articles March 2, 2008

Posted by Andre Vellino in CISTI Visualization, Citation, Collaborative filtering, Digital library, Recommender.
1 comment so far

I have just finished an initial prototype of a recommender for a digital library. This web application was built using Sean Owen’s open source collaborative filtering toolkit Taste (with a lot of adaptations by Dave Zeber.) It uses data from 1.6 Million articles in a collection of about 1500 bio-medical journals.

This demo isn’t ready to be made publicly available in part because of some licensing uncertainties about the meta-data. Later this quarter I may be able put a more polished version on CISTI Lab (currently undergoing a makeover, so please forgive the “under construction” skin) using the NRC Press collection, although I’m worried that the citation graph for that collection may be too sparse to yield reliable recommendations.

The Synthese Recommender uses many of the ideas from TechLens. For example, to seed the recommender with ratings we use the citation graph for the collection. Out of the 1.6M articles, 370K of them qualify as “useful” for recommending – i.e. only those whose articles with 3 or more citations. The total number of citations is ~ 1.5M, making the average number of citations per article roughly 4.

In contrast, the number of citations per article in the 100K article Citeseer collection (which, incidentally, is now in it’s next generation with CiteseerX, whose design can be read about here) used in TechLens is roughly 12. It strikes me as a little odd that the our bio-medical collection should have have almost 3 times fewer citations per article. I will have to look at the citation data more carefully! [P.S. I did look into this and “4” is the average number of references per article for which we have entries in the bibliographic database. I am told that biomedical articles do in fact have a much higher number of overall references than computer science articles.]

Compared with data from consumer-product recommenders, citation-based “ratings” in a digital library are much (three orders of magnitude) more sparse. For instance, the Netflix prize data contains 100 million ratings from 480 thousand customers over 17,000 movie titles. That’s roughly 1.2% of non-zeros. With 1.5Million citations (“ratings”) on 370K users and 370K items we get roughly 0.00116% of non-zeros.

What do you think the odds are that applying PageRank to give a numeric value to citation-based ratings is going to affect the quality of recommendations? Stay tuned for the answer (in a couple of months, probably.)