February 3, 2007

Andre Vellino

A lot of “social information” can be gleaned from journal articles in a scientific digital library. The most obvious source of social information is found in citations. Citation indexes measure the number of times an article or monograph is referenced by other documents, hence giving a measure of the cumulative impact and relevance of an individual’s scientific research output. This simple measure has been improved upon by the Hirsh Index, which measures citation relevance as a function of the distribution of citations received by a given researcher’s publications.

Members of the CISTI Research team are looking at the question of how to use networks of citations to rank search results by relevance, in the same way that web search results are sorted by page rank. I am not working on citations myself, but I have been wondering whether it would be possible to improve that ranking measure by extracting more detailed information about citations. For example, one could (i) count the co-occurrence of different citations across a collection, (ii) count the number of occurrences of each citation inside the article and (iii) weight these citation occurrences according to their location article.

Counting the co-occurrences of different citations in a collection is one way to cluster documents and this clustering information could be used to help rank search results. Typically, authors will publish “highly similar” or “highly related” articles within the span of a few years and often they will cite the same references. It might be that later work is the continuation / improvement / refutation of earlier work. Given a collection of articles that share a lot of citations, you might want to rank the “most general” (the one that contains the most citations) or the article the “subsumes” the others. Articles from a given author that have the highest number of co-occurring citations of other works should be given extra weight.

If a given article cites another article several times and the number of such citations is disproportionately high compared to other citations, this might be a measure of how “derivative” the work is. Truly original works such as Einstein’s article “On The Electrodynamics of Moving bodies” might contain no citations at all, let alone repeated citations. The number of occurrences of a citation within an article could be combined with its locations in the article to provide a measure of its importance. For example, if the citation is in a footnote it should be considered a minor citation, but if it appears in the introduction and in the body of the article it is more likely to be significant. If it only occurs in a customary “Related Research” section, it is likely to be relevant but not central to the article.


