CISTI Sciverse Gadget App December 13, 2011Posted by Andre Vellino in CISTI, Digital library, General, Information retrieval, Open Access.
add a comment
Betwixt the jigs and the reels, and with the help of several people at CISTI and Elsevier, I developed a (beta) Sciverse gadget that gives searchers and researchers a window on CISTI’s electonic collection by taking the search term entered in Elsevier Hub and providing them with CISTI’s search results from a database of over 20 million journal articles.
I want to commend all and sundry at Sciverse Applications for this initiative. Opening up bibligraphic data and providing developers with a developer platform (a customized version of Google’s OpenSocial platform) is exactly the right kind of thing to do both to benefit third parties (they get access to anotherwise closed and proprietary data) and to enhance their own search and discover environment.
There are, already, several advanced and interesting applications on Sciverse. My favourites are: Altmetric (winner of the Science Challenge prize – see YouTube demo video below) NextBio’s Prolific Authors and Elsevier’s Table Download.
And there will be more to come. An open marketplace like this where the principles of variation and natural selection can operate will, I predict, make for a richer diversity of useful search and discovery tools than any single organization can develop on its own.
Mendeley Data vs. Netflix Data November 2, 2010Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.
Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.
The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.
I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05. Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!
But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e. the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).
In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]
Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)
I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.
Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.
Ex Libris ‘bX’ Recommender Promo Video October 5, 2010Posted by Andre Vellino in Collaborative filtering, Recommender.
I stumbled across this Ex Libris promo video for its ‘bX’ recommender yesterday. Having done quite a few of these use-case demo scenarios to “show the value”, I appreciate how hard it is to pitch a relatively complex idea in straight-forward terms. I think it does a pretty good job too, notwithstanding the slightly over-the-top-happiness tenor of the whole thing.
At the risk of repeating myself, though, there’s one thing that the video glosses over. SFX logs are, effectively, click-logs and clicks have two sources: search engine results and ‘bX’ recommendations themselves. Hence ‘bX’ recommendations are more likely to be “semantically homogenous” (although less so than pure search results) because the data they derive from is biased by search-engine ranking. The proportion of SFX trafic that is generated by the recommender itself further narrows the semantic diversity of recommendations.
Are User-Based Recommenders Biased by Search Engine Ranking? September 28, 2010Posted by Andre Vellino in Collaborative filtering, Recommender, Recommender service, Search, Semantics.
I have a hypothesis (first emitted here) that I would like to test with data from query logs: user-based recommenders – such as the ‘bX’ recommender for journal articles – are biased by search-engine language models and ranking algorithms.
Let’s say you are looking for “multiple sclerosis” and you enter those terms as a search query. Some of the articles that were presented to you from the search results will likely be relevant and you download a few of the articles during your session. This may be followed by another, semantically germane query that yeilds more article downloads. As a consequence, the usage-log (e.g. the SFX log used by ‘bX’) is going to register these articles as having been “co-downloaded”. Which is natural enough.
But if this happens a lot, then a collaborative filtering recommender is going to generate recommendations that are biased by the ranking algorithm and language model that produced the search-result ranking: even by PageRank, if you’re using Google.
In contrast, a citation-based (i.e. author-centric) recommender (such as Sarkanto) will likely yield more semantically diverse recommendations because co-citations will have (we hope!) originated from deeper semantic relations (i.e. non-obvious but meaningful connections between the items cited in the bibliography).
Sarkanto Scientific Search September 13, 2010Posted by Andre Vellino in Collaborative filtering, Digital library, Information retrieval, Recommender, Recommender service, Search.
add a comment
A few weeks ago I finished deploying a version of a collaborative recommender system that uses only article citations as a basis for recommending journal articles. This tool allows you to search ~ 7 million STM (Scientific Technical and Medical) articles up to Dec. 2009 and to compare citation-base recommendations (using the Synthese recommender) with recommendations generated by ‘bX’ (a user-based collaborative recommender from Ex Libris). You can try the Sarkanto demo and read more about how ‘bX’ and Sarkanto compare.
Note that I’m also using this implementation to experiment with Google Translate API and the Microsoft Translator to do both query expansion into the other Canadian Official Language and to translate various bibliographic fields upon returning search results.
Google Books on Charlie Rose March 8, 2010Posted by Andre Vellino in CISTI, Digital library, General, Open Access, Search.
add a comment
I found this conversation about the “Google Books” library very interesting. It is was between Robert Darnton (professor of American cultural history at Harvard and Director of the Harvard University Library), David Drummond (Chief Legal Officer at Google), bestselling author James Gleick and Charlie Rose (from PBS) last night.
I was especially pleased to see Prof. Darnton insist on the need to guarantee “the public interest”. Only he seemed to have the long view, though.