Evaluating Article Recommenders July 23, 2009
Posted by Andre Vellino in Collaborative filtering, Recommender.trackback
In his March article for CACM, Greg Linden opines that RMSE (Root Mean Square Error) and similar measures of recommender acuracy are not necessarily the best ways to assess their value to users. He suggests that Top-N measures may be preferable if the problem is to predict what someone will really like.
“A recommender that does a good job predicting across all movies might not do the best job predicting the TopN movies. RMSE equally penalizes errors on movies you do not care about seeing as it does errors on great movies, but perhaps what we really care about is minimizing the error when predicting great movies.”
This problem is compounded when it isn’t even possible to measure errors of any kind. Suppose you have an item-based recommender for journal articles in a digital library and recommendations are restricted to items in the collection owned by the library. These recommendations are then restricted to a certain set which may be incommensurable with recommendations generated from a different collection. So any quality measure would depend on the size of the collection.
How then would one go about evaluating recommendations in this circumstance? One way is for an expert to inspect the results and judge them for relevance or quality. Another is to measure some meta-properties of the recommendations, such as their semantic distance from one another or from the item they are being recommended from. At least y0u would be able to say that one recommender offers greater novelty or diversity than another.
This is the kind of approach taken by Òscar Celma and Perfecto Herrera in a paper delivered at Recommender Systems 2008. They concluded that content-based recommendations for music that are less biased by popularity (i.e. more biased toward content-similarity) produced less novelty in recommendations and also less user-satisfaction.
While music listeners may appreciate novelty and diversity, my expectation is that users of recommenders for scholarly articles actually want something closer to “more like this” (content similarity) than “other users who looked at this looked that” (collaborative filtering).
At least that’s the conclusion (not yet scientifically corroborated) that I came to when I compared a usage-only recommender (‘bX’ from Ex Libris) to a citation-only recommender for scholarly articles (Synthese). At first blush ‘bX’ produces more “interesting” recommendations (greater diversity) whereas Synthese (in citation-only mode anyway) generates more “similar” recommendations.
Perhaps what the user needs is both kinds of recommenders – depending on thier information retrieval needs.
I think that the whole Netflix/RMSE thing is not where the next step is in recommender systems. We need to broaden the applications.
Have you tried something like this: papers who have cited papers X, Y, Z have also cited paper W? I’d love to get an analysis of a paper I am about to submit to see whether I have omitted any reference… It would be cool to determine, mathematically, whether a set of reference is “complete” in some sense.
(There has been a lot of work done on mining frequent “minimal” item sets. I am guessing it could be applied to this problem.)
Minimally, just this feature: papers which have cited this paper have also cited this other paper… that’d be great.
The next version of Synthese will be exactly what you suggest – the “minimal” feature, that is – (which is, of course, simpler than what I was experimenting with). Adding the controls you want with your second suggestion is pretty easy too, except for figuring out how the UI should look.
The data I have is *very* sparse. On about 8M science articles, I’m only able (currently) to produce item-based recommendations (based on citations only) for about 1.8M of them. I’m now working on reducing the sparsity of my data….
There’s also going to be an OpenURL API, very much like ‘bX’ – submit an OpenURL to Synthese and you get back metadata about the recommended articles in XML. Should be done in September sometime.
Then, you could go further:
Suppose that I say “No, I don’t want to cite paper W even though I plan to cite papers X, Y, Z”… then what does it say? Can you then focus on papers (if any) citing papers X, Y, Z but not paper W?
The fun thing is that these are small data sets (very sparse) so you can do the computations live with little effort.
Interesting suggestion. Thanks Daniel.