Evaluating Article Recommenders July 23, 2009Posted by Andre Vellino in Collaborative filtering, Recommender.
In his March article for CACM, Greg Linden opines that RMSE (Root Mean Square Error) and similar measures of recommender acuracy are not necessarily the best ways to assess their value to users. He suggests that Top-N measures may be preferable if the problem is to predict what someone will really like.
“A recommender that does a good job predicting across all movies might not do the best job predicting the TopN movies. RMSE equally penalizes errors on movies you do not care about seeing as it does errors on great movies, but perhaps what we really care about is minimizing the error when predicting great movies.”
This problem is compounded when it isn’t even possible to measure errors of any kind. Suppose you have an item-based recommender for journal articles in a digital library and recommendations are restricted to items in the collection owned by the library. These recommendations are then restricted to a certain set which may be incommensurable with recommendations generated from a different collection. So any quality measure would depend on the size of the collection.
How then would one go about evaluating recommendations in this circumstance? One way is for an expert to inspect the results and judge them for relevance or quality. Another is to measure some meta-properties of the recommendations, such as their semantic distance from one another or from the item they are being recommended from. At least y0u would be able to say that one recommender offers greater novelty or diversity than another.
This is the kind of approach taken by Òscar Celma and Perfecto Herrera in a paper delivered at Recommender Systems 2008. They concluded that content-based recommendations for music that are less biased by popularity (i.e. more biased toward content-similarity) produced less novelty in recommendations and also less user-satisfaction.
While music listeners may appreciate novelty and diversity, my expectation is that users of recommenders for scholarly articles actually want something closer to “more like this” (content similarity) than “other users who looked at this looked that” (collaborative filtering).
At least that’s the conclusion (not yet scientifically corroborated) that I came to when I compared a usage-only recommender (‘bX’ from Ex Libris) to a citation-only recommender for scholarly articles (Synthese). At first blush ‘bX’ produces more “interesting” recommendations (greater diversity) whereas Synthese (in citation-only mode anyway) generates more “similar” recommendations.
Perhaps what the user needs is both kinds of recommenders – depending on thier information retrieval needs.