The Identity of Objects March 14, 2008
Posted by Andre Vellino in Digital Identity, Epistemology, Semantics.3 comments
I was listening to my colleague Richard Ackerman give a preview of his upcoming keynote address at the National Information Standards Organization (NISO) forum when Brian Cantwell Smith’s book On The Origin of Objects popped into mind (I wrote a short review of that book many moons ago and I’m a big fan of the book.) Brian is now Dean of the Faculty of Information Studies at the University of Toronto and those of us who have enjoyed The Origin have been patiently waiting for the publication of “The Age of Significance“, a 7-volume series that fleshes out some details.
Brian’s book came to mind because of the point Richard makes in his presentation that computers love unique identifiers for objects – books, articles, authors – and that we don’t really have good standards for identifying things. Even if you take into account efforts like Digital Object Identifiers (DOI) the task providing unique references to persistent digital objects presents significant hurdles, such as dealing with versions.
Netflix Prize article in Wired March 6, 2008
Posted by Andre Vellino in Collaborative filtering, Recommender.2 comments
This item in Wired magazine on a new approach to winning the netflix prize made for interesting reading. The original slashdot post from a couple of days ago “Psychologist Beating Math Nerds in Race to Netflix Prize” made it sound like big news rather than merely an interesting development.
Visualizing Movie Revenues March 4, 2008
Posted by Andre Vellino in CISTI Visualization, User Interface, Visualization.2 comments
This New York Times Flash visualization of how movies have fared at the box office over time has received a lot of attention in the blogosphere, but it’s deserves the attention it’s getting. Simply put – it’s beautiful.
The icing on the cake for the authors of this piece must be Ben Shneiderman’s 6-line comment on the Portfolio.com blog post about how this thing came to be. Nothing like praise from the father of it all.
Recommending BioMedical Articles March 2, 2008
Posted by Andre Vellino in CISTI Visualization, Citation, Collaborative filtering, Digital library, Recommender.1 comment so far
I have just finished an initial prototype of a recommender for a digital library. This web application was built using Sean Owen’s open source collaborative filtering toolkit Taste (with a lot of adaptations by Dave Zeber.) It uses data from 1.6 Million articles in a collection of about 1500 bio-medical journals.
This demo isn’t ready to be made publicly available in part because of some licensing uncertainties about the meta-data. Later this quarter I may be able put a more polished version on CISTI Lab (currently undergoing a makeover, so please forgive the “under construction” skin) using the NRC Press collection, although I’m worried that the citation graph for that collection may be too sparse to yield reliable recommendations.
The Synthese Recommender uses many of the ideas from TechLens. For example, to seed the recommender with ratings we use the citation graph for the collection. Out of the 1.6M articles, 370K of them qualify as “useful” for recommending – i.e. only those whose articles with 3 or more citations. The total number of citations is ~ 1.5M, making the average number of citations per article roughly 4.
In contrast, the number of citations per article in the 100K article Citeseer collection (which, incidentally, is now in it’s next generation with CiteseerX, whose design can be read about here) used in TechLens is roughly 12. It strikes me as a little odd that the our bio-medical collection should have have almost 3 times fewer citations per article. I will have to look at the citation data more carefully! [P.S. I did look into this and "4" is the average number of references per article for which we have entries in the bibliographic database. I am told that biomedical articles do in fact have a much higher number of overall references than computer science articles.]
Compared with data from consumer-product recommenders, citation-based “ratings” in a digital library are much (three orders of magnitude) more sparse. For instance, the Netflix prize data contains 100 million ratings from 480 thousand customers over 17,000 movie titles. That’s roughly 1.2% of non-zeros. With 1.5Million citations (“ratings”) on 370K users and 370K items we get roughly 0.00116% of non-zeros.
What do you think the odds are that applying PageRank to give a numeric value to citation-based ratings is going to affect the quality of recommendations? Stay tuned for the answer (in a couple of months, probably.)
