jump to navigation

Visualizing Movie Revenues March 4, 2008

Posted by Andre Vellino in CISTI Visualization, User Interface, Visualization.
2 comments

This New York Times Flash visualization of how movies have fared at the box office over time has received a lot of attention in the blogosphere, but it’s deserves the attention it’s getting. Simply put – it’s beautiful.

N.Y. Times visualization of box office revenues

The icing on the cake for the authors of this piece must be Ben Shneiderman’s 6-line comment on the Portfolio.com blog post about how this thing came to be. Nothing like praise from the father of it all.

Recommending BioMedical Articles March 2, 2008

Posted by Andre Vellino in CISTI Visualization, Citation, Collaborative filtering, Digital library, Recommender.
1 comment so far

I have just finished an initial prototype of a recommender for a digital library. This web application was built using Sean Owen’s open source collaborative filtering toolkit Taste (with a lot of adaptations by Dave Zeber.) It uses data from 1.6 Million articles in a collection of about 1500 bio-medical journals.

This demo isn’t ready to be made publicly available in part because of some licensing uncertainties about the meta-data. Later this quarter I may be able put a more polished version on CISTI Lab (currently undergoing a makeover, so please forgive the “under construction” skin) using the NRC Press collection, although I’m worried that the citation graph for that collection may be too sparse to yield reliable recommendations.

The Synthese Recommender uses many of the ideas from TechLens. For example, to seed the recommender with ratings we use the citation graph for the collection. Out of the 1.6M articles, 370K of them qualify as “useful” for recommending – i.e. only those whose articles with 3 or more citations. The total number of citations is ~ 1.5M, making the average number of citations per article roughly 4.

In contrast, the number of citations per article in the 100K article Citeseer collection (which, incidentally, is now in it’s next generation with CiteseerX, whose design can be read about here) used in TechLens is roughly 12. It strikes me as a little odd that the our bio-medical collection should have have almost 3 times fewer citations per article. I will have to look at the citation data more carefully! [P.S. I did look into this and "4" is the average number of references per article for which we have entries in the bibliographic database. I am told that biomedical articles do in fact have a much higher number of overall references than computer science articles.]

Compared with data from consumer-product recommenders, citation-based “ratings” in a digital library are much (three orders of magnitude) more sparse. For instance, the Netflix prize data contains 100 million ratings from 480 thousand customers over 17,000 movie titles. That’s roughly 1.2% of non-zeros. With 1.5Million citations (“ratings”) on 370K users and 370K items we get roughly 0.00116% of non-zeros.

What do you think the odds are that applying PageRank to give a numeric value to citation-based ratings is going to affect the quality of recommendations? Stay tuned for the answer (in a couple of months, probably.)

BioMedExperts January 13, 2008

Posted by Andre Vellino in CISTI Visualization, Citation, Information retrieval, Recommender, Search, Social networks.
1 comment so far

Whatever “social networking site for scientists” means exactly, I’m not sure, but whatever it is, it comes in many flavours. There’s the “Facebook” / “LinkedIn” kind of site like Nature’s with forms, blogs, people with whom to make connections etc. There’s the “Del.icio.ous” / “Connotea”, bookmark-centric kind like Elsevier’s 2collab and there’s the “Google Scholar” type of search-engine, like GoPubMed that has been enhanced with subject-specific capabilities such as MeSH and GeneOntology lexicons to improve relevance and classification. GoPubMed also features the ability to search for authors (e.g. by frequency of publication) and Journal (e.g. by impact factor.)

One of the business analysts at CISTI (Naomi Krym) pointed me to the recent launch of BioMedExperts – a new social networking site for bio-med scientists. It was developed by Collexis (and Dell, which supplied the hardware) and combines large subsets of the functionalities in the above services. You can define your own publishing profile as an author, invite authors to your network, define your academic profile and so forth. Collexis also offers “context sensitive search”, whose search results, like GoPubMed, are driven by biomed ontologies.

What I like most about BioMedExperts is the UI that Collexis has devised to help the user navigate the huge network of authors from a citation network. Here’s what the applet looks like:

BioMedExpert-Network

The goal of Recommender Systems is sometimes framed as “give me what I want” vs. “give me the tools to explore the space so that I can find what I want”. The Collexis applet does an interesting job of the latter for authors and citations.

However, using this applet for even a few minutes demonstrates the need for automated tools that also “recommend” (in some generic sense) or at least removes or hides irrelevant information. Without some kind of recommendation capability there’s just too much data to display in such a small area: it needs to be condensed somehow. Given the appropriate controls (e.g. the slider bars at the top of the Collexis applet) a recommender system could show you a range between the Top N recommendations and the “long tail” in the space of possible recommendations.

So what are the “appropriate contols” for a recommender? Well, it depends on the space of objects being recommended and the recommender algorithm(s). For authors, for example, one of the slider bars could be the weighting given to the text-content similarity of other authors’ articles. Another slider bar could control the display by the similarity of authors’ citation patterns.

For recommending articles with collaborative filtering – e.g. from implicit ratings from users’ viewing patterns – a slider control could weight the articles that are most similar by usage (in different time windows) or by users’ explicit ratings (e.g. “innovation” / “information” / “authority”.)

We’re still not quite there yet, but I think that something like Collexis’ applet is a promising interface for navigating recommendations.

PageRank for Ranking Journals January 10, 2008

Posted by Andre Vellino in CISTI Visualization, Collaborative filtering, Recommender.
5 comments

The latest entry in BioMed Central’s blog points us to an alternative database of journal citation metrics from Spain: SCImago.

It uses 13,000 journals, many from Scopus (one wonders – how did they get the IP rights to use the citation data?!)

Like EigenFactor, SCImago performs Journal Ranking using a PageRank-like algorithm.

SCImago also has a nice graphing tool that allows you to look at co-citations maps by subject:

CoCitationsInCanada2006

and citation frequency bubble-charts:

CitationFrequencyInCanada2006-Bubble

by topic and by country for a given year.

It wouldn’t take much to animate sequences of these bubble-maps and show how citation numbers are changing over time, a bit the way Gapminder does it.

In the rankings by country over the past 10 years, Canada article citation ranking is consistently 7th by absolute numbers. On a per capita basis Canada is 6th in cited publications, ahead of the U.S., Germany and France; #1 and #2 per capita are Switzerland and Sweden.

Follow

Get every new post delivered to your Inbox.