The HIP-index: A Better Measure of Research Impact November 16, 2013Posted by Andre Vellino in Bibliometrics, Citation Analysis, Statistical Semantics.
Tags: Citation analysis
Eighteen months ago, Xiaodan Zhu, Peter Turney, Daniel Lemire and I embarked on an experiment to see if we could identify the features in an article that would enable us to identify the critical (vs. incidental) references. We thought that being able to identify references that are crucial would help us devise a better researcher productivity index – one that was better than the h-index.
I am happy to report that we were successful! In September I gave an overview presentation to the U. Ottawa School of Information Studies that describes the problem we were trying to solve, our methods and results. Since then our paper has been accepted for publication in JASIST, most likely in a 2014 issue.
To automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper, we examined the effectiveness of a variety of candidate features – positional features, semantic features, context features and citation-frequency features – that might be predictors of the academic influence of a citation. We asked the authors of 100 papers to identify the key references in their own work and created a dataset in which citations were labeled according to their academic influence (note that this dataset is made available under the Open Data Commons Public Domain Dedication and License). We then used supervised machine learning to perform feature selection and found a model that predicts academic influence effectively using only four features.
The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, the hip-index weights citations simply by how many times a reference is mentioned. We show that the hip-index has better precision than the conventional h-index at predicting ACL Fellows on a collection of 20,000 articles from the ACL Digital Archive of Research Papers.
P.S. (Nov. 18) Daniel Lemire in his related blog post gives the following credit, which I entirely share: Most of the credit for this work goes to my co-authors. Much of the heavy lifting was done by Xiaodan Zhu.
Protecting Yourself from Spies September 7, 2013Posted by Andre Vellino in Ethics, Human Rights, Information.
add a comment
I once worked for a company that makes the kind of software that the NSA and CSIS appear to be using to monitor email and internet metadata (see the Guardian for a quick survey of the metadata that exists in different digital media).
I might add that I think there is nothing morally wrong with the surveillance technology itself – indeed it can be used to protect privacy and prevent harm. It is more a question of whether our privacy rights are violated when the technology is used and whether those rights should be relinquished to the state for the greater good.
The recent revelation that the presumption of privacy even when engaging in encrypted transactions is erroneous adds fuel to my concern that people don’t make informed decisions about what information they disclose and that they don’t even try to protect their information even when it is quite easy to do. This post highlights some software solutions you can use to reduce the likelihood that your private information is monitored.
Let’s start with web browsing. The amount of information that a web servers can glean from your web browser’s attempt to connect with it is quite voluminous. To see what a server can find out about your browser and computer, try this link:
Furthermore, the combination of these browser characteristics, while they may not provide personal identity information can still identify you uniquely. Try this test from the Electronic Frontier Foundation:
When I try it, they assert that my browser information-collection, i.e. my browser “fingerprint” is unique among the 3M or so they have tested.
There is not much you can do to limit the uniqueness of your browser’s fingerprint other than having a generic computer and a generic browser configuration. Using the TOR browser / network (see below) helps to reduce the uniqueness of your browser-fingerprint, but there are tradeoffs (response speed for one thing).
There was a time when I thought that HTTP-Secure (“https”) was a reliable way of ensuring that information between your browser and the end-point server (e.g. a Bank) could not be intercepted or tampered with. The revelation that the NSA is able to decrypt such communications reduces my confidence that this method is “secure” in any meaningful way, but at least it offers some degree of assurance that not just anybody and either read or tamper with such transactions.
If that level of confidence is sufficient for you, then you might consider adding the HTTPS Everywhere plugin (brought to you by the Electronic Freedom Foundation) to your browser.
This browser / encrypted network system describes itself as
…free software and an open network that helps you defend against a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security
In principle, the Onion Routing technology behind it offers the end-user a high degree of anonymity and untraceability. However, if anyone can break SSL, the next step is to break TOR.
File and file system encryption
If you want to protect computer files, or indeed a whole file system (e.g. in case your laptop is stolen or your USB key is lost) you should try TrueCrypt. It offers operating-system level, on-the fly encryption, file-level encryption and partition encryption. Best of all, TrueCrypt is open source (so you can check for yourself, if you have the patience and know-how, that there are no backdoors for the NSA or CSIS).
Securing email is a bit trickier. There is no meaningful way to encrypt e-mail metatdata. The very nature of e-mail addressing and store-and-forward protocols like SMTP require that metadata. Which, of course, is a fundamental design flaw with email.
It appears that most people think that their privacy is worth sacrificing in exchange for safety and protection by government. This is short-sighted. A benevolent government in whose integrity you trust might do the right thing at any point in time, but the issue is a matter of principle. You should not relinquish your right to privacy to the state.
As Bruce Schneier wrote in The Guardian:
By subverting the internet at every level to make it a vast, multi-layered and robust surveillance platform, the NSA has undermined a fundamental social contract…..
We have a moral duty to [dismantle the surveillance state], and we have no time to lose.
In the meantime we can at least do better to protect ourselves.
Some Problems with MOOCs August 17, 2013Posted by Andre Vellino in Education, Ethics.
add a comment
Michael Sandel‘s acclaimed undergraduate lectures at Harvard on Justice are now offered in a MOOC at EdX and watching them for a second time gave me an insight into a few of the significant shortcomings of recorded lectures.
First, they have a limited shelf-life. However perennial the issues are (e.g. “What is Justice?”), what makes it a learning experience for the students is the process of investigation and enquiry. While Sandel’s recordings of his lectures are a master class on how to engage students, how to foster critical thinking and make issues pertinent and alive, their very nature as recordings ultimately limits them to being historical documents.
For instance, since 2005 – the year in which these lectures were recorded – the richest person in the world (taken as an example of [potential] financial injustice) is no longer Bill Gates (it’s Carlos Slim Helu), significant examples of greed and inequality are better illustrated with the 2007-2008 financial crisis and there have been many changes in U.S. politics since the election of President Obama.
At least as importantly, watching these lectures makes the viewer feel wanting of interactions with the lecturer. Listening to young minds grappling with the issues is pedagogically interesting, but as a student what you really want is to be in the audience asking questions, taking positions and arguing with the lecturer and fellow students.
As a taste of how a student might benefit from a Harvard education, having a course such as this on-line is wonderful. And it is clearly of value to anyone who would be unable to attend or afford such an education. But it is no substitute for the real experience.
So, for these two reasons alone, I think that MOOCs will, at best, be a complement to a university education, not an alternative to it.
Freedom Abhors a Chill March 24, 2013Posted by Andre Vellino in Ethics.
add a comment
Jian Ghomeshi’s opening monolog on CBC’s radio program Q is the lastest salvo against the Library and Archives of Canada new Code of Conduct. In it he uses the phrase “Freedom Abhors a Chill”. And a chill it is:
Members of Parliament for the Official Opposition Andrew Cash and Pierre Nantel gave the Heritage Minister a piece of their mind about it in the Canadian House of Commons:
Jim Turk, Executive Director of the Canadian Association of University Teachers (CAUT) gave a clear explanation of what’s at stake in an interview on Radio Canada International.
My question is – we’ve expressed our collective outrage at this Orwellian nightmare – and now what? Do we decide that Federal archival and library institutions are doomed and take on their role on the remaining islands of democracy or “…take arms against a sea of troubles, and by opposing end them”?
Is Clippy the Future? February 8, 2013Posted by Andre Vellino in Artificial Intelligence, Collaborative filtering, Data Mining.
add a comment
The student-led Information without Borders conference that I attended at Dalhousie yesterday was truly excellent – as much for its organization (all by students!) as for its diverse topics: the future of libraries, cloud computing, recommender systems, sciverse apps and the foundations for innovation.
At the panel discussion in which I participated, I suggested that to predict the future one need only look at the past. To predict the iPad one needed only look at the Apple Newton (which died in 1998). What was the analog, I wondered, for an information retrieval tool, now dead and buried, that might still evolve into something we all want in the field of information management?
I proposed that the future of information retrieval might be something like an evolved Office Assistant, (affectionately coined “Clippy”) – the infamous, now deceased Microsoft Paperclip that assisted you in understanding and navigating Microsoft products.
My vision for a next generation Clippy was clearly not well articulated since it prompted the following tweet from Stephen Abram:
I think that Siri, (about which I posted a few years ago) belongs to the old Clippy style of annoying and in-the-way-of-what-I-want-to-do applications. I am surprised it has survived so long and was promoted by Apple so strongly. I predict it will join Clippy, Google Wave and Google Glasses on the growing heap of unwanted technologies that were not ready for prime-time.
Watson (who is now going to medical school, and about which I also posted a couple of years ago) is, however, just the sort of Natural Language Understanding component technology that I have in mind for for an interactive, personal information assistant. When a computer that now costs three million dollars with15 terrabytes of RAM can fit in your pocket and cost $500, a Watson-like system that understands natural language queries will be an important component of Clippy++.
What neither Watson nor Siri have – and this is what I foresee in my crystal ball is the most significant attribute about “Clippy++” – is personalization and autonomy. What will make true personalization possible with “Clippy++” is our collective willingness to accept the intrusion of a mechanical supervisor that learns from our behaviour about what we want, need and expect.
This culture-shift is happening right now – we gladly and willingly disclose our information consumption habits to supervisory software and data-analytics engines in exchange for entertainment and social networking. It won’t be long before we’re willing to do that for serious, personalized information management purposes as well.
The key, though, is going to be the interaction – the dialog that we have with Clippy++ – and it will have to have explanations for its actions and recommendations. That’s going to be the hallmark of its evolution to Machina Sapiens.
The End of Files December 8, 2012Posted by Andre Vellino in Data, Digital library.
add a comment
A few weeks ago, I boldly predicted in my class on copyright that the computer file was as doomed in annals of history as the piano roll (the last of which was printed in 2008 – See this documentary video on YouTube on how they are made and copied!)
This is a slightly different prediction than the one made by the Economist in 2005: Death to Folders. Their argument was that folders as a method of organizing files was obsolete and that search, tagging and “smart folders” were going to change everything. My assertion is the very notion of a file – these things that are copied, edited, executed by computers - will eventually disappear (to the end-user, anyway.)
The path to the “end of files” is more than just a question of masking the underlying data-representation to the user. It is true that Apps (as designed for mobile devices) have begun to do that as a convenient way of hiding the details of a file from the user – be it an application file or a document file. The reason that Apps (generally) contain within them the (references to) data-items (i.e. files) that they need, particularly if the information is stored in the cloud, is to provide a Digital Rights Management scheme. Which no doubt why this App model is slowly creeping its way from mobile devices to mainstream laptops and desktops (viz. Mac OS Mountain Lion and Windows 8).
But this is just the beginning. There’s going to be a paradigm shift (a perfectly fine phrase, when it’s used correctly!) in our mental representations of computing objects and it is going to be more profound than merely masking the existence of the underlying representation. I think the new paradigm that will replace “file” is going to be: “the set of information items and interfaces that are needed to perform some action the current use-context”.
Consider as an example of this trend towards the new paradigm, Wolfram’s Computable Document Format. In this model, documents are created by dynamically assembling components from different places and performing computations on them. In this model there are distributed, raw information components – data mostly – that are assembled in the application and don’t correspond to a “file” at all. Or consider information mashups like Google Maps with restaurant reviews and recommendations are generated as a function of search-history, location, and user-identity. These “content-bundles”, for want of a better phrase, are definitely not files or documents but, from the end-user’s point of view, they are also indistinguishable from them.
Even, MS Word DocX “files” are instances of this new model. The Open Document XML file format is a standardized data-structure: XML components bound together in a zip file. Imagine de-regimenting this convention a little and what constitutes a “document” could change quite significantly.
Conventional, static files will continue to exist for some time and version control systems will continue to provide change management services to what we now know as “files”. But I predict that my grand children won’t know what a file is – and won’t need to. The procedural instructions required for assembling information-packages out of components, including the digital rights constraints that govern them, will eventually dominate the world of consumable digital content to the point where the idea of a file will be obsolete.