jump to navigation

International Digital Curation Conference 2015 February 17, 2015

Posted by Andre Vellino in Data, Data Curation.
1 comment so far

dccI had never intended to leave this blog void of entries in 2014, let alone leave it with a “top 10” list as the last entry.  So it’s time to re-boot Synthese with a short report on the 2015 International Digital Curation Conference.

The opening keynote by Tony Hey was both a master-class in how to give a compelling lecture and an impressive demonstration of how much one person can know about his field.  When the video of this talk comes out, watch it!

It was also great to see such a wide variety of topics in the poster sessions: A poster on Data Citation was the award winner (I still can’t believe that the graduate student who did this research had to pay for her own subscription to Web of Science to do this research!). The runner-up award for best paper was about authorship attribution metadata to climate datasets.

Climate data figured quite prominently, including at least three talks : one on implementing an ISO standard MOLES3 (Metadata Objects Linking Environmental Sciences) at the Centre for Environmental Data Archival a second on Twenty years of data management in the British Atmospheric Data Centre and my own on Harmonizing metadata among diverse climate change datasets.

There were 3 parallel sessions on the second day – one just has to be resigned to giving up on two thirds of the interesting talks. I did go to this one one: A system for distributed minting and management of persistent identifiers, which I found especially intriguing. In a sentence, it proposes to do for digital identifiers (e.g. DOI) what Bitcoin does for money. In other word’s it’s a Bitcoin-like, distributed and secure method of generating unique identifiers.  I hope is succeeds.

This talk by the Ph.D. student Tiffany Chao Mapping methods metadata for research data struck me as a perfect application for text mining.  She proposes extracting the Methods and Instrumentation sections from the National Environmental Methods Index to generate metadata descriptors for the corresponding datafiles.  Right now her work is being done by hand to demonstrate its feasibility but a machine could do it too.

I registered for a DataCarpentry workshop to “access life science data available on the web”.  I learned a little R programming, discovered the ROpenSci repository and got my feet wet with the AntWeb and Gender packages. I look forward to graduating to rWBclimate, an R interface to the World Bank climate data in the climate knowledge portal.

One treasure trove led to another. I gate-crashed a small visualization hackathon workshop at which I discovered the British Library’s digital collection and the 1001 things that could be done to it if you had a small army of graduate students in the Digital Humanities at your disposal. Hopefully, that’s exactly what’s going to happen when the Universities of Cambridge, Oxford, Edinburgh, Warwick and University College London start collaborating at the Alan Turing Institute (to be located in the British Library).

The Data Spring Workshop was exciting in a different way – a lot of presenters gave lightening talks on their practical problems and solutions with managing data.  There was so much, I can hardly remember any of it!  One item stood out for me,  though, because it addresses my pain: a method for re-creating and preserving the environments for computational experiments.  It took me about 1.2 minutes to become an instant convert to Recomputation.org mission.

This only skims the surface, but it will have to do for now.

Top 10 Mac Software for 2013 December 26, 2013

Posted by Andre Vellino in Software Review.

This is a top 10 list of Mac software for 2013. Most of them are not new, but many are new to me for this year.

(1) 1Password : https://agilebits.com/onepassword

This is single-handedly the most useful and valuable piece of software I own.  It’s a password-vault that securely generates and stores passwords for all your logins. Free and Open Source equivalents include Password Safe and KeePass but 1Password has them all beat in their user interface and that’s important when you use something every day. It’s true that Open Source alternatives have the security advantage that anyone can inspect the code for back-doors and security mistakes, but I am willing to trust Agile Bits.  Maybe it’s because they’re Canadian.

(2) BoxCryptor : https://www.boxcryptor.com/

Worry about storing your files in the cloud no more. Boxcryptor provides file-encryption  for cloud storage services, including Dropbox, Google Drive and SkyDrive.  For file encryption or even disk-level encryption I would have recommended TrueCrypt except that it hasn’t been updated in more than a year. For Windows systems, I would suggest Axcrypt.

gpg-tools(3) GPG Mail : https://gpgtools.org/

GNU Privacy Guard (GPG) is a tool for encrypting, decrypting, signing and verifying files or messages. Despite adding my GPG signature on all my e-mails for the past 5 months, no one has yet sent me an encrypted e-mail, but once everyone uses it, I predict it will be the spam-killer app.

things2(4) Things 2 : http://culturedcode.com/things/

If you’re not the most organized person in the world you’ll be grateful for this tool: it helps remind you of what you need to do, and when you need to do it.


(5) TeXStudio : http://texstudio.sourceforge.net/

TeX is 35 years old and still going strong.  TeXStudio is a pretty good text editor and a pretty interface for this rather complicated typesetting system.  Essential for writing camera-ready copy, particularly if it involves mathematical equations and symbols.


(6) Pixelmator : http://www.pixelmator.com/

If you don’t have the patience to learn Photoshop or even Gimp, Pixelmator likely does most of what you’ll want if you are a casual photo editor.

(7) WhatSize : http://www.whatsizemac.com/

Even a rarely used software item can be quite valuable.  Sometimes you really need to see how your space is allocated on your disk when you see your space disappear from it. WhatSize does only one thing but it does it well.

(8) GoBan : http://www.sente.ch/

I don’t play computer games much, but when I do it’s the game of Go – still by far the most beautiful board game ever invented.  This UI app is very nice for playing others on line or against Go software like GNU Go or Pachi Go.

stellarium(9) Stellarium : http://www.stellarium.org/

Starry Night used to be the king of the hill of sky simulators for astronomy – and perhaps it still is – but Stellarium is a quite a fine Open Source alternative that is quite a bit less complicated.

(10) Audacity : http://audacity.sourceforge.net/

Audio editing software is probably frustrating no matter how good the user interface. And Audacity’s user interface is frustrating!  But I keep coming back to it because it’s so available and does so much that’s useful (noise reduction, normalization, export to various formats, etc.)

Needless to say, I have no commercial or other interest in any product mentioned above and I have paid for all my personal product licenses for the commercial software listed above: 1Password, Things 2, WhatSize, Pixelmator and GoBan.

The HIP-index: A Better Measure of Research Impact November 16, 2013

Posted by Andre Vellino in Bibliometrics, Citation Analysis, Statistical Semantics.


Eighteen months ago, Xiaodan Zhu, Peter Turney, Daniel Lemire and I embarked on an experiment to see if we could identify the features in an article that would enable us to identify the critical (vs. incidental) references.  We thought that being able to identify references that are crucial would help us devise a better researcher productivity index – one that was better than the h-index.

I am happy to report that we were successful!  In September I gave an overview presentation to the U. Ottawa School of Information Studies that describes the problem we were trying to solve, our methods and results. Since then our paper has been accepted for publication in JASIST, most likely in a 2014 issue.

To automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper, we examined the effectiveness of a variety of candidate features – positional features, semantic features, context features and citation-frequency features – that might be predictors of the academic influence of a citation. We asked the authors of 100 papers to identify the key references in their own work and created a dataset in which citations were labeled according to their academic influence (note that this dataset is made available under the Open Data Commons Public Domain Dedication and License). We then used supervised machine learning to perform feature selection and found a model that predicts academic influence effectively using only four features.

The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index, the hip-index weights citations simply by how many times a reference is mentioned. We show that the hip-index has better precision than the conventional h-index at predicting ACL Fellows on a collection of 20,000 articles from the ACL Digital Archive of Research Papers.

P.S. (Nov. 18) Daniel Lemire in his related blog post gives the following credit, which I entirely share: Most of the credit for this work goes to my co-authors. Much of the heavy lifting was done by Xiaodan Zhu.

Protecting Yourself from Spies September 7, 2013

Posted by Andre Vellino in Ethics, Human Rights, Information.
add a comment


I once worked for a company that makes the kind of software that the NSA and CSIS appear to be using to monitor email and internet metadata (see the Guardian for a quick survey of the metadata that exists in different digital media).

I might add that I think there is nothing morally wrong with the surveillance technology itself – indeed it can be used to protect privacy and prevent harm. It is more a question of whether our privacy rights are violated when the technology is used and whether those rights should be relinquished to the state for the greater good.

The recent revelation that the presumption of privacy even when engaging in encrypted transactions is erroneous adds fuel to my concern that people don’t make informed decisions about what information they disclose and that they don’t even try to protect their information even when it is quite easy to do. This post highlights some software solutions you can use to reduce the likelihood that your private information is monitored.

Web Browsing

Let’s start with web browsing. The amount of information that a web servers can glean from your web browser’s attempt to connect with it is quite voluminous. To see what a server can find out about your browser and computer, try this link:


Furthermore, the combination of these browser characteristics, while they may not provide personal identity information can still identify you uniquely.  Try this test from the Electronic Frontier Foundation:


When I try it, they assert that my browser information-collection, i.e. my browser “fingerprint” is unique among the 3M or so they have tested.

There is not much you can do to limit the uniqueness of your browser’s fingerprint other than having a generic computer and a generic browser configuration.  Using the TOR browser / network (see below) helps to reduce the uniqueness of your browser-fingerprint, but there are tradeoffs (response speed for one thing).


There was a time when I thought that HTTP-Secure (“https”) was a reliable way of ensuring that information between your browser and the end-point server (e.g. a Bank) could not be intercepted or tampered with. The revelation that the NSA is able to decrypt such communications reduces my confidence that this method is “secure” in any meaningful way, but at least it offers some degree of assurance that not just anybody and either read or tamper with such transactions.

If that level of confidence is sufficient for you, then you might consider adding the HTTPS Everywhere plugin (brought to you by the Electronic Freedom Foundation) to your browser.


This browser / encrypted network system describes itself as

…free software and an open network that helps you defend against a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security

In principle, the Onion Routing technology behind it offers the end-user a high degree of anonymity and untraceability. However, if anyone can break SSL, the next step is to break TOR.

File and file system encryption

If you want to protect computer files, or indeed a whole file system (e.g. in case your laptop is stolen or your USB key is lost) you should try TrueCrypt. It offers operating-system level, on-the fly encryption, file-level encryption and partition encryption.  Best of all, TrueCrypt is open source (so you can check for yourself, if you have the patience and know-how, that there are no backdoors for the NSA or CSIS).

Also, for Windows PCs (or Wine enabled Macs), AxCrypt is a pretty good and easy to use tool for encrypting files.


Securing email is a bit trickier. There is no meaningful way to encrypt e-mail metatdata. The very nature of e-mail addressing and store-and-forward protocols like SMTP require that metadata. Which, of course, is a fundamental design flaw with email.

However, if you want to protect the content of what you say from prying eyes, you can try Gnu Privacy Guard (GPG). Its precursor was PGP (Pretty Good Privacy) and Edward Snowden thinks it works.


It appears that most people think that their privacy is worth sacrificing in exchange for safety and protection by government.  This is short-sighted. A benevolent government in whose integrity you trust might do the right thing at any point in time, but the issue is a matter of principle. You should not relinquish your right to privacy to the state.

As Bruce Schneier wrote in The Guardian:

By subverting the internet at every level to make it a vast, multi-layered and robust surveillance platform, the NSA has undermined a fundamental social contract…..

We have a moral duty to [dismantle the surveillance state], and we have no time to lose.

In the meantime we can at least do better to protect ourselves.

Some Problems with MOOCs August 17, 2013

Posted by Andre Vellino in Education, Ethics.
add a comment

Michael Sandel‘s acclaimed undergraduate lectures at Harvard on Justice are now offered in a MOOC at EdX and watching them for a second time gave me an insight into a few of the significant shortcomings of recorded lectures.

First, they have a limited shelf-life. However perennial the issues are (e.g. “What is Justice?”), what makes it a learning experience for the students is the process of investigation and enquiry.  While Sandel’s recordings of his lectures are a master class on how to engage students, how to foster critical thinking and make issues pertinent and alive,  their very nature as recordings ultimately limits them to being historical documents.

For instance, since 2005 – the year in which these lectures were recorded – the richest person in the world (taken as an example of [potential] financial injustice) is no longer Bill Gates (it’s Carlos Slim Helu), significant examples of greed and inequality are better illustrated with the 2007-2008 financial crisis and there have been many changes in U.S. politics since the election of President Obama.

At least as importantly, watching these lectures makes the viewer feel wanting of interactions with the lecturer. Listening to young minds grappling with the issues is pedagogically interesting, but as a student what you really want is to be in the audience asking questions, taking positions and arguing with the lecturer and fellow students.

As a taste of how a student might benefit from a Harvard education, having a course such as this on-line is wonderful. And it is clearly of value to anyone who would be unable to attend or afford such an education.  But it is no substitute for the real experience.

So, for these two reasons alone, I think that MOOCs will, at best, be a complement to a university education, not an alternative to it.

Freedom Abhors a Chill March 24, 2013

Posted by Andre Vellino in Ethics.
add a comment

Jian Ghomeshi’s opening monolog on CBC’s radio program Q is the lastest salvo against the Library and Archives of Canada new Code of Conduct. In it he uses the phrase “Freedom Abhors a Chill”.  And a chill it is:

View this document on Scribd

The BC Library Association has condemned it in writing. BC Archivist Myron Groover was polite but firm on “As It Happens”.

Members of Parliament for the Official Opposition Andrew Cash and Pierre Nantel gave the Heritage Minister a piece of their mind about it in the Canadian House of Commons:

Jim Turk, Executive Director of the Canadian Association of University Teachers (CAUT) gave a clear explanation of what’s at stake in an interview on Radio Canada International.

My question is – we’ve expressed our collective outrage at this Orwellian nightmare – and now what? Do we decide that Federal archival and library institutions are doomed and take on their role on the remaining islands of democracy or “…take arms against a sea of troubles, and by opposing end them”?