jump to navigation

MetaData in Social Science February 13, 2007

Posted by Andre Vellino in Data Mining.

I was aware that there are repositories of and search engines for many databases in various “hard science” disciplines like Chemistry and Astronomy, but until a few weeks ago, it hadn’t occurred to me that social scientists also have large and valuable collections of digital data and that these too are “published”.

Social science data consist of demographic studies, polls, census, epidemiological studies, etc. and are typically the result of surveys. Hence the raw data must be protected, e.g. by tight access-controls and private networks. In Canada, data gathered by Research Data Centers are protected by the Statistics Act, which includes provisions for gaol sentences for people who misuse protected information.

But there are also public sources of social science data where aggregate information is anonymized so that personal identity information cannot be inferred. And, as with the open source movement in software and the open access movement in scholarly publishing, social science data in Canada has its very own liberation movement.

Social scientists’ ability to extract any meaning from this information depends critically on metadata. For example, the column in a spreadsheet or a data-file that whose contents are “M” or “F” might refer to the sex of the respondent to the questionnaire, but it is not, in general possible to determine that this is the meaning of this data without also having the corresponding questionnaire at hand and a human-being to make the connection.

Yet, compared to bibliographic metadata, social-science metadata is a poor second-cousin. In the worst case, a data-file is a meaningless sequence of numbers whose significance is entirely opaque to anyone but the social scientist who created it. In the best case the raw data is annotated in some markup language like XML using a standardized DTD for social-science metadata.

Unfortunately, there aren’t many software tools for social scientists to create, annotate and analyse their data in a commonly accepted standard. The Council of European Social Science Data Archives and the The Inter-university Consortium for Political and Social Research – have developed a metadata standard called DDI (Data Documentation Initiative) and there are some software systems (such as Nesstar) for storing analyzing and meta-tagging data.

The latest version of the DDI standard (version 3), due for ratification in July 2007, has created meta-data tags to take into account the full life-cycle of creation and publication of social science data, including tags for conventional bibliographic meta-data (author / subject / publication date etc.) So the future for electronic social science looks promising. But it’s a pity that there wasn’t more forethought in the creation of meta-data standards for social science in the 1970s or 80s. We might know more about ourselves as a society than it appears we currently do.


1. Daniel Lemire - February 20, 2007

Great post. I’m glad to have learned about this Data Documentation Initiative.

In data warehousing, we have a very simple way to handle the semantics issue you point out, and that’s traceability. You should keep track of where this data comes from, how it was used, when it was used, when it was submitted, by whom, in what context. There is no need for all-encompassing standards.

I am not convinced that meta-data standards are all that useful. We have had librarians work hard on such standards, but the Web has basically allowed people to bypass all of that without any side effect. I use google scholar for my research and despite the fact that it is a somewhat informal interface, it is actually way more useful than the interfaces librarians have come up with.

I do not know any serious researcher who, given a choice, would opt to drop Google in favor of a formal database. I am old enough to recall the old school libraries with their formal databases where days could go by before I would find something. These systems had plenty of standardized meta-data and they were still extremely hard to use (which is why we needed the librarians…). Have we learned nothing from the Web and how it changed all that, forever?

(Yes, I know where you work and what this organization does.)

Dublin Core is largely a failure in practice as a formal meta-data system, for example. The data is heavily polluted, twisted and so on.

Informal non-standard tagging, on the other hand, works. My mail is tagged, my bookmarks are tagged, my files on my mac are tagged, and it works ok. There are many reasons why that it is, but the future is not filled with standard meta-data.

I have tried Nesstar and their demo. This is so boring. So rigid.

2. Andre Vellino - February 20, 2007

Thanks Daniel. I am somewhat sympathetic to your view about meta-data. I also use Google Scholar, for example, and don’t care much for bibliographic search engines that allow, let alone require, tag-specific (Subject / Author, etc.) searches.

At the same time, I do think there’s a role for “database” style information. I’m sure you use the Bibtex feature in CiteULike and / or Google Scholar. Why? Because it saves you a lot of work in (properly) formating bibliographic references in LaTeX, right?

Well, with Social Science data, I expect the same kind of thing is true. If you have, for example, different studies in different countries at different times using different methods and you want to integrate them somehow, then it would be good to know that in study X, the set of possible income ranges is divided up in 5K increments (in local currency) and in study Y the income ranges are divided in 30K increments (in a different currency), so that you can normalize and integrate them.

I haven’t been around libraries long-enough to have a strong opinion about Folksonomies, but I have to say I’m not at all sure at all about their collective value for information retrieval. It may work for you on your mac, but your way is not likely to be useful for me unless we think very much alike (which we might, in this case, but perhaps not in general), so I do think there’s *some* value in, if not *standardized* taxonomies, at least a conventionally agreed upon taxonomy, especially in well-established scientific fields (Physics -> {Acoustics, Atomic and Molecular, Nuclear, Optics, Solid State, Elementary Particles and Fields, etc.}

3. Daniel Lemire - February 20, 2007

> At the same time, I do think there’s a role for “database” style information. I’m sure you use the > Bibtex feature in CiteULike and / or Google Scholar. Why? Because it saves you a lot of work in
> (properly) formating bibliographic references in LaTeX, right?

Yes. And I use the h1, h2 and title elements in HTML/XHTML. I generate a lot of my PDF documents with the full documents. I also carefully keep track of all my documents. I also craft carefully (most of the time) the software I write.

This works fine because it is intra-document semantics, that is, very localized semantics. It is not expensive to be consistent throughout a document or a small set of documents. The more documents you throw in, the more expensive inter-document semantics become and eventually, the cost is very high.

It depends what you are after. A small data warehouse designed by a group of experts who all know each other and meet every Saturday? Dedicated librarians whose job depends on having consistent standards? A few gigabyte of data. Fine. You can get some agreements going. But I say that in the Web era, you are just a dot.

You want to get data from a large poll of researchers worldwide? You need to aggregate gigabytes of data. Well, you better have very thin semantics and very low expectations. Do not waste your time attending too many meetings where people will define semantics carefully. You will almost certainly waste your time. Too many people have tried to reinvent the all-encompassing ontology in the last 30 years.

Bibtex is *not* consistent throughout. You can’t mix and match bibtex documents from different users. I can’t copy and paste the bibtex entries from the ACM digital library and mix them with the bibtex entries from IEEE. And let us not mix DBLP’s bibtex into the lot. There is some consistency, but it is far from perfect. And the consistency is relatively thin. You can describe very common, agreed upon, semantically simple objects in a semantically useful way so that user intervention is minimized. Good.

But to describe the meaning of the data collected in thousands of spreadsheet by distributed users who have never met? Nah.

> If you have, for example, different studies in different countries at different times using
> different methods and you want to integrate them somehow

This is the exact same problem data warehouse specialists have to deal with every day. The marketing department defines “revenue” one way, and the finance department defines it another way. Oh! And wait! The next accountant changed the rules somewhat last May.

In practice, we have learned a lot from this. What we have learned is the importance of traceability. You need to know where your bits come from. You must always go back and drill down to the source.

Metadata standards? The Common Warehouse Metamodel has been around for a long time and was designed to solve these problems. See…


These things work in highly centralized, slow-changing, top-down enterprises with a powerful CIO. Wait! There aren’t many of those left around, are there?

When projects come and go every year, when definitions are challenged month to month, you have to deal with semantics in a more pragmatic way.

I am *not* saying people should not carefully describe their data. But to hope that there can be an all-powerful standard way of doing it, is… well, I’ll stay polite.

I document my data. I know where it comes from and most of the time, I could regenerate it (I do not work in social science, so it is somewhat easier in my case). But the way I describe my data varies from project to project. To hope that thousands of people would agree on a standard way for all of their projects, and that they would never lie, never misread the specifications, never misunderstand the specifications, never be lazy about the specifications… well… it is utopian.

We have not gotten people to use Dublin Core reliably and correctly. Let us be realistic.

And I do not say this because I think human beings are silly, stupid or evil. I think that semantics is a dynamic thing. People do not think in terms of static, complete, and decidable definitions. That is what you find in the specifications, but that it is not how people think. I know what a planet, but I can’t necessarily formally define it.

Want a good model of something that might work, check this out:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: