jump to navigation

What is ‘Data’? June 14, 2011

Posted by Andre Vellino in Data, Data Mining, Information retrieval.

“What does ‘data’ mean to you?” I asked innocently to various participants at JCDL 2011 today.  I had just come out of a very interesting panel discussion entitled “Big Data, Big Deal?” at which most of the discussion was about large amounts of proprietary text at http://www.hathitrust.org/ (some of of the discussion was also about large amounts of music in the SALAMI project at McGill).

Now I am very interested in text, text retrieval (and music IR too) and I found the panel discussion most rewarding.  But it wasn’t aboutwhat I had been expecting it to be about (from the title) and I was perplexed by this use of the term “data” in this context. After all, the subtitle of the JCDL 2011 conference is “Bringing Together Scholars, Scholarship and Research Data”.  So the context for “data” was (for me) “research data” in the sense of the term that is pretty much the same the first 3 sentences of the Wikipedia entry for Data:

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of “datum”) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived.

So I was somewhat taken aback by the argument that ensued. Everyone, it seems (except me), was quite happy to speak of “Big data” and “large amounts of text” as synonymous.  As though the streams of bytes that are common to readings from an NMR spectrometer, digital music and electronic journal articles were in all significant respects indistinguishable.

Of course, large volumes of byte-sequences share some kinds of problems like storage, preservation and search. But “text data” is a different kind of beast, isn’t it? For one thing, text typically has meaning – cognitive content that is different from, say, music or images or spreadsheets of temperature variations in Glasgow over the past 500 years. It has more structure too, as evidenced by how efficiently it compresses and how (relatively) easy it is to search.

I’m happy to speak of data about text that is inferred by the act of mining text.  Word frequencies, ngrams, term clusters, sentiment categories etc. fit the definition of “data” above. Even the textual “meta-data” about text is data of a certain kind. But the text itself just doesn’t seem to be that kind of thing (qualitative or quantitative attributes of a variable).


1. Crystal Bruce - June 15, 2011

I took a course called “Arts for Social Science Students” and while most of the course was a joke… (“This is a mouse!”)… I did learn the difference between data and information. My definition of data is like yours and Wikipedias, whereas information is data with meaning.

That being said, “text data” where it is being spoken about as phrases and sentences, is “information” to me. I like the distinction that you make about text and text data (data about text).

If the panel discussion was on data and not text data/information, I wonder if it would have been as interesting!

2. gawp - July 29, 2011

Does an ontology of data sets and databases exist?
Text is clearly a subset of data, but there is linguistic (word) and non linguistic (i.e. DNA sequences) text.

3. pranav - August 13, 2011

qualitative or quantitative attributes are more important, as we go on and on data will increase and we need to decide what we want to use or not use etc. I completely agree with you for this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: