Sometimes I think we don’t deserve good data

I wrote the following post yesterday afternoon as a way of venting, but I didn’t get around to posting it till this morning.

Consider this post a continuation of my post on opportunistic analysis. This is me decompressing after many hours of frustrating and unfruitful attempts to get some data that is supposedly freely downloadable.

I’ve written recently about sentiment analysis, and created a few tools to estimate the positive or negative sentiment expressed in a text by counting the number of positive and negative words that appear in that text. Positive and negative words are identified by lists – people uses different approaches to decide if a word carries a particular sentiment. This approach has many drawbacks. For example, Greg Tucker-Kellogg wrote recently on his blog how words such as “please” can often get defined as a positive word, which causes problems when a word with “please” defined as positive gets used to analyze the comment “please slow down.” The comment is identifying a problem, but it would be rated as positive because of the word.

One way to get around this problem, instead of trying to determine which words “really” are positive or negative, is to weight the words. If “please” gets defined as only a little positive, its effects on an overall sentiment estimate can more easily be minimized or reversed by the presence of other words that might more clearly convey positive or negative emotions. Several sentiment word lists already do this. For example, the AFINN list scores each word on a scale from -5 (extremely negative sentiment) to +5 (extremely positive sentiment).

Of the different scored lists that are out there, and there are several, it seems that most of the scores are created by having a whole bunch of people rate the words. The lists are considered adequate when using the lists to score a text results in a sentiment estimate that matches estimates made by actual human beings.

This got me thinking about what kinds of words might be scored as more positively or negatively by human raters. For example, I think most people would rate “spectacular” as being a stronger positive word than “good”, but you don’t need to have people rate words for you to know that:

The above graph, taken from Google’s ngram viewer, shows the relative usage of “good” and “spectacular” in books from 1950 through 2000. “Good” is consistently used much more often than “spectacular”, which makes sense if we assume that the majority of most people’s day-to-day experiences are pretty average on the scale from positive to negative. That’s the point of emotions – to help us remember particularly good and particularly bad experiences.So sentiment-laden words that are used less frequently might represent emotional experiences that are experienced less often, and therefore word frequency in a really large corpus might act as proxy for the strength of the sentiment, with higher usage indicating lower strength.

If I’m right, then all I’d have to do is take the frequency of different sentiment words and scale it – maybe log it and then scale by twice the logged standard deviation or something – to get a reasonable strength scale without having to having people rate the words.

Here’s where my frustrations started. As I mentioned, the above graph was from Google’s ngram viewer. Google actually makes those ngram data sets public, so I figured all I would have to do is download the data and calculate frequencies. And thus commenced a whole day of hitting my head against the wall. Google has two versions of its English data sets. Version 1 was created in 2009 and version 2 was created in July of this year. Version 1 is split into 9 equally-sized zip archives, while version 2 can be split equally or by initial letter, and is gzipped.

I can’t download version 2. Every time I click on the file, I get this:

This happens in Chrome. If I hold my nose and try it in Internet Explorer, Microsoft gives me a message that I can’t download it and says that’s probably because I didn’t enter a password.

But, hey, I was just interested in a rough, proof-of-concept sort of activity, so I can just use version 1, right? Wrong. I can download it just fine. I can unzip it just fine. This is what I get when I open it.

$0.00 1903 2 2 2
$0.00 1906 2 2 2
$0.00 1908 4 4 4
$0.00 1909 18 13 13
$0.00 1910 2 2 2
$0.00 1911 6 6 5
$0.00 1912 5 3 3
$0.00 1913 9 9 9
$0.00 1914 9 8 8
$0.00 1915 2 2 2

The second column is the year in which the word occurred. The third column is the number of times the word occurred. The fourth column is the number of pages on which the word occurred. The fifth column is the number of books in which the word occurred. The first column is the word.

Oh, wait. No, it’s not. It’s a dollar sign followed by some digits. As I go down the list, the digits change, but in every case the word column just contains dollar signs followed by digits. And that happens for any of the nine partitions of the full 1-gram data set. I checked them all.The second column is the year in which the word occurred. The third column is the number of times the word occurred. The fourth column is the number of pages on which the word occurred. The fifth column is the number of books in which the word occurred. The first column is the word.

But you know what, that’s not what frustrates me. Ok, that’s a lie. This really frustrates me. But as much as I’m irked by my inability to get the data that’s supposedly just sitting there for the taking, I’m more irked at what I found when I searched for solutions to my problems.

I was able to find one question on Stack Overflow that, I think, is describing the problem I had with the version 1 data, but the one reply didn’t realize that the original poster was talking about the raw data and just directed him to the download site. I was able to find one other question on, I think, a listserv (I can’t for the life of me find it now) that mentioned that all of the words were appearing as number/symbol combinations. I couldn’t find a single person talking about not being able to download the Version 2 files. So that means one of three things:

  1. I am just one incredibly unlucky person to have all of Google’s ngram data sets fail for me but for no one else.
  2. I’m really dumb to not see that I’m obviously doing something wrong here.
  3. People aren’t using the data sets all that much.

I think 1 is unlikely to the point of being implausible. I think 2 is more likely than 1 but I’d like to think I’m smart enough to download a .gz file or open a tab-delimited file in R. That leaves 3, and thinking back over the many, many dead-ends Google gave me today as I searched for answers to my problem, it really does seem that people are doing remarkably little with the data. For example, if you look at this site, there are lots of people who are going to the ngram viewer, entering two or three words, and then speculating wildly about the picture that results. See this Stack Exchange  for a discussion of all the silly ways people are thinking very little about the conclusions they derive from the graphs.

A solution to this would be, of course, to get the raw data and do some more rigorous and systematic research on the data sets. In fact, maybe I’ll do that! I can make some time to download the files, then I…oh.

I’m frustrated that I can’t get the data, but Google isn’t exactly known for its customer service, so I can’t fault them too much for not caring enough to search out and fix whatever problem it is that’s preventing me from getting what I want. I am frustrated that there are so many people using the ngram tool to spin narratives rather than do actual research. I suspect that if more people were trying to use the data, more people would be encountering the kinds of problems I’ve encountered, and so more people would be writing about those problems, which means those problems would get noticed and fixed.

What we need are fewer people who are content to base our arguments on some graph that Google (or SPSS, or – less frequently – R) spits out at us. We get better data by using bad data and then letting people see how frustrating it is to use bad data. But if we never try to use even the bad data beyond a few automated outputs, we’ll never even see that it’s bad data, and we certainly won’t get anything better in the long run.

[UPDATE, 10/5/2012: Whoa. I just went back to the Google ngram data site on a whim (actually, I’ve been obsessing over it a little since I wrote this post. All of the version 2 data sets are gone, including all mention of them in the data descriptions. However, if you look at the snippet that shows up below the link on the Google search, you can clearly see that it mentions version 2. I’m downloading the version 1 set now to see if gobbledygook still shows up in place of the ngrams.


…Nope. I’m still just getting number/symbol combinations in the word column no matter which file I pull.


5 thoughts on “Sometimes I think we don’t deserve good data

  1. I figured that something odd must be going on, so I downloaded the first file,, unzipped it, and opened it in emacs.

    It appeared to actually be tab-separated, not comma separated.

    The first column appeared to be #’s.

    So then I pulled out all the entries from the first column
    awk ‘{ print $1 }’ googlebooks-eng-all-1gram-20090715-0.csv

    After a lot of random numbers, real words started to show up.

    Hypothesis: the list is generated automatically, and what google defines as a “word” is probably something like “anything between two whitespace characters”. Result — lots of things that we don’t recognize as words, which, when sorted alphabetically (using ascii) appear before all the “real words”(TM).

    Opening it up in emacs again, the first letter A appears on line 203,625 (out of 47,323,011), but with an ampersand in front of it.

    TL:DR – the words are in there, but so is some “junk” that needs to be filtered out.

  2. Robin,

    Thanks for checking into that. Yes, the file is indeed tab-separated, even though its file extension says it should be comma-separated. Not sure why they did that. Anyway, I got tripped up by the description of on the ngram website:

    “Inside each file the ngrams are sorted alphabetically and then chronologically. Note that the files themselves aren’t ordered with respect to one another.”

    I read the first sentence and assumed that if there was junk it would be in one file. That’s why I checked out the other files too. But apparently, each file has that junk at the top of the list.

    I still think it’s interesting that they took down the Version 2 files.

  3. Schaun, your scenario #1 was the implausible reality. You tried to download the dataset during a brief period where we were in transition between Ngram Viewer 1.0 and Ngram Viewer 2.0. Sorry.

  4. Jon,

    Thanks for letting me know, and no need to be sorry – those sorts of things happen. My frustration was more with the apparent lack of people using the ngram data to do rigorous research than with the ngram data itself.

Comments are closed.