Research at Disney

If the talk about a shortage of faculty positions is dispiriting, articles like this are energizing. Data science has emerged as a hopeful and interesting alternative to academic social science. But one of the biggest drawbacks has to be that many data science positions are shaped so exclusively by computer science, engineering, or some other area of science that isn’t primarily social. Those areas of work are great, integral and critical, but the result of the skew is that descriptions of “data science” can lose sight of the real human behavior and social phenomena behind the data being analyzed. Continue reading

Big Data of all sizes: how to turn a regular organization into a data-driven organization

Everyone’s talking about Big Data lately. It’s being touted as a “revolution” for organizational decision making. I generally think more reliance on data is a very good thing, and I’m glad that people who traditionally haven’t thought much about data are now thinking about it more. That being said, I’ve been struck at the differences between the ways the actual term Big Data seems to be used by practitioners, as opposed to the ways the term is used by the executives and managers who supposedly want Big Data to work for them. Continue reading

Sometimes I think we don’t deserve good data

I wrote the following post yesterday afternoon as a way of venting, but I didn’t get around to posting it till this morning.

Consider this post a continuation of my post on opportunistic analysis. This is me decompressing after many hours of frustrating and unfruitful attempts to get some data that is supposedly freely downloadable.

I’ve written recently about sentiment analysis, and created a few tools to estimate the positive or negative sentiment expressed in a text by counting the number of positive and negative words that appear in that text. Positive and negative words are identified by lists – people uses different approaches to decide if a word carries a particular sentiment. This approach has many drawbacks. For example, Greg Tucker-Kellogg wrote recently on his blog how words such as “please” can often get defined as a positive word, which causes problems when a word with “please” defined as positive gets used to analyze the comment “please slow down.” The comment is identifying a problem, but it would be rated as positive because of the word. Continue reading

Why do Jihadi Clerics become Jihadi?

I don’t spend a lot of time thinking about Jihadi terrorism these days. I do still pay attention to the conflict in Afghanistan, and off and on I’ve been able to help with some projects being undertaken by other researchers. But I don’t have much time to think about terrorism outside of a conflict zone. However, yesterday I saw a flyer in the elevator for a talk on “Jihadi Clerics” and my interest was piqued enough that I attended. Continue reading

Automatic cleaning of messy text data

[UPDATE: I just wrapped AspellCheck() in a llply/laply wrapper from Hadley Wickham’s `plyr` package, so now it can be run on a vector of texts as well as a single character string, and it now has a default progress bar (set progress = “none” to turn it off). But you have to have plyr loaded to use the AspellCheck() function now.]

I deal with a lot of text data, and in R, the basic, general-purpose suite of tools for analyzing text data is the `tm` (text mining) package. I like the tm package a lot – it provides some convenient methods for pulling data from lots of different formats into a single corpus of texts, and it uses the sparseMatrix() function from the `Matrix`  package to allow comparisons of very large numbers of terms across very large numbers of documents without eating up very large amounts of resources. Continue reading