Give data collection the respect it deserves

I attended a presentation on “a framework of corruption” the other day. Perhaps this is true for other areas of research as well, but researchers and analysts who look at corruption love to talk about frameworks and maps and indices and typologies. In a sense you can’t blame them. Corruption is about as vague a term in social research as is possible. To make talking and thinking about it useful you have to first break it into pieces. What kinds of corruption are there? Unfortunately, the typologies usually just involve other terms and ideas that are nearly as vague as the original word.

This bothers me. As Schaun mentioned in his last post, corruption is a label we place on observable behaviors. You can point to an official engaged in bribery or in fraudulent record-keeping; you can’t point to an official engaged in corruption (until you’ve identified an observable behavior as corruption). This was my typical response to Army commanders in Afghanistan when they asked me about corruption. “Don’t tell me you have a problem with corruption. Tell me you have a problem with such-and-such actions on the part of such-and-such people. Then we can start to deal with that problem.” Now I mostly look at corruption in the US, and think more about “institutional corruption” than the kind of behaviors we encountered in Afghanistan. If anything, the problem of specificity here is worse.

Since switching focus I’ve developed a slightly different way of looking at the problem. Researchers interested in corruption don’t tend to get very specific because they don’t actually know what to be specific about. We “know” corruption contributed to the financial crisis. We argue about contributing factors like “a lack of transparency” or “financial misrepresentation”. Ask which people did what and how many of them did it, and you probably won’t get many replies. But everything that’s happened in the last two to five years has happened in actual locations and has been done by actual people. In big banks and big organizations, lots of analysts and managers and personnel of varying sorts have engaged in a relatively wide variety of tasks and behaviors (not all of which or even most of which, needless to say, are corrupt). The consequences of all those individual tasks and behaviors combined comprise the problems we face today. Corruption is all real stuff. It’s just that we know very little about and have very little access to that kind of population-level reality.

I don’t claim this is a new insight. Lots of researchers have pointed out the general lack of empirical data in social science and social research. In fact, I’m basically saying something very similar to what Schaun expressed here. It’s probably safe to claim that social science has become much more empirical over the last few decades. It seems to me that most well-known work tends to have some empirical element. However I think we can do a lot more in response to this insight than we have and I think we must do a lot more about the problem than most researchers have considered necessary. The rest of this post mentions three limitations of current data collection, and then makes the case for massively increased data collection with some examples.

Isolated factors and overly-targeted data collection: Data collection is expensive in terms of time, energy, and  money, so researchers tend to get very specific in the data they collect. You think income level affects smoking? Collect data on income-levels and smoking patterns. Think social networks (who is friends with whom) affect smoking? Collect social network data. In the long-view, this narrowness is a mistake. Few researchers actually think the “x causes y” model accurately represents reality. They “exclude” and “control for” and “hold everything else constant” because that’s all they can do given that they can’t get data for all the other factors. The problem is that all those other factors and behaviors are constantly occurring before, during, and after the behavior the researcher is concerned with. We can hypothesize why they don’t matter or why they matter less than the things for which we did collect data, but as argued previously, we don’t actually know enough to make that case very strongly yet. Before we can know something like that, we have to have much fuller and more robust descriptions of people’s behaviors. Ultimately, we have to collect data on almost every aspect and stage of human life you can imagine. That’s what the advance of social science will require.

Giving credit to data collectors. Everyone admires a researcher with an awesome data set. I don’t think many really care that much about the actual data collector. That’s a mistake. Getting out into the world and collecting data should be a big and important part of the work of research. I think some of the great psychologists knew that. In his memoirs, Stanley Milgram describes tons of research projects that involved active involvement in life – on subway cars, in the streets, etc. Anthropologists have always placed importance on the active work of participant observation (although sadly most bring back and publish their interpretations and reflections more than actual data). I’ve always enjoyed reading the work of scientists who study ants (like Edmund Wilson and Deborah Gordon), and it’s amazing how much time they spend just watching ants and recording their movements. The physicist Ernest Rutherford is alleged to have said “all science is either physics or stamp collecting.” Social scientists and researchers need to do a whole lot more stamp collecting, and we should love it, and we should admire it.

Data ownership. I was in a Geospatial Information Systems workshop recently and overheard some senior researchers discussing the growing interest in the idea of making data public. They were speaking specifically in reference to federally funded research and seemed to be in support of the notion that when research is funded by the National Science Foundation, the researchers’ data should eventually be made accessible to the public (which I later learned is actually something of a hot topic at the moment). A few weeks later I mentioned this idea in a meeting and was kind of surprised at the skeptical reception. No one wants to release data until they’ve “got their papers out of it”. By that time they’re working on new things and no one cares (“because the data has already been used”). I think that’s a really strange way of thinking about science. Data is thought of like tissue-paper – use it once and throw it away; definitely don’t use someone else’s. -That’s a horrible mistake (Google “The Republic of Science” by Michael Polanyi. He makes a great case why).

But all of the above are problems that might go away quickly if data-collection became perceived as an exciting endeavor. An analogy is appropriate here. Accounting is an extremely important part of contemporary human society. Modern accounting is based on double-entry bookkeeping. Double-entry bookkeeping was invented sometime between the 13th and 15th centuries. It’s basically a great way of tracking transactions over time. It doesn’t seem at first like an amazing idea, but the poet Goethe called double-entry bookkeeping one of “the most beautiful discoveries of the human spirit” (or “finest inventions of the human mind”, depending on your translation). The development and advance of social data-collection is going to be just as important, and based on a lot of similarly mundane-seeming record-keeping.

To get the level of data that will be really useful will require increased prioritization of record keeping. Focus on record keeping and data collection, and analysis of the data will follow fairly naturally. Focus on analysis, on the other hand, and data collection will remain something only a minority of researchers does much of and only to the extent that it allows them to squeeze a paper or two out of it (after which they promptly drop the collection). When I worked for the Army and with the intelligence community, most of the work revolved around dueling assertions. You make one claim and then someone else makes another. There wasn’t a lot of systematic data or description. That frustrated me and I often said so. In a sense I was just making another claim, which would typically be met by another counter-claim (“this isn’t about evidence! You just have to make the call!”), thus slipping us back into the patterned grind of “intelligence-work”. Then Schaun and I and a couple of others started putting together systematic data-sets and using them to conduct our analyses and guide our claims. When we did that we met one of two responses from other analysts:  they either thought our dataset was crap so they tried to put together a better one, or they thought our analysis was crap so they tried to reanalyze our dataset. Doing without data was just no longer an option. At first I was irritated. I quickly became thrilled. Both responses were an improvement over the old world of claim/counter-claim.

With everyone constantly doing things, people (including researchers) usually don’t have the time to record for everyone else. Keeping record requires technology. Given all the other things people have to do, recording all their doings themselves is going to fall fairly low on the list of priorities. We need to invent technology that makes it as costless as possible. That kind of technology is becoming more and more available. A lot of popular activities have a fundamental technological component which almost inherently involves record-keeping: Twitter, Facebook, 4square, Pinterest, texting, phone calls, online chatting, product and customer-experience reviews (e.g. Yelp), credit cards. All of this automatically involves the recording of date and time information, as well as some information about the experience/event. Most could easily record location information. But even older and basic technologies that used to have no “social” component could be updated to aid record-keeping. Cars, microwaves, plumbing, heating/cooling, etc. How long does your microwave operate every day? How often? At what times of day? Cars already record mileage. If auto-makers linked odometers to the car’s computers, then that information could be made even more accessible and useable.

We need as full an account of daily life as possible. We can figure out what matters, when, and why, once we have the data. Assuming something doesn’t matter before then is unwise.

It’s an exciting prospect. I think it adds a great deal of importance and appeal to research. In a sense it also expands the world of research. The emergence of modern science involved a whole lot of amateur scientists and hobbyists who made important contributions by working away in their own corners of the world. The theory-heavy social science that has become typical and conventional today presents a fairly massive barrier to entry. In order to participate you have to spend a whole lot of time reading what a whole lot of other people have said about the world. You have to know the terms they used and you have to use them yourself. Only then do you get to start doing your own work. Most non-academic people don’t have the time or interest to engage in that primarily theoretical work.  The work of systematic description, on the other hand, is an activity with a fairly low entry-cost and it’s very accessible to amateurs working away developing their own tools and technologies and data sets. If we can let go, just for a brief period, of the pie-in-the-sky goal of “building a science of behavior” and focus instead on the more practical goal of just getting systematic descriptions of observables, we might quickly discover we are able to achieve both goals at once


7 thoughts on “Give data collection the respect it deserves

  1. It is interesting to think about what “better observation” entails and what exactly a “behavior” is or consists of. As physical behavior is increasingly augmented by information and mechanical technology, observation of behavior becomes more complicated. What all do we include in a single observation of a single behavior? If a financial services analyst is monitoring the graphical output of millions of instant calculations on his laptop screen and then deciding to call a client and recommend a buy or a sell, then actually calling the client, how do we parse what in here is behavior and what is not? Are the financial calculations on his laptop part of his cognition and therefore part of a particular behavior? How do you observe and record this?

  2. Jeremy,

    I was thinking along those same lines, but now I wonder if I’ve been over-thinking it. I think Paul’s point, and I tend to agree, is that we ought to be recording the calculations on the laptop and the calls…and the amount of time spent on the laptop and the calls, and the times and days when they occur, and anything else observable. The problem of deciding what counts as behavior and what doesn’t is much more of a problem when you’re identifying something abstract as a behavior and then trying to operationalize it. If you define behavior just as a single person directing an action towards an object or other person, most (probably not all) of the definitional problems take care of themselves.

  3. I think perhaps you’re getting at a distinction between behavior and environment. If environments are more complex then discerning which parts of the environment are influencing behavior also potentially becomes more difficult.

    However, part of my argument is that we don’t need to figure that stuff out yet. When I say we should collect data and keep record, I mean not just of behavior but of the physical world as well. As much of the financial analyst’s office and business and area of the city you can get. Since any particular person won’t be able to get everything, I see no problem with them deciding which part to get based on hypotheses (about the parts of the world) that interest them. Although I think it’s probably a better and more efficient approach to just take advantage of opportunity. If you see an opportunity to collect something, go for it.

    It’s actually kind of interesting that for pretty much any aspect of the environment you might be interested in, there’s almost certainly going to be some psychological research that has explored and seemed to demonstrate how that aspect influences behavior (controlling for everything else as much as possible): a warm coffee mug, the ten commandments hanging on the wall, a single word buried in a scenario, the race of another person’s face, the incline of a hill, the cup of coffee you’re drinking, etc. etc. This is especially the case with the automaticity and unconscious and implicit cognition literature. Everything potentially matters. So collect and record it. Each researcher, obviously, collecting and recording what they see fit.

  4. An interesting presentation by Chris Blattman, “Does Poverty Lead to Violence?,” in which (@ around 19:00) he describes a joint “active” research initiative that he’s undertaking w/ the Liberian government. He emphasizes that despite all the cross-national (i.e., macro) studies that have been done, our understanding of poverty-violence dynamics is too poor to effectively inform policy interventions. He argues the need for experimental designs in field-based social research, and defends that of his own project after a pretty aggressive question from an audience member (response is around @ 31:00). Expensive, but absolutely necessary.

    Third video down:

  5. Thanks for posting the video, I hadn’t seen it. A very good friend just came back from working on that project in Liberia. It’s a pretty intense project and is definitely the kind of thing that needs to be done. And in general I think it is starting to be done more. I don’t usually like the magazine but Foreign Policy recently had an interesting article about the role of data in conflict/genocide research:

    The consequence of not demanding and working to collect systematic data is the proliferation of poor understanding and commentary on conflicts like what the one(s) that occurred in Rwanda. I’m not very familiar with that case, but I attended a talk given by Allan Stam once and he made a pretty persuasive case that the conventional description of that conflict is very wrong:

    I’m really interested in the potential of emerging technologies to make this kind of systematic data collection far more feasible. It can’t just be about doing it despite its cost. It’s got to also be about wanting it so badly that we do things on our own and as entrepreneurs to make it possible.

  6. Pingback: On the virtues of deliberate inaction « House of Stones

  7. Pingback: Why do Jihadi Clerics become Jihadi? | House of Stones

Comments are closed.