“Opportunistic analysis” sounds easier than it really is

First some background. As I explained in my first post, I work in a marketing department now but I used to work for the Department of the Army. I worked hard to make my data as publicly-releasable as possible because I worked on long-term, strategic types of issues. Before I left, it was standard practice to turn some of the analyses into manuscripts for open publication, and I’m still working on those manuscripts now that I’ve left. I was working on one last night – a project on attacks in Afghanistan – and realized it might be useful to have a monthly record of overall troop levels to augment one part of the analysis.

Holy cow, it’s really hard to get that information.

You might be thinking that I’m just a poor planner. I worked for the Army, so it seems I should have thought of this need earlier and figured out how to get those numbers a long time ago. I actually did think of this need earlier – a couple of years ago. I looked for the information then and couldn’t find it. The frustration of that search seems to have dimmed in my memory over the ensuing period of time, but it all came back in a wave last night. For example, take a look at this:

Troops levels in Afghanistan and Iraq - NPR

It’s a graph of troop levels in Iraq and Afghanistan that was presented in an NPR story. The original is here. Notice the Afghanistan line is missing in parts. On the original web page, underneath the graph, it says two things: “Note: Data not available for some months” and “Source: Department of Defense.” In other words, NPR got those data from the people who are actually conducting the war, and those people themselves didn’t know how many troops were on the ground for some months.

This isn’t just a case of NPR not getting access to the data. The Congressional Research Service wasn’t able to get it either. In this report, they have to spend a few pages explaining how there was no straight answer to the question of how many troops were in Afghanistan at any given time. Part of that is a legitimate definitional problem – large-scale troop records often talk about the operation to which the troops were assigned, not the location. So a lot of troops assigned to the Afghanistan conflict were actually residing in other countries, doing support work. However, part of the problem was that the CRS actually got conflicting reports. They said the “Boots on the Ground” reports, prepared by the Department of Defense, probably contained the best estimates, but those could be biased if troops were “not present on the day of the head count.” That’s right. When DoD wanted to know how many troops they had in Afghanistan, they had to go count them. Footnote 89 is also interesting:”DOD did not send Congress Boots on the Ground Reports for October 1 and November 1, 2008.”

So we can’t assume that the data are actually stored some place and that the public just doesn’t have access to them. It seems it’s actually realistic to believe that perhaps no one has really been keeping those sorts of records.

So I started looking for other sources. That’s when I came across this:

Troop levels in Afghanistan - Brookings Institution

What’s I’ve pasted here is actually a still of this interactive graph produced as part of the Brookings Institution’s Afghanistan Index. If you roll your mouse over any particular part of the graph, it will tell you how many Afghan Security Forces, U.S. troops, and other foreign troops were in the country at any given time. As the graph says, they had to estimate levels for many months, but at least they have the estimates – and they’re up front with how they arrived at them.

Here’s the thing, though: the only way to get those actual numbers – to use anything but the graph itself in an analysis – you have to run your mouse over each section of the graph and record each number individually. I think they must have all that information in a spreadsheet somewhere – the graph has to come from someplace – but it’s not available on the site. Last night, I emailed the people who put the Afghanistan Index together and asked them for the data itself. I haven’t heard back from them yet. I hope I can get the data from them eventually.

My point is that it shouldn’t be this hard. There are potentially useful things people could do with a monthly record (even an estimated one) of troop levels in Afghanistan. Similarly, iCasualties.org has impressively kept a record of every casualty in Iraq and Afghanistan for which the site’s author could find a press release or news story. The entire list of fatalities in Afghanistan is laid out here. [http://icasualties.org/OEF/Fatalities.aspx] The problem is, the list only displays 50 records per page. There are 58 pages. If you want the whole data set, you have to copy and paste each page individually. Granted, that’s not as frustrating as reading monthly data for three different types of troops off of mouse hovers on the Brookings Institution’s page, but it still makes it difficult for an analyst to use data in potentially useful and informative ways.

I’m a little frustrated at the DoD for not making their data more available, but I think I’ve come to terms with it. The military is used to being overly cautious with its information, and most people I met in the military felt that they had more pressing concerns than maintaining databases. I’d argue that maintaining better databases could help them with some of those other pressing concerns, but I can understand if they don’t give my argument much credence.

What I understand less, and what frustrates me more, is the non-government sites that recognize the value of giving people information – that’s actually a large part of the reasons those sites give to justify their existences – but then present that information in graphs and PDF-ed tables. They basically present the information in the least-accessible format for statistical analysis. They can’t have created their graphs or the PDFs without first having a database or a spreadsheet or something. It would be so easy to include a link. But it’s really rare to find that kind of link. And it’s not just the U.S. government and it’s not just Afghanistan. I’ve run into the same problem with data from NGOs in India, the Mexican government, and the U.N.

I’m a big advocate of opportunistic analysis. I think some really insightful stuff has come from people who took data that was just lying around in different places, put all of it together in one place, and then employed some rigorous analytic tools to find patterns. I think that kind of work is not only necessary in research environments where a researcher’s employer is more concerned with day-to-day operational and business outcomes than with achieving a consistently deeper understanding of an issue, but I also think it’s exciting. I get excited when I find governments, NGOs, and other groups dedicated to making information available, who recognize the potential benefits of letting people take advantage of open information. I wish these organizations realized that available information isn’t the same as usable information. I have more available information than I can handle. I have precious little usable information. That’s a problem for organizations who want their data to be used. I only have so much time and energy. More often than not, if the data are available but difficult to access, I’m going to choose a different project or a different way to address my questions. There are a lot of opportunities out there. The opportunities that are the easiest to access will be the ones that get the most use, garner the most attention, and do the most good.


9 thoughts on ““Opportunistic analysis” sounds easier than it really is

  1. Schaun,

    Great post – as usual. As you say and know, this isn’t an isolated incident. In fact, it’s pretty systemic in the DoD. Just today, a colleague was trying to find all reports from a specific source and we are now resorting to an access database to catalog reports as they come in. You wouldn’t realize that its 2012 with how we have to operate sometimes.

    Hope all is well,


    PS – We have yet to start that project I pitched to you at NDU in August. But it’s on the list!

  2. That makes me sad, but on one level I can kind of understand it coming from the DoD. If you asked people to list the reasons why a Department of Defense exists, I think “to provide information retrieval services” would be pretty far down the list, if it featured at all. The DoD seems to be like a lot of other organizations – businesses, NGOs, etc. – that have been slow to realize to what extent to which the ability to use information has changed. I just don’t think people are used to concatenating individual bits of information into large data sets for rigorous analysis. They aren’t even used to thinking about that.

    But that’s what confuses me about places like the Brookings Institution. They’ve already recognized the importance of information concatenation (that’s kind of an awkward phrase, isn’t it? Not sure what else to call it). They have a great graph that shows that they’ve been putting the stuff together. All they have to do is make the data available.

  3. Have they already recognized that importance? I’m not so sure they have.

    I think they might have recognized some of the importance of data aggregation (aggregation is a better term, I think, for what they’re doing). They know that people like to see charts and graphs and that the word ‘data’ has some resonance in the world. You point out that they already put together a great graph, but that in its current form it is really hard to use the data behind that graph for inferential analysis.

    I have to point this out: has anyone at Brookings used that data for inferential analysis?? From what I’ve seen from their website and their reports, they really really haven’t. Statements like the following seem to be the norm in their use of data:

    “But what has gone relatively unnoticed is that the number of IED attacks outside these two nations has doubled over the past three years. The first nine months of 2011 saw an average of 608 attacks per month in 99 countries, according to the Defense Department.”


    Based on this descriptive statement the author proceeds to talk about a range of implications.

    They’re not even using the data they’ve collected for their own analyses! I think that might be part of the reason they don’t make their data accessible for anyone else – simply because they’re just not really aware that anyone would want data in any other form than an attractive and dynamic graph. They don’t use it for analysis themselves, why would anyone else want it for analysis?

  4. Of course, there’s always the possibility that they want to do more with the dataset, themselves–i.e., fear of getting scooped if they release a set that they’ve worked hard to produce.

    Add to the list sub-national data for Africa. Most national stats websites are outdated or else post data in pdf format. Kenya has broken out recently, though, with their Open Kenya site. Holes galore, but exportable in EIGHT formats.

    Like good social science bloggers, you should have an open-source data page.

  5. I thought about the term data aggregation and it didn’t seem appropriate. They do lots of that too – giving the total number of attacks rather than the records of individual attack, total troops in country rather than total troops in province or district. That frustrates me too. But what frustrates me more is that, in many (not all) cases, in order to get those aggregate numbers they first had to take a bunch of records that were stored in different places and brought them into one place. That’s the point at which I wish they would share.

    But you’re right. I’m still coming to terms with just how many people define “analysis” as a table of sums and averages. That makes me sad, because there’s really so much people could do with the information they already have at their fingertips. That’s one of the reasons I find opportunistic analysis so exciting. People seem to think that if they don’t know the answer to something its because they don’t have the right information. I think they often have the exact information they need – they just don’t know what to do with it to get their answers.

  6. I’ve thought about that – not wanting to share for fear someone else will get the credit. I think that happens, but it’s obviously not the case with something like iCasualties. That website exists for almost no other reason than to make that data public. And the Brookings Institution example I gave actually had the data available – it was just stupidly hard to access in a useable format. I’d think if they didn’t want people to get it, they wouldn’t make an interactive graph that technically allows people to get it. They’d stick with just their plain-old graphs that they print in their PDF reports.

    Yes, I’ve liked the openness of some of the Africa data I’ve seen. And I’ve been impressed that the World Bank has recently made a lot of its data public. The exceptions make the rule all the harder to swallow.

    (Largely) open-source data page: http://www.infochimps.com/

  7. Here’s an interesting example of constraining opportunistic analysis.

    I never looked into the Iraq wikileaks data because when it was released I was working for the Army, and although official instructions tended to be confusing, I got the sense that we weren’t supposed to look at it.

    However, someone recently mentioned to me what the Guardian had done with the data and since I no longer work for the Army, I looked it up. It’s kind of amazing (but I admit, perhaps my expectations are low when it comes to conflict data). You can see it here:

    I’ll just point out the irony that in a classic case of constraining opportunistic analysis and making it almost impossible, US government analysts can’t really use this data even though everyone else in the world can. And from my own personal experience, it’s a lot more accessible and bigger than what I could typically find in the bowels of the DoD.

