Trying to figure out why I don’t want to call myself a data scientist

I want to change my LinkedIn profile a little. [EDIT: Already changed it based on stuff I wrote here…I guess I should have archived the old one for reference.] The last time I really made any major profile updates was when I was applying for my current job.  Now, I sort of like my summary – it’s a story about a non-traditional briefing I gave to a British colonel, a fuller version of which I used in my cover letter. That story was apparently interesting enough that my current employers told me after I was hired that it had caught their attention, but it doesn’t really reflect the full range of what I currently do and is too focused on my government work experience. The same is true of the ensuing description of my abilities, and the description of my current job responsibilities needs to be updated.

I’ve also decided I’m tired of my current “professional headline.” It’s currently “Social and Behavioral Scientist,” which sounds terribly generic, as it should – social and behavioral science means so many different things that it means practically nothing. It’s like someone asking “what do you like to read?” and just answering with “books”.

I see a lot of people describe themselves in terms of their topical focus. That’s a problem for me, because I don’t have a topical focus and I like it that way. I’ve done research on conflict in Afghanistan and Pakistan, and am continuing to do so. I’ve done a good bit of market research – branding, segmentation, customer preferences, and things like that. Paul and I have recently turned to researching research itself, as we’re in middle (well, not that far…first quarter?) of doing a meta-analysis of cited support for claims of ethnic violence in the Bosnia conflict. I’m also starting to look at getting back into some education research (my M.A. thesis was on student consumption of history curricula). And the only reason I’ve limited myself to the topics I just listed is because opportunities to study other things haven’t presented themselves yet. For me, behavior is the interesting thing. The context of that behavior doesn’t really make it more or less interesting to me.

I’ve also seen a lot of people describe themselves in terms of the methods they use. Maybe I dislike that approach just because social/behavioral research methods tend to get pigeon-holed into the useless quantitative/qualitative framework, which means I either need to (1) allow people to mistakenly assume that I have less methodological breadth than I actually do, (2) allow people to be confused when I mix methodological descriptions that they normally wouldn’t put together – e.g. “statistical ethnography”, (3) spend an inordinate amount of time boring and possible mildly offending people with an explanation of why I reject their methodological categories, or (4) describe myself as doing “mixed methods”, a term that I think is really kind of silly.

Over the last several months, I’ve seen a lot of myself in people’s descriptions of “data scientists.” I think that term is funny – is it used in juxtaposition to scientists who don’t use data?  I don’t think I’d even heard of “data science” until the beginning of this year, but Cathy O’Neil posted an interesting summary a while ago that caught my attention (and check out Kaiser Fung’s summary of the debate that ensued between O’Neil and Cosma Shalizi about whether data scientists are really just statisticians).  Like many other characterizations I’ve read, O’Neil description of what an employer would want from a data scientist seems to fit me:

  • “Data grappling skills” – If in R, then yes. If in SQL, then yes for many but not all scenarios. If in Hadoop, then no.
  • “Data viz experience” – Yes.
  • “Knowledge of stats, errorbars, confidence intervals” – “I know stats” is just about as uselessly broad a description as “I’m a social scientist”, but I think what she’d getting at here is the ability to convey both estimates as well as the uncertainty surrounding those estimates. I can do that.
  • “Experience with forecasting and prediction, both general and specific” – I feel like this one goes along with the previous point. There are a lot of tools for doing prediction, and I know and have used many of those tools.
  • “Great communication skills” – I’d like to think so.

So far, so good. But in other parts of her post, she describes data scientists in ways that only partially describe me:

  • “Super quantitative” – No. Yes? I don’t really know what this means.
  • “Can work independently” – Sure.
  • “Knows machine learning or time series analysis” – I know but have relatively little experience in common time series techniques such as vector-autoregression, and while I’ve recently started digging into random forests and stuff like that, I wouldn’t say I’m competent in anything that could be considered machine learning. That being said, I think there are a lot of tools that create the same sorts of outcomes as machine learning techniques do, and I know a lot of those tools.
  • “Knows how to program” – depends on the language, but I certainly don’t consider myself a programmer. I’m a researcher. I can program enough to access, manipulate, and analyze the kinds of data I’ve come across in my career, assuming I can use the particular tools and languages I’ve already learned, or can be given enough time and incentives to learn a new one.
  • “Loves data” – Yes. Yes. Yes. Yes. Yes!

So in most cases the description seems to fit, but reading though other people’s posts on the subject I wonder how many people take such a “whole-of-researcher” approach to deciding whether they need a data scientist, or deciding what to ask a data scientist to do if they have one. For example, a recent discussion in LinkedIn’s Data Scientists group attracted a lot of comments about how people ask for data scientists when in fact that’s not what they want. A few quotes from that discussion:

“What companies want when they say ‘data scientist’ is someone who can program Hadoop and C/C++.”

“What I have observed is that all the data scientist positions in the market are focusing on Hadoop, MapReduce, Pig, Hive, NoSql, Unix etc as skill sets which I feel is more towards being proficient to run a program optimally for big data resources.”

“Data Scientists must have a programming background. In my opinion, if I were looking to hire one, they would need to be proficient in a scripting language like Python or Perl, need to be proficient in some type of [open-source] statistical language (R, Octave/Matlab, whatever), have some ability to program with Java (more preferable than C++) and have a strong data mining/machine learning, computer science, or statistical background (maybe 2 of the 3). Of course, a working knowledge of SQL is also necessary. Of course, throw in some Hadoop as well.”

O’Neil seemed to acknowledge the tendency for people to say “data scientist” but mean “computer programmer”:

“Don’t confuse a data scientist with a software engineer! Just as software engineers focus on their craft and aren’t expected to be experts at the craft of modeling, data scientists know how to program in the sense that they typically know how to use a scripting language like python to manipulate the data into a form where they can do analytics on it. They sometimes even know a bit of java or C, but they aren’t software engineers, and asking them to be is missing the point of their value to your business.”

That is part of what concerns me about calling myself a data scientist. I’m not a software engineer, and I have no desire to be. I’m interested in working with software engineers, and people who know C and Java and Hadoop and all the other tools I don’t know, because I’m interested in working on a team that can tackle just about any problem that comes its way. But I’m not interested in tackling those problem by myself. That’s not because I’m lazy or unable to learn the necessary tools. It’s that (1) I’m more interested working the intersections between research design/planning, data collection, data analysis, and communication of findings than I am in inhabiting any one of those positions in and of itself and (2) I firmly believe that nearly every topic of social/behavioral research that could realistically be tackled by a single person has already been tackled – multiple times, in most cases.

Whether you’re a CEO, a military commander, a project manager, an administrator, or anyone else who needs to make decisions, for information about your company, area of operations, project, or customers to translate into better decision making, you (or people you hire) need to establish what sorts of information are relevant to the pending decision; get that information or some reasonable proxy measures; analyze that information in a way that mitigates bias, considers alternative explanations, and estimates uncertainty;  and communicate those analytic findings in a way that decision makers understand. All of those things are important all by themselves, but because all of them have to work together for data-driven decision making to happen, it doesn’t make sense have them all happen independent of one another. I can work any one or more of those positions, but I can also work the spaces in between. I’ve been trained to do that. I enjoy doing that. I’m good at that. That kind of engagement across all the positions of the research process doesn’t seem to exist in most discussion of data scientists, except in those cases where data scientists are portrayed as analytic Renaissance Men or Women who do everything.

That brings me back to my second point: a jack-of-all-trades makes the best researcher only in cases where you have only one researcher, and the amount of things you can’t do when you have only one researcher is huge, and seems to be growing every day. I’m convinced that most good research requires good teams. I think it’s a good idea for team members to overlap in their skills, just as I think the different areas of research should not operate independent of one another, but it seems like a waste of those skills (and, if you’re hiring a team, a waste of money) to hire people whose skills nearly completely overlap. You need people who do enough of what everyone else does to be able to understand and appreciate the other team member’s contributions, but who do enough of what everyone else doesn’t to be able to clearly and quickly take the lead on certain parts of the project.

I think team-based research is about more than just getting all the right skills together. There is something very isolating about being the only researcher in one’s place of work, and there’s something incredibly useful about being able to bounce ideas, findings, and interpretations off of someone else who has the ability to bounce them back. Just the existence of research teams isn’t enough to ensure good research, but teams can catch mistakes, energize their members, and improve both the ideas and the products that they eventually produce. Those are the situations where the really exciting – and useful – research takes place. That’s the kind of research I want to be involved in.

Now, I’m not overly concerned with how a potential employer might view me if I called myself a data scientist and then showed up to the interview not knowing Ruby or PHP. I know that what employers ask for is rarely what they really want, and that the only way to figure out if they really want something you have to offer is to actually go through the painful processes of applying and interviewing. But all that aside, I already have a job and don’t anticipate looking for another one anytime soon. I’m more concerned about being able to effectively communicate what I do and what I have to offer. Ever since I decided in grad school that I didn’t want to call myself an anthropologist anymore (for all the reasons Paul outlined in his most recent post), I’ve been searching for a professional description for myself.

When I talk to current colleagues about my conflict projects or meta-analyses, I get blank stares. When conflict researchers find out I currently work in a marketing department, I get confused silence and then a change of subject. When people find out my degrees are in anthropology and then I start talking about statistics and database management, they don’t seem to know what to say.

I think there must be a decent, succinct description for someone who can live in any the provinces of social and behavioral research, but who is happiest and most effective traveling between those provinces. I just don’t know what that description is. “Data scientist” comes close in many ways, but especially when I look at all the ways that terms is used in actual work settings, I’m not convinced the term is any better than something generic, like “researcher.” At least with “researcher,” people will know that they don’t know what I do. With any of the disciplinary titles, be it “data scientist” or “anthropologist”, I find I have to spend far too much time explaining why I don’t do half the things they think I should, and why I do do half the things they think I shouldn’t. I don’t need a short, clear title for myself. It would just be nice to have.


One thought on “Trying to figure out why I don’t want to call myself a data scientist

  1. Pingback: Data science? Yes, please. Data scientist? Meh. | House of Stones

Comments are closed.