Cog-Sci of Data Wrangling

April 16, 2017


This week I gave a guest lecture in the Data Science class, a slightly terrifying experience given it’s a ~400 person class, and it being my first big talk on more hands-on programming topics - but I think it went pretty well. The topic was ‘Data Wrangling’ - the largely ‘administrative’ process of getting all your data ‘wrangled’ together into a usable format to do the actual analysis for a project. It’s a really messy process, and takes forever. The actual talk was a largely a tutorial, focused on being a practical guide to wrangling data by providing a crash course on file types, databases and APIs. If you’re at all interested, the Jupyter notebook I used for the talk is publicly available here, and the talk was actually recorded for the class, and is also publicly available here (which I find very weird…).

Since I’m really just a cognitive scientist temporarily masquerading as someone qualified to TA for Data Science, preparing for this talk I started thinking more broadly about the context of Data Wrangling. More specifically, about how we got to this state - what factors led us to current state of how difficult data is to wrangle. This post is basically some quick, assorted, thoughts on that (the ‘cog-sci’ of data wrangling, perhaps). I’m sure they are not original, and they are perhaps more broad and hand-wavey than practical (it’s explicitly not about practical engineering issues such as standards, formats and tools, etc), but here they are:

So, what to do? I mean, I don’t know. Data Science is often ad hoc, pulling data and tools together that were not designed to go together, and we don’t have any universal solutions for collecting, storing and analyzing data. Ultimately, then, many of these problems are probably here to stay, for now at least.

Despite this, some general thoughts:

Anyways, that’s that. There are surely tons of other relevant point to be made, and very possibly I’m wildly missing the mark, but these were some broader thoughts I thought I’d jot down, since they didn’t make it into the talk in a detailed form. Let me know if you have any thoughts / comments / suggestions / retorts / questions / criticisms / magical skills on offer to wrangle all my data for me.

comments powered by Disqus