At some point near the beginning of grad school I was amassing code, data files, and plots, and was feeling a little uneasy. It was the first research project I had full domain over. Which raw data were contributing to which results? Was I reluctant to make changes to what I had done for fear of introducing inconsistencies? I mentioned to my advisor that I was trying to sort things out and she suggested a tool that handled that very task. Wanting to avoid extra work, I made a “mental note” and never tried it.

Flash forward to my postdoc, years later, when I had gained some intuition for how to organize a data analysis project but still found myself more preoccupied than I should be with managing the processing and analysis work with multiple exploratory branch points. Finally I “rediscovered” Make, organized my project with a Makefile, and have loved it ever since.

Briefly, Make is used to produce one or more files by running code on other files. It treats the information flow as a directed acyclic graph (DAG), and therefore as a network of dependencies. While this is commonly used to build software from source files, it also works great for data processing and analysis, in which the tables, figures, and other insights are a culmination of steps in which information is extracted from raw data and manipulated. With Make you can change any data or code, and then tell it to update anything from one result file to the entire project. It will figure out what needs to be updated after the change and only run the necessary code.

Make as a directed acyclic
graph.

For a good introduction and basic demo, I recommend Mike Bostock’s post. It’s referenced by another good page that describes a sensible default structure for a data science project that works well with Make.

Using Make has incentivized me to organize in a way that enhances clarity. I’ve broken long R scripts into more modular parts that write and read intermediate data files. Instead of assembling data, manipulating it in many ways, and saving plots all in an interactive R session, now my scripts are focused on more specific tasks. This would have been a pain if I had to rerun them all manually when something upstream changed, but with Make, making or trying out a change doesn’t require additional work beyond making the change itself.

As with KonMari, the steps to tidying your project with Make are probably easier if done in one go. When I had a bunch of in-memory data objects that referred to each other, packed into one long script for “convenience”, I needed to sort out the information flow and split it into steps. Some of the scripts weren’t even designed to run start to finish without intervention, so those needed fixing before they could be referenced in the Makefile.

Tangled arrows converted to organized network of
nodes.

This organization reduces the stress caused by the increasing complexity of an expanding project. As a psychological bonus, I’ve always had an especially strong drive to build things. Explicitly connecting all the data and code in a scientific study helps me feel that I am indeed creating something. Now I’m interested to see what other applications Make would be useful for. Give it a try!