The better my code gets, the better my research gets

I’m having a happy programming weekend. I need to make some corrections to bits of analysis that I did something like a year ago, and the ‘happy’ part comes from the fact that my past self, in this particular case, wrote good code, that makes it possible for me to do this, and even better, do it without major struggle. Which got me thinking in general, about the relationship between good code and good research.

 The unique property of programming, which makes it such a powerful tool, is the ability to abstract repetitive tasks and automate them. This not only saves time, but also ensures consistency, and makes the procedure reproducible. It should also make it easy to, for example, post-hoc dissect the methods used for a particular piece of analysis. Notice that all of these align with the gold standards we expect of scientific research – powerful, consistent, reproducible. Following good coding practices leads to good research. 

For your pleasure and perusal, here’s a scary scary article (Baggerly and Coombes, 2009) about what pretty much inevitably happens when best practices are not followed. Things like accidentally flipping 0’s and 1’s, thus reversing the outcome of your experiment. Adding a row, that misaligns all the numbers from the labels. I’d love to think that this is an isolated bad example, but I don’t believe that it is. I’ve done similar things before. The only reason I know that I have, is that eventually, my coding and analysis got good enough to the point that it let me dissect bits of analysis I did in the past and find glaring errors in it. Also, learning more about the subject I’m studying and the data sets I’m using, and talking to people much better/more experienced at this game than I am, led me to finding errors in my methods, and better ways of doing analysis. Had I not carried on learning about it, *I might never have known*. Loads of papers get published every year, with flawed methods and overinflated p-values, mine would have just been another one of them. It’s something that we should really think about and address as a field. 

I can see how it happens. Biology undergrad degrees for the most part still have a relatively traditional curriculum, which doesn’t include much maths or programming. However, practicing biologists in the field frequently end up with gigantic datasets, that can only really be analysed computationally. And while some places might have bioinformatics core teams or collaborators, that’s often not an available option. So lots of people go into DIY data analysis. 

It’s not my intention to discourage this. As a biologist turned bioinformatician, I found this data analysis adventure journey to be fascinating and wonderful, so I would encourage everyone to go play, and learn about coding and stats. As more and more high-throughput datasets are released, and high-throughput experiments are becoming the norm, the ability to work with this data becomes empowering and invaluable. However, I do feel the need to post some caveats to go with that: 

  • Every piece of analysis I did during my first 6 months of coding was *completely* wrong. It was before I learnt what a ‘genome release’ was, so found out that I should check whether they’re the same between datasets. Until you’ve been analysing data in a particular field for a while, the likelihood is that, with all best intentions to be rigorous, the likelihood is that you won’t even have the basic knowledge of the range of what can go wrong, and so won’t be able to pick up on it. It gradually got better from there, but 3 years in, I still feel like I’m constantly learning.
  • Talking to / learning from experts is invaluable. But so are discussions with other people around you. I’ve even picked up on my own analysis errors in the process of teaching other people how to code. So – do talk to people! This can help you learn faster, and pick up errors quicker. 
  • To the best of your ability, make sure your code is clean and readable, and your analysis is reproducible. It’ll help you pick up on stuff going wrong, and you’ll be massively grateful to your past self if you ever have to go back and fix things. 
  • Keep digging, and keep learning – bioinformatics is a journey that lasts for years, not something you can pick up over a weekend.
I do worry that as a field, we’re unprepared for the flood of data that we’re increasingly generating. I’m also cautiously optimistic, and think that we can work to improve both researcher skills and the tools available for dealing with the data. So, in conclusion, I would like to link you to a wonderful post by Vince Buffalo, which goes into a more in-depth discussion of reproducibility, good coding practices and the beauty of Bioconductor

One response

  1. Pingback: I don’t believe in genomics | Genomics Adventures

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s