The better my code gets, the better my research gets

I’m having a happy programming weekend. I need to make some corrections to bits of analysis that I did something like a year ago, and the ‘happy’ part comes from the fact that my past self, in this particular case, wrote good code, that makes it possible for me to do this, and even better, do it without major struggle. Which got me thinking in general, about the relationship between good code and good research.

 The unique property of programming, which makes it such a powerful tool, is the ability to abstract repetitive tasks and automate them. This not only saves time, but also ensures consistency, and makes the procedure reproducible. It should also make it easy to, for example, post-hoc dissect the methods used for a particular piece of analysis. Notice that all of these align with the gold standards we expect of scientific research – powerful, consistent, reproducible. Following good coding practices leads to good research. 

For your pleasure and perusal, here’s a scary scary article (Baggerly and Coombes, 2009) about what pretty much inevitably happens when best practices are not followed. Things like accidentally flipping 0’s and 1’s, thus reversing the outcome of your experiment. Adding a row, that misaligns all the numbers from the labels. I’d love to think that this is an isolated bad example, but I don’t believe that it is. I’ve done similar things before. The only reason I know that I have, is that eventually, my coding and analysis got good enough to the point that it let me dissect bits of analysis I did in the past and find glaring errors in it. Also, learning more about the subject I’m studying and the data sets I’m using, and talking to people much better/more experienced at this game than I am, led me to finding errors in my methods, and better ways of doing analysis. Had I not carried on learning about it, *I might never have known*. Loads of papers get published every year, with flawed methods and overinflated p-values, mine would have just been another one of them. It’s something that we should really think about and address as a field. 

I can see how it happens. Biology undergrad degrees for the most part still have a relatively traditional curriculum, which doesn’t include much maths or programming. However, practicing biologists in the field frequently end up with gigantic datasets, that can only really be analysed computationally. And while some places might have bioinformatics core teams or collaborators, that’s often not an available option. So lots of people go into DIY data analysis. 

It’s not my intention to discourage this. As a biologist turned bioinformatician, I found this data analysis adventure journey to be fascinating and wonderful, so I would encourage everyone to go play, and learn about coding and stats. As more and more high-throughput datasets are released, and high-throughput experiments are becoming the norm, the ability to work with this data becomes empowering and invaluable. However, I do feel the need to post some caveats to go with that: 

  • Every piece of analysis I did during my first 6 months of coding was *completely* wrong. It was before I learnt what a ‘genome release’ was, so found out that I should check whether they’re the same between datasets. Until you’ve been analysing data in a particular field for a while, the likelihood is that, with all best intentions to be rigorous, the likelihood is that you won’t even have the basic knowledge of the range of what can go wrong, and so won’t be able to pick up on it. It gradually got better from there, but 3 years in, I still feel like I’m constantly learning.
  • Talking to / learning from experts is invaluable. But so are discussions with other people around you. I’ve even picked up on my own analysis errors in the process of teaching other people how to code. So – do talk to people! This can help you learn faster, and pick up errors quicker. 
  • To the best of your ability, make sure your code is clean and readable, and your analysis is reproducible. It’ll help you pick up on stuff going wrong, and you’ll be massively grateful to your past self if you ever have to go back and fix things. 
  • Keep digging, and keep learning – bioinformatics is a journey that lasts for years, not something you can pick up over a weekend.
I do worry that as a field, we’re unprepared for the flood of data that we’re increasingly generating. I’m also cautiously optimistic, and think that we can work to improve both researcher skills and the tools available for dealing with the data. So, in conclusion, I would like to link you to a wonderful post by Vince Buffalo, which goes into a more in-depth discussion of reproducibility, good coding practices and the beauty of Bioconductor

Enhancer architecture

There were a few really interesting talks on enhancer architecture in the gene expression session this morning. Here’s a quick summary of the work, and why I think it’s really cool.

Enhancers, otherwise known as cis-regulatory modules (CRMs), are sequences outside of core promoter regions, which affect gene expression. They can be extremely complex, with a large number of interacting modules working together to facilitate dynamic changes in gene expression. A lot of work has been done on characterising individual enhancers (for a collection of experimentally validated enhancers known for Drosophila, check out REDfly), but we are only just beginning to understand how they interact and work together. Uncovering the design constraints and the action of interacting enhancers is a cornerstone of our efforts to understand genome regulation, which makes this a really interesting topic of research.

The first talk, by Jelena Erceg (working in the Furlong lab in EMBL, Heidelberg) used the enhancers for pMad and Tinman, known from in vivo experiments. From these, she constructed a series of synthetic enhancers attached to a reporter gene, with the aim of finding out the effects of enhancer distance and orientation on their effects on gene expression. It is perhaps unsurprising that both of these do have an impact on the regulatory effects of the enhancer. However, what was interesting is that the effects varied from tissue to tissue – in the visceral mesoderm, the enhancer appeared to be very robust, and only changed effect in response to large changes in spacing. In the cardiac mesoderm, on the other hand, small changes in the layout of the enhancer sites had a large effect on the reporter gene expression, showing that the enhancer is much less robust in this tissue. Getting to the root of these differences sounds like a really interesting problem.

The second talk, by Tara Martin (from the DePace lab at Harvard Medical School), uses the same methods to address a slightly different question. She starts off with two different models of enhancer action. One is an ‘enhanceosome’ – an enhancer where the entirety of it acts as one entity, with the positioning of the elements being important, and the other is a ‘billboard model’ – where the enhancer is composed of a large number of independent modules, with the effects on gene expression being additive. She then used synthetic promoter regions attached to a reporter to test which of these models is plausible – additions of extra modules should not have an effect on an enhanceosome, while they would have an effect on a billboard model enhancer. Her conclusions were very interesting – there were small, independent modules present, but those were composed of a number of enhancer site units. It appeared that the presence or absence of enhancer sites conveyed tissue specificity, while the number of enhancer sites in the module conveyed the strength of the regulation effect.

The reason I love these studies, is that they draw on previous in vivo research, and then use a structured synthetic biology approach to try and unveil regulatory design principles. As well as giving really interesting results, the carefully constructed experiments that systematically test interactions and effects are, well, pretty. They have that lovely traditional appeal of “carving nature at its joints” – just what science should be like!

5 ways to access modENCODE data

The modENCODE project is an international initiative aiming to characterise functional elements in the fly and worm model organisms. To date, they have collected data from over 1800 experiments, including everything from gene expression profiling and copy number variation, to a range of histone modification and transcription factor binding sites.

This is a massively useful community resource – whatever favourite gene or problem you’re working on, it’s likely that due to the wide range of experimental data available, there are modENCODE datasets that could complement your research. This is a quick guide aimed at helping you find out what data exists, and how to get a hold of it.

5 ways of accessing modENCODE data:

1. The faceted data browser

This is my favourite interface for browsing the available datasets. There are a range of filters available for selecting your datasets of interest, and there are links for viewing the data in GBrowse, modMine, and links for downloading it.

2. Browsing by category

This feature exists both on the modENCODE webpage and in modMine. The datasets are sorted into broad categories – for example ‘Chromatin structure’, and each category has a number of studies associated with it – for example ‘Nucleosome mapping’ and ‘Genome-wide chromatin profiling’. Clicking on a study takes you to a list of all the data submissions associated with it.

3. Keyword search

This feature is available on the front page of modMine, as well as in the top right corner of every modMine page. You can search for your favourite gene, experiment type, PI, etc.

4. FTP download

For easy bulk download of modENCODE data, you can use the FTP interface.

5. On the Amazon cloud

To save you downloading data to a local machine, all of it is available on the Amazon cloud. You can upload your own data too, and do your analysis there. To get started, check out this modENCODE help page.

Fighting malaria, one banana-sniffing fruit fly at a time

I’m currently at the 2012 Drosophila Meeting in Chicago, which just got off to a wonderful start. My favourite part of the evening was Stephanie Turner Chen’s presentation on her fantastic PhD thesis work, for which she received the Larry Sandler award.

She started off studying Drosophila olfactory neurons – specifically, how flies can smell carbon dioxide (they hate it and run away from it). Some genetics identified a specific receptor in the neurons responsible. She then asked how it was that fruit flies hate carbon dioxide, but love fermented fruit, which gives off plenty of it. It turned out that there are other compounds which can counteract the carbon dioxide neuron response.

The link to malaria is that mosquitoes, unlike fruit flies, love carbon dioxide, and use the smell to locate humans. But, like fruit flies, they use the same receptor to detect the carbon dioxide, and they react to the same compounds that disrupt the detection mechanism. So, by disrupting the mosquito ability to smell carbon dioxide, we can make their human targets invisible, offering a potential new, cost effective mosquito repellent which could help in the fight against malaria.

I love this project for a number of reasons. It’s a whirlwind of awesome science, from genetics, to electrophysiology for looking at activation of neurons, to insect behavioural studies and even field testing of the potential new mosquito repellents. It’s wonderfully question driven – the diverse array of techniques is applied to answer a logical sequence of questions about the observed phenomena. And finally, it’s a wonderful showcase of how abstract basic science can have a real world impact.

Well done Stephanie, the award is well deserved!