Work tips: reference management

At heart, I’m a LaTex geek. This served me well during a number of years of generating pretty reports, and eventually a PhD thesis. I also used it for the occasional beautifully formatted paper submission. However, biology is primarily a collaborative sport, and I also do quite a bit of writing for other people, and sadly I don’t find that the world at large always shares my LaTex feelings. So most of the time, I succumb to the gravitational pull of Microsoft Office. However, one of the more painful parts of this transition has been finding a reference manager that I find as easy to use as BibDesk.

EndNote seems to be a common choice, but unfortunately, the interface just doesn’t make sense to me at all. I managed to import references from specially downloaded citation files, but when it came to anything other that, I was just lost. I missed the nice straight-forward integration with Google Scholar, the interface seemed slow and clunky, and I even seemed to have trouble finding the right search buttons. At the point where I accidentally flooded my library with 2056 articles (all articles ever written by an author named Guttman), I decided there must be another way.

Since I heard people enthusing about it, I thought I’d try Mendeley. And I have to say – within minutes, I was hooked. They already had me on the first screen – “drop your paper PDFs here, and we’ll extract the publication information from them”. And things carried on similarly smoothly from there. Within 5 minutes, I had imported references, attached PDFs, knew how to leave notes in individual papers, and setting up integration with Word took the whole of 2 clicks. A bookmark in my web browser also allows me to import papers as soon as I first find them, and my Mendeley web account means that my references will be synced across different computers. And it’s BibTex compatible too, which means I don’t have to give up my LaTex addiction. It really has all the best features of modern software design – fast, clean, intuitive. And don’t worry, they’re not paying me – the software is free anyway.

I have to say, I breathed a sigh of relief that I wouldn’t have to spend hours and days obtaining the arcane knowledge required to make EndNote do my bidding. Now looking forward to a bright Mendeley future.


Rosalind: online bioinformatics teaching tool

Rosalind is a beautiful online tool that lies at the intersection of “people who like solving bioinformatics problems”, “people who like structured learning curves” and “people who like receiving experience points and levelling up for their efforts”.

Basically, you make a username and start with the simplest problems – the first one is to count the occurrence of different nucleotides in a string. As you solve the simpler problems, the more difficult ones become unlocked, and the progression teaches you programming concepts in the process. In case you’re worried these will be too simple, the more advanced problems do include things like multiple alignments and phylogeny comparisons – there’s probably some interesting stuff here for most people.

Most importantly – it’s fun, and it’s addictive. Check it out!

Work tips: Manufacturing structure

Academia is flexible. Both in terms of the prescribed working hours (of which at times there are many, but they can happen whenever you like), and the things you do within those hours. In many ways, this is both a blessing and a curse. As someone who loves organising their own time and hates being told what to do, I mostly consider it a selling point. And the ability to sip my coffee slowly in the morning while my partner rushes off to their industry job is, well, a perk. However, it also means that, since pretty much everything runs off self-motivation and self-inforcement, if you’re in the middle of a motivational dip, even getting to work in the morning, and everything from there, requires disproportionate feats of willpower.

As such, for me, a big part of getting work done relies on manufacturing structure. Sometimes it’s focused around real deadlines, but often it’s arbitrary – I set myself checkpoints and deadlines, and fabricate emergencies, in order to get a reliable work output. The basics of this are pretty universal – breaking things down into smaller chucks, making timelines, assigning clear time limits for tasks I hate (“I’ll do exactly 15 mins of admin, then allow myself to do fun thing x” is better than procrastinating about doing admin all afternoon). The systems themselves have varied enormously though. Since repetition bores me (another perk of a science job: it’s a job in which you’re constantly learning new things, making it habitable for people like me), it means that I will invent a system, use it for maybe a month or two, get bored of it, then invent another system. So, January might be “I handwrite my tasks in the morning on a piece of A4 paper, then tick them off as I go along” month, while March turns into “3 work tasks on a colourful A5 piece of paper, and a smiley sticker for every 20 mins of work I do”. In case anyone’s interested, this month happens to be “Big Word document with all my tasks ever, and I pick out a ‘major’ + a few ‘minor’ things to focus on each day”. However, this month I have some extra help, which is exciting enough to make me dedicate a whole blog post to structure.

I recently came across an iPhone app called 30/30. I declare no conflict of interest, and either way, the app is free (full functionality, no adverts – they make money from selling people cute optional extra icons). The concept is really simple – you assign a set of tasks, with time for each one (e.g. sort email 10 mins, write manuscript 30 min, work break 8 mins), and it cycles through them continuously, sounding an alarm when your task time runs out, and you’re supposed to move on to the next one. I didn’t realise it would make much of a difference to me, since I thought this is pretty much what I do already – I assign time to tasks, then go and do them. However, I’m finding that it’s making a massive difference to my productivity. I think the reason for this is that when I’m ‘self-policing’, a part of my focus is actually dedicated to planning ahead, looking at the clock, trying to make sure I don’t lose track of time, etc. Or, you know, it isn’t, and then my schedule falls apart, and the 10 mins that was supposed to be spent sorting emails expands to an hour. With using the app, there is an external signal that I know will go off, so I can actually dedicate the entirety of my brain to the actual task at hand. This is super useful, since some of my favourite science tasks (reading literature, writing, analysis) are really immersive. And also, since there’s an actual allocated time for things like email, I don’t find myself sneaking a peek at it every few mins during my research time. It’s not meant to be super restrictive – if there’s a task I’m really into, I just pause the timer and go for it. However, for the great majority of the time, structure helps me channel my energy in a productive way, and it’s definitely a safety net for times when things aren’t going all that smoothly. For people without an iPhone or similar, while the interface will be less shiny, the same system can probably be replicated with a piece of paper and a timer.

Since I’m on the subject of productivity in science – here’s an awesome blog about it that I came across recently:

How about you guys? How do you organise your time? What tips do you find helpful?

Drosophila neuroscience teaching in Uganda

I recently found out about this super cool charity that was originally founded by a few Cambridge biology PhD students:

The idea is to introduce Drosophila to African neuroscience labs (who currently mainly work on rats) as a cheap, convenient and awesome model organism. The way they are doing this is by running some pretty intense summer courses with a whole bunch of lecturers from different universities. The courses cover theory, but also focus a lot on practical skills – everything from running and maintaining a fly lab, to making and using equipment, to how to find open access lab protocols and obtain reagents. The charity also set up a fully functioning fly neuroscience lab at the University of Uganda, using sponsorship money and donated lab equipment.

There’s a Facebook page, as well as a blog from the 2011 course. Check out the pictures on the blog – I was really impressed with some of the improvised equipment they were using!

I don’t believe in genomics

Here’s the deal. I love genomics. I love playing with numbers. And I still think that unveiling the design of life itself is the most interesting problem in the world. However. I think we have some issues.

The first problem we have is the Epic Multiple Testing Problem. Ok, so I’m sure everyone’s aware of multiple testing corrections within individual experiments – that’s fairly straight-forward, and we correct for those. However, what happens that we don’t correct for, is that in particularly hot fields, lots of labs are working on the same problem. And by chance, some of them get strong, striking results. And the more extreme your results are, really the more dubious you should be about them, but actually, we trust them more, they get published first, etc. Then when people go to replicate the results, they get weaker or no results, but there’s a bias against publishing negative results, so these get disregarded. And it’s only when the results are so entrenched that they become dogma, that this creates a space for people to go in and specifically disprove them.

So, we have a cycle of: great result – mostly unreported weakening – accepted dogma – disproved conclusions. A cycle that possibly takes decades. This is described by John Ioannidis in his excellent article – Why most published research findings are false. And do check out this New Yorker article on the same topic – this is really something that’s happening widely, and across disciplines. Drugs are magically losing efficacy, a large number of textbook scientific claims are gradually being disproven.

I don’t see this as a tragic demise of science, more as a science run-of-the-mill. I believe in Karl Popper’s model of science as relying on falsifiability, rather than being continually right. We make claims, we test them, only the best survive – that’s ok. It does mean a few things though. First of all, because of the high pressure on scientists to publish positive results, there’s a systematic pressure to create and embrace false positives. Because of the Epic Multiple Testing Problem, even with correct stats, most up to date methods, and no bad intentions from anyone – lots of research findings, particularly in hot fields, will be false positives. I think it gets worse in fields where people think they already know the conclusions (hello obesity research), or where there are social beliefs tangled up in the science (e.g. anything ever about the male and female brain, Simon Baron-Cohen’s thing about autism being the ‘extreme male brain’, whatever. Did you know that girls with autism are significantly less likely to be diagnosed, even when displaying the same severity of symptoms as the boys who are diagnosed? Not helpful. Anyway). Not that everything that emerges from those fields will be false, but I would guess that it significantly increases their rate of false positives.

So, this is a problem across the entirety of science as a field. What makes genomics potentially worse than average? Well, as I briefly discussed in my last post, there’s a flood of data, and as a field, we’re really quite unprepared and incompetent with it. I don’t mean that individual people are necessarily bad, but biology isn’t traditionally a particularly numerical field (despite some wonderful exceptions to this generalisation), and I think most undergrad curriculums still reflect this. Mine included no programming, and next to no maths, except what I set out to learn on my own initiative. However, lots of people later end up in positions where they need to make use of high-throughput data – something you can’t meaningfully do without at least a basic understanding of stats. But people need to publish to survive, so what you end up with are widely misapplied statistics and hugely variable quality of bioinformatics methods. And, you know, the times when people accidentally flipped 1’s and 0’s and reversed the outcome of their experiment, shifted everything by just one row, accidentally used the same data twice, or in the wrong genome release, or any number of things that can go wrong. I expect this to change as training catches up with the new requirements biologists are facing, but for now, even from this alone, I’d expect we’re one of the fields with the highest rate of false positive findings in science today.

However, even ignoring the incompetence issues, I think there might be another problem. Even if you know what you’re doing, I think the journey of relating the data back to the biology is a precarious one. It’s possible that the more complicated models are actually overfitting the data – while the numbers are fun to play with, what we’re actually dealing with tends to be very noisy data, so any conclusions beyond the most broad and basic ones are in danger of being just patterns that we see in noise. Even with the best tools available to us.

So I don’t believe in genomics. I believe that most of what we publish today will be proven false down the line. I believe that the pressure to publish exacerbates the problem, and that because of the nature of our field, we’re particularly vulnerable. It’s terrifying, given how much of an impact genomics is already having and will continue to have on medicine. It’s also a strong reason to set out to do it better.

To make an obvious wishful thinking point – the focus of any performance assessments should be on the quality of research, rather than the speed and number of publications. Further to that, an increased ability to publish negative results would reduce some of the biases present. But working within the existing system, how can we do better science?

Well, for a start, I salute everybody doing systematic meta-studies. Given the problems outlined, I’m not particularly sure any one study really proves anything, but after a number of years and a number of studies, you can start getting an impression of what the conclusions are. Meta-studies are an excellent way of systematically incorporating the existing knowledge, and particularly if they also focus on the quality as well as the conclusions of the individual studies included – meta-studies are a really powerful tool. Sometimes just really powerful at telling us that our evidence is contradictory and we don’t actually know much of anything, but, you know what? That might be the real answer.

As for individual studies… I think it’s important to base your conclusions on multiple strands of evidence, and connect them with simple methods and good solid logic. I think high-throughput data is quite seductive, in that you can generate a lot of it quickly, and of course you can see patterns in it – you’ll see patterns in any large enough set of numbers. So, build up evidence from multiple sources, make sure that your sources of evidence follow best practice recommendations (controls, sufficient replicates, etc.), integrate the data (ooh, buzzword), be aware that noise and false positives are a problem at every stage of the process, and see what you can get out of it. It’s frustrating, and takes years, and you probably won’t get very flashy results out of it. Unfortunately, it’s how solid science is done. As a bioinformatician, I don’t aspire to use the most complicated models, or build the most sophisticated tools. What I aspire to do is to find simple, solid strategies to filter noise from meaning.

In conclusion – our field is messy, noisy, and filled with false positives. But it’s exciting and I love it anyway, and I think it’s crucial that we keep trying to make it better. I don’t believe in genomics, but I do believe in its future.