Trimming Illumina sequencing adapters

I’ve been trying to get my head around how Illumina sequencing adapters work, so that I can trim them from my sequencing data accurately. So what with being quite into educational videos these days, first I watched this video:

There was also a really helpful article about them from Tufts University, which you can check out here:

One of the things I was trying to find out was, for example, whether you cut the adapters from the 5′ end or the 3′ end. From this, I got the impression that, provided your library is decent quality and you’ve done the size selection right (the example in the video used sizes of 200-300bp), you shouldn’t need to trim the 3′ end – with a 100bp sequence read, your sequence should never reach the adapter on the other end. And indeed, this is exactly what I observe in our lab’s RNA-seq data – there’s an occasional adapter trace on the 5′ end, but after that, it’s good quality sequence with no duplications or GC bias that would suggest adapter contamination on the other side. Adapter trimming for data like this is pretty much optional, I think – a lot of mappers can deal with it and just discard the non-matching sequence.

However, I’ve also been dealing with a type of data that has very different properties – data from an RNA bisulfite experiment. For anyone unfamiliar with it  – this is an experiment used to determine the cytosine-5 methylation patterns in DNA or RNA, by applying bisulfite treatment that converts cytosine to uracil. Getting data on RNA methylation is exciting, because this is an area that hasn’t been explored all that much, but it also has some major difficulties in terms of the experiment itself. RNA is already relatively unstable at the best of times, and bisulfite treatment is a harsh treatment that introduces several types of chemical degradation. As a result, by the time it got as far as the sequencing, the library didn’t need to be size-selected, as it was already in the 50-140 bp range due to the RNA having broken apart in the course of the procedure. With such short sequence lengths, the 3′ adapter gets sequenced in more than a minority of cases, and as such, adapter trimming becomes a really essential step. And indeed in the FastQC step, I observed very high levels of sequence duplication for the indexed primers, confirming this. However, I still didn’t quite feel I got my head completely around how adapters work, and what was happening with my data.

Next, I looked at this tutorial from ARK-Genomics, which I thought was really helpful. It talks about how FastQC guesses what the contaminants are, how it might get it wrong, and how you sometimes need to do a little bit of detective work to figure out what’s actually happening to your sample. A further look at my duplicate sequences told me that there was an RNA RT primer in there, and then that there was indeed something funny happening around there, that I couldn’t quite understand.

When I say ‘something funny’, the problem with adapter contamination in general is as follows:

You have your sequence of interest, for example a string like “This is my data of interest.” However, around the edges of it, you have some other sequences you have to trim away. After a careful internet search, imagine that you conclude your sequence looks like this:

TreeCatThis is my data of interest.CatTree

So you conclude that based on the methods used, you should cut off instances of Cat to return your sequence of interest. However, then as a sanity check you write a script that parses your raw sequence looking for matches to all adapters potentially used in the experiment, and you come across:

FruitBatThis is my data of interest.FruitBat

You spend an embarrassing amount of time cursing Fruit Bat and searching more online forums, and then go and talk to the experimentalist who provided you with the list of adapters used in the first place. And to cut a long story short – turns out that adapters used for small RNAs are different from general sequencing adapters (specifically, the universal adapter isn’t used for one thing). And that’s why you should always talk to experimentalists straight away. And sanity check your data. Preferably in that order.

Anyway, much as I appreciate the ability to regale biologists with witty adapter tales in pubs, I also thought that this would be useful to write about, in case anyone is encountering similar issues they can’t make sense of. In the end I made a diagram of how small RNAs are processed for sequencing – here it is in all its glory:

Illumina sequencing adapters for small RNAs

Also, my adapters are now successfully trimmed away (using cutadapt, if anyone’s wondering). Victory.