DNA fragments that are shorter than the read length will cause reads to extend into adapter sequence and can be a non-trivial portion of reads especially with longer reads from the GA II.
If a read extends into adapter by only a few bases it may still align , but with mismatches and indels in the adapter region which can then contribute to false positive alignments and incorrect SNP calls. As fragments get shorter and the amount of adapter increases it’s more likely the read will fail to align.
Novoalign V2.05 and later include an option in licensed versions to detect short fragments and to strip the adapter sequence from them.
Command Line Options
-a | Turns on adapter stripping using the default adapter sequences, “AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG”, and “AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA”. These default adapter sequences have been seen on many libraries however we have also seen “AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG” on end of read 1 of some libraries. You should always check to ensure that the correct adapter sequence has been specified. The -a option is also used for single-end adapter stripping please refer to 3′ Adapter Stripping – Single End Reads. |
-a sequence1 sequence2 | Turns on paired-end adapter stripping using the specified adapter sequences. The first adapter sequence is the adapter on 3′ end of fragments that will be read as part of read 1 on short fragments. The second adapter sequence is the 5′ adapter that will be read as part of read2. The second adapter sequence is optional and is only required if the two adapters differ. It is also not required to specify the full sequence, we recommend at least 12bp to ensure good specificity. A list of Illumina/Solexa adapter sequences can be found at Solexa Library Primer Sequences and also atSeqanswers.com. |
Example specifying adapters from the SeqAnswers document
novoalign ... -a "AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA"...
If in doubt about the adapters used or you want to check if there are any reads with adapter present you can use the following command to linux shell command to print possible adapter sequences:
grep -E AGATCGGAAGAGC[ACGT]{10} readfile_1.fastq | sed "s/.*\(AGATCGGAAGAGC.*\)/\1/" | less
Do this for read 1 & 2 files and check if adapters match the defaults.
Example1 | |
Without Adapter Stripping | With Adapter Stripping |
# Paired Reads: 2500 # Pairs Aligned: 2033 # Read Sequences: 5000 # Aligned: 4318 # Unique Alignment: 3273 # Gapped Alignment: 118 # Quality Filter: 0 # Homopolymer Filter: 0 # Elapsed Time: 23,860s # Fragment Length Distribution # From To Count # 42 47 20 # 48 53 37 # 54 59 40 # 60 65 44 ... |
# Paired Reads: 2500 # Pairs Aligned: 2112 # Read Sequences: 5000 # Aligned: 4458 # Unique Alignment: 3376 # Gapped Alignment: 112 # Quality Filter: 0 # Homopolymer Filter: 0 # Elapsed Time: 25,391s # Fragment Length Distribution # From To Count # 24 29 1 # 30 35 14 # 36 41 27 # 42 47 36 # 48 53 37 # 54 59 40 # 60 65 44 ... |
Data sample from 45bp paired end Human cDNA, mean fragment length 137 with standard deviation of 45 |
Example 2 | |
Without Adapter Stripping | With Adapter Stripping |
# Paired Reads: 5000 # Pairs Aligned: 3788 # Read Sequences: 10000 # Aligned: 8194 # Unique Alignment: 6995 # Gapped Alignment: 42 # Quality Filter: 26 # Homopolymer Filter: 37 # Elapsed Time: 494,338s # Fragment Length Distribution # From To Count # 42 48 75 # 49 55 154 # 56 62 187 ... |
# Paired Reads: 5000 # Pairs Aligned: 3906 # Read Sequences: 10000 # Aligned: 8398 # Unique Alignment: 7135 # Gapped Alignment: 34 # Quality Filter: 59 # Homopolymer Filter: 43 # Elapsed Time: 438,171s # Fragment Length Distribution # From To Count # 28 34 14 # 35 41 48 # 42 48 94 # 49 55 153 # 56 62 186 # 63 69 243 ... |
Data sample from 45bp Bisulphite treated Human DNA, mean fragment length of 94bp and standard deviation of 30 |
Without Adapter Stripping
@HWI-EAS261_4:1:1:968:1074/1 L GTTTCAGTGCATCACAGTTCATCTTCTAACCCCAGAGTCAGAAGA IIIIIIII$IIIIIIIIIIIIIIIIIII1III+9I%IIIIF>;;< U 80 71 >CD5-chr11_S_11_60625543_60652895 3369 F . . . 33A>C 43+A @HWI-EAS261_4:1:1:968:1074/2 R TCTGACTCTTGGGTTAGAAGATGAACTGTGATGGACTGAAACAGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII3IIIIIIIIII? NM @HWI-EAS261_4:1:1:1037:1480/1 L GAAAAAGACCCTGGAAGCAGTTAGCAGAATAGTGTGATAATGAGA IIIIIIII&IIIIIIIIIIIIIIIIIIIIIIIEIII8I4II>C7> NM @HWI-EAS261_4:1:1:1037:1480/2 R CATTATCACACTATTCTGCTAACTGCTTCCATGGTCTTTTTCCGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@IIIIIIII':- U 58 99 >LAP3-chr4_I_2_4_17190620_17192520 1740 R . . . 1A>T 2G>C 3T>G
With Adapter Stripping
@HWI-EAS261_4:1:1:968:1074/1 L GTTTCAGTGCATCACAGTTCATCTTCTAACCCCAGAGTCAGA IIIIIIII$IIIIIIIIIIIIIIIIIII1III+9I%IIIIF> U 95 150 >CD5-chr11_S_11_60625543_60652895 3369 F . 3369 R 33A>C @HWI-EAS261_4:1:1:968:1074/2 R TCTGACTCTTGGGTTAGAAGATGAACTGTGATGGACTGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII3IIIIIIII U 0 150 >CD5-chr11_S_11_60625543_60652895 3369 R . 3369 F @HWI-EAS261_4:1:1:1037:1480/1 L GAAAAAGACCCTGGAAGCAGTTAGCAGAATAGTGTGATAATG IIIIIIII&IIIIIIIIIIIIIIIIIIIIIIIEIII8I4II> U 43 150 >LAP3-chr4_I_2_4_17190620_17192520 1743 F . 1743 R 11A>C @HWI-EAS261_4:1:1:1037:1480/2 R CATTATCACACTATTCTGCTAACTGCTTCCATGGTCTTTTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@IIIIIIII U 0 150 >LAP3-chr4_I_2_4_17190620_17192520 1743 R . 1743 F
Adapter Stripping Process
The reads are prefixed with adapter sequence and then aligned against each other using Needleman Wunsch Global alignment. The alignment allows mismatches and indels and uses quality based scoring similar to the alignment algorithms used to align reads against the reference with the difference that we now have base qualities for the two sequences in the alignment rather than just for one. We then take the highest scoring alignment and if its identity exceeds 90% it is used to determine the amount of adapter to be trimmed.
In order for adapter to be trimmed the two reads of a pair both need to align to the same amount of adapter sequence and to align to the reverse complement of each other. False positives are highly improbable.