full
border
#666666
http://www.novocraft.com/wp-content/themes/smartbox-installable/
http://www.novocraft.com/
#0397c9
style1

Paired End Short Fragment Detection and Adapter Stripping

DNA fragments that are shorter than the read length will cause reads to extend into adapter sequence and can be a non-trivial portion of reads especially with longer reads from the GA II.

 

If a read extends into adapter by only a few bases it may still align , but with mismatches and indels in the adapter region which can then contribute to false positive alignments and incorrect SNP calls. As fragments get shorter and the amount of adapter increases it’s more likely the read will fail to align.

 

Novoalign V2.05 and later include an option in licensed versions to detect short fragments and to strip the adapter sequence from them.

 

Command Line Options

 

-a Turns on adapter stripping using the default adapter sequences, “AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG”, and “AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA”. These default adapter sequences have been seen on many libraries however we have also seen “AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG” on end of read 1 of some libraries. You should always check to ensure that the correct adapter sequence has been specified.
The -a option is also used for single-end adapter stripping please refer to 3′ Adapter Stripping – Single End Reads.
-a sequence1 sequence2 Turns on paired-end adapter stripping using the specified adapter sequences. The first adapter sequence is the adapter on 3′ end of fragments that will be read as part of read 1 on short fragments. The second adapter sequence is the 5′ adapter that will be read as part of read2. The second adapter sequence is optional and is only required if the two adapters differ. It is also not required to specify the full sequence, we recommend at least 12bp to ensure good specificity. A list of Illumina/Solexa adapter sequences can be found at Solexa Library Primer Sequences(external link) and also atSeqanswers.com(external link).

 

Example specifying adapters from the SeqAnswers document

novoalign ... -a "AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA"...

If in doubt about the adapters used or you want to check if there are any reads with adapter present you can use the following command to linux shell command to print possible adapter sequences:

grep -E AGATCGGAAGAGC[ACGT]{10} readfile_1.fastq | sed "s/.*\(AGATCGGAAGAGC.*\)/\1/" | less

Do this for read 1 & 2 files and check if adapters match the defaults.

Example1
Without Adapter Stripping With Adapter Stripping
#       Paired Reads:     2500
#      Pairs Aligned:     2033
#     Read Sequences:     5000
#            Aligned:     4318
#   Unique Alignment:     3273
#   Gapped Alignment:      118
#     Quality Filter:        0
# Homopolymer Filter:        0
#       Elapsed Time: 23,860s
# Fragment Length Distribution
#	From	To	Count
#	42	47	20
#	48	53	37
#	54	59	40
#	60	65	44
...
#       Paired Reads:     2500
#      Pairs Aligned:     2112
#     Read Sequences:     5000
#            Aligned:     4458
#   Unique Alignment:     3376
#   Gapped Alignment:      112
#     Quality Filter:        0
# Homopolymer Filter:        0
#       Elapsed Time: 25,391s
# Fragment Length Distribution
#	From	To	Count
#	24	29	1
#	30	35	14
#	36	41	27
#	42	47	36
#	48	53	37
#	54	59	40
#	60	65	44
...
Data sample from 45bp paired end Human cDNA, mean fragment length 137 with standard deviation of 45
Example 2
Without Adapter Stripping With Adapter Stripping
#       Paired Reads:     5000
#      Pairs Aligned:     3788
#     Read Sequences:    10000
#            Aligned:     8194
#   Unique Alignment:     6995
#   Gapped Alignment:       42
#     Quality Filter:       26
# Homopolymer Filter:       37
#       Elapsed Time: 494,338s
# Fragment Length Distribution
#	From	To	Count
#	42	48	75
#	49	55	154
#	56	62	187
...
#       Paired Reads:     5000
#      Pairs Aligned:     3906
#     Read Sequences:    10000
#            Aligned:     8398
#   Unique Alignment:     7135
#   Gapped Alignment:       34
#     Quality Filter:       59
# Homopolymer Filter:       43
#       Elapsed Time: 438,171s
# Fragment Length Distribution
#	From	To	Count
#	28	34	14
#	35	41	48
#	42	48	94
#	49	55	153
#	56	62	186
#	63	69	243
...
Data sample from 45bp Bisulphite treated Human DNA, mean fragment length of 94bp and standard deviation of 30

 

Without Adapter Stripping

@HWI-EAS261_4:1:1:968:1074/1	L	GTTTCAGTGCATCACAGTTCATCTTCTAACCCCAGAGTCAGAAGA	IIIIIIII$IIIIIIIIIIIIIIIIIII1III+9I%IIIIF>;;<	U	80	71	>CD5-chr11_S_11_60625543_60652895	3369	F	.	.	.	33A>C 43+A
@HWI-EAS261_4:1:1:968:1074/2	R	TCTGACTCTTGGGTTAGAAGATGAACTGTGATGGACTGAAACAGA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII3IIIIIIIIII?	NM
@HWI-EAS261_4:1:1:1037:1480/1	L	GAAAAAGACCCTGGAAGCAGTTAGCAGAATAGTGTGATAATGAGA	IIIIIIII&IIIIIIIIIIIIIIIIIIIIIIIEIII8I4II>C7>	NM
@HWI-EAS261_4:1:1:1037:1480/2	R	CATTATCACACTATTCTGCTAACTGCTTCCATGGTCTTTTTCCGA	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@IIIIIIII':-	U	58	99	>LAP3-chr4_I_2_4_17190620_17192520	1740	R	.	.	.	1A>T 2G>C 3T>G

With Adapter Stripping

@HWI-EAS261_4:1:1:968:1074/1	L	GTTTCAGTGCATCACAGTTCATCTTCTAACCCCAGAGTCAGA	IIIIIIII$IIIIIIIIIIIIIIIIIII1III+9I%IIIIF>	U	95	150	>CD5-chr11_S_11_60625543_60652895	3369	F	.	3369	R	33A>C
@HWI-EAS261_4:1:1:968:1074/2	R	TCTGACTCTTGGGTTAGAAGATGAACTGTGATGGACTGAAAC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII3IIIIIIII	U	0	150	>CD5-chr11_S_11_60625543_60652895	3369	R	.	3369	F
@HWI-EAS261_4:1:1:1037:1480/1	L	GAAAAAGACCCTGGAAGCAGTTAGCAGAATAGTGTGATAATG	IIIIIIII&IIIIIIIIIIIIIIIIIIIIIIIEIII8I4II>	U	43	150	>LAP3-chr4_I_2_4_17190620_17192520	1743	F	.	1743	R	11A>C
@HWI-EAS261_4:1:1:1037:1480/2	R	CATTATCACACTATTCTGCTAACTGCTTCCATGGTCTTTTTC	IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@IIIIIIII	U	0	150	>LAP3-chr4_I_2_4_17190620_17192520	1743	R	.	1743	F

Adapter Stripping Process

The reads are prefixed with adapter sequence and then aligned against each other using Needleman Wunsch Global alignment. The alignment allows mismatches and indels and uses quality based scoring similar to the alignment algorithms used to align reads against the reference with the difference that we now have base qualities for the two sequences in the alignment rather than just for one. We then take the highest scoring alignment and if its identity exceeds 90% it is used to determine the amount of adapter to be trimmed.

Image

In order for adapter to be trimmed the two reads of a pair both need to align to the same amount of adapter sequence and to align to the reverse complement of each other. False positives are highly improbable.

default
Loading posts...
link_magnifier
#6E787E
on
fadeInDown
loading
#6E787E
on