NovoalignCS is our aligner for ABI SOLiD colour space reads, operation is similar to standard Novoalign with most command line options from Novoalign working in a similar fashion. The major difference is that the current version of NovoalignCS does support adapter trimming, miRNA mode, or bisulphite mode.
Novoindex
You need to build a colour space index for colour space reads. This index uses a hash table with colour space seeds rather than nucleotide seeds.
To construct a colour space index just add option -c to the Novoindex command, as in
novoindex -c genome.ncx *.fa
NovoalignCS
NovoalignCS command line options are generally the same as Novoalign, commonly used options are:
Option |
Description |
-d dbname |
Full pathname of indexed reference sequence from novoindex -c |
-f seqfile1 [seqfile2] |
NovoalignCS accepts ABI Solid *.csfasta files with _QV.qual quality files or .csfastq files. |
-t 99 |
Sets the threshold or highest alignment score acceptable for the best alignment. A default threshold is calculated from read length and genome size such that an alignment to a non-repeat should have a quality higher than 30. |
-s 1 |
If a read is unaligned then shorten by 1 base and try again. This is useful for aligning short RNA reads. Suggested parameters for short RNA against Human are: novoalignCS -d …. -s 1 -l 14 -t 40 -f ….. |
-p 99,99 99,99 |
Sets thresholds for polyclonal filter. This filter is designed to remove reads that may come from polyclonal clusters or beads. Please refer to paper: The first pair of values (n,t) sets the number of bases and threshold for the first 20 base pairs of each read. If there are n or more bases with phred quality below t then the read is flagged as polyclonal and will not be aligned. The alignment status is ‘QC’. The second pair applies to the entire read rather than just the first 20bp and is specified as the fraction of bases below a base quality. Setting -p -1 disables the filter. Default is -p 7,10 0.3,10 for 7 of first 20bp below Q10 or 30% of all bases below Q10. |
-o format [readgroup] |
Specifies the report format. Native, SAM, Pairwise. Default is Native. eg. novoalign -o SAM |
-i 99,99 |
Sets approximate fragment length and standard deviation. Default to Mate pair mode with mean fragment length of 2500bp with standard deviation of 500. |
-i PE 99,99 | Sets paired end mode, mean frgment length and standard deviation. |
-k |
Enables quality calibration. This is worth trying! |
-K file |
Colour Error counts are written to the named file after all reads are processed. This file is useful for charting colour errors by base position in the read. |
File Formats
CSFASTA
NovoalignCS supports ABI SOLiD csfasta and qual input files with no user preprocessing required.
- Polyclonal filter (-p option) used to detect and stop alignment of reads with excessive low quality bases.
- In paired end mode Csfasta header is used to identify pairs and match reads, allowing mixed single and paired reads.
- If a csfasta file is specified as input NovoalignCS will look in the same folder for a quality file by replacing the .csfasta file extension with _QV.qual. If a quality file is not found a quality of 20 (1% colour error rate) is assumed for all bases.
>2_14_26_F3 |
*_QV.qual >2_14_26_F3 |
Color Space FASTQ
There are two variations of colour space fastq files being used by other aligners.
-
BWA uses a csfastq format that includes a quality value for the primer base. This is typically coded as a ‘!’ and is not used in alignment scoring.
-
BFAST has a fastq format that is similar to BWA format except that it does not have a quality for the primer base and hence the quality line is one letter shorter than the read line. NovoalignCS does not support paired end reads in a single BFAST fastq file, it requires two files for paired end.
Novoalign supports both formats and can automatically distinguish the two types based on the length of the quality string on the first line. You can also specify format on the command line using -F BFASTQ or -F CSFASTQ
Colour space qualities are phred quality plus ascii ‘!’.
BWA Type CSFASTQ
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
|
T32322133300002330031001022230020232002203222030231
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
!21(()+#’+#40*.##**)$#$*$###$###############(+####’
@SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
T01212120333223322020022322232232232222022232033230
+SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
!,*+*()+*(#’+)###$#+$##’####################’+#####
BFAST Type CSFASTQ
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 T32322133300002330031001022230020232002203222030231 +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 21(()+#’+#40*.##**)$#$*$###$###############(+####’ @SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50 T01212120333223322020022322232232232222022232033230 +SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50 ,*+*()+*(#’+)###$#+$##’####################’+##### |
Two files can be specified for paired end mode. In this case Novoalign parses the header records looking for a header in standard ABI format (eg. >2_14_26_F3). If found then headers from the two files are assumed to be in order and matched for purpose of identifying paired reads. Reads that exist in only one file will be aligned in single end mode.
Report Formats
SAM Format
SAM format follows SAM specifications including colour space specific tags.
CS:Z: | Original Colour space read |
CQ:Z: | Colour qualities |
CM:i: | Number of colour mismatches |
Example:
@HD VN:1.0 SO:unsorted |
@PG ID:novoalignCS VN:V1.00.11 CL:novoalignCS -d ../ecoli.ncx -f /export/home/zayed/service_projects/solid_ecoli_test//Rosalind_20080729_2_Chris5_F3.csfasta /export/home/zayed/service_projects/solid_ecoli_test/Rosalind_20080729_2_Chris5_R3.csfasta -r R -oSAM
@SQ SN:NC_004431 AS:ecoli.nix LN:5231428
469_29_17_F3 16 NC_004431 3712099 150 1S49M * 0 0 TTCGGTACCAGCAATAGACAGCGTTGCACGATCGGCGTAGTTAACGGCGG #27;JJ$”>L9″”HP6)9STI=8JYIHG6=MTRWQROHWULTSTLJ[]UI PG:Z:novoalignCS
AS:i:21 UQ:i:21 NM:i:0 MD:Z:49 CS:Z:T20330310301231330323231131013321122333132121310320 CQ:Z:!46@?=.?6>76@81?4>:9<2-,=->+46?5&&2?+.’3:%0565’1# CM:i:2
469_29_25_F3 16 NC_004431 404723 150 50M * 0 0 TGGATGCTCTTAGCCGTTTGTTGATGCTTAACGCTCCACAAGGAACGATC :NABN#!%>LWT@G2″9OU””BTPTSQZMJYYTUVW[\KMZ`VWPLR PG:Z:novoalignCS
AS:i:41 UQ:i:41 NM:i:0 MD:Z:50 CS:Z:T12323102020111022331030231321111000303230200313201 CQ:Z:!7<1@8?B<?/=?>>>:=9<>?<6>7:;(<>88#+@)9<<1.’587-69 CM:i:4
Native Format
The colour space read and quality values are inserted in the report line before the Nucleotide sequence and qualities. All other fields are as per Novoalign documentation
Example:
@7_418_678 L T22010302130021000210112203201031000 78&:6<47=1>71>5=&%<)776)&::&,5(15/* CCCCAAAGTCGCTCACCATCCCAGGGCAGGCCAAG .AM%!!!!G!!5GVWH!!”J\XQ^XWTY[YGEX U 59 150 >Chr6 159937789 R . 159937674 F |
@7_418_678 R T30322221320200131003000212300010020201000200000020 967,356)63<442?:41.<;<$+>9)72;45-‘086$6(.5&-)540&. AATCTCTGCTTCCCATGGGCCCTCAGCCCCAAACCTTGGGGAGTGGGTGG XVKGQTGGRXYQOZbWNHS`!!4Q`JHRVXR*!?QW@@F>!!!!F”!!;3 U 79 150 >Chr6 159937674 F . 159937789 R