Novoindex – Indexing the Reference Genome
Novoalign aligns reads against a reference genome using an indexed copy of the reference genome. Novoindex is the program that creates the index structure.
In simplest form Novoindex takes two parameters, the name to be used for the index file and the name of the reference sequence to be indexed:
novoindex ssuis.nix ssuis.fa
This will create an index structure for the reference sequence ssuis.fa which can then be used by Novoalign.
The basic index structure create using default options is sufficient in most cases however there are times you may wish to change the default options.
Command line format
novoindex [options] indexfile reference_file_list
Command Line Options
Option | Description |
-b | Turns on Bisulphite indexing mode. See discussion on Bisulphite treated reads. |
-c | Builds a colour space index for use with NovoalignCS. |
-k | Sets the k-mer size to be used for the index. The default value is usually suitable unless you need to reduce the memory used by the index. K can be in range 11 to 15 for normal indexes and 11-19 for Bisulphite indexes.. |
-s | Sets the indexing step size. If s=2 only every second k-mer is indexed, with s=3 only every 3rd k-mer is indexed and so on. Higher values of s reduce the memory required for the index. Acceptable values for s are in range 1 to 5. |
The default values for k&s are chosen automatically based on the size of the reference genome and the amount of RAM available on the server that Novoindex is running on. The defaults settings will use around 17Gbyte sof RAM to index the Human Genome if it is available on the server. This will minimise run time however there are times when you might like to use less memory for the index.
Effect of K & S on Memory & Run Time
An index comprises three tables:
- A k-mer hash table of size 4*4k bytes
- A sequence offset table of size 4N/s bytes where N is the length of the sequences being indexed and s is the step size.
- A compressed sequence file of size N/2 bytes. Note, that Novoindex uses 4-bit per base for the sequence in order to retain IUB ambiguous codes in the reference sequence.
The following chart plots run time and memory used for aligning 72bp single end reads against full human genome at various k&s settings. The 13.2 index, at just over 7Gbyte, gives the best performance. Default would have built a 14.2 index.
(Test system 8-core Intel system with 48G RAM)
This chart is similar to above however we just indexed the 150Mbp chrX. In this case a 12.2 index gives the best performance. Using default would have chosen a 12.1 index.
(Test system Dual Core AMD with 8Gbyte RAM)
While the k&s settings have an effect on run time and memory they do not affect the specificity or sensitivity of Novoalign.
Memory Mapping
During alignment with Novoalign the index is loaded into a shared memory segment. This allows the index to be shared with multiple copies of Novoalign. So if you have 8 lanes of reads to align and run 8 Novoalign jobs at the same time only one copy of the index is loaded into memory.