NovoalignMPI is an implementation of novoalign that allows one to align reads over a compute cluster of computers connected on the same network. The advantages of this approach over novoalign on a single server is that one can add more computational resources from other servers with computing capability.
A working installation of MPICH2 version 1.3.1 or higher is required to be installed on all servers. SSH passwordless authentication is also required for the user on the cluster. Google “ssh without password” for more information on how to setup SSH authentication keys between servers.
3. Disk Setup
All machines running on the cluster need to have access to a shared filesystem. An example of a shared filesystem could be an NFS mount or some other
disk that can be accessed across all machines.
4. Cluster Setup
Ensure that once MPICH2 is installed on your cluster that you are able to do a basic `mpiexec ` command is the first thing that needs to be tested. In this example we have 3 machines on our network we would like to run novoalignMPI. These three servers are named on the network as hpc1,hpc2 and nfs2. We create a hosts.txt file with the names of these servers as below
#hosts.txt hpc1 hpc2 nfs1
With MPICH2 install we should be able to run:
mpiexec -n 3 -f hosts.txt uptime
And get the following result returned
$ mpiexec -n 3 -f hosts.txt uptime 13:05:20 up 13 days, 2:30, 2 users, load average: 1.94, 2.53, 2.12 13:05:12 up 6 days, 15:29, 0 users, load average: 0.41, 0.98, 0.79 13:00:36 up 13 days, 2:24, 1 user, load average: 0.65, 1.14, 1.02
5. Input Data
We are going to map small RNA reads to the Arabidopsis TAIR9 genome build.
Our Arabidopsis novoindex is located on /share/databases/tair9.nix and was built fromt the TAIR9 fasta whole-genome sequences. We give this option to novoalignMPI with the “-d” command line argument. See Basic short read mapping page for more information on how to build a novoindex from a FASTA-formatted genome.
5.2 Short Reads
The short reads were downloaded from the NCBI short reads archive. We chose 4 samples from Groszmann et al ,2011 that are publicly available.
These files are:
SRR070689.fastq.gz SRR070690.fastq.gz SRR070691.fastq.gz SRR070692.fastq.gz
6. Running novoalignMPI on the cluster
We run novoalignMPI as follows :
mpiexec -np N+1 -f hosts.txt $PATH/novoalignMPI -d /share/databases/tair9.nix -f $file -o SAM 2> stderr.txt > $file.sam < /dev/null
Where N is the number of hosts available in “hosts.txt”
Therefore our final command in the BASH shell language is to process all the short reads into SAM format is:
#!/bin/bash for file in `ls *fastq.gz`; do base=`basename $file .fastq.gz`; mpiexec -np 4 -f hosts.txt $PATH/novoalignMPI -d /share/databases/tair9.nix -f $file -r All -a -o SAM 2> $base.error < /dev/null > $base.sam done
This creates our output alignments in *.sam files.
You can set “-np” option based on the number of compute nodes you would like to use. novoalignMPI uses all the CPU cores on each node to do alignments. If you would like to limit the number of CPU cores add “-c $CORES” to the right of the novoalignMPI command where $CORES is the max number of CPU cores to use e.g.
#Limit novoalignMPI to using only 2 CPUs on each node mpiexec -np 4 -f hosts.txt $PATH/novoalignMPI -c 2 d /share/databases/tair9.nix -f $file -o SAM 2> stderr.txt > $file.sam < /dev/null