Mapping Reads with Cluster-Aware NovoAlign MPI

1.Introduction

NovoalignMPI is an implementation of novoalign that allows one to align reads over a compute cluster of computers connected on the same network. The advantages of this approach over novoalign on a single server is that one can add more computational resources from other servers with computing capability.

2.Software Prerequisites

A working installation of MPICH2 version 1.3.1 or higher is required to be installed on all servers. SSH passwordless authentication is also required for the user on the cluster. Google “ssh without password” for more information on how to setup SSH authentication keys between servers.

3. Disk Setup

All machines running on the cluster need to have access to a shared filesystem. An example of a shared filesystem could be an NFS mount or some other

disk that can be accessed across all machines.

4. Cluster Setup

Ensure that once MPICH2 is installed on your cluster that you are able to do a basic `mpiexec ` command is the first thing that needs to be tested. In this example we have 3 machines on our network we would like to run novoalignMPI. These three servers are named on the network as hpc1,hpc2 and nfs2. We create a hosts.txt file with the names of these servers as below

#hosts.txt
hpc1
hpc2
nfs1

With MPICH2 install we should be able to run:

mpiexec -n 3  -f hosts.txt  uptime

And get the following result returned

$ mpiexec -n 3 -f hosts.txt uptime
 13:05:20 up 13 days,  2:30,  2 users,  load average: 1.94, 2.53, 2.12
 13:05:12 up 6 days, 15:29,  0 users,  load average: 0.41, 0.98, 0.79
 13:00:36 up 13 days,  2:24,  1 user,  load average: 0.65, 1.14, 1.02

5. Input Data

We are going to map small RNA reads to the Arabidopsis TAIR9 genome build.

5.1 Databases

Our Arabidopsis novoindex is located on /share/databases/tair9.nix and was built fromt the TAIR9 fasta whole-genome sequences. We give this option to novoalignMPI with the “-d” command line argument. See Basic short read mapping page for more information on how to build a novoindex from a FASTA-formatted genome.

5.2 Short Reads

The short reads were downloaded from the NCBI short reads archive. We chose 4 samples from Groszmann et al ,2011 that are publicly available.

These files are:

SRR070689.fastq.gz
SRR070690.fastq.gz
SRR070691.fastq.gz
SRR070692.fastq.gz

6. Running novoalignMPI on the cluster

We run novoalignMPI as follows :

mpiexec -np N+1 -f hosts.txt $PATH/novoalignMPI -d /share/databases/tair9.nix -f $file -o SAM 2> stderr.txt > $file.sam < /dev/null

Where N is the number of hosts available in “hosts.txt”
Therefore our final command in the BASH shell language is to process all the short reads into SAM format is:

#!/bin/bash
for file in `ls *fastq.gz`;
do
    base=`basename $file .fastq.gz`; 
    mpiexec -np 4 -f hosts.txt $PATH/novoalignMPI -d /share/databases/tair9.nix -f $file -r All -a -o SAM 2> $base.error < /dev/null > $base.sam
done

This creates our output alignments in *.sam files.

7.Discussion

You can set “-np” option based on the number of compute nodes you would like to use. novoalignMPI uses all the CPU cores on each node to do alignments. If you would like to limit the number of CPU cores add “-c $CORES” to the right of the novoalignMPI command where $CORES is the max number of CPU cores to use e.g.

#Limit novoalignMPI to using only 2 CPUs on each node 
 mpiexec -np 4 -f hosts.txt $PATH/novoalignMPI -c 2  d /share/databases/tair9.nix -f $file -o SAM 2> stderr.txt > $file.sam < /dev/null

Documentation

Mapping Reads with Cluster-Aware NovoAlign MPI

1.Introduction

2.Software Prerequisites

3. Disk Setup

4. Cluster Setup

5. Input Data

5.1 Databases

5.2 Short Reads

6. Running novoalignMPI on the cluster

7.Discussion

LATEST NEWS

Novocraft and Basepair Inc. Announce Strategic Partnership to Deliver Advanced Genomic Pipelines in the Cloud

Novoalign V4.03.01

Novoalign V4.03.00 and Novosort V3.00.00

Contact Us