Loading...
 

Support Help

Forums > Support> general questions

general questions

Hi,

We recently sequenced a substrain of C57BL/6J, the inbred mouse strain currently used as the mouse reference genome. We are currently using the NovoalignCSMPI to align the reads.

I am a newbie in NGS analysis. I have experience working with MAQ and BWA where you can mention maximum gaps and mismatches allowed at the seed or whole read level. I am not sure how it works for the NovolalignCS. I would highly appreciate if you could answer the following questions:

Some Introduction: We have 21 lanes of data ( 21 CSfasta and quality files worth 6Ox coverage). This data was generated from the same mate-pair library from a single male mouse liver. We have a cluster system with 12 nodes excluding
headmaster and each node has 16 processors. Each node has 64 GB of RAM or 4GB of RAM per processor. Each node has around 500 gb of scratch. We have installed MPICH2 on every node.

My first question is related to how to use NovoalignCS on our cluster and minimize the running time:

1) Normally I prefer using 1 node for 1 lane of data and split data across 16 processor in each node. This way i can align 12 lanes (12 nodes) of data at a time.

Usage: mpiexec -f hosts.txt -np novoalignMPI -d -f >report.novo 2>run.log


> Hi,
>
> We recently sequenced a substrain of C57BL/6J, the inbred mouse strain currently used as the mouse reference genome. We are currently using the NovoalignCSMPI to align the reads.
>
> I am a newbie in NGS analysis. I have experience working with MAQ and BWA where you can mention maximum gaps and mismatches allowed at the seed or whole read level. I am not sure how it works for the NovolalignCS. I would highly appreciate if you could answer the following questions:
>
> Some Introduction: We have 21 lanes of data ( 21 CSfasta and quality files worth 6Ox coverage). This data was generated from the same mate-pair library from a single male mouse liver. We have a cluster system with 12 nodes excluding
> headmaster and each node has 16 processors. Each node has 64 GB of RAM or 4GB of RAM per processor. Each node has around 500 gb of scratch. We have installed MPICH2 on every node.
>
> My first question is related to how to use NovoalignCS on our cluster and minimize the running time:
>
> 1) Normally I prefer using 1 node for 1 lane of data and split data across 16 processor in each node. This way i can align 12 lanes (12 nodes) of data at a time.
>
> Usage: mpiexec -f hosts.txt -np novoalignMPI -d -f >report.novo 2>run.log


Hi Newbie,

with regard question 1) if you want to run 1 lane on 1 node then don't use MPI, just use multi-threaded novoalign. If you are submitting your job using SGE, Torque, PBS or similar you'll need to tell it how many cores you are using. By default NovoalignCS will use all cores on the node.
MPI is good when you want to use multiple nodes to align one lane.

I'm not sure why but I can't see your question 2. Could you post as a new topic.

Thanks, Colin


Hi Colin,

I have few more questions. I wrote a big paragraph but when I pasted it in the message box, the message got truncated and only a small portion of it got posted. Is there any way I can directly send you an email or post a paragraph with 20 lines in it ?

Thanks.


Hi NGS_newbie

Do you have a scheduler system such as SGE or PBS installed?

You can make a submission shell script and submit each lane to the cluster this way. You will need to have parallel environments setup so you can designate how many cores you want novoalignCSMPI to use on the cluster. The shell script would be something like

To use 64 cores for 1 alignment job across your cluster add this to your script:

#!/bin/sh

#SGE options
#$ -cwd -V -pe parenv 64 -R y 

#Usage: qsub $0 1.fastq 2.fastq  output.sam

cores=64

arg1=$1
arg2=$2
outsam=$3

mpiexec  -n $cores \
          novoalignCSMPI  -c 1 -d $index \
           \ other novo options....
           -f $arg1 $arg2 > $outsam

exit 0


And then submit the alignment job using this qsub program. If you do this for each lane the scheduler will queue them up and you will have SAM files for each lane. For this to work your scheduler must support parallel environments. We use SGE but it is also possible with PBS "nodes=4:ppn=16" request.


Hi,

Could you post your questions here and use a new topic for each question. Paste should work but sometimes special characters are interpreted as markup an dthe formatting gets messed up. If you have any special characters it's best to paste them into a code block
{CODE()} any text with special characters {CODE}


Best, Colin


Show posts:
 
Show HelpHelp