Loading...
 

Support Help

Forums > Support> novoalignMPI with SGE

novoalignMPI with SGE

Hi all, I am wondering if anyone out there has successfully integrated novoalignMPI with a scheduler - specifically SGE who could provide me some direction in troubleshooting my setup.

My setup appears to be correct but I must have missed something somewhere because any alignment job sent through the scheduler hangs. The novoalignMPI PIDs are running on my compute nodes but no output is returned to my working directory and the job remains in run state.

I am running novoalign 2.0.13 w/ MPI, SGE 6.2, CentOS 6.2, MPICH2 1.4.1p1.

The alignment runs if I run mpiexec on our compute nodes without the scheduler and simple mpich2 jobs run through the scheduler fine as well. Only when SGE and novoalignMPI are brought together do I have this problem.

Attached is my sge job script

Thanks!


Hi Lebowski

could you try the following attached file.

You should make a parallel environment using qconf and assign the number of cores you want to use e.g.

  1. $ -V -cwd -pe parenv 44


mpiexec -n 44 /PATH/novoalignMPI -c 1 ...

I have attached an updated script for you to try


Thanks for the response Zayed. I get the same result - the job hangs and I get no alignment data. One curious piece of output is in the sge error log - I would expect to see this novoalign output in the output log not the error log:

  1. novoalignMPI (V2.07.13 - Build Jul 18 2011 @ 09:03:02 - A short read aligner with qualities.
  2. (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
  3. License file: /foo/alignment/novoalign/2.07.13/novoalign.lic
  4. Licensed to FooCom;ny
  5. novoalignMPI --hdrhd off -c 1 -v 120 -i PE 425,80 -x 5 -r Random -F ILMFQ -d /foo/human/ncbi/37.1/indexed/allchr.nix -f /foo/mpiTest/sample.fastq -o SAM
  6. Starting at Thu May 17 13:21:10 2012
  7. Interpreting input files as Illumina FASTQ, Casava Pipeline 1.3 to 1.7.



My pe environment info is below:

pe_name mpich2_141_hydra
slots 300
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

> Hi all, I am wondering if anyone out there has successfully integrated novoalignMPI with a scheduler - specifically SGE who could provide me some direction in troubleshooting my setup.
>
> My setup appears to be correct but I must have missed something somewhere because any alignment job sent through the scheduler hangs. The novoalignMPI PIDs are running on my compute nodes but no output is returned to my working directory and the job remains in run state.
>
> I am running novoalign 2.0.13 w/ MPI, SGE 6.2, CentOS 6.2, MPICH2 1.4.1p1.
>
> The alignment runs if I run mpiexec on our compute nodes without the scheduler and simple mpich2 jobs run through the scheduler fine as well. Only when SGE and novoalignMPI are brought together do I have this problem.
>
> Attached is my sge job script
>
> Thanks!


Hi Lebowski,

A couple of things,
On the novoalign command you didn't redirect stderr (2>) so this will mean the novoalign log goes to the qsub -e file


With regard the stopping. How long did you wait? and did you check the processes with top? Each server will have to load the index into memory and this can take a few minutes depending on you IO subsystem performance. It does appear to have stopped before loading the index. Could you check each server with top or similar to see if the processes have started and how much memory they have used.

Colin


Hi Colin,

Thanks for pointing that out.

The processes sit indefinitely in sleep state on the slave nodes (or until a time threshold is hit by SGE) - could be hours. With the sample size I've attempted as a test this should complete within a few minutes at most. From what I know - the expected processes are launched across the compute queue, but they seem to be waiting on something...

Below is some memory stats from the MPI pid on a slave node:

Name: novoalignMPI
State: S (sleeping)
...
VmPeak: 22552 kB
VmSize: 12336 kB
...
Threads: 1
...


It definitely appears to stop when it memory maps the index file. The read files are opened just before this and it's reported the file format so they were opened correctly. After mapping the index you should see a report of the index k&s settings, we don't see this so it looks like the memory mapping is hanging. I don't know why this would happen, perhaps SGE has imposed a memory limit but I'd expect SGE to kill the job if the memory resource limit was exceeded.

You could try without memory mapping, add the option -mmapoff but be careful as each slave will try and load it's own copy of the index and you may quickly run out of ram. TRy this first just using 2 or 3 slaves to see if it works and let me know the results.

Colin


Thanks for the reply Colin. I added --mmapOff and still have the same results.

Hi Lebowski

I have a few questions/suggestions:

1. How many slots are you using to submit the job. Typically this is specified with mpiexec -n
2. Have you tried your parallel environment with a $fill_up rule instead of $round_robin? This may help restrict a job to a few test machines so all processes share the same index.
3. How many reads are you using to test your configuration. A few thousand may be too few. About 200K is usually what I use.
4. Could you try building a smaller index for testing e.g. chr22 + chr21 + chr20, and try this novoindex as it should load faster into memory.

You can use specify an SGE resource request using something like

qsub -l "h_vmem=14g"  -R y  ...other qsub opts...

Hi,

Could you do a qstat -qc and post the output. I'm thinking sge is not recognising that the processes share memory and then causing the processes to wait on vmem availability.

Could you also try with just two slots, one master & one slave.

Colin


Thanks for the responses. I'm using 2 slots for the job and tried the $fill_up rule as well with no luck. I had been using a small sample of 10k reads but also tried a 200k read and got he same result. I also tried the -l resource setting with no luck either.

Colin, I'm not aware of a qstat -qc option is there another that could show this info?


My mistake, it should be qconf -sc

When you ask for 2 slots are they on the same server? How much RAM does ecah server have?

I figure this is an SGE resource problem. Usually we see SGE killing jobs if the exceed their resource limit but in this case it looks like the jobs are just being suspended until the resources become available.

Colin


No, the 2 slots are on different hosts - the available hosts have 16GB of memory at a minumum. Here's my output from qconf:

  1. -------------------------------

arch a RESTRING == YES NO NONE 0
calendar c RESTRING == YES NO NONE 0
cpu cpu DOUBLE >= YES NO 0 0
display_win_gui dwg BOOL == YES NO 0 0
h_core h_core MEMORY


my post attempt was truncated - attached is the qconf resource settings

Hi,

From qconf the hard and soft vmem limits are set to 2G and also set to consumable.

How about trying changes as per Zayeds suggestion and request an increased memory limit limit in the qsub. I think you may need to do this for both the hard and soft limit as well "h_vmem=14g s_vmem=14g". You should set the limit at 2-3G higher than the index file size.

Could you try this with a 2 node MPI job. If it works for a 2 node job but fails when you request more nodes it is likely that SGE doesn't understand about shared memory, you might be able to correct for this by changing vmem to non-consumable in qconf.

There's also the possibility of running multi-threaded slaves so you use -c8 or similar on NovoalignMPI and then start just one 8 threaded job on each server. This requires the PE to be set up correctly. I haven't set up a PE like this but there are quite a few examples of this on the internet.


Colin


I set both memory limits but still have the same result. The log statements from verbose output following the standard stuff from novoalign (same as at top of thread) are interesting but I've read that these are harmless - I thought I would note it anyway. These are always the last message(s) I get back

proxy:0:0@node1 got pmi command (from 0): get
kvsname=kvs_13065_0 key=P1-businesscard
mpiexec@node1 pgid: 0 got PMI command: cmd=get kvsname=kvs_13065_0 key=P1-businesscard
mpiexec@node1 PMI response to fd 6 pid 0: cmd=get_result rc=0 msg=success value=port#60913$description#node2$
proxy:0:0@node1 forwarding command (cmd=get kvsname=kvs_13065_0 key=P1-businesscard) upstream
proxy:0:0@node1 we don't understand the response get_result; forwarding downstream


Hi Colin and Zayed

We were able to get this to run by using a queue with larger memory nodes along with "h_vmem=50G -l h_stack=20M" args. Your ideas about tweaking the memory footprint trough SGE were definitely correct. Using a smaller index would likely have proved this as well. I'll continue to pare this down to see what the optimal settings will be for various sets. Thank you for all your help.


Hi,

That's excellent. Our SGE system doesn't use consumable memory so we don't have this problem.

I think a good solution if you want to use consumable memory is to multi-thread the MPI slaves. If you have servers with 16 cores just run novoalignMPI -c 16 and start less slaves (mpiexec -np option). For this to work you need to set up the PE correctly. One issue with this is that the NovoalignMPI master process is only single threaded so you might be wasting some cores. We have tested multi-threaded slaves with SGE and if we wanted 4 * 16 thread slaves we'd request 64 cores from SGE and use -np 5 so mpiexec started 1 master process and 4 slaves but this slightly overcommitted one node as it would have a single threaded master process and a slave with 16 threads.

Colin


Hi,

I wanted to follow up with an additional note on this thread in-case its of interest to others. As mentioned above I was able to get this to work targeting larger memory nodes but this was actually through a different SGE queue. The queue was targeting nodes running centos 5x. While the memory and stack settings were important to get this working they weren't causing the processes to hang - it seems to be related to our nodes running Centos 6.2.

I installed NovoalignMPI 2.08.01 and validated that this version works on our centos 6.2 queue, but was not able to get v2.07.13 to work on centos 6.2 queues regardless of arguments passed.


Hi Lebowski,

That's interesting and good to know though I've no idea why it would happen. We haven't changed MPI versions in a long while so V2.07.13 & 2.08.01 should have same MPI libraries linked and should behave the same. It's strange.

Thanks for persevering,
Colin


Show posts:
 
Show HelpHelp