Hey guys,

We are new to slurm, hoping to use some of it's advanced parallel 
features over what is offered in older versions of SGE.

We have written various sbatch scripts to test out methods of submitting 
jobs, and we are not finding a way to have it perform as intended.

We have spent many hours looking over the man pages and resubmitting 
jobs but haven't found one that works just yet so I'm hoping another 
user can help us out.

Here is a simple example what we are attempting to do:

We have an sbatch script that in turn should call out 10 consecutive 
srun commands.

We have it spread across 2 nodes of our cluster with -N 2, and what we 
would like is for srun1,srun2 to run at the same time, then 3,4 once the 
first two are finished, and so on until all 10 jobs are finished.

What we are finding is that the first srun is running in parallel on 2 
nodes, then it's proceeding to the next sequentially, until it finishes 
all 10. Obviously this is not ideal.

We have looked into the options for -n, -c, and haven't found either to 
do what we were expecting just extrapolate out the running of each srun 
to multiple cores/machines.

One workaround we have found is to just submit all 10 jobs as separate 
srun commands. This works in theory until the we try to scale up to say 
200 jobs, we run out of available slots, and with srun, jobs will 
terminate without available slots to receive them, which is why we 
really want to get this running as intended in an sbatch.

Any help that can be provided in how to correctly modify the sbatch 
script would be most helpful.

Thanks in advance.

Alan Cowles

Reply via email to