Hey guys, We are new to slurm, hoping to use some of it's advanced parallel features over what is offered in older versions of SGE.
We have written various sbatch scripts to test out methods of submitting jobs, and we are not finding a way to have it perform as intended. We have spent many hours looking over the man pages and resubmitting jobs but haven't found one that works just yet so I'm hoping another user can help us out. Here is a simple example what we are attempting to do: We have an sbatch script that in turn should call out 10 consecutive srun commands. We have it spread across 2 nodes of our cluster with -N 2, and what we would like is for srun1,srun2 to run at the same time, then 3,4 once the first two are finished, and so on until all 10 jobs are finished. What we are finding is that the first srun is running in parallel on 2 nodes, then it's proceeding to the next sequentially, until it finishes all 10. Obviously this is not ideal. We have looked into the options for -n, -c, and haven't found either to do what we were expecting just extrapolate out the running of each srun to multiple cores/machines. One workaround we have found is to just submit all 10 jobs as separate srun commands. This works in theory until the we try to scale up to say 200 jobs, we run out of available slots, and with srun, jobs will terminate without available slots to receive them, which is why we really want to get this running as intended in an sbatch. Any help that can be provided in how to correctly modify the sbatch script would be most helpful. Thanks in advance. Alan Cowles
