Hi everyone,

I'm using slurm 2.1.0 on a 6 node cluster. I have a couple of questions about 
sbatch.

We're trying to schedule around 100k jobs in the cluster, and I'm hitting the 
MaxJobCount limit. Is there a way to schedule beyond 65k jbos?

I would also like to group jobs logically, so that I can cancel a single job to 
kill all the related steps. To do that I tried to schedule job steps by running 
sbatch --jobid within another sbatch invocation:

# outer script
cal file | while read x; do
  sbatch --jobid $SLURM_JOB_ID realjob.sh $x
done

then running this script directly with sbatch:

sbatch outerscript.sh

This seems to schedule all the jobs correctly as steps under the main job, 
which is nice. But as soon as outerscript.sh finishes, all steps are killed.

This leads me to several more questions:

- by running job steps as shown, can I schedule 100k steps?
- how can I avoid steps being killed when the main script finishes?

Also, I'm curious: can I "wait" for a job or a step, or steps in a script (like 
a "barrier" would)? This would be very helpful for several scrips I'm writing 
(dependencies are not what I'm looking for). Can "sattach" be used for that 
purpose?

Thanks.

Reply via email to