Quoting [email protected]:
-----Original Message-----
From: Moe Jette [mailto:[email protected]]
Sent: Tuesday, 3 March 2015 9:54 AM

-snip-

The options for srun, sbatch, and salloc are almost identical with
respect to specification of a job's allocation requirements.

Yes. Part of my problem comes down to what it means to nest them, since they are common options. I was surprised to find that this (submitted with sbatch) runs one task for each core, rather than one for each node:
#!/bin/bash
#SBATCH --nodes=2 --ntasks-per-node=16
srun --ntasks-per-node=1 uname -a

I picked apart what openmpi/mpirun does with --pernode and found I can do the following:
#!/bin/bash
#SBATCH --nodes=2 --ntasks-per-node=16
srun --ntasks=$SLURM_NNODES uname -a

That is a fine workaround (as is mpirun --pernode), though I suspect it might break with other layout options which might pack the tasks in to the first node. The intent of --ntasks-per-node=1 seems clearer so it unfortunate that it does not work as I want (or that my understanding of what should be wanted is poor).

Yes, some of the srun options apply to a job allocation, some to a job step, and some to both. I'll agree this needs more documentation and will put that on our to-do list.

> 2) We currently have a few unrelated usage patterns where jobs
request
> multiple nodes but only some of the cores (perhaps to match jobs that
> they used on our previous cluster configuration).  How would you deal
> with that case where --exclusive is not necessarily appropriate? A
big
> stick might be an option (and advice to use whole
> nodes) though the users are in different cities so it might have to
be
> a virtual stick.


Perhaps the salloc/sbatch/srun options: --cpus-per-task and/or --
ntasks-per-node

The problem only really arises when mixing different job steps that need to use the resources with different patterns. This should be unusual expect perhaps the per node pre/post process case. I think I'd ideally prefer all the layout info to be in the sbatch request and the 'main' step to run mpirun or srun with no particular options, so the remaining part is how to handle the special per node pre/post processing case.

The sbatch options controlling job layout (task count, node count, cpus per task, etc.) are used to construct environment variables for the batch script (same for salloc commands), so the srun command gets those options by default. It is a typical use case for all resource specification options to appear on the sbatch submit line (or in the script), then have the srun commands within the script only identify the application to be run.

There is another mode of operation in which a single job executes a multitude of srun commands within an allocation, using different size and layout options. These various job steps (each srun invocation), can run serially or in parallel using overlapping or separate resources (see srun's --exclusive option).
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to