[slurm-dev] Re: mixing mpi and per node tasks

Gareth.Williams Mon, 02 Mar 2015 15:51:39 -0800

> -----Original Message-----
> From: Moe Jette [mailto:[email protected]]
> Sent: Tuesday, 3 March 2015 9:54 AM


-snip-

> The options for srun, sbatch, and salloc are almost identical with
> respect to specification of a job's allocation requirements.

Yes. Part of my problem comes down to what it means to nest them, since they 
are common options.  I was surprised to find that this (submitted with sbatch) 
runs one task for each core, rather than one for each node:
#!/bin/bash
#SBATCH --nodes=2 --ntasks-per-node=16
srun --ntasks-per-node=1 uname -a

I picked apart what openmpi/mpirun does with --pernode and found I can do the 
following:
#!/bin/bash
#SBATCH --nodes=2 --ntasks-per-node=16
srun --ntasks=$SLURM_NNODES uname -a

That is a fine workaround (as is mpirun --pernode), though I suspect it might 
break with other layout options which might pack the tasks in to the first 
node.  The intent of --ntasks-per-node=1 seems clearer so it unfortunate that 
it does not work as I want (or that my understanding of what should be wanted 
is poor).

> > 2) We currently have a few unrelated usage patterns where jobs
> request
> > multiple nodes but only some of the cores (perhaps to match jobs that
> > they used on our previous cluster configuration).  How would you deal
> > with that case where --exclusive is not necessarily appropriate? A
> big
> > stick might be an option (and advice to use whole
> > nodes) though the users are in different cities so it might have to
> be
> > a virtual stick.
> 
> 
> Perhaps the salloc/sbatch/srun options: --cpus-per-task and/or --
> ntasks-per-node

The problem only really arises when mixing different job steps that need to use 
the resources with different patterns.  This should be unusual expect perhaps 
the per node pre/post process case. I think I'd ideally prefer all the layout 
info to be in the sbatch request and the 'main' step to run mpirun or srun with 
no particular options, so the remaining part is how to handle the special per 
node pre/post processing case.

> Why do they need only a few cores, but multiple nodes?

I think it is mostly resistance to change and maybe partly preserving a 
particular decomposition pattern.  There might also be some 'getting a job 
started sooner' in the presence of some others' serial jobs. Finally there are 
a few users who are tuning and finding that they go faster by not using all 
cores (but they should probably be using nodes exclusively anyway, especially 
if their code is bandwidth limited and they want to avoid the remaining cores 
being used and causing contention...).

> If done to get all of the memory on a node, perhaps your system should
> be configured to allocate and manage memory. Some relevant slurm.conf
> parameters are: SelectParameters=CR_CORE_MEM, MaxMemPerCPU=# and
> DefMemPerCPU=#. See the slurm.conf man page for more information:
> http://slurm.schedmd.com/slurm.conf.html

We do schedule memory.

> 
> > Gareth
> >
> > BTW. --ntasks-per-node=1 was not needed in your advice as it was the
> > default.  However, in that case to use srun and use all the cores,
> > extra options were needed.
> 
> I know that, but wanted to provide you with a more general solution.

Fair enough :-) Thanks.

Gareth

> 
> 
> >> -----Original Message-----
> >> From: Moe Jette [mailto:[email protected]]
> >> Sent: Tuesday, 3 March 2015 3:42 AM
> >> To: slurm-dev
> >> Subject: [slurm-dev] Re: mixing mpi and per node tasks
> >>
> >>
> >> Use the "--exclusive" option to always get whole node allocations:
> >>
> >> $ sbatch --exclusive -N 3 my.bash
> >>
> >> I would use the "--ntasks-per-node=1" option to control the task
> >> count per node:
> >>
> >> srun --ntasks-per-node=1 my.app
> >>
> >> I would also recommend this document:
> >> http://slurm.schedmd.com/mc_support.html
> >>
> >> Quoting [email protected]:
> >>
> >> > We have a cluster with dual socket nodes with 10-core cpus (ht
> off)
> >> > and we share nodes with SelectType=select/cons_res.  Before (or
> >> > after) running an MPI task, I'd like to run some pre (and post)
> >> > processing tasks, one per node but am having trouble finding
> >> > documentation for how to do this.  I was expecting to submit a
> jobs
> >> > with sbatch with --nodes=N --tasks-per-node=20 where N is an
> >> > integer to get multiple whole nodes then run srun
> >> > --tasks-per-node=1 for the per node tasks but this does not work
> (I get one task for each core).
> >> >
> >> > I'd also like any solution to work with hybrid mpi/openmp with one
> >> > openmp task per node or per socket.
> >> >
> >> > Thanks,
> >> >
> >> > Gareth
> >>
> >>
> >> --
> >> Morris "Moe" Jette
> >> CTO, SchedMD LLC
> >> Commercial Slurm Development and Support
> 
> 
> --
> Morris "Moe" Jette
> CTO, SchedMD LLC
> Commercial Slurm Development and Support

[slurm-dev] Re: mixing mpi and per node tasks

Reply via email to