[slurm-dev] Re: Processes sharing cores

Jason Bacon Tue, 07 Jun 2016 07:21:34 -0700


That's very good to know, thanks.

Currently running 14.11.6, but planning to upgrade to the 16.x series soon.

However, I just ran a 6-core job on our test cluster (two 4-core nodes)and it looks like SLURM doesn't provide enough information for jobs thatuse more than one node. It seems if the job uses all the cores on onenode or if it uses more than one node, there's just one mask that's all1's and hence doesn't tell us anything useful:

Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 425:grep SBATCH slurm-587.out

SBATCH_CPU_BIND_LIST=0xF
SBATCH_CPU_BIND_VERBOSE=verbose
SBATCH_CPU_BIND_TYPE=mask_cpu:
SBATCH_CPU_BIND=verbose,mask_cpu:0xF

There's also this:

SLURM_TASKS_PER_NODE=4,2
SLURM_JOB_CPUS_PER_NODE=4,2

But I'm not seeing enough information in the environment to determinethe core assignments on a partially-allocated node. I suppose I couldbe missing something, though.

I think what we need is a variable or set of variables that specify aseparate mask for each allocated node. For the job I just ran, it wouldlook something like this:


compute-001:0xF,compute-002:0x3

I know that cgroups is an alternative, but that adds a lot of complexityand will likely never work on platforms other than Linux, so it would begood to get this working. Seems like it wouldn't be too difficult.


Thanks,

    Jason


On 06/07/16 08:31, Ralph Castain wrote:

No, we don’t pick that up - suppose we could try. Those envars have ahistory of changing, though, and it gets difficult to match theversion with the var.
I can put this on my “nice to do someday” list and see if/when we canget to it. Just so I don’t have to parse around more - what version ofslurm are you using?
On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected]<mailto:[email protected]>> wrote:
Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* whenSLURM integration is compiled in?
printenv in the sbatch script produces the following:
Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI379: grep SBATCH slurm-5*
slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned0 and 1 to job 579 and 2 and 3 to 580.
Regards,

   Jason

On 06/06/16 21:11, Ralph Castain wrote:
Running two jobs across the same nodes is indeed an issue.Regardless of which MPI you use, the second mpiexec has no idea thatthe first one exists. Thus, the bindings applied to the second jobwill be computed as if the first job doesn’t exist - and thus, theprocs will overload on top of each other.
The way you solve this with OpenMPI is by using the -slot-list <foo>option. This tells each mpiexec which cores are allocated to it, andit will constrain its binding calculation within that envelope.Thus, if you start the first job with -slot-list 0-2, and the secondwith -slot-list 3-5, the two jobs will be isolated from each other.
You can use any specification for the slot-list - it takes acomma-separated list of cores.
HTH
Ralph
On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
Actually, --bind-to core is the default for most OpenMPI jobs now,so adding this flag has no effect. It refers to the processeswithin the job.
I'm thinking this is an MPI-SLURM integration issue. Embarrassinglyparallel SLURM jobs are binding properly, but MPI jobs are ignoringthe SLURM environment and choosing their own cores.
OpenMPI was built with --with-slurm and it appears from config.logthat it located everything it needed.
I can work around the problem with "mpirun --bind-to none", whichI'm guessing will impact performance slightly for memory-intensiveapps.
We're still digging on this one and may be for a while...

  Jason

On 06/03/16 15:48, Benjamin Redling wrote:
On 2016-06-03 21:25, Jason Bacon wrote:
It might be worth mentioning that the calcpi-parallel jobs arerun with
--array (no srun).

Disabling the task/affinity plugin and using "mpirun --bind-to core"
works around the issue.  The MPI processes bind to specific cores and
the embarrassingly parallel jobs kindly move over and stay out ofthe way.
Are the mpirun --bind-to core child processes the same as a slurmtask?
I have no experience at all with MPI jobs -- just trying to understand
task/affinity and params.
As far as I understand when you let mpirun do the binding ithandles the
binding different https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php

If I grok the
% mpirun ... --map-by core --bind-to core
example in the "Mapping, Ranking, and Binding: Oh My!" section right.
On 06/03/16 10:18, Jason Bacon wrote:
We're having an issue with CPU binding when two jobs land on thesame
node.
Some cores are shared by the 2 jobs while others are left idle.Below
[...]
TaskPluginParam=cores,verbose
don't you bind each _job_ to a single core because you override
automatic binding and thous prevent binding each child process to
different core?


Regards,
Benjamin
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
              -- Francois Fenelon
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
               -- Francois Fenelon



--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
                -- Francois Fenelon

[slurm-dev] Re: Processes sharing cores

Reply via email to