[slurm-dev] Re: Processes sharing cores

Jason Bacon Thu, 09 Jun 2016 09:07:27 -0700


Hi Ralph,

Your interpretation is correct, with one addition: With task/affinityenabled, it is not necessary that both sbatch submissions are usingmpirun. We first discovered the issue when an OMPI application wasconflicting with an embarrassingly parallel job (#SBATCH --array). The--array job was obeying the SLURM core assignments and the MPI job was not.

I disabled task/affinity for now so that processes in a --array job canfloat to any core and it won't matter which cores the MPI job bindsto. The only risk we have in this configuration is two OMPI jobssharing a node, and a probably minor performance hit for mostembarrassingly parallel jobs since they don't use cpu affinity.


Here's everything that might be relevant from slurm.conf:

MpiDefault=none
PropagateResourceLimitsExcept=ALL
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/affinity
TaskPluginParam=sched,verbose
#TaskPluginParam=cores,verbose
FastSchedule=1

It makes no difference which TaskPluginParam we use.

Here's Moe's comment again on how SLURM communicates the bindings, incase it helps:

~~~

The CPU assignments for the entire job are not shown in environmentvariables for a batch job, but are shown in the SLURM_CPU_BINDenvironment variable for the launched application (likely different foreach node/task) or using the "scontrol -dd show job" command:


JobId=9635 JobName=bash
  .....
     Nodes=tux1 CPU_IDs=0-3 Mem=200
     Nodes=tux2 CPU_IDs=0-1 Mem=100
~~~

Thanks much for looking into this,

    Jason


On 06/09/16 09:52, Ralph Castain wrote:

Hi Jason
It sounds like the srun executed inside each mpirun is not gettingbound to a specific set of cores, or else we are not correctly pickingthat up and staying within it. So let me see if I fully understand thescenario, and please forgive this old fossil brain if you’ve explainedall this before:
You are executing multiple parallel sbatch commands on the same nodes,with each sbatch requesting and being allocated only a subset of coreson those nodes. Within each sbatch, you are executing a single mpirunthat launches an application.
Is that accurate? If so, I can try to replicate and test this here ifyou tell me how you built and configured SLURM (as I haven’t usedtheir task/affinity plugin before)
Ralph
On Jun 9, 2016, at 7:35 AM, Jason Bacon <[email protected]<mailto:[email protected]>> wrote:
Thanks for all the suggestions, everyone.

A little more info:
I had to do a new OMPI build using --with-pmi. Binding workscorrectly using srun with this build, but mpirun still ignores theSLURM core assignments.
I also patched the task/affinity plugin for FreeBSD for the sake ofcomparison (minor differences in the cpuset API). It's not 100% yet,but it appears that mpirun is ignoring the SLURM core assignmentsthere as well.
Next question:
Is anyone out there seeing mpirun obey the core assignments fromSLURM's task/affinity plugin? If so, I'd love to see your configurearguments for both SLURM and OMPI.
I have growing doubts that this interface is working, though. I canimagine this issue going unnoticed most of the time, because it willonly cause a problem when an OMPI job shares a node with another jobusing core binding, which is infrequent on our clusters. Even whenthat happens, it may still go unnoticed unless someone is monitoringperformance carefully, because the only likely impact is a fewprocesses running at 50% their normal speed because they're sharing acore.
I think this is worth fixing and I'd be happy to help with the codingand testing. We can't police how every user starts their MPI jobs,so it would be good if it works properly no matter what they use.
Thanks again,

   Jason

On 06/07/16 20:17, Ralph Castain wrote:
Yes, it should - provided the job step executing each mpirun hasbeen given a unique binding. I suspect this is the problem you areencountering, but can’t know for certain. You could run an app thatprints out its binding and then see if two parallel executions ofsrun yield different values.
On Jun 7, 2016, at 5:26 PM, Jason Bacon <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
So this *should* work even for two separate MPI jobs sharing a node?

Thanks much,

   Jason

On 06/07/2016 09:09, Ralph Castain wrote:
Yes, it should. What’s odd is that mpirun launches its daemonsusing srun under the covers, and the daemon should therefore bebound. We detect that and use it, but I’m not sure why this isn’tworking here.
On Jun 7, 2016, at 6:52 AM, Bruce Roberts<[email protected]<mailto:[email protected]><mailto:[email protected]>>wrote:
What happens if you use srun instead of mpirun? I would expectthat to work correctly.
On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
   No, we don’t pick that up - suppose we could try. Those envars
   have a history of changing, though, and it gets difficult to
   match the version with the var.

   I can put this on my “nice to do someday” list and see if/when
   we can get to it. Just so I don’t have to parse around more -
   what version of slurm are you using?
On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected]<mailto:[email protected]>>
   wrote:



   Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_*
   when SLURM integration is compiled in?

   printenv in the sbatch script produces the following:

   Linux login.finch bacon
   ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep SBATCH
   slurm-5*
   slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
   slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
   slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
   slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
   slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
   slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
   slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
   slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC

   All OpenMPI jobs are using cores 0 and 2, although SLURM has
   assigned 0 and 1 to job 579 and 2 and 3 to 580.

   Regards,

      Jason

   On 06/06/16 21:11, Ralph Castain wrote:
   Running two jobs across the same nodes is indeed an issue.
   Regardless of which MPI you use, the second mpiexec has no
   idea that the first one exists. Thus, the bindings applied to
   the second job will be computed as if the first job doesn’t
   exist - and thus, the procs will overload on top of each other.

   The way you solve this with OpenMPI is by using the
   -slot-list <foo> option. This tells each mpiexec which cores
   are allocated to it, and it will constrain its binding
   calculation within that envelope. Thus, if you start the
   first job with -slot-list 0-2, and the second with -slot-list
   3-5, the two jobs will be isolated from each other.

   You can use any specification for the slot-list - it takes a
   comma-separated list of cores.

   HTH
   Ralph
On Jun 6, 2016, at 6:08 PM, Jason Bacon<[email protected] <mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>wrote:
   Actually, --bind-to core is the default for most OpenMPI
   jobs now, so adding this flag has no effect.  It refers to
   the processes within the job.

   I'm thinking this is an MPI-SLURM integration issue.
   Embarrassingly parallel SLURM jobs are binding properly, but
   MPI jobs are ignoring the SLURM environment and choosing
   their own cores.

   OpenMPI was built with --with-slurm and it appears from
   config.log that it located everything it needed.

   I can work around the problem with "mpirun --bind-to none",
   which I'm guessing will impact performance slightly for
   memory-intensive apps.

   We're still digging on this one and may be for a while...

     Jason

   On 06/03/16 15:48, Benjamin Redling wrote:
   On 2016-06-03 21:25, Jason Bacon wrote:
   It might be worth mentioning that the calcpi-parallel jobs
   are run with
   --array (no srun).

   Disabling the task/affinity plugin and using "mpirun
   --bind-to core"
   works around the issue.  The MPI processes bind to
   specific cores and
   the embarrassingly parallel jobs kindly move over and stay
   out of the way.
   Are the mpirun --bind-to core child processes the same as a
   slurm task?
   I have no experience at all with MPI jobs -- just trying to
   understand
   task/affinity and params.

   As far as I understand when you let mpirun do the binding
   it handles the
   binding different
https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php

   If I grok the
   % mpirun ... --map-by core --bind-to core
   example in the "Mapping, Ranking, and Binding: Oh My!"
   section right.
   *
   On 06/03/16 10:18, Jason Bacon wrote:
   We're having an issue with CPU binding when two jobs land
   on the same
   node.

   Some cores are shared by the 2 jobs while others are left
   idle. Below
   [...]
   TaskPluginParam=cores,verbose
don't you bind each _job_ to a single core because youoverride
   automatic binding and thous prevent binding each child
   process to
   different core?


   Regards,
   Benjamin
   *
   *

   --
   All wars are civil wars, because all men are brothers ...
   Each one owes
   infinitely more to the human race than to the parti cular
   country in
   which he was born.
                 -- Francois Fenelon
   *
   *
   *
   *

   --
   All wars are civil wars, because all men are bro thers ...
   Each one owes
   infinitely more to the human race than to the particular
   country in
   which he was born.
                  -- Francois Fenelon*
   *
   *
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
               -- Francois Fenelon



--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
                -- Francois Fenelon

[slurm-dev] Re: Processes sharing cores

Reply via email to