[slurm-dev] Re: Processes sharing cores

Jason Bacon Tue, 07 Jun 2016 17:26:01 -0700


So this *should* work even for two separate MPI jobs sharing a node?


Thanks much,

    Jason

On 06/07/2016 09:09, Ralph Castain wrote:

Yes, it should. What’s odd is that mpirun launches its daemons usingsrun under the covers, and the daemon should therefore be bound. Wedetect that and use it, but I’m not sure why this isn’t working here.

On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected]<mailto:[email protected]>> wrote:

What happens if you use srun instead of mpirun? I would expect thatto work correctly.

On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected]<mailto:[email protected]>> wrote:


    No, we don’t pick that up - suppose we could try. Those envars
    have a history of changing, though, and it gets difficult to
    match the version with the var.

    I can put this on my “nice to do someday” list and see if/when we
    can get to it. Just so I don’t have to parse around more - what
    version of slurm are you using?

    On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected]
    <mailto:[email protected]>> wrote:



    Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_*
    when SLURM integration is compiled in?

    printenv in the sbatch script produces the following:

    Linux login.finch bacon
    ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep SBATCH
    slurm-5*
    slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
    slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
    slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
    slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
    slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
    slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
    slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
    slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC

    All OpenMPI jobs are using cores 0 and 2, although SLURM has
    assigned 0 and 1 to job 579 and 2 and 3 to 580.

    Regards,

       Jason

    On 06/06/16 21:11, Ralph Castain wrote:

    Running two jobs across the same nodes is indeed an issue.
    Regardless of which MPI you use, the second mpiexec has no idea
    that the first one exists. Thus, the bindings applied to the
    second job will be computed as if the first job doesn’t exist -
    and thus, the procs will overload on top of each other.

    The way you solve this with OpenMPI is by using the -slot-list
    <foo> option. This tells each mpiexec which cores are allocated
    to it, and it will constrain its binding calculation within
    that envelope. Thus, if you start the first job with -slot-list
    0-2, and the second with -slot-list 3-5, the two jobs will be
    isolated from each other.

    You can use any specification for the slot-list - it takes a
    comma-separated list of cores.

    HTH
    Ralph

    On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected]
    <mailto:[email protected]><mailto:[email protected]>> wrote:



    Actually, --bind-to core is the default for most OpenMPI jobs
    now, so adding this flag has no effect.  It refers to the
    processes within the job.

    I'm thinking this is an MPI-SLURM integration issue.
    Embarrassingly parallel SLURM jobs are binding properly, but
    MPI jobs are ignoring the SLURM environment and choosing their
    own cores.

    OpenMPI was built with --with-slurm and it appears from
    config.log that it located everything it needed.

    I can work around the problem with "mpirun --bind-to none",
    which I'm guessing will impact performance slightly for
    memory-intensive apps.

    We're still digging on this one and may be for a while...

      Jason

    On 06/03/16 15:48, Benjamin Redling wrote:

    On 2016-06-03 21:25, Jason Bacon wrote:

    It might be worth mentioning that the calcpi-parallel jobs
    are run with
    --array (no srun).

    Disabling the task/affinity plugin and using "mpirun
    --bind-to core"
    works around the issue.  The MPI processes bind to specific
    cores and
    the embarrassingly parallel jobs kindly move over and stay
    out of the way.

    Are the mpirun --bind-to core child processes the same as a
    slurm task?
    I have no experience at all with MPI jobs -- just trying to
    understand
    task/affinity and params.

    As far as I understand when you let mpirun do the binding it
    handles the
    binding different
    https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php

    If I grok the
    % mpirun ... --map-by core --bind-to core
    example in the "Mapping, Ranking, and Binding: Oh My!"
    section right.
    *

    On 06/03/16 10:18, Jason Bacon wrote:

    We're having an issue with CPU binding when two jobs land
    on the same
    node.

    Some cores are shared by the 2 jobs while others are left
    idle. Below

    [...]

    TaskPluginParam=cores,verbose

    don't you bind each _job_ to a single core because you override
    automatic binding and thous prevent binding each child process to
    different core?


    Regards,
    Benjamin
    *

    *

    --
    All wars are civil wars, because all men are brothers ... Each
    one owes
    infinitely more to the human race than to the parti cular
    country in
    which he was born.
                  -- Francois Fenelon
    *

    *
    *

    *

    --
    All wars are civil wars, because all men are bro thers ... Each
    one owes
    infinitely more to the human race than to the particular country in
    which he was born.
                   -- Francois Fenelon*

    *
    *

[slurm-dev] Re: Processes sharing cores

Reply via email to