Hi Jason

It sounds like the srun executed inside each mpirun is not getting bound to a 
specific set of cores, or else we are not correctly picking that up and staying 
within it. So let me see if I fully understand the scenario, and please forgive 
this old fossil brain if you’ve explained all this before:

You are executing multiple parallel sbatch commands on the same nodes, with 
each sbatch requesting and being allocated only a subset of cores on those 
nodes. Within each sbatch, you are executing a single mpirun that launches an 
application.

Is that accurate? If so, I can try to replicate and test this here if you tell 
me how you built and configured SLURM (as I haven’t used their task/affinity 
plugin before)

Ralph

> On Jun 9, 2016, at 7:35 AM, Jason Bacon <[email protected]> wrote:
> 
> 
> 
> Thanks for all the suggestions, everyone.
> 
> A little more info:
> 
> I had to do a new OMPI build using --with-pmi.  Binding works correctly using 
> srun with this build, but mpirun still ignores the SLURM core assignments.
> 
> I also patched the task/affinity plugin for FreeBSD for the sake of 
> comparison (minor differences in the cpuset API).  It's not 100% yet, but it 
> appears that mpirun is ignoring the SLURM core assignments there as well.
> 
> Next question:
> 
> Is anyone out there seeing mpirun obey the core assignments from SLURM's 
> task/affinity plugin?  If so, I'd love to see your configure arguments for 
> both SLURM and OMPI.
> 
> I have growing doubts that this interface is working, though.  I can imagine 
> this issue going unnoticed most of the time, because it will only cause a 
> problem when an OMPI job shares a node with another job using core binding, 
> which is infrequent on our clusters.  Even when that happens, it may still go 
> unnoticed unless someone is monitoring performance carefully, because the 
> only likely impact is a few processes running at 50% their normal speed 
> because they're sharing a core.
> 
> I think this is worth fixing and I'd be happy to help with the coding and 
> testing.  We can't police how every user starts their MPI jobs, so it would 
> be good if it works properly no matter what they use.
> 
> Thanks again,
> 
>    Jason
> 
> On 06/07/16 20:17, Ralph Castain wrote:
>> Yes, it should - provided the job step executing each mpirun has been given 
>> a unique binding. I suspect this is the problem you are encountering, but 
>> can’t know for certain. You could run an app that prints out its binding and 
>> then see if two parallel executions of srun yield different values.
>> 
>> 
>>> On Jun 7, 2016, at 5:26 PM, Jason Bacon <[email protected] 
>>> <mailto:[email protected]> <mailto:[email protected] 
>>> <mailto:[email protected]>>> wrote:
>>> 
>>> 
>>> So this *should* work even for two separate MPI jobs sharing a node?
>>> 
>>> Thanks much,
>>> 
>>>    Jason
>>> 
>>> On 06/07/2016 09:09, Ralph Castain wrote:
>>>> Yes, it should. What’s odd is that mpirun launches its daemons using srun 
>>>> under the covers, and the daemon should therefore be bound. We detect that 
>>>> and use it, but I’m not sure why this isn’t working here.
>>>> 
>>>> 
>>>>> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>> <mailto:[email protected]>>> wrote:
>>>>> 
>>>>> What happens if you use srun instead of mpirun? I would expect that to 
>>>>> work correctly.
>>>>> 
>>>>> On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>> <mailto:[email protected]>>> wrote:
>>>>> 
>>>>>    No, we don’t pick that up - suppose we could try. Those envars
>>>>>    have a history of changing, though, and it gets difficult to
>>>>>    match the version with the var.
>>>>> 
>>>>>    I can put this on my “nice to do someday” list and see if/when
>>>>>    we can get to it. Just so I don’t have to parse around more -
>>>>>    what version of slurm are you using?
>>>>> 
>>>>> 
>>>>>>    On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected] 
>>>>>> <mailto:[email protected]>>
>>>>>>    wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>    Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_*
>>>>>>    when SLURM integration is compiled in?
>>>>>> 
>>>>>>    printenv in the sbatch script produces the following:
>>>>>> 
>>>>>>    Linux login.finch bacon
>>>>>>    ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep SBATCH
>>>>>>    slurm-5*
>>>>>>    slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
>>>>>>    slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>>>>>    slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>>>>>    slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
>>>>>>    slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
>>>>>>    slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>>>>>    slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>>>>>    slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
>>>>>> 
>>>>>>    All OpenMPI jobs are using cores 0 and 2, although SLURM has
>>>>>>    assigned 0 and 1 to job 579 and 2 and 3 to 580.
>>>>>> 
>>>>>>    Regards,
>>>>>> 
>>>>>>       Jason
>>>>>> 
>>>>>>    On 06/06/16 21:11, Ralph Castain wrote:
>>>>>>>    Running two jobs across the same nodes is indeed an issue.
>>>>>>>    Regardless of which MPI you use, the second mpiexec has no
>>>>>>>    idea that the first one exists. Thus, the bindings applied to
>>>>>>>    the second job will be computed as if the first job doesn’t
>>>>>>>    exist - and thus, the procs will overload on top of each other.
>>>>>>> 
>>>>>>>    The way you solve this with OpenMPI is by using the
>>>>>>>    -slot-list <foo> option. This tells each mpiexec which cores
>>>>>>>    are allocated to it, and it will constrain its binding
>>>>>>>    calculation within that envelope. Thus, if you start the
>>>>>>>    first job with -slot-list 0-2, and the second with -slot-list
>>>>>>>    3-5, the two jobs will be isolated from each other.
>>>>>>> 
>>>>>>>    You can use any specification for the slot-list - it takes a
>>>>>>>    comma-separated list of cores.
>>>>>>> 
>>>>>>>    HTH
>>>>>>>    Ralph
>>>>>>> 
>>>>>>>>    On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected] 
>>>>>>>> <mailto:[email protected]>
>>>>>>>>    <mailto:[email protected] 
>>>>>>>> <mailto:[email protected]>><mailto:[email protected] 
>>>>>>>> <mailto:[email protected]>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>    Actually, --bind-to core is the default for most OpenMPI
>>>>>>>>    jobs now, so adding this flag has no effect.  It refers to
>>>>>>>>    the processes within the job.
>>>>>>>> 
>>>>>>>>    I'm thinking this is an MPI-SLURM integration issue.
>>>>>>>>    Embarrassingly parallel SLURM jobs are binding properly, but
>>>>>>>>    MPI jobs are ignoring the SLURM environment and choosing
>>>>>>>>    their own cores.
>>>>>>>> 
>>>>>>>>    OpenMPI was built with --with-slurm and it appears from
>>>>>>>>    config.log that it located everything it needed.
>>>>>>>> 
>>>>>>>>    I can work around the problem with "mpirun --bind-to none",
>>>>>>>>    which I'm guessing will impact performance slightly for
>>>>>>>>    memory-intensive apps.
>>>>>>>> 
>>>>>>>>    We're still digging on this one and may be for a while...
>>>>>>>> 
>>>>>>>>      Jason
>>>>>>>> 
>>>>>>>>    On 06/03/16 15:48, Benjamin Redling wrote:
>>>>>>>>>    On 2016-06-03 21:25, Jason Bacon wrote:
>>>>>>>>>>    It might be worth mentioning that the calcpi-parallel jobs
>>>>>>>>>>    are run with
>>>>>>>>>>    --array (no srun).
>>>>>>>>>> 
>>>>>>>>>>    Disabling the task/affinity plugin and using "mpirun
>>>>>>>>>>    --bind-to core"
>>>>>>>>>>    works around the issue.  The MPI processes bind to
>>>>>>>>>>    specific cores and
>>>>>>>>>>    the embarrassingly parallel jobs kindly move over and stay
>>>>>>>>>>    out of the way.
>>>>>>>>>    Are the mpirun --bind-to core child processes the same as a
>>>>>>>>>    slurm task?
>>>>>>>>>    I have no experience at all with MPI jobs -- just trying to
>>>>>>>>>    understand
>>>>>>>>>    task/affinity and params.
>>>>>>>>> 
>>>>>>>>>    As far as I understand when you let mpirun do the binding
>>>>>>>>>    it handles the
>>>>>>>>>    binding different
>>>>>>>>>    https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php
>>>>>>>>> 
>>>>>>>>>    If I grok the
>>>>>>>>>    % mpirun ... --map-by core --bind-to core
>>>>>>>>>    example in the "Mapping, Ranking, and Binding: Oh My!"
>>>>>>>>>    section right.
>>>>>>>>>    *
>>>>>>>>>>    On 06/03/16 10:18, Jason Bacon wrote:
>>>>>>>>>>>    We're having an issue with CPU binding when two jobs land
>>>>>>>>>>>    on the same
>>>>>>>>>>>    node.
>>>>>>>>>>> 
>>>>>>>>>>>    Some cores are shared by the 2 jobs while others are left
>>>>>>>>>>>    idle. Below
>>>>>>>>>    [...]
>>>>>>>>>>>    TaskPluginParam=cores,verbose
>>>>>>>>>    don't you bind each _job_ to a single core because you override
>>>>>>>>>    automatic binding and thous prevent binding each child
>>>>>>>>>    process to
>>>>>>>>>    different core?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>    Regards,
>>>>>>>>>    Benjamin
>>>>>>>>>    *
>>>>>>>>    *
>>>>>>>> 
>>>>>>>>    --
>>>>>>>>    All wars are civil wars, because all men are brothers ...
>>>>>>>>    Each one owes
>>>>>>>>    infinitely more to the human race than to the parti cular
>>>>>>>>    country in
>>>>>>>>    which he was born.
>>>>>>>>                  -- Francois Fenelon
>>>>>>>>    *
>>>>>>>    *
>>>>>>>    *
>>>>>>    *
>>>>>> 
>>>>>>    --
>>>>>>    All wars are civil wars, because all men are bro thers ...
>>>>>>    Each one owes
>>>>>>    infinitely more to the human race than to the particular
>>>>>>    country in
>>>>>>    which he was born.
>>>>>>                   -- Francois Fenelon*
>>>>>    *
>>>>>    *
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> All wars are civil wars, because all men are brothers ... Each one owes
> infinitely more to the human race than to the particular country in
> which he was born.
>                -- Francois Fenelon

Reply via email to