OMPI doesn’t use cgroups because we run at the user level, so we can’t set them 
on our child processes


> On Jun 7, 2016, at 7:16 AM, Bruce Roberts <[email protected]> wrote:
> 
> Not using cgroups? 
> 
> On June 7, 2016 7:10:19 AM PDT, Ralph Castain <[email protected]> wrote:
> Yes, it should. What’s odd is that mpirun launches its daemons using srun 
> under the covers, and the daemon should therefore be bound. We detect that 
> and use it, but I’m not sure why this isn’t working here.
> 
> 
>> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> What happens if you use srun instead of mpirun? I would expect that to work 
>> correctly. 
>> 
>> On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected] 
>> <mailto:[email protected]>> wrote:
>> No, we don’t pick that up - suppose we could try. Those envars have a 
>> history of changing, though, and it gets difficult to match the version with 
>> the var.
>> 
>> I can put this on my “nice to do someday” list and see if/when we can get to 
>> it. Just so I don’t have to parse around more - what version of slurm are 
>> you using?
>> 
>> 
>>> On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> 
>>> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* when SLURM 
>>> integration is compiled in?
>>> 
>>> printenv in the sbatch script produces the following:
>>> 
>>> Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: 
>>> grep SBATCH slurm-5*
>>> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
>>> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
>>> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
>>> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
>>> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
>>> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
>>> 
>>> All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned 0 and 
>>> 1 to job 579 and 2 and 3 to 580.
>>> 
>>> Regards,
>>> 
>>>    Jason
>>> 
>>> On 06/06/16 21:11, Ralph Castain wrote:
>>>> Running two jobs across the same nodes is indeed an issue. Regardless of 
>>>> which MPI you use, the second mpiexec has no idea that the first one 
>>>> exists. Thus, the bindings applied to the second job will be computed as 
>>>> if the first job doesn’t exist - and thus, the procs will overload on top 
>>>> of each other.
>>>> 
>>>> The way you solve this with OpenMPI is by using the -slot-list <foo> 
>>>> option. This tells each mpiexec which cores are allocated to it, and it 
>>>> will constrain its binding calculation within that envelope. Thus, if you 
>>>> start the first job with -slot-list 0-2, and the second with -slot-list 
>>>> 3-5, the two jobs will be isolated from each other.
>>>> 
>>>> You can use any specification for the slot-list - it takes a 
>>>> comma-separated list of cores.
>>>> 
>>>> HTH
>>>> Ralph
>>>> 
>>>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected] 
>>>>> <mailto:[email protected]> <mailto:[email protected] 
>>>>> <mailto:[email protected]>>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Actually, --bind-to core is the default for most OpenMPI jobs now, so 
>>>>> adding this flag has no effect.  It refers to the processes within the 
>>>>> job.
>>>>> 
>>>>> I'm thinking this is an MPI-SLURM integration issue. Embarrassingly 
>>>>> parallel SLURM jobs are binding properly, but MPI jobs are ignoring the 
>>>>> SLURM environment and choosing their own cores.
>>>>> 
>>>>> OpenMPI was built with --with-slurm and it appears from config.log that 
>>>>> it located everything it needed.
>>>>> 
>>>>> I can work around the problem with "mpirun --bind-to none", which I'm 
>>>>> guessing will impact performance slightly for memory-intensive apps.
>>>>> 
>>>>> We're still digging on this one and may be for a while...
>>>>> 
>>>>>   Jason
>>>>> 
>>>>> On 06/03/16 15:48, Benjamin Redling wrote:
>>>>>> On 2016-06-03 21:25, Jason Bacon wrote:
>>>>>>> It might be worth mentioning that the calcpi-parallel jobs are run with
>>>>>>> --array (no srun). 
>>>>>>> 
>>>>>>> Disabling the task/affinity plugin and using "mpirun --bind-to core"
>>>>>>> works around the issue.  The MPI processes bind to specific cores and
>>>>>>> the embarrassingly parallel jobs kindly move over and stay out of the 
>>>>>>> way.
>>>>>> Are the mpirun --bind-to core child processes the same as a slurm task?
>>>>>> I have no experience at all with MPI jobs -- just trying to understand
>>>>>> task/affinity and params.
>>>>>> 
>>>>>> As far as I understand when you let mpirun do the binding it handles the
>>>>>> binding different https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php 
>>>>>> <https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php>
>>>>>> 
>>>>>> If I grok the
>>>>>> % mpirun ... --map-by core --bind-to core
>>>>>> example in the "Mapping, Ranking, and Binding: Oh My!" section right.
>>>>>> 
>>>>>>> On 06/03/16 10:18, Jason Bacon wrote:
>>>>>>>> We're having an issue with CPU binding when two jobs land on the same
>>>>>>>> node.
>>>>>>>> 
>>>>>>>> Some cores are shared by the 2 jobs while others are left idle. Below
>>>>>> [...]
>>>>>>>> TaskPluginParam=cores,verbose
>>>>>> don't you bind each _job_ to a single core because you override
>>>>>> automatic binding and thous prevent binding each child process to
>>>>>> different core?
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Benjamin
>>>>> 
>>>>> 
>>>>> --
>>>>> All wars are civil wars, because all men are brothers ... Each one owes
>>>>> infinitely more to the human race than to the parti cular country in
>>>>> which he was born.
>>>>>               -- Francois Fenelon
>>>> 
>>> 
>>> 
>>> -- 
>>> All wars are civil wars, because all men are bro thers ... Each one owes
>>> infinitely more to the human race than to the particular country in
>>> which he was born.
>>>                -- Francois Fenelon
>> 
> 

Reply via email to