OMPI doesn’t use cgroups because we run at the user level, so we can’t set them on our child processes
> On Jun 7, 2016, at 7:16 AM, Bruce Roberts <[email protected]> wrote: > > Not using cgroups? > > On June 7, 2016 7:10:19 AM PDT, Ralph Castain <[email protected]> wrote: > Yes, it should. What’s odd is that mpirun launches its daemons using srun > under the covers, and the daemon should therefore be bound. We detect that > and use it, but I’m not sure why this isn’t working here. > > >> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected] >> <mailto:[email protected]>> wrote: >> >> What happens if you use srun instead of mpirun? I would expect that to work >> correctly. >> >> On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected] >> <mailto:[email protected]>> wrote: >> No, we don’t pick that up - suppose we could try. Those envars have a >> history of changing, though, and it gets difficult to match the version with >> the var. >> >> I can put this on my “nice to do someday” list and see if/when we can get to >> it. Just so I don’t have to parse around more - what version of slurm are >> you using? >> >> >>> On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> >>> >>> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* when SLURM >>> integration is compiled in? >>> >>> printenv in the sbatch script produces the following: >>> >>> Linux login.finch bacon ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: >>> grep SBATCH slurm-5* >>> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3 >>> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose >>> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu: >>> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3 >>> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC >>> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose >>> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu: >>> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC >>> >>> All OpenMPI jobs are using cores 0 and 2, although SLURM has assigned 0 and >>> 1 to job 579 and 2 and 3 to 580. >>> >>> Regards, >>> >>> Jason >>> >>> On 06/06/16 21:11, Ralph Castain wrote: >>>> Running two jobs across the same nodes is indeed an issue. Regardless of >>>> which MPI you use, the second mpiexec has no idea that the first one >>>> exists. Thus, the bindings applied to the second job will be computed as >>>> if the first job doesn’t exist - and thus, the procs will overload on top >>>> of each other. >>>> >>>> The way you solve this with OpenMPI is by using the -slot-list <foo> >>>> option. This tells each mpiexec which cores are allocated to it, and it >>>> will constrain its binding calculation within that envelope. Thus, if you >>>> start the first job with -slot-list 0-2, and the second with -slot-list >>>> 3-5, the two jobs will be isolated from each other. >>>> >>>> You can use any specification for the slot-list - it takes a >>>> comma-separated list of cores. >>>> >>>> HTH >>>> Ralph >>>> >>>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected] >>>>> <mailto:[email protected]> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>> >>>>> >>>>> >>>>> Actually, --bind-to core is the default for most OpenMPI jobs now, so >>>>> adding this flag has no effect. It refers to the processes within the >>>>> job. >>>>> >>>>> I'm thinking this is an MPI-SLURM integration issue. Embarrassingly >>>>> parallel SLURM jobs are binding properly, but MPI jobs are ignoring the >>>>> SLURM environment and choosing their own cores. >>>>> >>>>> OpenMPI was built with --with-slurm and it appears from config.log that >>>>> it located everything it needed. >>>>> >>>>> I can work around the problem with "mpirun --bind-to none", which I'm >>>>> guessing will impact performance slightly for memory-intensive apps. >>>>> >>>>> We're still digging on this one and may be for a while... >>>>> >>>>> Jason >>>>> >>>>> On 06/03/16 15:48, Benjamin Redling wrote: >>>>>> On 2016-06-03 21:25, Jason Bacon wrote: >>>>>>> It might be worth mentioning that the calcpi-parallel jobs are run with >>>>>>> --array (no srun). >>>>>>> >>>>>>> Disabling the task/affinity plugin and using "mpirun --bind-to core" >>>>>>> works around the issue. The MPI processes bind to specific cores and >>>>>>> the embarrassingly parallel jobs kindly move over and stay out of the >>>>>>> way. >>>>>> Are the mpirun --bind-to core child processes the same as a slurm task? >>>>>> I have no experience at all with MPI jobs -- just trying to understand >>>>>> task/affinity and params. >>>>>> >>>>>> As far as I understand when you let mpirun do the binding it handles the >>>>>> binding different https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php >>>>>> <https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php> >>>>>> >>>>>> If I grok the >>>>>> % mpirun ... --map-by core --bind-to core >>>>>> example in the "Mapping, Ranking, and Binding: Oh My!" section right. >>>>>> >>>>>>> On 06/03/16 10:18, Jason Bacon wrote: >>>>>>>> We're having an issue with CPU binding when two jobs land on the same >>>>>>>> node. >>>>>>>> >>>>>>>> Some cores are shared by the 2 jobs while others are left idle. Below >>>>>> [...] >>>>>>>> TaskPluginParam=cores,verbose >>>>>> don't you bind each _job_ to a single core because you override >>>>>> automatic binding and thous prevent binding each child process to >>>>>> different core? >>>>>> >>>>>> >>>>>> Regards, >>>>>> Benjamin >>>>> >>>>> >>>>> -- >>>>> All wars are civil wars, because all men are brothers ... Each one owes >>>>> infinitely more to the human race than to the parti cular country in >>>>> which he was born. >>>>> -- Francois Fenelon >>>> >>> >>> >>> -- >>> All wars are civil wars, because all men are bro thers ... Each one owes >>> infinitely more to the human race than to the particular country in >>> which he was born. >>> -- Francois Fenelon >> >
