Hi Jason It sounds like the srun executed inside each mpirun is not getting bound to a specific set of cores, or else we are not correctly picking that up and staying within it. So let me see if I fully understand the scenario, and please forgive this old fossil brain if you’ve explained all this before:
You are executing multiple parallel sbatch commands on the same nodes, with each sbatch requesting and being allocated only a subset of cores on those nodes. Within each sbatch, you are executing a single mpirun that launches an application. Is that accurate? If so, I can try to replicate and test this here if you tell me how you built and configured SLURM (as I haven’t used their task/affinity plugin before) Ralph > On Jun 9, 2016, at 7:35 AM, Jason Bacon <[email protected]> wrote: > > > > Thanks for all the suggestions, everyone. > > A little more info: > > I had to do a new OMPI build using --with-pmi. Binding works correctly using > srun with this build, but mpirun still ignores the SLURM core assignments. > > I also patched the task/affinity plugin for FreeBSD for the sake of > comparison (minor differences in the cpuset API). It's not 100% yet, but it > appears that mpirun is ignoring the SLURM core assignments there as well. > > Next question: > > Is anyone out there seeing mpirun obey the core assignments from SLURM's > task/affinity plugin? If so, I'd love to see your configure arguments for > both SLURM and OMPI. > > I have growing doubts that this interface is working, though. I can imagine > this issue going unnoticed most of the time, because it will only cause a > problem when an OMPI job shares a node with another job using core binding, > which is infrequent on our clusters. Even when that happens, it may still go > unnoticed unless someone is monitoring performance carefully, because the > only likely impact is a few processes running at 50% their normal speed > because they're sharing a core. > > I think this is worth fixing and I'd be happy to help with the coding and > testing. We can't police how every user starts their MPI jobs, so it would > be good if it works properly no matter what they use. > > Thanks again, > > Jason > > On 06/07/16 20:17, Ralph Castain wrote: >> Yes, it should - provided the job step executing each mpirun has been given >> a unique binding. I suspect this is the problem you are encountering, but >> can’t know for certain. You could run an app that prints out its binding and >> then see if two parallel executions of srun yield different values. >> >> >>> On Jun 7, 2016, at 5:26 PM, Jason Bacon <[email protected] >>> <mailto:[email protected]> <mailto:[email protected] >>> <mailto:[email protected]>>> wrote: >>> >>> >>> So this *should* work even for two separate MPI jobs sharing a node? >>> >>> Thanks much, >>> >>> Jason >>> >>> On 06/07/2016 09:09, Ralph Castain wrote: >>>> Yes, it should. What’s odd is that mpirun launches its daemons using srun >>>> under the covers, and the daemon should therefore be bound. We detect that >>>> and use it, but I’m not sure why this isn’t working here. >>>> >>>> >>>>> On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected] >>>>> <mailto:[email protected]> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>> >>>>> What happens if you use srun instead of mpirun? I would expect that to >>>>> work correctly. >>>>> >>>>> On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected] >>>>> <mailto:[email protected]> <mailto:[email protected] >>>>> <mailto:[email protected]>>> wrote: >>>>> >>>>> No, we don’t pick that up - suppose we could try. Those envars >>>>> have a history of changing, though, and it gets difficult to >>>>> match the version with the var. >>>>> >>>>> I can put this on my “nice to do someday” list and see if/when >>>>> we can get to it. Just so I don’t have to parse around more - >>>>> what version of slurm are you using? >>>>> >>>>> >>>>>> On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected] >>>>>> <mailto:[email protected]>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_* >>>>>> when SLURM integration is compiled in? >>>>>> >>>>>> printenv in the sbatch script produces the following: >>>>>> >>>>>> Linux login.finch bacon >>>>>> ~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep SBATCH >>>>>> slurm-5* >>>>>> slurm-579.out:SBATCH_CPU_BIND_LIST=0x3 >>>>>> slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose >>>>>> slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu: >>>>>> slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3 >>>>>> slurm-580.out:SBATCH_CPU_BIND_LIST=0xC >>>>>> slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose >>>>>> slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu: >>>>>> slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC >>>>>> >>>>>> All OpenMPI jobs are using cores 0 and 2, although SLURM has >>>>>> assigned 0 and 1 to job 579 and 2 and 3 to 580. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Jason >>>>>> >>>>>> On 06/06/16 21:11, Ralph Castain wrote: >>>>>>> Running two jobs across the same nodes is indeed an issue. >>>>>>> Regardless of which MPI you use, the second mpiexec has no >>>>>>> idea that the first one exists. Thus, the bindings applied to >>>>>>> the second job will be computed as if the first job doesn’t >>>>>>> exist - and thus, the procs will overload on top of each other. >>>>>>> >>>>>>> The way you solve this with OpenMPI is by using the >>>>>>> -slot-list <foo> option. This tells each mpiexec which cores >>>>>>> are allocated to it, and it will constrain its binding >>>>>>> calculation within that envelope. Thus, if you start the >>>>>>> first job with -slot-list 0-2, and the second with -slot-list >>>>>>> 3-5, the two jobs will be isolated from each other. >>>>>>> >>>>>>> You can use any specification for the slot-list - it takes a >>>>>>> comma-separated list of cores. >>>>>>> >>>>>>> HTH >>>>>>> Ralph >>>>>>> >>>>>>>> On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected] >>>>>>>> <mailto:[email protected]> >>>>>>>> <mailto:[email protected] >>>>>>>> <mailto:[email protected]>><mailto:[email protected] >>>>>>>> <mailto:[email protected]>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Actually, --bind-to core is the default for most OpenMPI >>>>>>>> jobs now, so adding this flag has no effect. It refers to >>>>>>>> the processes within the job. >>>>>>>> >>>>>>>> I'm thinking this is an MPI-SLURM integration issue. >>>>>>>> Embarrassingly parallel SLURM jobs are binding properly, but >>>>>>>> MPI jobs are ignoring the SLURM environment and choosing >>>>>>>> their own cores. >>>>>>>> >>>>>>>> OpenMPI was built with --with-slurm and it appears from >>>>>>>> config.log that it located everything it needed. >>>>>>>> >>>>>>>> I can work around the problem with "mpirun --bind-to none", >>>>>>>> which I'm guessing will impact performance slightly for >>>>>>>> memory-intensive apps. >>>>>>>> >>>>>>>> We're still digging on this one and may be for a while... >>>>>>>> >>>>>>>> Jason >>>>>>>> >>>>>>>> On 06/03/16 15:48, Benjamin Redling wrote: >>>>>>>>> On 2016-06-03 21:25, Jason Bacon wrote: >>>>>>>>>> It might be worth mentioning that the calcpi-parallel jobs >>>>>>>>>> are run with >>>>>>>>>> --array (no srun). >>>>>>>>>> >>>>>>>>>> Disabling the task/affinity plugin and using "mpirun >>>>>>>>>> --bind-to core" >>>>>>>>>> works around the issue. The MPI processes bind to >>>>>>>>>> specific cores and >>>>>>>>>> the embarrassingly parallel jobs kindly move over and stay >>>>>>>>>> out of the way. >>>>>>>>> Are the mpirun --bind-to core child processes the same as a >>>>>>>>> slurm task? >>>>>>>>> I have no experience at all with MPI jobs -- just trying to >>>>>>>>> understand >>>>>>>>> task/affinity and params. >>>>>>>>> >>>>>>>>> As far as I understand when you let mpirun do the binding >>>>>>>>> it handles the >>>>>>>>> binding different >>>>>>>>> https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php >>>>>>>>> >>>>>>>>> If I grok the >>>>>>>>> % mpirun ... --map-by core --bind-to core >>>>>>>>> example in the "Mapping, Ranking, and Binding: Oh My!" >>>>>>>>> section right. >>>>>>>>> * >>>>>>>>>> On 06/03/16 10:18, Jason Bacon wrote: >>>>>>>>>>> We're having an issue with CPU binding when two jobs land >>>>>>>>>>> on the same >>>>>>>>>>> node. >>>>>>>>>>> >>>>>>>>>>> Some cores are shared by the 2 jobs while others are left >>>>>>>>>>> idle. Below >>>>>>>>> [...] >>>>>>>>>>> TaskPluginParam=cores,verbose >>>>>>>>> don't you bind each _job_ to a single core because you override >>>>>>>>> automatic binding and thous prevent binding each child >>>>>>>>> process to >>>>>>>>> different core? >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Benjamin >>>>>>>>> * >>>>>>>> * >>>>>>>> >>>>>>>> -- >>>>>>>> All wars are civil wars, because all men are brothers ... >>>>>>>> Each one owes >>>>>>>> infinitely more to the human race than to the parti cular >>>>>>>> country in >>>>>>>> which he was born. >>>>>>>> -- Francois Fenelon >>>>>>>> * >>>>>>> * >>>>>>> * >>>>>> * >>>>>> >>>>>> -- >>>>>> All wars are civil wars, because all men are bro thers ... >>>>>> Each one owes >>>>>> infinitely more to the human race than to the particular >>>>>> country in >>>>>> which he was born. >>>>>> -- Francois Fenelon* >>>>> * >>>>> * >>>>> >>>> >>> >> > > > -- > All wars are civil wars, because all men are brothers ... Each one owes > infinitely more to the human race than to the particular country in > which he was born. > -- Francois Fenelon
