Thanks for all the suggestions, everyone.
A little more info:
I had to do a new OMPI build using --with-pmi. Binding works correctly
using srun with this build, but mpirun still ignores the SLURM core
assignments.
I also patched the task/affinity plugin for FreeBSD for the sake of
comparison (minor differences in the cpuset API). It's not 100% yet,
but it appears that mpirun is ignoring the SLURM core assignments there
as well.
Next question:
Is anyone out there seeing mpirun obey the core assignments from SLURM's
task/affinity plugin? If so, I'd love to see your configure arguments
for both SLURM and OMPI.
I have growing doubts that this interface is working, though. I can
imagine this issue going unnoticed most of the time, because it will
only cause a problem when an OMPI job shares a node with another job
using core binding, which is infrequent on our clusters. Even when that
happens, it may still go unnoticed unless someone is monitoring
performance carefully, because the only likely impact is a few processes
running at 50% their normal speed because they're sharing a core.
I think this is worth fixing and I'd be happy to help with the coding
and testing. We can't police how every user starts their MPI jobs, so
it would be good if it works properly no matter what they use.
Thanks again,
Jason
On 06/07/16 20:17, Ralph Castain wrote:
Yes, it should - provided the job step executing each mpirun has been
given a unique binding. I suspect this is the problem you are
encountering, but can’t know for certain. You could run an app that
prints out its binding and then see if two parallel executions of srun
yield different values.
On Jun 7, 2016, at 5:26 PM, Jason Bacon <[email protected]
<mailto:[email protected]>> wrote:
So this *should* work even for two separate MPI jobs sharing a node?
Thanks much,
Jason
On 06/07/2016 09:09, Ralph Castain wrote:
Yes, it should. What’s odd is that mpirun launches its daemons using
srun under the covers, and the daemon should therefore be bound. We
detect that and use it, but I’m not sure why this isn’t working here.
On Jun 7, 2016, at 6:52 AM, Bruce Roberts <[email protected]
<mailto:[email protected]>> wrote:
What happens if you use srun instead of mpirun? I would expect that
to work correctly.
On June 7, 2016 6:31:27 AM MST, Ralph Castain <[email protected]
<mailto:[email protected]>> wrote:
No, we don’t pick that up - suppose we could try. Those envars
have a history of changing, though, and it gets difficult to
match the version with the var.
I can put this on my “nice to do someday” list and see if/when
we can get to it. Just so I don’t have to parse around more -
what version of slurm are you using?
On Jun 7, 2016, at 6:15 AM, Jason Bacon <[email protected]>
wrote:
Thanks for the tip, but does OpenMPI not use SBATCH_CPU_BIND_*
when SLURM integration is compiled in?
printenv in the sbatch script produces the following:
Linux login.finch bacon
~/Data/Testing/Facil/Software/Src/Bench/MPI 379: grep SBATCH
slurm-5*
slurm-579.out:SBATCH_CPU_BIND_LIST=0x3
slurm-579.out:SBATCH_CPU_BIND_VERBOSE=verbose
slurm-579.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
slurm-579.out:SBATCH_CPU_BIND=verbose,mask_cpu:0x3
slurm-580.out:SBATCH_CPU_BIND_LIST=0xC
slurm-580.out:SBATCH_CPU_BIND_VERBOSE=verbose
slurm-580.out:SBATCH_CPU_BIND_TYPE=mask_cpu:
slurm-580.out:SBATCH_CPU_BIND=verbose,mask_cpu:0xC
All OpenMPI jobs are using cores 0 and 2, although SLURM has
assigned 0 and 1 to job 579 and 2 and 3 to 580.
Regards,
Jason
On 06/06/16 21:11, Ralph Castain wrote:
Running two jobs across the same nodes is indeed an issue.
Regardless of which MPI you use, the second mpiexec has no
idea that the first one exists. Thus, the bindings applied to
the second job will be computed as if the first job doesn’t
exist - and thus, the procs will overload on top of each other.
The way you solve this with OpenMPI is by using the
-slot-list <foo> option. This tells each mpiexec which cores
are allocated to it, and it will constrain its binding
calculation within that envelope. Thus, if you start the
first job with -slot-list 0-2, and the second with -slot-list
3-5, the two jobs will be isolated from each other.
You can use any specification for the slot-list - it takes a
comma-separated list of cores.
HTH
Ralph
On Jun 6, 2016, at 6:08 PM, Jason Bacon <[email protected]
<mailto:[email protected]><mailto:[email protected]>> wrote:
Actually, --bind-to core is the default for most OpenMPI
jobs now, so adding this flag has no effect. It refers to
the processes within the job.
I'm thinking this is an MPI-SLURM integration issue.
Embarrassingly parallel SLURM jobs are binding properly, but
MPI jobs are ignoring the SLURM environment and choosing
their own cores.
OpenMPI was built with --with-slurm and it appears from
config.log that it located everything it needed.
I can work around the problem with "mpirun --bind-to none",
which I'm guessing will impact performance slightly for
memory-intensive apps.
We're still digging on this one and may be for a while...
Jason
On 06/03/16 15:48, Benjamin Redling wrote:
On 2016-06-03 21:25, Jason Bacon wrote:
It might be worth mentioning that the calcpi-parallel jobs
are run with
--array (no srun).
Disabling the task/affinity plugin and using "mpirun
--bind-to core"
works around the issue. The MPI processes bind to
specific cores and
the embarrassingly parallel jobs kindly move over and stay
out of the way.
Are the mpirun --bind-to core child processes the same as a
slurm task?
I have no experience at all with MPI jobs -- just trying to
understand
task/affinity and params.
As far as I understand when you let mpirun do the binding
it handles the
binding different
https://www.open-mpi.org/doc/v1.8/man1/mpirun.1.php
If I grok the
% mpirun ... --map-by core --bind-to core
example in the "Mapping, Ranking, and Binding: Oh My!"
section right.
*
On 06/03/16 10:18, Jason Bacon wrote:
We're having an issue with CPU binding when two jobs land
on the same
node.
Some cores are shared by the 2 jobs while others are left
idle. Below
[...]
TaskPluginParam=cores,verbose
don't you bind each _job_ to a single core because you override
automatic binding and thous prevent binding each child
process to
different core?
Regards,
Benjamin
*
*
--
All wars are civil wars, because all men are brothers ...
Each one owes
infinitely more to the human race than to the parti cular
country in
which he was born.
-- Francois Fenelon
*
*
*
*
--
All wars are civil wars, because all men are bro thers ...
Each one owes
infinitely more to the human race than to the particular
country in
which he was born.
-- Francois Fenelon*
*
*
--
All wars are civil wars, because all men are brothers ... Each one owes
infinitely more to the human race than to the particular country in
which he was born.
-- Francois Fenelon