Re: [gridengine users] Multiple jobs on one machine bound to the same cores

Reuti Thu, 11 Dec 2014 03:23:56 -0800

Hi,

> Am 11.12.2014 um 10:20 schrieb Kamel Mazouzi <[email protected]>:
> 
> 
> Hi,
> 
> FYI : according to this thread 
> http://www.open-mpi.org/community/lists/users/2011/07/16931.php OMPI uses its 
> own binding scheme

There are some more details. It's true, that Open MPI has it's own scheme to 
bind to cores, which defaults to "start at socket/core = 0/0" for each and 
every `mpiexec` started. And taking:

> and core binding is enabled by default in version 1.8.

this default binding into account: each and every `mpiexec` starts on the same 
cores, even if they are started outside of any queuing system but just having 
three windows on your workstation open and issuing three times a plain 
`mpiexec`. This might be annoying for the end user (which might even not 
realize this as in the essence all computations are giving the expected 
result), but it's just the way it's implemented right now.

What can help:

$ mpiexec --bind-to none ...

This will also work when running under SGE: in this case Open MPI doesn't 
change the imposed binding, which was already done by SGE when using:

$ qsub -binding linear:1 ...

Small drawback: SGE will give a set of cores to `mpiexec` resp. the `orted`s, 
and inside this set of cores the Linux scheduler can move the processes around 
- but maybe better than no binding at all.

===

If you know for sure that due to the defined queues your job will run only on 
one machine: it should be possible to reformat the given environment resp. 
$PE_HOSTFILE which is output by:

$ qsub -binding env linear:1 ...
$ qsub -binding pe linear:1 ...

to a rankfile for Open MPI.

===

The above won't work for jobs spanning more than one machine: only at time the 
`qrsh -inherit ...` is done to reach another node, the binding is elected and 
done by SGE. Hence it's a) too late to prepare a rankfile for Open MPI, and b) 
as each `qrsh -inherit ...` might end up on a different core it's useless to 
perform a `qrsh -inherit ...` in advance to get the information about the 
granted core. Hence also the fourth column in the $PE_HOSTFILE seems to be 
valid only for the master node when using `qrsh -binding pe linear:1 ...`.

Somehow I remember reading a post that Univa GE does it differently, and 
already when the job starts the allocation of cores on all involved machines is 
outlined.

-- Reuti

> Regards
> 
> On Thu, Dec 11, 2014 at 10:08 AM, Michael Würsch <[email protected]> 
> wrote:
> Hello, 
> 
> With OGS/GE 2011.11 and OpenMPI 1.8.3 we have a problem with core/memory 
> binding when multiple OpenMPI jobs run on the same machine.
> 
> "qsub binding linear:1 job" works fine, if the job is the only one running on 
> the machine. As hwloc-ps and numastat show, each MPI thread is bound to one 
> core and allocates memory that belongs to the socket containing the core.
> 
> However, when two or more jobs run on the same machine, "binding linear:1" 
> causes them to be bound to the same cores. For instance, when two jobs with 6 
> MPI threads each are started on a 12 core (2 x Xeon L5640, hyperthreading 
> switched off) machine, each of the two jobs is bound to these cores:
> 
> [lx012:16840] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././.][./././././.]
> [lx012:16840] MCW rank 1 bound to socket 1[core 6[hwt 0]]: 
> [./././././.][B/././././.]
> [lx012:16840] MCW rank 2 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././.][./././././.]
> [lx012:16840] MCW rank 3 bound to socket 1[core 7[hwt 0]]: 
> [./././././.][./B/./././.]
> [lx012:16840] MCW rank 4 bound to socket 0[core 2[hwt 0]]: 
> [././B/././.][./././././.]
> [lx012:16840] MCW rank 5 bound to socket 1[core 8[hwt 0]]: 
> [./././././.][././B/././.]
> ("mpirun -report-binding" output)
> 
> Thus each MPI thread gets only 50% of a core and the remaining 6 cores are 
> not used.
> 
> This is clearly not what we want. Is there a communication problem between 
> grid engine and OpenMPI? We do not fully understand how the communication is 
> supposed to work. The machines file created by grid engine contains only 
> machine names, but no information about which cores to use on these machines.
> 
> One could fix the binding by specifying explicitly (as parameters or in a 
> machine file) which cores should be used by mpirun. However, grid engine 
> seems to provide only the information on which core the first MPI thread 
> should run. When "qsub binding env linear:1" is used, grid engine sets 
> SGE_BINDING to 0 for the first job, 6 for the second job, 1 for the third 
> job, 7 for the forth job and so on. However, to construct a machine file for 
> OpenMPI one needs to know all cores that are supposed to be used by the job.
> 
> How can we force grid engine and OpenMPI to manage core binding in a 
> reasonable way?
> 
> Maybe we are missing some setting in OpenMPI which we are not aware of (I 
> thought binding should be enabled as a standard). 
> If you need to know anything about our queue settings I could tell you that.
> 
> Thank you
> Michael
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Multiple jobs on one machine bound to the same cores

Reply via email to