Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Daniel Gruber Wed, 17 Nov 2010 04:33:37 -0500

Hi, 

I'm interested in what is expected from OGE/SGE in order to support 
most of your scenarios. First of all the "-binding pe" request is 
not flexible and makes only sense in scenarios when having the 
same architecture on each host, each involved host is 
used exclusively for the job (SGE exclusive job feature) 
and when the same amount of slots is allocated for each 
host (fixed allocation rule). SGE just writes out the 
socket,core tuples (determined on master task host) in 
the pe_hostfile (the same for each host!). SGE does no 
binding itself. Therefore I think we should have a deeper 
look on the more flexible "-binding [set] <strategy>".


1. One qrsh (--inherit) per slot

If a (legacy) parallel application does a qrsh for *each* granted 
slot (regardless if it calls the local host or a remote host) 
this should work out of the box with OGE/SGE with the 
"-binding linear:1" request in OGE tight integration. 
What might confuse here is when doing a "qstat -cb -j <jobno>" 
just one core is shown as allocated (which is a bug). 
But when having a look on the host level (qstat -F m_topology_inuse) 
the allocated cores can be seen. This should work with 
different allocation rules.

2. One qrsh per host (OpenMPI case)

This should work under following constraints:
- OGE tight integration (control_slaves true)
- fixed allocation schema (allocation_rule N)
Then what is needed is simply call qsub with 
"-binding linear:N". Then the master script on 
the master host and all orted on the remote 
hosts are bound (if there are free cores) to 
N successive cores. Here orted is detecting 
this and binds its threads each to one of the 
detected cores (when the mpi command line parameter 
is present) - right? 

What does not work is having an OGE/SGE allocation_rule
round robin, or fill up. Since the amount of slots 
per host are unknown on submission time and different 
for each host. Am I right that this is currently the 
only drawback when using SGE and OpenMPI?

The next thing in the discussion was the alignment of 
cores and slots. Because the term of "slots" is 
very flexible in SGE/OGE and does not in all cases 
reflect the amount of cores (in case of SMT for example)
a compiled in mapping does not exist at the moment.
What people could do is to enforce suche a mapping 
via JSV scripts, which do the necessary reformulation 
of the request (modify #slots or #cores if necessary).

Did I miss some important points from SGE/OGE point of 
view? 


Cheers

Daniel


Am Dienstag, den 16.11.2010, 18:24 -0700 schrieb Ralph Castain:
> 
> 
> On Tue, Nov 16, 2010 at 12:23 PM, Terry Dontje
> <terry.don...@oracle.com> wrote:
>         On 11/16/2010 01:31 PM, Reuti wrote: 
>         > Hi Ralph,
>         > 
>         > Am 16.11.2010 um 15:40 schrieb Ralph Castain:
>         > 
>         > > > 2. have SGE bind procs it launches to -all- of those cores. I 
> believe SGE does this automatically to constrain the procs to running on only 
> those cores.
>         > > This is another "bug/feature" in SGE: it's a matter of 
> discussion, whether the shepherd should get exactly one core (in case you use 
> more than one `qrsh`per node) for each call, or *all* cores assigned (which 
> we need right now, as the processes in Open MPI will be forks of orte 
> daemon). About such a situtation I filled an issue a long time ago and 
> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
> setting should then also change the core allocation of the master process):
>         > > 
>         > > http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
>         > > 
>         > > I believe this is indeed the crux of the issue
>         > fantastic to share the same view.
>         > 
>         FWIW, I think I agree too.
>         
>         > > > 3. tell OMPI to --bind-to-core.
>         > > > 
>         > > > In other words, tell SGE to allocate a certain number of cores 
> on each node, but to bind each proc to all of them (i.e., don't bind a proc 
> to a specific core). I'm pretty sure that is a standard SGE option today (at 
> least, I know it used to be). I don't believe any patch or devel work is 
> required (to either SGE or OMPI).
>         > > When you use a fixed allocation_rule and a matching -binding 
> request it will work today. But any other case won't be distributed in the 
> correct way.
>         > > 
>         > > Is it possible to not include the -binding request? If SGE is 
> told to use a fixed allocation_rule, and to allocate (for example) 2 
> cores/node, then won't the orted see 
>         > > itself bound to two specific cores on each node?
>         > When you leave out the -binding, all jobs are allowed to run on any 
> core.
>         > 
>         > 
>         > > We would then be okay as the spawned children of orted would 
> inherit its binding. Just don't tell mpirun to bind the processes and the 
> threads of those MPI procs will be able to operate across the provided cores.
>         > > 
>         > > Or does SGE only allocate 2 cores/node in that case (i.e., 
> allocate, but no -binding given), but doesn't bind the orted to any two 
> specific cores? If so, then that would be a problem as the orted would think 
> itself unconstrained. If I understand the thread correctly, you're saying 
> that this is what happens today - true?
>         > Exactly. It won't apply any binding at all and orted would think of 
> being unlimited. I.e. limited only by the number of slots it should use 
> thereon.
>         > 
>         So I guess the question I have for Ralph.  I thought, and this
>         might be mixing some of the ideas Jeff and I've been talking
>         about, that when a RM executes the orted with a bound set of
>         resources (ie cores) that orted would bind the individual
>         processes on a subset of the bounded resources.  Is this not
>         really the case for 1.4.X branch?  I believe it is the case
>         for the trunk based on Jeff's refactoring.
> 
> 
> You are absolutely correct, Terry, and the 1.4 release series does
> include the proper code. The point here, though, is that SGE binds the
> orted to a single core, even though other cores are also allocated. So
> the orted detects an external binding of one core, and binds all its
> children to that same core.
> 
> 
> What I had suggested to Reuti was to not include the -binding flag to
> SGE in the hopes that SGE would then bind the orted to all the
> allocated cores. However, as I feared, SGE in that case doesn't bind
> the orted at all - and so we assume the entire node is available for
> our use.
> 
> 
> This is an SGE issue. We need them to bind the orted to -all- the
> allocated cores (and only those cores) in order for us to operate
> correctly.
> 
> 
>  
>         
>         
>         -- 
>         Oracle
>         Terry D. Dontje | Principal Software Engineer
>         Developer Tools Engineering | +1.781.442.2631
>         Oracle - Performance Technologies
>         95 Network Drive, Burlington, MA 01803
>         Email terry.don...@oracle.com
>         
>         
>         
>         
>         
>         
>         _______________________________________________
>         users mailing list
>         us...@open-mpi.org
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- -
Daniel Gruber | Software Engineer
Phone: +49 (0)941 3075-128  (x60128)
ORACLE Grid Engine Engineering
ORACLE Deutschland B.V. & Co. KG | Dr.-Leo-Ritter-Str. 7 | D-93049
Regensburg

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Rijnzathe 6, 3454PV De Meern, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Reply via email to