Re: [OMPI users] Open MPI & Grid Engine/Grid Scheduler thread binding

2011-07-15 Thread Terry Dontje
Here's, hopefully, more useful info.  Note reading the job2core.pdf 
presentation, that was  mentioned earlier, more closely will also 
clarify a couple points (I've put those points inline below).


On 7/15/2011 12:01 AM, Ralph Castain wrote:

On Jul 14, 2011, at 5:46 PM, Jeff Squyres wrote:


Looping in the users mailing list so that Ralph and Oracle can comment...

Not entirely sure what I can contribute here, but I'll try - see below for some 
clarifications. I think the discussion here is based on some misunderstanding 
of how OMPI works.



On Jul 14, 2011, at 2:34 PM, Rayson Ho wrote:


(CC'ing Jeff from the Open-MPI project...)

On Thu, Jul 14, 2011 at 1:35 PM, Tad Kollar  wrote:

As I thought more about it, I was afraid that might be the case, but hoped
sge_shepherd would do some magic for tightly-integrated jobs.

To SGE, if each of the tasks is not started by sge_shepherd, then the
only option is to set the binding mask to the allocation, which in
your original case, was the whole system (48 CPUs).



We're running OpenMPI 1.5.3 if that makes a difference. Do you know of
anyone using an MVAPICH2 1.6 pe that can handle binding?

OMPI uses its own binding scheme - we stick within the overall binding envelope 
given to us, but we don't use external bindings of individual procs. Reason is 
simple: SGE has no visibility into the MPI procs we spawn. All SGE sees is 
mpirun and the daemons (called orteds) we launch on each node, and so it can't 
provide a binding scheme for the MPI procs (it actually has no idea how many 
procs are on each node as OMPI's mapper can support a multitude of algorithms, 
all invisible to SGE).


However, if one reads the job2core.pdf presentation on page 14 it talks 
about how SGE will pass a rankfile to Open MPI which is how SGE drives 
the binding it wants for an Open MPI job.

I just downloaded Open MPI 1.5.4a and grep'ed the source, looks like
it is not looking at the SGE_BINDING env variable that is set by SGE.

No, we don't. However, the orteds do check to see if they have been bound, and 
if so, to what processors. Those bindings are then used as an envelope limiting 
the processors we use to bind the procs we spawn.

I believe SGE_BINDING is an env-var to SGE that tells it what binding to 
use for the job and SGE will then, as mentioned above, generate a 
rankfile to be used by Open MPI.



The serial case worked (its affinity list was '0' instead of '0-47'), so at
least we know that's in good shape :-)

Please also submit a few more jobs and see if the new hwloc code is
able to handle multiple jobs running on your AMD MC server.



My ultimate goal is for affinity support to be enabled and scheduled
automatically for all MPI users, i.e. without them having to do any more
than they would for a no-affinity job (otherwise I have a feeling most of
them would just ignore it). What do you think it will take to get to that
point?

We tried to do this once - I set a default param to auto-bind processes. Major 
error. I was lynched by the MPI user community until we removed that param.

Reason is simple: suppose you have MPI processes that launch threads. Remember, 
there is no thread-level binding out there - all the OS will let you do is bind 
at the process level. So now you bind someone's MPI process to some core(s), 
which forces all the threads from that process to stay within that 
bindingthereby potentially creating a horrendous thread-contention problem.

It doesn't take threading to cause problems - some applications just don't work 
as well when bound. It's true that the benchmarks generally do, but they aren't 
representative of real applications.

Bottom line: defaulting to binding processes was something the MPI community 
appears to have rejected, with reason. Might argue about whether or not they 
are correct, but that appears to be the consensus, and it is the position OMPI 
has adopted. User ignorance of when to bind and when not to bind is not a valid 
reason to impact everyone.



That's my goal since 2008...

I started a mail thread, "processor affinity -- OpenMPI / batchsystem
integration" to the Open MPI list in 2008. And in 2009, the conclusion
was that Sun was saying that the binding info is set in the
environment and Open MPI would perform the binding itself (so I
assumed that was done):

It is done - we just use OMPI's binding schemes and not the ones provided 
natively by SGE. Like I said above, SGE doesn't see the MPI procs and can't 
provide a binding pattern for them - so looking at the SUNW_MP_BIND envar is 
pointless.

Note SUNW_MP_BIND has *nothing* to do with  Open MPI but is a way that 
SGE feeds binding to OpenMP (note no "I") applications.  So Ralph is 
right that this env-var is pointless from an Open MPI perspective.



http://www.open-mpi.org/community/lists/users/2009/10/10938.php

Revisiting the presentation (see: job2core.pdf link at the above URL),
Sun's variable name is $SUNW_MP_BIND, so it is most 

Re: [OMPI users] Open MPI & Grid Engine/Grid Scheduler thread binding (was: New loadcheck)

2011-07-15 Thread Ralph Castain

On Jul 14, 2011, at 5:46 PM, Jeff Squyres wrote:

> Looping in the users mailing list so that Ralph and Oracle can comment...

Not entirely sure what I can contribute here, but I'll try - see below for some 
clarifications. I think the discussion here is based on some misunderstanding 
of how OMPI works.

> 
> 
> On Jul 14, 2011, at 2:34 PM, Rayson Ho wrote:
> 
>> (CC'ing Jeff from the Open-MPI project...)
>> 
>> On Thu, Jul 14, 2011 at 1:35 PM, Tad Kollar  wrote:
>>> As I thought more about it, I was afraid that might be the case, but hoped
>>> sge_shepherd would do some magic for tightly-integrated jobs.
>> 
>> To SGE, if each of the tasks is not started by sge_shepherd, then the
>> only option is to set the binding mask to the allocation, which in
>> your original case, was the whole system (48 CPUs).
>> 
>> 
>>> We're running OpenMPI 1.5.3 if that makes a difference. Do you know of
>>> anyone using an MVAPICH2 1.6 pe that can handle binding?

OMPI uses its own binding scheme - we stick within the overall binding envelope 
given to us, but we don't use external bindings of individual procs. Reason is 
simple: SGE has no visibility into the MPI procs we spawn. All SGE sees is 
mpirun and the daemons (called orteds) we launch on each node, and so it can't 
provide a binding scheme for the MPI procs (it actually has no idea how many 
procs are on each node as OMPI's mapper can support a multitude of algorithms, 
all invisible to SGE).


>> 
>> I just downloaded Open MPI 1.5.4a and grep'ed the source, looks like
>> it is not looking at the SGE_BINDING env variable that is set by SGE.

No, we don't. However, the orteds do check to see if they have been bound, and 
if so, to what processors. Those bindings are then used as an envelope limiting 
the processors we use to bind the procs we spawn.

>> 
>> 
>>> The serial case worked (its affinity list was '0' instead of '0-47'), so at
>>> least we know that's in good shape :-)
>> 
>> Please also submit a few more jobs and see if the new hwloc code is
>> able to handle multiple jobs running on your AMD MC server.
>> 
>> 
>>> My ultimate goal is for affinity support to be enabled and scheduled
>>> automatically for all MPI users, i.e. without them having to do any more
>>> than they would for a no-affinity job (otherwise I have a feeling most of
>>> them would just ignore it). What do you think it will take to get to that
>>> point?

We tried to do this once - I set a default param to auto-bind processes. Major 
error. I was lynched by the MPI user community until we removed that param.

Reason is simple: suppose you have MPI processes that launch threads. Remember, 
there is no thread-level binding out there - all the OS will let you do is bind 
at the process level. So now you bind someone's MPI process to some core(s), 
which forces all the threads from that process to stay within that 
bindingthereby potentially creating a horrendous thread-contention problem.

It doesn't take threading to cause problems - some applications just don't work 
as well when bound. It's true that the benchmarks generally do, but they aren't 
representative of real applications.

Bottom line: defaulting to binding processes was something the MPI community 
appears to have rejected, with reason. Might argue about whether or not they 
are correct, but that appears to be the consensus, and it is the position OMPI 
has adopted. User ignorance of when to bind and when not to bind is not a valid 
reason to impact everyone.


>> 
>> That's my goal since 2008...
>> 
>> I started a mail thread, "processor affinity -- OpenMPI / batchsystem
>> integration" to the Open MPI list in 2008. And in 2009, the conclusion
>> was that Sun was saying that the binding info is set in the
>> environment and Open MPI would perform the binding itself (so I
>> assumed that was done):

It is done - we just use OMPI's binding schemes and not the ones provided 
natively by SGE. Like I said above, SGE doesn't see the MPI procs and can't 
provide a binding pattern for them - so looking at the SUNW_MP_BIND envar is 
pointless.


>> 
>> http://www.open-mpi.org/community/lists/users/2009/10/10938.php
>> 
>> Revisiting the presentation (see: job2core.pdf link at the above URL),
>> Sun's variable name is $SUNW_MP_BIND, so it is most likely Sun Cluster
>> Toolkit implementation specific rather than a feature in Open MPI --
>> and looking at the Open MPI code I don't see SUNW_MP_BIND referenced
>> anywhere.
>> 
>> I believe it is a matter of integrating the thread binding support
>> between the 2 -- both SGE & Open MPI support thread binding.


I don't believe this is accurate - certainly OMPI doesn't support thread-level 
binding, and I haven't seen an OS that does yet. Might happen  someday...but I 
suspect you mean "process" and not "thread".


HTH
Ralph


>> The
>> harder part is to handle cross node binding as SGE binds threads
>> locally only (not directly controlled by qmaster) -- may be