Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Shipman, Galen M.


On Jun 23, 2008, at 5:51 PM, Jeff Squyres wrote:

Ah -- I see -- we have 2 different fields with the same name (just  
different places within the struct hierarchy) with different  
meanings.  That was a good idea.  ;-)



exactly


Thanks; that actually helps understand things quite a bit.


On Jun 23, 2008, at 5:45 PM, Shipman, Galen M. wrote:

Oh, I see, you are confusing the req_state on the OMPI request  
with the req_state on the PML request.


The ompi request state is for persistent requests, the PML request  
state is not and does not use that enum.


- Galen


On Jun 23, 2008, at 5:18 PM, Jeff Squyres wrote:


On Jun 23, 2008, at 4:43 PM, Shipman, Galen M. wrote:

We use req_state currently to track that we receive both RNDV  
completion and RNDV ack prior to freeing the request..


Does that mean you're not using the enum values, but rather just  
to indicate that the value is >= 0?


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Jeff Squyres
Ah -- I see -- we have 2 different fields with the same name (just  
different places within the struct hierarchy) with different  
meanings.  That was a good idea.  ;-)


Thanks; that actually helps understand things quite a bit.


On Jun 23, 2008, at 5:45 PM, Shipman, Galen M. wrote:

Oh, I see, you are confusing the req_state on the OMPI request with  
the req_state on the PML request.


The ompi request state is for persistent requests, the PML request  
state is not and does not use that enum.


- Galen


On Jun 23, 2008, at 5:18 PM, Jeff Squyres wrote:


On Jun 23, 2008, at 4:43 PM, Shipman, Galen M. wrote:

We use req_state currently to track that we receive both RNDV  
completion and RNDV ack prior to freeing the request..


Does that mean you're not using the enum values, but rather just to  
indicate that the value is >= 0?


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Shipman, Galen M.
Oh, I see, you are confusing the req_state on the OMPI request with  
the req_state on the PML request.


The ompi request state is for persistent requests, the PML request  
state is not and does not use that enum.


- Galen


On Jun 23, 2008, at 5:18 PM, Jeff Squyres wrote:


On Jun 23, 2008, at 4:43 PM, Shipman, Galen M. wrote:

We use req_state currently to track that we receive both RNDV  
completion and RNDV ack prior to freeing the request..


Does that mean you're not using the enum values, but rather just to  
indicate that the value is >= 0?


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
Okay, so let's explore an alternative that preserves the support you are
seeking for the "ignorant user", but doesn't penalize everyone else. What we
could do is simply set things up so that:

1. if -mca plm xyz is provided, then no modex data is added

2. if it is not provided, then only rank=0 inserts the data. All other procs
simply check their own selection against the one given by rank=0

Now, if a knowledgeable user or sys admin specifies what to use for their
system, we won't penalize their startup time. A user who doesn't know what
to do gets to run, albeit less scalably on startup.

Looking forward from there, we can look to a day where failing to initialize
something that exists on the system could be detected in some other fashion,
letting the local proc abort since it would know that other procs that
detected similar capabilities may well have selected that PML. For now,
though, this would solve the problem.

Make sense?
Ralph



On 6/23/08 1:31 PM, "Brian W. Barrett"  wrote:

> The problem is that we default to OB1, but that's not the right choice for
> some platforms (like Pathscale / PSM), where there's a huge performance
> hit for using OB1.  So we run into a situation where user installs Open
> MPI, starts running, gets horrible performance, bad mouths Open MPI, and
> now we're in that game again.  Yeah, the sys admin should know what to do,
> but it doesn't always work that way.
> 
> Brian
> 
> 
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
> 
>> My fault - I should be more precise in my language. ;-/
>> 
>> #1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
>> to me that a simpler solution to what you describe is for the user to
>> specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
>> with the failed-to-initialize problem cleanly by having the proc directly
>> abort.
>> 
>> Again, sometimes I think we attempt to automate too many things. This seems
>> like a pretty clear case where you know what you want - the sys admin, if
>> nobody else, can certainly set that mca param in the default param file!
>> 
>> Otherwise, it seems to me that you are relying on the modex to detect that
>> your proc failed to init the correct subsystem. I hate to force a modex just
>> for that - if so, then perhaps this could again be a settable option to
>> avoid requiring non-scalable behavior for those of us who want scalability?
>> 
>> 
>> On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:
>> 
>>> The selection code was added because frequently high speed interconnects
>>> fail to initialize properly due to random stuff happening (yes, that's a
>>> horrible statement, but true).  We ran into a situation with some really
>>> flaky machines where most of the processes would chose CM, but a couple
>>> would fail to initialize the MTL and therefore chose OB1.  This lead to a
>>> hang situation, which is the worst of the worst.
>>> 
>>> I think #1 is adequate, although it doesn't handle spawn particularly
>>> well.  And spawn is generally used in environments where such network
>>> mismatches are most likely to occur.
>>> 
>>> Brian
>>> 
>>> 
>>> On Mon, 23 Jun 2008, Ralph H Castain wrote:
>>> 
 Since my goal is to eliminate the modex completely for managed
 installations, could you give me a brief understanding of this eventual PML
 selection logic? It would help to hear an example of how and why different
 procs could get different answers - and why we would want to allow them to
 do so.
 
 Thanks
 Ralph
 
 
 
 On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:
 
> The first approach sounds fair enough to me. We should avoid 2 and 3
> as the pml selection mechanism used to be
> more complex before we reduced it to accommodate a major design bug in
> the BTL selection process. When using the complete PML selection, BTL
> would be initialized several times, leading to a variety of bugs.
> Eventually the PML selection should return to its old self, when the
> BTL bug gets fixed.
> 
> Aurelien
> 
> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
> 
>> Yo all
>> 
>> I've been doing further research into the modex and came across
>> something I
>> don't fully understand. It seems we have each process insert into
>> the modex
>> the name of the PML module that it selected. Once the modex has
>> exchanged
>> that info, it then loops across all procs in the job to check their
>> selection, and aborts if any proc picked a different PML module.
>> 
>> All well and good...assuming that procs actually -can- choose
>> different PML
>> modules and hence create an "abort" scenario. However, if I look
>> inside the
>> PML's at their selection logic, I find that a proc can ONLY pick a
>> module
>> other than ob1 if:
>> 
>> 1. the user 

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Brian W. Barrett
The problem is that we default to OB1, but that's not the right choice for 
some platforms (like Pathscale / PSM), where there's a huge performance 
hit for using OB1.  So we run into a situation where user installs Open 
MPI, starts running, gets horrible performance, bad mouths Open MPI, and 
now we're in that game again.  Yeah, the sys admin should know what to do, 
but it doesn't always work that way.


Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:


The selection code was added because frequently high speed interconnects
fail to initialize properly due to random stuff happening (yes, that's a
horrible statement, but true).  We ran into a situation with some really
flaky machines where most of the processes would chose CM, but a couple
would fail to initialize the MTL and therefore chose OB1.  This lead to a
hang situation, which is the worst of the worst.

I think #1 is adequate, although it doesn't handle spawn particularly
well.  And spawn is generally used in environments where such network
mismatches are most likely to occur.

Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:


The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and
that it is
other than "psm". In this case, the CM module will be selected
because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me
that you
either have the required capability or you don't. I can see that in
some
environments (e.g., rsh across unmanaged collections of machines),
it might
be possible for someone to launch across a set of machines where
some do and
some don't have the required support. However, in all other cases,
this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should
feel free
to confirm or correct it), it seems to me that this could be
streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the
modex,
and other procs simply check it against their own and return an
error if
they differ. This accomplishes the identical functionality to what
we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by

Re: [OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
My fault - I should be more precise in my language. ;-/

#1 is not adequate, IMHO, as it forces us to -always- do a modex. It seems
to me that a simpler solution to what you describe is for the user to
specify -mca pml ob1, or -mca pml cm. If the latter, then you could deal
with the failed-to-initialize problem cleanly by having the proc directly
abort.

Again, sometimes I think we attempt to automate too many things. This seems
like a pretty clear case where you know what you want - the sys admin, if
nobody else, can certainly set that mca param in the default param file!

Otherwise, it seems to me that you are relying on the modex to detect that
your proc failed to init the correct subsystem. I hate to force a modex just
for that - if so, then perhaps this could again be a settable option to
avoid requiring non-scalable behavior for those of us who want scalability?


On 6/23/08 1:21 PM, "Brian W. Barrett"  wrote:

> The selection code was added because frequently high speed interconnects
> fail to initialize properly due to random stuff happening (yes, that's a
> horrible statement, but true).  We ran into a situation with some really
> flaky machines where most of the processes would chose CM, but a couple
> would fail to initialize the MTL and therefore chose OB1.  This lead to a
> hang situation, which is the worst of the worst.
> 
> I think #1 is adequate, although it doesn't handle spawn particularly
> well.  And spawn is generally used in environments where such network
> mismatches are most likely to occur.
> 
> Brian
> 
> 
> On Mon, 23 Jun 2008, Ralph H Castain wrote:
> 
>> Since my goal is to eliminate the modex completely for managed
>> installations, could you give me a brief understanding of this eventual PML
>> selection logic? It would help to hear an example of how and why different
>> procs could get different answers - and why we would want to allow them to
>> do so.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> 
>> On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:
>> 
>>> The first approach sounds fair enough to me. We should avoid 2 and 3
>>> as the pml selection mechanism used to be
>>> more complex before we reduced it to accommodate a major design bug in
>>> the BTL selection process. When using the complete PML selection, BTL
>>> would be initialized several times, leading to a variety of bugs.
>>> Eventually the PML selection should return to its old self, when the
>>> BTL bug gets fixed.
>>> 
>>> Aurelien
>>> 
>>> Le 23 juin 08 à 12:36, Ralph H Castain a écrit :
>>> 
 Yo all
 
 I've been doing further research into the modex and came across
 something I
 don't fully understand. It seems we have each process insert into
 the modex
 the name of the PML module that it selected. Once the modex has
 exchanged
 that info, it then loops across all procs in the job to check their
 selection, and aborts if any proc picked a different PML module.
 
 All well and good...assuming that procs actually -can- choose
 different PML
 modules and hence create an "abort" scenario. However, if I look
 inside the
 PML's at their selection logic, I find that a proc can ONLY pick a
 module
 other than ob1 if:
 
 1. the user specifies the module to use via -mca pml xyz or by using a
 module specific mca param to adjust its priority. In this case,
 since the
 mca param is propagated, ALL procs have no choice but to pick that
 same
 module, so that can't cause us to abort (we will have already
 returned an
 error and aborted if the specified module can't run).
 
 2. the pml/cm module detects that an MTL module was selected, and
 that it is
 other than "psm". In this case, the CM module will be selected
 because its
 default priority is higher than that of OB1.
 
 In looking deeper into the MTL selection logic, it appears to me
 that you
 either have the required capability or you don't. I can see that in
 some
 environments (e.g., rsh across unmanaged collections of machines),
 it might
 be possible for someone to launch across a set of machines where
 some do and
 some don't have the required support. However, in all other cases,
 this will
 be homogeneous across the system.
 
 Given this analysis (and someone more familiar with the PML should
 feel free
 to confirm or correct it), it seems to me that this could be
 streamlined via
 one or more means:
 
 1. at the most, we could have rank=0 add the PML module name to the
 modex,
 and other procs simply check it against their own and return an
 error if
 they differ. This accomplishes the identical functionality to what
 we have
 today, but with much less info in the modex.
 
 2. we could eliminate this info from the modex altogether by
 requiring the
 user to specify the 

Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Brian W. Barrett

On Mon, 23 Jun 2008, Jeff Squyres wrote:


On Jun 23, 2008, at 3:17 PM, Brian W. Barrett wrote:

Just because it's volatile doesn't mean that adds are atomic.  There's at 
least one place in the PML (or used to be) where two threads could 
decrement that counter at the same time.


With atomics, then both subtracts should occur, right?  So a request could go 
from ACTIVE -> INACTIVE -> INVALID.  Is that what is desired?  (I honestly 
don't know enough about ob1 to say)


Or should we just be assigning a specific state, rather than relying on 
subtracting?  That was my real question.


I honestly don't know.  I just remember that there were some cases where 
we were doing crazy counting.


Brian


Re: [OMPI devel] PML selection logic

2008-06-23 Thread Brian W. Barrett
The selection code was added because frequently high speed interconnects 
fail to initialize properly due to random stuff happening (yes, that's a 
horrible statement, but true).  We ran into a situation with some really 
flaky machines where most of the processes would chose CM, but a couple 
would fail to initialize the MTL and therefore chose OB1.  This lead to a 
hang situation, which is the worst of the worst.


I think #1 is adequate, although it doesn't handle spawn particularly 
well.  And spawn is generally used in environments where such network 
mismatches are most likely to occur.


Brian


On Mon, 23 Jun 2008, Ralph H Castain wrote:


Since my goal is to eliminate the modex completely for managed
installations, could you give me a brief understanding of this eventual PML
selection logic? It would help to hear an example of how and why different
procs could get different answers - and why we would want to allow them to
do so.

Thanks
Ralph



On 6/23/08 11:59 AM, "Aurélien Bouteiller"  wrote:


The first approach sounds fair enough to me. We should avoid 2 and 3
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in
the BTL selection process. When using the complete PML selection, BTL
would be initialized several times, leading to a variety of bugs.
Eventually the PML selection should return to its old self, when the
BTL bug gets fixed.

Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across
something I
don't fully understand. It seems we have each process insert into
the modex
the name of the PML module that it selected. Once the modex has
exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose
different PML
modules and hence create an "abort" scenario. However, if I look
inside the
PML's at their selection logic, I find that a proc can ONLY pick a
module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,
since the
mca param is propagated, ALL procs have no choice but to pick that
same
module, so that can't cause us to abort (we will have already
returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and
that it is
other than "psm". In this case, the CM module will be selected
because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me
that you
either have the required capability or you don't. I can see that in
some
environments (e.g., rsh across unmanaged collections of machines),
it might
be possible for someone to launch across a set of machines where
some do and
some don't have the required support. However, in all other cases,
this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should
feel free
to confirm or correct it), it seems to me that this could be
streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the
modex,
and other procs simply check it against their own and return an
error if
they differ. This accomplishes the identical functionality to what
we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by
requiring the
user to specify the PML module if they want something other than the
default
OB1. In this case, there can be no confusion over what each proc is
to use.
The CM module will attempt to init the MTL - if it cannot do so,
then the
job will return the correct error and tell the user that CM/MTL
support is
unavailable.

3. we could again eliminate the info by not inserting it into the
modex if
(a) the default PML module is selected, or (b) the user specified
the PML
module to be used. In the first case, each proc can simply check to
see if
they picked the default - if not, then we can insert the info to
indicate
the difference. Thus, in the "standard" case, no info will be
inserted.

In the second case, we will already get an error if the specified
PML module
could not be used. Hence, the modex check provides no additional
info or
value.

I understand the motivation to support automation. However, in this
case,
the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be
in order?

Ralph



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Jeff Squyres

On Jun 23, 2008, at 3:17 PM, Brian W. Barrett wrote:

Just because it's volatile doesn't mean that adds are atomic.   
There's at least one place in the PML (or used to be) where two  
threads could decrement that counter at the same time.


With atomics, then both subtracts should occur, right?  So a request  
could go from ACTIVE -> INACTIVE -> INVALID.  Is that what is  
desired?  (I honestly don't know enough about ob1 to say)


Or should we just be assigning a specific state, rather than relying  
on subtracting?  That was my real question.




On Mon, 23 Jun 2008, Jeff Squyres wrote:


I see in a few places in ob1 we do things like this:

 OPAL_THREAD_ADD32(>req_state, -1);

Why do we do this?  req_state is technically an enum value, so we  
shouldn't be adding/subtracting to it (granted, it looks like the  
enum values were carefully chosen to allow this).  Additionally,  
req_state is volatile; the atomics shouldn't be necessary.


Is there some other non-obvious reason?

Also, I see this in a few places:

 req->req_state = 2;

which really should be

 req->req_state = OMPI_REQUEST_ACTIVE;



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] ob1 and req->req_state

2008-06-23 Thread Brian W. Barrett
Just because it's volatile doesn't mean that adds are atomic.  There's at 
least one place in the PML (or used to be) where two threads could 
decrement that counter at the same time.


Brian

On Mon, 23 Jun 2008, Jeff Squyres wrote:


I see in a few places in ob1 we do things like this:

  OPAL_THREAD_ADD32(>req_state, -1);

Why do we do this?  req_state is technically an enum value, so we shouldn't 
be adding/subtracting to it (granted, it looks like the enum values were 
carefully chosen to allow this).  Additionally, req_state is volatile; the 
atomics shouldn't be necessary.


Is there some other non-obvious reason?

Also, I see this in a few places:

  req->req_state = 2;

which really should be

  req->req_state = OMPI_REQUEST_ACTIVE;




[OMPI devel] ob1 and req->req_state

2008-06-23 Thread Jeff Squyres

I see in a few places in ob1 we do things like this:

OPAL_THREAD_ADD32(>req_state, -1);

Why do we do this?  req_state is technically an enum value, so we  
shouldn't be adding/subtracting to it (granted, it looks like the enum  
values were carefully chosen to allow this).  Additionally, req_state  
is volatile; the atomics shouldn't be necessary.


Is there some other non-obvious reason?

Also, I see this in a few places:

req->req_state = 2;

which really should be

req->req_state = OMPI_REQUEST_ACTIVE;

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] PML selection logic

2008-06-23 Thread Aurélien Bouteiller
The first approach sounds fair enough to me. We should avoid 2 and 3  
as the pml selection mechanism used to be
more complex before we reduced it to accommodate a major design bug in  
the BTL selection process. When using the complete PML selection, BTL  
would be initialized several times, leading to a variety of bugs.  
Eventually the PML selection should return to its old self, when the  
BTL bug gets fixed.


Aurelien

Le 23 juin 08 à 12:36, Ralph H Castain a écrit :


Yo all

I've been doing further research into the modex and came across  
something I
don't fully understand. It seems we have each process insert into  
the modex
the name of the PML module that it selected. Once the modex has  
exchanged

that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose  
different PML
modules and hence create an "abort" scenario. However, if I look  
inside the
PML's at their selection logic, I find that a proc can ONLY pick a  
module

other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case,  
since the
mca param is propagated, ALL procs have no choice but to pick that  
same
module, so that can't cause us to abort (we will have already  
returned an

error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and  
that it is
other than "psm". In this case, the CM module will be selected  
because its

default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me  
that you
either have the required capability or you don't. I can see that in  
some
environments (e.g., rsh across unmanaged collections of machines),  
it might
be possible for someone to launch across a set of machines where  
some do and
some don't have the required support. However, in all other cases,  
this will

be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should  
feel free
to confirm or correct it), it seems to me that this could be  
streamlined via

one or more means:

1. at the most, we could have rank=0 add the PML module name to the  
modex,
and other procs simply check it against their own and return an  
error if
they differ. This accomplishes the identical functionality to what  
we have

today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by  
requiring the
user to specify the PML module if they want something other than the  
default
OB1. In this case, there can be no confusion over what each proc is  
to use.
The CM module will attempt to init the MTL - if it cannot do so,  
then the
job will return the correct error and tell the user that CM/MTL  
support is

unavailable.

3. we could again eliminate the info by not inserting it into the  
modex if
(a) the default PML module is selected, or (b) the user specified  
the PML
module to be used. In the first case, each proc can simply check to  
see if
they picked the default - if not, then we can insert the info to  
indicate
the difference. Thus, in the "standard" case, no info will be  
inserted.


In the second case, we will already get an error if the specified  
PML module
could not be used. Hence, the modex check provides no additional  
info or

value.

I understand the motivation to support automation. However, in this  
case,

the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be  
in order?


Ralph



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] ompi_ignore dr pml?

2008-06-23 Thread Tim Mattox
Until someone can work on it, sure, ompi_ignore DR sounds right.
Unfortunately, IU may *need* to work on it this fall... hopefully we (I)
will have a new student to help do the work.  As for inclusion in 1.3,
I don't think we care.

On Mon, Jun 23, 2008 at 11:01 AM, Jeff Squyres  wrote:
> Should we .ompi_ignore dr?
>
> It's not complete and no one wants to support it.  I'm thinking that we
> shouldn't even include it in v1.3.
>
> Thoughts?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
 I'm a bright... http://www.the-brights.net/


[OMPI devel] PML selection logic

2008-06-23 Thread Ralph H Castain
Yo all

I've been doing further research into the modex and came across something I
don't fully understand. It seems we have each process insert into the modex
the name of the PML module that it selected. Once the modex has exchanged
that info, it then loops across all procs in the job to check their
selection, and aborts if any proc picked a different PML module.

All well and good...assuming that procs actually -can- choose different PML
modules and hence create an "abort" scenario. However, if I look inside the
PML's at their selection logic, I find that a proc can ONLY pick a module
other than ob1 if:

1. the user specifies the module to use via -mca pml xyz or by using a
module specific mca param to adjust its priority. In this case, since the
mca param is propagated, ALL procs have no choice but to pick that same
module, so that can't cause us to abort (we will have already returned an
error and aborted if the specified module can't run).

2. the pml/cm module detects that an MTL module was selected, and that it is
other than "psm". In this case, the CM module will be selected because its
default priority is higher than that of OB1.

In looking deeper into the MTL selection logic, it appears to me that you
either have the required capability or you don't. I can see that in some
environments (e.g., rsh across unmanaged collections of machines), it might
be possible for someone to launch across a set of machines where some do and
some don't have the required support. However, in all other cases, this will
be homogeneous across the system.

Given this analysis (and someone more familiar with the PML should feel free
to confirm or correct it), it seems to me that this could be streamlined via
one or more means:

1. at the most, we could have rank=0 add the PML module name to the modex,
and other procs simply check it against their own and return an error if
they differ. This accomplishes the identical functionality to what we have
today, but with much less info in the modex.

2. we could eliminate this info from the modex altogether by requiring the
user to specify the PML module if they want something other than the default
OB1. In this case, there can be no confusion over what each proc is to use.
The CM module will attempt to init the MTL - if it cannot do so, then the
job will return the correct error and tell the user that CM/MTL support is
unavailable.

3. we could again eliminate the info by not inserting it into the modex if
(a) the default PML module is selected, or (b) the user specified the PML
module to be used. In the first case, each proc can simply check to see if
they picked the default - if not, then we can insert the info to indicate
the difference. Thus, in the "standard" case, no info will be inserted.

In the second case, we will already get an error if the specified PML module
could not be used. Hence, the modex check provides no additional info or
value.

I understand the motivation to support automation. However, in this case,
the automation actually doesn't seem to buy us very much, and it isn't
coming "free". So perhaps some change in how this is done would be in order?

Ralph





Re: [OMPI devel] multiple GigE interfaces...

2008-06-23 Thread Adrian Knoth
On Wed, Jun 18, 2008 at 05:13:28PM -0700, Muhammad Atif wrote:

>  Hi again... I was on a break from Xensocket stuff This time some
>  general questions...

Hi.

> question. What if I have multiple Ethernet cards (say 5) on two of my
> quad core machines.  The IP addresses (and the subnets of course) are 
> Machine A   Machine B
> eth0 is y.y.1.a y.y.1.z 
> eth1 is y.y.4.by.y.4.y
> eth2 is y.y.4.c   ...
> eth3 is y.y.4.d   ...
> 
>  ...

This sounds pretty weird. And I guess your netmasks don't allow to
separate the NICs, do they?

> from the FAQ's/Some emails in user lists  it is clear that if I want
> to run a job on multiple ethernets, I can use --mca btl_tcp_if_include
> eth0,eth1. This

You can, but you don't have to. If you don't specify something, OMPI
will choose "something right".

> will run the job on two of the subnets utilizing both the Ethernet
> cards. Is it doing some sort of load balancing? or some round robin
> mechanism? What part of code is responsible for this work?

As far as I know, it's handled by OB1 (PML), which does striping across
several BTL instances.

So in other words, as long as both segments are equally fast, the load
balancing should do fine. If they differ in performance, the OB1 doesn't
find an optimal solution. If you're hitting this case, ask htor, he has
an auto-tuning replacement, but that's not going to be part of OMPI.

> eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet.
> Even in the FAQ's (which mostly answers our lame questions)  its not
> entirely clear how communication will be done.  Each process will have
> tcp_num_btls equal to interfaces, but then what? Is it some sort of
> load balancing or similar stuff which is not clear in tcpdump?

I feel you could end up with communication stalls, the typical hang
situation. One problem that might occur: the TCP component looks for
remote addresses on the "same" network, so the component might be unable
to decide whether your IP is on the same physical network or uses
the wrong link. Then, you won't gain anything.

Another problem: at least the Linux kernel (without tweaking) decides
which interface and address to use for outgoing communication. If you
have multiple subnets, then the kernel would go for the closest match
between local and remote addresses, but in your case, it might be some
kind of lottery.


> related question is what if I want to run 8 process job (on 2x4
> cluster) and want to pin a process to an network interface. OpenMPI to
> my understanding does not give any control of allocating IP to a
> process (like MPICH)

You could just say btl_if_include=ethX, thus giving you the right
network interface. Obviously, this requires separate networks.


> or is there some magical --mca thingie. I think only way to go is
> adding routing tables... am i thinking in right direction? If yes, then
> the performance of my boxes decrease when i trying to force the routing

Routing should be fast, since it's done at kernel level. I cannot speak
for Xen-based virtual interfaces.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] BW benchmark hangs after r 18551

2008-06-23 Thread Lenny Verkhovsky
Hi,
Seqf bug fixed in r18706.

Best Regards
Lenny.
On Thu, Jun 19, 2008 at 5:37 PM, Lenny Verkhovsky <
lenny.verkhov...@gmail.com> wrote:

> Sorry,
> I checked it without sm.
>
> pls ignore this mail.
>
>
>
> On Thu, Jun 19, 2008 at 4:32 PM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Hi,
>> I found what caused the problem in both cases.
>>
>> --- ompi/mca/btl/sm/btl_sm.c(revision 18675)
>> +++ ompi/mca/btl/sm/btl_sm.c(working copy)
>> @@ -812,7 +812,7 @@
>>   */
>>  MCA_BTL_SM_FIFO_WRITE(endpoint, endpoint->my_smp_rank,
>>endpoint->peer_smp_rank, frag->hdr, false, rc);
>> -return (rc < 0 ? rc : 1);
>> +   return OMPI_SUCCESS;
>>  }
>> I am just not sure if it's OK.
>>
>> Lenny.
>>   On Wed, Jun 18, 2008 at 3:21 PM, Lenny Verkhovsky <
>> lenny.verkhov...@gmail.com> wrote:
>>
>>> Hi,
>>> I am not sure if it related,
>>> but I applied your patch ( r18667 )  to r 18656 ( one before NUMA )
>>> together with disabling sendi,
>>> The result still the same ( hanging ).
>>>
>>>
>>>
>>>
>>>  On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca 
>>> wrote:
>>>
 Lenny,

 I guess you're running the latest version. If not, please update, Galen
 and myself corrected some bugs last week. If you're using the latest (and
 greatest) then ... well I imagine there is at least one bug left.

 There is a quick test you can do. In the btl_sm.c in the module
 structure at the beginning of the file, please replace the sendi function 
 by
 NULL. If this fix the problem, then at least we know that it's a sm send
 immediate problem.

  Thanks,
george.


 On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:

 Hi, George,
>
> I have a problem running BW benchmark on 100 rank cluster after r18551.
> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>
>
> #mpirun -np 100 -hostfile hostfile_w  ./mpi_p_18549 -t bw -s 10
> BW (100) (size min max avg)  10 576.734030  2001.882416
> 1062.698408
> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 10
> mpirun: killing job...
> ( it hangs even after 10 hours ).
>
>
> It doesn't happen if I run --bynode or btl openib,self only.
>
>
> Lenny.
>


>>>
>>
>