Re: [OMPI devel] [devel-core] Open MPI v1.2.5rc1 has been posted

2007-12-06 Thread Tim Mattox
Argh, sorry about that...  the website changes were checked into svn... but the
main website wasn't svn up'ed...   Open MPI v1.2.5rc1 is now there.  Enjoy.

On Dec 6, 2007 5:49 PM, Jeff Squyres  wrote:
> Tim --
>
> I don't see 1.2.5rc1 posted there...?
>
>
>
> On Dec 6, 2007, at 4:43 PM, Tim Mattox wrote:
>
> > Hi All,
> > The first release candidate of Open MPI v1.2.5 is now up:
> >
> > http://www.open-mpi.org/software/ompi/v1.2/
> >
> > Please run it through it's paces as best you can.
> > --
> > Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> > tmat...@gmail.com || timat...@open-mpi.org
> >I'm a bright... http://www.the-brights.net/
> > ___
> > devel-core mailing list
> > devel-c...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
>
>
> --
> Jeff Squyres
> Cisco Systems
> ___
> devel-core mailing list
> devel-c...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] === CREATE FAILURE ===

2007-12-06 Thread George Bosilca

Fixed by r16884.

  george.

On Dec 7, 2007, at 12:46 PM, MPI Team wrote:



ERROR: Command returned a non-zero exist status
  make -j 4 distcheck

Start time: Thu Dec  6 21:00:25 EST 2007
End time:   Thu Dec  6 21:16:34 EST 2007

=
==
[... previous lines snipped ...]
/bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  
-I../opal/include -I../orte/include -I../ompi/include -I../opal/mca/ 
paffinity/linux/plpa/src/libplpa -I../../ompi   -I../.. -I.. -I../../ 
opal/include -I../../orte/include -I../../ompi/include-O3 - 
DNDEBUG -finline-functions -fno-strict-aliasing -pthread -MT proc/ 
proc.lo -MD -MP -MF $depbase.Tpo -c -o proc/proc.lo ../../ompi/proc/ 
proc.c &&\

mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../opal/include -I../ 
orte/include -I../ompi/include -I../opal/mca/paffinity/linux/plpa/ 
src/libplpa -I../../ompi -I../.. -I.. -I../../opal/include -I../../ 
orte/include -I../../ompi/include -O3 -DNDEBUG -finline-functions - 
fno-strict-aliasing -pthread -MT op/op_predefined.lo -MD -MP -MF  
op/.deps/op_predefined.Tpo -c ../../ompi/op/op_predefined.c  -fPIC - 
DPIC -o op/.libs/op_predefined.o
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../opal/include -I../ 
orte/include -I../ompi/include -I../opal/mca/paffinity/linux/plpa/ 
src/libplpa -I../../ompi -I../.. -I.. -I../../opal/include -I../../ 
orte/include -I../../ompi/include -O3 -DNDEBUG -finline-functions - 
fno-strict-aliasing -pthread -MT proc/proc.lo -MD -MP -MF proc/.deps/ 
proc.Tpo -c ../../ompi/proc/proc.c  -fPIC -DPIC -o proc/.libs/proc.o
depbase=`echo request/grequest.lo | sed 's|[^/]*$|.deps/&|;s|\.lo 
$||'`;\
/bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  
-I../opal/include -I../orte/include -I../ompi/include -I../opal/mca/ 
paffinity/linux/plpa/src/libplpa -I../../ompi   -I../.. -I.. -I../../ 
opal/include -I../../orte/include -I../../ompi/include-O3 - 
DNDEBUG -finline-functions -fno-strict-aliasing -pthread -MT request/ 
grequest.lo -MD -MP -MF $depbase.Tpo -c -o request/grequest.lo ../../ 
ompi/request/grequest.c &&\

mv -f $depbase.Tpo $depbase.Plo
depbase=`echo request/request.lo | sed 's|[^/]*$|.deps/&|;s|\.lo 
$||'`;\
/bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  
-I../opal/include -I../orte/include -I../ompi/include -I../opal/mca/ 
paffinity/linux/plpa/src/libplpa -I../../ompi   -I../.. -I.. -I../../ 
opal/include -I../../orte/include -I../../ompi/include-O3 - 
DNDEBUG -finline-functions -fno-strict-aliasing -pthread -MT request/ 
request.lo -MD -MP -MF $depbase.Tpo -c -o request/request.lo ../../ 
ompi/request/request.c &&\

mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../opal/include -I../ 
orte/include -I../ompi/include -I../opal/mca/paffinity/linux/plpa/ 
src/libplpa -I../../ompi -I../.. -I.. -I../../opal/include -I../../ 
orte/include -I../../ompi/include -O3 -DNDEBUG -finline-functions - 
fno-strict-aliasing -pthread -MT request/grequest.lo -MD -MP -MF  
request/.deps/grequest.Tpo -c ../../ompi/request/grequest.c  -fPIC - 
DPIC -o request/.libs/grequest.o
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../opal/include -I../ 
orte/include -I../ompi/include -I../opal/mca/paffinity/linux/plpa/ 
src/libplpa -I../../ompi -I../.. -I.. -I../../opal/include -I../../ 
orte/include -I../../ompi/include -O3 -DNDEBUG -finline-functions - 
fno-strict-aliasing -pthread -MT request/request.lo -MD -MP -MF  
request/.deps/request.Tpo -c ../../ompi/request/request.c  -fPIC - 
DPIC -o request/.libs/request.o
depbase=`echo request/req_test.lo | sed 's|[^/]*$|.deps/&|;s|\.lo 
$||'`;\
/bin/sh ../libtool --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  
-I../opal/include -I../orte/include -I../ompi/include -I../opal/mca/ 
paffinity/linux/plpa/src/libplpa -I../../ompi   -I../.. -I.. -I../../ 
opal/include -I../../orte/include -I../../ompi/include-O3 - 
DNDEBUG -finline-functions -fno-strict-aliasing -pthread -MT request/ 
req_test.lo -MD -MP -MF $depbase.Tpo -c -o request/req_test.lo ../../ 
ompi/request/req_test.c &&\

mv -f $depbase.Tpo $depbase.Plo
../../ompi/request/request.c:24:42: ompi/request/request_default.h:  
No such file or directory
../../ompi/request/request.c:36: error: `ompi_request_default_test'  
undeclared here (not in a function)
../../ompi/request/request.c:36: error: initializer element is not  
constant
../../ompi/request/request.c:36: error: (near initialization for  
`ompi_request_functions.req_test')
../../ompi/request/request.c:37: error:  
`ompi_request_default_test_any' undeclared here (not in a function)
../../ompi/request/request.c:37: error: initializer element is not  
constant
../../ompi/request/request.c:37: error: (near initialization for  
`ompi_request_functions.req_test_any')
../../ompi/request/request.c:38: error:  
`ompi_request_default_test_all' undeclared here (not in a function)

Re: [OMPI devel] opal_condition_wait

2007-12-06 Thread Gleb Natapov
On Thu, Dec 06, 2007 at 09:46:45AM -0500, Tim Prins wrote:
> Also, when we are using threads, there is a case where we do not 
> decrement the signaled count, in condition.h:84. Gleb put this in in 
> r9451, however the change does not make sense to me. I think that the 
> signal count should always be decremented.
> 
> Can anyone shine any light on these issues?
> 
I made this change a long time ago (I wander why I even tested threaded
build back then), but what I recall looking into the code and log message
there was a deadlock when signal broadcast doesn't wake up all thread
that are waiting on a conditional variable. Suppose two threads wait on
a condition C, third thread does broadcast. This makes C->c_signaled to
be equal 2. Now one thread wakes up and decrement C->c_signaled by one.
And before other thread is starting to run it calls condition_wait on C
one more time. Because c_signaled is 1 it doesn't sleep and decrement
c_signaled one more time. Now c_signaled is zero and when second thread
wakes up it see this and go to sleep again. The solution was to check in
condition_wait if condition is already signaled before go to sleep and
if yes exit immediately.

--
Gleb.


Re: [OMPI devel] GROUP_EMPTY fixes break intel tests :-(

2007-12-06 Thread Jeff Squyres

Done: r16872.

On Dec 6, 2007, at 1:34 PM, Terry Dontje wrote:



Jeff Squyres wrote:

I should also note the following:

- LAM/MPI does the same thing (increments refcount when GROUP_EMPTY  
is

returned to the user, and allows GROUP_EMPTY in GROUP_FREE)

- MPICH2 has the following comment in GROUP_FREE (and code to match):

   /* Cannot free the predefined groups, but allow  
GROUP_EMPTY

   because otherwise many tests fail */

So I'm thinking that we should allow GROUP_EMPTY in GROUP_FREE --  
back
out Edgar's changed and put in some big comments about exactly  
why.  :-)


Comments?



Note, CT6 (Sun's previous implemention) also passed these tests.   Sun
would like this test passing
to be maintained until some concrete message is made by the forum.   
That

being said I would agree
with Jeff's proposal of backing out the change and putting in  
comments why.


--td


On Dec 6, 2007, at 11:01 AM, Jeff Squyres wrote:



So the changes that we debated and had Edgar put in *do* break some
intel tests.  Doh!  :-(

  MPI_Group_compare_f
  MPI_Group_intersection2_c
  MPI_Group_intersection2_f

It looks like these tests are specifically calling MPI_GROUP_FREE on
MPI_GROUP_EMPTY.

I note that there is code in the ompi/group/group_*.c code that
specifically calls OBJ_RETAIN on ompi_group_empty when we return
MPI_GROUP_EMPTY.  I wonder if this RETAIN was added (and the MPI  
param

check removed) in reaction to the intel tests...?

Can someone cite again where we thought the spec said that we should
not free GROUP_EMPTY?  Was is just on the argument that it's a
predefined handle and therefore should never be freed?

I cannot find any specific text in 1.2 or the errata stating that  
it's
bad to free GROUP_EMPTY.  I agree that this is somewhat counter to  
the
rest of the MPI philosophy of not freeing predefined handles,  
though.


--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] GROUP_EMPTY fixes break intel tests :-(

2007-12-06 Thread Terry Dontje


Jeff Squyres wrote:

I should also note the following:

- LAM/MPI does the same thing (increments refcount when GROUP_EMPTY is  
returned to the user, and allows GROUP_EMPTY in GROUP_FREE)


- MPICH2 has the following comment in GROUP_FREE (and code to match):

/* Cannot free the predefined groups, but allow GROUP_EMPTY
because otherwise many tests fail */

So I'm thinking that we should allow GROUP_EMPTY in GROUP_FREE -- back  
out Edgar's changed and put in some big comments about exactly why.  :-)


Comments?

  
Note, CT6 (Sun's previous implemention) also passed these tests.   Sun 
would like this test passing
to be maintained until some concrete message is made by the forum.  That 
being said I would agree

with Jeff's proposal of backing out the change and putting in comments why.

--td


On Dec 6, 2007, at 11:01 AM, Jeff Squyres wrote:

  

So the changes that we debated and had Edgar put in *do* break some
intel tests.  Doh!  :-(

   MPI_Group_compare_f
   MPI_Group_intersection2_c
   MPI_Group_intersection2_f

It looks like these tests are specifically calling MPI_GROUP_FREE on
MPI_GROUP_EMPTY.

I note that there is code in the ompi/group/group_*.c code that
specifically calls OBJ_RETAIN on ompi_group_empty when we return
MPI_GROUP_EMPTY.  I wonder if this RETAIN was added (and the MPI param
check removed) in reaction to the intel tests...?

Can someone cite again where we thought the spec said that we should
not free GROUP_EMPTY?  Was is just on the argument that it's a
predefined handle and therefore should never be freed?

I cannot find any specific text in 1.2 or the errata stating that it's
bad to free GROUP_EMPTY.  I agree that this is somewhat counter to the
rest of the MPI philosophy of not freeing predefined handles, though.

--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




  




Re: [OMPI devel] opal_condition_wait

2007-12-06 Thread Brian W. Barrett

On Thu, 6 Dec 2007, Tim Prins wrote:


Tim Prins wrote:

First, in opal_condition_wait (condition.h:97) we do not release the
passed mutex if opal_using_threads() is not set. Is there a reason for
this? I ask since this violates the way condition variables are supposed
to work, and it seems like there are situations where this could cause
deadlock.

So in (partial) answer to my own email, this is because throughout the
code we do:
OPAL_THREAD_LOCK(m)
opal_condition_wait(cond, m);
OPAL_THREAD_UNLOCK(m)

So this relies on opal_condition_wait not touching the lock. This
explains it, but it still seems very wrong.


Yes, this is correct.  The assumption is that you are using the 
conditional macro lock/unlock with the condition variables.  I personally 
don't like this (I think we should have had macro conditional condition 
variables), but that obviously isn't how it works today.


The problem with always holding the lock when you enter the condition 
variable is that even when threading is disabled, calling a lock is at 
least as expensive as an add, possibly including a cache miss.  So from a 
performance standpoint, this would be a no-go.



Also, when we are using threads, there is a case where we do not
decrement the signaled count, in condition.h:84. Gleb put this in in
r9451, however the change does not make sense to me. I think that the
signal count should always be decremented.

Can anyone shine any light on these issues?


Unfortunately, I can't add much on this front.

Brian


Re: [OMPI devel] GROUP_EMPTY fixes break intel tests :-(

2007-12-06 Thread Edgar Gabriel

well, the best I could find is the following in section 5.2.1

"MPI_GROUP_EMPTY, which is a valid handle to an empty group, should not 
be confused with MPI_GROUP_NULL, which in turn is an invalid handle. The 
former may be used as an argument to group operations; the latter, which 
is returned when a group is freed, in not a valid argument. ( End of 
advice to users.) "



Jeff Squyres wrote:
So the changes that we debated and had Edgar put in *do* break some  
intel tests.  Doh!  :-(


MPI_Group_compare_f
MPI_Group_intersection2_c
MPI_Group_intersection2_f

It looks like these tests are specifically calling MPI_GROUP_FREE on  
MPI_GROUP_EMPTY.


I note that there is code in the ompi/group/group_*.c code that  
specifically calls OBJ_RETAIN on ompi_group_empty when we return  
MPI_GROUP_EMPTY.  I wonder if this RETAIN was added (and the MPI param  
check removed) in reaction to the intel tests...?


Can someone cite again where we thought the spec said that we should  
not free GROUP_EMPTY?  Was is just on the argument that it's a  
predefined handle and therefore should never be freed?


I cannot find any specific text in 1.2 or the errata stating that it's  
bad to free GROUP_EMPTY.  I agree that this is somewhat counter to the  
rest of the MPI philosophy of not freeing predefined handles, though.




--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] GROUP_EMPTY fixes break intel tests :-(

2007-12-06 Thread Jeff Squyres

I should also note the following:

- LAM/MPI does the same thing (increments refcount when GROUP_EMPTY is  
returned to the user, and allows GROUP_EMPTY in GROUP_FREE)


- MPICH2 has the following comment in GROUP_FREE (and code to match):

   /* Cannot free the predefined groups, but allow GROUP_EMPTY
   because otherwise many tests fail */

So I'm thinking that we should allow GROUP_EMPTY in GROUP_FREE -- back  
out Edgar's changed and put in some big comments about exactly why.  :-)


Comments?


On Dec 6, 2007, at 11:01 AM, Jeff Squyres wrote:


So the changes that we debated and had Edgar put in *do* break some
intel tests.  Doh!  :-(

   MPI_Group_compare_f
   MPI_Group_intersection2_c
   MPI_Group_intersection2_f

It looks like these tests are specifically calling MPI_GROUP_FREE on
MPI_GROUP_EMPTY.

I note that there is code in the ompi/group/group_*.c code that
specifically calls OBJ_RETAIN on ompi_group_empty when we return
MPI_GROUP_EMPTY.  I wonder if this RETAIN was added (and the MPI param
check removed) in reaction to the intel tests...?

Can someone cite again where we thought the spec said that we should
not free GROUP_EMPTY?  Was is just on the argument that it's a
predefined handle and therefore should never be freed?

I cannot find any specific text in 1.2 or the errata stating that it's
bad to free GROUP_EMPTY.  I agree that this is somewhat counter to the
rest of the MPI philosophy of not freeing predefined handles, though.

--
Jeff Squyres
Cisco Systems
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Ralph H Castain



On 12/6/07 8:06 AM, "Shipman, Galen M."  wrote:

>>> 
>>> Do we really need a complete node map? A far as I can tell, it looks
>>> like the MPI layer only needs a list of local processes. So maybe it
>>> would be better to forget about the node ids at the mpi layer and just
>>> return the local procs.
>> 
>> I agree, though I don't think we want a parallel list of procs. We just need
>> to set the "local" flag in the existing ompi_proc_t structures.
>> 
> 
> Having a parallel list of procs makes perfect sense. That way ORTE can store
> ORTE information in the orte_proc_t and OMPI can store OMPI information in
> the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or
> have a pointer to it so that we have no duplication of data.
> 

Hmmmwell, I personally don't have an opinion either way regarding how
the info flows up to the MPI layer, so I'll leave that up to you MPI folks.
I can certainly create such a list if that's the way you want to go.
However, since I have no way of knowing that a job is an MPI-job or not, I
will point out that non-MPI jobs will see an increased memory footprint as a
result. Not sure we care a whole lot - depends upon what other uses people
are trying to make of ORTE (e.g., STCI).

I will point out that getting the complete node map to every process will
incur some penalty in terms of launch time, and still begs the question of
how it gets there. I can provide it when launching via the ORTE daemons, of
course (would have to send it via a daemon-to-local-proc RML message,
though, as it would be too large to put in enviro variables) - I assume
there must be a parallel mechanism for something like the Cray to provide
it?

I'm also not sure how something like SLURM or TM would provide this if/when
we tightly integrate (i.e., go through their daemons instead of ORTE's) -
could be a future issue.


> Having a global map makes sense, particularly for numerous communication
> scenarios, if I know all the processes are on the same node I may send a
> message to the lowest "vpid" on that node and he could then forward to
> everyone else.

Are you talking about RML communications? If so, we already have that in
place via the daemons.

Or are you talking about routing BTL/MPI messages?? I thought that was a
"no-no"?

> 
> 
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
>> 
> Having a enviro variable with at comma-seperated list of local ranks doesn't
> seems like a bit of a hack to me.
> 
>>> 
>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>> 
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>> 
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>> 
>> 2. how to identify which of those procs are "local"
>> 
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>> 
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>> 
>> 
>>> 
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>> 
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>> 
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>> 
>> 
>> Ralph
>> 
>>> 
>>> Tim
>>> 
>>> Ralph H Castain wrote:
 IV. RTE/MPI relative modex responsibilities
 The modex operation conducted during MPI_Init currently involves the
 exchange of two critical pieces of information:
 
 1. the location (i.e., node) of each process in my job so I can determine
 who shares a node with me. This is subsequently used by the shared memory
 subsystem for initialization and message routing; and
 
 2. BTL contact info for each process in my job.
 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Ralph H Castain



On 12/6/07 8:09 AM, "Shipman, Galen M."  wrote:

> Sorry, to be clear that should have been:
> 
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
>> 
> Having an enviro variable with a comma-seperated list of local ranks seems
> like a bit of a hack to me.

No argument - just trying to offer options for consideration. Not advocating
any of them yet. I'm still hoping for the "perfect solution" to show itself,
but I personally expect an acceptable compromise is the most likely
scenario.


> 
>>> 
>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>> 
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>> 
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>> 
>> 2. how to identify which of those procs are "local"
>> 
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>> 
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>> 
>> 
>>> 
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>> 
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>> 
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>> 
>> 
>> Ralph
>> 
>>> 
>>> Tim
>>> 
>>> Ralph H Castain wrote:
 IV. RTE/MPI relative modex responsibilities
 The modex operation conducted during MPI_Init currently involves the
 exchange of two critical pieces of information:
 
 1. the location (i.e., node) of each process in my job so I can determine
 who shares a node with me. This is subsequently used by the shared memory
 subsystem for initialization and message routing; and
 
 2. BTL contact info for each process in my job.
 
 During our recent efforts to further abstract the RTE from the MPI layer,
 we
 pushed responsibility for both pieces of information into the MPI layer.
 This wasn't done capriciously - the modex has always included the exchange
 of both pieces of information, and we chose not to disturb that situation.
 
 However, the mixing of these two functional requirements does cause
 problems
 when dealing with an environment such as the Cray where BTL information is
 "exchanged" via an entirely different mechanism. In addition, it has been
 noted that the RTE (and not the MPI layer) actually "knows" the node
 location for each process.
 
 Hence, questions have been raised as to whether:
 
 (a) the modex should be built into a framework to allow multiple BTL
 exchange mechansims to be supported, or some alternative mechanism be used
 -
 one suggestion made was to implement an MPICH-like attribute exchange; and
 
 (b) the RTE should absorb responsibility for providing a "node map" of the
 processes in a job (note: the modex may -use- that info, but would no
 longer
 be required to exchange it). This has a number of implications that need to
 be carefully considered - e.g., the memory required to store the node map
 in
 every process is non-zero. On the other hand:
 
 (i) every proc already -does- store the node for every proc - it is simply
 stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
 would want to avoid duplicating that storage, but there would be no change
 in memory footprint if done carefully.
 
 (ii) every daemon already knows the node map for the job, so communicating
 that info to its local procs may not prove a major burden. However, the
 very
 environments where this subject may be an issue (e.g., the Cray) do not use
 our daemons, so some alternative mechanism 

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Shipman, Galen M.
Sorry, to be clear that should have been:

> One option is for the RTE to just pass in an enviro variable with a
> comma-separated list of your local ranks, but that creates a problem down
> the road when trying to integrate tighter with systems like SLURM where the
> procs would get mass-launched (so the environment has to be the same for all
> of them).
> 
Having an enviro variable with a comma-seperated list of local ranks seems
like a bit of a hack to me.

>> 
>> So my vote would be to leave the modex alone, but remove the node id,
>> and add a function to get the list of local procs. It doesn't matter to
>> me how the RTE implements that.
> 
> I think we would need to be careful here that we don't create a need for
> more communication. We have two functions currently in the modex:
> 
> 1. how to exchange the info required to populate the ompi_proc_t structures;
> and
> 
> 2. how to identify which of those procs are "local"
> 
> The problem with leaving the modex as it currently sits is that some
> environments require a different mechanism for exchanging the ompi_proc_t
> info. While most can use the RML, some can't. The same division of
> capabilities applies to getting the "local" info, so it makes sense to me to
> put the modex in a framework.
> 
> Otherwise, we wind up with a bunch of #if's in the code to support
> environments like the Cray. I believe the mca system was put in place
> precisely to avoid those kind of practices, so it makes sense to me to take
> advantage of it.
> 
> 
>> 
>> Alternatively, if we did a process attribute system we could just use
>> predefined attributes, and the runtime can get each process's node id
>> however it wants.
> 
> Same problem as above, isn't it? Probably ignorance on my part, but it seems
> to me that we simply exchange a modex framework for an attribute framework
> (since each environment would have to get the attribute values in a
> different manner) - don't we?
> 
> I have no problem with using attributes instead of the modex, but the issue
> appears to be the same either way - you still need a framework to handle the
> different methods.
> 
> 
> Ralph
> 
>> 
>> Tim
>> 
>> Ralph H Castain wrote:
>>> IV. RTE/MPI relative modex responsibilities
>>> The modex operation conducted during MPI_Init currently involves the
>>> exchange of two critical pieces of information:
>>> 
>>> 1. the location (i.e., node) of each process in my job so I can determine
>>> who shares a node with me. This is subsequently used by the shared memory
>>> subsystem for initialization and message routing; and
>>> 
>>> 2. BTL contact info for each process in my job.
>>> 
>>> During our recent efforts to further abstract the RTE from the MPI layer, we
>>> pushed responsibility for both pieces of information into the MPI layer.
>>> This wasn't done capriciously - the modex has always included the exchange
>>> of both pieces of information, and we chose not to disturb that situation.
>>> 
>>> However, the mixing of these two functional requirements does cause problems
>>> when dealing with an environment such as the Cray where BTL information is
>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>> location for each process.
>>> 
>>> Hence, questions have been raised as to whether:
>>> 
>>> (a) the modex should be built into a framework to allow multiple BTL
>>> exchange mechansims to be supported, or some alternative mechanism be used -
>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>> 
>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>> processes in a job (note: the modex may -use- that info, but would no longer
>>> be required to exchange it). This has a number of implications that need to
>>> be carefully considered - e.g., the memory required to store the node map in
>>> every process is non-zero. On the other hand:
>>> 
>>> (i) every proc already -does- store the node for every proc - it is simply
>>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>>> would want to avoid duplicating that storage, but there would be no change
>>> in memory footprint if done carefully.
>>> 
>>> (ii) every daemon already knows the node map for the job, so communicating
>>> that info to its local procs may not prove a major burden. However, the very
>>> environments where this subject may be an issue (e.g., the Cray) do not use
>>> our daemons, so some alternative mechanism for obtaining the info would be
>>> required.
>>> 
>>> 
>>> So the questions to be considered here are:
>>> 
>>> (a) do we leave the current modex "as-is", to include exchange of the node
>>> map, perhaps including "#if" statements to support different exchange
>>> mechanisms?
>>> 
>>> (b) do we separate the two functions currently in the modex and push the
>>> requirement to obtain a node map into the RTE? If so, how do we want the MPI

Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities

2007-12-06 Thread Shipman, Galen M.
>> 
>> Do we really need a complete node map? A far as I can tell, it looks
>> like the MPI layer only needs a list of local processes. So maybe it
>> would be better to forget about the node ids at the mpi layer and just
>> return the local procs.
> 
> I agree, though I don't think we want a parallel list of procs. We just need
> to set the "local" flag in the existing ompi_proc_t structures.
> 

Having a parallel list of procs makes perfect sense. That way ORTE can store
ORTE information in the orte_proc_t and OMPI can store OMPI information in
the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or
have a pointer to it so that we have no duplication of data.

Having a global map makes sense, particularly for numerous communication
scenarios, if I know all the processes are on the same node I may send a
message to the lowest "vpid" on that node and he could then forward to
everyone else.


> One option is for the RTE to just pass in an enviro variable with a
> comma-separated list of your local ranks, but that creates a problem down
> the road when trying to integrate tighter with systems like SLURM where the
> procs would get mass-launched (so the environment has to be the same for all
> of them).
> 
Having a enviro variable with at comma-seperated list of local ranks doesn't
seems like a bit of a hack to me.

>> 
>> So my vote would be to leave the modex alone, but remove the node id,
>> and add a function to get the list of local procs. It doesn't matter to
>> me how the RTE implements that.
> 
> I think we would need to be careful here that we don't create a need for
> more communication. We have two functions currently in the modex:
> 
> 1. how to exchange the info required to populate the ompi_proc_t structures;
> and
> 
> 2. how to identify which of those procs are "local"
> 
> The problem with leaving the modex as it currently sits is that some
> environments require a different mechanism for exchanging the ompi_proc_t
> info. While most can use the RML, some can't. The same division of
> capabilities applies to getting the "local" info, so it makes sense to me to
> put the modex in a framework.
> 
> Otherwise, we wind up with a bunch of #if's in the code to support
> environments like the Cray. I believe the mca system was put in place
> precisely to avoid those kind of practices, so it makes sense to me to take
> advantage of it.
> 
> 
>> 
>> Alternatively, if we did a process attribute system we could just use
>> predefined attributes, and the runtime can get each process's node id
>> however it wants.
> 
> Same problem as above, isn't it? Probably ignorance on my part, but it seems
> to me that we simply exchange a modex framework for an attribute framework
> (since each environment would have to get the attribute values in a
> different manner) - don't we?
> 
> I have no problem with using attributes instead of the modex, but the issue
> appears to be the same either way - you still need a framework to handle the
> different methods.
> 
> 
> Ralph
> 
>> 
>> Tim
>> 
>> Ralph H Castain wrote:
>>> IV. RTE/MPI relative modex responsibilities
>>> The modex operation conducted during MPI_Init currently involves the
>>> exchange of two critical pieces of information:
>>> 
>>> 1. the location (i.e., node) of each process in my job so I can determine
>>> who shares a node with me. This is subsequently used by the shared memory
>>> subsystem for initialization and message routing; and
>>> 
>>> 2. BTL contact info for each process in my job.
>>> 
>>> During our recent efforts to further abstract the RTE from the MPI layer, we
>>> pushed responsibility for both pieces of information into the MPI layer.
>>> This wasn't done capriciously - the modex has always included the exchange
>>> of both pieces of information, and we chose not to disturb that situation.
>>> 
>>> However, the mixing of these two functional requirements does cause problems
>>> when dealing with an environment such as the Cray where BTL information is
>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>> location for each process.
>>> 
>>> Hence, questions have been raised as to whether:
>>> 
>>> (a) the modex should be built into a framework to allow multiple BTL
>>> exchange mechansims to be supported, or some alternative mechanism be used -
>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>> 
>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>> processes in a job (note: the modex may -use- that info, but would no longer
>>> be required to exchange it). This has a number of implications that need to
>>> be carefully considered - e.g., the memory required to store the node map in
>>> every process is non-zero. On the other hand:
>>> 
>>> (i) every proc already -does- store the node for every proc - it is simply
>>> stored in the ompi_proc_t structures as 

[OMPI devel] opal_condition_wait

2007-12-06 Thread Tim Prins

Hi,

A couple of questions.

First, in opal_condition_wait (condition.h:97) we do not release the 
passed mutex if opal_using_threads() is not set. Is there a reason for 
this? I ask since this violates the way condition variables are supposed 
to work, and it seems like there are situations where this could cause 
deadlock.


Also, when we are using threads, there is a case where we do not 
decrement the signaled count, in condition.h:84. Gleb put this in in 
r9451, however the change does not make sense to me. I think that the 
signal count should always be decremented.


Can anyone shine any light on these issues?

Thanks,

Tim


Re: [OMPI devel] [PATCH] openib btl: remove excess ompi_btl_openib_connect_base_open call

2007-12-06 Thread Pavel Shamis (Pasha)

:-)
Nice catch. Please commit the fix.

Pasha.

Jeff Squyres wrote:

Hah!  Sweet; good catch -- feel free to delete that extra call.


On Dec 5, 2007, at 6:42 PM, Jon Mason wrote:

  

There is a double call to ompi_btl_openib_connect_base_open in
mca_btl_openib_mca_setup_qps().  It looks like someone just forgot to
clean-up the previous call when they added the check for the return
code.

I ran a quick IMB test over IB to verify everything is still working.

Thanks,
Jon


Index: ompi/mca/btl/openib/btl_openib_mca.c
===
--- ompi/mca/btl/openib/btl_openib_mca.c(revision 16855)
+++ ompi/mca/btl/openib/btl_openib_mca.c(working copy)
@@ -672,10 +672,7 @@
mca_btl_openib_component.credits_qp = smallest_pp_qp;

/* Register any MCA params for the connect pseudo-components */
-
-ompi_btl_openib_connect_base_open();
-
-if ( OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
+if (OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
goto error;

ret = OMPI_SUCCESS;
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




  




Re: [OMPI devel] IB pow wow notes

2007-12-06 Thread Jeff Squyres

On Dec 2, 2007, at 5:11 PM, Richard Graham wrote:

One question – there is a mention a new pml that is essentially CM 
+matching.

Why is this no just another instance of CM ?


I'm not sure I understand your question -- the new proposed PML would  
be different than CM: it would have matching and support more than one  
underlying device (e.g., more than one MTL).


Could this just be CM with some run-time parameter enabled?   
Possibly.  Is it worth it?  I'm not sure -- CM is nice in that it's so  
small / simple.  Do we really want to make it more complex?


All of this is speculation / vaporware at the moment anyway -- just  
tossing around some ideas...


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [PATCH] openib btl: remove excess ompi_btl_openib_connect_base_open call

2007-12-06 Thread Jeff Squyres

Hah!  Sweet; good catch -- feel free to delete that extra call.


On Dec 5, 2007, at 6:42 PM, Jon Mason wrote:


There is a double call to ompi_btl_openib_connect_base_open in
mca_btl_openib_mca_setup_qps().  It looks like someone just forgot to
clean-up the previous call when they added the check for the return
code.

I ran a quick IMB test over IB to verify everything is still working.

Thanks,
Jon


Index: ompi/mca/btl/openib/btl_openib_mca.c
===
--- ompi/mca/btl/openib/btl_openib_mca.c(revision 16855)
+++ ompi/mca/btl/openib/btl_openib_mca.c(working copy)
@@ -672,10 +672,7 @@
mca_btl_openib_component.credits_qp = smallest_pp_qp;

/* Register any MCA params for the connect pseudo-components */
-
-ompi_btl_openib_connect_base_open();
-
-if ( OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
+if (OMPI_SUCCESS != ompi_btl_openib_connect_base_open())
goto error;

ret = OMPI_SUCCESS;
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


Re: [OMPI devel] Using MTT to test the newly added SCTP BTL

2007-12-06 Thread Jeff Squyres

On Dec 5, 2007, at 1:42 PM, Karol Mroz wrote:

Removal of .ompi_ignore should not create build problems for anyone  
who

is running without some form of SCTP support. To test this claim, we
built Open MPI with .ompi_ignore removed and no SCTP support on  
both an
ubuntu linux and an OSX machine. Both builds succeeded without any  
problem.


In light of the above, are there any objections to us removing the
.ompi_ignore file from the SCTP BTL code?


Thanks for your persistence on this.  :-)

I think that since no one has objected, you should feel free to do so.

I tried to work around this problem by using a pre-installed version  
of

Open MPI to run MTT tests on (ibm tests initially) but all I get is a
short summary from MTT that things succeeded, instead of a detailed  
list

of specific test successes/failures as is shown when using a nightly
tarball.


MTT has several different reporters; the default "file" reporter  
simply outputs a summary to stdout upon completion.  The intention is  
that the file reporter would be used by developers for quick/ 
interactive tests to verify that you hadn't broken anything; more  
details are available in the meta data files in the scratch tree if  
you know where to look.


We intended that MTT's database reporter would usually be used for  
common testing, etc.  The web interface is [by far] the easiest way to  
drill down in the results to see the details of what you need to know  
about individual failures, etc.



The 'tests' also complete much faster which sparks some concern
as to whether they were actually run.


If you just manually add the sctp btl directory to an existing  
tarball, I'm pretty sure that it won't build.  OMPI's build system is  
highly dependent upon its "autogen" procedure, which creates a hard- 
coded list of components to build.  For a tarball, that procedure has  
already completed, and even if you add in more component directories  
after you expand the tarball, the hard-coded lists won't be updated,  
and therefore OMPI's configure/build system will skip them.



Furthermore, MTT puts the source
into a new 'random' directory prior to building (way around this?),


No.  The internal directory structure of the scratch tree, as you  
noted, uses random directory names.  This is for two reasons:


1. because MTT can't know ahead of time what you are going to tell it  
to do
2. one obvious way to have non-random directory names is to use the  
names of the INI file sections as various directory levels.  However,  
this creates Very, Very Long directory names in the scratch tree and  
some compilers have a problem with this (even though the total  
filenames are within the filesystem limit).  Hence, we came up with  
the scheme of using short, random directory names that will guarantee  
that the total filename length is short.


Note that for human convenience, MTT *also* puts in sym links to the  
short random directory names that correspond to the INI section  
names.  So if a human needs to go into the scratch tree to investigate  
some failures, it should be pretty easy to navigate using the sym  
links (vs. the short/random names).



so I
can't add the SCTP directory by hand, and then run the
build/installation phase. Adding the code on the fly during the
installation phase also does not work.

Any advice in this matter?

Thanks again everyone.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Karol Mroz
km...@cs.ubc.ca


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems