Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941

2008-03-27 Thread Ralph H Castain
Hmmm...puzzling. It is working fine for me on TM machines and on my Mac.
However, Galen reports it borked on alps as well.

I'll have to dig a little to check this out and see if there is something
missing on those PLMs. Will get back shortly.

Sorry for problem


On 3/27/08 10:28 AM, "Tim Prins"  wrote:

> Unfortunately now with r17988 I cannot run any mpi programs, they seem
> to hang in the modex.
> 
> Tim
> 
> Ralph H Castain wrote:
>> Thanks Tim - I found the problem and will commit a fix shortly.
>> 
>> Appreciate your testing and reporting!
>> 
>> 
>> On 3/27/08 8:24 AM, "Tim Prins"  wrote:
>> 
>>> This commit breaks things for me. Running on 3 nodes of odin:
>>> 
>>> mpirun -mca btl tcp,sm,self  examples/ring_c
>>> 
>>> causes a hang. All of the processes are stuck in
>>> orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang,
>>> and the ring program does not hang all the time, but fairly often.
>>> 
>>> Tim
>>> 
>>> r...@osl.iu.edu wrote:
 Author: rhc
 Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008)
 New Revision: 17941
 URL: https://svn.open-mpi.org/trac/ompi/changeset/17941
 
 Log:
 Fix the allgather and allgather_list functions to avoid deadlocks at large
 node/proc counts. Violated the RML rules here - we received the allgather
 buffer and then did an xcast, which causes a send to go out, and is then
 subsequently received by the sender. This fix breaks that pattern by
 forcing
 the recv to complete outside of the function itself - thus, the allgather
 and
 allgather_list always complete their recvs before returning or sending.
 
 Reogranize the grpcomm code a little to provide support for soon-to-come
 new
 grpcomm components. The revised organization puts what will be common code
 elements in the base to avoid duplication, while allowing components that
 don't need those functions to ignore them.
 
 Added:
trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c
trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c
trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c
 Text files modified:
trunk/orte/mca/grpcomm/base/Makefile.am| 5
trunk/orte/mca/grpcomm/base/base.h |23 +
trunk/orte/mca/grpcomm/base/grpcomm_base_close.c   | 4
trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1
trunk/orte/mca/grpcomm/base/grpcomm_base_select.c  |   121 ++---
trunk/orte/mca/grpcomm/basic/grpcomm_basic.h   |16
trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 -
trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c|   845
 ++-
trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8
trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c   | 8
trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c  |21
trunk/orte/mca/grpcomm/grpcomm.h   |45 +
trunk/orte/mca/rml/rml_types.h |31
trunk/orte/orted/orted_comm.c  |27 +
14 files changed, 226 insertions(+), 959 deletions(-)
 
 
 Diff not shown due to size (92619 bytes).
 To see the diff, run the following command:
 
 svn diff -r 17940:17941 --no-diff-deleted
 
 ___
 svn mailing list
 s...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941

2008-03-27 Thread Tim Prins
Unfortunately now with r17988 I cannot run any mpi programs, they seem 
to hang in the modex.


Tim

Ralph H Castain wrote:

Thanks Tim - I found the problem and will commit a fix shortly.

Appreciate your testing and reporting!


On 3/27/08 8:24 AM, "Tim Prins"  wrote:


This commit breaks things for me. Running on 3 nodes of odin:

mpirun -mca btl tcp,sm,self  examples/ring_c

causes a hang. All of the processes are stuck in
orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang,
and the ring program does not hang all the time, but fairly often.

Tim

r...@osl.iu.edu wrote:

Author: rhc
Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008)
New Revision: 17941
URL: https://svn.open-mpi.org/trac/ompi/changeset/17941

Log:
Fix the allgather and allgather_list functions to avoid deadlocks at large
node/proc counts. Violated the RML rules here - we received the allgather
buffer and then did an xcast, which causes a send to go out, and is then
subsequently received by the sender. This fix breaks that pattern by forcing
the recv to complete outside of the function itself - thus, the allgather and
allgather_list always complete their recvs before returning or sending.

Reogranize the grpcomm code a little to provide support for soon-to-come new
grpcomm components. The revised organization puts what will be common code
elements in the base to avoid duplication, while allowing components that
don't need those functions to ignore them.

Added:
   trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c
   trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c
   trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c
Text files modified:
   trunk/orte/mca/grpcomm/base/Makefile.am| 5
   trunk/orte/mca/grpcomm/base/base.h |23 +
   trunk/orte/mca/grpcomm/base/grpcomm_base_close.c   | 4
   trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1
   trunk/orte/mca/grpcomm/base/grpcomm_base_select.c  |   121 ++---
   trunk/orte/mca/grpcomm/basic/grpcomm_basic.h   |16
   trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 -
   trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c|   845
++-
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c   | 8
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c  |21
   trunk/orte/mca/grpcomm/grpcomm.h   |45 +
   trunk/orte/mca/rml/rml_types.h |31
   trunk/orte/orted/orted_comm.c  |27 +
   14 files changed, 226 insertions(+), 959 deletions(-)


Diff not shown due to size (92619 bytes).
To see the diff, run the following command:

svn diff -r 17940:17941 --no-diff-deleted

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941

2008-03-27 Thread Ralph H Castain
Thanks Tim - I found the problem and will commit a fix shortly.

Appreciate your testing and reporting!


On 3/27/08 8:24 AM, "Tim Prins"  wrote:

> This commit breaks things for me. Running on 3 nodes of odin:
> 
> mpirun -mca btl tcp,sm,self  examples/ring_c
> 
> causes a hang. All of the processes are stuck in
> orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang,
> and the ring program does not hang all the time, but fairly often.
> 
> Tim
> 
> r...@osl.iu.edu wrote:
>> Author: rhc
>> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008)
>> New Revision: 17941
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941
>> 
>> Log:
>> Fix the allgather and allgather_list functions to avoid deadlocks at large
>> node/proc counts. Violated the RML rules here - we received the allgather
>> buffer and then did an xcast, which causes a send to go out, and is then
>> subsequently received by the sender. This fix breaks that pattern by forcing
>> the recv to complete outside of the function itself - thus, the allgather and
>> allgather_list always complete their recvs before returning or sending.
>> 
>> Reogranize the grpcomm code a little to provide support for soon-to-come new
>> grpcomm components. The revised organization puts what will be common code
>> elements in the base to avoid duplication, while allowing components that
>> don't need those functions to ignore them.
>> 
>> Added:
>>trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c
>>trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c
>>trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c
>> Text files modified:
>>trunk/orte/mca/grpcomm/base/Makefile.am| 5
>>trunk/orte/mca/grpcomm/base/base.h |23 +
>>trunk/orte/mca/grpcomm/base/grpcomm_base_close.c   | 4
>>trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1
>>trunk/orte/mca/grpcomm/base/grpcomm_base_select.c  |   121 ++---
>>trunk/orte/mca/grpcomm/basic/grpcomm_basic.h   |16
>>trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 -
>>trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c|   845
>> ++-
>>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8
>>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c   | 8
>>trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c  |21
>>trunk/orte/mca/grpcomm/grpcomm.h   |45 +
>>trunk/orte/mca/rml/rml_types.h |31
>>trunk/orte/orted/orted_comm.c  |27 +
>>14 files changed, 226 insertions(+), 959 deletions(-)
>> 
>> 
>> Diff not shown due to size (92619 bytes).
>> To see the diff, run the following command:
>> 
>> svn diff -r 17940:17941 --no-diff-deleted
>> 
>> ___
>> svn mailing list
>> s...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941

2008-03-27 Thread Tim Prins

This commit breaks things for me. Running on 3 nodes of odin:

mpirun -mca btl tcp,sm,self  examples/ring_c

causes a hang. All of the processes are stuck in 
orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang, 
and the ring program does not hang all the time, but fairly often.


Tim

r...@osl.iu.edu wrote:

Author: rhc
Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008)
New Revision: 17941
URL: https://svn.open-mpi.org/trac/ompi/changeset/17941

Log:
Fix the allgather and allgather_list functions to avoid deadlocks at large 
node/proc counts. Violated the RML rules here - we received the allgather 
buffer and then did an xcast, which causes a send to go out, and is then 
subsequently received by the sender. This fix breaks that pattern by forcing 
the recv to complete outside of the function itself - thus, the allgather and 
allgather_list always complete their recvs before returning or sending.

Reogranize the grpcomm code a little to provide support for soon-to-come new 
grpcomm components. The revised organization puts what will be common code 
elements in the base to avoid duplication, while allowing components that don't 
need those functions to ignore them.

Added:
   trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c
   trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c
   trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c
Text files modified: 
   trunk/orte/mca/grpcomm/base/Makefile.am| 5 
   trunk/orte/mca/grpcomm/base/base.h |23 +   
   trunk/orte/mca/grpcomm/base/grpcomm_base_close.c   | 4 
   trunk/orte/mca/grpcomm/base/grpcomm_base_open.c| 1 
   trunk/orte/mca/grpcomm/base/grpcomm_base_select.c  |   121 ++---   
   trunk/orte/mca/grpcomm/basic/grpcomm_basic.h   |16 
   trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c |30 -   
   trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c|   845 ++- 
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8 
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c   | 8 
   trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c  |21 
   trunk/orte/mca/grpcomm/grpcomm.h   |45 +   
   trunk/orte/mca/rml/rml_types.h |31 
   trunk/orte/orted/orted_comm.c  |27 +   
   14 files changed, 226 insertions(+), 959 deletions(-)



Diff not shown due to size (92619 bytes).
To see the diff, run the following command:

svn diff -r 17940:17941 --no-diff-deleted

___
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn