Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres

On Jul 29, 2008, at 9:47 AM, Jeff Squyres wrote:

Ok.  FWIW, Pasha and I think that openib has supported "send-to- 
self" for a while (we don't know exactly when; but Pasha thinks it  
is very old code that we don't check for self in add_procs).  But it  
only broke recently.



More in the FWIW category -- we just checked, and OMPI v1.2 supported  
"--mca btl openib" (note the lack of ",self").  So the openib BTL has,  
indeed, supported send-to-self for quite a while.


This should help narrow where to start looking for the problem:  
changes within the last few weeks.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Jeff Squyres
Ok.  FWIW, Pasha and I think that openib has supported "send-to-self"  
for a while (we don't know exactly when; but Pasha thinks it is very  
old code that we don't check for self in add_procs).  But it only  
broke recently.



On Jul 29, 2008, at 9:31 AM, George Bosilca wrote:

I ran few tests and the only combination leading to a deadlock is  
openib and self. As openib is the only BTL supporting self  
communications (except self of course), I guess it interfere with  
self in some more or less strange ways. I didn't had the time to dig  
deeper yet to see what exactly happens there, I'll schedule this  
later today.


 george.

On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago  
(Pasha: do you remember?) because Mellanox HCAs are capable of  
send-to-self (process) and there were no code changes necessary to  
enable it.  So it allowed a slightly simpler command line.  This  
was quite a while ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread George Bosilca
I ran few tests and the only combination leading to a deadlock is  
openib and self. As openib is the only BTL supporting self  
communications (except self of course), I guess it interfere with self  
in some more or less strange ways. I didn't had the time to dig deeper  
yet to see what exactly happens there, I'll schedule this later today.


  george.

On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago (Pasha:  
do you remember?) because Mellanox HCAs are capable of send-to-self  
(process) and there were no code changes necessary to enable it.   
So it allowed a slightly simpler command line.  This was quite a  
while ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Pavel Shamis (Pasha)

Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago (Pasha: do 
you remember?) because Mellanox HCAs are capable of send-to-self 
(process) and there were no code changes necessary to enable it.  So 
it allowed a slightly simpler command line.  This was quite a while 
ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje  wrote:

> Jeff Squyres wrote:
>
>> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:
>>
>>  Interesting. The self is only used for local communications. I don't
>>> expect that any benchmark execute such communications, but apparently I was
>>> wrong. Please let me know the failing test, I will take a look this evening.
>>>
>>
>> FWIW, my manual tests of a simplistic "ring" program work for all
>> combinations (openib, openib+self, openib+self+sm).  Shrug.
>>
>> But for OSU latency, I found that openib, openib+sm work, but
>> openib+sm+self hangs (same results whether the 2 procs are on the same node
>> or different nodes).  There is no self communication in osu_latency, so
>> something else must be going on.
>>
>>  Is it something to do with the MPI_Barrier call?  osu_latency uses
> MPI_Barrier and from rhc's email it sounds like his code does too.


I don't think it's an issue with MPI_Barrier().  I'm running into this
problem with srtest.c (one of the example programs from the mpich
distribution).  It's a ring-type test with no barriers until the end, yet it
hangs on the very first Send/Recv pair from rank0 to rank1.

I my case, openib and openib+sm works, but openib+self & openib+sm+self
hang.

--brad


>
> --td
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Terry Dontje

Jeff Squyres wrote:

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't 
expect that any benchmark execute such communications, but apparently 
I was wrong. Please let me know the failing test, I will take a look 
this evening.


FWIW, my manual tests of a simplistic "ring" program work for all 
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but 
openib+sm+self hangs (same results whether the 2 procs are on the same 
node or different nodes).  There is no self communication in 
osu_latency, so something else must be going on.


Is it something to do with the MPI_Barrier call?  osu_latency uses 
MPI_Barrier and from rhc's email it sounds like his code does too.


--td


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 11:05 AM, Ralph Castain wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?


This used to be true, but I think we changed it a while ago (Pasha: do  
you remember?) because Mellanox HCAs are capable of send-to-self  
(process) and there were no code changes necessary to enable it.  So  
it allowed a slightly simpler command line.  This was quite a while  
ago, IIRC.


All current iWARP adapters do not allow loopback communication at all  
(i.e., communication to either the same proc or other procs on the  
same host), so we added the following test in openib's add_procs:


if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev- 
>transport_type &&

0 != (ompi_proc->proc_flags && OMPI_PROC_FLAG_LOCAL)) {
continue;
}

(meaning: skip this proc if it's on the same host; let btl self handle  
it, etc.)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
My test wasn't a benchmark - I was just testing with a little program  
that calls mpi_init, mpi_barrier, and mpi_finalize.


A test with just mpi_init/finalize works fine, so it looks like we  
simply hang when trying to communicate. This also only happens on  
multi-node operations.


On Jul 28, 2008, at 10:16 AM, Jeff Squyres wrote:


On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I  
don't expect that any benchmark execute such communications, but  
apparently I was wrong. Please let me know the failing test, I will  
take a look this evening.


FWIW, my manual tests of a simplistic "ring" program work for all  
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but openib 
+sm+self hangs (same results whether the 2 procs are on the same  
node or different nodes).  There is no self communication in  
osu_latency, so something else must be going on.


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:

Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but  
apparently I was wrong. Please let me know the failing test, I will  
take a look this evening.


FWIW, my manual tests of a simplistic "ring" program work for all  
combinations (openib, openib+self, openib+self+sm).  Shrug.


But for OSU latency, I found that openib, openib+sm work, but openib+sm 
+self hangs (same results whether the 2 procs are on the same node or  
different nodes).  There is no self communication in osu_latency, so  
something else must be going on.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
Interesting. The self is only used for local communications. I don't  
expect that any benchmark execute such communications, but apparently  
I was wrong. Please let me know the failing test, I will take a look  
this evening.


  Thanks,
george.

On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote:


I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


george.

PS: Btw, it is not supposed to work at all, except in the case  
where openib handle internal messages (where the source and  
destination is the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain

I just re-tested to confirm, and that is correct.

-mca btl openib works
-mca btl openib,selfhangs
-mca btl openib,sm  works


On Jul 28, 2008, at 9:49 AM, George Bosilca wrote:

I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


 george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what  
I had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB  
was locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it  
further - planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread George Bosilca
I'm a little bit lost here. You're stating that openib,self doesn't  
work while openib does? In other words that adding self to the BTL  
leads to deadlocks?


  george.

PS: Btw, it is not supposed to work at all, except in the case where  
openib handle internal messages (where the source and destination is  
the same process).


On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote:



On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to  
use self btl.




Don't know - could be true. But if that is true, then we should  
check to see if that condition is met and error out - with an  
appropriate message - if so. Otherwise, how is a user supposed to  
know this condition?





On 7/28/08, Jeff Squyres  wrote: FWIW, all my  
MTT runs are hanging as well.




On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via  
self,openib





On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain


On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote:


only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use  
self btl.




Don't know - could be true. But if that is true, then we should check  
to see if that condition is met and error out - with an appropriate  
message - if so. Otherwise, how is a user supposed to know this  
condition?





On 7/28/08, Jeff Squyres  wrote:
FWIW, all my MTT runs are hanging as well.



On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:

I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
only openib works for me too,

but Glebs said to me once that it's illigal and I always need to use self
btl.

On 7/28/08, Jeff Squyres  wrote:
>
> FWIW, all my MTT runs are hanging as well.
>
>
> On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:
>
>  My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
>> systems and tests using --mca btl  openib,self hang in all cases.
>>
>> --brad
>>
>>
>> 2008/7/28 Lenny Verkhovsky 
>> I failed to run on different nodes or on the same node via self,openib
>>
>>
>>
>>
>> On 7/28/08, Ralph Castain  wrote:
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>>  It could also be something new. Brad and I noted on Fri that IB was
>>> locking up as soon as we tried any cross-node communications. Hadn't seen
>>> that before, and at least I haven't explored it further - planned to do so
>>> today.
>>>
>>>
>>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>>
>>>  I believe it it.

 On 7/28/08, Jeff Squyres  wrote: On Jul 28, 2008,
 at 7:51 AM, Jeff Squyres wrote:

 Is this related to r1378?

 Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



 On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

 Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel


 --
 Jeff Squyres
 Cisco Systems



 --
 Jeff Squyres
 Cisco Systems

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

FWIW, all my MTT runs are hanging as well.


On Jul 28, 2008, at 10:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote: On Jul 28,  
2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
Interesting - you are quite correct and I should have been more  
precise. I ran with -mca btl openib and it worked. So having just  
openib seems to be okay.




On Jul 28, 2008, at 8:37 AM, Brad Benton wrote:

My experience is the same a Lenny's.  I've tested on x86_64 and  
ppc64 systems and tests using --mca btl  openib,self hang in all  
cases.


--brad


2008/7/28 Lenny Verkhovsky 
I failed to run on different nodes or on the same node via self,openib




On 7/28/08, Ralph Castain  wrote:
I checked this out some more and I believe it is ticket #1378  
related. We lock up if SM is included in the BTL's, which is what I  
had done on my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications.  
Hadn't seen that before, and at least I haven't explored it further  
- planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
systems and tests using --mca btl  openib,self hang in all cases.

--brad


2008/7/28 Lenny Verkhovsky 

> I failed to run on different nodes or on the same node via self,openib
>
>
>
> On 7/28/08, Ralph Castain  wrote:
>>
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>> It could also be something new. Brad and I noted on Fri that IB was
>> locking up as soon as we tried any cross-node communications. Hadn't seen
>> that before, and at least I haven't explored it further - planned to do so
>> today.
>>
>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>
>> I believe it it.
>>
>> On 7/28/08, Jeff Squyres  wrote:
>>>
>>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>>
>>>  Is this related to r1378?

>>>
>>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>>
>>>
>>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

  Hi,
>
> I experience hanging of tests ( latency ) since r19010
>
>
> Best Regards
>
> Lenny.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


 --
 Jeff Squyres
 Cisco Systems


>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I failed to run on different nodes or on the same node via self,openib



On 7/28/08, Ralph Castain  wrote:
>
> I checked this out some more and I believe it is ticket #1378 related. We
> lock up if SM is included in the BTL's, which is what I had done on my test.
> If I ^sm, I can run fine.
>
> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>
> It could also be something new. Brad and I noted on Fri that IB was locking
> up as soon as we tried any cross-node communications. Hadn't seen that
> before, and at least I haven't explored it further - planned to do so today.
>
> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>
> I believe it it.
>
> On 7/28/08, Jeff Squyres  wrote:
>>
>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>
>>  Is this related to r1378?
>>>
>>
>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>
>>
>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>>
>>>  Hi,

 I experience hanging of tests ( latency ) since r19010


 Best Regards

 Lenny.

 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel

>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
I checked this out some more and I believe it is ticket #1378 related.  
We lock up if SM is included in the BTL's, which is what I had done on  
my test. If I ^sm, I can run fine.



On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:

It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further -  
planned to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Ralph Castain
It could also be something new. Brad and I noted on Fri that IB was  
locking up as soon as we tried any cross-node communications. Hadn't  
seen that before, and at least I haven't explored it further - planned  
to do so today.



On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:


I believe it it.

On 7/28/08, Jeff Squyres  wrote:
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:

Is this related to r1378?

Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
I believe it it.

On 7/28/08, Jeff Squyres  wrote:
>
> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>
>  Is this related to r1378?
>>
>
> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>
>
>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:
>>
>>  Hi,
>>>
>>> I experience hanging of tests ( latency ) since r19010
>>>
>>>
>>> Best Regards
>>>
>>> Lenny.
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:


Is this related to r1378?


Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.



On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:


Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Jeff Squyres

Is this related to r1378?


On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:


Hi,

I experience hanging of tests ( latency ) since r19010


Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] trunk hangs since r19010

2008-07-28 Thread Lenny Verkhovsky
Hi,

I experience hanging of tests ( latency ) since r19010

Best Regards

Lenny.