Re: [OMPI devel] trunk hangs since r19010
On Jul 29, 2008, at 9:47 AM, Jeff Squyres wrote: Ok. FWIW, Pasha and I think that openib has supported "send-to- self" for a while (we don't know exactly when; but Pasha thinks it is very old code that we don't check for self in add_procs). But it only broke recently. More in the FWIW category -- we just checked, and OMPI v1.2 supported "--mca btl openib" (note the lack of ",self"). So the openib BTL has, indeed, supported send-to-self for quite a while. This should help narrow where to start looking for the problem: changes within the last few weeks. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Ok. FWIW, Pasha and I think that openib has supported "send-to-self" for a while (we don't know exactly when; but Pasha thinks it is very old code that we don't check for self in add_procs). But it only broke recently. On Jul 29, 2008, at 9:31 AM, George Bosilca wrote: I ran few tests and the only combination leading to a deadlock is openib and self. As openib is the only BTL supporting self communications (except self of course), I guess it interfere with self in some more or less strange ways. I didn't had the time to dig deeper yet to see what exactly happens there, I'll schedule this later today. george. On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. Yep, Correct. FYI. In my MTT testing I also see a lot of killed tests. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
I ran few tests and the only combination leading to a deadlock is openib and self. As openib is the only BTL supporting self communications (except self of course), I guess it interfere with self in some more or less strange ways. I didn't had the time to dig deeper yet to see what exactly happens there, I'll schedule this later today. george. On Jul 29, 2008, at 8:52 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. Yep, Correct. FYI. In my MTT testing I also see a lot of killed tests. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
Jeff Squyres wrote: This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. Yep, Correct. FYI. In my MTT testing I also see a lot of killed tests.
Re: [OMPI devel] trunk hangs since r19010
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje wrote: > Jeff Squyres wrote: > >> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: >> >> Interesting. The self is only used for local communications. I don't >>> expect that any benchmark execute such communications, but apparently I was >>> wrong. Please let me know the failing test, I will take a look this evening. >>> >> >> FWIW, my manual tests of a simplistic "ring" program work for all >> combinations (openib, openib+self, openib+self+sm). Shrug. >> >> But for OSU latency, I found that openib, openib+sm work, but >> openib+sm+self hangs (same results whether the 2 procs are on the same node >> or different nodes). There is no self communication in osu_latency, so >> something else must be going on. >> >> Is it something to do with the MPI_Barrier call? osu_latency uses > MPI_Barrier and from rhc's email it sounds like his code does too. I don't think it's an issue with MPI_Barrier(). I'm running into this problem with srtest.c (one of the example programs from the mpich distribution). It's a ring-type test with no barriers until the end, yet it hangs on the very first Send/Recv pair from rank0 to rank1. I my case, openib and openib+sm works, but openib+self & openib+sm+self hang. --brad > > --td > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
Jeff Squyres wrote: On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm+self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. Is it something to do with the MPI_Barrier call? osu_latency uses MPI_Barrier and from rhc's email it sounds like his code does too. --td
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 11:05 AM, Ralph Castain wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. All current iWARP adapters do not allow loopback communication at all (i.e., communication to either the same proc or other procs on the same host), so we added the following test in openib's add_procs: if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev- >transport_type && 0 != (ompi_proc->proc_flags && OMPI_PROC_FLAG_LOCAL)) { continue; } (meaning: skip this proc if it's on the same host; let btl self handle it, etc.) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
My test wasn't a benchmark - I was just testing with a little program that calls mpi_init, mpi_barrier, and mpi_finalize. A test with just mpi_init/finalize works fine, so it looks like we simply hang when trying to communicate. This also only happens on multi-node operations. On Jul 28, 2008, at 10:16 AM, Jeff Squyres wrote: On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib +sm+self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm +self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. Thanks, george. On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote: I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. On 7/28/08, Jeff Squyres wrote: > > FWIW, all my MTT runs are hanging as well. > > > On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: > > My experience is the same a Lenny's. I've tested on x86_64 and ppc64 >> systems and tests using --mca btl openib,self hang in all cases. >> >> --brad >> >> >> 2008/7/28 Lenny Verkhovsky >> I failed to run on different nodes or on the same node via self,openib >> >> >> >> >> On 7/28/08, Ralph Castain wrote: >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >>> locking up as soon as we tried any cross-node communications. Hadn't seen >>> that before, and at least I haven't explored it further - planned to do so >>> today. >>> >>> >>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >>> >>> I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Interesting - you are quite correct and I should have been more precise. I ran with -mca btl openib and it worked. So having just openib seems to be okay. On Jul 28, 2008, at 8:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky > I failed to run on different nodes or on the same node via self,openib > > > > On 7/28/08, Ralph Castain wrote: >> >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >> locking up as soon as we tried any cross-node communications. Hadn't seen >> that before, and at least I haven't explored it further - planned to do so >> today. >> >> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >> >> I believe it it. >> >> On 7/28/08, Jeff Squyres wrote: >>> >>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >>> >>> Is this related to r1378? >>> >>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >>> >>> >>> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, > > I experience hanging of tests ( latency ) since r19010 > > > Best Regards > > Lenny. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres Cisco Systems >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: > > I checked this out some more and I believe it is ticket #1378 related. We > lock up if SM is included in the BTL's, which is what I had done on my test. > If I ^sm, I can run fine. > > On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: > > It could also be something new. Brad and I noted on Fri that IB was locking > up as soon as we tried any cross-node communications. Hadn't seen that > before, and at least I haven't explored it further - planned to do so today. > > On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: > > I believe it it. > > On 7/28/08, Jeff Squyres wrote: >> >> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >> >> Is this related to r1378? >>> >> >> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >> >> >> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >>> >>> Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
I believe it it. On 7/28/08, Jeff Squyres wrote: > > On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: > > Is this related to r1378? >> > > Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. > > > On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >> >> Hi, >>> >>> I experience hanging of tests ( latency ) since r19010 >>> >>> >>> Best Regards >>> >>> Lenny. >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
Is this related to r1378? On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] trunk hangs since r19010
Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny.