Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh
Jonathan Dursi wrote: Continuing the conversation with myself: Sorry to interrupt... :^) Okay, I managed to reproduce the hang. I'll try to look at this. Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Jonathan Dursi
Hi, Eugene: If it continues to be a problem for people to reproduce this, I'll see what can be done about having an account made here for someone to poke around. Alternately, any suggestions for tests that I can do to help diagnose/verify the problem, or figure out whats different about

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh
Jonathan Dursi wrote: Continuing the conversation with myself: Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is,

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-22 Thread Jonathan Dursi
Hi, Jeff: I wish I had your problems reproducing this. This problem apparently rears its head when OpenMPI is compiled with the intel compilers, as well, but only ~1% of the time. Unfortunately, we have users who launch ~1400 single-node jobs at a go. So they see on order a dozen or

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-21 Thread Jonathan Dursi
Continuing the conversation with myself: Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is, setting btl_sm_num_fifos to at

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-21 Thread Jonathan Dursi
I hate to repost, but I'm still stuck with the problem that, on a completely standard install with a standard gcc compiler, we're getting random hangs with a trivial test program when using the sm btl, and we still have no clues as to how to track down the problem. Using a completely standard