Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Eugene Loh
Matthew MacManes wrote: I would be happy to help troubleshoot, but I am not much of a programmer to know how. The hang is reproducible, and -mca btl ^sm is about 15% faster. if you want to shoot me some instructions off list, I can give it a go. The application that I am working with,

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Matthew MacManes
I would be happy to help troubleshoot, but I am not much of a programmer to know how. The hang is reproducible, and -mca btl ^sm is about 15% faster. if you want to shoot me some instructions off list, I can give it a go. The application that I am working with, primarily, is ABySS:

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-15 Thread Eugene Loh
Matthew MacManes wrote: On my system, mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and appeared to hang after several thousand interations) than -mca btl ^sm If the hang is reproducible, we should perhaps have a look. Also, the fact that it's much slower is interesting. Can you

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Matthew MacManes
On my system, mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and appeared to hang after several thousand interations) than -mca btl ^sm Is there another better way I should be modifying fifos to get better performance? Matt On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote: >> >>

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-11 Thread Terry Dontje
Date: Thu, 10 Dec 2009 17:57:27 -0500 From: Jeff Squyres On Dec 10, 2009, at 5:53 PM, Gus Correa wrote: > How does the efficiency of loopback > (let's say, over TCP and over IB) compare with "sm"? Definitely not as good; that's why we have sm. :-) I don't

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mark Bolstad
Some additional data: Without threads it still hangs, similar behavior as before. All of the tests were run on a system running FC11 with X5550 processors. I just reran on a node of a RHEL 5.3 cluster with E5530 processors (dual Nehalam): - openmpi 1.3.4 and gcc 4.1.2 - No issues:

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Eugene Loh
Gus Correa wrote: Why wouldn't shared memory work right on Nehalem? We don't know exactly what is driving this problem, but the issue appears to be related to memory fences. Messages have to be posted to a receiver's queue. By default, each process (since OMPI 1.3.2) has only one queue.

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres
On Dec 10, 2009, at 5:53 PM, Gus Correa wrote: > How does the efficiency of loopback > (let's say, over TCP and over IB) compare with "sm"? Definitely not as good; that's why we have sm. :-) I don't have any quantification of that assertion, though (i.e., no numbers to back that up). > FYI,

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Gus Correa
Hi Jeff Thanks for jumping in! :) And for your clarifications too, of course. How does the efficiency of loopback (let's say, over TCP and over IB) compare with "sm"? FYI, I do NOT see the problem reported by Matthew et al. on our AMD Opteron Shanghai dual-socket quad-core. They run a quite

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jonathan Dursi
Jeff Squyres wrote: Why wouldn't shared memory work right on Nehalem? (That is probably distressing for Mark, Matthew, and other Nehalem owners.) To be clear, we don't know that this is a Nehalem-specific problem. I have definitely had this problem on Harpertown cores. - Jonathan

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres
On Dec 10, 2009, at 5:01 PM, Gus Correa wrote: > > Just a quick interjection, I also have a dual-quad Nehalem system, HT > > on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads > > --enable-mpi-f77=no --with-openib=no > > > > With v1.3.4 I see roughly the same behavior, hello,

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

2009-12-10 Thread Gus Correa
Hi Matthew, Mark, Mattijs Great news that a solution was found, actually two, which seem to have been around for a while. Thanks Mark and Mattijs posting the solutions. Much better that all can be solved by software, with a single mca parameter. A pity that it took a while for the actual

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Jeff Squyres
On Dec 10, 2009, at 5:01 PM, Gus Correa wrote: > A couple of questions to the OpenMPI pros: > If shared memory ("sm") is turned off on a standalone computer, > which mechanism is used for MPI communication? > TCP via loopback port? Other? Whatever device supports node-local loopback. TCP is

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Matthew MacManes
Hi All, I agree that the issue is troublesome. It apparently has been reported, and there is an active bug report, with some technical discussion of the underlying problems, found here: https://svn.open-mpi.org/trac/ompi/ticket/2043 For now, it is OK, but it is an issue that hopefully will

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Gus Correa
HI Mark, Matthew, list Oh well, Mark's direct experience on a Nehalem is a game changer, and his recommendation to turn off the shared memory feature may be the way to go for Matthew, at least to have things working. Thank you Mark, your interjection sheds new light on the awkward situation

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa) RESOLVED FOR NOW

2009-12-10 Thread Matthew MacManes
Mark, Exciting.. SOLVED.. There is an open ticket #2043 regarding Nehelem/OpenMPI/Hang problem (https://svn.open-mpi.org/trac/ompi/ticket/2043).. Seems like the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems like there is a group up us with dual socket nehalems trying to use

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mattijs Janssens
On Thursday 10 December 2009 15:42:49 Mark Bolstad wrote: > Just a quick interjection, I also have a dual-quad Nehalem system, HT on, > 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads > --enable-mpi-f77=no --with-openib=no > > With v1.3.4 I see roughly the same behavior, hello,

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-10 Thread Mark Bolstad
Just a quick interjection, I also have a dual-quad Nehalem system, HT on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads --enable-mpi-f77=no --with-openib=no With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity fails randomly with np >= 8. Turning on -v

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Gus Correa
Hi Matthew Save any misinterpretation I may have made of the code: Hello_c has no real communication, except for a final Barrier synchronization. Each process prints "hello world" and that's it. Ring probes a little more, with processes Send(ing) and Recv(cieving) messages. Ring just passes a

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes
Hi Gus and List, 1st of all Gus, I want to say thanks.. you have been a huge help, and when I get this fixed, I owe you big time! However, the problems continue... I formatted the HD, reinstalled OS to make sure that I was working from scratch. I did your step A, which seemed to go fine:

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-09 Thread Matthew MacManes
Hi Gus, Interestingly the results for the connectivity_c test... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-08 Thread Gus Correa
Hi Matthew Please see comments/answers inline below. Matthew MacManes wrote: Hi Gus, Thanks for your ideas.. I have a few questions, and will try to answer yours in hopes of solving this!! A simple way to test OpenMPI on your system is to run the test programs that come with the OpenMPI

Re: [OMPI users] mpirun only works when -np <4 (Gus Correa)

2009-12-08 Thread Matthew MacManes
Hi Gus, Thanks for your ideas.. I have a few questions, and will try to answer yours in hopes of solving this!! Should I worry about setting things like --num-cores --bind-to-cores? This, I think, gets at your questions about processor affinity.. Am I right? I could not exactly figure out