Re: [OMPI users] switch and NIC performance (was: very bad parallelscaling of vasp using openmpi)

2009-09-23 Thread Jeff Squyres
Wow; I should point out an amazing coincidence here. Doug Eadline used [almost] exactly the same analogy that I did (truck vs. F1) in a column that was published today in Linux Magazine: http://www.linux-mag.com/id/7534 I swear I didn't read his column before I posted my answer

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh
Jonathan Dursi wrote: Continuing the conversation with myself: Sorry to interrupt... :^) Okay, I managed to reproduce the hang. I'll try to look at this. Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Peter Kjellstrom
On Wednesday 23 September 2009, Rahul Nabar wrote: > On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote: > > Most of that bandwidth is in marketing...  Sorry, but it's not a high > > performance switch. > > Well, how does one figure out what exactly is a "hih

Re: [OMPI users] switch and NIC performance (was: very bad parallel scaling of vasp using openmpi)

2009-09-23 Thread Jeff Squyres
On Sep 23, 2009, at 10:15 AM, Dave Love wrote: So, how does one go about selecting a good switch? "The most expensive the better" is somewhat a unsatisfying option! Also it's apparently not always right +1 on Dave's and Joe's comments. For example, not all of Cisco's switches are

[OMPI users] switch and NIC performance (was: very bad parallel scaling of vasp using openmpi)

2009-09-23 Thread Dave Love
Rahul Nabar writes: > So, how does one go about selecting a good switch? "The most expensive > the better" is somewhat a unsatisfying option! Also it's apparently not always right, if I recall correctly, according to the figures on MPI switch performance in the reports

Re: [OMPI users] error in ompi-checkpoint

2009-09-23 Thread Josh Hursey
How did you configure Open MPI? Is your application using SIGUSR1? This error message indicates that Open MPI's daemons could not communicate with the application processes. The daemons send SIGUSR1 to the process to initiate the handshake (you can change this signal with -mca

Re: [OMPI users] fault tolerance in open mpi

2009-09-23 Thread Josh Hursey
Unfortunately I cannot provide a precise time frame for availability at this point, but we are targeting the v1.5 release series. There is a handful of core developers working on this issue at the moment. Pieces of this work have already made it into the Open MPI development trunk. If you

Re: [OMPI users] Changing location where checkpoints are saved

2009-09-23 Thread Josh Hursey
This is described in the C/R User's Guide attached to the webpage below: https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR Additionally this has been addressed on the users mailing list in the past, so searching around will likely turn up some examples. -- Josh On Sep 18, 2009, at

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Jonathan Dursi
Hi, Eugene: If it continues to be a problem for people to reproduce this, I'll see what can be done about having an account made here for someone to poke around. Alternately, any suggestions for tests that I can do to help diagnose/verify the problem, or figure out whats different about

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Joe Landman
Rahul Nabar wrote: On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote: Most of that bandwidth is in marketing... Sorry, but it's not a high performance switch. Well, how does one figure out what exactly is a "hih performance switch"? I've found this an exceedingly

Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-23 Thread Eugene Loh
Jonathan Dursi wrote: Continuing the conversation with myself: Google pointed me to Trac ticket #1944, which spoke of deadlocks in looped collective operations; there is no collective operation anywhere in this sample code, but trying one of the suggested workarounds/clues: that is,

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Rahul Nabar
On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager wrote: > Most of that bandwidth is in marketing...  Sorry, but it's not a high > performance switch. Well, how does one figure out what exactly is a "hih performance switch"? I've found this an exceedingly hard task. Like the