[OMPI users] Memchecker and Wait

2009-08-11 Thread Allen Barnett
Hi: I'm trying to use the memchecker/valgrind capability of OpenMPI 1.3.3 to help debug my MPI application. I noticed a rather odd thing: After Waiting on a Recv Request, valgrind declares my receive buffer as invalid memory. Is this just a fluke of valgrind, or is OMPI doing something internally?

Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Gus Correa
Hi Jody Jody Klymak wrote: On Aug 11, 2009, at 17:35 PM, Gus Correa wrote: You can check this, say, by logging in to each node and doing /usr/local/openmpi/bin/ompi_info and comparing the output. Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009. Did you wipe off the old

Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Jody Klymak
On Aug 11, 2009, at 17:35 PM, Gus Correa wrote: You can check this, say, by logging in to each node and doing /usr/ local/openmpi/bin/ompi_info and comparing the output. Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009. What about passwords? ssh from server to node is

Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Gus Correa
Hi Jody Are you sure you have the same OpenMPI version installed on /usr/local/openmpi on *all* nodes? The fact that the programs run on the xserver0, but hang when you try xserver0 and xserver1 together suggest some inconsistency in the runtime environment, which may come from different

Re: [OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Ralph Castain
I can't speak to the tcp problem, but the following: [xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay - recipient list is empty! is not an error message. It is perfectly normal operation. Ralph On Aug 11, 2009, at 1:54 PM, Jody Klymak wrote: Hello, On Aug 11, 2009, at 8:15

Re: [OMPI users] Tuned collectives: How to choose them dynamically? (-mca coll_tuned_dynamic_rules_filename dyn_rules)"

2009-08-11 Thread Gus Correa
Hi Pavel, Lenny, Igor, list Igor: Thanks for the pointer to your notes/paper! Lenny: Thanks for resurrecting this thread! Pavel: Thanks for the article! It clarified a number of things about tuned collectives (e.g. fixed vs. dynamic selection), and the example rule file is very helpful too.

Re: [OMPI users] Automated tuning tool

2009-08-11 Thread Gus Correa
Thank you, John Casu and Edgar Gabriel for the pointers to the parameter space sweep script and the OTPO code. For simplicity, I was thinking of testing each tuned collective separately, instead of the applications, to have an idea of which algorithms and parameters are best for our small

Re: [OMPI users] need help with a code segment

2009-08-11 Thread Jeff Squyres
comm_world(int key); ^ compilation aborted for drt_dv_app.c (code 2) make[1]: *** [drt_dv_app.o] Error 2 Hope someone can help Bernie Borenstein The Boeing Company __ Information from ESET NOD32 Antivirus, version of virus signature database 4326 (20090811) __ T

[OMPI users] need help with a code segment

2009-08-11 Thread Borenstein, Bernard S
ESET NOD32 Antivirus, version of virus signature database 4326 (20090811) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com

[OMPI users] tcp connectivity OS X and 1.3.3

2009-08-11 Thread Jody Klymak
Hello,On Aug 11, 2009, at  8:15 AM, Ralph Castain wrote:You can turn off those mca params I gave you as you are now past that point. I know there are others that can help debug that TCP btl error, but they can help you there.Just to eliminate the mitgcm from the debugging I compiled

Re: [OMPI users] problem configuring with torque

2009-08-11 Thread Gus Correa
Hi Craig, list On my Rocks 4.3 cluster Torque is installed on /opt/torque, not on /share/apps/torque. That directory path may have changed on more recent versions of Rocks, or you may have installed another copy of of Torque on /share/apps/torque. However, have you checked where Torque is

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Well, it now is launching just fine, so that's one thing! :-) Afraid I'll have to let the TCP btl guys take over from here. It looks like everything is up and running, but something strange is going on in the MPI comm layer. You can turn off those mca params I gave you as you are now past that

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Yeah, it's the lib confusion that's the problem - this is the problem: [saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described > vs non-described) mismatch - operation not allowed in file > base/odls_base_default_fns.c at line 2475 > Have you tried configuring with

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote: This means that OMPI is finding an mca_iof_proxy.la file at run time from a prior version of Open MPI. You might want to use "find" or "locate" to search your nodes and find it. I suspect that you somehow have an OMPI 1.3.x install that

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 7:03 AM, Ralph Castain wrote: Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 I'm afraid the output will be a tad verbose, but I would appreciate seeing it. Might also tell us something about the lib issue. Command line was:

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process that was supposed to recv the message. This should cause

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres
On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote: [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] This would well be caused by a version mismatch between your nodes. E.g., if one node is

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work. Its the "local rank" if that makes any difference... Any thoughts on this

[OMPI users] Error: system limit exceeded on number of pipes that can be open

2009-08-11 Thread Mike Dubman
Hello guys, When executing following command with mtt and ompi 1.3.3: mpirun --host witch15,witch15,witch15,witch15,witch16,witch16,witch16,witch16,witch17,witch17,witch17,witch17,witch18,witch18,witch18,witch18,witch19,witch19,witch19,witch19 -np 20 --mca btl_openib_use_srq 1 --mca btl

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Oops - I should have looked at your output more closely. The component_find warnings are clearly indicating some old libs laying around, but that isn't why your job is hanging. The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sorry, but Jeff is correct - that error message clearly indicates a version mismatch. Somewhere, one or more of your nodes is still picking up an old version. On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres wrote: > On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: > > I have

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres
On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: I have removed all the OS-X -supplied libraries, recompiled and installed openmpi 1.3.3, and I am *still* getting this warning when running ompi_info: [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: I am still finding this very mysterious I have removed all the OS-X -supplied libraries, recompiled and

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote: On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: If it isn't already there, try putting a print statement tight at program start, another just prior to MPI_Init, and another just after MPI_Init. It could be that something is hanging

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ashley Pittman
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: > If it isn't already there, try putting a print statement tight at > program start, another just prior to MPI_Init, and another just after > MPI_Init. It could be that something is hanging somewhere during > program startup since it sounds

[OMPI users] How to make a job abort when one host dies?

2009-08-11 Thread Oskar Enoksson
I searched the FAQ and google but couldn't come up with a solution to this problem. My problem is that when one MPI execution host dies or the network connection goes down the job is not aborted. Instead the remaining processes continue to eat 100% CPU indefinitely. How can I make jobs abort

Re: [OMPI users] problem configuring with torque

2009-08-11 Thread Ralph Castain
On Aug 10, 2009, at 10:36 PM, Craig Plaisance wrote: I am building openmpi on a cluster running rocks. When I build using ./configure --with-tm=/share/apps/torque --prefix=/share/apps/ openmpi/intel I receive the warning configure: WARNING: Unrecognized options: --with-tm, --enable-ltdl-

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: Note that I always configure with --prefix=somewhere-in-my-own-dir, never to a system directory. Avoids this kind

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: rhc$ echo $PATH /Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/ openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/