Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Well, it now is launching just fine, so that's one thing! :-) Afraid I'll have to let the TCP btl guys take over from here. It looks like everything is up and running, but something strange is going on in the MPI comm layer. You can turn off those mca params I gave you as you are now past that

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Yeah, it's the lib confusion that's the problem - this is the problem: [saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described > vs non-described) mismatch - operation not allowed in file > base/odls_base_default_fns.c at line 2475 > Have you tried configuring with

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote: This means that OMPI is finding an mca_iof_proxy.la file at run time from a prior version of Open MPI. You might want to use "find" or "locate" to search your nodes and find it. I suspect that you somehow have an OMPI 1.3.x install that

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 7:03 AM, Ralph Castain wrote: Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: -mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5 I'm afraid the output will be a tad verbose, but I would appreciate seeing it. Might also tell us something about the lib issue. Command line was:

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sigh - too early in the morning for this old brain, I fear... You are right - the ranks are fine, and local rank doesn't matter. It sounds like a problem where the TCP messaging is getting a message ack'd from someone other than the process that was supposed to recv the message. This should cause

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres
On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote: [xserve03.local][[61029,1],4][btl_tcp_endpoint.c: 486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[61029,1],3] This would well be caused by a version mismatch between your nodes. E.g., if one node is

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote: The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be the same MPI rank. That definitely won't work. Its the "local rank" if that makes any difference... Any thoughts on this

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Oops - I should have looked at your output more closely. The component_find warnings are clearly indicating some old libs laying around, but that isn't why your job is hanging. The reason your job is hanging is sitting in the orte-ps output. You have multiple processes declaring themselves to be

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Sorry, but Jeff is correct - that error message clearly indicates a version mismatch. Somewhere, one or more of your nodes is still picking up an old version. On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres wrote: > On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: > > I have

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Jeff Squyres
On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote: I have removed all the OS-X -supplied libraries, recompiled and installed openmpi 1.3.3, and I am *still* getting this warning when running ompi_info: [saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy" uses an MCA interface

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: I am still finding this very mysterious I have removed all the OS-X -supplied libraries, recompiled and

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote: On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: If it isn't already there, try putting a print statement tight at program start, another just prior to MPI_Init, and another just after MPI_Init. It could be that something is hanging

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ashley Pittman
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote: > If it isn't already there, try putting a print statement tight at > program start, another just prior to MPI_Init, and another just after > MPI_Init. It could be that something is hanging somewhere during > program startup since it sounds

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Klymak Jody
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote: Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: Note that I always configure with --prefix=somewhere-in-my-own-dir, never to a system directory. Avoids this kind

Re: [OMPI users] torque pbs behaviour...

2009-08-11 Thread Ralph Castain
Interesting! Well, I always make sure I have my personal OMPI build before any system stuff, and I work exclusively on Mac OS-X: rhc$ echo $PATH /Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/ openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Klymak Jody
On 10-Aug-09, at 6:44 PM, Ralph Castain wrote: Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in your path that is interfering with operation (i.e., it comes before your 1.3.3 installation). H, The OS X faq says not to do this: "Note that there is no need to add

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Klymak Jody
So, mpirun --display-allocation -pernode --display-map hostname gives me the output below. Simple jobs seem to run, but the MITgcm does not, either under ssh or torque. It hangs at some early point in execution before anything is written, so its hard for me to tell what the error is.

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain
No problem - actually, that default works with any environment, not just Torque On Aug 10, 2009, at 4:37 PM, Gus Correa wrote: Thank you for the correction, Ralph. I didn't know there was a (wise) default for the number of processes when using Torque-enabled OpenMPI. Gus Correa Ralph

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Gus Correa
Thank you for the correction, Ralph. I didn't know there was a (wise) default for the number of processes when using Torque-enabled OpenMPI. Gus Correa Ralph Castain wrote: Just to correct something said here. You need to tell mpirun how many processes to launch, regardless of whether you

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain
No problem - yes indeed, 1.1.x would be a bad choice :-) On Aug 10, 2009, at 3:58 PM, Jody Klymak wrote: On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote: mpirun --display-allocation -pernode --display-map hostname Ummm, hmm, this is embarassing, none of those command line arguments

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak
On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote: mpirun --display-allocation -pernode --display-map hostname Ummm, hmm, this is embarassing, none of those command line arguments worked, making me suspicious... It looks like somehow I decided to build and run openMPI 1.1.5, or at

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Gus Correa
Hi Jody, list See comments inline. Jody Klymak wrote: On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain
On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote: Hi Ralph, On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote: Umm...are you saying that your $PBS_NODEFILE contains the following: No, if I put cat $PBS_NODEFILE in the pbs script I get xserve02.local ... xserve02.local xserve01.local ...

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak
Hi Ralph, On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote: Umm...are you saying that your $PBS_NODEFILE contains the following: No, if I put cat $PBS_NODEFILE in the pbs script I get xserve02.local ... xserve02.local xserve01.local ... xserve01.local each repeated 8 times. So that seems

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak
On Aug 10, 2009, at 13:01 PM, Gus Correa wrote: Hi Jody We don't have Mac OS-X, but Linux, not sure if this applies to you. Did you configure your OpenMPI with Torque support, and pointed to the same library that provides the Torque you are using

Re: [OMPI users] torque pbs behaviour...

2009-08-10 Thread Ralph Castain
Umm...are you saying that your $PBS_NODEFILE contains the following: xserve01.local np=8 xserve02.local np=8 If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would

[OMPI users] torque pbs behaviour...

2009-08-10 Thread Jody Klymak
Hi All, I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!) If I submit openMPI with: #PBS -l nodes=2:ppn=8 mpirun MyProg pbs