Well, it now is launching just fine, so that's one thing! :-)
Afraid I'll have to let the TCP btl guys take over from here. It looks like
everything is up and running, but something strange is going on in the MPI
comm layer.
You can turn off those mca params I gave you as you are now past that
Yeah, it's the lib confusion that's the problem - this is the problem:
[saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described
> vs non-described) mismatch - operation not allowed in file
> base/odls_base_default_fns.c at line 2475
>
Have you tried configuring with
On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:
This means that OMPI is finding an mca_iof_proxy.la file at run time
from a prior version of Open MPI. You might want to use "find" or
"locate" to search your nodes and find it. I suspect that you
somehow have an OMPI 1.3.x install that
On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:
Sigh - too early in the morning for this old brain, I fear...
You are right - the ranks are fine, and local rank doesn't matter.
It sounds like a problem where the TCP messaging is getting a
message ack'd from someone other than the process
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
I'm afraid the output will be a tad verbose, but I would appreciate
seeing it. Might also tell us something about the lib issue.
Command line was:
Sigh - too early in the morning for this old brain, I fear...
You are right - the ranks are fine, and local rank doesn't matter. It sounds
like a problem where the TCP messaging is getting a message ack'd from
someone other than the process that was supposed to recv the message. This
should cause
On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote:
[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process
identifier [[61029,1],3]
This would well be caused by a version mismatch between your nodes.
E.g., if one node is
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
The reason your job is hanging is sitting in the orte-ps output. You
have multiple processes declaring themselves to be the same MPI
rank. That definitely won't work.
Its the "local rank" if that makes any difference...
Any thoughts on this
Oops - I should have looked at your output more closely. The component_find
warnings are clearly indicating some old libs laying around, but that isn't
why your job is hanging.
The reason your job is hanging is sitting in the orte-ps output. You have
multiple processes declaring themselves to be
Sorry, but Jeff is correct - that error message clearly indicates a version
mismatch. Somewhere, one or more of your nodes is still picking up an old
version.
On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres wrote:
> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>
> I have
On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
I have removed all the OS-X -supplied libraries, recompiled and
installed openmpi 1.3.3, and I am *still* getting this warning when
running ompi_info:
[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
uses an MCA interface
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
I am still finding this very mysterious
I have removed all the OS-X -supplied libraries, recompiled and
On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote:
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
If it isn't already there, try putting a print statement tight at
program start, another just prior to MPI_Init, and another just after
MPI_Init. It could be that something is hanging
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
> If it isn't already there, try putting a print statement tight at
> program start, another just prior to MPI_Init, and another just after
> MPI_Init. It could be that something is hanging somewhere during
> program startup since it sounds
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
Note that I always configure with --prefix=somewhere-in-my-own-dir,
never to a system directory. Avoids this kind
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
rhc$ echo $PATH
/Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/
openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/
On 10-Aug-09, at 6:44 PM, Ralph Castain wrote:
Check your LD_LIBRARY_PATH - there is an earlier version of OMPI in
your path that is interfering with operation (i.e., it comes before
your 1.3.3 installation).
H, The OS X faq says not to do this:
"Note that there is no need to add
So,
mpirun --display-allocation -pernode --display-map hostname
gives me the output below. Simple jobs seem to run, but the MITgcm
does not, either under ssh or torque. It hangs at some early point in
execution before anything is written, so its hard for me to tell what
the error is.
No problem - actually, that default works with any environment, not
just Torque
On Aug 10, 2009, at 4:37 PM, Gus Correa wrote:
Thank you for the correction, Ralph.
I didn't know there was a (wise) default for the
number of processes when using Torque-enabled OpenMPI.
Gus Correa
Ralph
Thank you for the correction, Ralph.
I didn't know there was a (wise) default for the
number of processes when using Torque-enabled OpenMPI.
Gus Correa
Ralph Castain wrote:
Just to correct something said here.
You need to tell mpirun how many processes to launch,
regardless of whether you
No problem - yes indeed, 1.1.x would be a bad choice :-)
On Aug 10, 2009, at 3:58 PM, Jody Klymak wrote:
On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote:
mpirun --display-allocation -pernode --display-map hostname
Ummm, hmm, this is embarassing, none of those command line arguments
On Aug 10, 2009, at 14:39 PM, Ralph Castain wrote:
mpirun --display-allocation -pernode --display-map hostname
Ummm, hmm, this is embarassing, none of those command line arguments
worked, making me suspicious...
It looks like somehow I decided to build and run openMPI 1.1.5, or at
Hi Jody, list
See comments inline.
Jody Klymak wrote:
On Aug 10, 2009, at 13:01 PM, Gus Correa wrote:
Hi Jody
We don't have Mac OS-X, but Linux, not sure if this applies to you.
Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque
On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote:
Hi Ralph,
On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:
Umm...are you saying that your $PBS_NODEFILE contains the following:
No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
Hi Ralph,
On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:
Umm...are you saying that your $PBS_NODEFILE contains the following:
No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local
each repeated 8 times. So that seems
On Aug 10, 2009, at 13:01 PM, Gus Correa wrote:
Hi Jody
We don't have Mac OS-X, but Linux, not sure if this applies to you.
Did you configure your OpenMPI with Torque support,
and pointed to the same library that provides the
Torque you are using
Umm...are you saying that your $PBS_NODEFILE contains the following:
xserve01.local np=8
xserve02.local np=8
If so, that could be part of the problem - it isn't the standard
notation we are expecting to see in that file. What Torque normally
provides is one line for each slot, so we would
Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7 cluster
with openMPI (after finding that Xgrid was pretty flaky about
connections). I *think* this is an MPI problem (perhaps via operator
error!)
If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs
28 matches
Mail list logo