Hi:
I'm trying to use the memchecker/valgrind capability of OpenMPI 1.3.3 to
help debug my MPI application. I noticed a rather odd thing: After
Waiting on a Recv Request, valgrind declares my receive buffer as
invalid memory. Is this just a fluke of valgrind, or is OMPI doing
something internally?
Hi Jody
Jody Klymak wrote:
On Aug 11, 2009, at 17:35 PM, Gus Correa wrote:
You can check this, say, by logging in to each node and doing
/usr/local/openmpi/bin/ompi_info and comparing the output.
Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009.
Did you wipe off the old
On Aug 11, 2009, at 17:35 PM, Gus Correa wrote:
You can check this, say, by logging in to each node and doing /usr/
local/openmpi/bin/ompi_info and comparing the output.
Yep, they are all the same 1.3.3, SVN r21666, July 14th 2009.
What about passwords? ssh from server to node is
Hi Jody
Are you sure you have the same OpenMPI version installed on
/usr/local/openmpi on *all* nodes?
The fact that the programs run on the xserver0, but hang
when you try xserver0 and xserver1 together suggest
some inconsistency in the runtime environment,
which may come from different
I can't speak to the tcp problem, but the following:
[xserve02.local:43625] [[28627,0],2] orte:daemon:send_relay -
recipient list is empty!
is not an error message. It is perfectly normal operation.
Ralph
On Aug 11, 2009, at 1:54 PM, Jody Klymak wrote:
Hello,
On Aug 11, 2009, at 8:15
Hi Pavel, Lenny, Igor, list
Igor: Thanks for the pointer to your notes/paper!
Lenny: Thanks for resurrecting this thread!
Pavel: Thanks for the article!
It clarified a number of things about tuned collectives
(e.g. fixed vs. dynamic selection),
and the example rule file is very helpful too.
Thank you, John Casu and Edgar Gabriel for the pointers
to the parameter space sweep script and the OTPO code.
For simplicity,
I was thinking of testing each tuned collective separately,
instead of the applications, to have an idea
of which algorithms and parameters are best for our small
comm_world(int key);
^
compilation aborted for drt_dv_app.c (code 2)
make[1]: *** [drt_dv_app.o] Error 2
Hope someone can help
Bernie Borenstein
The Boeing Company
__ Information from ESET NOD32 Antivirus, version of virus
signature database 4326 (20090811) __
T
ESET NOD32 Antivirus, version of virus
signature database 4326 (20090811) __
The message was checked by ESET NOD32 Antivirus.
http://www.eset.com
Hello,On Aug 11, 2009, at 8:15 AM, Ralph Castain wrote:You can turn off those mca params I gave you as you are now past that point. I know there are others that can help debug that TCP btl error, but they can help you there.Just to eliminate the mitgcm from the debugging I compiled
Hi Craig, list
On my Rocks 4.3 cluster Torque is installed on /opt/torque,
not on /share/apps/torque.
That directory path may have changed on more recent versions of Rocks,
or you may have installed another copy of
of Torque on /share/apps/torque.
However, have you checked where Torque is
Well, it now is launching just fine, so that's one thing! :-)
Afraid I'll have to let the TCP btl guys take over from here. It looks like
everything is up and running, but something strange is going on in the MPI
comm layer.
You can turn off those mca params I gave you as you are now past that
Yeah, it's the lib confusion that's the problem - this is the problem:
[saturna.cluster:07360] [[14551,0],0] ORTE_ERROR_LOG: Buffer type (described
> vs non-described) mismatch - operation not allowed in file
> base/odls_base_default_fns.c at line 2475
>
Have you tried configuring with
On 11-Aug-09, at 6:16 AM, Jeff Squyres wrote:
This means that OMPI is finding an mca_iof_proxy.la file at run time
from a prior version of Open MPI. You might want to use "find" or
"locate" to search your nodes and find it. I suspect that you
somehow have an OMPI 1.3.x install that
On 11-Aug-09, at 7:03 AM, Ralph Castain wrote:
Sigh - too early in the morning for this old brain, I fear...
You are right - the ranks are fine, and local rank doesn't matter.
It sounds like a problem where the TCP messaging is getting a
message ack'd from someone other than the process
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
-mca plm_base_verbose 5 --debug-daemons -mca odls_base_verbose 5
I'm afraid the output will be a tad verbose, but I would appreciate
seeing it. Might also tell us something about the lib issue.
Command line was:
Sigh - too early in the morning for this old brain, I fear...
You are right - the ranks are fine, and local rank doesn't matter. It sounds
like a problem where the TCP messaging is getting a message ack'd from
someone other than the process that was supposed to recv the message. This
should cause
On Aug 11, 2009, at 9:43 AM, Klymak Jody wrote:
[xserve03.local][[61029,1],4][btl_tcp_endpoint.c:
486:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process
identifier [[61029,1],3]
This would well be caused by a version mismatch between your nodes.
E.g., if one node is
On 11-Aug-09, at 6:28 AM, Ralph Castain wrote:
The reason your job is hanging is sitting in the orte-ps output. You
have multiple processes declaring themselves to be the same MPI
rank. That definitely won't work.
Its the "local rank" if that makes any difference...
Any thoughts on this
Hello guys,
When executing following command with mtt and ompi 1.3.3:
mpirun --host
witch15,witch15,witch15,witch15,witch16,witch16,witch16,witch16,witch17,witch17,witch17,witch17,witch18,witch18,witch18,witch18,witch19,witch19,witch19,witch19
-np 20 --mca btl_openib_use_srq 1 --mca btl
Oops - I should have looked at your output more closely. The component_find
warnings are clearly indicating some old libs laying around, but that isn't
why your job is hanging.
The reason your job is hanging is sitting in the orte-ps output. You have
multiple processes declaring themselves to be
Sorry, but Jeff is correct - that error message clearly indicates a version
mismatch. Somewhere, one or more of your nodes is still picking up an old
version.
On Tue, Aug 11, 2009 at 7:16 AM, Jeff Squyres wrote:
> On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
>
> I have
On Aug 11, 2009, at 9:11 AM, Klymak Jody wrote:
I have removed all the OS-X -supplied libraries, recompiled and
installed openmpi 1.3.3, and I am *still* getting this warning when
running ompi_info:
[saturna.cluster:50307] mca: base: component_find: iof "mca_iof_proxy"
uses an MCA interface
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
I am still finding this very mysterious
I have removed all the OS-X -supplied libraries, recompiled and
On Aug 11, 2009, at 5:17 AM, Ashley Pittman wrote:
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
If it isn't already there, try putting a print statement tight at
program start, another just prior to MPI_Init, and another just after
MPI_Init. It could be that something is hanging
On Tue, 2009-08-11 at 03:03 -0600, Ralph Castain wrote:
> If it isn't already there, try putting a print statement tight at
> program start, another just prior to MPI_Init, and another just after
> MPI_Init. It could be that something is hanging somewhere during
> program startup since it sounds
I searched the FAQ and google but couldn't come up with a solution to
this problem.
My problem is that when one MPI execution host dies or the network
connection goes down the job is not aborted. Instead the remaining
processes continue to eat 100% CPU indefinitely. How can I make jobs
abort
On Aug 10, 2009, at 10:36 PM, Craig Plaisance wrote:
I am building openmpi on a cluster running rocks. When I build
using ./configure --with-tm=/share/apps/torque --prefix=/share/apps/
openmpi/intel I receive the warning
configure: WARNING: Unrecognized options: --with-tm, --enable-ltdl-
On 10-Aug-09, at 8:03 PM, Ralph Castain wrote:
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
Note that I always configure with --prefix=somewhere-in-my-own-dir,
never to a system directory. Avoids this kind
Interesting! Well, I always make sure I have my personal OMPI build
before any system stuff, and I work exclusively on Mac OS-X:
rhc$ echo $PATH
/Library/Frameworks/Python.framework/Versions/Current/bin:/Users/rhc/
openmpi/bin:/Users/rhc/bin:/opt/local/bin:/usr/X11R6/bin:/usr/local/
30 matches
Mail list logo