Sorry for the delay in replying.
I'd check two things:
- Disable all firewall support between these two machines. OMPI uses
random TCP ports to communicate between processes; if they're blocked,
Bad Things will happen.
- It is easiest to install OMPI in the same location on all your
machines (e.g., /opt/openmpi). If you do that, you might want to try
configuring OMPI with --enable-mpirun-prefix-by-default. In rsh/ssh
environments, this flag will have mpirun set your PATH and
LD_LIBRARY_PATH properly on remote nodes.
Let us know how that works out.
On Jun 10, 2008, at 8:58 AM, jody wrote:
Interestingly i can start mpirun from any of the remote machines,
running processes on other remote machines and on the local machine,.
But from the local machine i can start no process on a remote
machine -
it just shows the behavior detailed in the previous mail.
remote1 -> remote1 ok
remote1 -> remote2 ok
remote1 -> local ok
remote2 -> remote1 ok
remote2 -> remote2 ok
remote2 -> local ok
local -> local ok
local -> remote1 fails
local -> remote2 fails
My remote machines are freshly updated gentoo machines (AMD),
my local machne is a freshly installed fedora 8 (Intel Quadro).
All use a freshly installed open-mpi 1.2.5.
Before my fedora machine crashed it had fedora 6,
and everything worked great (with 1.2.2 on all machines).
Does anybody have a suggestion where i should look?
Thanks
Jody
On Tue, Jun 10, 2008 at 12:59 PM, jody <jody....@gmail.com> wrote:
Hi
after a crash i reinstalled open-mpi 1.2.5 on my machines,
used
./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
and set PATH and LD_LIBRARY_PATH in .bashrc:
PATH=/opt/openmpi/bin:$PATH
export PATH
LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH
First problem:
ssh nano_00 printenv
does not contain the correct paths (and no LD_LIBRARY_PATH at all),
but with a normal ssh-login the two are set correctly.
When i run a test application on one computer, it works.
As soon as an additional computer is involved, there is no more
output,
and everything just hangs.
Adding the prefix doesn't change anything, even though openmpi is
installed in the same
directory (/opt/openmpi) on every computer.
The debug-daemon doesn't help very much:
$ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch
(and nothing happens anymore)
On the remote host, i see the following three processes coming up
after i do the mpirun on the local machine:
30603 ? S 0:00 sshd: jody@notty
30604 ? Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ;
export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
30605 ? S 0:00 /opt/openmpi/bin/orted --debug-daemons
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934
--nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl
So it looks as if the correct paths are set (probably the doing of
--enable-mpirun-prefix-by-default)
If i interrupt on the local machine (Ctrl-C)::
[aim-plankton:14983] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[aim-plankton:14983] [0,0,1] orted_recv_pls: received
kill_local_procs
[aim-plankton:14983] [0,0,1] orted_recv_pls: received message from
[0,0,0]
[aim-plankton:14983] [0,0,1] orted_recv_pls: received
kill_local_procs
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90
[aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
as expected.
[aim-plankton:14982] ERROR: There may be more information available
from
[aim-plankton:14982] ERROR: the remote shell (see above).
[aim-plankton:14982] ERROR: The daemon exited unexpectedly with
status 255.
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
--------------------------------------------------------------------------
WARNING: mpirun has exited before it received notification that all
started processes had terminated. You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------
[aim-plankton:14983] OOB: Connection to HNP lost
On the remote machine, the "sshd: jody@notty" process is gone, but
the
other two stay.
I would be grateful for any suggestions!
Jody
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems