Hi after a crash i reinstalled open-mpi 1.2.5 on my machines, used ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default and set PATH and LD_LIBRARY_PATH in .bashrc: PATH=/opt/openmpi/bin:$PATH export PATH LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH
First problem: ssh nano_00 printenv does not contain the correct paths (and no LD_LIBRARY_PATH at all), but with a normal ssh-login the two are set correctly. When i run a test application on one computer, it works. As soon as an additional computer is involved, there is no more output, and everything just hangs. Adding the prefix doesn't change anything, even though openmpi is installed in the same directory (/opt/openmpi) on every computer. The debug-daemon doesn't help very much: $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch (and nothing happens anymore) On the remote host, i see the following three processes coming up after i do the mpirun on the local machine: 30603 ? S 0:00 sshd: jody@notty 30604 ? Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 -- 30605 ? S 0:00 /opt/openmpi/bin/orted --debug-daemons --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934 --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl So it looks as if the correct paths are set (probably the doing of --enable-mpirun-prefix-by-default) If i interrupt on the local machine (Ctrl-C):: [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start as expected. [aim-plankton:14982] ERROR: There may be more information available from [aim-plankton:14982] ERROR: the remote shell (see above). [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255. [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 -------------------------------------------------------------------------- WARNING: mpirun has exited before it received notification that all started processes had terminated. You should double check and ensure that there are no runaway processes still executing. -------------------------------------------------------------------------- [aim-plankton:14983] OOB: Connection to HNP lost On the remote machine, the "sshd: jody@notty" process is gone, but the other two stay. I would be grateful for any suggestions! Jody