It sounds to me like TCP communication isn't getting through for some
reason. Try the following:
mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname
You should see output from the receipt of a daemon callback for each
daemon, the the sending of the launch command. My guess is that you
won't see all the daemons callback, which is why you hang.
This should tell you which node isn't getting a message back to
wherever mpirun is executing. You might then check to ensure no
firewalls are in the way to that node, there is a TCP path back from
it, etc.
I can help with additional diagnostics once we get that far.
Ralph
On Feb 7, 2009, at 2:40 PM, Kersey Black wrote:
Hi,
Disclaimer up front -- a newbie to openmpi working to get Gromacs
and other modeling code running.
I have it running fine on the local machine, but I am unable to get
openmpi to work when trying to include a remote machine.
Any help or pointers would be greatly appreciated.
System: opensuse, 10.3.
Openmpi: first I installed 1.2.2 as rpm from yast, and, when that
did not seem to work, I switched to the current release of 1.3,
compiled with default configuration options, except I did use the --
prefix to set the installation directory
openmpi-mca-params.conf: (with 1.3) I have only added
btl = self,tcp
mpi_show_mca_params = enviro
ssh: host-based authentication
With both installs, I can run on multiple slots on the local
machine, but when I try to include a remote machine, it hangs.
Using this hostfile:
ccn3 slots=2 max_slots=2
ccn4 slots=2 max_slots=2
Typical output (this is from 1.3) when I try to run two slots
locally (ccn3) and 2 on the remote machine (ccn4):
-----
black@ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -
np 4 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63883,0],1] checking in as pid 20554 on host ccn3
Daemon [[63883,0],1] not using static ports
[ccn3:20554] [[63883,0],1] orted: up and running - waiting for
commands!
Daemon was launched on ccn4 - beginning to initialize
Daemon [[63883,0],2] checking in as pid 7485 on host ccn4
Daemon [[63883,0],2] not using static ports
----
And here it hangs
When I kill the job with ^C, I get:
ccn3
ccn4 - daemon did not report back when launched
Everything I read in the FAQ (in particular in part 2 of the
"Running MPI" portion) suggests that this has to do with SSH
problems, or with PATH problems.
SSH is configured and working for host-based authentication. It
seems to be fine.
I set the LD_LIBRARY_PATH to include openmpi/lib and include the
openmpi/bin directory in PATH as part of a script that runs for all
users (called by /bin/bashrc.local), and when things did not work, I
included the same code in ~/.bashrc and ~/.profile. All of this
results in it being set 3 times (from `env`) in a interactive shell,
but it has not solved the problem.
For comparison, when I run it locally on just two slots on the local
machine, I get:
black@ccn3:~/Documents/mp> mpirun --debug-daemons --hostfile myh3 -
np 2 hostname
Daemon was launched on ccn3 - beginning to initialize
Daemon [[63924,0],1] checking in as pid 20608 on host ccn3
Daemon [[63924,0],1] not using static ports
[ccn3:20603] [[63924,0],0] orted_cmd: received add_local_procs
[ccn3:20603] [[63924,0],0] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20603] [[63924,0],0] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20603] [[63924,0],0] node[2].name ccn4 daemon INVALID arch
ffc91200
[ccn3:20608] [[63924,0],1] orted: up and running - waiting for
commands!
[ccn3:20608] [[63924,0],1] orted_cmd: received add_local_procs
[ccn3:20608] [[63924,0],1] node[0].name ccn3 daemon 0 arch ffc91200
[ccn3:20608] [[63924,0],1] node[1].name ccn3 daemon 1 arch ffc91200
[ccn3:20608] [[63924,0],1] node[2].name ccn4 daemon INVALID arch
ffc91200
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
ccn3
[ccn3:20608] [[63924,0],1] orted_cmd: received waitpid_fired cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received iof_complete cmd
[ccn3:20608] [[63924,0],1] orted_cmd: received exit
[ccn3:20608] [[63924,0],1] orted: finalizing
I can also run it locally on the remote machine with the command:
ssh ccn4 mpirun --debug-daemons -np 2 hostname
Many thanks for any ideas.
Kersey
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users