On Sep 11, 2008, at 6:29 PM, Prasanna Ranganathan wrote:
I have tried the following to no avail.
On 499 machines running openMPI 1.2.7:
mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...
With different combinations of the following parameters
-mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose
1 -mca
oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120
I still get the No route to Host error messages.
This is quite odd -- with the oob_tcp_listen_mode option, we have run
jobs in the thousands of processes in the v1.2 series. The startup is
still a bit slow (it's vastly improved in the upcoming v1.3 series),
but we didn't run into problems like this.
Can you absolutely verify that you are running 1.2.7 on all of your
nodes and you have specified "-mca oob_tcp_listen_mode listen_thread"
on the mpirun command line? The important part here is that when you
invoke OMPI v1.2.7's mpirun on the head node, you are also using
v1.2.7 on all the back-end nodes as well.
Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons
and did
not get any additional useful debug output other than the error
messages.
I did notice one strange thing though. The following is always
successful
(atleast all my attempts)
mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
but
mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
--debug-daemons
prints these error messages at the end from each of the nodes :
[idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx2:04064] [0,0,1] orted_recv_pls: received exit
[idx2:04064] *** Process received signal ***
[idx2:04064] Signal: Segmentation fault (11)
[idx2:04064] Signal code: (128)
[idx2:04064] Failing at address: (nil)
[idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
[idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close
+0x18)
[0x2b92cc0202a2]
[idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize
+0x70)
[0x2b92cc00b5ac]
[idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2b92cc00875c]
[idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
[idx2:04064] *** End of error message ***
I am not sure if this points to the actual cause for these issues.
Is is to
do with the openMPI 1.2.7 having posix enabled in the current
configuration
on these nodes?
POSIX threads enabled should not cause these issues. What you want to
see in ompi_info output is the following:
[6:46] svbu-mpi:~/hg/openib-fd-progress % ompi_info | grep thread
Thread support: posix (mpi: no, progress: no)
The two "no"'s are what are important here.
--
Jeff Squyres
Cisco Systems