On Sep 11, 2008, at 6:29 PM, Prasanna Ranganathan wrote:

I have tried the following to no avail.

On 499 machines running openMPI 1.2.7:

mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...

With different combinations of the following parameters

-mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca
oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120

I still get the No route to Host error messages.

This is quite odd -- with the oob_tcp_listen_mode option, we have run jobs in the thousands of processes in the v1.2 series. The startup is still a bit slow (it's vastly improved in the upcoming v1.3 series), but we didn't run into problems like this.

Can you absolutely verify that you are running 1.2.7 on all of your nodes and you have specified "-mca oob_tcp_listen_mode listen_thread" on the mpirun command line? The important part here is that when you invoke OMPI v1.2.7's mpirun on the head node, you are also using v1.2.7 on all the back-end nodes as well.

Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did not get any additional useful debug output other than the error messages.

I did notice one strange thing though. The following is always successful
(atleast all my attempts)

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld

but

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
--debug-daemons

prints these error messages at the end from each of the nodes :

[idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx2:04064] [0,0,1] orted_recv_pls: received exit
[idx2:04064] *** Process received signal ***
[idx2:04064] Signal: Segmentation fault (11)
[idx2:04064] Signal code:  (128)
[idx2:04064] Failing at address: (nil)
[idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
[idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close +0x18)
[0x2b92cc0202a2]
[idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize +0x70)
[0x2b92cc00b5ac]
[idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2b92cc00875c]
[idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
[idx2:04064] *** End of error message ***


I am not sure if this points to the actual cause for these issues. Is is to do with the openMPI 1.2.7 having posix enabled in the current configuration
on these nodes?


POSIX threads enabled should not cause these issues. What you want to see in ompi_info output is the following:

[6:46] svbu-mpi:~/hg/openib-fd-progress % ompi_info | grep thread
          Thread support: posix (mpi: no, progress: no)

The two "no"'s are what are important here.

--
Jeff Squyres
Cisco Systems

Reply via email to