Hi,
We recently had to rebuild all our cluster nodes and went from RHEL 6->7.2.
Theoretically, nothing package wise was drastically changed (and X10 was
working fine before), but I'm having a lot of trouble getting X10 running
again on more than one node and running out of ideas if anyone can suggest
something?
I've tried X10 2.5.3, 2.5.4, and the trunk built from source as well as the
2.5.4 Linux/x86_64 prebuilt. They all complain about some sort of
connectivity issue:
when I try to run the HelloWholeWorld.x10 example (devtty21 and devtty22
are two random nodes) -
$ x10c++ HelloWholeWorld.x10 -o hello
$ X10_HOSTLIST="devtty21,devtty22" X10_NPLACES=2 ./hello world
TCP::connect timeout
Launcher 1: failed to connect to parent
No route to host
Launcher 0: tearing down remaining runtimes after waiting 3 seconds
Launcher 0: cleanup complete, exit code=9. Goodbye!
Launcher -1: cleanup complete, exit code=9. Goodbye!
Looking at the packet capture, the initial ssh communication and port
notification work fine, but the subsequent child->parent connection on that
port gets filtered by the OS (I think because the port is never opened
properly on the parent?). I can setup a simple python client/server on the
same port so it doesn't seem to be a firewall issue. I tried recompiling
the socket runtime with debug defined, but nothing in the output looked
suspicious, e.g.
...
Launcher 0: opened listen socket at port 38786
...
Launcher 1: connecting to parent via: 10.212.55.40:38786
TCP::connect timeout
Launcher 1: failed to connect to parent
No route to host
...
I've also tried with openmpi-1.10.2 and gotten similar results (10.176.5.36
and 10.212.55.40 are the ip's of the ib and ethernet interfaces of devtty21
respectively) -
$x10c++ -x10rt mpi HelloWholeWorld.x10 -o hello
$mpirun --host devtty22,devtty21 -np 2 hello world
[devtty22][[22596,1],0][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.176.5.36 failed: No route to host (113)
[devtty22][[22596,1],0][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.212.55.40 failed: No route to host (113)
(it hangs indefinitely here)
The standard mpi hello world on the same nodes works fine -
$ mpic++ mpi_hello_world.c -o mpi_hello
$ mpirun --host devtty21,devtty22 -np 2 hello world
Hello world from processor devtty22.cs.georgetown.edu, rank 1 out of 2
processors
Hello world from processor devtty21.cs.georgetown.edu, rank 0 out of 2
processors
Any ideas? I'm happy to provide any additional configuration information if
it would be helpful.
Thanks,
Brendan Sheridan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users