This looks like an issue with DNS lookup.  You probably either need to use
the full hostname instead of the shortname, such as "devtty21.mydomain.com"
in place of "devtty21", or edit your /etc/hosts file to ensure that
"devtty21" maps to an IP address correctly.  Under the covers we use SSH to
establish links across machines, so one useful test is to type "ssh
devtty21" at the command line.  It needs to be successful, giving you a
command prompt on the devtty21 machine.  If it prompts you for a password
or comes back with any other question or issue, that needs to be fixed
before multi-place will work correctly.

Another possibility is that there is a mismatch between the ethernet and
infiniband networks - e.g. the hostname is mapped to the infiniband
interface, and you're attempting to reach it via the ethernet interface.

In either case, if you have the network configuration files from the
previous builds still around, I would compare them to the new ones to see
what might have changed.


 - Ben



From:   Brendan Sheridan <bs...@georgetown.edu>
To:     x10-users@lists.sourceforge.net
Date:   02/27/2016 07:37 AM
Subject:        [X10-users] trouble with communication runtimes



Hi,

We recently had to rebuild all our cluster nodes and went from RHEL 6->7.2.
Theoretically, nothing package wise was drastically changed (and X10 was
working fine before), but I'm having a lot of trouble getting X10 running
again on more than one node and running out of ideas if anyone can suggest
something?

I've tried X10 2.5.3, 2.5.4, and the trunk built from source as well as the
2.5.4 Linux/x86_64 prebuilt. They all complain about some sort of
connectivity issue:

when I try to run the HelloWholeWorld.x10 example (devtty21 and devtty22
are two random nodes) -

$ x10c++  HelloWholeWorld.x10 -o hello
$ X10_HOSTLIST="devtty21,devtty22" X10_NPLACES=2 ./hello world
TCP::connect timeout
Launcher 1: failed to connect to parent
No route to host
Launcher 0: tearing down remaining runtimes after waiting 3 seconds
Launcher 0: cleanup complete, exit code=9.  Goodbye!
Launcher -1: cleanup complete, exit code=9.  Goodbye!

Looking at the packet capture, the initial ssh communication and port
notification work fine, but the subsequent child->parent connection on that
port gets filtered by the OS (I think because the port is never opened
properly on the parent?). I can setup a simple python client/server on the
same port so it doesn't seem to be a firewall issue. I tried recompiling
the socket runtime with debug defined, but nothing in the output looked
suspicious, e.g.

...
Launcher 0: opened listen socket at port 38786
...
Launcher 1: connecting to parent via: 10.212.55.40:38786
TCP::connect timeout
Launcher 1: failed to connect to parent
No route to host
...

I've also tried with openmpi-1.10.2 and gotten similar results (10.176.5.36
and 10.212.55.40 are the ip's of the ib and ethernet interfaces of devtty21
respectively) -

$x10c++  -x10rt mpi HelloWholeWorld.x10 -o hello
$mpirun --host devtty22,devtty21 -np 2 hello world

[devtty22][[22596,1],0][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect]
 connect() to 10.176.5.36 failed: No route to host (113)
[devtty22][[22596,1],0][btl_tcp_endpoint.c:818:mca_btl_tcp_endpoint_complete_connect]
 connect() to 10.212.55.40 failed: No route to host (113)
(it hangs indefinitely here)

The standard mpi hello world on the same nodes works fine -

$ mpic++ mpi_hello_world.c -o mpi_hello
$ mpirun --host devtty21,devtty22 -np 2 hello world
Hello world from processor devtty22.cs.georgetown.edu, rank 1 out of 2
processors
Hello world from processor devtty21.cs.georgetown.edu, rank 0 out of 2
processors

Any ideas? I'm happy to provide any additional configuration information if
it would be helpful.

Thanks,
Brendan Sheridan
------------------------------------------------------------------------------

Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
X10-users mailing list
X10-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/x10-users

Reply via email to