Also check your iptables config on both nodes, there might be a firewall rule hanging you up... It sounds like you can talk back to the head node on an established socket but you can't establish a socket to the head node from the compute node.

Eric

On 6/25/15 4:38 PM, Peter Kjellström wrote:
On Thu, 25 Jun 2015 14:15:17 -0700
Trevor Gale <[email protected]> wrote:
 > Hello all,
 >
 > I am experiencing an odd issue where my head node can see the compute
 > node but the compute node cannot see the headnode. If I run “sinfo”
 > on the head node I see both nodes in the state idle, but I can’t run
 > sinfo on the compute node. If i look at the head nodes logs I see no
 > issues, and I see things like “node_did_resp compute0”. but if I look
 > at the compute nodes log I see “slurm connect failed: no route to
 > host”. I am using the IP addresses that I assigned the nodes in my
 > IPoIB config, and I know these IPs work normally (I can ssh, scp, and
 > ping with them), but for some reason the compute node does not see
 > the head node.
 >
 > Does anyone have any idea what the issue might be?

Two ideas:

1) You have different slurm.conf files with different node definitions
across you cluster causing connectivity problems.

2) You have actual IPoIB connectivity problems, maybe the quite recent
rhel6/centos-6 bug that caused islands of connectivity under certain
circumstances? (fixed in -504.16.2).

/Peter

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to