On Thu, 25 Jun 2015 14:15:17 -0700
Trevor Gale <[email protected]> wrote:
> Hello all,
> 
> I am experiencing an odd issue where my head node can see the compute
> node but the compute node cannot see the headnode. If I run “sinfo”
> on the head node I see both nodes in the state idle, but I can’t run
> sinfo on the compute node. If i look at the head nodes logs I see no
> issues, and I see things like “node_did_resp compute0”. but if I look
> at the compute nodes log I see “slurm connect failed: no route to
> host”. I am using the IP addresses that I assigned the nodes in my
> IPoIB config, and I know these IPs work normally (I can ssh, scp, and
> ping with them), but for some reason the compute node does not see
> the head node.
> 
> Does anyone have any idea what the issue might be?

Two ideas:

 1) You have different slurm.conf files with different node definitions
 across you cluster causing connectivity problems.

 2) You have actual IPoIB connectivity problems, maybe the quite recent
 rhel6/centos-6 bug that caused islands of connectivity under certain
 circumstances? (fixed in -504.16.2).

/Peter 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Reply via email to