On Thu, 25 Jun 2015 14:15:17 -0700 Trevor Gale <[email protected]> wrote: > Hello all, > > I am experiencing an odd issue where my head node can see the compute > node but the compute node cannot see the headnode. If I run “sinfo” > on the head node I see both nodes in the state idle, but I can’t run > sinfo on the compute node. If i look at the head nodes logs I see no > issues, and I see things like “node_did_resp compute0”. but if I look > at the compute nodes log I see “slurm connect failed: no route to > host”. I am using the IP addresses that I assigned the nodes in my > IPoIB config, and I know these IPs work normally (I can ssh, scp, and > ping with them), but for some reason the compute node does not see > the head node. > > Does anyone have any idea what the issue might be?
Two ideas: 1) You have different slurm.conf files with different node definitions across you cluster causing connectivity problems. 2) You have actual IPoIB connectivity problems, maybe the quite recent rhel6/centos-6 bug that caused islands of connectivity under certain circumstances? (fixed in -504.16.2). /Peter -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
