No, once the node "thenodename" goes down, there is no more host resolution, it can't be pinged or ssh'ed to.
mike On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]> wrote: > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter <[email protected]> > wrote: >> I'm sure that slurm is built to handle failover like this! That's >> why I'm so troubled by the behavior I'm seeing. >> >> The node configuration is relatively straightforward: >> >> NodeName=thenodename Procs=4 State=UNKNOWN >> #.. other nodes defined the same way >> >> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100 >> Shared=NO Default=YES > > Can you resolve all these hostnames on the controller node? > > > >> Could it be the State=UNKNOWN that is screwing things up? Are >> there any other configuration options that could produce this >> behavior? What seemed to happen was that a reconfigure command >> was sent to the controller, right before the error messages I sent >> previously: >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received >> >> >> >> On Wed, Aug 10, 2011 at 9:03 AM, <[email protected]> wrote: >> > There are many clusters running slurm where nodes go down daily and you are >> > the first person to report a problem. My best guess is that your slurm.conf >> > file is bad. What does your node configuration line(s) look like in >> > slurm.conf? >> > >> > Quoting Mike Schachter <[email protected]>: >> > >> >> So this morning a node went down in the middle of a bunch of >> >> jobs running, the slurm controller tried to reconfigure, and this >> >> was the only error message we got in the log file: >> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host >> >> slurmctld: fatal: slurm_set_addr failure on thenodename >> >> >> >> The controller won't even restart if I set the State=DOWN for the node >> >> in /etc/slurm.conf. I have to manually remove the node from configuration >> >> file in order for the controller to restart. >> >> >> >> This is a huge problem for us! We expected that nodes could go >> >> down graceful failover. Any idea what's going on? >> >> >> >> mike >> >> >> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote: >> >>> >> >>> Your log should say what is happening. If not, try logging on as root and >> >>> starting the daemon by hand with lots of debugging (-v's): >> >>> "slurmctld -Dvvvvv" >> >>> >> >>> Quoting Mike Schachter <[email protected]>: >> >>> >> >>>> Hi there, >> >>>> >> >>>> If we have a node down, and then restart the slurm controller, >> >>>> for some reason slurm won't start up! Is there some way to >> >>>> ameliorate this issue? >> >>>> >> >>>> mike >> >>>> >> >>> >> >>> >> >>> >> >>> >> >> >> >> >> > >> > >> > >> > >> >
