On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter <[email protected]> wrote: > I'm sure that slurm is built to handle failover like this! That's > why I'm so troubled by the behavior I'm seeing. > > The node configuration is relatively straightforward: > > NodeName=thenodename Procs=4 State=UNKNOWN > #.. other nodes defined the same way > > PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100 > Shared=NO Default=YES
Can you resolve all these hostnames on the controller node? > Could it be the State=UNKNOWN that is screwing things up? Are > there any other configuration options that could produce this > behavior? What seemed to happen was that a reconfigure command > was sent to the controller, right before the error messages I sent > previously: > > [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received > > > > On Wed, Aug 10, 2011 at 9:03 AM, <[email protected]> wrote: > > There are many clusters running slurm where nodes go down daily and you are > > the first person to report a problem. My best guess is that your slurm.conf > > file is bad. What does your node configuration line(s) look like in > > slurm.conf? > > > > Quoting Mike Schachter <[email protected]>: > > > >> So this morning a node went down in the middle of a bunch of > >> jobs running, the slurm controller tried to reconfigure, and this > >> was the only error message we got in the log file: > >> > >> slurmctld: error: Unable to resolve "thenodename": Unknown host > >> slurmctld: fatal: slurm_set_addr failure on thenodename > >> > >> The controller won't even restart if I set the State=DOWN for the node > >> in /etc/slurm.conf. I have to manually remove the node from configuration > >> file in order for the controller to restart. > >> > >> This is a huge problem for us! We expected that nodes could go > >> down graceful failover. Any idea what's going on? > >> > >> mike > >> > >> > >> On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote: > >>> > >>> Your log should say what is happening. If not, try logging on as root and > >>> starting the daemon by hand with lots of debugging (-v's): > >>> "slurmctld -Dvvvvv" > >>> > >>> Quoting Mike Schachter <[email protected]>: > >>> > >>>> Hi there, > >>>> > >>>> If we have a node down, and then restart the slurm controller, > >>>> for some reason slurm won't start up! Is there some way to > >>>> ameliorate this issue? > >>>> > >>>> mike > >>>> > >>> > >>> > >>> > >>> > >> > >> > > > > > > > > >
