On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter <[email protected]> 
wrote:
> I'm sure that slurm is built to handle failover like this! That's
> why I'm so troubled by the behavior I'm seeing.
> 
> The node configuration is relatively straightforward:
> 
> NodeName=thenodename Procs=4 State=UNKNOWN
> #.. other nodes defined the same way
> 
> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100
> Shared=NO Default=YES

Can you resolve all these hostnames on the controller node?


 
> Could it be the State=UNKNOWN that is screwing things up? Are
> there any other configuration options that could produce this
> behavior? What seemed to happen was that a reconfigure command
> was sent to the controller, right before the error messages I sent
> previously:
> 
> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received
> 
> 
> 
> On Wed, Aug 10, 2011 at 9:03 AM,  <[email protected]> wrote:
> > There are many clusters running slurm where nodes go down daily and you are
> > the first person to report a problem. My best guess is that your slurm.conf
> > file is bad. What does your node configuration line(s) look like in
> > slurm.conf?
> >
> > Quoting Mike Schachter <[email protected]>:
> >
> >> So this morning a node went down in the middle of a bunch of
> >> jobs running, the slurm controller tried to reconfigure, and this
> >> was the only error message we got in the log file:
> >>
> >> slurmctld: error: Unable to resolve "thenodename": Unknown host
> >> slurmctld: fatal: slurm_set_addr failure on thenodename
> >>
> >> The controller won't even restart if I set the State=DOWN for the node
> >> in /etc/slurm.conf. I have to manually remove the node from configuration
> >> file in order for the controller to restart.
> >>
> >> This is a huge problem for us! We expected that nodes could go
> >> down graceful failover. Any idea what's going on?
> >>
> >>  mike
> >>
> >>
> >> On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
> >>>
> >>> Your log should say what is happening. If not, try logging on as root and
> >>> starting the daemon by hand with lots of debugging (-v's):
> >>> "slurmctld -Dvvvvv"
> >>>
> >>> Quoting Mike Schachter <[email protected]>:
> >>>
> >>>> Hi there,
> >>>>
> >>>> If we have a node down, and then restart the slurm controller,
> >>>> for some reason slurm won't start up! Is there some way to
> >>>> ameliorate this issue?
> >>>>
> >>>>  mike
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> >
> 

Reply via email to