Ah, success. It was gres related. I verified the slurm.conf's are the
same, but I never verified the gres.conf. It looks like our production
gres.conf had been copied to the backup controller which had the same
gres names, but different hosts associated with them. Fixing that and
restarting
Yes, head node & backup head sync to the same ntp server. Verifying by
hand they seem to be within 1 sec of each other. Here's the nodes info
it finds as it starts up in slurmd.log:
[2017-01-31T15:31:59.711] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2
Memory=48388 TmpDisk=508671 Uptime=1147426
Similar to Lachlan's suggestions: check that the slurm.conf is the same on
all nodes, and in particular that the number of cpus and cores are correct.
Have you tried removing the Gres parameters? Perhaps it's looking for devices
it can't find.
Paddy
On Tue, Jan 31, 2017 at 02:08:51PM -0800,
trival questions: does node has correct time wrt head node? and is node
correctly configured in slurm.conf? (# of cpus, amount of memory, etc)
cheers
L.
--
The most dangerous phrase in the language is, "We've always done it this
way."
- Grace Hopper
On 1 February 2017 at 08:03, E V
enabling debug5 doesn't show anything more useful. I don't see
anything relevant in slurmd.log just job starts and stops.
slurmctld.log has the takeover output with backup head node
immediately draining itself same as before but with more of the
context before the DRAIN:
No eplilog scripts defined, and access to save state is fine, as an
scontrol takeover works, but does have the side affect of the backup
draining itself. I set SlurmctlDebug to debug3 and didn't get much
more info:
[2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1
[2017-01-31T09:45:22.329]
Hi E V,
You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to get
it to be more verbose.
Do you have any epilog scripts defined?
If it's related to the node being the backup controller, as a wild guess,
perhaps your backup control doesn't have access to the