[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread E V
Ah, success. It was gres related. I verified the slurm.conf's are the same, but I never verified the gres.conf. It looks like our production gres.conf had been copied to the backup controller which had the same gres names, but different hosts associated with them. Fixing that and restarting

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread E V
Yes, head node & backup head sync to the same ntp server. Verifying by hand they seem to be within 1 sec of each other. Here's the nodes info it finds as it starts up in slurmd.log: [2017-01-31T15:31:59.711] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=48388 TmpDisk=508671 Uptime=1147426

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread Paddy Doyle
Similar to Lachlan's suggestions: check that the slurm.conf is the same on all nodes, and in particular that the number of cpus and cores are correct. Have you tried removing the Gres parameters? Perhaps it's looking for devices it can't find. Paddy On Tue, Jan 31, 2017 at 02:08:51PM -0800,

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-31 Thread Lachlan Musicman
trival questions: does node has correct time wrt head node? and is node correctly configured in slurm.conf? (# of cpus, amount of memory, etc) cheers L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 1 February 2017 at 08:03, E V

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-31 Thread E V
enabling debug5 doesn't show anything more useful. I don't see anything relevant in slurmd.log just job starts and stops. slurmctld.log has the takeover output with backup head node immediately draining itself same as before but with more of the context before the DRAIN:

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-31 Thread E V
No eplilog scripts defined, and access to save state is fine, as an scontrol takeover works, but does have the side affect of the backup draining itself. I set SlurmctlDebug to debug3 and didn't get much more info: [2017-01-31T09:45:22.329] debug2: node_did_resp hpcc-1 [2017-01-31T09:45:22.329]

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-01-30 Thread Paddy Doyle
Hi E V, You could turn up the SlurmctldDebug and SlurmdDebug values in slurm.conf to get it to be more verbose. Do you have any epilog scripts defined? If it's related to the node being the backup controller, as a wild guess, perhaps your backup control doesn't have access to the