[slurm-dev] Re: KNL node down after reboot

2017-05-16 Thread Costin Caramarcu
Hi, A few suggestions: 1) Try increasing the timeouts: SlurmctldTimeout=600 SlurmdTimeout=600 ResumeTimeout=600 2) Make sure that when slurm starts the node finished mounting file systems and the whole boot procedure is done, Regards, Costin On Tue, May 16, 2017 at

[slurm-dev] problem starting jobs after reboot with knl generic

2017-03-08 Thread Costin Caramarcu
Hi, I'm having trouble understanding why my jobs won't start after a MCDRAM change and reboot. Everything seems to work as expected: - the reboot happens - the MCDRAM change is done after reboot - slurmd on the node comes up correctly without any errors - the controller seems to