Hi,
A few suggestions:
1) Try increasing the timeouts:
SlurmctldTimeout=600
SlurmdTimeout=600
ResumeTimeout=600
2) Make sure that when slurm starts the node finished mounting file
systems and the whole boot procedure is done,
Regards,
Costin
On Tue, May 16, 2017 at
Hi,
I'm having trouble understanding why my jobs won't start after a MCDRAM
change and reboot.
Everything seems to work as expected:
- the reboot happens
- the MCDRAM change is done after reboot
- slurmd on the node comes up correctly without any errors
- the controller seems to