[slurm-users] Compute node OS and firmware updates

Ole Holm Nielsen Thu, 06 Aug 2020 15:43:27 -0700

Regarding the question of methods for Slurm compute node OS and firmwareupdates, we have for a long time used rolling updates while the clusteris in full production, so that we do not waste any resources. Whenentire partitions are upgraded in this way, there is no risk of startingnew jobs on nodes with differing states of OS and firmware, whilerunning jobs continue on the not-yet-updated nodes.

The basic idea (which was provided by Niels Carl Hansen, ncwh -at-cscaa.dk) is to run a crontab script "update.sh" whenever a node isrebooted. Use scontrol to reboot the nodes as they become idle, therebyperforming the updates that you want. Remove the crontab job as part ofthe update.sh script.


The update.sh script and instructions for usage are in:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/nodes

Comments are welcome.

/Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

[slurm-users] Compute node OS and firmware updates

Reply via email to