Dear community,

Trying to implement the latest fix/patch for munged, we restarted the updated 
munged locally on the compute nodes with "systemctl restart munged", resulting 
in the sudden death of a lot of compute nodes' slurmd.


Checking the jobs on the affected nodes, we saw a lot of user processes/jobs 
still running, which was good - yet  "systemctl restart slurmd" cancelled all 
of them, eg.

[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern
[2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to 
StepId=49695760.extern
[2026-02-12T17:08:00.325] slurmd version 25.05.5 started
and all affected user jobs (even though having survived the death of their 
parent slurmd) were killed and re-queued...


We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and 
"ProctrackType=proctrack/cgroup" configured.


Other sites do not see the same behavior (their user jobs survive a slurmd 
restart without issues), so now we are at a loss figuring out why the h.... 
this happens within our setup.


Anyone experienced similar problems and got them solved...?


Thanks in advance -

--
___________________________
Christian Griebel/HPC


-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to