I think the service name is munge not munged, although the binary is munged.
Or was your 'systemctl restart munged' a typo? William On Thu, 12 Feb 2026, 19:58 Griebel, Christian via slurm-users, < [email protected]> wrote: > Dear community, > > > Trying to implement the latest fix/patch for munged, we restarted the > updated munged locally on the compute nodes with "systemctl restart > munged", resulting in the sudden death of a lot of compute nodes' slurmd. > > > Checking the jobs on the affected nodes, we saw a lot of > user processes/jobs still running, which was good - yet "systemctl restart > slurmd" cancelled all of them, eg. > > [2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern > [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to > StepId=49695760.extern > [2026-02-12T17:08:00.325] slurmd version 25.05.5 started > and all affected user jobs (even though having survived the death of their > parent slurmd) were killed and re-queued... > > > We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit > and "ProctrackType=proctrack/cgroup" configured. > > > Other sites do not see the same behavior (their user jobs survive a slurmd > restart without issues), so now we are at a loss figuring out why the h.... > this happens within our setup. > > > Anyone experienced similar problems and got them solved...? > > > Thanks in advance - > > -- > ___________________________ > Christian Griebel/HPC > > > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
