[slurm-users] Re: Restarting slurmd kills still-running jobs

William Brown via slurm-users Thu, 12 Feb 2026 13:37:09 -0800

I think the service name is munge not munged, although the binary is munged.


Or was your 'systemctl restart munged' a typo?

William



On Thu, 12 Feb 2026, 19:58 Griebel, Christian via slurm-users, <
[email protected]> wrote:

> Dear community,
>
>
> Trying to implement the latest fix/patch for munged, we restarted the
> updated munged locally on the compute nodes with "systemctl restart
> munged", resulting in the sudden death of a lot of compute nodes' slurmd.
>
>
> Checking the jobs on the affected nodes, we saw a lot of
> user processes/jobs still running, which was good - yet  "systemctl restart
> slurmd" cancelled all of them, eg.
>
> [2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern
> [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to
> StepId=49695760.extern
> [2026-02-12T17:08:00.325] slurmd version 25.05.5 started
> and all affected user jobs (even though having survived the death of their
> parent slurmd) were killed and re-queued...
>
>
> We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit
> and "ProctrackType=proctrack/cgroup" configured.
>
>
> Other sites do not see the same behavior (their user jobs survive a slurmd
> restart without issues), so now we are at a loss figuring out why the h....
> this happens within our setup.
>
>
> Anyone experienced similar problems and got them solved...?
>
>
> Thanks in advance -
>
> --
> ___________________________
> Christian Griebel/HPC
>
>
>
> --
> slurm-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Restarting slurmd kills still-running jobs

Reply via email to