[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-11 Thread Christopher Samuel via slurm-users
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread archisman.pathak--- via slurm-users
In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread archisman.pathak--- via slurm-users
Could you give more details regarding this and how you debugged the same? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Cutts, Tim via slurm-users
We have Weka filesystems on one of our clusters and saw this; we discovered we had slightly misconfigured the weka client and the result was that Weka’s and SLURMs cgroups were fighting with each other, and this seemed to be the result. Fixing the weka cgroups config improved the problem, for

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported