Re: [slurm-users] strigger on CG, completing state

2019-05-31 Thread Chris Samuel
On Tuesday, 28 May 2019 9:03:16 AM PDT Matthew BETTINGER wrote:

> We use triggers for the obvious alerts but is that a way to make a trigger
> for nodes stuck in CG (completing) state?  Some user jobs, mostly Julia
> notebook can get hung in completing state is the user kills the running job
> or cancels it with cntrl.  When this happens we can have many many nodes
> stuck in CG.  Slurm 17.02.6.  Thanks!

Are you using cgroups to control/constrain jobs?

17.02 is very old, now 19.05 is out only it and 18.08 are getting updates.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Matthew BETTINGER
Ok thanks we will look into that!  Thought we were the only ones who had the 
problem and yes it's like windows 98SE,  you can try all you want but 
eventually we end up rebooting the nodes.  Interns are starting to show up and 
you know they can bend a cluster in ways you never seen before.  We will 
investigate this as this looks like a more proactive approach instead of 
walking in the morning and seeing 100's of nodes stuck in CG because intern 
didn't tear down their jupyter sessions in a sane way.

On 5/29/19, 3:02 AM, "slurm-users on behalf of Yair Yarom" 
 wrote:

Hi,


Check the UnkillableStepProgram and UnkillableStepTimeout options in 
slurm.conf.
We use it to drain the stuck nodes and mail us - as here, usually stuck 
processes will require a reboot. As the drained strigger will never get 
triggered, we also set a finished trigger for the next RUNNING job. That 
trigger will either send us mail if
 there are only stuck processes, or strigger --fini the next RUNNING job.




Yair.




On Tue, May 28, 2019 at 7:58 PM mercan  wrote:


Hi;

If you did not use the epilog script, you can set the epilog script to 
clean up all residues from the finished jobs:


https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts

Ahmet M.


28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
> We use triggers for the obvious alerts but is that a way to make a 
trigger for nodes stuck in CG (completing) state?  Some user jobs, mostly Julia 
notebook can get hung in completing state is the user kills the running job or 
cancels it with cntrl.  When
 this happens we can have many many nodes stuck in CG.  Slurm 17.02.6.  
Thanks!
>






-- 
  /|   |
  \/   | Yair Yarom | Senior DevOps Architect
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|







Re: [slurm-users] strigger on CG, completing state

2019-05-29 Thread Yair Yarom
Hi,

Check the UnkillableStepProgram and UnkillableStepTimeout options in
slurm.conf.
We use it to drain the stuck nodes and mail us - as here, usually stuck
processes will require a reboot. As the drained strigger will never get
triggered, we also set a finished trigger for the next RUNNING job. That
trigger will either send us mail if there are only stuck processes, or
strigger --fini the next RUNNING job.

Yair.


On Tue, May 28, 2019 at 7:58 PM mercan  wrote:

> Hi;
>
> If you did not use the epilog script, you can set the epilog script to
> clean up all residues from the finished jobs:
>
>
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts
>
> Ahmet M.
>
>
> 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:
> > We use triggers for the obvious alerts but is that a way to make a
> trigger for nodes stuck in CG (completing) state?  Some user jobs, mostly
> Julia notebook can get hung in completing state is the user kills the
> running job or cancels it with cntrl.  When this happens we can have many
> many nodes stuck in CG.  Slurm 17.02.6.  Thanks!
> >
>
>

-- 

  /|   |
  \/   | Yair Yarom | Senior DevOps Architect
  []   | The Rachel and Selim Benin School
  [] /\| of Computer Science and Engineering
  []//\\/  | The Hebrew University of Jerusalem
  [//  \\  | T +972-2-5494522 | F +972-2-5494522
  //\  | ir...@cs.huji.ac.il
 //|


Re: [slurm-users] strigger on CG, completing state

2019-05-28 Thread mercan

Hi;

If you did not use the epilog script, you can set the epilog script to 
clean up all residues from the finished jobs:


https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts

Ahmet M.


28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı:

We use triggers for the obvious alerts but is that a way to make a trigger for 
nodes stuck in CG (completing) state?  Some user jobs, mostly Julia notebook 
can get hung in completing state is the user kills the running job or cancels 
it with cntrl.  When this happens we can have many many nodes stuck in CG.  
Slurm 17.02.6.  Thanks!