[slurm-dev] Re: Requeue and resubmit after networking issue

Ryan Cox Mon, 19 May 2014 07:23:33 -0700


Teshome,


I think you want a high (or zero) SlurmdTimeout

       SlurmdTimeout

The interval, in seconds, that the SLURM controllerwaits forslurmd to respond before configuring that node’s stateto DOWN.A value of zero indicates the node will not be tested byslurm-ctld to confirm the state of slurmd, the node will notbe auto-matically set to a DOWN state indicating anon-responsiveslurmd, and some other tool will take responsibility formoni-toring the state of each compute node and its slurmddaemon.SLURM’s hierarchical communication mechanism is used toping theslurmd daemons in order to minimize system noise andoverhead.The default value is 300 seconds. The value may notexceed

              65533 seconds.

Ryan

On 05/19/2014 06:49 AM, Teshome Dagne Mulugeta wrote:

Thank you Chris.

That means every user has to put --no-kill in their sbatch command. It would be 
nice if there are options in slurm configuration to implement that.

Warm regards,
Teshome

________________________________________
From: Chris Samuel <[email protected]>
Sent: Monday, May 19, 2014 2:10 PM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue and resubmit after networking issue

On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote:

Is there a way to keep the running jobs continue after a netwokring issue
between slurm daemon and nodes?

I suspect the answer is the --no-kill option for sbatch.

Best of luck!
Chris
--
  Christopher Samuel        Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: [email protected] Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/      http://twitter.com/vlsci


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Requeue and resubmit after networking issue

Reply via email to