Teshome,

I think you want a high (or zero) SlurmdTimeout

       SlurmdTimeout
The interval, in seconds, that the SLURM controller waits for slurmd to respond before configuring that node’s state to DOWN. A value of zero indicates the node will not be tested by slurm- ctld to confirm the state of slurmd, the node will not be auto- matically set to a DOWN state indicating a non-responsive slurmd, and some other tool will take responsibility for moni- toring the state of each compute node and its slurmd daemon. SLURM’s hierarchical communication mechanism is used to ping the slurmd daemons in order to minimize system noise and overhead. The default value is 300 seconds. The value may not exceed
              65533 seconds.

Ryan

On 05/19/2014 06:49 AM, Teshome Dagne Mulugeta wrote:
Thank you Chris.

That means every user has to put --no-kill in their sbatch command. It would be 
nice if there are options in slurm configuration to implement that.

Warm regards,
Teshome

________________________________________
From: Chris Samuel <[email protected]>
Sent: Monday, May 19, 2014 2:10 PM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue and resubmit after networking issue

On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote:

Is there a way to keep the running jobs continue after a netwokring issue
between slurm daemon and nodes?
I suspect the answer is the --no-kill option for sbatch.

Best of luck!
Chris
--
  Christopher Samuel        Senior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: [email protected] Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/      http://twitter.com/vlsci

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Reply via email to