Teshome,
I think you want a high (or zero) SlurmdTimeout
SlurmdTimeout
The interval, in seconds, that the SLURM controller
waits for
slurmd to respond before configuring that node’s state
to DOWN.
A value of zero indicates the node will not be tested by
slurm-
ctld to confirm the state of slurmd, the node will not
be auto-
matically set to a DOWN state indicating a
non-responsive
slurmd, and some other tool will take responsibility for
moni-
toring the state of each compute node and its slurmd
daemon.
SLURM’s hierarchical communication mechanism is used to
ping the
slurmd daemons in order to minimize system noise and
overhead.
The default value is 300 seconds. The value may not
exceed
65533 seconds.
Ryan
On 05/19/2014 06:49 AM, Teshome Dagne Mulugeta wrote:
Thank you Chris.
That means every user has to put --no-kill in their sbatch command. It would be
nice if there are options in slurm configuration to implement that.
Warm regards,
Teshome
________________________________________
From: Chris Samuel <[email protected]>
Sent: Monday, May 19, 2014 2:10 PM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue and resubmit after networking issue
On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote:
Is there a way to keep the running jobs continue after a netwokring issue
between slurm daemon and nodes?
I suspect the answer is the --no-kill option for sbatch.
Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: [email protected] Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University