Hi Ryan, That is the one. Thank you so much :-)
Warm regards, Teshome ________________________________________ From: Ryan Cox <[email protected]> Sent: Monday, May 19, 2014 4:23 PM To: slurm-dev Subject: [slurm-dev] Re: Requeue and resubmit after networking issue Teshome, I think you want a high (or zero) SlurmdTimeout SlurmdTimeout The interval, in seconds, that the SLURM controller waits for slurmd to respond before configuring that node’s state to DOWN. A value of zero indicates the node will not be tested by slurm- ctld to confirm the state of slurmd, the node will not be auto- matically set to a DOWN state indicating a non-responsive slurmd, and some other tool will take responsibility for moni- toring the state of each compute node and its slurmd daemon. SLURM’s hierarchical communication mechanism is used to ping the slurmd daemons in order to minimize system noise and overhead. The default value is 300 seconds. The value may not exceed 65533 seconds. Ryan On 05/19/2014 06:49 AM, Teshome Dagne Mulugeta wrote: > Thank you Chris. > > That means every user has to put --no-kill in their sbatch command. It would > be nice if there are options in slurm configuration to implement that. > > Warm regards, > Teshome > > ________________________________________ > From: Chris Samuel <[email protected]> > Sent: Monday, May 19, 2014 2:10 PM > To: slurm-dev > Subject: [slurm-dev] Re: Requeue and resubmit after networking issue > > On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote: > >> Is there a way to keep the running jobs continue after a netwokring issue >> between slurm daemon and nodes? > I suspect the answer is the --no-kill option for sbatch. > > Best of luck! > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: [email protected] Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci -- Ryan Cox Operations Director Fulton Supercomputing Lab Brigham Young University
