[slurm-dev] Re: Requeue and resubmit after networking issue

Teshome Dagne Mulugeta Tue, 20 May 2014 01:25:33 -0700

Hi Ryan,

That is the one. Thank you so much :-)


Warm regards,
Teshome

________________________________________
From: Ryan Cox <[email protected]>
Sent: Monday, May 19, 2014 4:23 PM
To: slurm-dev
Subject: [slurm-dev] Re: Requeue and resubmit after networking issue

Teshome,

I think you want a high (or zero) SlurmdTimeout

        SlurmdTimeout
               The  interval,  in  seconds, that the SLURM controller
waits for
               slurmd to respond before configuring that node’s state
to  DOWN.
               A  value of zero indicates the node will not be tested by
slurm-
               ctld to confirm the state of slurmd, the node will not
be  auto-
               matically  set  to  a  DOWN  state  indicating  a
non-responsive
               slurmd, and some other tool will take responsibility for
moni-
               toring  the  state  of  each compute node and its slurmd
daemon.
               SLURM’s hierarchical communication mechanism is used to
ping the
               slurmd  daemons  in order to minimize system noise and
overhead.
               The default value is 300 seconds.   The  value  may not
exceed
               65533 seconds.

Ryan

On 05/19/2014 06:49 AM, Teshome Dagne Mulugeta wrote:
> Thank you Chris.
>
> That means every user has to put --no-kill in their sbatch command. It would 
> be nice if there are options in slurm configuration to implement that.
>
> Warm regards,
> Teshome
>
> ________________________________________
> From: Chris Samuel <[email protected]>
> Sent: Monday, May 19, 2014 2:10 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: Requeue and resubmit after networking issue
>
> On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote:
>
>> Is there a way to keep the running jobs continue after a netwokring issue
>> between slurm daemon and nodes?
> I suspect the answer is the --no-kill option for sbatch.
>
> Best of luck!
> Chris
> --
>   Christopher Samuel        Senior Systems Administrator
>   VLSCI - Victorian Life Sciences Computation Initiative
>   Email: [email protected] Phone: +61 (0)3 903 55545
>   http://www.vlsci.org.au/      http://twitter.com/vlsci

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Requeue and resubmit after networking issue

Reply via email to