Yeah, ran into this ourselves.  SlurmdTimeout is the relevant timer.  I'm
planning to set that to 0 or <bignum> whenever we do maintenance.

here's the thread:

http://comments.gmane.org/gmane.comp.distributed.slurm.devel/2353

Hope this helps.... Cheerio

Michael


On Thu, Jan 24, 2013 at 9:00 AM, Marcin Stolarek
<[email protected]>wrote:

>  Hi all,
>
> Today we have experienced unpleasant behavior of our slurm installation.
> We were trying to configure gres and unfortunately we had put the same
> gres.conf file on all nodes, what (in our heterogeneous  cluster) caused
> problem with reconfiguring slurmd on nodes. After `scontrol reconfigure` we
> had plenty of down* nodes and jobs on these nodes were killed.
>
> I've looked into man slurm.conf but I haven't found possibility to change
> this behaviour. I think it would be very nice if slurm can recognize that
> these tasks are still running and keep theirs state in R, mayby for
> configured period of time.
> Is this possible now? Do you have any sugestions how we can prevent such a
> problems in the future?
>
> cheers,
> marcin
>



-- 
Hey! Somebody punched the foley guy!
   - Crow, MST3K ep. 508

Reply via email to