Yeah, ran into this ourselves. SlurmdTimeout is the relevant timer. I'm planning to set that to 0 or <bignum> whenever we do maintenance.
here's the thread: http://comments.gmane.org/gmane.comp.distributed.slurm.devel/2353 Hope this helps.... Cheerio Michael On Thu, Jan 24, 2013 at 9:00 AM, Marcin Stolarek <[email protected]>wrote: > Hi all, > > Today we have experienced unpleasant behavior of our slurm installation. > We were trying to configure gres and unfortunately we had put the same > gres.conf file on all nodes, what (in our heterogeneous cluster) caused > problem with reconfiguring slurmd on nodes. After `scontrol reconfigure` we > had plenty of down* nodes and jobs on these nodes were killed. > > I've looked into man slurm.conf but I haven't found possibility to change > this behaviour. I think it would be very nice if slurm can recognize that > these tasks are still running and keep theirs state in R, mayby for > configured period of time. > Is this possible now? Do you have any sugestions how we can prevent such a > problems in the future? > > cheers, > marcin > -- Hey! Somebody punched the foley guy! - Crow, MST3K ep. 508
