[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-14 Thread Bjørn-Helge Mevik
Thanks. Nice to know! -- Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Robbert Eggermont
On 13-10-15 11:26, Bjørn-Helge Mevik wrote: Restarting the slurmd daemons and/or the slurmctld daemon should in general not kill jobs. But if you change things in slurm.conf such that the format of the slurm state files changes, then restarting slurmctld might result in all jobs being killed.

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Ryan Cox
That particular problem is now fixed: http://bugs.schedmd.com/show_bug.cgi?id=587 Ryan On 10/13/2015 03:26 AM, Bjørn-Helge Mevik wrote: Restarting the slurmd daemons and/or the slurmctld daemon should in general not kill jobs. But if you change things in slurm.conf such that the format of t

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Bjørn-Helge Mevik
Restarting the slurmd daemons and/or the slurmctld daemon should in general not kill jobs. But if you change things in slurm.conf such that the format of the slurm state files changes, then restarting slurmctld might result in all jobs being killed. We did this once a couple of years ago when we

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Antony Cleave
I've only ever had this happen once but it's murphy's law that it didn't happen on the test system but on the system in production and I was just a minute or so too slow finding the error. Antony On 12 Oct 2015 18:25, "Paul Edmon" wrote: > > I've had this happen several times, but have never los

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon
I've had this happen several times, but have never lost jobs due to it. Still one should always watch the logs on the master when restarting so you can catch typos immediately. We run a sanity check on our conf's before we push them (we use puppet for configuration control). Our post commi

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Antony Cleave
While this is true be very, very careful when restarting the slurmd on the controller node. it's quite easy to miss a typo in one of the config files, e.g. an unexpected comma in topology.conf which can cause slurm to segfault or otherwise shut-down uncleanly. If this happens then the state of

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-12 Thread Paul Edmon
You should be able to do this with out losing any jobs (at least I've never lost any on any version of Slurm I have run). I do it all the time in our environment (about once a day) as our slurm.conf is in flux quite a bit. It should always preserve the running and pending state. The only i