Thanks. Nice to know!
--
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
On 13-10-15 11:26, Bjørn-Helge Mevik wrote:
Restarting the slurmd daemons and/or the slurmctld daemon should in
general not kill jobs.
But if you change things in slurm.conf such that the format of the slurm
state files changes, then restarting slurmctld might result in all jobs
being killed.
That particular problem is now fixed:
http://bugs.schedmd.com/show_bug.cgi?id=587
Ryan
On 10/13/2015 03:26 AM, Bjørn-Helge Mevik wrote:
Restarting the slurmd daemons and/or the slurmctld daemon should in
general not kill jobs.
But if you change things in slurm.conf such that the format of t
Restarting the slurmd daemons and/or the slurmctld daemon should in
general not kill jobs.
But if you change things in slurm.conf such that the format of the slurm
state files changes, then restarting slurmctld might result in all jobs
being killed. We did this once a couple of years ago when we
I've only ever had this happen once but it's murphy's law that it didn't
happen on the test system but on the system in production and I was just a
minute or so too slow finding the error.
Antony
On 12 Oct 2015 18:25, "Paul Edmon" wrote:
>
> I've had this happen several times, but have never los
I've had this happen several times, but have never lost jobs due to it.
Still one should always watch the logs on the master when restarting so
you can catch typos immediately.
We run a sanity check on our conf's before we push them (we use puppet
for configuration control). Our post commi
While this is true be very, very careful when restarting the slurmd on
the controller node.
it's quite easy to miss a typo in one of the config files, e.g. an
unexpected comma in topology.conf which can cause slurm to segfault or
otherwise shut-down uncleanly. If this happens then the state of
You should be able to do this with out losing any jobs (at least I've
never lost any on any version of Slurm I have run). I do it all the
time in our environment (about once a day) as our slurm.conf is in flux
quite a bit. It should always preserve the running and pending state.
The only i