I don't know what I did, but I am now able to restart slurmctld without breaking things.
>From the utils server, I push the new conf to all nodes. On the head node I run: systemctld restart slurmctld; scontrol reconfigure And then I'm done and I still have my queue. Thanks all. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 17 June 2016 at 03:28, Nicholas McCollum <nmccol...@asc.edu> wrote: > I don't know if this is your exact circumstance, but I have found in the > past that if you have a primary and backup controller and your > StateSaveLocation isn't readable/writable by both, the secondary controller > will takeover whenever the primary fails and it will appear that you have a > completely blank cluster. No jobs, no reservations, nothing. If you > submit a job, it starts back at the FirstJobId. > > If you have a primary/backup, and this is the case... sometimes when > updating slurm.conf and you do an 'scontrol reconfigure', if everything > isn't 100% correct the slurmctld will crash. If here is an error, an easy > way to figure it out is to do a 'slurmctld -Dvvvvv' and it will fail and > tell you what the issue is. > > Hopefully this helps. > > ------------------- > Nicholas McCollum > HPC Systems Administrator > Alabama Supercomputer Authority > > > On Wed, 15 Jun 2016, Lachlan Musicman wrote: > > Hi, >> >> I would like some clarification on upgrading slurm.conf. >> >> As we discover things needing to be added or changed, we update a central >> slurm.conf and distribute to all nodes, AllocNodes and head nodes via >> ansible. This works a treat. >> >> Next, we would like to have out new slurm.conf applied without loosing any >> jobs in the queue. >> >> I have googled. >> >> I have read this thread: >> https://groups.google.com/d/topic/slurm-devel/xLzTBkcCiuc/discussion >> >> I have read the upgrade instructions in Quickstart: >> http://slurm.schedmd.com/quickstart_admin.html#upgrade >> >> Despite this, when executing those instructions, we lose all of our queue. >> >> I have a /var/spool/slurmd which is in the conf, writable by the slurm >> user >> and filled with files that look like they belong there. >> >> So, I have a number of queries. >> >> While I understand that upgrading slurm is very similar to updating the >> slurm.conf, they aren't identical. >> >> One requires all nodes have slurmd stopped, the other only that they have >> scontrol reconfigure run. >> >> slurmctld always needs to be restarted, is that correct? >> >> Is the order as listed in the quickstart#upgrade the same for a simple >> reconfigure? >> >> (Note that this is the order I've been following, and it clears the queue. >> So I'm expecting the answer to be "no". Email Unit testing). >> >> (Note2: SlurmctldTimeout=120 SlurmdTimeout=300 All happening within these >> timeframes) >> >> What am I doing wrong? >> >> Also: instead of pointing me to one of the already read docs, can someone >> please explicitly step through the steps they take when they update their >> slurm.conf without clearing the queue? >> >> Cheers >> L. >> >> >> ------ >> The most dangerous phrase in the language is, "We've always done it this >> way." >> >> - Grace Hopper >> >> -- >> This message has been scanned for viruses and >> dangerous content by MailScanner, and is >> believed to be clean. >> >>