I don't know what I did, but I am now able to restart slurmctld without
breaking things.

>From the utils server, I push the new conf to all nodes.
On the head node I run: systemctld restart slurmctld; scontrol reconfigure

And then I'm done and I still have my queue.

Thanks all.


------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

On 17 June 2016 at 03:28, Nicholas McCollum <nmccol...@asc.edu> wrote:

> I don't know if this is your exact circumstance, but I have found in the
> past that if you have a primary and backup controller and your
> StateSaveLocation isn't readable/writable by both, the secondary controller
> will takeover whenever the primary fails and it will appear that you have a
> completely blank cluster.  No jobs, no reservations, nothing.  If you
> submit a job, it starts back at the FirstJobId.
>
> If you have a primary/backup, and this is the case... sometimes when
> updating slurm.conf and you do an 'scontrol reconfigure', if everything
> isn't 100% correct the slurmctld will crash.  If here is an error, an easy
> way to figure it out is to do a 'slurmctld -Dvvvvv' and it will fail and
> tell you what the issue is.
>
> Hopefully this helps.
>
> -------------------
> Nicholas McCollum
> HPC Systems Administrator
> Alabama Supercomputer Authority
>
>
> On Wed, 15 Jun 2016, Lachlan Musicman wrote:
>
> Hi,
>>
>> I would like some clarification on upgrading slurm.conf.
>>
>> As we discover things needing to be added or changed, we update a central
>> slurm.conf and distribute to all nodes, AllocNodes and head nodes via
>> ansible. This works a treat.
>>
>> Next, we would like to have out new slurm.conf applied without loosing any
>> jobs in the queue.
>>
>> I have googled.
>>
>> I have read this thread:
>> https://groups.google.com/d/topic/slurm-devel/xLzTBkcCiuc/discussion
>>
>> I have read the upgrade instructions in Quickstart:
>> http://slurm.schedmd.com/quickstart_admin.html#upgrade
>>
>> Despite this, when executing those instructions, we lose all of our queue.
>>
>> I have a /var/spool/slurmd which is in the conf, writable by the slurm
>> user
>> and filled with files that look like they belong there.
>>
>> So, I have a number of queries.
>>
>> While I understand that upgrading slurm is very similar to updating the
>> slurm.conf, they aren't identical.
>>
>> One requires all nodes have slurmd stopped, the other only that they have
>> scontrol reconfigure run.
>>
>> slurmctld always needs to be restarted, is that correct?
>>
>> Is the order as listed in the quickstart#upgrade the same for a simple
>> reconfigure?
>>
>> (Note that this is the order I've been following, and it clears the queue.
>> So I'm expecting the answer to be "no". Email Unit testing).
>>
>> (Note2: SlurmctldTimeout=120 SlurmdTimeout=300 All happening within these
>> timeframes)
>>
>> What am I doing wrong?
>>
>> Also: instead of pointing me to one of the already read docs, can someone
>> please explicitly step through the steps they take when they update their
>> slurm.conf without clearing the queue?
>>
>> Cheers
>> L.
>>
>>
>> ------
>> The most dangerous phrase in the language is, "We've always done it this
>> way."
>>
>> - Grace Hopper
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>>

Reply via email to