[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

Barbara Krasovec Wed, 17 Aug 2016 06:48:56 -0700

I upgraded  SLURM rom 15.08 to 16.05 without draining the nodes and
without loosing any jobs, this was my procedure:


I increased timeouts in slurm.conf:
SlurmctldTimeout=3600
SlurmdTimeout=3600

Did mysqldump of slurm database and copied slurmstate dir (just in
case), I increased innodb_buffer_size in my.cnf to 128M, then I followed
the instructions on slurm page:


        Shutdown the slurmdbd daemon
        Upgrade the slurmdbd daemon
        Restart the slurmdbd daemon
        Shutdown the slurmctld daemon(s)
        Shutdown the slurmd daemons on the compute nodes
        Upgrade the slurmctld and slurmd daemons
        Restart the slurmd daemons on the compute nodes
        Restart the slurmctld daemon(s)

For me it worked.
Cheers,
Barbara

On 08/17/2016 03:38 PM, Ole Holm Nielsen wrote:
> 
> 
> On 08/03/2016 03:04 AM, Christopher Samuel wrote:
>> So you always go in the order of upgrading:
>>
>> * slurmdbd
>> * slurmctld
>> [recompile all plugins, MPI stacks, etc that link against Slurm]
>> * slurmd
>>
>> We use a health check script that defines the version of Slurm that is
>> considered production so we can just bump that number first, wait for
>> all the compute nodes to be marked as drained and then as nodes become
>> idle we can start restarting slurmd knowing that we will never get a job
>> that spans both old and new slurmd's.
> 
> Obviously upgrading slurmd's which are running jobs is quite tricky!  I
> have some questions:
> 
> 1. Can't you replace the health check by a global scontrol like this?
>    scontrol update NodeName=<nodelist> State=drain Reason="Upgrading
> slurmd"
> 
> 2. Do you really have to wait for *all* nodes to become drained before
> starting to upgrade?  This could take weeks!
> 
> 3. Is it OK to upgrade subsets of nodes after they become drained?
> 
> 4. I assume that upgraded nodes can be returned to the IDLE state by:
>    scontrol update NodeName=<nodelist> State=resume
> 
> Could you possibly elaborate on the steps which you described?
> 
> FYI, I'm trying to capture this advice in my Wiki:
> https://wiki.fysik.dtu.dk/niflheim/SLURM#upgrading-on-centos-7
> 
> Thanks,
> Ole

[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

Reply via email to