On 2014-06-11T11:56:36 EEST, Barbara Krasovec wrote:
>
> On 06/10/2014 04:16 PM, [email protected] wrote:
>>
>> Pending and running jobs should be preserved across major releases too.
> When we upgraded slurm from 2.5 to 2.6, it was tested before on a
> working test cluster and all jobs were killed.
> So, if I do an upgrade of slurm from 2.6.5 to 14.03., it should work
> on a working cluster and it is not necessary to drain it? I just stop
> new jobs, those that are already in queue (running or pending) should
> be preserved?

It ought to work, yes, but if something goes wrong... Some issues we 
have seen the past few years:

1) Jobs killed on upgrade, some complaints about protocol 
incompatibility between the new slurmctld and older slurmd's in logs. 
IIRC this might have been the 2.5 -> 2.6.0 upgrade, a fix was included 
in 2.6.1(?).

2) Jobs killed due to slurmd timeout. This was due to an upgrade 
procedure where (for some reason?) slurmd's were first stopped, then 
new rpm packages installed, then slurmd's restarted. Well, with enough 
nodes upgrading the packages on all the nodes took long enough that 
slurmctld decided all the nodes were down and killed the jobs, even 
though the jobs themselves were running fine. (This issue is of course 
trivial to avoid with a saner upgrade procedure and/or larger 
SlurmdTimeout parameter. Would have been nice to think of it before the 
"OH F***"-moment.. ;) )

3) slurmdbd hanging for 45 minutes during "service slurmdbd restart", 
due to updating the MySQL tables. Our Job Id's are at ~11M, and 
/var/lib/mysql is ~10GB, so I guess it's just a lot of work to do.

4) The libslurm so version is bumped every release. So things like MPI 
libraries with slurm integration ought to be recompiled. Sometimes it 
works to just symlink the old so name(s) to the new one, but this is of 
course a giant kludge with no guarantee of working. Some kind of ABI 
stability with symbol versioning etc. would be nice..

Issues (2) and (3) are unfortunately the kind you tend to run into when 
upgrading your production system rather than some test cluster.. :( But 
generally, on the fly upgrades have worked fine for us. Still, we try 
to do major upgrades at the same time we're doing other maintenance if 
possible.

>
> Thanks,
> Barbara
>>
>> Quoting Barbara Krasovec <[email protected]>:
>>
>>> On 06/10/2014 08:24 AM, José Manuel Molero wrote:
>>>> Dear Slurm user,
>>>>
>>>> Maybe this are dummy questions, but I can't find the response in
>>>> the manual.
>>>>
>>>> Recently we have installed in a cluster, the slurm 14.03 version,
>>>> in a Red Hat/ Scientific Linux enviroment.
>>>> In order to tune the configuration, we want to test different
>>>> parameters of the slurm.conf
>>>> But there are several users running important jobs for several days.
>>>>
>>>> How can I change the configuration of slurm and restart the
>>>> slurmctld without affecting to the users and the jobs of the users?
>>>> Its also necessary restart the slurm daemons?
>>>> Is also possible to upgrade or change the slurm version while there
>>>> are jobs running?
>>>>
>>>> Thanks in advance.
>>>>
>>> Hello!
>>>
>>> We apply new configuration parameters with "scontrol reconfigure"
>>> (first I arrange new slurm.conf on all nodes).
>>>
>>> Upgrading slurm: in my experience, when upgrading to a minor release
>>> (e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a
>>> running cluster, jobs are conserved. But upgrading to a major
>>> release (e.g. from 2.5 to 2.6), cluster has to be drained first,
>>> otherwise jobs are killed.
>>>
>>> Cheers,
>>> Barbara
>>



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & BECS
+358503841576 || [email protected]


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to