On 2014-06-11T11:56:36 EEST, Barbara Krasovec wrote: > > On 06/10/2014 04:16 PM, [email protected] wrote: >> >> Pending and running jobs should be preserved across major releases too. > When we upgraded slurm from 2.5 to 2.6, it was tested before on a > working test cluster and all jobs were killed. > So, if I do an upgrade of slurm from 2.6.5 to 14.03., it should work > on a working cluster and it is not necessary to drain it? I just stop > new jobs, those that are already in queue (running or pending) should > be preserved?
It ought to work, yes, but if something goes wrong... Some issues we have seen the past few years: 1) Jobs killed on upgrade, some complaints about protocol incompatibility between the new slurmctld and older slurmd's in logs. IIRC this might have been the 2.5 -> 2.6.0 upgrade, a fix was included in 2.6.1(?). 2) Jobs killed due to slurmd timeout. This was due to an upgrade procedure where (for some reason?) slurmd's were first stopped, then new rpm packages installed, then slurmd's restarted. Well, with enough nodes upgrading the packages on all the nodes took long enough that slurmctld decided all the nodes were down and killed the jobs, even though the jobs themselves were running fine. (This issue is of course trivial to avoid with a saner upgrade procedure and/or larger SlurmdTimeout parameter. Would have been nice to think of it before the "OH F***"-moment.. ;) ) 3) slurmdbd hanging for 45 minutes during "service slurmdbd restart", due to updating the MySQL tables. Our Job Id's are at ~11M, and /var/lib/mysql is ~10GB, so I guess it's just a lot of work to do. 4) The libslurm so version is bumped every release. So things like MPI libraries with slurm integration ought to be recompiled. Sometimes it works to just symlink the old so name(s) to the new one, but this is of course a giant kludge with no guarantee of working. Some kind of ABI stability with symbol versioning etc. would be nice.. Issues (2) and (3) are unfortunately the kind you tend to run into when upgrading your production system rather than some test cluster.. :( But generally, on the fly upgrades have worked fine for us. Still, we try to do major upgrades at the same time we're doing other maintenance if possible. > > Thanks, > Barbara >> >> Quoting Barbara Krasovec <[email protected]>: >> >>> On 06/10/2014 08:24 AM, José Manuel Molero wrote: >>>> Dear Slurm user, >>>> >>>> Maybe this are dummy questions, but I can't find the response in >>>> the manual. >>>> >>>> Recently we have installed in a cluster, the slurm 14.03 version, >>>> in a Red Hat/ Scientific Linux enviroment. >>>> In order to tune the configuration, we want to test different >>>> parameters of the slurm.conf >>>> But there are several users running important jobs for several days. >>>> >>>> How can I change the configuration of slurm and restart the >>>> slurmctld without affecting to the users and the jobs of the users? >>>> Its also necessary restart the slurm daemons? >>>> Is also possible to upgrade or change the slurm version while there >>>> are jobs running? >>>> >>>> Thanks in advance. >>>> >>> Hello! >>> >>> We apply new configuration parameters with "scontrol reconfigure" >>> (first I arrange new slurm.conf on all nodes). >>> >>> Upgrading slurm: in my experience, when upgrading to a minor release >>> (e.g. from 2.6.4 to 2.6.X), it is not a problem to do it on a >>> running cluster, jobs are conserved. But upgrading to a major >>> release (e.g. from 2.5 to 2.6), cluster has to be drained first, >>> otherwise jobs are killed. >>> >>> Cheers, >>> Barbara >> -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & BECS +358503841576 || [email protected]
signature.asc
Description: OpenPGP digital signature
