Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Miguel Gila Fri, 23 Feb 2018 01:08:38 -0800

We recently ran a similar exercise: when updating from 17.02.7 to 17.11.03-2, 
we had to stop the upgrade on our production DB (shared with other databases) 
after nearly half-day into it. It had reached a job table for a system with 6 
million jobs and still had to go thru another one with >7 million jobs and 
several smaller ones...


Interestingly enough, a poor vmare VM (2CPUs, 3GB/RAM) with MariaDB 5.5.56 
outperformed our central MySQL 5.5.59 (128GB, 14core, SAN) by a factor of at 
least 3 on every table conversion. As a result, since both DBs were upgraded in 
parallel with the same source production data, instead of rolling back, we 
decided to move all our Slurm DB queries to the VM. After giving it more RAM 
and more CPUs, the VM is performing quite well and are evaluating how to move 
forward.

Needless to say, we were all shocked by this performance difference and are 
still trying to figure why. CPU on the MySQL DB was nearly 80-90% busy all the 
time, no iowait. The "left outer join" query was the place where everything to 
stuck. In my eyes, all this doesn't make a lot of sense :-S

Cheers,
Miguel

-- 
Miguel Gila
CSCS Swiss National Supercomputing Centre
HPC Operations



On 22.02.18, 21:47, "slurm-users on behalf of Christopher Benjamin Coffey" 
<slurm-users-boun...@lists.schedmd.com on behalf of chris.cof...@nau.edu> wrote:

    Thanks Paul. I didn't realize we were tracking energy ( . Looks like the 
best way to stop tracking energy is to specify what you want to track with 
AccountingStorageTRES ? I'll give that a try.
    
    Best,
    Chris
    
    —
    Christopher Coffey
    High-Performance Computing
    Northern Arizona University
    928-523-1167
     
    
    On 2/22/18, 8:18 AM, "slurm-users on behalf of Paul Edmon" 
<slurm-users-boun...@lists.schedmd.com on behalf of ped...@cfa.harvard.edu> 
wrote:
    
        Typically the long db upgrades are only for major version upgrades.  
        Most of the time minor versions don't take nearly as long.
        
        At least with our upgrade from 17.02.9 to 17.11.3 the upgrade only took 
        1.5 hours with 6 months worth of jobs (about 10 million jobs).  We 
don't 
        track energy usage though so perhaps we avoided that particular query 
        due to that.
        
         From past experience these major upgrades can take quite a bit of time 
        as they typically change a lot about the DB structure in between major 
        versions.
        
        -Paul Edmon-
        
        On 02/22/2018 06:17 AM, Malte Thoma wrote:
        > FYI:
        > * We broke our upgrade from 17.02.1-2 to 17.11.2 after about 18 h.
        > * Dropped the job table ("truncate xyz_job_table;")
        > * Executed the everlasting sql command by hand on a back-up database
        > * Meanwhile we did the slurm upgrade (fast&easy)
        > * Reset the First-Job-ID to a high number
        > * Inserted the converted datatable in the real database again.
        >
        > It took two experts for this task and we would appreciate a better 
        > upgrade-concept very much!
        > I fact, we hesitate to upgrade from 17.11.2  to 17.11.3, because we 
        > are afraid of similar problems. Does anyone has experience with this?
        >
        > It would be good to know if there is ANY chance if future upgrades 
        > will cause the same problems or if this will become better?
        >
        > Regards,
        > Malte
        >
        >
        >
        >
        >
        >
        > Am 22.02.2018 um 01:30 schrieb Christopher Benjamin Coffey:
        >> This is great to know Kurt. We can't be the only folks running into 
        >> this.. I wonder if the mysql update code gets into a deadlock or 
        >> something. I'm hoping a slurm dev will chime in ...
        >>
        >> Kurt, out of band if need be, I'd be interested in the details of 
        >> what you ended up doing.
        >>
        >> Best,
        >> Chris
        >>
        >> —
        >> Christopher Coffey
        >> High-Performance Computing
        >> Northern Arizona University
        >> 928-523-1167
        >>
        >> On 2/21/18, 5:08 PM, "slurm-users on behalf of Kurt H Maier" 
        >> <slurm-users-boun...@lists.schedmd.com on behalf of k...@sciops.net> 
        >> wrote:
        >>
        >>      On Wed, Feb 21, 2018 at 11:56:38PM +0000, Christopher Benjamin 
        >> Coffey wrote:
        >>      > Hello,
        >>      >
        >>      > We have been trying to upgrade slurm on our cluster from 
        >> 16.05.6 to 17.11.3. I'm thinking this should be doable? Past 
upgrades 
        >> have been a breeze, and I believe during the last one, the db 
upgrade 
        >> took like 25 minutes. Well now, the db upgrade process is taking far 
        >> too long. We previously attempted the upgrade during a maintenance 
        >> window and the upgrade process did not complete after 24 hrs. I gave 
        >> up on the upgrade and reverted the slurm version back by restoring a 
        >> backup db.
        >>           We hit this on our try as well: upgrading from 17.02.9 to 
        >> 17.11.3.  We
        >>      truncated our job history for the upgrade, and then did the 
rest 
        >> of the
        >>      conversion out-of-band and re-imported it after the fact. It 
        >> took us
        >>      almost sixteen hours to convert a 1.5 million-job store.
        >>           We got hung up on precisely the same query you did, on a 
        >> similarly hefty
        >>      machine.  It caused us to roll back an upgrade and try again 
        >> during our
        >>      subsequent maintenance window with the above approach.
        >>           khm
        >>
        >

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

Reply via email to