TLDR; If you get a timeout for the Slurm database, and a longer timelimit
in innodb doesn't help, you might want to consider loosening the lock mode
The long story!
So, we’ve just upgraded our main cluster to 17.11.3 and moved our database
to Mariadb. There have been some glitches and this one falls into the
category where it’s not an actual bug, but our experience might still be
interesting to someone who is doing sacctmgr delete and find Slurmdbd
crashing. After changing the MariaDB configuration, it worked again, and I
didn't try to repro the issue again or test it further. But here's what I
saw from fixing the problem for us.
Slurmdbd repeatably died, with the error message “fatal: mysql gave
ER_LOCK_WAIT_TIMEOUT as an error.” Setting innodb_lock_wait_timeout in
my.cnf to a higher value didn’t solve the problem.
One single query from a script seemed to be the only thing needed to create
this lock situation: sacctmgr -i delete account where account=$accountname
A delete by sacctmgr is followed up with an alter table, in the same
This seems to be problematic using pretty standard configurations for
MariaDB in Centos7. The query seems to create a lock conflict with itself.
“Waiting for table metadata lock | alter table "milou_assoc_table"
The Slurm code already postpones the ALTER TABLE call until the end of the
transaction, noting that a rollback won’t be possible afterwards. Mixing
DDL and DML SQL statements in the same transaction, for the same table,
might not be wise.
A quicker solution that I opted for, in the middle of a service stop with
our systems down, was to change the MariaDB configuration. Instead of 1, I
set innodb_autoinc_lock_mode=2, allowing for looser locks.
We are running Slurm 17.11.3 on a 300-node Centos7 cluster with MariaDB
5.5.56.-2. We have all old and new users in our LDAP and information on
expiration of projects in a separate external structure. Only projects that
are active (not expired) and users belonging to at least one such projects,
are listed in the Slurm database. At regular intervals, expired data is
removed using sacctmgr delete.
Since we moved the database to MariaDB and upgraded to 17.11 at the same
time, I don’t know how MariaDB behaved with previous Slurm versions.
We got this issue with delete, and changing this configuration fixed it.
There might be problems with other queries too.
Changing to a looser lock mode might introduce new issues, especially
depending on what backup and recovery solutions you have planned for your
database. I set innodb_autoinc_lock_mode=2, but it is possible that the
“traditional” value of 0 will also work.
That’s it! It would be interesting to hear if someone else has encountered
this problem and how you solved it.
Jessica Nettelblad, UPPMAX