TLDR; If you get a timeout for the Slurm database, and a longer timelimit in innodb doesn't help, you might want to consider loosening the lock mode in MariaDB.
The long story! So, we’ve just upgraded our main cluster to 17.11.3 and moved our database to Mariadb. There have been some glitches and this one falls into the category where it’s not an actual bug, but our experience might still be interesting to someone who is doing sacctmgr delete and find Slurmdbd crashing. After changing the MariaDB configuration, it worked again, and I didn't try to repro the issue again or test it further. But here's what I saw from fixing the problem for us. THE ERROR Slurmdbd repeatably died, with the error message “fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error.” Setting innodb_lock_wait_timeout in my.cnf to a higher value didn’t solve the problem. One single query from a script seemed to be the only thing needed to create this lock situation: sacctmgr -i delete account where account=$accountname cluster=$cluster_name` THE PROBLEM A delete by sacctmgr is followed up with an alter table, in the same transaction. https://github.com/SchedMD/slurm/blob/master/src/plugins/accounting_storage/mysql/accounting_storage_mysql.c This seems to be problematic using pretty standard configurations for MariaDB in Centos7. The query seems to create a lock conflict with itself. “Waiting for table metadata lock | alter table "milou_assoc_table" AUTO_INCREMENT=0” THE FIX The Slurm code already postpones the ALTER TABLE call until the end of the transaction, noting that a rollback won’t be possible afterwards. Mixing DDL and DML SQL statements in the same transaction, for the same table, might not be wise. A quicker solution that I opted for, in the middle of a service stop with our systems down, was to change the MariaDB configuration. Instead of 1, I set innodb_autoinc_lock_mode=2, allowing for looser locks. OUR SETUP We are running Slurm 17.11.3 on a 300-node Centos7 cluster with MariaDB 5.5.56.-2. We have all old and new users in our LDAP and information on expiration of projects in a separate external structure. Only projects that are active (not expired) and users belonging to at least one such projects, are listed in the Slurm database. At regular intervals, expired data is removed using sacctmgr delete. SOME COMMENTS Since we moved the database to MariaDB and upgraded to 17.11 at the same time, I don’t know how MariaDB behaved with previous Slurm versions. We got this issue with delete, and changing this configuration fixed it. There might be problems with other queries too. Changing to a looser lock mode might introduce new issues, especially depending on what backup and recovery solutions you have planned for your database. I set innodb_autoinc_lock_mode=2, but it is possible that the “traditional” value of 0 will also work. That’s it! It would be interesting to hear if someone else has encountered this problem and how you solved it. Best regards, Jessica Nettelblad, UPPMAX