Hello,
For 2 month now we have been finding the slurmdbd daemon down on every 1rst of the month. Error messages in the logs appear shortly after midnight. They say: 2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_step_table" (job_db_inx, id_step, time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values (48088499, -2, 1506747882, 'batch', 1, '1=1,2=5000,4=1', 1, 1, 'tars-584', '252', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='tars-584', node_inx='252', task_dis 2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program In the slurmdbd.conf we have: # CONTROLLER AuthType=auth/munge DbdHost=tars-acct PidFile=/var/run/slurmdbd.pid SlurmUser=slurm # DEBUG DebugLevel=info #DebugLevel=debug2 #DebugFlags=DB_ARCHIVE # DATABASE StorageType=accounting_storage/mysql StoragePass=slurmdbd StorageUser=slurmdbd # ARCHIVES ArchiveDir=/path/to/archives ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=12months PurgeJobAfter=12months PurgeResvAfter=12months PurgeStepAfter=1months PurgeSuspendAfter=12months I have read in the slurmdbb.conf documentation : "The purge takes place at the start of the each purge interval. For example, if the purge time is 2 months, the purge would happen at the beginning of each month." So, I suppose that what happens, as jobs are running even at midnight is: - slurmdbd tries to insert a record in the job_step_table whereas the database is locked for the purge. - As the purge takes a long time, the insert request times out. We didn't have that problem before but this was maybe due to the fact that we had less jobs (usage of our cluster is always increasing) so, the purge took less time... If I understood well what I read from the documentation: at the beginning of the month and according to my configuration, slurm purges all jobs and events that are older than 1 year and all job steps that are older than 1 month. Can you confirm that I understood well? If this is so, is there a way to have a shorter "purge interval"? I would like to see if the problem happens if we purged every week. Any feedback regarding this problem is welcome. Regards, Véronique -- Véronique Legrand IT engineer – scientific calculation & software development https://research.pasteur.fr/en/member/veronique-legrand/ Cluster and computing group IT department Institut Pasteur Paris Tel : 95 03