Hello,

For 2 month now we have been finding the slurmdbd daemon down on every 1rst of 
the month.
Error messages in the logs appear shortly after midnight.
They say:
2017-10-01T00:02:42.468823+02:00 tars-acct slurmdbd[7762]: error: mysql_query 
failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert 
into "tars_step_table" (job_db_inx, id_step, time_start, step_name, state, 
tres_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq, 
req_cpufreq_min, req_cpufreq_gov) values (48088499, -2, 1506747882, 'batch', 1, 
'1=1,2=5000,4=1', 1, 1, 'tars-584', '252', 0, 0, 0, 0) on duplicate key update 
nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='tars-584', 
node_inx='252', task_dis
2017-10-01T00:02:42.468854+02:00 tars-acct slurmdbd[7762]: fatal: mysql gave 
ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the 
calling program

In the slurmdbd.conf we have:
# CONTROLLER
AuthType=auth/munge
DbdHost=tars-acct
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm

# DEBUG
DebugLevel=info
#DebugLevel=debug2
#DebugFlags=DB_ARCHIVE

# DATABASE
StorageType=accounting_storage/mysql
StoragePass=slurmdbd
StorageUser=slurmdbd

# ARCHIVES
ArchiveDir=/path/to/archives
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=12months
PurgeStepAfter=1months
PurgeSuspendAfter=12months

I have read in the slurmdbb.conf documentation : "The purge takes place at the 
start of the each purge interval.  For example, if the purge time is 2 months, 
the purge would happen
at the beginning of each month."
So, I suppose that what happens, as jobs are running even at midnight is:
- slurmdbd tries to insert a record in the job_step_table whereas the database 
is locked for the purge.
- As the purge takes a long time, the insert request times out.

We didn't have that problem before but this was maybe due to the fact that we 
had less jobs (usage of our cluster is always increasing) so, the purge took 
less time...

If I understood well what I read from the documentation: at the beginning of 
the month and according to my configuration, slurm purges all jobs and events 
that are older than 1 year
and all job steps that are older than 1 month.
Can you confirm that I understood well?

If this is so, is there a way to have a shorter "purge interval"? I would like 
to see if the problem happens if we purged every week.

Any feedback regarding this problem is welcome.

Regards,


Véronique


--
Véronique Legrand
IT engineer – scientific calculation & software development
https://research.pasteur.fr/en/member/veronique-legrand/
Cluster and computing group
IT department
Institut Pasteur Paris
Tel : 95 03

Reply via email to