Hi,

We have noticed that when some of our users do big queries using sacct the
slurmdbd daemon uses too much ram and the OOM killer kills the slurmdbd
process.

We have in around ~11 million jobs in our database. If we try this query
the slurmdbd memory usage goes above ~15G and crashes because of the OOM
killer (The VM hosting it has 16G of ram)

sacct --format=jobid,user,group,account,cluster,cputime,
cputimeraw,elapsed,ncpus,state,start,end -X --allusers -S 2017-01-01T00:00
-E 2017-12-30T00:00 -s COMPLETED,FAILED,CANCELLED,TIMEOUT

If we do a query which returns ~3 million jobs the memory usage for
slurmdbd stays around ~4GB

After some debugging we have noticed that mysql can handle the query
without issues so there is no fine tunning that we can do in the mysql
server. It's the slurmdbd's memory usage what grows really fast and then
the OOM killer does his job. The problem we see is that any user in the
cluster doing some testing with sacct can crash the slurmdbd daemon.

Does anyone knows of any workaround for this issue?

thanks in advance for any help or suggestion.

regards,
Pablo.

p.s. I know I can increase the memory in the VM as a short-term solution
but I guess this won't scale in the long term.

Reply via email to