[slurm-dev] race condition in accounting

Alejandro Lucero Palau Mon, 14 Nov 2011 02:21:46 -0800

Hi,

We have seen sacct reports some jobs with with time_start = time_end, so
elapsed time is reported as 0, but these jobs have steps with execution
time > 0 (time_end - time_start for those steps).


Peering at slurmdbd log we can see "DBD_JOB_START: ELIGIBLE CALL ID" for
those jobs creating an entry in job_table with time_start=0. However, 
there are not any  log for "DBD_JOB_START: START CALL ID" for those jobs
as there are such log lines for other jobs without this problem.

Studying the code related with this accouting process, we have found a
potential race condition between _set_db_inx_thread thread at
plugins/accounting_storage/slurmdbd/accounting_storage_slurmdbd.c and
call to jobacct_storage_g_job_start in select_nodes function at
slurmctl/node_scheduler.c

The problem is the call from select_nodes is done when the job is set to
RUNNING but just if a db_index does exist. The _set_db_inx_thread 
thread is executed each 5 seconds, and it sends DBD JOB START messages
to slurmdbd. When slurmdbd processes those messages, it looks at which
is the job state for sending either a ELEGIBLE CALL ID or START CALL ID.
Most of the times the code works fine but there's a possibility where a
race conditions appears:

1) _set_db_inx_thread sends a DBD JOB START message to slurmdbd
2) slurmdbd looks at job state so it sends a ELIGIGLE CALL ID to mysql
3) Before mysql backend can process the operation, this is before a
db_index for the job can be created,  select_nodes sets the job to
RUNNING but it DOES not call jobacct_storage_g_job_start since there is
no a db_index for the job.
4) A step starts for the job
5) When the job completes an update in job_table for that job sets the
right completion time but it leaves start time to 0.

Those clusters with thousand of jobs by day are probably suffering this
problem. This situation has been studied for slurm 2.2.7 but same
problems arise with slurm 2.3.1:

select
id_job,job_db_inx,id_user,from_unixtime(time_submit),from_unixtime(time_start),from_unixtime(time_end),
state from bsc_cluster_job_table t1 where time_start=0 and time_end > 0
and exists (select * from bsc_cluster_step_table t2 where
t1.job_db_inx=t2.job_db_inx);

First thought was to solve the problem updating the job table entry with
start time at completion time but it would lead to sreport not showing
consumed time till the job completes. Other option could be to check
when first step is created then update the job table entry if job start
time is 0.

Probably someone else can see a better way to solve it or other
implications around.








WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm

[slurm-dev] race condition in accounting

Reply via email to