Re: [slurm-dev] race condition in accounting

Danny Auble Thu, 24 Nov 2011 09:55:43 -0800

Yes, this should work with 2.2 as well.

I am aware of the slow down that could potentially happen here. I don'tthink the slow down will be that big of a deal though. I suppose anargument could be posed that you are only changing a uint32_t and awrite lock most likely isn't necessary since that variable can only bechanged when processing the return of the function. With that said,perhaps no real harm can come from keeping it a read lock. I figured Iwould err on the safe side though. If you could run some numbers tofigure out the slow down the write lock causes that could be useful. Myguess is it is a trivial amount, but that is only a guess.

As a base line in the current code, before this change and with HTCoptions set, we were consistently able to run around 1000 jobs a second.

In a system that is only managing a few thousand jobs a day I don'tthink this is going to be noticed at any moment. Once we try to get tothe point of multiple 1000's a second that is where I would be mostinterested in seeing if a slow down was obvious.


Danny

On 11/24/11 09:32, Alejandro Lucero Palau wrote:

Hi Danny,

I've just installed the patch on a slurm-2.2.7 system.

I'll let you know if it works.

I guess you are aware of adding the writer lock could have an impact on
performance. It seems there's a raising interest on managing thousands
of jobs by day and I wonder how using a writer lock for a loop like that
in set_db_inx_thread can make slurm less efficient.


On 11/23/2011 07:01 PM, Danny Auble wrote:

Alex,

Thanks for finding this.

https://github.com/SchedMD/slurm/commit/a3bb24095ace5e28048fe514135392cdb2b1e9ae


appears to fix the problem.  Let me know if you see any other issues
with this.  The fix will be in 2.3.2.

Danny

On 11/14/11 02:21, Alejandro Lucero Palau wrote:

Hi,

We have seen sacct reports some jobs with with time_start = time_end, so
elapsed time is reported as 0, but these jobs have steps with execution
time>   0 (time_end - time_start for those steps).

Peering at slurmdbd log we can see "DBD_JOB_START: ELIGIBLE CALL ID" for
those jobs creating an entry in job_table with time_start=0. However,
there are not any  log for "DBD_JOB_START: START CALL ID" for those jobs
as there are such log lines for other jobs without this problem.

Studying the code related with this accouting process, we have found a
potential race condition between _set_db_inx_thread thread at
plugins/accounting_storage/slurmdbd/accounting_storage_slurmdbd.c and
call to jobacct_storage_g_job_start in select_nodes function at
slurmctl/node_scheduler.c

The problem is the call from select_nodes is done when the job is set to
RUNNING but just if a db_index does exist. The _set_db_inx_thread
thread is executed each 5 seconds, and it sends DBD JOB START messages
to slurmdbd. When slurmdbd processes those messages, it looks at which
is the job state for sending either a ELEGIBLE CALL ID or START CALL ID.
Most of the times the code works fine but there's a possibility where a
race conditions appears:

1) _set_db_inx_thread sends a DBD JOB START message to slurmdbd
2) slurmdbd looks at job state so it sends a ELIGIGLE CALL ID to mysql
3) Before mysql backend can process the operation, this is before a
db_index for the job can be created,  select_nodes sets the job to
RUNNING but it DOES not call jobacct_storage_g_job_start since there is
no a db_index for the job.
4) A step starts for the job
5) When the job completes an update in job_table for that job sets the
right completion time but it leaves start time to 0.

Those clusters with thousand of jobs by day are probably suffering this
problem. This situation has been studied for slurm 2.2.7 but same
problems arise with slurm 2.3.1:

select
id_job,job_db_inx,id_user,from_unixtime(time_submit),from_unixtime(time_start),from_unixtime(time_end),

state from bsc_cluster_job_table t1 where time_start=0 and time_end>   0
and exists (select * from bsc_cluster_step_table t2 where
t1.job_db_inx=t2.job_db_inx);

First thought was to solve the problem updating the job table entry with
start time at completion time but it would lead to sreport not showing
consumed time till the job completes. Other option could be to check
when first step is created then update the job table entry if job start
time is 0.

Probably someone else can see a better way to solve it or other
implications around.








WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm


WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm

Re: [slurm-dev] race condition in accounting

Reply via email to