Hi Matthieu,

I had the same concern and approached the problem somewhat differently than you by waiting until there is data ready to read before calling jobacct_gather_g_getinfo(), see:
https://github.com/SchedMD/slurm/commit/4c0eea7b8c20ccb1cacad51838a1ea8257cc637d

I was hoping to avoid making this change to SLURM version 2.3, but if you have also seen the problem then perhaps either your patch or mine should be made to v2.3. I need to look at your patch more closely, but it seems a better approach than my patch.

Moe

Quoting Matthieu Hautreux <matthieu.hautr...@gmail.com>:

Don,

having encountered the same exact problem... and applying this patch,
it does not work as expected.

Indeed, there is an other layer of lock in
jobacct_common-{setinfo,getinfo} that triggers the same exact
deadlock.

I think that the reason you did not see the problem is because of the
size of your jobs. The reason of the deadlock is the use of both
jobacct_gather_g_setinfo and jobacct_gather_g_getinfo over a pipe in
both slurmd and slurmstepd in their communications. As long as the
simultaneous usage of both in each of the daemon does not appear, you
do not have the problem.  jobacct_gather_g_getinfo over a pipe in
slurmd is mostly use to collect statistics from the slurmstepds (like
when enforcing memory limits). jobacct_gather_g_setinfo over a pipe in
slurmd is used in stepd_completion when slurmd receive a completion
message from an other node of a same job and push the accounting data
to the related local slurmstepd for aggregation. This does not occur
with small jobs.

I think that it is harmfull to use both setinfo and getinfo other a
pipe when communicating between slurmd and slurmstepd. I modified the
code to ensure that slurmd only do getinfo calls and slurmstepd only
do setinfo calls. I have installed a patched version of slurm that
does not reproduce the problem for now. You will find it enclosed. It
does not require to use the patch removing the locks.

Note that before applying this patch on a node, it must be drained.
Indeed, as the communication protocol is modified to avoid the
deadlock, both slumrd and slurmstepd must be aware of the
modification. Hot swap of the rpms could result in a old slurmstepd
trying to reply to a new slurmd in the older way thus resulting in a
failure in job termination.

Moe, Danny, what is your view concerning that deadlock scenario ? Do
you think that this patch is sufficient ?

Thanks in advance for your feedback

Regards,
Matthieu



2011/12/16  <don.alb...@bull.com>:
I tried to reproduce the deadlock described in the previous message on my
local system by increasing the frequency that the REQUEST_PING was issued
and the frequency of calls to the job accounting routines.   I did this by
setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3
seconds,  and also setting "--acctg-freq=1" in the job submissions (see
script below) to increase attempts to get job accounting statistics.   I
also added a couple of "info" displays where the PING and STEP_COMPLETE
requests were processed so I could see them in the log file without turning
on all the detailed traces.

I then ran this script:

#!/bin/bash
count=0
until [ $count = 10000 ]
do
   srun --acctg-freq=1 -N4 -n4 sleep 8 &
   srun --acctg-freq=1 -N4 -n4 sleep 7 &
   sleep 5
   count=`expr $count + 1`
done

I had to play around with the sleep times a bit, because if I submitted jobs
too quickly,  I would not get the PING requests to the node because it was
*too* busy.  But with the above script,  I could get the PING to occur
fairly close to the step complete requests.  For example:

[2011-12-16T11:46:44] launch task 22945.0 request from
605.100@141.112.17.124 (port 37606)
[2011-12-16T11:46:44] launch task 22944.0 request from
605.100@141.112.17.124 (port 37094)
[2011-12-16T11:46:46] Ping request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:46] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] Step complete request received
[2011-12-16T11:46:47] [22943.0] done with job
[2011-12-16T11:46:47] [22942.0] done with job

I ran more 10000 iterations of jobs and never encountered the deadlock.
This is possibly because I only have a four node cluster to try it on, so
the probability of hitting the deadlock is low.

As I described below,  I believe the deadlock is caused by excessive
locking/unlocking of the "g_jobacct_gather_context_lock" when calling the
job accounting plugin routines.  I removed all the lock/unlock pairs around
the calls and recompiled.  I then ran the same set of jobs again, to see if
removing the lock/unlock calls caused any problems.   None were apparent.

Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that
removes the lock calls.

  -Don Albert-






owner-slurm-...@lists.llnl.gov wrote on 12/07/2011 03:06:44 PM:

From: don.alb...@bull.com
To: slurm-dev@lists.llnl.gov,
Date: 12/07/2011 03:07 PM
Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock
on "g_jobacct_gather_context_lock"
Sent by: owner-slurm-...@lists.llnl.gov

I have a bug report which describes a node in the "down" state with
a reason of "Not responding".   The node turns out to have a
deadlock between two pairs of threads, one-half of each pair being
in slurmd, and the other half in slurmstepd.  The original stack
traces and analysis by the site support taken from the bug report is
attached below.

Briefly, the deadlock occurs because slurmd has received two
requests:  a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has
started a thread to handle each of them.   The REQUEST_PING code
takes the opportunity to perform an "_enforce_job_mem_limit" call,
which requires getting some job accounting information.  To get this
information, it opens a socket to slurmstepd and sends a
REQUEST_STEP_STAT,  and then eventually calls
"jobacct_gather_g_getinfo" to receive the response.  This routine
locks the "g_jobacct_gather_context_lock" before eventually calling
"safe_read" to read from the socket.

Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to
slurmstepd and sends a REQUEST_STEP_COMPLETION request,  then
eventually calls "jobacct_gather_g_setinfo".  This would lead to
sending the data via "safe_write" except that the thread attempts to
lock the "g_jobacct_gather_context_lock" and waits behind the
REQUEST_PING thread.

Since both of the slurmd threads issued their requests to slumstepd
over their respective sockets before attempting to lock the above
lock,  slurmstepd has kicked off two threads to process them.  Each
of these threads needs to call some job_account_gather routines,
which attempt to lock "g_jobacct_gather_context_lock".   But since
slurmstepd is a separate process from slurmd,  this copy of the lock
is not the same as the slurmd one.   One of the threads in
slurmstepd gets the lock first and the other waits.  In the deadlock
case, we have the one that needs to respond to the REQUEST_STEP_STAT
stuck behind the lock,  while the other slurmstepd thread is
attempting to receive the REQUEST_STEP_COMPLETE information,  and
can't because its corresponding slurmd thread is waiting on the
slurmd copy of the  "g_jobacct_gather_context_lock" lock.

Looking at the code in "slurm_jobacct_gather.c",  where all the
"jobacct_gather_g_<xxx>" routines reside,  it seems that each of
these routines locks the "g_jobacct_gather_context_lock" lock before
proceeding and calling the job_account_gather plugin routines.
This lock ostensibly protects the "g_jobacct_gather_context" data
structure,  but it seems to me that there isn't anything there that
needs protecting, except for the initialization of the structure.
The routines all call "_slurm_jobacct_gather_init" first,  which
checks for the existence of the structure, and if necessary creates
it and populates the "ops" table by causing the loading of the
appropriate plugin.  Once this is done,  there doesn't seem to me to
be any reason to lock this particular mutex.  Removing the lock/
unlock calls from these routines would prevent the deadlock described
above.

But not being the author of this code,  I am not sure if removing
this locking/unlocking is really safe, or might lead to other problems.

  -Don Albert-



[attachment "analysis.txt" deleted by Don Albert/US/BULL]




Reply via email to