Don, having encountered the same exact problem... and applying this patch, it does not work as expected.
Indeed, there is an other layer of lock in jobacct_common-{setinfo,getinfo} that triggers the same exact deadlock. I think that the reason you did not see the problem is because of the size of your jobs. The reason of the deadlock is the use of both jobacct_gather_g_setinfo and jobacct_gather_g_getinfo over a pipe in both slurmd and slurmstepd in their communications. As long as the simultaneous usage of both in each of the daemon does not appear, you do not have the problem. jobacct_gather_g_getinfo over a pipe in slurmd is mostly use to collect statistics from the slurmstepds (like when enforcing memory limits). jobacct_gather_g_setinfo over a pipe in slurmd is used in stepd_completion when slurmd receive a completion message from an other node of a same job and push the accounting data to the related local slurmstepd for aggregation. This does not occur with small jobs. I think that it is harmfull to use both setinfo and getinfo other a pipe when communicating between slurmd and slurmstepd. I modified the code to ensure that slurmd only do getinfo calls and slurmstepd only do setinfo calls. I have installed a patched version of slurm that does not reproduce the problem for now. You will find it enclosed. It does not require to use the patch removing the locks. Note that before applying this patch on a node, it must be drained. Indeed, as the communication protocol is modified to avoid the deadlock, both slumrd and slurmstepd must be aware of the modification. Hot swap of the rpms could result in a old slurmstepd trying to reply to a new slurmd in the older way thus resulting in a failure in job termination. Moe, Danny, what is your view concerning that deadlock scenario ? Do you think that this patch is sufficient ? Thanks in advance for your feedback Regards, Matthieu 2011/12/16 <don.alb...@bull.com>: > I tried to reproduce the deadlock described in the previous message on my > local system by increasing the frequency that the REQUEST_PING was issued > and the frequency of calls to the job accounting routines. I did this by > setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3 > seconds, and also setting "--acctg-freq=1" in the job submissions (see > script below) to increase attempts to get job accounting statistics. I > also added a couple of "info" displays where the PING and STEP_COMPLETE > requests were processed so I could see them in the log file without turning > on all the detailed traces. > > I then ran this script: > > #!/bin/bash > count=0 > until [ $count = 10000 ] > do > srun --acctg-freq=1 -N4 -n4 sleep 8 & > srun --acctg-freq=1 -N4 -n4 sleep 7 & > sleep 5 > count=`expr $count + 1` > done > > I had to play around with the sleep times a bit, because if I submitted jobs > too quickly, I would not get the PING requests to the node because it was > *too* busy. But with the above script, I could get the PING to occur > fairly close to the step complete requests. For example: > > [2011-12-16T11:46:44] launch task 22945.0 request from > 605.100@141.112.17.124 (port 37606) > [2011-12-16T11:46:44] launch task 22944.0 request from > 605.100@141.112.17.124 (port 37094) > [2011-12-16T11:46:46] Ping request received > [2011-12-16T11:46:46] Step complete request received > [2011-12-16T11:46:46] Step complete request received > [2011-12-16T11:46:46] Step complete request received > [2011-12-16T11:46:47] Step complete request received > [2011-12-16T11:46:47] Step complete request received > [2011-12-16T11:46:47] Step complete request received > [2011-12-16T11:46:47] [22943.0] done with job > [2011-12-16T11:46:47] [22942.0] done with job > > I ran more 10000 iterations of jobs and never encountered the deadlock. > This is possibly because I only have a four node cluster to try it on, so > the probability of hitting the deadlock is low. > > As I described below, I believe the deadlock is caused by excessive > locking/unlocking of the "g_jobacct_gather_context_lock" when calling the > job accounting plugin routines. I removed all the lock/unlock pairs around > the calls and recompiled. I then ran the same set of jobs again, to see if > removing the lock/unlock calls caused any problems. None were apparent. > > Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that > removes the lock calls. > > -Don Albert- > > > > > > > owner-slurm-...@lists.llnl.gov wrote on 12/07/2011 03:06:44 PM: > >> From: don.alb...@bull.com >> To: slurm-dev@lists.llnl.gov, >> Date: 12/07/2011 03:07 PM >> Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock >> on "g_jobacct_gather_context_lock" >> Sent by: owner-slurm-...@lists.llnl.gov >> >> I have a bug report which describes a node in the "down" state with >> a reason of "Not responding". The node turns out to have a >> deadlock between two pairs of threads, one-half of each pair being >> in slurmd, and the other half in slurmstepd. The original stack >> traces and analysis by the site support taken from the bug report is >> attached below. >> >> Briefly, the deadlock occurs because slurmd has received two >> requests: a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has >> started a thread to handle each of them. The REQUEST_PING code >> takes the opportunity to perform an "_enforce_job_mem_limit" call, >> which requires getting some job accounting information. To get this >> information, it opens a socket to slurmstepd and sends a >> REQUEST_STEP_STAT, and then eventually calls >> "jobacct_gather_g_getinfo" to receive the response. This routine >> locks the "g_jobacct_gather_context_lock" before eventually calling >> "safe_read" to read from the socket. >> >> Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to >> slurmstepd and sends a REQUEST_STEP_COMPLETION request, then >> eventually calls "jobacct_gather_g_setinfo". This would lead to >> sending the data via "safe_write" except that the thread attempts to >> lock the "g_jobacct_gather_context_lock" and waits behind the >> REQUEST_PING thread. >> >> Since both of the slurmd threads issued their requests to slumstepd >> over their respective sockets before attempting to lock the above >> lock, slurmstepd has kicked off two threads to process them. Each >> of these threads needs to call some job_account_gather routines, >> which attempt to lock "g_jobacct_gather_context_lock". But since >> slurmstepd is a separate process from slurmd, this copy of the lock >> is not the same as the slurmd one. One of the threads in >> slurmstepd gets the lock first and the other waits. In the deadlock >> case, we have the one that needs to respond to the REQUEST_STEP_STAT >> stuck behind the lock, while the other slurmstepd thread is >> attempting to receive the REQUEST_STEP_COMPLETE information, and >> can't because its corresponding slurmd thread is waiting on the >> slurmd copy of the "g_jobacct_gather_context_lock" lock. >> >> Looking at the code in "slurm_jobacct_gather.c", where all the >> "jobacct_gather_g_<xxx>" routines reside, it seems that each of >> these routines locks the "g_jobacct_gather_context_lock" lock before >> proceeding and calling the job_account_gather plugin routines. >> This lock ostensibly protects the "g_jobacct_gather_context" data >> structure, but it seems to me that there isn't anything there that >> needs protecting, except for the initialization of the structure. >> The routines all call "_slurm_jobacct_gather_init" first, which >> checks for the existence of the structure, and if necessary creates >> it and populates the "ops" table by causing the loading of the >> appropriate plugin. Once this is done, there doesn't seem to me to >> be any reason to lock this particular mutex. Removing the lock/ >> unlock calls from these routines would prevent the deadlock described >> above. >> >> But not being the author of this code, I am not sure if removing >> this locking/unlocking is really safe, or might lead to other problems. >> >> -Don Albert- >> >> >> >> [attachment "analysis.txt" deleted by Don Albert/US/BULL]
From 786312ed5222e6c37f96c6aaf62602ca5cef7912 Mon Sep 17 00:00:00 2001 From: Matthieu Hautreux <matthieu.hautr...@cea.fr> Date: Tue, 24 Jan 2012 09:42:41 +0100 Subject: [PATCH 1/1] slurmd/slurmstepd: remove a deadlock scenario in completion mgmt in the tightly coupled functions slurmd:stepd_completion and slurmstepd:_handle_completion, a jobacct structure is send from the main daemon to the step daemon to provide the statistics of the children slurmstepd and do the aggregation. The methodology used to send the structure is the use of jobacct_gather_g_{setinfo,getinfo} over a pipe (JOBACCT_DATA_PIPE). As {setinfo,getinfo} use a common internal lock and reading or writing to a pipe is equivalent to holding a lock, slurmd and slurmstepd have to avoid using both setinfo and getinfo over a pipe or deadlock situations can occured. For example : slurmd(lockforread,write)/slurmstepd(write,lockforread). This patch remove the call to jobacct_gather_g_setinfo in slurmd and the call to jobacct_gather_g_getinfo in slurmstepd ensuring that slurmd only do getinfo operations over a pipe and slurmstepd only do setinfo over a pipe. Instead jobacct_gather_g_{pack,unpack} are used to marshall/unmarshall the data for transmission over the pipe. --- src/common/stepd_api.c | 20 +++++++++++++++++++- src/slurmd/slurmstepd/req.c | 22 ++++++++++++++++++++-- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/src/common/stepd_api.c b/src/common/stepd_api.c index 5339b96..640b09d 100644 --- a/src/common/stepd_api.c +++ b/src/common/stepd_api.c @@ -874,6 +874,10 @@ stepd_completion(int fd, step_complete_msg_t *sent) int req = REQUEST_STEP_COMPLETION; int rc; int errnum = 0; + Buf buffer; + int len = 0; + + buffer = init_buf(0); debug("Entering stepd_completion, range_first = %d, range_last = %d", sent->range_first, sent->range_last); @@ -881,7 +885,21 @@ stepd_completion(int fd, step_complete_msg_t *sent) safe_write(fd, &sent->range_first, sizeof(int)); safe_write(fd, &sent->range_last, sizeof(int)); safe_write(fd, &sent->step_rc, sizeof(int)); - jobacct_gather_g_setinfo(sent->jobacct, JOBACCT_DATA_PIPE, &fd); + /* + * We must not use setinfo over a pipe with slurmstepd here + * Indeed, slurmd does a large use of getinfo over a pipe + * with slurmstepd and doing the reverse can result in a deadlock + * scenario with slurmstepd : + * slurmd(lockforread,write)/slurmstepd(write,lockforread) + * Do pack/unpack instead to be sure of independances of + * slurmd and slurmstepd + */ + jobacct_gather_g_pack(sent->jobacct, SLURM_PROTOCOL_VERSION, buffer); + len = get_buf_offset(buffer); + safe_write(fd, &len, sizeof(int)); + safe_write(fd, get_buf_data(buffer), len); + free_buf(buffer); + /* Receive the return code and errno */ safe_read(fd, &rc, sizeof(int)); safe_read(fd, &errnum, sizeof(int)); diff --git a/src/slurmd/slurmstepd/req.c b/src/slurmd/slurmstepd/req.c index 478279d..00cce00 100644 --- a/src/slurmd/slurmstepd/req.c +++ b/src/slurmd/slurmstepd/req.c @@ -1282,6 +1282,10 @@ _handle_completion(int fd, slurmd_job_t *job, uid_t uid) jobacctinfo_t *jobacct = NULL; int step_rc; + char* buf; + int len; + Buf buffer; + debug("_handle_completion for job %u.%u", job->jobid, job->stepid); @@ -1300,8 +1304,22 @@ _handle_completion(int fd, slurmd_job_t *job, uid_t uid) safe_read(fd, &first, sizeof(int)); safe_read(fd, &last, sizeof(int)); safe_read(fd, &step_rc, sizeof(int)); - jobacct = jobacct_gather_g_create(NULL); - jobacct_gather_g_getinfo(jobacct, JOBACCT_DATA_PIPE, &fd); + + /* + * We must not use getinfo over a pipe with slurmd here + * Indeed, slurmstepd does a large use of setinfo over a pipe + * with slurmd and doing the reverse can result in a deadlock + * scenario with slurmd : + * slurmd(lockforread,write)/slurmstepd(write,lockforread) + * Do pack/unpack instead to be sure of independances of + * slurmd and slurmstepd + */ + safe_read(fd, &len, sizeof(int)); + buf = xmalloc(len); + safe_read(fd, buf, len); + buffer = create_buf(buf, len); + jobacct_gather_g_unpack(&jobacct, SLURM_PROTOCOL_VERSION, buffer); + free_buf(buffer); /* * Record the completed nodes -- 1.7.6.2