Mark

you are right that the problem is due to the fact that each daemon is
waiting to read() data from the other daemon before they write the
data that the other side is waiting for. This is because the read
calls hold the locks necessary to do the write... (the is no processes
shared mutexes involved, only the blocking read acting as a kind of
shared mutex) That is the main problem of using setinfo and getinfo
over a pipe (JOBACCT_DATA_PIPE arg of setinfo,getinfo) when doing
communication between slurmd and slurmstepd. Ensuring that only one
side do setinfo (i.e. write action) and the other do the getinfo (read
action) ensure that no deadlock can occur and that is the way my
workaround works. I agree that still using setinfo and getinfo in the
other part of slurmd and slurmstepd is harmful as we can not guarantee
that in the furture a new reversed call will not be added and the same
problem occurs again.

I would be in favor of replacing every setinfo/getinfo with
JOBACCT_DATA_PIPE between processes by the pack/unpack variation that
sounds better concerning deadlock scenario.

One problem that I outlined is that as I had to modified the protocol
between slurmd and slurmstepd, it prevents administrator to do a hot
upgrade of their systems (while jobs are running on execution nodes).
Moe 's patch does not have this issue, so perhaps it would be better
for 2.3. Maybe my proposition (extended to other communications too)
could be used for 2.4, as breaking the compatibility is not a major
problem between major releases. I already have applied my patch to
some of the systems I am in charge of but I could do a cold upgrade
again with Moe 's patch if it sounds better.

An interesting feedback on the communication protocol is that an
addition of the proto version in the protocol could help this kind of
upgrade/modification in the future. So maybe that it could also be
included in a new protocol if something has to be rewritten.

Regards,
Matthieu

2012/1/25 Mark A. Grondona <mgrond...@llnl.gov>:
> On Tue, 24 Jan 2012 22:21:05 -0700, don.alb...@bull.com wrote:
>> I have been struggling with this problem for some weeks now.  Once the
>> two pairs of threads in the two processes are in the deadlock, it is
>> easy enough to see how it happened and why they are stuck waiting on
>> each other.  The root cause is that both slurmd and slurmstepd use the
>> "job accounting gather plugin" and its associated common code within
>> their own processes, but in certain cases, need to transfer data
>> between the processes.  The functions within the job accounting gather
>> code contain a mutex that is shut for each function.  But since slurmd
>> and slurmstepd are separate processes, there is a separate copy of
>> this mutex in each process.  So when a thread in slurmd invokes a job
>> accounting gather function that needs to communicate to slurmstepd,
>> and the associated thread in slurmstepd also has to invoke a job
>> accounting gather function, two separate mutexes must be shut.  This
>> creates an ordering problem, and in the case where one of the threads
>> in slurmd is doing a "getinfo" and the other a "setinfo", the current
>> arrangement of the code ends up trying to shut the mutexes in opposite
>> order, leading to a race condition and deadlock.
>
> Are the mutexes in the jobacct code set as process-shared? If not
> I'm not sure how locking order in separate processes could cause
> a deadlock.
>
>
>> When I first looked at the job accounting gather code, the mutex in
>> question seemed to me to be unnecessary.  You can see my reason for
>> this in my original message at the end of this message.  Therefore I
>> proposed that the mutex be simply eliminated, which would avoid the
>> deadlock.  In private email, Moe suggested that this mutex might in
>> fact be needed to protect some structures inside of the plugin itself,
>> and Matthieu has shown that eliminating the mutex calls does not
>> prevent a deadlock further down in the code.  So that approach isn't
>> the solution.
>>
>> If I understand it correctly, Moe has proposed a change which delays
>> the call in slurmd to "getinfo" by adding a test and wait on the pipe
>> to slurmstepd to make sure that there is something to read before
>> calling the code that locks the mutex and reads the data. This avoids
>> the deadlock by ensuring that the slurmstepd side has successfully
>> obtained its copy of the mutex and written the data (i.e., is not
>> waiting on its side).  It resolves the ordering problem by always
>> having the sender shut its mutex first. This should fix the deadlock
>> that we have already seen, but I am concerned that if the race
>> condition on the mutex goes the other way, we could end up with both
>> sides trying to write instead of trying to read, and still have a
>> deadlock.  If we can be sure that "safe_write" will always complete
>> immediately (i.e., that it will never block on the write to the pipe)
>> and return to open its side of the mutex, then this may not be a
>> problem.
>>
>> Again, if I understand correctly, Matthieu's change takes a different
>> approach and solves the deadlock by eliminating the locking on the
>> "setinfo" call.  It does this by doing its own packing/unpacking of
>> the data and calls to write the data over the pipe outside of the job
>> account gather routines, so the mutex lock/unlock calls for this
>> direction of data movement are never done.  Since his system is the
>> one where the original problem showed up, and with this change the
>> problem has gone away, I am leaning toward this solution.  My only
>> comment is that eliminating the "jobacct_gather_g_setinfo" and
>> "jobacct_gather_g_getinfo" calls in this case means that the job
>> account gather plugin is not invoked, and any special handling for
>> this case in the plugin is not possible (e.g., doing something
>> different for aix vs linux).  But that probably isn't a big issue.
>
> It sounds to me like the real deadlock has nothing to do with
> the mutex (which only protects the data between threads within
> a process), but instead is due to poorly designed protocol for
> the jobacct_gather code between slurmd/slurmstepd, and you are
> getting into a state where each daemon is waiting to read() data
> from the other daemon (before they write the data that the other
> side is waiting for). Does that sound right?
>
> If so, this sounds like a very flawed protocol to me.
> Matthieu's approach sounds like it eliminates the blocking
> read from one daemon, if I understand it correctly (?), and
> Moe's approach makes the read() non-blocking on the slurmd
> side of the protocol. Don't both of these aproaches just seem
> to be patching up a protocol that needs to be redesigned to
> be robust?
>
> My opinion (in as much as it counts) would be to
> apply Matthieu's fix in slurm-2.3, but for 2.4, please consider
> rethinking the whole design of the jobacct_gather protocol,
> and design something that isn't affected by this flaw. If you
> make it generic enough, perhaps other facilities could make
> use of the protocol in the future.
>
> mark
>
>
>
>
>>   -Don Albert-
>>
>> owner-slurm-...@lists.llnl.gov wrote on 01/24/2012 03:52:10 PM:
>>
>> > From: Moe Jette <je...@schedmd.com>
>> > To: slurm-dev@lists.llnl.gov,
>> > Date: 01/24/2012 03:52 PM
>> > Subject: Re: [slurm-dev] Slurmd/slurmstepd race condition and
>> > deadlock on "g_jobacct_gather_context_lock"
>> > Sent by: owner-slurm-...@lists.llnl.gov
>> >
>> > Hi Matthieu,
>> >
>> > I had the same concern and approached the problem somewhat differently
>> > than you by waiting until there is data ready to read before calling
>> > jobacct_gather_g_getinfo(), see:
>> > https://github.com/SchedMD/slurm/commit/
>> > 4c0eea7b8c20ccb1cacad51838a1ea8257cc637d
>> >
>> > I was hoping to avoid making this change to SLURM version 2.3, but if
>> > you have also seen the problem then perhaps either your patch or mine
>> > should be made to v2.3. I need to look at your patch more closely, but
>> > it seems a better approach than my patch.
>> >
>> > Moe
>> >
>> > Quoting Matthieu Hautreux <matthieu.hautr...@gmail.com>:
>> >
>> > > Don,
>> > >
>> > > having encountered the same exact problem... and applying this patch,
>> > > it does not work as expected.
>> > >
>> > > Indeed, there is an other layer of lock in
>> > > jobacct_common-{setinfo,getinfo} that triggers the same exact
>> > > deadlock.
>> > >
>> > > I think that the reason you did not see the problem is because of the
>> > > size of your jobs. The reason of the deadlock is the use of both
>> > > jobacct_gather_g_setinfo and jobacct_gather_g_getinfo over a pipe in
>> > > both slurmd and slurmstepd in their communications. As long as the
>> > > simultaneous usage of both in each of the daemon does not appear, you
>> > > do not have the problem.  jobacct_gather_g_getinfo over a pipe in
>> > > slurmd is mostly use to collect statistics from the slurmstepds (like
>> > > when enforcing memory limits). jobacct_gather_g_setinfo over a pipe in
>> > > slurmd is used in stepd_completion when slurmd receive a completion
>> > > message from an other node of a same job and push the accounting data
>> > > to the related local slurmstepd for aggregation. This does not occur
>> > > with small jobs.
>> > >
>> > > I think that it is harmfull to use both setinfo and getinfo other a
>> > > pipe when communicating between slurmd and slurmstepd. I modified the
>> > > code to ensure that slurmd only do getinfo calls and slurmstepd only
>> > > do setinfo calls. I have installed a patched version of slurm that
>> > > does not reproduce the problem for now. You will find it enclosed. It
>> > > does not require to use the patch removing the locks.
>> > >
>> > > Note that before applying this patch on a node, it must be drained.
>> > > Indeed, as the communication protocol is modified to avoid the
>> > > deadlock, both slumrd and slurmstepd must be aware of the
>> > > modification. Hot swap of the rpms could result in a old slurmstepd
>> > > trying to reply to a new slurmd in the older way thus resulting in a
>> > > failure in job termination.
>> > >
>> > > Moe, Danny, what is your view concerning that deadlock scenario ? Do
>> > > you think that this patch is sufficient ?
>> > >
>> > > Thanks in advance for your feedback
>> > >
>> > > Regards,
>> > > Matthieu
>> > >
>> > >
>> > >
>> > > 2011/12/16  <don.alb...@bull.com>:
>> > >> I tried to reproduce the deadlock described in the previous message
>> on my
>> > >> local system by increasing the frequency that the REQUEST_PING was
>> issued
>> > >> and the frequency of calls to the job accounting routines.   I did
>> this by
>> > >> setting "SlurmdTimeout=5" to increase the rate of PINGs to every 2-3
>> > >> seconds,  and also setting "--acctg-freq=1" in the job submissions
>> (see
>> > >> script below) to increase attempts to get job accounting statistics.
>>   I
>> > >> also added a couple of "info" displays where the PING and
>> STEP_COMPLETE
>> > >> requests were processed so I could see them in the log file without
>> turning
>> > >> on all the detailed traces.
>> > >>
>> > >> I then ran this script:
>> > >>
>> > >> #!/bin/bash
>> > >> count=0
>> > >> until [ $count = 10000 ]
>> > >> do
>> > >>    srun --acctg-freq=1 -N4 -n4 sleep 8 &
>> > >>    srun --acctg-freq=1 -N4 -n4 sleep 7 &
>> > >>    sleep 5
>> > >>    count=`expr $count + 1`
>> > >> done
>> > >>
>> > >> I had to play around with the sleep times a bit, because if I
>> > submitted jobs
>> > >> too quickly,  I would not get the PING requests to the node because
>> it was
>> > >> *too* busy.  But with the above script,  I could get the PING to
>> occur
>> > >> fairly close to the step complete requests.  For example:
>> > >>
>> > >> [2011-12-16T11:46:44] launch task 22945.0 request from
>> > >> 605.100@141.112.17.124 (port 37606)
>> > >> [2011-12-16T11:46:44] launch task 22944.0 request from
>> > >> 605.100@141.112.17.124 (port 37094)
>> > >> [2011-12-16T11:46:46] Ping request received
>> > >> [2011-12-16T11:46:46] Step complete request received
>> > >> [2011-12-16T11:46:46] Step complete request received
>> > >> [2011-12-16T11:46:46] Step complete request received
>> > >> [2011-12-16T11:46:47] Step complete request received
>> > >> [2011-12-16T11:46:47] Step complete request received
>> > >> [2011-12-16T11:46:47] Step complete request received
>> > >> [2011-12-16T11:46:47] [22943.0] done with job
>> > >> [2011-12-16T11:46:47] [22942.0] done with job
>> > >>
>> > >> I ran more 10000 iterations of jobs and never encountered the
>> deadlock.
>> > >> This is possibly because I only have a four node cluster to try it
>> on, so
>> > >> the probability of hitting the deadlock is low.
>> > >>
>> > >> As I described below,  I believe the deadlock is caused by excessive
>> > >> locking/unlocking of the "g_jobacct_gather_context_lock" when calling
>> the
>> > >> job accounting plugin routines.  I removed all the lock/unlock pairs
>> around
>> > >> the calls and recompiled.  I then ran the same set of jobs again,to
>> see if
>> > >> removing the lock/unlock calls caused any problems.   None were
>> apparent.
>> > >>
>> > >> Attached is a SLURM 2.4.0-pre2 patch to "slurm_jobacct_gather.c" that
>> > >> removes the lock calls.
>> > >>
>> > >>   -Don Albert-
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> owner-slurm-...@lists.llnl.gov wrote on 12/07/2011 03:06:44 PM:
>> > >>
>> > >>> From: don.alb...@bull.com
>> > >>> To: slurm-dev@lists.llnl.gov,
>> > >>> Date: 12/07/2011 03:07 PM
>> > >>> Subject: [slurm-dev] Slurmd/slurmstepd race condition and deadlock
>> > >>> on "g_jobacct_gather_context_lock"
>> > >>> Sent by: owner-slurm-...@lists.llnl.gov
>> > >>>
>> > >>> I have a bug report which describes a node in the "down" state with
>> > >>> a reason of "Not responding".   The node turns out to have a
>> > >>> deadlock between two pairs of threads, one-half of each pair being
>> > >>> in slurmd, and the other half in slurmstepd.  The original stack
>> > >>> traces and analysis by the site support taken from the bug report is
>> > >>> attached below.
>> > >>>
>> > >>> Briefly, the deadlock occurs because slurmd has received two
>> > >>> requests:  a REQUEST_PING, and a REQUEST_STEP_COMPLETE, and has
>> > >>> started a thread to handle each of them.   The REQUEST_PING code
>> > >>> takes the opportunity to perform an "_enforce_job_mem_limit" call,
>> > >>> which requires getting some job accounting information.  To get this
>> > >>> information, it opens a socket to slurmstepd and sends a
>> > >>> REQUEST_STEP_STAT,  and then eventually calls
>> > >>> "jobacct_gather_g_getinfo" to receive the response.  This routine
>> > >>> locks the "g_jobacct_gather_context_lock" before eventually calling
>> > >>> "safe_read" to read from the socket.
>> > >>>
>> > >>> Meanwhile the REQUEST_STEP_COMPLETE code opens a socket to
>> > >>> slurmstepd and sends a REQUEST_STEP_COMPLETION request,  then
>> > >>> eventually calls "jobacct_gather_g_setinfo".  This would lead to
>> > >>> sending the data via "safe_write" except that the thread attempts to
>> > >>> lock the "g_jobacct_gather_context_lock" and waits behind the
>> > >>> REQUEST_PING thread.
>> > >>>
>> > >>> Since both of the slurmd threads issued their requests to slumstepd
>> > >>> over their respective sockets before attempting to lock the above
>> > >>> lock,  slurmstepd has kicked off two threads to process them.  Each
>> > >>> of these threads needs to call some job_account_gather routines,
>> > >>> which attempt to lock "g_jobacct_gather_context_lock".   But since
>> > >>> slurmstepd is a separate process from slurmd,  this copy of the lock
>> > >>> is not the same as the slurmd one.   One of the threads in
>> > >>> slurmstepd gets the lock first and the other waits.  In the deadlock
>> > >>> case, we have the one that needs to respond to the REQUEST_STEP_STAT
>> > >>> stuck behind the lock,  while the other slurmstepd thread is
>> > >>> attempting to receive the REQUEST_STEP_COMPLETE information,  and
>> > >>> can't because its corresponding slurmd thread is waiting on the
>> > >>> slurmd copy of the  "g_jobacct_gather_context_lock" lock.
>> > >>>
>> > >>> Looking at the code in "slurm_jobacct_gather.c",  where all the
>> > >>> "jobacct_gather_g_<xxx>" routines reside,  it seems that each of
>> > >>> these routines locks the "g_jobacct_gather_context_lock" lock before
>> > >>> proceeding and calling the job_account_gather plugin routines.
>> > >>> This lock ostensibly protects the "g_jobacct_gather_context" data
>> > >>> structure,  but it seems to me that there isn't anything there that
>> > >>> needs protecting, except for the initialization of the structure.
>> > >>> The routines all call "_slurm_jobacct_gather_init" first,  which
>> > >>> checks for the existence of the structure, and if necessary creates
>> > >>> it and populates the "ops" table by causing the loading of the
>> > >>> appropriate plugin.  Once this is done,  there doesn't seem to me to
>> > >>> be any reason to lock this particular mutex.  Removing the lock/
>> > >>> unlock calls from these routines would prevent the deadlock
>> described
>> > >>> above.
>> > >>>
>> > >>> But not being the author of this code,  I am not sure if removing
>> > >>> this locking/unlocking is really safe, or might lead to other
>> problems.
>> > >>>
>> > >>>   -Don Albert-
>> > >>>
>> > >>>
>> > >>>
>> > >>> [attachment "analysis.txt" deleted by Don Albert/US/BULL]
>> > >
>> >
>> >
>> >
>>
> Non-text part: text/html

Reply via email to