Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-23 Thread Mike Galbraith
On Thu, 2017-11-23 at 15:42 +0100, Sebastian Siewior wrote:
> On 2017-11-21 22:20:51 [+0100], Mike Galbraith wrote:
> > On Tue, 2017-11-21 at 14:56 -0500, Mikulas Patocka wrote:
> > > 
> > > If we don't have any reason why it is needed to unplug block requests 
> > > when 
> > > a spinlock is taken - so let's not do this.
> > 
> > That's perfectly fine.  I guess I shouldn't have even mentioned having
> > encountered unplug at mutex being insufficient.
> 
> While at it, I intend to drop
> fs-jbd2-pull-your-plug-when-waiting-for-space.patch from the -RT queue
> for v4.14 which does
> 
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -116,6 +116,8 @@ void __jbd2_log_wait_for_space(journal_t
>   nblocks = jbd2_space_needed(journal);
>   while (jbd2_log_space_left(journal) < nblocks) {
>   write_unlock(&journal->j_state_lock);
> + if (current->plug)
> + io_schedule();
>   mutex_lock(&journal->j_checkpoint_mutex);
>  
>   /*
> 
> and is/was probably a workaround for the missing schedule while blocking
> on mutex/rwsem.

Yeah, that's now code without a meaningful job.

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-23 Thread Sebastian Siewior
On 2017-11-21 22:20:51 [+0100], Mike Galbraith wrote:
> On Tue, 2017-11-21 at 14:56 -0500, Mikulas Patocka wrote:
> > 
> > If we don't have any reason why it is needed to unplug block requests when 
> > a spinlock is taken - so let's not do this.
> 
> That's perfectly fine.  I guess I shouldn't have even mentioned having
> encountered unplug at mutex being insufficient.

While at it, I intend to drop
fs-jbd2-pull-your-plug-when-waiting-for-space.patch from the -RT queue
for v4.14 which does

--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -116,6 +116,8 @@ void __jbd2_log_wait_for_space(journal_t
nblocks = jbd2_space_needed(journal);
while (jbd2_log_space_left(journal) < nblocks) {
write_unlock(&journal->j_state_lock);
+   if (current->plug)
+   io_schedule();
mutex_lock(&journal->j_checkpoint_mutex);
 
/*

and is/was probably a workaround for the missing schedule while blocking
on mutex/rwsem.

>   -Mike

Sebastian


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Mike Galbraith
On Tue, 2017-11-21 at 14:56 -0500, Mikulas Patocka wrote:
> 
> If we don't have any reason why it is needed to unplug block requests when 
> a spinlock is taken - so let's not do this.

That's perfectly fine.  I guess I shouldn't have even mentioned having
encountered unplug at mutex being insufficient.

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Galbraith wrote:

> On Tue, 2017-11-21 at 11:11 -0500, Mikulas Patocka wrote:
> > 
> > So, drop the spinlock unplugging and leave only mutex unplugging, 
> > reproduce the deadlock and send the stacktraces.
> 
> Nah, I reproduced it five years ago.  Is any of that relevant today?
>  Damned if I know.  Your report was the first I've noticed since.
> 
>   -Mike

If we don't have any reason why it is needed to unplug block requests when 
a spinlock is taken - so let's not do this.

If you add code to the spinlock path and you don't know why, it is cargo 
cult programming.

Mikulas

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Mike Galbraith
On Tue, 2017-11-21 at 11:11 -0500, Mikulas Patocka wrote:
> 
> So, drop the spinlock unplugging and leave only mutex unplugging, 
> reproduce the deadlock and send the stacktraces.

Nah, I reproduced it five years ago.  Is any of that relevant today?
 Damned if I know.  Your report was the first I've noticed since.

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Mikulas Patocka


On Tue, 21 Nov 2017, Mike Galbraith wrote:

> On Tue, 2017-11-21 at 09:37 +0100, Thomas Gleixner wrote:
> > On Tue, 21 Nov 2017, Mike Galbraith wrote:
> > > On Mon, 2017-11-20 at 16:33 -0500, Mikulas Patocka wrote:
> > > > 
> > > > Is there some specific scenario where you need to call 
> > > > blk_schedule_flush_plug from rt_spin_lock_fastlock?
> > > 
> > > Excellent question.  What's the difference between not getting IO
> > > started because you meet a mutex with an rt_mutex under the hood, and
> > > not getting IO started because you meet a spinlock with an rt_mutex
> > > under the hood?  If just doing the mutex side puts this thing back to
> > > sleep, I'm happy.
> > 
> > Think about it from the mainline POV.
> > 
> > The spinlock cannot ever go to schedule and therefore cannot create a
> > situation which requires an unplug. The RT substitution of the spinlock
> > with a rtmutex based sleeping spinlock should not change that at all.
> > 
> > A regular mutex/rwsem etc. can and will unplug when the lock is contended
> > and the caller blocks. The RT conversion of these locks to rtmutex based
> > variants creates the problem: Unplug cannot be called when the task has
> > pi_blocked_on set because the unplug path might content on yet another
> > lock. So unplugging in the slow path before setting pi_blocked_on is the
> > right thing to do.
> 
> Sure.  What alarms me about IO deadlocks reappearing after all this
> time is that at the time I met them, I needed every last bit of that
> patchlet I showed to kill them, whether that should have been the case
> or not.  'course that tree contained roughly a zillion patches..
> 
> Whatever, time will tell if I'm properly alarmed, or merely paranoid :)
> 
>   -Mike

So, drop the spinlock unplugging and leave only mutex unplugging, 
reproduce the deadlock and send the stacktraces.

Mikulas

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Mike Galbraith
On Tue, 2017-11-21 at 09:37 +0100, Thomas Gleixner wrote:
> On Tue, 21 Nov 2017, Mike Galbraith wrote:
> > On Mon, 2017-11-20 at 16:33 -0500, Mikulas Patocka wrote:
> > > 
> > > Is there some specific scenario where you need to call 
> > > blk_schedule_flush_plug from rt_spin_lock_fastlock?
> > 
> > Excellent question.  What's the difference between not getting IO
> > started because you meet a mutex with an rt_mutex under the hood, and
> > not getting IO started because you meet a spinlock with an rt_mutex
> > under the hood?  If just doing the mutex side puts this thing back to
> > sleep, I'm happy.
> 
> Think about it from the mainline POV.
> 
> The spinlock cannot ever go to schedule and therefore cannot create a
> situation which requires an unplug. The RT substitution of the spinlock
> with a rtmutex based sleeping spinlock should not change that at all.
> 
> A regular mutex/rwsem etc. can and will unplug when the lock is contended
> and the caller blocks. The RT conversion of these locks to rtmutex based
> variants creates the problem: Unplug cannot be called when the task has
> pi_blocked_on set because the unplug path might content on yet another
> lock. So unplugging in the slow path before setting pi_blocked_on is the
> right thing to do.

Sure.  What alarms me about IO deadlocks reappearing after all this
time is that at the time I met them, I needed every last bit of that
patchlet I showed to kill them, whether that should have been the case
or not.  'course that tree contained roughly a zillion patches..

Whatever, time will tell if I'm properly alarmed, or merely paranoid :)

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-21 Thread Thomas Gleixner
On Tue, 21 Nov 2017, Mike Galbraith wrote:
> On Mon, 2017-11-20 at 16:33 -0500, Mikulas Patocka wrote:
> > 
> > Is there some specific scenario where you need to call 
> > blk_schedule_flush_plug from rt_spin_lock_fastlock?
> 
> Excellent question.  What's the difference between not getting IO
> started because you meet a mutex with an rt_mutex under the hood, and
> not getting IO started because you meet a spinlock with an rt_mutex
> under the hood?  If just doing the mutex side puts this thing back to
> sleep, I'm happy.

Think about it from the mainline POV.

The spinlock cannot ever go to schedule and therefore cannot create a
situation which requires an unplug. The RT substitution of the spinlock
with a rtmutex based sleeping spinlock should not change that at all.

A regular mutex/rwsem etc. can and will unplug when the lock is contended
and the caller blocks. The RT conversion of these locks to rtmutex based
variants creates the problem: Unplug cannot be called when the task has
pi_blocked_on set because the unplug path might content on yet another
lock. So unplugging in the slow path before setting pi_blocked_on is the
right thing to do.

Thanks,

tglx

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mike Galbraith
On Mon, 2017-11-20 at 16:33 -0500, Mikulas Patocka wrote:
> 
> Is there some specific scenario where you need to call 
> blk_schedule_flush_plug from rt_spin_lock_fastlock?

Excellent question.  What's the difference between not getting IO
started because you meet a mutex with an rt_mutex under the hood, and
not getting IO started because you meet a spinlock with an rt_mutex
under the hood?  If just doing the mutex side puts this thing back to
sleep, I'm happy.

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mikulas Patocka


On Mon, 20 Nov 2017, Mikulas Patocka wrote:

> 
> 
> On Mon, 20 Nov 2017, Sebastian Siewior wrote:
> 
> > On 2017-11-18 19:37:10 [+0100], Mike Galbraith wrote:
> > > Below is my 2012 3.0-rt version of that for reference; at that time we
> > > were using slab, and slab_lock referenced below was a local_lock.  The
> > > comment came from crash analysis of a deadlock I met before adding the
> > > (yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
> > > an optional hack, you WOULD deadlock if you ground disks to fine powder
> > > the way SUSE QA did.  Pulling the plug before blocking cured the
> > > xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
> > > to not trade their nasty IO deadlocks for the more mundane variety.  So
> > > my question is: are we sure that in the here and now flush won't want
> > > any lock we may be holding?  In days of yore, it most definitely would
> > > turn your box into a doorstop if you tried hard enough.
> > 
> > I have a deadlock in ftest01/LTP which is cured by that.
> > The root-problem (as I understand it) is that !RT does
> >   schedule() -> sched_submit_work() -> blk_schedule_flush_plug()
> > 
> > on a lock contention (that is that bit_spinlock or rwsem in jbd2/ext4
> > for instance). On RT this does not happen because of tsk_is_pi_blocked()
> > check in sched_submit_work(). That check is needed because an additional
> > (rtmutex) lock can not be acquired at this point.
> 
> bit_spin_lock on non-RT kernel doesn't call blk_schedule_flush_plug(). So, 
> if not calling it causes deadlock, it should be fixed in non-RT kernel as 
> well.
> 
> It is highly questionable - how could bit_spin_lock depend on 
> blk_schedule_flush_plug() at all? bit_spin_lock spins in a loop until the 
> specified bit is clear. blk_schedule_flush_plug() submits disk requests to 
> the disk driver.
> 
> If some part of kernel spins until disk requests are completed, it is 
> already seriously misdesigned. Spinning until disk requests complete 
> wouldn't work on uniprocessor non-preemptive kernel at all.
> 
> So, I suspect that the cause of the deadlock is something completely 
> different.
> 
> Mikulas

BTW. if you have some deadlock on jbd_lock_bh_state or 
jbd_lock_bh_journal_head or bh_uptodate_lock_irqsave (these functions use 
bit_spin_lock on non-RT kernel) - then, there must be some other task that 
holds this lock, so identify such task and send the stacktrace for 
analysis.

It may be a generic bug in jbd2 code (that is just triggered because the 
RT patch changes timings) or it may be jbd2 code incorrectly patched by 
the RT patch.

Spinlocks have nothing to do with blk_schedule_flush_plug().

Mikulas

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mikulas Patocka


On Sat, 18 Nov 2017, Mike Galbraith wrote:

> On Fri, 2017-11-17 at 15:57 +0100, Sebastian Siewior wrote:
> > On 2017-11-13 12:56:53 [-0500], Mikulas Patocka wrote:
> > > Hi
> > Hi,
> > 
> > > I'm submitting this patch for the CONFIG_PREEMPT_RT patch. It fixes 
> > > deadlocks in device mapper when real time preemption is used.
> > 
> > applied, thank you.
> 
> Below is my 2012 3.0-rt version of that for reference; at that time we
> were using slab, and slab_lock referenced below was a local_lock.  The
> comment came from crash analysis of a deadlock I met before adding the
> (yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
> an optional hack, you WOULD deadlock if you ground disks to fine powder
> the way SUSE QA did.  Pulling the plug before blocking cured the
> xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
> to not trade their nasty IO deadlocks for the more mundane variety.  So
> my question is: are we sure that in the here and now flush won't want
> any lock we may be holding?  In days of yore, it most definitely would
> turn your box into a doorstop if you tried hard enough.

I think that you shouldn't call blk_schedule_flush_plug when attempting to 
take a rt-spinlock at all.

Rt-mutex can't depend on rt-spinlock (altough they use the same internal 
implementation), so a task waiting on rt-spinlock can't block any other 
task attempting to take rt-mutex from making progress.

Is there some specific scenario where you need to call 
blk_schedule_flush_plug from rt_spin_lock_fastlock?

Mikulas

> Subject: rt: pull your plug before blocking
> Date: Wed Jul 18 14:43:15 CEST 2012
> From: Mike Galbraith 
> 
> Queued IO can lead to IO deadlock should a task require wakeup from a task
> which is blocked on that queued IO.
> 
> ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
> the buffer queued by dbench1.  Game over.
> 
> Signed-off-by: Mike Galbraith 
> ---
>  kernel/locking/rtmutex.c |   18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "rtmutex_common.h"
>  
> @@ -930,8 +931,18 @@ static inline void rt_spin_lock_fastlock
>  
>   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
>   rt_mutex_deadlock_account_lock(lock, current);
> - else
> + else {
> + /*
> +  * We can't pull the plug if we're already holding a lock
> +  * else we can deadlock.  eg, if we're holding slab_lock,
> +  * ksoftirqd can block while processing BLOCK_SOFTIRQ after
> +  * having acquired q->queue_lock.  If _we_ then block on
> +  * that q->queue_lock while flushing our plug, deadlock.
> +  */
> + if (__migrate_disabled(current) < 2 && 
> blk_needs_flush_plug(current))
> + blk_schedule_flush_plug(current);
>   slowfn(lock);
> + }
>  }
>  
>  static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
> @@ -1892,9 +1903,12 @@ rt_mutex_fastlock(struct rt_mutex *lock,
>   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
>   rt_mutex_deadlock_account_lock(lock, current);
>   return 0;
> - } else
> + } else {
> + if (blk_needs_flush_plug(current))
> + blk_schedule_flush_plug(current);
>   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK,
> ww_ctx);
> + }
>  }
>  
>  static inline int
> 

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mikulas Patocka


On Mon, 20 Nov 2017, Sebastian Siewior wrote:

> On 2017-11-18 19:37:10 [+0100], Mike Galbraith wrote:
> > Below is my 2012 3.0-rt version of that for reference; at that time we
> > were using slab, and slab_lock referenced below was a local_lock.  The
> > comment came from crash analysis of a deadlock I met before adding the
> > (yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
> > an optional hack, you WOULD deadlock if you ground disks to fine powder
> > the way SUSE QA did.  Pulling the plug before blocking cured the
> > xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
> > to not trade their nasty IO deadlocks for the more mundane variety.  So
> > my question is: are we sure that in the here and now flush won't want
> > any lock we may be holding?  In days of yore, it most definitely would
> > turn your box into a doorstop if you tried hard enough.
> 
> I have a deadlock in ftest01/LTP which is cured by that.
> The root-problem (as I understand it) is that !RT does
>   schedule() -> sched_submit_work() -> blk_schedule_flush_plug()
> 
> on a lock contention (that is that bit_spinlock or rwsem in jbd2/ext4
> for instance). On RT this does not happen because of tsk_is_pi_blocked()
> check in sched_submit_work(). That check is needed because an additional
> (rtmutex) lock can not be acquired at this point.

bit_spin_lock on non-RT kernel doesn't call blk_schedule_flush_plug(). So, 
if not calling it causes deadlock, it should be fixed in non-RT kernel as 
well.

It is highly questionable - how could bit_spin_lock depend on 
blk_schedule_flush_plug() at all? bit_spin_lock spins in a loop until the 
specified bit is clear. blk_schedule_flush_plug() submits disk requests to 
the disk driver.

If some part of kernel spins until disk requests are completed, it is 
already seriously misdesigned. Spinning until disk requests complete 
wouldn't work on uniprocessor non-preemptive kernel at all.

So, I suspect that the cause of the deadlock is something completely 
different.

Mikulas

> With this change we get closer to what !RT does. In regard to that
> migrate_disable() we could check if it is possible to do that after we
> acquired the lock (which is something we tried before but failed due to
> CPU-hotplug requirement, maybe something changed now) or flush the plug
> before disabling migration if it is really a problem.
> 
> To your question whether or not delaying IO can cause any deadlocks is
> something that I can't answer and this something that would affect !RT,
> too. I tried to add lockdep to bit-spinlocks but this does not work
> because one context acquires the lock and another does the unlock. It
> has been explained to me that no deadlocks should happen as long as
> the IO is flushed before we block/wait on a lock.
> 
> > Subject: rt: pull your plug before blocking
> > Date: Wed Jul 18 14:43:15 CEST 2012
> > From: Mike Galbraith 
> > 
> > Queued IO can lead to IO deadlock should a task require wakeup from a task
> > which is blocked on that queued IO.
> > 
> > ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> > pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
> > the buffer queued by dbench1.  Game over.
> > 
> > Signed-off-by: Mike Galbraith 
> > ---
> >  kernel/locking/rtmutex.c |   18 --
> >  1 file changed, 16 insertions(+), 2 deletions(-)
> > 
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "rtmutex_common.h"
> >  
> > @@ -930,8 +931,18 @@ static inline void rt_spin_lock_fastlock
> >  
> > if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
> > rt_mutex_deadlock_account_lock(lock, current);
> > -   else
> > +   else {
> > +   /*
> > +* We can't pull the plug if we're already holding a lock
> > +* else we can deadlock.  eg, if we're holding slab_lock,
> > +* ksoftirqd can block while processing BLOCK_SOFTIRQ after
> > +* having acquired q->queue_lock.  If _we_ then block on
> > +* that q->queue_lock while flushing our plug, deadlock.
> > +*/
> > +   if (__migrate_disabled(current) < 2 && 
> > blk_needs_flush_plug(current))
> > +   blk_schedule_flush_plug(current);
> > slowfn(lock);
> > +   }
> >  }
> >  
> >  static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
> > @@ -1892,9 +1903,12 @@ rt_mutex_fastlock(struct rt_mutex *lock,
> > if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
> > rt_mutex_deadlock_account_lock(lock, current);
> > return 0;
> > -   } else
> > +   } else {
> > +   if (blk_needs_flush_plug(current))
> > +   blk_schedule_flush_plug(current);
> > return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK,
> >

Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mike Galbraith
On Mon, 2017-11-20 at 13:43 +0100, Mike Galbraith wrote:

> The problem was converted spinlocks (and added RT locks).

But I suppose lockdep should gripe about that these days...


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Mike Galbraith
On Mon, 2017-11-20 at 11:53 +0100, Sebastian Siewior wrote:
> 
> To your question whether or not delaying IO can cause any deadlocks is
> something that I can't answer and this something that would affect !RT,
> too. I tried to add lockdep to bit-spinlocks but this does not work
> because one context acquires the lock and another does the unlock. It
> has been explained to me that no deadlocks should happen as long as
> the IO is flushed before we block/wait on a lock.

That wasn't the question (guess I didn't formulate it well).  What I
was concerned about was the possibility that the situation that caused
me to add that __migrate_disabled() qualifier might arise anew, but
with different players.

At the time, I had to add that qualifier to prevent ABBA between the
owner of q->queue_lock, who was blocked by a lock ALREADY held by the
task trying to pull its plug, who then met the locked q->queue_lock.
 Ergo the less than perfect hack to only allow pulling the plug when
NOT YET holding a spinlock.  The problem was converted spinlocks (and
added RT locks).

-Mike


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-20 Thread Sebastian Siewior
On 2017-11-18 19:37:10 [+0100], Mike Galbraith wrote:
> Below is my 2012 3.0-rt version of that for reference; at that time we
> were using slab, and slab_lock referenced below was a local_lock.  The
> comment came from crash analysis of a deadlock I met before adding the
> (yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
> an optional hack, you WOULD deadlock if you ground disks to fine powder
> the way SUSE QA did.  Pulling the plug before blocking cured the
> xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
> to not trade their nasty IO deadlocks for the more mundane variety.  So
> my question is: are we sure that in the here and now flush won't want
> any lock we may be holding?  In days of yore, it most definitely would
> turn your box into a doorstop if you tried hard enough.

I have a deadlock in ftest01/LTP which is cured by that.
The root-problem (as I understand it) is that !RT does
  schedule() -> sched_submit_work() -> blk_schedule_flush_plug()

on a lock contention (that is that bit_spinlock or rwsem in jbd2/ext4
for instance). On RT this does not happen because of tsk_is_pi_blocked()
check in sched_submit_work(). That check is needed because an additional
(rtmutex) lock can not be acquired at this point.

With this change we get closer to what !RT does. In regard to that
migrate_disable() we could check if it is possible to do that after we
acquired the lock (which is something we tried before but failed due to
CPU-hotplug requirement, maybe something changed now) or flush the plug
before disabling migration if it is really a problem.

To your question whether or not delaying IO can cause any deadlocks is
something that I can't answer and this something that would affect !RT,
too. I tried to add lockdep to bit-spinlocks but this does not work
because one context acquires the lock and another does the unlock. It
has been explained to me that no deadlocks should happen as long as
the IO is flushed before we block/wait on a lock.

> Subject: rt: pull your plug before blocking
> Date: Wed Jul 18 14:43:15 CEST 2012
> From: Mike Galbraith 
> 
> Queued IO can lead to IO deadlock should a task require wakeup from a task
> which is blocked on that queued IO.
> 
> ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
> pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
> the buffer queued by dbench1.  Game over.
> 
> Signed-off-by: Mike Galbraith 
> ---
>  kernel/locking/rtmutex.c |   18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> --- a/kernel/locking/rtmutex.c
> +++ b/kernel/locking/rtmutex.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "rtmutex_common.h"
>  
> @@ -930,8 +931,18 @@ static inline void rt_spin_lock_fastlock
>  
>   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
>   rt_mutex_deadlock_account_lock(lock, current);
> - else
> + else {
> + /*
> +  * We can't pull the plug if we're already holding a lock
> +  * else we can deadlock.  eg, if we're holding slab_lock,
> +  * ksoftirqd can block while processing BLOCK_SOFTIRQ after
> +  * having acquired q->queue_lock.  If _we_ then block on
> +  * that q->queue_lock while flushing our plug, deadlock.
> +  */
> + if (__migrate_disabled(current) < 2 && 
> blk_needs_flush_plug(current))
> + blk_schedule_flush_plug(current);
>   slowfn(lock);
> + }
>  }
>  
>  static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
> @@ -1892,9 +1903,12 @@ rt_mutex_fastlock(struct rt_mutex *lock,
>   if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
>   rt_mutex_deadlock_account_lock(lock, current);
>   return 0;
> - } else
> + } else {
> + if (blk_needs_flush_plug(current))
> + blk_schedule_flush_plug(current);
>   return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK,
> ww_ctx);
> + }
>  }
>  
>  static inline int

Sebastian


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-18 Thread Mike Galbraith
On Fri, 2017-11-17 at 15:57 +0100, Sebastian Siewior wrote:
> On 2017-11-13 12:56:53 [-0500], Mikulas Patocka wrote:
> > Hi
> Hi,
> 
> > I'm submitting this patch for the CONFIG_PREEMPT_RT patch. It fixes 
> > deadlocks in device mapper when real time preemption is used.
> 
> applied, thank you.

Below is my 2012 3.0-rt version of that for reference; at that time we
were using slab, and slab_lock referenced below was a local_lock.  The
comment came from crash analysis of a deadlock I met before adding the
(yeah, hacky) __migrate_disabled() blocker.  At the time, that was not
an optional hack, you WOULD deadlock if you ground disks to fine powder
the way SUSE QA did.  Pulling the plug before blocking cured the
xfs/ext[34] IO deadlocks they griped about, but I had to add that hack
to not trade their nasty IO deadlocks for the more mundane variety.  So
my question is: are we sure that in the here and now flush won't want
any lock we may be holding?  In days of yore, it most definitely would
turn your box into a doorstop if you tried hard enough.

Subject: rt: pull your plug before blocking
Date: Wed Jul 18 14:43:15 CEST 2012
From: Mike Galbraith 

Queued IO can lead to IO deadlock should a task require wakeup from a task
which is blocked on that queued IO.

ext3: dbench1 queues a buffer, blocks on journal mutex, it's plug is not
pulled.  dbench2 mutex owner is waiting for kjournald, who is waiting for
the buffer queued by dbench1.  Game over.

Signed-off-by: Mike Galbraith 
---
 kernel/locking/rtmutex.c |   18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rtmutex_common.h"
 
@@ -930,8 +931,18 @@ static inline void rt_spin_lock_fastlock
 
if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current)))
rt_mutex_deadlock_account_lock(lock, current);
-   else
+   else {
+   /*
+* We can't pull the plug if we're already holding a lock
+* else we can deadlock.  eg, if we're holding slab_lock,
+* ksoftirqd can block while processing BLOCK_SOFTIRQ after
+* having acquired q->queue_lock.  If _we_ then block on
+* that q->queue_lock while flushing our plug, deadlock.
+*/
+   if (__migrate_disabled(current) < 2 && 
blk_needs_flush_plug(current))
+   blk_schedule_flush_plug(current);
slowfn(lock);
+   }
 }
 
 static inline void rt_spin_lock_fastunlock(struct rt_mutex *lock,
@@ -1892,9 +1903,12 @@ rt_mutex_fastlock(struct rt_mutex *lock,
if (likely(rt_mutex_cmpxchg_acquire(lock, NULL, current))) {
rt_mutex_deadlock_account_lock(lock, current);
return 0;
-   } else
+   } else {
+   if (blk_needs_flush_plug(current))
+   blk_schedule_flush_plug(current);
return slowfn(lock, state, NULL, RT_MUTEX_MIN_CHAINWALK,
  ww_ctx);
+   }
 }
 
 static inline int


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-17 Thread Sebastian Siewior
On 2017-11-13 12:56:53 [-0500], Mikulas Patocka wrote:
> Hi
Hi,

> I'm submitting this patch for the CONFIG_PREEMPT_RT patch. It fixes 
> deadlocks in device mapper when real time preemption is used.

applied, thank you.

> Mikulas

Sebastian


Re: [PATCH PREEMPT RT] rt-mutex: fix deadlock in device mapper

2017-11-15 Thread Mikulas Patocka


On Tue, 14 Nov 2017, Sebastian Siewior wrote:

> [ minimize CC ]
> 
> On 2017-11-13 12:56:53 [-0500], Mikulas Patocka wrote:
> > Hi
> > 
> > I'm submitting this patch for the CONFIG_PREEMPT_RT patch. It fixes 
> > deadlocks in device mapper when real time preemption is used.
> 
> I run into a EXT4 deadlock which is fixed by your patch. I think we had
> earlier issues which were duct taped via
>fs-jbd2-pull-your-plug-when-waiting-for-space.patch
> 
> What I observed from that deadlock I meant to debug is that I had one
> task owning a bit_spinlock (buffer lock) and wanting W i_data_sem and
> another task owning W i_data_sem and waiting for the same bit_spinlock.
> Here it was wb_writeback() vs vfs_truncate() (to keep it short).

So, send the stacktraces to VFS maintainers. This could deadlock on non-RT 
kernel too.

> Could you please explain how this locking is supposed to work

The scenario for the deadlock is explained here 
https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
It was fixed by commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755

> and why RT deadlocks while !RT does not? Or does !RT rely on the flush 
> in sched_submit_work() which is often skipped in RT because most locks 
> are rtmutex based?

Yes - that's the reason. The non-rt kernel uses rt mutexes very rarely 
(they are only used in kernel/rcu/tree.h, include/linux/i2c.h and 
kernel/futex.c).

If non-rt kernel used rt mutexes in the I/O stack, the deadlock would also 
happen there.

> Because if that is
> the case then we might get the deadlock upstream, too once we get a
> rtmutex somewhere in VFS (since I doubt it would be possible with a
> futex based test case).
> 
> > Mikulas
> 
> Sebastian

Mikulas