Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-14 Thread Dave Chinner
On Tue, Aug 13, 2019 at 12:35:49PM -0400, Mikulas Patocka wrote:
> 
> 
> On Sat, 10 Aug 2019, Dave Chinner wrote:
> 
> > No, you misunderstand. I'm talking about blocking kswapd being
> > wrong.  i.e. Blocking kswapd in shrinkers causes problems
> > because th ememory reclaim code does not expect kswapd to be
> > arbitrarily delayed by waiting on IO. We've had this problem with
> > the XFS inode cache shrinker for years, and there are many reports
> > of extremely long reclaim latencies for both direct and kswapd
> > reclaim that result from kswapd not making progress while waiting
> > in shrinkers for IO to complete.
> > 
> > The work I'm currently doing to fix this XFS problem can be found
> > here:
> > 
> > https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-da...@fromorbit.com/
> > 
> > 
> > i.e. the point I'm making is that waiting for IO in kswapd reclaim
> > context is considered harmful - kswapd context shrinker reclaim
> > should be as non-blocking as possible, and any back-off to wait for
> > IO to complete should be done by the high level reclaim core once
> > it's completed an entire reclaim scan cycle of everything
> > 
> > What follows from that, and is pertinent for in this situation, is
> > that if you don't block kswapd, then other reclaim contexts are not
> > going to get stuck waiting for it regardless of the reclaim context
> > they use.
> > 
> > Cheers,
> > 
> > Dave.
> 
> So, what do you think the dm-bufio shrinker should do?

I'm not familiar with the constraints the code operates under, so
I can't guarantee that I have an answer for you... :/

> Currently it tries to free buffers on the clean list and if there are not 
> enough buffers on the clean list, it goes into the dirty list - it writes 
> the buffers back and then frees them.
> 
> What should it do? Should it just start writeback of the dirty list 
> without waiting for it? What should it do if all the buffers are under 
> writeback?

For kswapd, it should do what it can without blocking. e.g. kicking
an async writer thread rather than submitting the IO itself. That's
what I changes XFS to do.

And if you look at the patchset in the above link, it also
introduced a mechanism for shrinkers to communicate back to the high
level reclaim code that kswapd needs to back off
(reclaim_state->need_backoff).

With these mechanism, the shrinker can start IO without blocking
kswapd on congested request queues and tell memory reclaim to wait
before calling this shrinker again. This allows kswapd to aggregate
all the waits that shrinkers and page reclaim require to all
progress to be made into a single backoff event. That means kswapd
does other scanning work while background writeback goes on, and
once everythign is scanned it does a single wait for everything that
needs time to make progress...

I think that should also work for the dm-bufio shrinker, and the the
direct reclaim backoff parts of the patchset should work for
non-blocking direct reclaim scanning as well, like it now does for
XFS.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-13 Thread Mikulas Patocka



On Sat, 10 Aug 2019, Dave Chinner wrote:

> No, you misunderstand. I'm talking about blocking kswapd being
> wrong.  i.e. Blocking kswapd in shrinkers causes problems
> because th ememory reclaim code does not expect kswapd to be
> arbitrarily delayed by waiting on IO. We've had this problem with
> the XFS inode cache shrinker for years, and there are many reports
> of extremely long reclaim latencies for both direct and kswapd
> reclaim that result from kswapd not making progress while waiting
> in shrinkers for IO to complete.
> 
> The work I'm currently doing to fix this XFS problem can be found
> here:
> 
> https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-da...@fromorbit.com/
> 
> 
> i.e. the point I'm making is that waiting for IO in kswapd reclaim
> context is considered harmful - kswapd context shrinker reclaim
> should be as non-blocking as possible, and any back-off to wait for
> IO to complete should be done by the high level reclaim core once
> it's completed an entire reclaim scan cycle of everything
> 
> What follows from that, and is pertinent for in this situation, is
> that if you don't block kswapd, then other reclaim contexts are not
> going to get stuck waiting for it regardless of the reclaim context
> they use.
> 
> Cheers,
> 
> Dave.

So, what do you think the dm-bufio shrinker should do?

Currently it tries to free buffers on the clean list and if there are not 
enough buffers on the clean list, it goes into the dirty list - it writes 
the buffers back and then frees them.

What should it do? Should it just start writeback of the dirty list 
without waiting for it? What should it do if all the buffers are under 
writeback?

Mikulas


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 07:30:00AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 9 Aug 2019, Dave Chinner wrote:
> 
> > And, FWIW, there's an argument to be made here that the underlying
> > bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
> > completions while holding a mutex that other IO-level reclaim
> > contexts require to make progress.
> > 
> > Cheers,
> > 
> > Dave.
> 
> The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio 
> shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because:

No, you misunderstand. I'm talking about blocking kswapd being
wrong.  i.e. Blocking kswapd in shrinkers causes problems
because th ememory reclaim code does not expect kswapd to be
arbitrarily delayed by waiting on IO. We've had this problem with
the XFS inode cache shrinker for years, and there are many reports
of extremely long reclaim latencies for both direct and kswapd
reclaim that result from kswapd not making progress while waiting
in shrinkers for IO to complete.

The work I'm currently doing to fix this XFS problem can be found
here:

https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-da...@fromorbit.com/


i.e. the point I'm making is that waiting for IO in kswapd reclaim
context is considered harmful - kswapd context shrinker reclaim
should be as non-blocking as possible, and any back-off to wait for
IO to complete should be done by the high level reclaim core once
it's completed an entire reclaim scan cycle of everything

What follows from that, and is pertinent for in this situation, is
that if you don't block kswapd, then other reclaim contexts are not
going to get stuck waiting for it regardless of the reclaim context
they use.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-09 Thread Mikulas Patocka



On Fri, 9 Aug 2019, Dave Chinner wrote:

> And, FWIW, there's an argument to be made here that the underlying
> bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
> completions while holding a mutex that other IO-level reclaim
> contexts require to make progress.
> 
> Cheers,
> 
> Dave.

The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio 
shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because:
* it uses trylock to acquire the mutex
* it doesn't attempt to free buffers that are dirty or under I/O (see 
  __try_evict_buffer)

I think that this logic is sufficient to prevent the deadlock.

Mikulas


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-08 Thread Dave Chinner
On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
> 
> The obvious problem here is that in the call chain 
> xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc
>  
> we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> block device driver.
> 
> This patch changes the direct-io code to use GFP_NOIO.
> 
> PID: 474TASK: 8813e11f4600  CPU: 10  COMMAND: "kswapd0"
>#0 [8813dedfb938] __schedule at 8173f405
>#1 [8813dedfb990] schedule at 8173fa27
>#2 [8813dedfb9b0] schedule_timeout at 81742fec
>#3 [8813dedfba60] io_schedule_timeout at 8173f186
>#4 [8813dedfbaa0] bit_wait_io at 8174034f
>#5 [8813dedfbac0] __wait_on_bit at 8173fec8
>#6 [8813dedfbb10] out_of_line_wait_on_bit at 8173ff81
>#7 [8813dedfbb90] __make_buffer_clean at a038736f [dm_bufio]
>#8 [8813dedfbbb0] __try_evict_buffer at a0387bb8 [dm_bufio]
>#9 [8813dedfbbd0] dm_bufio_shrink_scan at a0387cc3 [dm_bufio]
>   #10 [8813dedfbc40] shrink_slab at 811a87ce
>   #11 [8813dedfbd30] shrink_zone at 811ad778
>   #12 [8813dedfbdc0] kswapd at 811ae92f
>   #13 [8813dedfbec0] kthread at 810a8428
>   #14 [8813dedfbf50] ret_from_fork at 81745242
> 
>   PID: 14127  TASK: 881455749c00  CPU: 11  COMMAND: "loop1"
>#0 [88272f5af228] __schedule at 8173f405
>#1 [88272f5af280] schedule at 8173fa27
>#2 [88272f5af2a0] schedule_preempt_disabled at 8173fd5e
>#3 [88272f5af2b0] __mutex_lock_slowpath at 81741fb5
>#4 [88272f5af330] mutex_lock at 81742133
>#5 [88272f5af350] dm_bufio_shrink_count at a03865f9 [dm_bufio]
>#6 [88272f5af380] shrink_slab at 811a86bd
>#7 [88272f5af470] shrink_zone at 811ad778
>#8 [88272f5af500] do_try_to_free_pages at 811adb34
>#9 [88272f5af590] try_to_free_pages at 811adef8
>   #10 [88272f5af610] __alloc_pages_nodemask at 811a09c3
>   #11 [88272f5af710] alloc_pages_current at 811e8b71
>   #12 [88272f5af760] new_slab at 811f4523
>   #13 [88272f5af7b0] __slab_alloc at 8173a1b5
>   #14 [88272f5af880] kmem_cache_alloc at 811f484b
>   #15 [88272f5af8d0] do_blockdev_direct_IO at 812535b3
>   #16 [88272f5afb00] __blockdev_direct_IO at 81255dc3
>   #17 [88272f5afb30] xfs_vm_direct_IO at a01fe3fc [xfs]
>   #18 [88272f5afb90] generic_file_read_iter at 81198994

Um, what kernel is this? XFS stopped using __blockdev_direct_IO some
time around 4.8 or 4.9, IIRC. Perhaps it would be best to reproduce
problems on a TOT kernel first?

And, FWIW, there's an argument to be made here that the underlying
bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO
completions while holding a mutex that other IO-level reclaim
contexts require to make progress.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-08 Thread Mikulas Patocka



On Thu, 8 Aug 2019, Matthew Wilcox wrote:

> On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> > A deadlock with this stacktrace was observed.
> > 
> > The obvious problem here is that in the call chain 
> > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc
> >  
> > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> > block device driver.
> 
> But that's not the problem.  The problem is the loop driver calls into the
> filesystem without calling memalloc_noio_save() / memalloc_noio_restore().
> There are dozens of places in XFS which use GFP_KERNEL allocations and
> all can trigger this same problem if called from the loop driver.

OK. I'll send a new patch that sets PF_MEMALLOC_NOIO in the loop driver.

Mikulas


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-08 Thread Junxiao Bi

Hi Mikulas,

This seemed not issue on mainline, mutex in dm_bufio_shrink_count() had 
been removed from mainline.


Thanks,

Junxiao.

On 8/8/19 2:50 AM, Mikulas Patocka wrote:

A deadlock with this stacktrace was observed.

The obvious problem here is that in the call chain
xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc
we do a GFP_KERNEL allocation while we are in a filesystem driver and in a
block device driver.

This patch changes the direct-io code to use GFP_NOIO.

PID: 474TASK: 8813e11f4600  CPU: 10  COMMAND: "kswapd0"
#0 [8813dedfb938] __schedule at 8173f405
#1 [8813dedfb990] schedule at 8173fa27
#2 [8813dedfb9b0] schedule_timeout at 81742fec
#3 [8813dedfba60] io_schedule_timeout at 8173f186
#4 [8813dedfbaa0] bit_wait_io at 8174034f
#5 [8813dedfbac0] __wait_on_bit at 8173fec8
#6 [8813dedfbb10] out_of_line_wait_on_bit at 8173ff81
#7 [8813dedfbb90] __make_buffer_clean at a038736f [dm_bufio]
#8 [8813dedfbbb0] __try_evict_buffer at a0387bb8 [dm_bufio]
#9 [8813dedfbbd0] dm_bufio_shrink_scan at a0387cc3 [dm_bufio]
   #10 [8813dedfbc40] shrink_slab at 811a87ce
   #11 [8813dedfbd30] shrink_zone at 811ad778
   #12 [8813dedfbdc0] kswapd at 811ae92f
   #13 [8813dedfbec0] kthread at 810a8428
   #14 [8813dedfbf50] ret_from_fork at 81745242

   PID: 14127  TASK: 881455749c00  CPU: 11  COMMAND: "loop1"
#0 [88272f5af228] __schedule at 8173f405
#1 [88272f5af280] schedule at 8173fa27
#2 [88272f5af2a0] schedule_preempt_disabled at 8173fd5e
#3 [88272f5af2b0] __mutex_lock_slowpath at 81741fb5
#4 [88272f5af330] mutex_lock at 81742133
#5 [88272f5af350] dm_bufio_shrink_count at a03865f9 [dm_bufio]
#6 [88272f5af380] shrink_slab at 811a86bd
#7 [88272f5af470] shrink_zone at 811ad778
#8 [88272f5af500] do_try_to_free_pages at 811adb34
#9 [88272f5af590] try_to_free_pages at 811adef8
   #10 [88272f5af610] __alloc_pages_nodemask at 811a09c3
   #11 [88272f5af710] alloc_pages_current at 811e8b71
   #12 [88272f5af760] new_slab at 811f4523
   #13 [88272f5af7b0] __slab_alloc at 8173a1b5
   #14 [88272f5af880] kmem_cache_alloc at 811f484b
   #15 [88272f5af8d0] do_blockdev_direct_IO at 812535b3
   #16 [88272f5afb00] __blockdev_direct_IO at 81255dc3
   #17 [88272f5afb30] xfs_vm_direct_IO at a01fe3fc [xfs]
   #18 [88272f5afb90] generic_file_read_iter at 81198994
   #19 [88272f5afc50] __dta_xfs_file_read_iter_2398 at a020c970 
[xfs]
   #20 [88272f5afcc0] lo_rw_aio at a0377042 [loop]
   #21 [88272f5afd70] loop_queue_work at a0377c3b [loop]
   #22 [88272f5afe60] kthread_worker_fn at 810a8a0c
   #23 [88272f5afec0] kthread at 810a8428
   #24 [88272f5aff50] ret_from_fork at 81745242

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
  fs/direct-io.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/direct-io.c
===
--- linux-2.6.orig/fs/direct-io.c   2019-08-02 08:51:56.0 +0200
+++ linux-2.6/fs/direct-io.c2019-08-08 11:22:18.0 +0200
@@ -436,7 +436,7 @@ dio_bio_alloc(struct dio *dio, struct di
 * bio_alloc() is guaranteed to return a bio when allowed to sleep and
 * we request a valid number of vectors.
 */
-   bio = bio_alloc(GFP_KERNEL, nr_vecs);
+   bio = bio_alloc(GFP_NOIO, nr_vecs);
  
  	bio_set_dev(bio, bdev);

bio->bi_iter.bi_sector = first_sector;
@@ -1197,7 +1197,7 @@ do_blockdev_direct_IO(struct kiocb *iocb
if (iov_iter_rw(iter) == READ && !count)
return 0;
  
-	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);

+   dio = kmem_cache_alloc(dio_cache, GFP_NOIO);
retval = -ENOMEM;
if (!dio)
goto out;


Re: [PATCH] direct-io: use GFP_NOIO to avoid deadlock

2019-08-08 Thread Matthew Wilcox
On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote:
> A deadlock with this stacktrace was observed.
> 
> The obvious problem here is that in the call chain 
> xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc
>  
> we do a GFP_KERNEL allocation while we are in a filesystem driver and in a 
> block device driver.

But that's not the problem.  The problem is the loop driver calls into the
filesystem without calling memalloc_noio_save() / memalloc_noio_restore().
There are dozens of places in XFS which use GFP_KERNEL allocations and
all can trigger this same problem if called from the loop driver.

>   #14 [88272f5af880] kmem_cache_alloc at 811f484b
>   #15 [88272f5af8d0] do_blockdev_direct_IO at 812535b3
>   #16 [88272f5afb00] __blockdev_direct_IO at 81255dc3
>   #17 [88272f5afb30] xfs_vm_direct_IO at a01fe3fc [xfs]
>   #18 [88272f5afb90] generic_file_read_iter at 81198994
>   #19 [88272f5afc50] __dta_xfs_file_read_iter_2398 at a020c970 
> [xfs]
>   #20 [88272f5afcc0] lo_rw_aio at a0377042 [loop]
>   #21 [88272f5afd70] loop_queue_work at a0377c3b [loop]
>   #22 [88272f5afe60] kthread_worker_fn at 810a8a0c
>   #23 [88272f5afec0] kthread at 810a8428
>   #24 [88272f5aff50] ret_from_fork at 81745242