Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Tue 04-09-18 13:30:44, Mikulas Patocka wrote:
> 
> 
> On Tue, 4 Sep 2018, Michal Hocko wrote:
> 
> > Regarding other other workloads. AFAIR the problem was due to the
> > wait_iff_congested in the direct reclaim. And I've been arguing that
> > special casing __GFP_NORETRY is not a propoer way to handle that case.
> > We have PF_LESS_THROTTLE to handle cases where the caller cannot be
> > really throttled because it is a part of the congestion control. I dunno
> > what happened in that regards since then though.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> So, could you take this patch https://patchwork.kernel.org/patch/10505523/ 
> and commit it to the kernel?
> 
> You agreed with that patch, but you didn't commit it.

Because I do not maintain any tree that Linus pulls from. You need to
involve Andrew Morton to get this to the mm tree and then to Linus.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Mikulas Patocka



On Tue, 4 Sep 2018, Michal Hocko wrote:

> Regarding other other workloads. AFAIR the problem was due to the
> wait_iff_congested in the direct reclaim. And I've been arguing that
> special casing __GFP_NORETRY is not a propoer way to handle that case.
> We have PF_LESS_THROTTLE to handle cases where the caller cannot be
> really throttled because it is a part of the congestion control. I dunno
> what happened in that regards since then though.
> -- 
> Michal Hocko
> SUSE Labs

So, could you take this patch https://patchwork.kernel.org/patch/10505523/ 
and commit it to the kernel?

You agreed with that patch, but you didn't commit it.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Tue 04-09-18 11:18:44, Mike Snitzer wrote:
> On Tue, Sep 04 2018 at  3:08am -0400,
> Michal Hocko  wrote:
> 
> > On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 1 Aug 2018, jing xia wrote:
> > > 
> > > > We reproduced this issue again and found out the root cause.
> > > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > > > takes a long time to do the soft_limit_reclaim, because of the huge
> > > > number of memory excess of the memcg.
> > > > Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> > > > 
> > > > Any suggestions for this?Thanks.
> > > 
> > > There's hardly any solution because Michal Hocko refuses to change 
> > > __GFP_NORETRY behavior.
> > > 
> > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
> > > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
> > > on the dm-bufio lock - the patches don't fix the high CPU consumption 
> > > inside the memory allocation, but the kernel code should wait less on the 
> > > bufio lock.
> > 
> > If you actually looked at the bottom line of the problem then you would
> > quickly find out that dm-bufio lock is the least of the problem with the
> > soft limit reclaim. This is a misfeature which has been merged and we
> > have to live with it. All we can do is to discourage people from using
> > it and use much more saner low limit instead.
> > 
> > So please stop this stupid blaming, try to understand the reasoning
> > behind my arguments.
> 
> Yes, this bickering isn't productive.  Michal, your responses are pretty
> hard to follow.  I'm just trying to follow along on what it is you're
> saying should be done.  It isn't clear to me.
> 
> PLEASE, restate what we should be doing differently.  Or what changes
> need to happen outside of DM, etc.

For this particular case I can only recommend to not use the memcg soft
limit. This is guaranteed to stall and there is no way around it because
this is the semantic of the soft limit. Sad, I know.

Regarding other other workloads. AFAIR the problem was due to the
wait_iff_congested in the direct reclaim. And I've been arguing that
special casing __GFP_NORETRY is not a propoer way to handle that case.
We have PF_LESS_THROTTLE to handle cases where the caller cannot be
really throttled because it is a part of the congestion control. I dunno
what happened in that regards since then though.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Mike Snitzer
On Tue, Sep 04 2018 at  3:08am -0400,
Michal Hocko  wrote:

> On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 1 Aug 2018, jing xia wrote:
> > 
> > > We reproduced this issue again and found out the root cause.
> > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > > takes a long time to do the soft_limit_reclaim, because of the huge
> > > number of memory excess of the memcg.
> > > Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> > > 
> > > Any suggestions for this?Thanks.
> > 
> > There's hardly any solution because Michal Hocko refuses to change 
> > __GFP_NORETRY behavior.
> > 
> > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
> > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
> > on the dm-bufio lock - the patches don't fix the high CPU consumption 
> > inside the memory allocation, but the kernel code should wait less on the 
> > bufio lock.
> 
> If you actually looked at the bottom line of the problem then you would
> quickly find out that dm-bufio lock is the least of the problem with the
> soft limit reclaim. This is a misfeature which has been merged and we
> have to live with it. All we can do is to discourage people from using
> it and use much more saner low limit instead.
> 
> So please stop this stupid blaming, try to understand the reasoning
> behind my arguments.

Yes, this bickering isn't productive.  Michal, your responses are pretty
hard to follow.  I'm just trying to follow along on what it is you're
saying should be done.  It isn't clear to me.

PLEASE, restate what we should be doing differently.  Or what changes
need to happen outside of DM, etc.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> 
> 
> On Wed, 1 Aug 2018, jing xia wrote:
> 
> > We reproduced this issue again and found out the root cause.
> > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > takes a long time to do the soft_limit_reclaim, because of the huge
> > number of memory excess of the memcg.
> > Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> > 
> > Any suggestions for this?Thanks.
> 
> There's hardly any solution because Michal Hocko refuses to change 
> __GFP_NORETRY behavior.
> 
> The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
> d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
> on the dm-bufio lock - the patches don't fix the high CPU consumption 
> inside the memory allocation, but the kernel code should wait less on the 
> bufio lock.

If you actually looked at the bottom line of the problem then you would
quickly find out that dm-bufio lock is the least of the problem with the
soft limit reclaim. This is a misfeature which has been merged and we
have to live with it. All we can do is to discourage people from using
it and use much more saner low limit instead.

So please stop this stupid blaming, try to understand the reasoning
behind my arguments.
-- 
Michal Evil Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-03 Thread Mikulas Patocka



On Wed, 1 Aug 2018, jing xia wrote:

> We reproduced this issue again and found out the root cause.
> dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> takes a long time to do the soft_limit_reclaim, because of the huge
> number of memory excess of the memcg.
> Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> 
> Any suggestions for this?Thanks.

There's hardly any solution because Michal Hocko refuses to change 
__GFP_NORETRY behavior.

The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
on the dm-bufio lock - the patches don't fix the high CPU consumption 
inside the memory allocation, but the kernel code should wait less on the 
bufio lock.

Mikulas


> On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko  wrote:
> > On Thu 14-06-18 15:18:58, jing xia wrote:
> > [...]
> >> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> >>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> >>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> >>  #2 [ffc0282af450] schedule at ff8008850f4c
> >>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> >>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
> >>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> >
> > This trace doesn't provide the full picture unfortunately. Waiting in
> > the direct reclaim means that the underlying bdi is congested. The real
> > question is why it doesn't flush IO in time.
> >
> >>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
> >>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
> >>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
> >>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
> >> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
> >> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c
> >> #12 [ffc0282afa30] __perform_reclaim at ff8008169130
> >> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
> >> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
> >> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94
> >> #16 [ffc0282afc20] __bufio_new at ff8008592e94
> >> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
> >> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384
> >> #19 [ffc0282afd70] process_one_work at ff80080b5b3c
> >> #20 [ffc0282afdc0] worker_thread at ff80080b64fc
> >> #21 [ffc0282afe20] kthread at ff80080bae34
> >>
> >> > Mikulas
> >
> > --
> > Michal Hocko
> > SUSE Labs
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-08-05 Thread Michal Hocko
On Wed 01-08-18 10:48:00, jing xia wrote:
> We reproduced this issue again and found out the root cause.
> dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> takes a long time to do the soft_limit_reclaim, because of the huge
> number of memory excess of the memcg.
> Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> 
> Any suggestions for this?Thanks.

Do not use soft limit?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-08-01 Thread jing xia
We reproduced this issue again and found out the root cause.
dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
takes a long time to do the soft_limit_reclaim, because of the huge
number of memory excess of the memcg.
Then, all the task who do shrink_slab() wait for  dm_bufio_lock.

Any suggestions for this?Thanks.

On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko  wrote:
> On Thu 14-06-18 15:18:58, jing xia wrote:
> [...]
>> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>>  #2 [ffc0282af450] schedule at ff8008850f4c
>>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
>
> This trace doesn't provide the full picture unfortunately. Waiting in
> the direct reclaim means that the underlying bdi is congested. The real
> question is why it doesn't flush IO in time.
>
>>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
>> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
>> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c
>> #12 [ffc0282afa30] __perform_reclaim at ff8008169130
>> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
>> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
>> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94
>> #16 [ffc0282afc20] __bufio_new at ff8008592e94
>> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
>> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384
>> #19 [ffc0282afd70] process_one_work at ff80080b5b3c
>> #20 [ffc0282afdc0] worker_thread at ff80080b64fc
>> #21 [ffc0282afe20] kthread at ff80080bae34
>>
>> > Mikulas
>
> --
> Michal Hocko
> SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-29 Thread Michal Hocko
On Thu 28-06-18 22:43:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 25 Jun 2018, Michal Hocko wrote:
> > > 
> > > > > And the throttling in dm-bufio prevents kswapd from making forward 
> > > > > progress, causing this situation...
> > > > 
> > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > > > circles like that? Are you even listening?
> > > > 
> > > > [...]
> > > > 
> > > > > And so what do you want to do to prevent block drivers from sleeping?
> > > > 
> > > > use the existing means we have.
> > > > -- 
> > > > Michal Hocko
> > > > SUSE Labs
> > > 
> > > So - do you want this patch?
> > > 
> > > There is no behavior difference between changing the allocator (so that 
> > > it 
> > > implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
> > > drivers to explicitly set PF_THROTTLE_LESS.
> > 
> > As long as you can reliably detect those users. And using gfp_mask is
> 
> You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You 
> can grep the kernel for __GFP_NORETRY to find all the users.

It seems that arguing doesn't make much sense here. I will not repeat
myself...

> > about the worst way to achieve that because users tend to be creative
> > when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
> > way to tell the allocator that _you_ are the one to help the reclaim by
> > cleaning data.
> 
> But using PF_LESS_THROTTLE explicitly adds more lines of code than 
> implying PF_LESS_THROTTLE in the allocator.

Yes and it will also make the code more explicit about the intention and
so it will be easier to maintain longterm.

> From: Mikulas Patocka 
> Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o
> 
> When doing __GFP_NORETRY allocation, the system may sleep in
> wait_iff_congested if there are too many dirty pages. Unfortunatelly this
> sleeping may slow down kswapd, preventing it from doing writeback and
> resolving the congestion.

This description is misleading at best if not outright wrong.
if (!sc->hibernation_mode && !current_is_kswapd() &&
   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);

so this is an explict throttling of the direct reclaim.

So I would use the following wording instead:
"
It has been noticed that congestion throttling can slow down allocations
path that participate in the IO and thus help the memory reclaim.
Stalling those allocation is therefore not productive. Moreover mempool
allocator and md variants of the same already implement their own
throttling which has a better way to be feedback driven. Stalling at the
page allocator is therefore even counterproductive.

PF_LESS_THROTTLE is a task flag denoting allocation context that is
participating in the memory reclaim which fits into these IO paths
model, so use the flag and make the page allocator aware they are
special and they do not really want any dirty data throttling.


"

with a more clear patch description and some data to back them up, you
can add

Acked-by: Michal Hocko  # mempool_alloc and bvec_alloc

I cannot really comment on other md allocators though because I am not
familiar with those. 

> This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for
> block device drivers.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> ---
>  block/bio.c   |4 
>  drivers/md/dm-bufio.c |   14 +++---
>  drivers/md/dm-crypt.c |8 
>  drivers/md/dm-integrity.c |4 
>  drivers/md/dm-kcopyd.c|3 +++
>  drivers/md/dm-verity-target.c |4 
>  drivers/md/dm-writecache.c|4 
>  mm/mempool.c  |4 
>  8 files changed, 42 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/mm/mempool.c
> ===
> --- linux-2.6.orig/mm/mempool.c   2018-06-29 03:47:16.29000 +0200
> +++ linux-2.6/mm/mempool.c2018-06-29 03:47:16.27000 +0200
> @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
>   unsigned long flags;
>   wait_queue_entry_t wait;
>   gfp_t gfp_temp;
> + unsigned old_flags;
>  
>   VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
>   might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp
>  
>  repeat_alloc:
>  
> + old_flags = current->flags & PF_LESS_THROTTLE;
> + current->flags |= PF_LESS_THROTTLE;
>   element = pool->alloc(gfp_temp, pool->pool_data);
> + current_restore_flags(old_flags, PF_LESS_THROTTLE);
>   if (likely(element != NULL))
>   return element;
>  
> Index: linux-2.6/block/bio.c
> ===
> --- 

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-28 Thread Mikulas Patocka



On Mon, 25 Jun 2018, Michal Hocko wrote:

> On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 25 Jun 2018, Michal Hocko wrote:
> > 
> > > > And the throttling in dm-bufio prevents kswapd from making forward 
> > > > progress, causing this situation...
> > > 
> > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > > circles like that? Are you even listening?
> > > 
> > > [...]
> > > 
> > > > And so what do you want to do to prevent block drivers from sleeping?
> > > 
> > > use the existing means we have.
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
> > So - do you want this patch?
> > 
> > There is no behavior difference between changing the allocator (so that it 
> > implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
> > drivers to explicitly set PF_THROTTLE_LESS.
> 
> As long as you can reliably detect those users. And using gfp_mask is

You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You 
can grep the kernel for __GFP_NORETRY to find all the users.

> about the worst way to achieve that because users tend to be creative
> when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
> way to tell the allocator that _you_ are the one to help the reclaim by
> cleaning data.

But using PF_LESS_THROTTLE explicitly adds more lines of code than 
implying PF_LESS_THROTTLE in the allocator.

> > But if you insist that the allocator can't be changed, we have to repeat 
> > the same code over and over again in the block drivers.
> 
> I am not familiar with the patched code but mempool change at least
> makes sense (bvec_alloc seems to fallback to mempool which then makes
> sense as well). If others in md/ do the same thing
> 
> I would just use current_restore_flags rather than open code it.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

So, do you accept this patch?

Mikulas



From: Mikulas Patocka 
Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o

When doing __GFP_NORETRY allocation, the system may sleep in
wait_iff_congested if there are too many dirty pages. Unfortunatelly this
sleeping may slow down kswapd, preventing it from doing writeback and
resolving the congestion.

This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for
block device drivers.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 block/bio.c   |4 
 drivers/md/dm-bufio.c |   14 +++---
 drivers/md/dm-crypt.c |8 
 drivers/md/dm-integrity.c |4 
 drivers/md/dm-kcopyd.c|3 +++
 drivers/md/dm-verity-target.c |4 
 drivers/md/dm-writecache.c|4 
 mm/mempool.c  |4 
 8 files changed, 42 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/mempool.c
===
--- linux-2.6.orig/mm/mempool.c 2018-06-29 03:47:16.29000 +0200
+++ linux-2.6/mm/mempool.c  2018-06-29 03:47:16.27000 +0200
@@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
unsigned long flags;
wait_queue_entry_t wait;
gfp_t gfp_temp;
+   unsigned old_flags;
 
VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
@@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp
 
 repeat_alloc:
 
+   old_flags = current->flags & PF_LESS_THROTTLE;
+   current->flags |= PF_LESS_THROTTLE;
element = pool->alloc(gfp_temp, pool->pool_data);
+   current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (likely(element != NULL))
return element;
 
Index: linux-2.6/block/bio.c
===
--- linux-2.6.orig/block/bio.c  2018-06-29 03:47:16.29000 +0200
+++ linux-2.6/block/bio.c   2018-06-29 03:47:16.27000 +0200
@@ -217,6 +217,7 @@ fallback:
} else {
struct biovec_slab *bvs = bvec_slabs + *idx;
gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | 
__GFP_IO);
+   unsigned old_flags;
 
/*
 * Make this allocation restricted and don't dump info on
@@ -229,7 +230,10 @@ fallback:
 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
 * is set, retry with the 1-entry mempool
 */
+   old_flags = current->flags & PF_LESS_THROTTLE;
+   current->flags |= PF_LESS_THROTTLE;
bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
+   current_restore_flags(old_flags, PF_LESS_THROTTLE);
if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
*idx = BVEC_POOL_MAX;
goto fallback;
Index: linux-2.6/drivers/md/dm-bufio.c
===
--- linux-2.6.orig/drivers/md/dm-bufio.c

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> 
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > > And the throttling in dm-bufio prevents kswapd from making forward 
> > > progress, causing this situation...
> > 
> > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > circles like that? Are you even listening?
> > 
> > [...]
> > 
> > > And so what do you want to do to prevent block drivers from sleeping?
> > 
> > use the existing means we have.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> So - do you want this patch?
> 
> There is no behavior difference between changing the allocator (so that it 
> implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
> drivers to explicitly set PF_THROTTLE_LESS.

As long as you can reliably detect those users. And using gfp_mask is
about the worst way to achieve that because users tend to be creative
when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
way to tell the allocator that _you_ are the one to help the reclaim by
cleaning data.

> But if you insist that the allocator can't be changed, we have to repeat 
> the same code over and over again in the block drivers.

I am not familiar with the patched code but mempool change at least
makes sense (bvec_alloc seems to fallback to mempool which then makes
sense as well). If others in md/ do the same thing

I would just use current_restore_flags rather than open code it.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Mikulas Patocka



On Mon, 25 Jun 2018, Michal Hocko wrote:

> > And the throttling in dm-bufio prevents kswapd from making forward 
> > progress, causing this situation...
> 
> Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> circles like that? Are you even listening?
> 
> [...]
> 
> > And so what do you want to do to prevent block drivers from sleeping?
> 
> use the existing means we have.
> -- 
> Michal Hocko
> SUSE Labs

So - do you want this patch?

There is no behavior difference between changing the allocator (so that it 
implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
drivers to explicitly set PF_THROTTLE_LESS.

But if you insist that the allocator can't be changed, we have to repeat 
the same code over and over again in the block drivers.

Mikulas



---
 block/bio.c   |4 
 drivers/md/dm-bufio.c |   14 +++---
 drivers/md/dm-crypt.c |8 
 drivers/md/dm-integrity.c |4 
 drivers/md/dm-kcopyd.c|3 +++
 drivers/md/dm-verity-target.c |4 
 drivers/md/dm-writecache.c|4 
 mm/mempool.c  |4 
 8 files changed, 42 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/mempool.c
===
--- linux-2.6.orig/mm/mempool.c 2018-06-25 16:32:19.21000 +0200
+++ linux-2.6/mm/mempool.c  2018-06-25 16:32:19.2 +0200
@@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
unsigned long flags;
wait_queue_entry_t wait;
gfp_t gfp_temp;
+   unsigned old_flags;
 
VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
@@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp
 
 repeat_alloc:
 
+   old_flags = current->flags & PF_LESS_THROTTLE;
+   current->flags |= PF_LESS_THROTTLE;
element = pool->alloc(gfp_temp, pool->pool_data);
+   current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags;
if (likely(element != NULL))
return element;
 
Index: linux-2.6/block/bio.c
===
--- linux-2.6.orig/block/bio.c  2018-06-25 16:32:19.21000 +0200
+++ linux-2.6/block/bio.c   2018-06-25 16:32:19.2 +0200
@@ -217,6 +217,7 @@ fallback:
} else {
struct biovec_slab *bvs = bvec_slabs + *idx;
gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | 
__GFP_IO);
+   unsigned old_flags;
 
/*
 * Make this allocation restricted and don't dump info on
@@ -229,7 +230,10 @@ fallback:
 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
 * is set, retry with the 1-entry mempool
 */
+   old_flags = current->flags & PF_LESS_THROTTLE;
+   current->flags |= PF_LESS_THROTTLE;
bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
+   current->flags = (current->flags & ~PF_LESS_THROTTLE) | 
old_flags;
if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
*idx = BVEC_POOL_MAX;
goto fallback;
Index: linux-2.6/drivers/md/dm-bufio.c
===
--- linux-2.6.orig/drivers/md/dm-bufio.c2018-06-25 16:32:19.21000 
+0200
+++ linux-2.6/drivers/md/dm-bufio.c 2018-06-25 16:32:19.2 +0200
@@ -356,6 +356,7 @@ static void __cache_size_refresh(void)
 static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
   unsigned char *data_mode)
 {
+   void *ptr;
if (unlikely(c->slab_cache != NULL)) {
*data_mode = DATA_MODE_SLAB;
return kmem_cache_alloc(c->slab_cache, gfp_mask);
@@ -363,9 +364,14 @@ static void *alloc_buffer_data(struct dm
 
if (c->block_size <= KMALLOC_MAX_SIZE &&
gfp_mask & __GFP_NORETRY) {
+   unsigned old_flags;
*data_mode = DATA_MODE_GET_FREE_PAGES;
-   return (void *)__get_free_pages(gfp_mask,
+   old_flags = current->flags & PF_LESS_THROTTLE;
+   current->flags |= PF_LESS_THROTTLE;
+   ptr = (void *)__get_free_pages(gfp_mask,
c->sectors_per_block_bits - 
(PAGE_SHIFT - SECTOR_SHIFT));
+   current->flags = (current->flags & ~PF_LESS_THROTTLE) | 
old_flags;
+   return ptr;
}
 
*data_mode = DATA_MODE_VMALLOC;
@@ -381,8 +387,10 @@ static void *alloc_buffer_data(struct dm
 */
if (gfp_mask & __GFP_NORETRY) {
unsigned noio_flag = memalloc_noio_save();
-   void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
-
+   unsigned old_flags = current->flags & PF_LESS_THROTTLE;
+  

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Mon 25-06-18 09:53:34, Mikulas Patocka wrote:
> y
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > > > 
> > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set 
> > > > > > > > (i.e. the 
> > > > > > > > request comes from a block device driver or a filesystem), we 
> > > > > > > > should not 
> > > > > > > > sleep.
> > > > > > > 
> > > > > > > Why? How are you going to audit all the callers that the behavior 
> > > > > > > makes
> > > > > > > sense and moreover how are you going to ensure that future usage 
> > > > > > > will
> > > > > > > still make sense. The more subtle side effects gfp flags have the 
> > > > > > > harder
> > > > > > > they are to maintain.
> > > > > > 
> > > > > > So just as an excercise. Try to explain the above semantic to 
> > > > > > users. We
> > > > > > currently have the following.
> > > > > > 
> > > > > >  * __GFP_NORETRY: The VM implementation will try only very 
> > > > > > lightweight
> > > > > >  *   memory direct reclaim to get some memory under memory pressure 
> > > > > > (thus
> > > > > >  *   it can sleep). It will avoid disruptive actions like OOM 
> > > > > > killer. The
> > > > > >  *   caller must handle the failure which is quite likely to happen 
> > > > > > under
> > > > > >  *   heavy memory pressure. The flag is suitable when failure can 
> > > > > > easily be
> > > > > >  *   handled at small cost, such as reduced throughput
> > > > > > 
> > > > > >  * __GFP_FS can call down to the low-level FS. Clearing the flag 
> > > > > > avoids the
> > > > > >  *   allocator recursing into the filesystem which might already be 
> > > > > > holding
> > > > > >  *   locks.
> > > > > > 
> > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? 
> > > > > > What
> > > > > > is the actual semantic without explaining the whole reclaim or force
> > > > > > users to look into the code to understand that? What about GFP_NOIO 
> > > > > > |
> > > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > > > shrinkers have to follow that as well?
> > > > > 
> > > > > My reasoning was that there is broken code that uses __GFP_NORETRY 
> > > > > and 
> > > > > assumes that it can't fail - so conditioning the change on !__GFP_FS 
> > > > > would 
> > > > > minimize the diruption to the broken code.
> > > > > 
> > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with 
> > > > > that.
> > > > 
> > > > As I've already said, this is a subtle change which is really hard to
> > > > reason about. Throttling on congestion has its meaning and reason. Look
> > > > at why we are doing that in the first place. You cannot simply say this
> > > 
> > > So - explain why is throttling needed. You support throttling, I don't, 
> > > so 
> > > you have to explain it :)
> > > 
> > > > is ok based on your specific usecase. We do have means to achieve that.
> > > > It is explicit and thus it will be applied only where it makes sense.
> > > > You keep repeating that implicit behavior change for everybody is
> > > > better.
> > > 
> > > I don't want to change it for everybody. I want to change it for block 
> > > device drivers. I don't care what you do with non-block drivers.
> > 
> > Well, it is usually onus of the patch submitter to justify any change.
> > But let me be nice on you, for once. This throttling is triggered only
> > if we all the pages we have encountered during the reclaim attempt are
> > dirty and that means that we are rushing through the LRU list quicker
> > than flushers are able to clean. If we didn't throttle we could hit
> > stronger reclaim priorities (aka scan more to reclaim memory) and
> > reclaim more pages as a result.
> 
> And the throttling in dm-bufio prevents kswapd from making forward 
> progress, causing this situation...

Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
circles like that? Are you even listening?

[...]

> And so what do you want to do to prevent block drivers from sleeping?

use the existing means we have.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Mikulas Patocka
y

On Mon, 25 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > 
> > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > > 
> > > > 
> > > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > > 
> > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > > [...]
> > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set 
> > > > > > > (i.e. the 
> > > > > > > request comes from a block device driver or a filesystem), we 
> > > > > > > should not 
> > > > > > > sleep.
> > > > > > 
> > > > > > Why? How are you going to audit all the callers that the behavior 
> > > > > > makes
> > > > > > sense and moreover how are you going to ensure that future usage 
> > > > > > will
> > > > > > still make sense. The more subtle side effects gfp flags have the 
> > > > > > harder
> > > > > > they are to maintain.
> > > > > 
> > > > > So just as an excercise. Try to explain the above semantic to users. 
> > > > > We
> > > > > currently have the following.
> > > > > 
> > > > >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > > >  *   memory direct reclaim to get some memory under memory pressure 
> > > > > (thus
> > > > >  *   it can sleep). It will avoid disruptive actions like OOM killer. 
> > > > > The
> > > > >  *   caller must handle the failure which is quite likely to happen 
> > > > > under
> > > > >  *   heavy memory pressure. The flag is suitable when failure can 
> > > > > easily be
> > > > >  *   handled at small cost, such as reduced throughput
> > > > > 
> > > > >  * __GFP_FS can call down to the low-level FS. Clearing the flag 
> > > > > avoids the
> > > > >  *   allocator recursing into the filesystem which might already be 
> > > > > holding
> > > > >  *   locks.
> > > > > 
> > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? 
> > > > > What
> > > > > is the actual semantic without explaining the whole reclaim or force
> > > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > > shrinkers have to follow that as well?
> > > > 
> > > > My reasoning was that there is broken code that uses __GFP_NORETRY and 
> > > > assumes that it can't fail - so conditioning the change on !__GFP_FS 
> > > > would 
> > > > minimize the diruption to the broken code.
> > > > 
> > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with 
> > > > that.
> > > 
> > > As I've already said, this is a subtle change which is really hard to
> > > reason about. Throttling on congestion has its meaning and reason. Look
> > > at why we are doing that in the first place. You cannot simply say this
> > 
> > So - explain why is throttling needed. You support throttling, I don't, so 
> > you have to explain it :)
> > 
> > > is ok based on your specific usecase. We do have means to achieve that.
> > > It is explicit and thus it will be applied only where it makes sense.
> > > You keep repeating that implicit behavior change for everybody is
> > > better.
> > 
> > I don't want to change it for everybody. I want to change it for block 
> > device drivers. I don't care what you do with non-block drivers.
> 
> Well, it is usually onus of the patch submitter to justify any change.
> But let me be nice on you, for once. This throttling is triggered only
> if we all the pages we have encountered during the reclaim attempt are
> dirty and that means that we are rushing through the LRU list quicker
> than flushers are able to clean. If we didn't throttle we could hit
> stronger reclaim priorities (aka scan more to reclaim memory) and
> reclaim more pages as a result.

And the throttling in dm-bufio prevents kswapd from making forward 
progress, causing this situation...

> > I'm sure you'll come up with another creative excuse why GFP_NORETRY
> > allocations need incur deliberate 100ms delays in block device drivers.
>
> ... is not really productive. I've tried to explain why I am not _sure_ what
> possible side effects such a change might have and your hand waving
> didn't really convince me. MD is not the only user of the page
> allocator...

But you are just doing that now - you're just coming up with another great 
excuse why block device drivers need to sleep 100ms. The system stops to a 
crawl when block device requests take 100ms and you - instead of fixing it 
- are just arguing how is it needed.

> > > I guess we will not agree on that part. I consider it a hack
> > > rather than a systematic solution. I can easily imagine that we just
> > > find out other call sites that would cause over reclaim or similar
> > 
> > If a __GFP_NORETRY allocation does overreclaim - it could be fixed by 
> > returning NULL instead of doing 

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > [...]
> > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set 
> > > > > > (i.e. the 
> > > > > > request comes from a block device driver or a filesystem), we 
> > > > > > should not 
> > > > > > sleep.
> > > > > 
> > > > > Why? How are you going to audit all the callers that the behavior 
> > > > > makes
> > > > > sense and moreover how are you going to ensure that future usage will
> > > > > still make sense. The more subtle side effects gfp flags have the 
> > > > > harder
> > > > > they are to maintain.
> > > > 
> > > > So just as an excercise. Try to explain the above semantic to users. We
> > > > currently have the following.
> > > > 
> > > >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > >  *   memory direct reclaim to get some memory under memory pressure 
> > > > (thus
> > > >  *   it can sleep). It will avoid disruptive actions like OOM killer. 
> > > > The
> > > >  *   caller must handle the failure which is quite likely to happen 
> > > > under
> > > >  *   heavy memory pressure. The flag is suitable when failure can 
> > > > easily be
> > > >  *   handled at small cost, such as reduced throughput
> > > > 
> > > >  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids 
> > > > the
> > > >  *   allocator recursing into the filesystem which might already be 
> > > > holding
> > > >  *   locks.
> > > > 
> > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > > is the actual semantic without explaining the whole reclaim or force
> > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > shrinkers have to follow that as well?
> > > 
> > > My reasoning was that there is broken code that uses __GFP_NORETRY and 
> > > assumes that it can't fail - so conditioning the change on !__GFP_FS 
> > > would 
> > > minimize the diruption to the broken code.
> > > 
> > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> > 
> > As I've already said, this is a subtle change which is really hard to
> > reason about. Throttling on congestion has its meaning and reason. Look
> > at why we are doing that in the first place. You cannot simply say this
> 
> So - explain why is throttling needed. You support throttling, I don't, so 
> you have to explain it :)
> 
> > is ok based on your specific usecase. We do have means to achieve that.
> > It is explicit and thus it will be applied only where it makes sense.
> > You keep repeating that implicit behavior change for everybody is
> > better.
> 
> I don't want to change it for everybody. I want to change it for block 
> device drivers. I don't care what you do with non-block drivers.

Well, it is usually onus of the patch submitter to justify any change.
But let me be nice on you, for once. This throttling is triggered only
if we all the pages we have encountered during the reclaim attempt are
dirty and that means that we are rushing through the LRU list quicker
than flushers are able to clean. If we didn't throttle we could hit
stronger reclaim priorities (aka scan more to reclaim memory) and
reclaim more pages as a result.

> > I guess we will not agree on that part. I consider it a hack
> > rather than a systematic solution. I can easily imagine that we just
> > find out other call sites that would cause over reclaim or similar
> 
> If a __GFP_NORETRY allocation does overreclaim - it could be fixed by 
> returning NULL instead of doing overreclaim. The specification says that 
> callers must handle failure of __GFP_NORETRY allocations.
> 
> So yes - if you think that just skipping throttling on __GFP_NORETRY could 
> cause excessive CPU consumption trying to reclaim unreclaimable pages or 
> something like that - then you can add more points where the __GFP_NORETRY 
> is failed with NULL to avoid the excessive CPU consumption.

Which is exactly something I do not want to do. Spread __GFP_NORETRY all
over the reclaim code. Especially for something we already have means
for...
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Mikulas Patocka



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > 
> > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > [...]
> > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. 
> > > > > the 
> > > > > request comes from a block device driver or a filesystem), we should 
> > > > > not 
> > > > > sleep.
> > > > 
> > > > Why? How are you going to audit all the callers that the behavior makes
> > > > sense and moreover how are you going to ensure that future usage will
> > > > still make sense. The more subtle side effects gfp flags have the harder
> > > > they are to maintain.
> > > 
> > > So just as an excercise. Try to explain the above semantic to users. We
> > > currently have the following.
> > > 
> > >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> > >  *   memory direct reclaim to get some memory under memory pressure (thus
> > >  *   it can sleep). It will avoid disruptive actions like OOM killer. The
> > >  *   caller must handle the failure which is quite likely to happen under
> > >  *   heavy memory pressure. The flag is suitable when failure can easily 
> > > be
> > >  *   handled at small cost, such as reduced throughput
> > > 
> > >  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids 
> > > the
> > >  *   allocator recursing into the filesystem which might already be 
> > > holding
> > >  *   locks.
> > > 
> > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > is the actual semantic without explaining the whole reclaim or force
> > > users to look into the code to understand that? What about GFP_NOIO |
> > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > shrinkers have to follow that as well?
> > 
> > My reasoning was that there is broken code that uses __GFP_NORETRY and 
> > assumes that it can't fail - so conditioning the change on !__GFP_FS would 
> > minimize the diruption to the broken code.
> > 
> > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> 
> As I've already said, this is a subtle change which is really hard to
> reason about. Throttling on congestion has its meaning and reason. Look
> at why we are doing that in the first place. You cannot simply say this

So - explain why is throttling needed. You support throttling, I don't, so 
you have to explain it :)

> is ok based on your specific usecase. We do have means to achieve that.
> It is explicit and thus it will be applied only where it makes sense.
> You keep repeating that implicit behavior change for everybody is
> better.

I don't want to change it for everybody. I want to change it for block 
device drivers. I don't care what you do with non-block drivers.

> I guess we will not agree on that part. I consider it a hack
> rather than a systematic solution. I can easily imagine that we just
> find out other call sites that would cause over reclaim or similar

If a __GFP_NORETRY allocation does overreclaim - it could be fixed by 
returning NULL instead of doing overreclaim. The specification says that 
callers must handle failure of __GFP_NORETRY allocations.

So yes - if you think that just skipping throttling on __GFP_NORETRY could 
cause excessive CPU consumption trying to reclaim unreclaimable pages or 
something like that - then you can add more points where the __GFP_NORETRY 
is failed with NULL to avoid the excessive CPU consumption.

> problems because they are not throttled on too many dirty pages due to
> congested storage. What are we going to then? Add another magic gfp
> flag? Really, try to not add even more subtle side effects for gfp
> flags. We _do_ have ways to accomplish what your particular usecase
> requires.
> 
> -- 
> Michal Hocko
> SUSE Labs

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> On Fri, 22 Jun 2018, Michal Hocko wrote:
[...]
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
> 
> I did audit them - see the previous email in this thread: 
> https://www.redhat.com/archives/dm-devel/2018-June/thread.html

I do not see any mention about throttling expectations for those users.
You have focused only on the allocation failure fallback AFAIR
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > [...]
> > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. 
> > > > the 
> > > > request comes from a block device driver or a filesystem), we should 
> > > > not 
> > > > sleep.
> > > 
> > > Why? How are you going to audit all the callers that the behavior makes
> > > sense and moreover how are you going to ensure that future usage will
> > > still make sense. The more subtle side effects gfp flags have the harder
> > > they are to maintain.
> > 
> > So just as an excercise. Try to explain the above semantic to users. We
> > currently have the following.
> > 
> >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> >  *   memory direct reclaim to get some memory under memory pressure (thus
> >  *   it can sleep). It will avoid disruptive actions like OOM killer. The
> >  *   caller must handle the failure which is quite likely to happen under
> >  *   heavy memory pressure. The flag is suitable when failure can easily be
> >  *   handled at small cost, such as reduced throughput
> > 
> >  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> >  *   allocator recursing into the filesystem which might already be holding
> >  *   locks.
> > 
> > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > is the actual semantic without explaining the whole reclaim or force
> > users to look into the code to understand that? What about GFP_NOIO |
> > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > shrinkers have to follow that as well?
> 
> My reasoning was that there is broken code that uses __GFP_NORETRY and 
> assumes that it can't fail - so conditioning the change on !__GFP_FS would 
> minimize the diruption to the broken code.
> 
> Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.

As I've already said, this is a subtle change which is really hard to
reason about. Throttling on congestion has its meaning and reason. Look
at why we are doing that in the first place. You cannot simply say this
is ok based on your specific usecase. We do have means to achieve that.
It is explicit and thus it will be applied only where it makes sense.
You keep repeating that implicit behavior change for everybody is
better. I guess we will not agree on that part. I consider it a hack
rather than a systematic solution. I can easily imagine that we just
find out other call sites that would cause over reclaim or similar
problems because they are not throttled on too many dirty pages due to
congested storage. What are we going to then? Add another magic gfp
flag? Really, try to not add even more subtle side effects for gfp
flags. We _do_ have ways to accomplish what your particular usecase
requires.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Mikulas Patocka



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> [...]
> > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> > > request comes from a block device driver or a filesystem), we should not 
> > > sleep.
> > 
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
> 
> So just as an excercise. Try to explain the above semantic to users. We
> currently have the following.
> 
>  * __GFP_NORETRY: The VM implementation will try only very lightweight
>  *   memory direct reclaim to get some memory under memory pressure (thus
>  *   it can sleep). It will avoid disruptive actions like OOM killer. The
>  *   caller must handle the failure which is quite likely to happen under
>  *   heavy memory pressure. The flag is suitable when failure can easily be
>  *   handled at small cost, such as reduced throughput
> 
>  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
>  *   allocator recursing into the filesystem which might already be holding
>  *   locks.
> 
> So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> is the actual semantic without explaining the whole reclaim or force
> users to look into the code to understand that? What about GFP_NOIO |
> __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> shrinkers have to follow that as well?

My reasoning was that there is broken code that uses __GFP_NORETRY and 
assumes that it can't fail - so conditioning the change on !__GFP_FS would 
minimize the diruption to the broken code.

Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.

Mikulas


Index: linux-2.6/mm/vmscan.c
===
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
 * the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd() &&
+  !(sc->gfp_mask & __GFP_NORETRY) &&
   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Mikulas Patocka



On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> [...]
> > > But seriously, isn't the best way around the throttling issue to use
> > > PF_LESS_THROTTLE?
> > 
> > Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
> > be better to change it just in one place than to add PF_LESS_THROTTLE to 
> > every block device driver (because adding it to every block driver results 
> > in more code).
> 
> Why would every block device need this? I thought we were talking about
> mempool allocator and the md variant of it. They are explicitly doing
> their own back off so PF_LESS_THROTTLE sounds appropriate to me.

Because every block driver is suspicible to this problem. Two years ago, 
there was a bug that when the user was swapping to dm-crypt device, memory 
management would stall the allocations done by dm-crypt by 100ms - that 
slowed down swapping rate and made the machine unuseable.

Then, people are complaining about the same problem in dm-bufio.

Next time, it will be something else.

(you will answer : "we will not fix bugs unless people are reporting them" :-)

> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> > request comes from a block device driver or a filesystem), we should not 
> > sleep.
> 
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

I did audit them - see the previous email in this thread: 
https://www.redhat.com/archives/dm-devel/2018-June/thread.html

Mikulas

> > Signed-off-by: Mikulas Patocka 
> > Cc: sta...@vger.kernel.org
> > 
> > Index: linux-2.6/mm/vmscan.c
> > ===
> > --- linux-2.6.orig/mm/vmscan.c
> > +++ linux-2.6/mm/vmscan.c
> > @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
> >  * the LRU too quickly.
> >  */
> > if (!sc->hibernation_mode && !current_is_kswapd() &&
> > +  (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY 
> > &&
> >current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> > wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> >  
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > But seriously, isn't the best way around the throttling issue to use
> > PF_LESS_THROTTLE?
> 
> Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
> be better to change it just in one place than to add PF_LESS_THROTTLE to 
> every block device driver (because adding it to every block driver results 
> in more code).

Why would every block device need this? I thought we were talking about
mempool allocator and the md variant of it. They are explicitly doing
their own back off so PF_LESS_THROTTLE sounds appropriate to me.

> What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> request comes from a block device driver or a filesystem), we should not 
> sleep.

Why? How are you going to audit all the callers that the behavior makes
sense and moreover how are you going to ensure that future usage will
still make sense. The more subtle side effects gfp flags have the harder
they are to maintain.

> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> Index: linux-2.6/mm/vmscan.c
> ===
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
>* the LRU too quickly.
>*/
>   if (!sc->hibernation_mode && !current_is_kswapd() &&
> +(sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY 
> &&
>  current_may_throttle() && pgdat_memcg_congested(pgdat, root))
>   wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-21 Thread Mikulas Patocka



On Tue, 19 Jun 2018, Michal Hocko wrote:

> On Mon 18-06-18 18:11:26, Mikulas Patocka wrote:
> [...]
> > I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases 
> > without a fallback - those are bugs that make various functions randomly 
> > return -ENOMEM.
> 
> Well, maybe those are just optimistic attempts to allocate memory and
> have a fallback somewhere else. So I would be careful calling them
> outright bugs. But maybe you are right.

I was trying to find the fallback code when I triaged them and maked as 
"BUG" those cases where I didn't find it. You can search harder and 
perhaps you'll find something that I didn't.

> > Most of the callers provide callback.
> > 
> > There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two 
> > different functions - if the allocation is larger than 
> > PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. 
> > If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, 
> > __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations 
> > don't trigger the oom killer at all).
> 
> Well, the primary purpose of this flag is to provide a consistent
> failure behavior for all requests regardless of the size.
> 
> > So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in 
> > the cases where the caller wants to avoid trigerring the oom killer (the 
> > problem is that __GFP_NORETRY causes random failure even in no-oom 
> > situations but __GFP_RETRY_MAYFAIL doesn't).
> 
> myabe yes.
> 
> > So my suggestion is - fix these obvious bugs when someone allocates memory 
> > with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be 
> > just changed to return NULL instead of sleeping.
> 
> No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY
> can sleep. Nothing will really change in that regards.  It does a
> reclaim and that _might_ sleep.
> 
> But seriously, isn't the best way around the throttling issue to use
> PF_LESS_THROTTLE?

Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
be better to change it just in one place than to add PF_LESS_THROTTLE to 
every block device driver (because adding it to every block driver results 
in more code).

What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
request comes from a block device driver or a filesystem), we should not 
sleep.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

Index: linux-2.6/mm/vmscan.c
===
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
 * the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd() &&
+  (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY 
&&
   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-18 Thread Mikulas Patocka



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 15 Jun 2018, Michal Hocko wrote:
> > 
> > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > > > 
> > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> > > > these flags too. dm-bufio is just a big mempool.
> > > 
> > > This doesn't answer my question though. Somebody else is doing it is not
> > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > > allocation AFAICS. So why do you really need it now? Why cannot you
> > 
> > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | 
> > __GFP_NOWARN" since the kernel 3.2 when it was introduced.
> > 
> > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT 
> > allocation, then drops the lock and does GFP_NOIO with the dropped lock 
> > (because someone was likely experiencing the same issue that is reported 
> > in this thread) - there are two commits that change it - 9ea61cac0 and 
> > 41c73a49df31.
> 
> OK, I see. Is there any fundamental reason why this path has to do one
> round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
> again?

If the process is woken up, there was some buffer added to the freelist, 
or refcount of some buffer was dropped to 0. In this case, we don't want 
to drop the lock and use GFP_NOIO, because the freed buffer may disappear 
when we drop the lock.

> [...]
> > > is the same class of problem, honestly, I dunno. And I've already said
> > > that stalling __GFP_NORETRY might be a good way around that but that
> > > needs much more consideration and existing users examination. I am not
> > > aware anybody has done that. Doing changes like that based on a single
> > > user is certainly risky.
> > 
> > Why don't you set any rules how these flags should be used?
> 
> It is really hard to change rules during the game. You basically have to
> examine all existing users and that is well beyond my time scope. I've
> tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
> well defined semantic. __GFP_NORETRY is a bigger beast.
> 
> Anyway, I believe that it would be much safer to look at the problem
> from a highlevel perspective. You seem to be focused on __GFP_NORETRY
> little bit too much IMHO. We are not throttling callers which explicitly
> do not want to or cannot - see current_may_throttle. Is it possible that
> both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
> or use other means? E.g. do you have backing_dev_info at that time?
> -- 
> Michal Hocko
> SUSE Labs

I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases 
without a fallback - those are bugs that make various functions randomly 
return -ENOMEM.

Most of the callers provide callback.

There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two 
different functions - if the allocation is larger than 
PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. 
If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, 
__GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations 
don't trigger the oom killer at all).

So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in 
the cases where the caller wants to avoid trigerring the oom killer (the 
problem is that __GFP_NORETRY causes random failure even in no-oom 
situations but __GFP_RETRY_MAYFAIL doesn't).


So my suggestion is - fix these obvious bugs when someone allocates memory 
with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be 
just changed to return NULL instead of sleeping.




arch/arm/mm/dma-mapping.c - fallback to a smaller size without __GFP_NORETRY

arch/mips/mm/dma-default.c - says that it uses __GFP_NORETRY to avoid the 
oom killer, provides no fallback - it seems to be a BUG

arch/sparc/mm/tsb.c - fallback to a smaller size without __GFP_NORETRY

arch/x86/include/asm/floppy.h - __GFP_NORETRY doesn't seem to serve any 
purpose, it may cause random failures during initialization, can be 
removed - BUG

arch/powerpc/mm/mmu_context_iommu.c - uses it just during moving pages, 
there's no problem with failure

arch/powerpc/platforms/pseries/cmm.c - a vm balloon driver, no problem 
with failure

block/bio.c - falls back to mempool

block/blk-mq.c - errorneous use of __GFP_NORETRY during initialization, it 
falls back to a smaller size, but doesn't drop the __GFP_NORETRY flag 
(BUG)

drivers/gpu/drm/i915/i915_gem.c - it starts with __GFP_NORETRY and on 
failure, it ORs it with __GFP_RETRY_MAYFAIL (which of these conflicting 
flags wins?)

drivers/gpu/drm/i915/i915_gem_gtt.c - __GFP_NORETRY is used during 
initialization (BUG), it shouldn't be used

drivers/gpu/drm/i915/i915_gem_execbuffer.c - fallback to a smaller size without 
__GFP_NORETRY

drivers/gpu/drm/i915/i915_gem_internal.c - fallback 

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > > 
> > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> > > these flags too. dm-bufio is just a big mempool.
> > 
> > This doesn't answer my question though. Somebody else is doing it is not
> > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > allocation AFAICS. So why do you really need it now? Why cannot you
> 
> dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | 
> __GFP_NOWARN" since the kernel 3.2 when it was introduced.
> 
> In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT 
> allocation, then drops the lock and does GFP_NOIO with the dropped lock 
> (because someone was likely experiencing the same issue that is reported 
> in this thread) - there are two commits that change it - 9ea61cac0 and 
> 41c73a49df31.

OK, I see. Is there any fundamental reason why this path has to do one
round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
again?

[...]
> > is the same class of problem, honestly, I dunno. And I've already said
> > that stalling __GFP_NORETRY might be a good way around that but that
> > needs much more consideration and existing users examination. I am not
> > aware anybody has done that. Doing changes like that based on a single
> > user is certainly risky.
> 
> Why don't you set any rules how these flags should be used?

It is really hard to change rules during the game. You basically have to
examine all existing users and that is well beyond my time scope. I've
tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
well defined semantic. __GFP_NORETRY is a bigger beast.

Anyway, I believe that it would be much safer to look at the problem
from a highlevel perspective. You seem to be focused on __GFP_NORETRY
little bit too much IMHO. We are not throttling callers which explicitly
do not want to or cannot - see current_may_throttle. Is it possible that
both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
or use other means? E.g. do you have backing_dev_info at that time?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Mikulas Patocka



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > 
> > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> > these flags too. dm-bufio is just a big mempool.
> 
> This doesn't answer my question though. Somebody else is doing it is not
> an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> allocation AFAICS. So why do you really need it now? Why cannot you

dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | 
__GFP_NOWARN" since the kernel 3.2 when it was introduced.

In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT 
allocation, then drops the lock and does GFP_NOIO with the dropped lock 
(because someone was likely experiencing the same issue that is reported 
in this thread) - there are two commits that change it - 9ea61cac0 and 
41c73a49df31.

> simply keep retrying GFP_NOWAIT with your own throttling?
> 
> Note that I am not trying to say that 41c73a49df31, I am merely trying
> to understand why this blocking allocation is done in the first place.
>  
> > If you argue that these flags are incorrect - then fix mempool_alloc.
> 
> AFAICS there is no report about mempool_alloc stalling here. Maybe this

If the page allocator can stall dm-bufio, it can stall mempool_alloc as 
well. dm-bufio is just bigger, so it will hit this bug sooner.

> is the same class of problem, honestly, I dunno. And I've already said
> that stalling __GFP_NORETRY might be a good way around that but that
> needs much more consideration and existing users examination. I am not
> aware anybody has done that. Doing changes like that based on a single
> user is certainly risky.

Why don't you set any rules how these flags should be used?

If you use GFP_NOIO | __GFP_NORETRY in your own code and blame other 
people for doing so - you are as much evil as Linus, who praised people 
for reverse-engineering hardware and blamed them for reverse-engineering 
bitkeeper :-)

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jun 2018, Michal Hocko wrote:
> 
> > On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Thu, 14 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Thu 14-06-18 15:18:58, jing xia wrote:
> > > > [...]
> > > > > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> > > > >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> > > > >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> > > > >  #2 [ffc0282af450] schedule at ff8008850f4c
> > > > >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> > > > >  #4 [ffc0282af520] schedule_timeout_uninterruptible at 
> > > > > ff8008853aa8
> > > > >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> > > > 
> > > > This trace doesn't provide the full picture unfortunately. Waiting in
> > > > the direct reclaim means that the underlying bdi is congested. The real
> > > > question is why it doesn't flush IO in time.
> > > 
> > > I pointed this out two years ago and you just refused to fix it:
> > > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html
> > 
> > Let me be evil again and let me quote the old discussion:
> > : > I agree that mempool_alloc should _primarily_ sleep on their own
> > : > throttling mechanism. I am not questioning that. I am just saying that
> > : > the page allocator has its own throttling which it relies on and that
> > : > cannot be just ignored because that might have other undesirable side
> > : > effects. So if the right approach is really to never throttle certain
> > : > requests then we have to bail out from a congested nodes/zones as soon
> > : > as the congestion is detected.
> > : >
> > : > Now, I would like to see that something like that is _really_ necessary.
> > :
> > : Currently, it is not a problem - device mapper reports the device as
> > : congested only if the underlying physical disks are congested.
> > :
> > : But once we change it so that device mapper reports congested state on its
> > : own (when it has too many bios in progress), this starts being a problem.
> > 
> > So has this changed since then? If yes then we can think of a proper
> > solution but that would require to actually describe why we see the
> > congestion, why it does help to wait on the caller rather than the
> > allocator etc...
> 
> Device mapper doesn't report congested state - but something else does 
> (perhaps the user inserted a cheap slow usb stick or sdcard?). And device 
> mapper is just a victim of that.

Maybe yes and that would require some more debugging to find out,
analyze and think of a proper solution.

> Why should device mapper sleep because some other random block device is 
> congested?

Well, the direct reclaim is a way to throttle memory allocations. There
is no real concept on who is asking for the memory. If you do not want
to get blocked then use GFP_NOWAIT.

> > Throwing statements like ...
> > 
> > > I'm sure you'll come up with another creative excuse why GFP_NORETRY 
> > > allocations need incur deliberate 100ms delays in block device drivers.
> > 
> > ... is not really productive. I've tried to explain why I am not _sure_ what
> > possible side effects such a change might have and your hand waving
> > didn't really convince me. MD is not the only user of the page
> > allocator...
> > 
> > E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
> > allocation") even added GFP_NOIO request in the first place when you
> > keep retrying and sleep yourself?
> 
> Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> these flags too. dm-bufio is just a big mempool.

This doesn't answer my question though. Somebody else is doing it is not
an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
allocation AFAICS. So why do you really need it now? Why cannot you
simply keep retrying GFP_NOWAIT with your own throttling?

Note that I am not trying to say that 41c73a49df31, I am merely trying
to understand why this blocking allocation is done in the first place.
 
> If you argue that these flags are incorrect - then fix mempool_alloc.

AFAICS there is no report about mempool_alloc stalling here. Maybe this
is the same class of problem, honestly, I dunno. And I've already said
that stalling __GFP_NORETRY might be a good way around that but that
needs much more consideration and existing users examination. I am not
aware anybody has done that. Doing changes like that based on a single
user is certainly risky.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Mikulas Patocka



On Fri, 15 Jun 2018, Michal Hocko wrote:

> On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 14 Jun 2018, Michal Hocko wrote:
> > 
> > > On Thu 14-06-18 15:18:58, jing xia wrote:
> > > [...]
> > > > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> > > >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> > > >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> > > >  #2 [ffc0282af450] schedule at ff8008850f4c
> > > >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> > > >  #4 [ffc0282af520] schedule_timeout_uninterruptible at 
> > > > ff8008853aa8
> > > >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> > > 
> > > This trace doesn't provide the full picture unfortunately. Waiting in
> > > the direct reclaim means that the underlying bdi is congested. The real
> > > question is why it doesn't flush IO in time.
> > 
> > I pointed this out two years ago and you just refused to fix it:
> > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html
> 
> Let me be evil again and let me quote the old discussion:
> : > I agree that mempool_alloc should _primarily_ sleep on their own
> : > throttling mechanism. I am not questioning that. I am just saying that
> : > the page allocator has its own throttling which it relies on and that
> : > cannot be just ignored because that might have other undesirable side
> : > effects. So if the right approach is really to never throttle certain
> : > requests then we have to bail out from a congested nodes/zones as soon
> : > as the congestion is detected.
> : >
> : > Now, I would like to see that something like that is _really_ necessary.
> :
> : Currently, it is not a problem - device mapper reports the device as
> : congested only if the underlying physical disks are congested.
> :
> : But once we change it so that device mapper reports congested state on its
> : own (when it has too many bios in progress), this starts being a problem.
> 
> So has this changed since then? If yes then we can think of a proper
> solution but that would require to actually describe why we see the
> congestion, why it does help to wait on the caller rather than the
> allocator etc...

Device mapper doesn't report congested state - but something else does 
(perhaps the user inserted a cheap slow usb stick or sdcard?). And device 
mapper is just a victim of that.

Why should device mapper sleep because some other random block device is 
congested?

> Throwing statements like ...
> 
> > I'm sure you'll come up with another creative excuse why GFP_NORETRY 
> > allocations need incur deliberate 100ms delays in block device drivers.
> 
> ... is not really productive. I've tried to explain why I am not _sure_ what
> possible side effects such a change might have and your hand waving
> didn't really convince me. MD is not the only user of the page
> allocator...
> 
> E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
> allocation") even added GFP_NOIO request in the first place when you
> keep retrying and sleep yourself?

Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
__GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
these flags too. dm-bufio is just a big mempool.

If you argue that these flags are incorrect - then fix mempool_alloc.

> The changelog only describes what but
> doesn't explain why. Or did I misread the code and this is not the
> allocation which is stalling due to congestion?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> 
> 
> On Thu, 14 Jun 2018, Michal Hocko wrote:
> 
> > On Thu 14-06-18 15:18:58, jing xia wrote:
> > [...]
> > > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> > >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> > >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> > >  #2 [ffc0282af450] schedule at ff8008850f4c
> > >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> > >  #4 [ffc0282af520] schedule_timeout_uninterruptible at 
> > > ff8008853aa8
> > >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> > 
> > This trace doesn't provide the full picture unfortunately. Waiting in
> > the direct reclaim means that the underlying bdi is congested. The real
> > question is why it doesn't flush IO in time.
> 
> I pointed this out two years ago and you just refused to fix it:
> http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html

Let me be evil again and let me quote the old discussion:
: > I agree that mempool_alloc should _primarily_ sleep on their own
: > throttling mechanism. I am not questioning that. I am just saying that
: > the page allocator has its own throttling which it relies on and that
: > cannot be just ignored because that might have other undesirable side
: > effects. So if the right approach is really to never throttle certain
: > requests then we have to bail out from a congested nodes/zones as soon
: > as the congestion is detected.
: >
: > Now, I would like to see that something like that is _really_ necessary.
:
: Currently, it is not a problem - device mapper reports the device as
: congested only if the underlying physical disks are congested.
:
: But once we change it so that device mapper reports congested state on its
: own (when it has too many bios in progress), this starts being a problem.

So has this changed since then? If yes then we can think of a proper
solution but that would require to actually describe why we see the
congestion, why it does help to wait on the caller rather than the
allocator etc...

Throwing statements like ...

> I'm sure you'll come up with another creative excuse why GFP_NORETRY 
> allocations need incur deliberate 100ms delays in block device drivers.

... is not really productive. I've tried to explain why I am not _sure_ what
possible side effects such a change might have and your hand waving
didn't really convince me. MD is not the only user of the page
allocator...

E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
allocation") even added GFP_NOIO request in the first place when you
keep retrying and sleep yourself? The changelog only describes what but
doesn't explain why. Or did I misread the code and this is not the
allocation which is stalling due to congestion?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-14 Thread Mikulas Patocka



On Thu, 14 Jun 2018, Michal Hocko wrote:

> On Thu 14-06-18 15:18:58, jing xia wrote:
> [...]
> > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> >  #2 [ffc0282af450] schedule at ff8008850f4c
> >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> >  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
> >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> 
> This trace doesn't provide the full picture unfortunately. Waiting in
> the direct reclaim means that the underlying bdi is congested. The real
> question is why it doesn't flush IO in time.

I pointed this out two years ago and you just refused to fix it:
http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html

I'm sure you'll come up with another creative excuse why GFP_NORETRY 
allocations need incur deliberate 100ms delays in block device drivers.

Mikulas

> >  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
> >  #7 [ffc0282af680] shrink_lruvec at ff8008178510
> >  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
> >  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
> > #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
> > #11 [ffc0282af990] try_to_free_pages at ff8008178f3c
> > #12 [ffc0282afa30] __perform_reclaim at ff8008169130
> > #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
> > #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
> > #15 [ffc0282afbe0] alloc_buffer at ff8008591a94
> > #16 [ffc0282afc20] __bufio_new at ff8008592e94
> > #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
> > #18 [ffc0282afd20] verity_prefetch_io at ff8008598384
> > #19 [ffc0282afd70] process_one_work at ff80080b5b3c
> > #20 [ffc0282afdc0] worker_thread at ff80080b64fc
> > #21 [ffc0282afe20] kthread at ff80080bae34
> > 
> > > Mikulas
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-14 Thread jing xia
Thanks for your comment.

On Wed, Jun 13, 2018 at 10:02 PM, Mikulas Patocka  wrote:
>
>
> On Tue, 12 Jun 2018, Mike Snitzer wrote:
>
>> On Tue, Jun 12 2018 at  4:03am -0400,
>> Jing Xia  wrote:
>>
>> > Performance test in android reports that the phone sometimes gets
>> > hanged and shows black screen for about several minutes.The sysdump shows:
>> > 1. kswapd and other tasks who enter the direct-reclaim path are waiting
>> > on the dm_bufio_lock;
>>
>> Do you have an understanding of where they are waiting?  Is it in
>> dm_bufio_shrink_scan()?
>>
>> > 2. the task who gets the dm_bufio_lock is stalled for IO completions,
>> > the relevant stack trace as :
>> >
>> > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>> >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>> >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>> >  #2 [ffc0282af450] schedule at ff8008850f4c
>> >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>> >  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>> >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
>> >  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>> >  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>> >  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>> >  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
>
> Please send the full stacktrace of this task.
>
> Then, we can see, why is it waiting here.
>
Please refer to:

PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
 #0 [ffc0282af3d0] __switch_to at ff8008085e48
 #1 [ffc0282af3f0] __schedule at ff8008850cc8
 #2 [ffc0282af450] schedule at ff8008850f4c
 #3 [ffc0282af470] schedule_timeout at ff8008853a0c
 #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
 #5 [ffc0282af530] wait_iff_congested at ff8008181b40
 #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
 #7 [ffc0282af680] shrink_lruvec at ff8008178510
 #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
 #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
#10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
#11 [ffc0282af990] try_to_free_pages at ff8008178f3c
#12 [ffc0282afa30] __perform_reclaim at ff8008169130
#13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
#14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
#15 [ffc0282afbe0] alloc_buffer at ff8008591a94
#16 [ffc0282afc20] __bufio_new at ff8008592e94
#17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
#18 [ffc0282afd20] verity_prefetch_io at ff8008598384
#19 [ffc0282afd70] process_one_work at ff80080b5b3c
#20 [ffc0282afdc0] worker_thread at ff80080b64fc
#21 [ffc0282afe20] kthread at ff80080bae34

> Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-14 Thread jing xia
Thank for your comment,I appreciate it.

On Wed, Jun 13, 2018 at 5:20 AM, Mike Snitzer  wrote:
> On Tue, Jun 12 2018 at  4:03am -0400,
> Jing Xia  wrote:
>
>> Performance test in android reports that the phone sometimes gets
>> hanged and shows black screen for about several minutes.The sysdump shows:
>> 1. kswapd and other tasks who enter the direct-reclaim path are waiting
>> on the dm_bufio_lock;
>
> Do you have an understanding of where they are waiting?  Is it in
> dm_bufio_shrink_scan()?
>

I'm sorry I don't make the issue clear.The kernel version in our
platform is k4.4, and
the implementation of dm_bufio_shrink_count() is still need to gain
the dm_bufio_lock.
So kswapd and other tasks in direct-reclaim are blocked in
dm_bufio_shrink_count().
According the sysdump,
1. those tasks are doing the shrink  in the same dm_bufio_shrinker
(the first dm_bufio_shrinker in the shrink_list), and waiting for the
same mutex lock;
2. the __GFP_FS is set to all of those tasks;

The dm_bufio_lock is not necessary in dm_bufio_shrink_count() at the
latest version,
but the issue may still happens, because the lock is also needed in
dm_bufio_shrink_scan().

the relevant stack traces please refer to:

PID: 855TASK: ffc07844a700  CPU: 1   COMMAND: "kswapd0"
 #0 [ffc078357920] __switch_to at ff8008085e48
 #1 [ffc078357940] __schedule at ff8008850cc8
 #2 [ffc0783579a0] schedule at ff8008850f4c
 #3 [ffc0783579c0] schedule_preempt_disabled at ff8008851284
 #4 [ffc0783579e0] __mutex_lock_slowpath at ff8008852898
 #5 [ffc078357a50] mutex_lock at ff8008852954
 #6 [ffc078357a70] dm_bufio_shrink_count at ff8008591d08
 #7 [ffc078357ab0] do_shrink_slab at ff80081753e0
 #8 [ffc078357b30] shrink_slab at ff8008175f3c
 #9 [ffc078357bd0] shrink_zone at ff8008178860
#10 [ffc078357c70] balance_pgdat at ff8008179838
#11 [ffc078357d70] kswapd at ff8008179e48
#12 [ffc078357e20] kthread at ff80080bae34

PID: 15538  TASK: ffc006833400  CPU: 3   COMMAND: "com.qiyi.video"
 #0 [ffc027637660] __switch_to at ff8008085e48
 #1 [ffc027637680] __schedule at ff8008850cc8
 #2 [ffc0276376e0] schedule at ff8008850f4c
 #3 [ffc027637700] schedule_preempt_disabled at ff8008851284
 #4 [ffc027637720] __mutex_lock_slowpath at ff8008852898
 #5 [ffc027637790] mutex_lock at ff8008852954
 #6 [ffc0276377b0] dm_bufio_shrink_count at ff8008591d08
 #7 [ffc0276377f0] do_shrink_slab at ff80081753e0
 #8 [ffc027637870] shrink_slab at ff8008175f3c
 #9 [ffc027637910] shrink_zone at ff8008178860
#10 [ffc0276379b0] do_try_to_free_pages at ff8008178bc4
#11 [ffc027637a50] try_to_free_pages at ff8008178f3c
#12 [ffc027637af0] __perform_reclaim at ff8008169130
#13 [ffc027637b70] __alloc_pages_nodemask at ff800816c9b8
#14 [ffc027637c90] handle_mm_fault at ff800819025c
#15 [ffc027637d80] do_page_fault at ff80080949f8
#16 [ffc027637df0] do_mem_abort at ff80080812f8


>> 2. the task who gets the dm_bufio_lock is stalled for IO completions,
>> the relevant stack trace as :
>>
>> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>>  #2 [ffc0282af450] schedule at ff8008850f4c
>>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
>>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
>
> Understanding the root cause for why the IO isn't completing quick
> enough would be nice.  Is the backing storage just overwhelmed?
>

Some information we can get from the sysdump:
1. swap-space is used out;
2. many tasks need to allocate a lot of  memory;
3. IOwait is 100%;

>> This patch aims to reduce the dm_bufio_lock contention when multiple
>> tasks do shrink_slab() at the same time.It is acceptable that task
>> will be allowed to reclaim from other shrinkers or reclaim from dm-bufio
>> next time, rather than stalled for the dm_bufio_lock.
>
> Your patch just looks to be papering over the issue.  Like you're
> treating the symptom rather than the problem.
>
Yes, maybe you are right.

>> Signed-off-by: Jing Xia 
>> Signed-off-by: Jing Xia 
>
> You only need one Signed-off-by.
>
Thanks, I got it.

>> ---
>>  drivers/md/dm-bufio.c | 13 +++--
>>  1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> index c546b56..402a028 100644
>> --- a/drivers/md/dm-bufio.c
>> +++ 

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-14 Thread Michal Hocko
On Thu 14-06-18 15:18:58, jing xia wrote:
[...]
> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>  #2 [ffc0282af450] schedule at ff8008850f4c
>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40

This trace doesn't provide the full picture unfortunately. Waiting in
the direct reclaim means that the underlying bdi is congested. The real
question is why it doesn't flush IO in time.

>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c
> #12 [ffc0282afa30] __perform_reclaim at ff8008169130
> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94
> #16 [ffc0282afc20] __bufio_new at ff8008592e94
> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384
> #19 [ffc0282afd70] process_one_work at ff80080b5b3c
> #20 [ffc0282afdc0] worker_thread at ff80080b64fc
> #21 [ffc0282afe20] kthread at ff80080bae34
> 
> > Mikulas

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-13 Thread Mikulas Patocka



On Tue, 12 Jun 2018, Mike Snitzer wrote:

> On Tue, Jun 12 2018 at  4:03am -0400,
> Jing Xia  wrote:
> 
> > Performance test in android reports that the phone sometimes gets
> > hanged and shows black screen for about several minutes.The sysdump shows:
> > 1. kswapd and other tasks who enter the direct-reclaim path are waiting
> > on the dm_bufio_lock;
> 
> Do you have an understanding of where they are waiting?  Is it in
> dm_bufio_shrink_scan()?
> 
> > 2. the task who gets the dm_bufio_lock is stalled for IO completions,
> > the relevant stack trace as :
> > 
> > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> >  #2 [ffc0282af450] schedule at ff8008850f4c
> >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> >  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
> >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> >  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
> >  #7 [ffc0282af680] shrink_lruvec at ff8008178510
> >  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
> >  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040

Please send the full stacktrace of this task.

Then, we can see, why is it waiting here.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-12 Thread Mike Snitzer
On Tue, Jun 12 2018 at  4:03am -0400,
Jing Xia  wrote:

> Performance test in android reports that the phone sometimes gets
> hanged and shows black screen for about several minutes.The sysdump shows:
> 1. kswapd and other tasks who enter the direct-reclaim path are waiting
> on the dm_bufio_lock;

Do you have an understanding of where they are waiting?  Is it in
dm_bufio_shrink_scan()?

> 2. the task who gets the dm_bufio_lock is stalled for IO completions,
> the relevant stack trace as :
> 
> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>  #2 [ffc0282af450] schedule at ff8008850f4c
>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040

Understanding the root cause for why the IO isn't completing quick
enough would be nice.  Is the backing storage just overwhelmed?

> This patch aims to reduce the dm_bufio_lock contention when multiple
> tasks do shrink_slab() at the same time.It is acceptable that task
> will be allowed to reclaim from other shrinkers or reclaim from dm-bufio
> next time, rather than stalled for the dm_bufio_lock.

Your patch just looks to be papering over the issue.  Like you're
treating the symptom rather than the problem.

> Signed-off-by: Jing Xia 
> Signed-off-by: Jing Xia 

You only need one Signed-off-by.

> ---
>  drivers/md/dm-bufio.c | 13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index c546b56..402a028 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -1647,10 +1647,19 @@ static unsigned long __scan(struct dm_bufio_client 
> *c, unsigned long nr_to_scan,
>  static unsigned long
>  dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> + unsigned long count;
> + unsigned long retain_target;
> +
>   struct dm_bufio_client *c = container_of(shrink, struct 
> dm_bufio_client, shrinker);
> - unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
> +
> + if (!dm_bufio_trylock(c))
> + return 0;
> +
> + count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
> READ_ONCE(c->n_buffers[LIST_DIRTY]);
> - unsigned long retain_target = get_retain_buffers(c);
> + retain_target = get_retain_buffers(c);
> +
> + dm_bufio_unlock(c);
>  
>   return (count < retain_target) ? 0 : (count - retain_target);
>  }
> -- 
> 1.9.1
> 

The reality of your patch is, on a heavily used bufio-backed volume,
you're effectively disabling the ability to reclaim bufio memory via the
shrinker.

Because chances are the bufio lock will always be contended for a
heavily used bufio client.

But after a quick look, I'm left wondering why dm_bufio_shrink_scan()'s
dm_bufio_trylock() isn't sufficient to short-circuit the shrinker for
your use-case?
Maybe __GFP_FS is set so dm_bufio_shrink_scan() only ever uses
dm_bufio_lock()?

Is a shrinker able to be reentered by the VM subsystem
(e.g. shrink_slab() calls down into same shrinker from multiple tasks
that hit direct reclaim)?
If so, a better fix could be to add a flag to the bufio client so we can
know if the same client is being re-entered via the shrinker (though
it'd likely be a bug for the shrinker to do that!).. and have
dm_bufio_shrink_scan() check that flag and return SHRINK_STOP if set.

That said, it could be that other parts of dm-bufio are monopolizing the
lock as part of issuing normal IO (to your potentially slow
backend).. in which case just taking the lock from the shrinker even
once will block like you've reported.

It does seem like additional analysis is needed to pinpoint exactly what
is occuring.  Or some additional clarification needed (e.g. are the
multiple tasks waiting for the bufio lock, as you reported with "1"
above, waiting for the same exact shrinker's ability to get the same
bufio lock?)

But Mikulas, please have a look at this reported issue and let us know
your thoughts.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel