Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue 04-09-18 13:30:44, Mikulas Patocka wrote: > > > On Tue, 4 Sep 2018, Michal Hocko wrote: > > > Regarding other other workloads. AFAIR the problem was due to the > > wait_iff_congested in the direct reclaim. And I've been arguing that > > special casing __GFP_NORETRY is not a propoer way to handle that case. > > We have PF_LESS_THROTTLE to handle cases where the caller cannot be > > really throttled because it is a part of the congestion control. I dunno > > what happened in that regards since then though. > > -- > > Michal Hocko > > SUSE Labs > > So, could you take this patch https://patchwork.kernel.org/patch/10505523/ > and commit it to the kernel? > > You agreed with that patch, but you didn't commit it. Because I do not maintain any tree that Linus pulls from. You need to involve Andrew Morton to get this to the mm tree and then to Linus. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue, 4 Sep 2018, Michal Hocko wrote: > Regarding other other workloads. AFAIR the problem was due to the > wait_iff_congested in the direct reclaim. And I've been arguing that > special casing __GFP_NORETRY is not a propoer way to handle that case. > We have PF_LESS_THROTTLE to handle cases where the caller cannot be > really throttled because it is a part of the congestion control. I dunno > what happened in that regards since then though. > -- > Michal Hocko > SUSE Labs So, could you take this patch https://patchwork.kernel.org/patch/10505523/ and commit it to the kernel? You agreed with that patch, but you didn't commit it. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue 04-09-18 11:18:44, Mike Snitzer wrote: > On Tue, Sep 04 2018 at 3:08am -0400, > Michal Hocko wrote: > > > On Mon 03-09-18 18:23:17, Mikulas Patocka wrote: > > > > > > > > > On Wed, 1 Aug 2018, jing xia wrote: > > > > > > > We reproduced this issue again and found out the root cause. > > > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and > > > > takes a long time to do the soft_limit_reclaim, because of the huge > > > > number of memory excess of the memcg. > > > > Then, all the task who do shrink_slab() wait for dm_bufio_lock. > > > > > > > > Any suggestions for this?Thanks. > > > > > > There's hardly any solution because Michal Hocko refuses to change > > > __GFP_NORETRY behavior. > > > > > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and > > > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention > > > on the dm-bufio lock - the patches don't fix the high CPU consumption > > > inside the memory allocation, but the kernel code should wait less on the > > > bufio lock. > > > > If you actually looked at the bottom line of the problem then you would > > quickly find out that dm-bufio lock is the least of the problem with the > > soft limit reclaim. This is a misfeature which has been merged and we > > have to live with it. All we can do is to discourage people from using > > it and use much more saner low limit instead. > > > > So please stop this stupid blaming, try to understand the reasoning > > behind my arguments. > > Yes, this bickering isn't productive. Michal, your responses are pretty > hard to follow. I'm just trying to follow along on what it is you're > saying should be done. It isn't clear to me. > > PLEASE, restate what we should be doing differently. Or what changes > need to happen outside of DM, etc. For this particular case I can only recommend to not use the memcg soft limit. This is guaranteed to stall and there is no way around it because this is the semantic of the soft limit. Sad, I know. Regarding other other workloads. AFAIR the problem was due to the wait_iff_congested in the direct reclaim. And I've been arguing that special casing __GFP_NORETRY is not a propoer way to handle that case. We have PF_LESS_THROTTLE to handle cases where the caller cannot be really throttled because it is a part of the congestion control. I dunno what happened in that regards since then though. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue, Sep 04 2018 at 3:08am -0400, Michal Hocko wrote: > On Mon 03-09-18 18:23:17, Mikulas Patocka wrote: > > > > > > On Wed, 1 Aug 2018, jing xia wrote: > > > > > We reproduced this issue again and found out the root cause. > > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and > > > takes a long time to do the soft_limit_reclaim, because of the huge > > > number of memory excess of the memcg. > > > Then, all the task who do shrink_slab() wait for dm_bufio_lock. > > > > > > Any suggestions for this?Thanks. > > > > There's hardly any solution because Michal Hocko refuses to change > > __GFP_NORETRY behavior. > > > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and > > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention > > on the dm-bufio lock - the patches don't fix the high CPU consumption > > inside the memory allocation, but the kernel code should wait less on the > > bufio lock. > > If you actually looked at the bottom line of the problem then you would > quickly find out that dm-bufio lock is the least of the problem with the > soft limit reclaim. This is a misfeature which has been merged and we > have to live with it. All we can do is to discourage people from using > it and use much more saner low limit instead. > > So please stop this stupid blaming, try to understand the reasoning > behind my arguments. Yes, this bickering isn't productive. Michal, your responses are pretty hard to follow. I'm just trying to follow along on what it is you're saying should be done. It isn't clear to me. PLEASE, restate what we should be doing differently. Or what changes need to happen outside of DM, etc. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Mon 03-09-18 18:23:17, Mikulas Patocka wrote: > > > On Wed, 1 Aug 2018, jing xia wrote: > > > We reproduced this issue again and found out the root cause. > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and > > takes a long time to do the soft_limit_reclaim, because of the huge > > number of memory excess of the memcg. > > Then, all the task who do shrink_slab() wait for dm_bufio_lock. > > > > Any suggestions for this?Thanks. > > There's hardly any solution because Michal Hocko refuses to change > __GFP_NORETRY behavior. > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention > on the dm-bufio lock - the patches don't fix the high CPU consumption > inside the memory allocation, but the kernel code should wait less on the > bufio lock. If you actually looked at the bottom line of the problem then you would quickly find out that dm-bufio lock is the least of the problem with the soft limit reclaim. This is a misfeature which has been merged and we have to live with it. All we can do is to discourage people from using it and use much more saner low limit instead. So please stop this stupid blaming, try to understand the reasoning behind my arguments. -- Michal Evil Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Wed, 1 Aug 2018, jing xia wrote: > We reproduced this issue again and found out the root cause. > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and > takes a long time to do the soft_limit_reclaim, because of the huge > number of memory excess of the memcg. > Then, all the task who do shrink_slab() wait for dm_bufio_lock. > > Any suggestions for this?Thanks. There's hardly any solution because Michal Hocko refuses to change __GFP_NORETRY behavior. The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention on the dm-bufio lock - the patches don't fix the high CPU consumption inside the memory allocation, but the kernel code should wait less on the bufio lock. Mikulas > On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko wrote: > > On Thu 14-06-18 15:18:58, jing xia wrote: > > [...] > >> PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > >> #0 [ffc0282af3d0] __switch_to at ff8008085e48 > >> #1 [ffc0282af3f0] __schedule at ff8008850cc8 > >> #2 [ffc0282af450] schedule at ff8008850f4c > >> #3 [ffc0282af470] schedule_timeout at ff8008853a0c > >> #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 > >> #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > > > This trace doesn't provide the full picture unfortunately. Waiting in > > the direct reclaim means that the underlying bdi is congested. The real > > question is why it doesn't flush IO in time. > > > >> #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 > >> #7 [ffc0282af680] shrink_lruvec at ff8008178510 > >> #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc > >> #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 > >> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c > >> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c > >> #12 [ffc0282afa30] __perform_reclaim at ff8008169130 > >> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8 > >> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c > >> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94 > >> #16 [ffc0282afc20] __bufio_new at ff8008592e94 > >> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198 > >> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384 > >> #19 [ffc0282afd70] process_one_work at ff80080b5b3c > >> #20 [ffc0282afdc0] worker_thread at ff80080b64fc > >> #21 [ffc0282afe20] kthread at ff80080bae34 > >> > >> > Mikulas > > > > -- > > Michal Hocko > > SUSE Labs > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Wed 01-08-18 10:48:00, jing xia wrote: > We reproduced this issue again and found out the root cause. > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and > takes a long time to do the soft_limit_reclaim, because of the huge > number of memory excess of the memcg. > Then, all the task who do shrink_slab() wait for dm_bufio_lock. > > Any suggestions for this?Thanks. Do not use soft limit? -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
We reproduced this issue again and found out the root cause. dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and takes a long time to do the soft_limit_reclaim, because of the huge number of memory excess of the memcg. Then, all the task who do shrink_slab() wait for dm_bufio_lock. Any suggestions for this?Thanks. On Thu, Jun 14, 2018 at 3:31 PM, Michal Hocko wrote: > On Thu 14-06-18 15:18:58, jing xia wrote: > [...] >> PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" >> #0 [ffc0282af3d0] __switch_to at ff8008085e48 >> #1 [ffc0282af3f0] __schedule at ff8008850cc8 >> #2 [ffc0282af450] schedule at ff8008850f4c >> #3 [ffc0282af470] schedule_timeout at ff8008853a0c >> #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 >> #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > This trace doesn't provide the full picture unfortunately. Waiting in > the direct reclaim means that the underlying bdi is congested. The real > question is why it doesn't flush IO in time. > >> #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 >> #7 [ffc0282af680] shrink_lruvec at ff8008178510 >> #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc >> #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 >> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c >> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c >> #12 [ffc0282afa30] __perform_reclaim at ff8008169130 >> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8 >> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c >> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94 >> #16 [ffc0282afc20] __bufio_new at ff8008592e94 >> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198 >> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384 >> #19 [ffc0282afd70] process_one_work at ff80080b5b3c >> #20 [ffc0282afdc0] worker_thread at ff80080b64fc >> #21 [ffc0282afe20] kthread at ff80080bae34 >> >> > Mikulas > > -- > Michal Hocko > SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Thu 28-06-18 22:43:29, Mikulas Patocka wrote: > > > On Mon, 25 Jun 2018, Michal Hocko wrote: > > > On Mon 25-06-18 10:42:30, Mikulas Patocka wrote: > > > > > > > > > On Mon, 25 Jun 2018, Michal Hocko wrote: > > > > > > > > And the throttling in dm-bufio prevents kswapd from making forward > > > > > progress, causing this situation... > > > > > > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in > > > > circles like that? Are you even listening? > > > > > > > > [...] > > > > > > > > > And so what do you want to do to prevent block drivers from sleeping? > > > > > > > > use the existing means we have. > > > > -- > > > > Michal Hocko > > > > SUSE Labs > > > > > > So - do you want this patch? > > > > > > There is no behavior difference between changing the allocator (so that > > > it > > > implies PF_THROTTLE_LESS for block drivers) and chaning all the block > > > drivers to explicitly set PF_THROTTLE_LESS. > > > > As long as you can reliably detect those users. And using gfp_mask is > > You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You > can grep the kernel for __GFP_NORETRY to find all the users. It seems that arguing doesn't make much sense here. I will not repeat myself... > > about the worst way to achieve that because users tend to be creative > > when it comes to using gfp mask. PF_THROTTLE_LESS in general is a > > way to tell the allocator that _you_ are the one to help the reclaim by > > cleaning data. > > But using PF_LESS_THROTTLE explicitly adds more lines of code than > implying PF_LESS_THROTTLE in the allocator. Yes and it will also make the code more explicit about the intention and so it will be easier to maintain longterm. > From: Mikulas Patocka > Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o > > When doing __GFP_NORETRY allocation, the system may sleep in > wait_iff_congested if there are too many dirty pages. Unfortunatelly this > sleeping may slow down kswapd, preventing it from doing writeback and > resolving the congestion. This description is misleading at best if not outright wrong. if (!sc->hibernation_mode && !current_is_kswapd() && current_may_throttle() && pgdat_memcg_congested(pgdat, root)) wait_iff_congested(BLK_RW_ASYNC, HZ/10); so this is an explict throttling of the direct reclaim. So I would use the following wording instead: " It has been noticed that congestion throttling can slow down allocations path that participate in the IO and thus help the memory reclaim. Stalling those allocation is therefore not productive. Moreover mempool allocator and md variants of the same already implement their own throttling which has a better way to be feedback driven. Stalling at the page allocator is therefore even counterproductive. PF_LESS_THROTTLE is a task flag denoting allocation context that is participating in the memory reclaim which fits into these IO paths model, so use the flag and make the page allocator aware they are special and they do not really want any dirty data throttling. " with a more clear patch description and some data to back them up, you can add Acked-by: Michal Hocko # mempool_alloc and bvec_alloc I cannot really comment on other md allocators though because I am not familiar with those. > This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for > block device drivers. > > Signed-off-by: Mikulas Patocka > Cc: sta...@vger.kernel.org > > --- > block/bio.c |4 > drivers/md/dm-bufio.c | 14 +++--- > drivers/md/dm-crypt.c |8 > drivers/md/dm-integrity.c |4 > drivers/md/dm-kcopyd.c|3 +++ > drivers/md/dm-verity-target.c |4 > drivers/md/dm-writecache.c|4 > mm/mempool.c |4 > 8 files changed, 42 insertions(+), 3 deletions(-) > > Index: linux-2.6/mm/mempool.c > === > --- linux-2.6.orig/mm/mempool.c 2018-06-29 03:47:16.29000 +0200 > +++ linux-2.6/mm/mempool.c2018-06-29 03:47:16.27000 +0200 > @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp > unsigned long flags; > wait_queue_entry_t wait; > gfp_t gfp_temp; > + unsigned old_flags; > > VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); > @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp > > repeat_alloc: > > + old_flags = current->flags & PF_LESS_THROTTLE; > + current->flags |= PF_LESS_THROTTLE; > element = pool->alloc(gfp_temp, pool->pool_data); > + current_restore_flags(old_flags, PF_LESS_THROTTLE); > if (likely(element != NULL)) > return element; > > Index: linux-2.6/block/bio.c > === > ---
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Mon, 25 Jun 2018, Michal Hocko wrote: > On Mon 25-06-18 10:42:30, Mikulas Patocka wrote: > > > > > > On Mon, 25 Jun 2018, Michal Hocko wrote: > > > > > > And the throttling in dm-bufio prevents kswapd from making forward > > > > progress, causing this situation... > > > > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in > > > circles like that? Are you even listening? > > > > > > [...] > > > > > > > And so what do you want to do to prevent block drivers from sleeping? > > > > > > use the existing means we have. > > > -- > > > Michal Hocko > > > SUSE Labs > > > > So - do you want this patch? > > > > There is no behavior difference between changing the allocator (so that it > > implies PF_THROTTLE_LESS for block drivers) and chaning all the block > > drivers to explicitly set PF_THROTTLE_LESS. > > As long as you can reliably detect those users. And using gfp_mask is You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You can grep the kernel for __GFP_NORETRY to find all the users. > about the worst way to achieve that because users tend to be creative > when it comes to using gfp mask. PF_THROTTLE_LESS in general is a > way to tell the allocator that _you_ are the one to help the reclaim by > cleaning data. But using PF_LESS_THROTTLE explicitly adds more lines of code than implying PF_LESS_THROTTLE in the allocator. > > But if you insist that the allocator can't be changed, we have to repeat > > the same code over and over again in the block drivers. > > I am not familiar with the patched code but mempool change at least > makes sense (bvec_alloc seems to fallback to mempool which then makes > sense as well). If others in md/ do the same thing > > I would just use current_restore_flags rather than open code it. > > Thanks! > -- > Michal Hocko > SUSE Labs So, do you accept this patch? Mikulas From: Mikulas Patocka Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o When doing __GFP_NORETRY allocation, the system may sleep in wait_iff_congested if there are too many dirty pages. Unfortunatelly this sleeping may slow down kswapd, preventing it from doing writeback and resolving the congestion. This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for block device drivers. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- block/bio.c |4 drivers/md/dm-bufio.c | 14 +++--- drivers/md/dm-crypt.c |8 drivers/md/dm-integrity.c |4 drivers/md/dm-kcopyd.c|3 +++ drivers/md/dm-verity-target.c |4 drivers/md/dm-writecache.c|4 mm/mempool.c |4 8 files changed, 42 insertions(+), 3 deletions(-) Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c 2018-06-29 03:47:16.29000 +0200 +++ linux-2.6/mm/mempool.c 2018-06-29 03:47:16.27000 +0200 @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp unsigned long flags; wait_queue_entry_t wait; gfp_t gfp_temp; + unsigned old_flags; VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp repeat_alloc: + old_flags = current->flags & PF_LESS_THROTTLE; + current->flags |= PF_LESS_THROTTLE; element = pool->alloc(gfp_temp, pool->pool_data); + current_restore_flags(old_flags, PF_LESS_THROTTLE); if (likely(element != NULL)) return element; Index: linux-2.6/block/bio.c === --- linux-2.6.orig/block/bio.c 2018-06-29 03:47:16.29000 +0200 +++ linux-2.6/block/bio.c 2018-06-29 03:47:16.27000 +0200 @@ -217,6 +217,7 @@ fallback: } else { struct biovec_slab *bvs = bvec_slabs + *idx; gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO); + unsigned old_flags; /* * Make this allocation restricted and don't dump info on @@ -229,7 +230,10 @@ fallback: * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM * is set, retry with the 1-entry mempool */ + old_flags = current->flags & PF_LESS_THROTTLE; + current->flags |= PF_LESS_THROTTLE; bvl = kmem_cache_alloc(bvs->slab, __gfp_mask); + current_restore_flags(old_flags, PF_LESS_THROTTLE); if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) { *idx = BVEC_POOL_MAX; goto fallback; Index: linux-2.6/drivers/md/dm-bufio.c === --- linux-2.6.orig/drivers/md/dm-bufio.c
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Mon 25-06-18 10:42:30, Mikulas Patocka wrote: > > > On Mon, 25 Jun 2018, Michal Hocko wrote: > > > > And the throttling in dm-bufio prevents kswapd from making forward > > > progress, causing this situation... > > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in > > circles like that? Are you even listening? > > > > [...] > > > > > And so what do you want to do to prevent block drivers from sleeping? > > > > use the existing means we have. > > -- > > Michal Hocko > > SUSE Labs > > So - do you want this patch? > > There is no behavior difference between changing the allocator (so that it > implies PF_THROTTLE_LESS for block drivers) and chaning all the block > drivers to explicitly set PF_THROTTLE_LESS. As long as you can reliably detect those users. And using gfp_mask is about the worst way to achieve that because users tend to be creative when it comes to using gfp mask. PF_THROTTLE_LESS in general is a way to tell the allocator that _you_ are the one to help the reclaim by cleaning data. > But if you insist that the allocator can't be changed, we have to repeat > the same code over and over again in the block drivers. I am not familiar with the patched code but mempool change at least makes sense (bvec_alloc seems to fallback to mempool which then makes sense as well). If others in md/ do the same thing I would just use current_restore_flags rather than open code it. Thanks! -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Mon, 25 Jun 2018, Michal Hocko wrote: > > And the throttling in dm-bufio prevents kswapd from making forward > > progress, causing this situation... > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in > circles like that? Are you even listening? > > [...] > > > And so what do you want to do to prevent block drivers from sleeping? > > use the existing means we have. > -- > Michal Hocko > SUSE Labs So - do you want this patch? There is no behavior difference between changing the allocator (so that it implies PF_THROTTLE_LESS for block drivers) and chaning all the block drivers to explicitly set PF_THROTTLE_LESS. But if you insist that the allocator can't be changed, we have to repeat the same code over and over again in the block drivers. Mikulas --- block/bio.c |4 drivers/md/dm-bufio.c | 14 +++--- drivers/md/dm-crypt.c |8 drivers/md/dm-integrity.c |4 drivers/md/dm-kcopyd.c|3 +++ drivers/md/dm-verity-target.c |4 drivers/md/dm-writecache.c|4 mm/mempool.c |4 8 files changed, 42 insertions(+), 3 deletions(-) Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c 2018-06-25 16:32:19.21000 +0200 +++ linux-2.6/mm/mempool.c 2018-06-25 16:32:19.2 +0200 @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp unsigned long flags; wait_queue_entry_t wait; gfp_t gfp_temp; + unsigned old_flags; VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); @@ -381,7 +382,10 @@ void *mempool_alloc(mempool_t *pool, gfp repeat_alloc: + old_flags = current->flags & PF_LESS_THROTTLE; + current->flags |= PF_LESS_THROTTLE; element = pool->alloc(gfp_temp, pool->pool_data); + current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags; if (likely(element != NULL)) return element; Index: linux-2.6/block/bio.c === --- linux-2.6.orig/block/bio.c 2018-06-25 16:32:19.21000 +0200 +++ linux-2.6/block/bio.c 2018-06-25 16:32:19.2 +0200 @@ -217,6 +217,7 @@ fallback: } else { struct biovec_slab *bvs = bvec_slabs + *idx; gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO); + unsigned old_flags; /* * Make this allocation restricted and don't dump info on @@ -229,7 +230,10 @@ fallback: * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM * is set, retry with the 1-entry mempool */ + old_flags = current->flags & PF_LESS_THROTTLE; + current->flags |= PF_LESS_THROTTLE; bvl = kmem_cache_alloc(bvs->slab, __gfp_mask); + current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags; if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) { *idx = BVEC_POOL_MAX; goto fallback; Index: linux-2.6/drivers/md/dm-bufio.c === --- linux-2.6.orig/drivers/md/dm-bufio.c2018-06-25 16:32:19.21000 +0200 +++ linux-2.6/drivers/md/dm-bufio.c 2018-06-25 16:32:19.2 +0200 @@ -356,6 +356,7 @@ static void __cache_size_refresh(void) static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask, unsigned char *data_mode) { + void *ptr; if (unlikely(c->slab_cache != NULL)) { *data_mode = DATA_MODE_SLAB; return kmem_cache_alloc(c->slab_cache, gfp_mask); @@ -363,9 +364,14 @@ static void *alloc_buffer_data(struct dm if (c->block_size <= KMALLOC_MAX_SIZE && gfp_mask & __GFP_NORETRY) { + unsigned old_flags; *data_mode = DATA_MODE_GET_FREE_PAGES; - return (void *)__get_free_pages(gfp_mask, + old_flags = current->flags & PF_LESS_THROTTLE; + current->flags |= PF_LESS_THROTTLE; + ptr = (void *)__get_free_pages(gfp_mask, c->sectors_per_block_bits - (PAGE_SHIFT - SECTOR_SHIFT)); + current->flags = (current->flags & ~PF_LESS_THROTTLE) | old_flags; + return ptr; } *data_mode = DATA_MODE_VMALLOC; @@ -381,8 +387,10 @@ static void *alloc_buffer_data(struct dm */ if (gfp_mask & __GFP_NORETRY) { unsigned noio_flag = memalloc_noio_save(); - void *ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL); - + unsigned old_flags = current->flags & PF_LESS_THROTTLE; +
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Mon 25-06-18 09:53:34, Mikulas Patocka wrote: > y > > On Mon, 25 Jun 2018, Michal Hocko wrote: > > > On Fri 22-06-18 14:57:10, Mikulas Patocka wrote: > > > > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote: > > > > > > > > > > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > > > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > > > > > > [...] > > > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set > > > > > > > > (i.e. the > > > > > > > > request comes from a block device driver or a filesystem), we > > > > > > > > should not > > > > > > > > sleep. > > > > > > > > > > > > > > Why? How are you going to audit all the callers that the behavior > > > > > > > makes > > > > > > > sense and moreover how are you going to ensure that future usage > > > > > > > will > > > > > > > still make sense. The more subtle side effects gfp flags have the > > > > > > > harder > > > > > > > they are to maintain. > > > > > > > > > > > > So just as an excercise. Try to explain the above semantic to > > > > > > users. We > > > > > > currently have the following. > > > > > > > > > > > > * __GFP_NORETRY: The VM implementation will try only very > > > > > > lightweight > > > > > > * memory direct reclaim to get some memory under memory pressure > > > > > > (thus > > > > > > * it can sleep). It will avoid disruptive actions like OOM > > > > > > killer. The > > > > > > * caller must handle the failure which is quite likely to happen > > > > > > under > > > > > > * heavy memory pressure. The flag is suitable when failure can > > > > > > easily be > > > > > > * handled at small cost, such as reduced throughput > > > > > > > > > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag > > > > > > avoids the > > > > > > * allocator recursing into the filesystem which might already be > > > > > > holding > > > > > > * locks. > > > > > > > > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? > > > > > > What > > > > > > is the actual semantic without explaining the whole reclaim or force > > > > > > users to look into the code to understand that? What about GFP_NOIO > > > > > > | > > > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > > > > > > shrinkers have to follow that as well? > > > > > > > > > > My reasoning was that there is broken code that uses __GFP_NORETRY > > > > > and > > > > > assumes that it can't fail - so conditioning the change on !__GFP_FS > > > > > would > > > > > minimize the diruption to the broken code. > > > > > > > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 > > > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with > > > > > that. > > > > > > > > As I've already said, this is a subtle change which is really hard to > > > > reason about. Throttling on congestion has its meaning and reason. Look > > > > at why we are doing that in the first place. You cannot simply say this > > > > > > So - explain why is throttling needed. You support throttling, I don't, > > > so > > > you have to explain it :) > > > > > > > is ok based on your specific usecase. We do have means to achieve that. > > > > It is explicit and thus it will be applied only where it makes sense. > > > > You keep repeating that implicit behavior change for everybody is > > > > better. > > > > > > I don't want to change it for everybody. I want to change it for block > > > device drivers. I don't care what you do with non-block drivers. > > > > Well, it is usually onus of the patch submitter to justify any change. > > But let me be nice on you, for once. This throttling is triggered only > > if we all the pages we have encountered during the reclaim attempt are > > dirty and that means that we are rushing through the LRU list quicker > > than flushers are able to clean. If we didn't throttle we could hit > > stronger reclaim priorities (aka scan more to reclaim memory) and > > reclaim more pages as a result. > > And the throttling in dm-bufio prevents kswapd from making forward > progress, causing this situation... Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in circles like that? Are you even listening? [...] > And so what do you want to do to prevent block drivers from sleeping? use the existing means we have. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
y On Mon, 25 Jun 2018, Michal Hocko wrote: > On Fri 22-06-18 14:57:10, Mikulas Patocka wrote: > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote: > > > > > > > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > > > > > [...] > > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set > > > > > > > (i.e. the > > > > > > > request comes from a block device driver or a filesystem), we > > > > > > > should not > > > > > > > sleep. > > > > > > > > > > > > Why? How are you going to audit all the callers that the behavior > > > > > > makes > > > > > > sense and moreover how are you going to ensure that future usage > > > > > > will > > > > > > still make sense. The more subtle side effects gfp flags have the > > > > > > harder > > > > > > they are to maintain. > > > > > > > > > > So just as an excercise. Try to explain the above semantic to users. > > > > > We > > > > > currently have the following. > > > > > > > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight > > > > > * memory direct reclaim to get some memory under memory pressure > > > > > (thus > > > > > * it can sleep). It will avoid disruptive actions like OOM killer. > > > > > The > > > > > * caller must handle the failure which is quite likely to happen > > > > > under > > > > > * heavy memory pressure. The flag is suitable when failure can > > > > > easily be > > > > > * handled at small cost, such as reduced throughput > > > > > > > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag > > > > > avoids the > > > > > * allocator recursing into the filesystem which might already be > > > > > holding > > > > > * locks. > > > > > > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? > > > > > What > > > > > is the actual semantic without explaining the whole reclaim or force > > > > > users to look into the code to understand that? What about GFP_NOIO | > > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > > > > > shrinkers have to follow that as well? > > > > > > > > My reasoning was that there is broken code that uses __GFP_NORETRY and > > > > assumes that it can't fail - so conditioning the change on !__GFP_FS > > > > would > > > > minimize the diruption to the broken code. > > > > > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 > > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with > > > > that. > > > > > > As I've already said, this is a subtle change which is really hard to > > > reason about. Throttling on congestion has its meaning and reason. Look > > > at why we are doing that in the first place. You cannot simply say this > > > > So - explain why is throttling needed. You support throttling, I don't, so > > you have to explain it :) > > > > > is ok based on your specific usecase. We do have means to achieve that. > > > It is explicit and thus it will be applied only where it makes sense. > > > You keep repeating that implicit behavior change for everybody is > > > better. > > > > I don't want to change it for everybody. I want to change it for block > > device drivers. I don't care what you do with non-block drivers. > > Well, it is usually onus of the patch submitter to justify any change. > But let me be nice on you, for once. This throttling is triggered only > if we all the pages we have encountered during the reclaim attempt are > dirty and that means that we are rushing through the LRU list quicker > than flushers are able to clean. If we didn't throttle we could hit > stronger reclaim priorities (aka scan more to reclaim memory) and > reclaim more pages as a result. And the throttling in dm-bufio prevents kswapd from making forward progress, causing this situation... > > I'm sure you'll come up with another creative excuse why GFP_NORETRY > > allocations need incur deliberate 100ms delays in block device drivers. > > ... is not really productive. I've tried to explain why I am not _sure_ what > possible side effects such a change might have and your hand waving > didn't really convince me. MD is not the only user of the page > allocator... But you are just doing that now - you're just coming up with another great excuse why block device drivers need to sleep 100ms. The system stops to a crawl when block device requests take 100ms and you - instead of fixing it - are just arguing how is it needed. > > > I guess we will not agree on that part. I consider it a hack > > > rather than a systematic solution. I can easily imagine that we just > > > find out other call sites that would cause over reclaim or similar > > > > If a __GFP_NORETRY allocation does overreclaim - it could be fixed by > > returning NULL instead of doing
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri 22-06-18 14:57:10, Mikulas Patocka wrote: > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote: > > > > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > > > > [...] > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set > > > > > > (i.e. the > > > > > > request comes from a block device driver or a filesystem), we > > > > > > should not > > > > > > sleep. > > > > > > > > > > Why? How are you going to audit all the callers that the behavior > > > > > makes > > > > > sense and moreover how are you going to ensure that future usage will > > > > > still make sense. The more subtle side effects gfp flags have the > > > > > harder > > > > > they are to maintain. > > > > > > > > So just as an excercise. Try to explain the above semantic to users. We > > > > currently have the following. > > > > > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight > > > > * memory direct reclaim to get some memory under memory pressure > > > > (thus > > > > * it can sleep). It will avoid disruptive actions like OOM killer. > > > > The > > > > * caller must handle the failure which is quite likely to happen > > > > under > > > > * heavy memory pressure. The flag is suitable when failure can > > > > easily be > > > > * handled at small cost, such as reduced throughput > > > > > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids > > > > the > > > > * allocator recursing into the filesystem which might already be > > > > holding > > > > * locks. > > > > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What > > > > is the actual semantic without explaining the whole reclaim or force > > > > users to look into the code to understand that? What about GFP_NOIO | > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > > > > shrinkers have to follow that as well? > > > > > > My reasoning was that there is broken code that uses __GFP_NORETRY and > > > assumes that it can't fail - so conditioning the change on !__GFP_FS > > > would > > > minimize the diruption to the broken code. > > > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that. > > > > As I've already said, this is a subtle change which is really hard to > > reason about. Throttling on congestion has its meaning and reason. Look > > at why we are doing that in the first place. You cannot simply say this > > So - explain why is throttling needed. You support throttling, I don't, so > you have to explain it :) > > > is ok based on your specific usecase. We do have means to achieve that. > > It is explicit and thus it will be applied only where it makes sense. > > You keep repeating that implicit behavior change for everybody is > > better. > > I don't want to change it for everybody. I want to change it for block > device drivers. I don't care what you do with non-block drivers. Well, it is usually onus of the patch submitter to justify any change. But let me be nice on you, for once. This throttling is triggered only if we all the pages we have encountered during the reclaim attempt are dirty and that means that we are rushing through the LRU list quicker than flushers are able to clean. If we didn't throttle we could hit stronger reclaim priorities (aka scan more to reclaim memory) and reclaim more pages as a result. > > I guess we will not agree on that part. I consider it a hack > > rather than a systematic solution. I can easily imagine that we just > > find out other call sites that would cause over reclaim or similar > > If a __GFP_NORETRY allocation does overreclaim - it could be fixed by > returning NULL instead of doing overreclaim. The specification says that > callers must handle failure of __GFP_NORETRY allocations. > > So yes - if you think that just skipping throttling on __GFP_NORETRY could > cause excessive CPU consumption trying to reclaim unreclaimable pages or > something like that - then you can add more points where the __GFP_NORETRY > is failed with NULL to avoid the excessive CPU consumption. Which is exactly something I do not want to do. Spread __GFP_NORETRY all over the reclaim code. Especially for something we already have means for... -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 22 Jun 2018, Michal Hocko wrote: > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote: > > > > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > > > [...] > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. > > > > > the > > > > > request comes from a block device driver or a filesystem), we should > > > > > not > > > > > sleep. > > > > > > > > Why? How are you going to audit all the callers that the behavior makes > > > > sense and moreover how are you going to ensure that future usage will > > > > still make sense. The more subtle side effects gfp flags have the harder > > > > they are to maintain. > > > > > > So just as an excercise. Try to explain the above semantic to users. We > > > currently have the following. > > > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight > > > * memory direct reclaim to get some memory under memory pressure (thus > > > * it can sleep). It will avoid disruptive actions like OOM killer. The > > > * caller must handle the failure which is quite likely to happen under > > > * heavy memory pressure. The flag is suitable when failure can easily > > > be > > > * handled at small cost, such as reduced throughput > > > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids > > > the > > > * allocator recursing into the filesystem which might already be > > > holding > > > * locks. > > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What > > > is the actual semantic without explaining the whole reclaim or force > > > users to look into the code to understand that? What about GFP_NOIO | > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > > > shrinkers have to follow that as well? > > > > My reasoning was that there is broken code that uses __GFP_NORETRY and > > assumes that it can't fail - so conditioning the change on !__GFP_FS would > > minimize the diruption to the broken code. > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that. > > As I've already said, this is a subtle change which is really hard to > reason about. Throttling on congestion has its meaning and reason. Look > at why we are doing that in the first place. You cannot simply say this So - explain why is throttling needed. You support throttling, I don't, so you have to explain it :) > is ok based on your specific usecase. We do have means to achieve that. > It is explicit and thus it will be applied only where it makes sense. > You keep repeating that implicit behavior change for everybody is > better. I don't want to change it for everybody. I want to change it for block device drivers. I don't care what you do with non-block drivers. > I guess we will not agree on that part. I consider it a hack > rather than a systematic solution. I can easily imagine that we just > find out other call sites that would cause over reclaim or similar If a __GFP_NORETRY allocation does overreclaim - it could be fixed by returning NULL instead of doing overreclaim. The specification says that callers must handle failure of __GFP_NORETRY allocations. So yes - if you think that just skipping throttling on __GFP_NORETRY could cause excessive CPU consumption trying to reclaim unreclaimable pages or something like that - then you can add more points where the __GFP_NORETRY is failed with NULL to avoid the excessive CPU consumption. > problems because they are not throttled on too many dirty pages due to > congested storage. What are we going to then? Add another magic gfp > flag? Really, try to not add even more subtle side effects for gfp > flags. We _do_ have ways to accomplish what your particular usecase > requires. > > -- > Michal Hocko > SUSE Labs Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri 22-06-18 08:44:52, Mikulas Patocka wrote: > On Fri, 22 Jun 2018, Michal Hocko wrote: [...] > > Why? How are you going to audit all the callers that the behavior makes > > sense and moreover how are you going to ensure that future usage will > > still make sense. The more subtle side effects gfp flags have the harder > > they are to maintain. > > I did audit them - see the previous email in this thread: > https://www.redhat.com/archives/dm-devel/2018-June/thread.html I do not see any mention about throttling expectations for those users. You have focused only on the allocation failure fallback AFAIR -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri 22-06-18 08:52:09, Mikulas Patocka wrote: > > > On Fri, 22 Jun 2018, Michal Hocko wrote: > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > > [...] > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. > > > > the > > > > request comes from a block device driver or a filesystem), we should > > > > not > > > > sleep. > > > > > > Why? How are you going to audit all the callers that the behavior makes > > > sense and moreover how are you going to ensure that future usage will > > > still make sense. The more subtle side effects gfp flags have the harder > > > they are to maintain. > > > > So just as an excercise. Try to explain the above semantic to users. We > > currently have the following. > > > > * __GFP_NORETRY: The VM implementation will try only very lightweight > > * memory direct reclaim to get some memory under memory pressure (thus > > * it can sleep). It will avoid disruptive actions like OOM killer. The > > * caller must handle the failure which is quite likely to happen under > > * heavy memory pressure. The flag is suitable when failure can easily be > > * handled at small cost, such as reduced throughput > > > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the > > * allocator recursing into the filesystem which might already be holding > > * locks. > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What > > is the actual semantic without explaining the whole reclaim or force > > users to look into the code to understand that? What about GFP_NOIO | > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > > shrinkers have to follow that as well? > > My reasoning was that there is broken code that uses __GFP_NORETRY and > assumes that it can't fail - so conditioning the change on !__GFP_FS would > minimize the diruption to the broken code. > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that. As I've already said, this is a subtle change which is really hard to reason about. Throttling on congestion has its meaning and reason. Look at why we are doing that in the first place. You cannot simply say this is ok based on your specific usecase. We do have means to achieve that. It is explicit and thus it will be applied only where it makes sense. You keep repeating that implicit behavior change for everybody is better. I guess we will not agree on that part. I consider it a hack rather than a systematic solution. I can easily imagine that we just find out other call sites that would cause over reclaim or similar problems because they are not throttled on too many dirty pages due to congested storage. What are we going to then? Add another magic gfp flag? Really, try to not add even more subtle side effects for gfp flags. We _do_ have ways to accomplish what your particular usecase requires. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 22 Jun 2018, Michal Hocko wrote: > On Fri 22-06-18 11:01:51, Michal Hocko wrote: > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > [...] > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the > > > request comes from a block device driver or a filesystem), we should not > > > sleep. > > > > Why? How are you going to audit all the callers that the behavior makes > > sense and moreover how are you going to ensure that future usage will > > still make sense. The more subtle side effects gfp flags have the harder > > they are to maintain. > > So just as an excercise. Try to explain the above semantic to users. We > currently have the following. > > * __GFP_NORETRY: The VM implementation will try only very lightweight > * memory direct reclaim to get some memory under memory pressure (thus > * it can sleep). It will avoid disruptive actions like OOM killer. The > * caller must handle the failure which is quite likely to happen under > * heavy memory pressure. The flag is suitable when failure can easily be > * handled at small cost, such as reduced throughput > > * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the > * allocator recursing into the filesystem which might already be holding > * locks. > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What > is the actual semantic without explaining the whole reclaim or force > users to look into the code to understand that? What about GFP_NOIO | > __GFP_NORETRY? What does it mean to that "should not sleep". Do all > shrinkers have to follow that as well? My reasoning was that there is broken code that uses __GFP_NORETRY and assumes that it can't fail - so conditioning the change on !__GFP_FS would minimize the diruption to the broken code. Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that. Mikulas Index: linux-2.6/mm/vmscan.c === --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat * the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd() && + !(sc->gfp_mask & __GFP_NORETRY) && current_may_throttle() && pgdat_memcg_congested(pgdat, root)) wait_iff_congested(BLK_RW_ASYNC, HZ/10); -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 22 Jun 2018, Michal Hocko wrote: > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: > [...] > > > But seriously, isn't the best way around the throttling issue to use > > > PF_LESS_THROTTLE? > > > > Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would > > be better to change it just in one place than to add PF_LESS_THROTTLE to > > every block device driver (because adding it to every block driver results > > in more code). > > Why would every block device need this? I thought we were talking about > mempool allocator and the md variant of it. They are explicitly doing > their own back off so PF_LESS_THROTTLE sounds appropriate to me. Because every block driver is suspicible to this problem. Two years ago, there was a bug that when the user was swapping to dm-crypt device, memory management would stall the allocations done by dm-crypt by 100ms - that slowed down swapping rate and made the machine unuseable. Then, people are complaining about the same problem in dm-bufio. Next time, it will be something else. (you will answer : "we will not fix bugs unless people are reporting them" :-) > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the > > request comes from a block device driver or a filesystem), we should not > > sleep. > > Why? How are you going to audit all the callers that the behavior makes > sense and moreover how are you going to ensure that future usage will > still make sense. The more subtle side effects gfp flags have the harder > they are to maintain. I did audit them - see the previous email in this thread: https://www.redhat.com/archives/dm-devel/2018-June/thread.html Mikulas > > Signed-off-by: Mikulas Patocka > > Cc: sta...@vger.kernel.org > > > > Index: linux-2.6/mm/vmscan.c > > === > > --- linux-2.6.orig/mm/vmscan.c > > +++ linux-2.6/mm/vmscan.c > > @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat > > * the LRU too quickly. > > */ > > if (!sc->hibernation_mode && !current_is_kswapd() && > > + (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY > > && > >current_may_throttle() && pgdat_memcg_congested(pgdat, root)) > > wait_iff_congested(BLK_RW_ASYNC, HZ/10); > > > > -- > Michal Hocko > SUSE Labs > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Thu 21-06-18 21:17:24, Mikulas Patocka wrote: [...] > > But seriously, isn't the best way around the throttling issue to use > > PF_LESS_THROTTLE? > > Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would > be better to change it just in one place than to add PF_LESS_THROTTLE to > every block device driver (because adding it to every block driver results > in more code). Why would every block device need this? I thought we were talking about mempool allocator and the md variant of it. They are explicitly doing their own back off so PF_LESS_THROTTLE sounds appropriate to me. > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the > request comes from a block device driver or a filesystem), we should not > sleep. Why? How are you going to audit all the callers that the behavior makes sense and moreover how are you going to ensure that future usage will still make sense. The more subtle side effects gfp flags have the harder they are to maintain. > Signed-off-by: Mikulas Patocka > Cc: sta...@vger.kernel.org > > Index: linux-2.6/mm/vmscan.c > === > --- linux-2.6.orig/mm/vmscan.c > +++ linux-2.6/mm/vmscan.c > @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat >* the LRU too quickly. >*/ > if (!sc->hibernation_mode && !current_is_kswapd() && > +(sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY > && > current_may_throttle() && pgdat_memcg_congested(pgdat, root)) > wait_iff_congested(BLK_RW_ASYNC, HZ/10); > -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue, 19 Jun 2018, Michal Hocko wrote: > On Mon 18-06-18 18:11:26, Mikulas Patocka wrote: > [...] > > I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases > > without a fallback - those are bugs that make various functions randomly > > return -ENOMEM. > > Well, maybe those are just optimistic attempts to allocate memory and > have a fallback somewhere else. So I would be careful calling them > outright bugs. But maybe you are right. I was trying to find the fallback code when I triaged them and maked as "BUG" those cases where I didn't find it. You can search harder and perhaps you'll find something that I didn't. > > Most of the callers provide callback. > > > > There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two > > different functions - if the allocation is larger than > > PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. > > If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, > > __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations > > don't trigger the oom killer at all). > > Well, the primary purpose of this flag is to provide a consistent > failure behavior for all requests regardless of the size. > > > So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in > > the cases where the caller wants to avoid trigerring the oom killer (the > > problem is that __GFP_NORETRY causes random failure even in no-oom > > situations but __GFP_RETRY_MAYFAIL doesn't). > > myabe yes. > > > So my suggestion is - fix these obvious bugs when someone allocates memory > > with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be > > just changed to return NULL instead of sleeping. > > No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY > can sleep. Nothing will really change in that regards. It does a > reclaim and that _might_ sleep. > > But seriously, isn't the best way around the throttling issue to use > PF_LESS_THROTTLE? Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would be better to change it just in one place than to add PF_LESS_THROTTLE to every block device driver (because adding it to every block driver results in more code). What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the request comes from a block device driver or a filesystem), we should not sleep. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org Index: linux-2.6/mm/vmscan.c === --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat * the LRU too quickly. */ if (!sc->hibernation_mode && !current_is_kswapd() && + (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY && current_may_throttle() && pgdat_memcg_congested(pgdat, root)) wait_iff_congested(BLK_RW_ASYNC, HZ/10); -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 15 Jun 2018, Michal Hocko wrote: > On Fri 15-06-18 08:47:52, Mikulas Patocka wrote: > > > > > > On Fri, 15 Jun 2018, Michal Hocko wrote: > > > > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote: > > > > > > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | > > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses > > > > these flags too. dm-bufio is just a big mempool. > > > > > > This doesn't answer my question though. Somebody else is doing it is not > > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO > > > allocation AFAICS. So why do you really need it now? Why cannot you > > > > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | > > __GFP_NOWARN" since the kernel 3.2 when it was introduced. > > > > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT > > allocation, then drops the lock and does GFP_NOIO with the dropped lock > > (because someone was likely experiencing the same issue that is reported > > in this thread) - there are two commits that change it - 9ea61cac0 and > > 41c73a49df31. > > OK, I see. Is there any fundamental reason why this path has to do one > round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry > again? If the process is woken up, there was some buffer added to the freelist, or refcount of some buffer was dropped to 0. In this case, we don't want to drop the lock and use GFP_NOIO, because the freed buffer may disappear when we drop the lock. > [...] > > > is the same class of problem, honestly, I dunno. And I've already said > > > that stalling __GFP_NORETRY might be a good way around that but that > > > needs much more consideration and existing users examination. I am not > > > aware anybody has done that. Doing changes like that based on a single > > > user is certainly risky. > > > > Why don't you set any rules how these flags should be used? > > It is really hard to change rules during the game. You basically have to > examine all existing users and that is well beyond my time scope. I've > tried that where it was possible. E.g. __GFP_REPEAT and turned it into a > well defined semantic. __GFP_NORETRY is a bigger beast. > > Anyway, I believe that it would be much safer to look at the problem > from a highlevel perspective. You seem to be focused on __GFP_NORETRY > little bit too much IMHO. We are not throttling callers which explicitly > do not want to or cannot - see current_may_throttle. Is it possible that > both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE > or use other means? E.g. do you have backing_dev_info at that time? > -- > Michal Hocko > SUSE Labs I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases without a fallback - those are bugs that make various functions randomly return -ENOMEM. Most of the callers provide callback. There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two different functions - if the allocation is larger than PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations don't trigger the oom killer at all). So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in the cases where the caller wants to avoid trigerring the oom killer (the problem is that __GFP_NORETRY causes random failure even in no-oom situations but __GFP_RETRY_MAYFAIL doesn't). So my suggestion is - fix these obvious bugs when someone allocates memory with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be just changed to return NULL instead of sleeping. arch/arm/mm/dma-mapping.c - fallback to a smaller size without __GFP_NORETRY arch/mips/mm/dma-default.c - says that it uses __GFP_NORETRY to avoid the oom killer, provides no fallback - it seems to be a BUG arch/sparc/mm/tsb.c - fallback to a smaller size without __GFP_NORETRY arch/x86/include/asm/floppy.h - __GFP_NORETRY doesn't seem to serve any purpose, it may cause random failures during initialization, can be removed - BUG arch/powerpc/mm/mmu_context_iommu.c - uses it just during moving pages, there's no problem with failure arch/powerpc/platforms/pseries/cmm.c - a vm balloon driver, no problem with failure block/bio.c - falls back to mempool block/blk-mq.c - errorneous use of __GFP_NORETRY during initialization, it falls back to a smaller size, but doesn't drop the __GFP_NORETRY flag (BUG) drivers/gpu/drm/i915/i915_gem.c - it starts with __GFP_NORETRY and on failure, it ORs it with __GFP_RETRY_MAYFAIL (which of these conflicting flags wins?) drivers/gpu/drm/i915/i915_gem_gtt.c - __GFP_NORETRY is used during initialization (BUG), it shouldn't be used drivers/gpu/drm/i915/i915_gem_execbuffer.c - fallback to a smaller size without __GFP_NORETRY drivers/gpu/drm/i915/i915_gem_internal.c - fallback
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri 15-06-18 08:47:52, Mikulas Patocka wrote: > > > On Fri, 15 Jun 2018, Michal Hocko wrote: > > > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote: > > > > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | > > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses > > > these flags too. dm-bufio is just a big mempool. > > > > This doesn't answer my question though. Somebody else is doing it is not > > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO > > allocation AFAICS. So why do you really need it now? Why cannot you > > dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | > __GFP_NOWARN" since the kernel 3.2 when it was introduced. > > In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT > allocation, then drops the lock and does GFP_NOIO with the dropped lock > (because someone was likely experiencing the same issue that is reported > in this thread) - there are two commits that change it - 9ea61cac0 and > 41c73a49df31. OK, I see. Is there any fundamental reason why this path has to do one round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry again? [...] > > is the same class of problem, honestly, I dunno. And I've already said > > that stalling __GFP_NORETRY might be a good way around that but that > > needs much more consideration and existing users examination. I am not > > aware anybody has done that. Doing changes like that based on a single > > user is certainly risky. > > Why don't you set any rules how these flags should be used? It is really hard to change rules during the game. You basically have to examine all existing users and that is well beyond my time scope. I've tried that where it was possible. E.g. __GFP_REPEAT and turned it into a well defined semantic. __GFP_NORETRY is a bigger beast. Anyway, I believe that it would be much safer to look at the problem from a highlevel perspective. You seem to be focused on __GFP_NORETRY little bit too much IMHO. We are not throttling callers which explicitly do not want to or cannot - see current_may_throttle. Is it possible that both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE or use other means? E.g. do you have backing_dev_info at that time? -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 15 Jun 2018, Michal Hocko wrote: > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote: > > > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses > > these flags too. dm-bufio is just a big mempool. > > This doesn't answer my question though. Somebody else is doing it is not > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO > allocation AFAICS. So why do you really need it now? Why cannot you dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN" since the kernel 3.2 when it was introduced. In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT allocation, then drops the lock and does GFP_NOIO with the dropped lock (because someone was likely experiencing the same issue that is reported in this thread) - there are two commits that change it - 9ea61cac0 and 41c73a49df31. > simply keep retrying GFP_NOWAIT with your own throttling? > > Note that I am not trying to say that 41c73a49df31, I am merely trying > to understand why this blocking allocation is done in the first place. > > > If you argue that these flags are incorrect - then fix mempool_alloc. > > AFAICS there is no report about mempool_alloc stalling here. Maybe this If the page allocator can stall dm-bufio, it can stall mempool_alloc as well. dm-bufio is just bigger, so it will hit this bug sooner. > is the same class of problem, honestly, I dunno. And I've already said > that stalling __GFP_NORETRY might be a good way around that but that > needs much more consideration and existing users examination. I am not > aware anybody has done that. Doing changes like that based on a single > user is certainly risky. Why don't you set any rules how these flags should be used? If you use GFP_NOIO | __GFP_NORETRY in your own code and blame other people for doing so - you are as much evil as Linus, who praised people for reverse-engineering hardware and blamed them for reverse-engineering bitkeeper :-) Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri 15-06-18 07:35:07, Mikulas Patocka wrote: > > > On Fri, 15 Jun 2018, Michal Hocko wrote: > > > On Thu 14-06-18 14:34:06, Mikulas Patocka wrote: > > > > > > > > > On Thu, 14 Jun 2018, Michal Hocko wrote: > > > > > > > On Thu 14-06-18 15:18:58, jing xia wrote: > > > > [...] > > > > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > > > > > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > > > > > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > > > > > #2 [ffc0282af450] schedule at ff8008850f4c > > > > > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > > > > > #4 [ffc0282af520] schedule_timeout_uninterruptible at > > > > > ff8008853aa8 > > > > > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > > > > > > > This trace doesn't provide the full picture unfortunately. Waiting in > > > > the direct reclaim means that the underlying bdi is congested. The real > > > > question is why it doesn't flush IO in time. > > > > > > I pointed this out two years ago and you just refused to fix it: > > > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html > > > > Let me be evil again and let me quote the old discussion: > > : > I agree that mempool_alloc should _primarily_ sleep on their own > > : > throttling mechanism. I am not questioning that. I am just saying that > > : > the page allocator has its own throttling which it relies on and that > > : > cannot be just ignored because that might have other undesirable side > > : > effects. So if the right approach is really to never throttle certain > > : > requests then we have to bail out from a congested nodes/zones as soon > > : > as the congestion is detected. > > : > > > : > Now, I would like to see that something like that is _really_ necessary. > > : > > : Currently, it is not a problem - device mapper reports the device as > > : congested only if the underlying physical disks are congested. > > : > > : But once we change it so that device mapper reports congested state on its > > : own (when it has too many bios in progress), this starts being a problem. > > > > So has this changed since then? If yes then we can think of a proper > > solution but that would require to actually describe why we see the > > congestion, why it does help to wait on the caller rather than the > > allocator etc... > > Device mapper doesn't report congested state - but something else does > (perhaps the user inserted a cheap slow usb stick or sdcard?). And device > mapper is just a victim of that. Maybe yes and that would require some more debugging to find out, analyze and think of a proper solution. > Why should device mapper sleep because some other random block device is > congested? Well, the direct reclaim is a way to throttle memory allocations. There is no real concept on who is asking for the memory. If you do not want to get blocked then use GFP_NOWAIT. > > Throwing statements like ... > > > > > I'm sure you'll come up with another creative excuse why GFP_NORETRY > > > allocations need incur deliberate 100ms delays in block device drivers. > > > > ... is not really productive. I've tried to explain why I am not _sure_ what > > possible side effects such a change might have and your hand waving > > didn't really convince me. MD is not the only user of the page > > allocator... > > > > E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO > > allocation") even added GFP_NOIO request in the first place when you > > keep retrying and sleep yourself? > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses > these flags too. dm-bufio is just a big mempool. This doesn't answer my question though. Somebody else is doing it is not an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO allocation AFAICS. So why do you really need it now? Why cannot you simply keep retrying GFP_NOWAIT with your own throttling? Note that I am not trying to say that 41c73a49df31, I am merely trying to understand why this blocking allocation is done in the first place. > If you argue that these flags are incorrect - then fix mempool_alloc. AFAICS there is no report about mempool_alloc stalling here. Maybe this is the same class of problem, honestly, I dunno. And I've already said that stalling __GFP_NORETRY might be a good way around that but that needs much more consideration and existing users examination. I am not aware anybody has done that. Doing changes like that based on a single user is certainly risky. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Fri, 15 Jun 2018, Michal Hocko wrote: > On Thu 14-06-18 14:34:06, Mikulas Patocka wrote: > > > > > > On Thu, 14 Jun 2018, Michal Hocko wrote: > > > > > On Thu 14-06-18 15:18:58, jing xia wrote: > > > [...] > > > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > > > > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > > > > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > > > > #2 [ffc0282af450] schedule at ff8008850f4c > > > > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > > > > #4 [ffc0282af520] schedule_timeout_uninterruptible at > > > > ff8008853aa8 > > > > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > > > > > This trace doesn't provide the full picture unfortunately. Waiting in > > > the direct reclaim means that the underlying bdi is congested. The real > > > question is why it doesn't flush IO in time. > > > > I pointed this out two years ago and you just refused to fix it: > > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html > > Let me be evil again and let me quote the old discussion: > : > I agree that mempool_alloc should _primarily_ sleep on their own > : > throttling mechanism. I am not questioning that. I am just saying that > : > the page allocator has its own throttling which it relies on and that > : > cannot be just ignored because that might have other undesirable side > : > effects. So if the right approach is really to never throttle certain > : > requests then we have to bail out from a congested nodes/zones as soon > : > as the congestion is detected. > : > > : > Now, I would like to see that something like that is _really_ necessary. > : > : Currently, it is not a problem - device mapper reports the device as > : congested only if the underlying physical disks are congested. > : > : But once we change it so that device mapper reports congested state on its > : own (when it has too many bios in progress), this starts being a problem. > > So has this changed since then? If yes then we can think of a proper > solution but that would require to actually describe why we see the > congestion, why it does help to wait on the caller rather than the > allocator etc... Device mapper doesn't report congested state - but something else does (perhaps the user inserted a cheap slow usb stick or sdcard?). And device mapper is just a victim of that. Why should device mapper sleep because some other random block device is congested? > Throwing statements like ... > > > I'm sure you'll come up with another creative excuse why GFP_NORETRY > > allocations need incur deliberate 100ms delays in block device drivers. > > ... is not really productive. I've tried to explain why I am not _sure_ what > possible side effects such a change might have and your hand waving > didn't really convince me. MD is not the only user of the page > allocator... > > E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO > allocation") even added GFP_NOIO request in the first place when you > keep retrying and sleep yourself? Because mempool uses it. Mempool uses allocations with "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses these flags too. dm-bufio is just a big mempool. If you argue that these flags are incorrect - then fix mempool_alloc. > The changelog only describes what but > doesn't explain why. Or did I misread the code and this is not the > allocation which is stalling due to congestion? Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Thu 14-06-18 14:34:06, Mikulas Patocka wrote: > > > On Thu, 14 Jun 2018, Michal Hocko wrote: > > > On Thu 14-06-18 15:18:58, jing xia wrote: > > [...] > > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > > > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > > > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > > > #2 [ffc0282af450] schedule at ff8008850f4c > > > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > > > #4 [ffc0282af520] schedule_timeout_uninterruptible at > > > ff8008853aa8 > > > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > > > This trace doesn't provide the full picture unfortunately. Waiting in > > the direct reclaim means that the underlying bdi is congested. The real > > question is why it doesn't flush IO in time. > > I pointed this out two years ago and you just refused to fix it: > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html Let me be evil again and let me quote the old discussion: : > I agree that mempool_alloc should _primarily_ sleep on their own : > throttling mechanism. I am not questioning that. I am just saying that : > the page allocator has its own throttling which it relies on and that : > cannot be just ignored because that might have other undesirable side : > effects. So if the right approach is really to never throttle certain : > requests then we have to bail out from a congested nodes/zones as soon : > as the congestion is detected. : > : > Now, I would like to see that something like that is _really_ necessary. : : Currently, it is not a problem - device mapper reports the device as : congested only if the underlying physical disks are congested. : : But once we change it so that device mapper reports congested state on its : own (when it has too many bios in progress), this starts being a problem. So has this changed since then? If yes then we can think of a proper solution but that would require to actually describe why we see the congestion, why it does help to wait on the caller rather than the allocator etc... Throwing statements like ... > I'm sure you'll come up with another creative excuse why GFP_NORETRY > allocations need incur deliberate 100ms delays in block device drivers. ... is not really productive. I've tried to explain why I am not _sure_ what possible side effects such a change might have and your hand waving didn't really convince me. MD is not the only user of the page allocator... E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO allocation") even added GFP_NOIO request in the first place when you keep retrying and sleep yourself? The changelog only describes what but doesn't explain why. Or did I misread the code and this is not the allocation which is stalling due to congestion? -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Thu, 14 Jun 2018, Michal Hocko wrote: > On Thu 14-06-18 15:18:58, jing xia wrote: > [...] > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > > #2 [ffc0282af450] schedule at ff8008850f4c > > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > > #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 > > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > This trace doesn't provide the full picture unfortunately. Waiting in > the direct reclaim means that the underlying bdi is congested. The real > question is why it doesn't flush IO in time. I pointed this out two years ago and you just refused to fix it: http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html I'm sure you'll come up with another creative excuse why GFP_NORETRY allocations need incur deliberate 100ms delays in block device drivers. Mikulas > > #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 > > #7 [ffc0282af680] shrink_lruvec at ff8008178510 > > #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc > > #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 > > #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c > > #11 [ffc0282af990] try_to_free_pages at ff8008178f3c > > #12 [ffc0282afa30] __perform_reclaim at ff8008169130 > > #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8 > > #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c > > #15 [ffc0282afbe0] alloc_buffer at ff8008591a94 > > #16 [ffc0282afc20] __bufio_new at ff8008592e94 > > #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198 > > #18 [ffc0282afd20] verity_prefetch_io at ff8008598384 > > #19 [ffc0282afd70] process_one_work at ff80080b5b3c > > #20 [ffc0282afdc0] worker_thread at ff80080b64fc > > #21 [ffc0282afe20] kthread at ff80080bae34 > > > > > Mikulas > > -- > Michal Hocko > SUSE Labs > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
Thanks for your comment. On Wed, Jun 13, 2018 at 10:02 PM, Mikulas Patocka wrote: > > > On Tue, 12 Jun 2018, Mike Snitzer wrote: > >> On Tue, Jun 12 2018 at 4:03am -0400, >> Jing Xia wrote: >> >> > Performance test in android reports that the phone sometimes gets >> > hanged and shows black screen for about several minutes.The sysdump shows: >> > 1. kswapd and other tasks who enter the direct-reclaim path are waiting >> > on the dm_bufio_lock; >> >> Do you have an understanding of where they are waiting? Is it in >> dm_bufio_shrink_scan()? >> >> > 2. the task who gets the dm_bufio_lock is stalled for IO completions, >> > the relevant stack trace as : >> > >> > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" >> > #0 [ffc0282af3d0] __switch_to at ff8008085e48 >> > #1 [ffc0282af3f0] __schedule at ff8008850cc8 >> > #2 [ffc0282af450] schedule at ff8008850f4c >> > #3 [ffc0282af470] schedule_timeout at ff8008853a0c >> > #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 >> > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 >> > #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 >> > #7 [ffc0282af680] shrink_lruvec at ff8008178510 >> > #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc >> > #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 > > Please send the full stacktrace of this task. > > Then, we can see, why is it waiting here. > Please refer to: PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" #0 [ffc0282af3d0] __switch_to at ff8008085e48 #1 [ffc0282af3f0] __schedule at ff8008850cc8 #2 [ffc0282af450] schedule at ff8008850f4c #3 [ffc0282af470] schedule_timeout at ff8008853a0c #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 #5 [ffc0282af530] wait_iff_congested at ff8008181b40 #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 #7 [ffc0282af680] shrink_lruvec at ff8008178510 #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c #11 [ffc0282af990] try_to_free_pages at ff8008178f3c #12 [ffc0282afa30] __perform_reclaim at ff8008169130 #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8 #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c #15 [ffc0282afbe0] alloc_buffer at ff8008591a94 #16 [ffc0282afc20] __bufio_new at ff8008592e94 #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198 #18 [ffc0282afd20] verity_prefetch_io at ff8008598384 #19 [ffc0282afd70] process_one_work at ff80080b5b3c #20 [ffc0282afdc0] worker_thread at ff80080b64fc #21 [ffc0282afe20] kthread at ff80080bae34 > Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
Thank for your comment,I appreciate it. On Wed, Jun 13, 2018 at 5:20 AM, Mike Snitzer wrote: > On Tue, Jun 12 2018 at 4:03am -0400, > Jing Xia wrote: > >> Performance test in android reports that the phone sometimes gets >> hanged and shows black screen for about several minutes.The sysdump shows: >> 1. kswapd and other tasks who enter the direct-reclaim path are waiting >> on the dm_bufio_lock; > > Do you have an understanding of where they are waiting? Is it in > dm_bufio_shrink_scan()? > I'm sorry I don't make the issue clear.The kernel version in our platform is k4.4, and the implementation of dm_bufio_shrink_count() is still need to gain the dm_bufio_lock. So kswapd and other tasks in direct-reclaim are blocked in dm_bufio_shrink_count(). According the sysdump, 1. those tasks are doing the shrink in the same dm_bufio_shrinker (the first dm_bufio_shrinker in the shrink_list), and waiting for the same mutex lock; 2. the __GFP_FS is set to all of those tasks; The dm_bufio_lock is not necessary in dm_bufio_shrink_count() at the latest version, but the issue may still happens, because the lock is also needed in dm_bufio_shrink_scan(). the relevant stack traces please refer to: PID: 855TASK: ffc07844a700 CPU: 1 COMMAND: "kswapd0" #0 [ffc078357920] __switch_to at ff8008085e48 #1 [ffc078357940] __schedule at ff8008850cc8 #2 [ffc0783579a0] schedule at ff8008850f4c #3 [ffc0783579c0] schedule_preempt_disabled at ff8008851284 #4 [ffc0783579e0] __mutex_lock_slowpath at ff8008852898 #5 [ffc078357a50] mutex_lock at ff8008852954 #6 [ffc078357a70] dm_bufio_shrink_count at ff8008591d08 #7 [ffc078357ab0] do_shrink_slab at ff80081753e0 #8 [ffc078357b30] shrink_slab at ff8008175f3c #9 [ffc078357bd0] shrink_zone at ff8008178860 #10 [ffc078357c70] balance_pgdat at ff8008179838 #11 [ffc078357d70] kswapd at ff8008179e48 #12 [ffc078357e20] kthread at ff80080bae34 PID: 15538 TASK: ffc006833400 CPU: 3 COMMAND: "com.qiyi.video" #0 [ffc027637660] __switch_to at ff8008085e48 #1 [ffc027637680] __schedule at ff8008850cc8 #2 [ffc0276376e0] schedule at ff8008850f4c #3 [ffc027637700] schedule_preempt_disabled at ff8008851284 #4 [ffc027637720] __mutex_lock_slowpath at ff8008852898 #5 [ffc027637790] mutex_lock at ff8008852954 #6 [ffc0276377b0] dm_bufio_shrink_count at ff8008591d08 #7 [ffc0276377f0] do_shrink_slab at ff80081753e0 #8 [ffc027637870] shrink_slab at ff8008175f3c #9 [ffc027637910] shrink_zone at ff8008178860 #10 [ffc0276379b0] do_try_to_free_pages at ff8008178bc4 #11 [ffc027637a50] try_to_free_pages at ff8008178f3c #12 [ffc027637af0] __perform_reclaim at ff8008169130 #13 [ffc027637b70] __alloc_pages_nodemask at ff800816c9b8 #14 [ffc027637c90] handle_mm_fault at ff800819025c #15 [ffc027637d80] do_page_fault at ff80080949f8 #16 [ffc027637df0] do_mem_abort at ff80080812f8 >> 2. the task who gets the dm_bufio_lock is stalled for IO completions, >> the relevant stack trace as : >> >> PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" >> #0 [ffc0282af3d0] __switch_to at ff8008085e48 >> #1 [ffc0282af3f0] __schedule at ff8008850cc8 >> #2 [ffc0282af450] schedule at ff8008850f4c >> #3 [ffc0282af470] schedule_timeout at ff8008853a0c >> #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 >> #5 [ffc0282af530] wait_iff_congested at ff8008181b40 >> #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 >> #7 [ffc0282af680] shrink_lruvec at ff8008178510 >> #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc >> #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 > > Understanding the root cause for why the IO isn't completing quick > enough would be nice. Is the backing storage just overwhelmed? > Some information we can get from the sysdump: 1. swap-space is used out; 2. many tasks need to allocate a lot of memory; 3. IOwait is 100%; >> This patch aims to reduce the dm_bufio_lock contention when multiple >> tasks do shrink_slab() at the same time.It is acceptable that task >> will be allowed to reclaim from other shrinkers or reclaim from dm-bufio >> next time, rather than stalled for the dm_bufio_lock. > > Your patch just looks to be papering over the issue. Like you're > treating the symptom rather than the problem. > Yes, maybe you are right. >> Signed-off-by: Jing Xia >> Signed-off-by: Jing Xia > > You only need one Signed-off-by. > Thanks, I got it. >> --- >> drivers/md/dm-bufio.c | 13 +++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c >> index c546b56..402a028 100644 >> --- a/drivers/md/dm-bufio.c >> +++
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Thu 14-06-18 15:18:58, jing xia wrote: [...] > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > #2 [ffc0282af450] schedule at ff8008850f4c > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 This trace doesn't provide the full picture unfortunately. Waiting in the direct reclaim means that the underlying bdi is congested. The real question is why it doesn't flush IO in time. > #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 > #7 [ffc0282af680] shrink_lruvec at ff8008178510 > #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc > #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 > #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c > #11 [ffc0282af990] try_to_free_pages at ff8008178f3c > #12 [ffc0282afa30] __perform_reclaim at ff8008169130 > #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8 > #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c > #15 [ffc0282afbe0] alloc_buffer at ff8008591a94 > #16 [ffc0282afc20] __bufio_new at ff8008592e94 > #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198 > #18 [ffc0282afd20] verity_prefetch_io at ff8008598384 > #19 [ffc0282afd70] process_one_work at ff80080b5b3c > #20 [ffc0282afdc0] worker_thread at ff80080b64fc > #21 [ffc0282afe20] kthread at ff80080bae34 > > > Mikulas -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue, 12 Jun 2018, Mike Snitzer wrote: > On Tue, Jun 12 2018 at 4:03am -0400, > Jing Xia wrote: > > > Performance test in android reports that the phone sometimes gets > > hanged and shows black screen for about several minutes.The sysdump shows: > > 1. kswapd and other tasks who enter the direct-reclaim path are waiting > > on the dm_bufio_lock; > > Do you have an understanding of where they are waiting? Is it in > dm_bufio_shrink_scan()? > > > 2. the task who gets the dm_bufio_lock is stalled for IO completions, > > the relevant stack trace as : > > > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > > #2 [ffc0282af450] schedule at ff8008850f4c > > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > > #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 > > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > > #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 > > #7 [ffc0282af680] shrink_lruvec at ff8008178510 > > #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc > > #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 Please send the full stacktrace of this task. Then, we can see, why is it waiting here. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention
On Tue, Jun 12 2018 at 4:03am -0400, Jing Xia wrote: > Performance test in android reports that the phone sometimes gets > hanged and shows black screen for about several minutes.The sysdump shows: > 1. kswapd and other tasks who enter the direct-reclaim path are waiting > on the dm_bufio_lock; Do you have an understanding of where they are waiting? Is it in dm_bufio_shrink_scan()? > 2. the task who gets the dm_bufio_lock is stalled for IO completions, > the relevant stack trace as : > > PID: 22920 TASK: ffc0120f1a00 CPU: 1 COMMAND: "kworker/u8:2" > #0 [ffc0282af3d0] __switch_to at ff8008085e48 > #1 [ffc0282af3f0] __schedule at ff8008850cc8 > #2 [ffc0282af450] schedule at ff8008850f4c > #3 [ffc0282af470] schedule_timeout at ff8008853a0c > #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8 > #5 [ffc0282af530] wait_iff_congested at ff8008181b40 > #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80 > #7 [ffc0282af680] shrink_lruvec at ff8008178510 > #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc > #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040 Understanding the root cause for why the IO isn't completing quick enough would be nice. Is the backing storage just overwhelmed? > This patch aims to reduce the dm_bufio_lock contention when multiple > tasks do shrink_slab() at the same time.It is acceptable that task > will be allowed to reclaim from other shrinkers or reclaim from dm-bufio > next time, rather than stalled for the dm_bufio_lock. Your patch just looks to be papering over the issue. Like you're treating the symptom rather than the problem. > Signed-off-by: Jing Xia > Signed-off-by: Jing Xia You only need one Signed-off-by. > --- > drivers/md/dm-bufio.c | 13 +++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c > index c546b56..402a028 100644 > --- a/drivers/md/dm-bufio.c > +++ b/drivers/md/dm-bufio.c > @@ -1647,10 +1647,19 @@ static unsigned long __scan(struct dm_bufio_client > *c, unsigned long nr_to_scan, > static unsigned long > dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > { > + unsigned long count; > + unsigned long retain_target; > + > struct dm_bufio_client *c = container_of(shrink, struct > dm_bufio_client, shrinker); > - unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) + > + > + if (!dm_bufio_trylock(c)) > + return 0; > + > + count = READ_ONCE(c->n_buffers[LIST_CLEAN]) + > READ_ONCE(c->n_buffers[LIST_DIRTY]); > - unsigned long retain_target = get_retain_buffers(c); > + retain_target = get_retain_buffers(c); > + > + dm_bufio_unlock(c); > > return (count < retain_target) ? 0 : (count - retain_target); > } > -- > 1.9.1 > The reality of your patch is, on a heavily used bufio-backed volume, you're effectively disabling the ability to reclaim bufio memory via the shrinker. Because chances are the bufio lock will always be contended for a heavily used bufio client. But after a quick look, I'm left wondering why dm_bufio_shrink_scan()'s dm_bufio_trylock() isn't sufficient to short-circuit the shrinker for your use-case? Maybe __GFP_FS is set so dm_bufio_shrink_scan() only ever uses dm_bufio_lock()? Is a shrinker able to be reentered by the VM subsystem (e.g. shrink_slab() calls down into same shrinker from multiple tasks that hit direct reclaim)? If so, a better fix could be to add a flag to the bufio client so we can know if the same client is being re-entered via the shrinker (though it'd likely be a bug for the shrinker to do that!).. and have dm_bufio_shrink_scan() check that flag and return SHRINK_STOP if set. That said, it could be that other parts of dm-bufio are monopolizing the lock as part of issuing normal IO (to your potentially slow backend).. in which case just taking the lock from the shrinker even once will block like you've reported. It does seem like additional analysis is needed to pinpoint exactly what is occuring. Or some additional clarification needed (e.g. are the multiple tasks waiting for the bufio lock, as you reported with "1" above, waiting for the same exact shrinker's ability to get the same bufio lock?) But Mikulas, please have a look at this reported issue and let us know your thoughts. Thanks, Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel