Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-03-01 Thread Yang Shi
On Mon, Mar 1, 2021 at 7:05 AM Johannes Weiner  wrote:
>
> Hello Yang,
>
> On Thu, Feb 25, 2021 at 09:00:16AM -0800, Yang Shi wrote:
> > Hi Andrew,
> >
> > Just checking in whether this series is on your radar. The patch 1/13
> > ~ patch 12/13 have been reviewed and acked. Vlastimil had had some
> > comments on patch 13/13, I'm not sure if he is going to continue
> > reviewing that one. I hope the last patch could get into the -mm tree
> > along with the others so that it can get a broader test. What do you
> > think about it?
>
> The merge window for 5.12 is/has been open, which is when maintainers
> are busy getting everything from the previous development cycle ready
> to send upstream. Usually, only fixes but no new features are picked
> up during that time. If you don't hear back, try resending in a week.

Thanks, Johannes. Totally understand.

>
> That reminds me, I also have patches I need to resend :)


Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-03-01 Thread Johannes Weiner
Hello Yang,

On Thu, Feb 25, 2021 at 09:00:16AM -0800, Yang Shi wrote:
> Hi Andrew,
> 
> Just checking in whether this series is on your radar. The patch 1/13
> ~ patch 12/13 have been reviewed and acked. Vlastimil had had some
> comments on patch 13/13, I'm not sure if he is going to continue
> reviewing that one. I hope the last patch could get into the -mm tree
> along with the others so that it can get a broader test. What do you
> think about it?

The merge window for 5.12 is/has been open, which is when maintainers
are busy getting everything from the previous development cycle ready
to send upstream. Usually, only fixes but no new features are picked
up during that time. If you don't hear back, try resending in a week.

That reminds me, I also have patches I need to resend :)


Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-02-25 Thread Yang Shi
Hi Andrew,

Just checking in whether this series is on your radar. The patch 1/13
~ patch 12/13 have been reviewed and acked. Vlastimil had had some
comments on patch 13/13, I'm not sure if he is going to continue
reviewing that one. I hope the last patch could get into the -mm tree
along with the others so that it can get a broader test. What do you
think about it?

Thanks,
Yang

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
>
> Changelog
> v7 --> v8:
> * Added lockdep assert in expand_shrinker_info() per Roman.
> * Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman
>   and Kirill.
> * Moved rwsem acquire/release out of unregister_memcg_shrinker() per 
> Roman.
> * Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman.
> * Fixed the next_deferred logic per Vlastimil.
> * Misc minor code cleanup, refactor and spelling correction per Roman
>   and Shakeel.
> * Collected more ack and review tags from Roman, Shakeel and Vlastimil.
> v6 --> v7:
> * Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
> * Added patch 06/12 to introduce a helper for dereferencing shrinker_info
>   per Kirill.
> * Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
> * Collected Acked-by from Kirill.
> v5 --> v6:
> * Rebased on top of 
> https://lore.kernel.org/linux-mm/1611216029-34397-1-git-send-email-abaci-bug...@linux.alibaba.com/
>   per Kirill.
> * Don't register shrinker idr with NULL and remove idr_replace() per 
> Vlastimil.
> * Move nr_deferred before map to guarantee the alignment per Vlastimil.
> * Misc minor code cleanup and refactor per Kirill and Vlastimil.
> * Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
> v4 --> v5:
> * Incorporated the comments from Kirill.
> * Rebased to v5.11-rc5.
> v3 --> v4:
> * Removed "memcg_" prefix for shrinker_maps related functions per Roman.
> * Use write lock instead of read lock per Kirill. Also removed Johannes's 
> ack
>   since write lock is used.
> * Incorporated the comments from Kirill.
> * Removed RFC.
> * Rebased to v5.11-rc4.
> v2 --> v3:
> * Moved shrinker_maps related code to vmscan.c per Dave.
> * Removed memcg_shrinker_map_size. Calcuated the size of map via 
> shrinker_nr_max
>   per Johannes.
> * Consolidated shrinker_deferred with shrinker_maps into one struct per 
> Dave.
> * Simplified the nr_deferred related code.
> * Dropped the memory barrier from v2.
> * Moved nr_deferred reparent code to vmscan.c per Dave.
> * Added test coverage information in patch #11. Dave is concerned about 
> the
>   potential regression. I didn't notice regression with my tests, but 
> suggestions
>   about more test coverage is definitely welcome. And it may help spot 
> regression
>   with this patch in -mm tree then linux-next tree so I keep it in this 
> version.
> * The code cleanup and consolidation resulted in the series grow to 11 
> patches.
> * Rebased onto 5.11-rc2.
> v1 --> v2:
> * Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
> * Folded patch #1 into patch #6 per Roman.
> * Added memory barrier to prevent shrink_slab_memcg from seeing NULL 
> shrinker_maps/
>   shrinker_deferred per Kirill.
> * Removed memcg_shrinker_map_mutex. Protcted 
> shrinker_map/shrinker_deferred
>   allocations from expand with shrinker_rwsem per Johannes.
>
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy 
> workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by 
> the
> shrinker.
>
> On our production machine, I saw absurd number of nr_deferred shown as the 
> below
> tracing result:
>
> <...>-48776 [032]  27970562.458916: mm_shrink_slab_start:
> super_cache_scan+0x0/0x1a0 9a83046f3458: nid: 0 objects to shrink
> 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> 9300 cache items 1667 delta 11 total_scan 833
>
> There are 2.5 trillion deferred objects on one node, assuming all of them
> are dentry (192 bytes per object), so the total size of deferred on
> one node is ~480TB. It is definitely ridiculous.
>
> I managed to reproduce this problem with kernel build workload plus negative 
> dentry
> generator.
>
> First step, run the below kernel build test script:
>
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
>
> cd /root/Buildarea/linux-stable
>
> for i in `seq 1500`; do
> cgcreate -g memory:kern_build
> echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
>
> echo 3 > /proc/sys/vm/drop_caches
> cgexec -g memory:kern_build make clean > /dev/null 2>&1
> cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
>
> cgdelete -g memory:kern_build
> done
>
> Then run the below negative dentry generator script:
>
> NR_CPUS=`cat /proc/cpuinfo | 

[v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-02-16 Thread Yang Shi


Changelog
v7 --> v8:
* Added lockdep assert in expand_shrinker_info() per Roman.
* Added patch 05/13 to use kvfree_rcu() instead of call_rcu() per Roman
  and Kirill.
* Moved rwsem acquire/release out of unregister_memcg_shrinker() per Roman.
* Renamed count_nr_deferred_{memcg} to xchg_nr_deferred_{memcg} per Roman.
* Fixed the next_deferred logic per Vlastimil.
* Misc minor code cleanup, refactor and spelling correction per Roman
  and Shakeel.
* Collected more ack and review tags from Roman, Shakeel and Vlastimil.
v6 --> v7:
* Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
* Added patch 06/12 to introduce a helper for dereferencing shrinker_info
  per Kirill.
* Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
* Collected Acked-by from Kirill.
v5 --> v6:
* Rebased on top of 
https://lore.kernel.org/linux-mm/1611216029-34397-1-git-send-email-abaci-bug...@linux.alibaba.com/
  per Kirill.
* Don't register shrinker idr with NULL and remove idr_replace() per 
Vlastimil.
* Move nr_deferred before map to guarantee the alignment per Vlastimil.
* Misc minor code cleanup and refactor per Kirill and Vlastimil.
* Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
v4 --> v5:
* Incorporated the comments from Kirill.
* Rebased to v5.11-rc5.
v3 --> v4:
* Removed "memcg_" prefix for shrinker_maps related functions per Roman.
* Use write lock instead of read lock per Kirill. Also removed Johannes's 
ack
  since write lock is used.
* Incorporated the comments from Kirill.
* Removed RFC.
* Rebased to v5.11-rc4.
v2 --> v3:
* Moved shrinker_maps related code to vmscan.c per Dave.
* Removed memcg_shrinker_map_size. Calcuated the size of map via 
shrinker_nr_max
  per Johannes.
* Consolidated shrinker_deferred with shrinker_maps into one struct per 
Dave.
* Simplified the nr_deferred related code.
* Dropped the memory barrier from v2.
* Moved nr_deferred reparent code to vmscan.c per Dave.
* Added test coverage information in patch #11. Dave is concerned about the
  potential regression. I didn't notice regression with my tests, but 
suggestions
  about more test coverage is definitely welcome. And it may help spot 
regression
  with this patch in -mm tree then linux-next tree so I keep it in this 
version.
* The code cleanup and consolidation resulted in the series grow to 11 
patches.
* Rebased onto 5.11-rc2. 
v1 --> v2:
* Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
* Folded patch #1 into patch #6 per Roman.
* Added memory barrier to prevent shrink_slab_memcg from seeing NULL 
shrinker_maps/
  shrinker_deferred per Kirill.
* Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
  allocations from expand with shrinker_rwsem per Johannes.

Recently huge amount one-off slab drop was seen on some vfs metadata heavy 
workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result: 

<...>-48776 [032]  27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative 
dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
cgcreate -g memory:kern_build
echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

echo 3 > /proc/sys/vm/drop_caches
cgexec -g memory:kern_build make clean > /dev/null 2>&1
cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
while true; do
FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
cat $FILE 2>/dev/null
done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below 
tracing result
showed:

kswapd0-475   [028]  305968.252561: mm_shrink_slab_start: 
super_cache_scan+0x0/0x190 24acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 
45746 total_scan 46844936 priority 12
kswapd0-475