Re: [PATCH] mm: Fix a race during split THP

2020-10-09 Thread Huang, Ying
"Huang, Ying"  writes:

> From: Huang Ying 
>
> It is reported that the following bug is triggered if the HDD is used as swap
> device,
>
> [ 5758.157556] BUG: kernel NULL pointer dereference, address: 0007
> [ 5758.165331] #PF: supervisor write access in kernel mode
> [ 5758.171161] #PF: error_code(0x0002) - not-present page
> [ 5758.176894] PGD 0 P4D 0
> [ 5758.179721] Oops: 0002 [#1] SMP PTI
> [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S  
>  - ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
> [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
> SE5C600.86B.02.01.0002.082220131453 08/22/2013
> [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
> [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 
> 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 
> 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
> [ 5758.234478] RSP: 0018:b147442d7af0 EFLAGS: 00010246
> [ 5758.240309] RAX:  RBX: 0014b217 RCX: 
> b14779fd9000
> [ 5758.248281] RDX: 0014b217 RSI: 9c52f2ab1400 RDI: 
> 0014b217
> [ 5758.256246] RBP: e00c51168080 R08: e00c5116fe08 R09: 
> 9c52fffd3000
> [ 5758.264208] R10: e00c511537c8 R11: 9c52fffd3c90 R12: 
> 
> [ 5758.272172] R13: e00c5117 R14: e00c5117 R15: 
> e00c51168040
> [ 5758.280134] FS:  () GS:9c52f2a8() 
> knlGS:
> [ 5758.289163] CS:  0010 DS:  ES:  CR0: 80050033
> [ 5758.295575] CR2: 0007 CR3: 22a0e003 CR4: 
> 000606e0
> [ 5758.303538] Call Trace:
> [ 5758.306273]  split_huge_page_to_list+0x88b/0x950
> [ 5758.311433]  deferred_split_scan+0x1ca/0x310
> [ 5758.316202]  do_shrink_slab+0x12c/0x2a0
> [ 5758.320491]  shrink_slab+0x20f/0x2c0
> [ 5758.324482]  shrink_node+0x240/0x6c0
> [ 5758.328469]  balance_pgdat+0x2d1/0x550
> [ 5758.332652]  kswapd+0x201/0x3c0
> [ 5758.336157]  ? finish_wait+0x80/0x80
> [ 5758.340147]  ? balance_pgdat+0x550/0x550
> [ 5758.344525]  kthread+0x114/0x130
> [ 5758.348126]  ? kthread_park+0x80/0x80
> [ 5758.352214]  ret_from_fork+0x22/0x30
> [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr 
> intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
> mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper 
> crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt 
> fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm 
> intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei 
> ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci 
> libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash 
> dm_log dm_mod
> [ 5758.412673] CR2: 0007
> [0.00] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 
> (mockbu...@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red 
> Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020
>
> After further digging it's found that the following race condition exists in 
> the
> original implementation,
>
> CPU1 CPU2
>  
> deferred_split_scan()
>   split_huge_page(page) /* page isn't compound head */
> split_huge_page_to_list(page, NULL)
>   __split_huge_page(page, )
> ClearPageCompound(head)
> /* unlock all subpages except page (not head) */
>  
> add_to_swap(head)  /* not THP */
>
> get_swap_page(head)
>
> add_to_swap_cache(head, )
>  
> SetPageSwapCache(head)
>  if PageSwapCache(head)
>split_swap_cluster(/* swap entry of head */)
>  /* Deref sis->cluster_info: NULL accessing! */
>
> So, in split_huge_page_to_list(), PageSwapCache() is called for the already
> split and unlocked "head", which may be added to swap cache in another CPU.  
> So
> split_swap_cluster() may be called wrongly.
>
> To fix the race, the call to split_swap_cluster() is moved to
> __split_huge_page() before all subpages are unlocked.  So that the
> PageSwapCache() is stable.
>
> Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> Reported-and-tested-by: Rafael Aquini 
> Signed-off-by: "Huang, Ying" 
> Cc: Hugh Dickins 
> Cc: Kirill A. Shutemov 
> Cc: Andrea Arcangeli 

Sorry, should have added

Cc: sta...@vger.kernel.org

Best Regards,
Huang, Ying


[PATCH] mm: Fix a race during split THP

2020-10-09 Thread Huang, Ying
From: Huang Ying 

It is reported that the following bug is triggered if the HDD is used as swap
device,

[ 5758.157556] BUG: kernel NULL pointer dereference, address: 0007
[ 5758.165331] #PF: supervisor write access in kernel mode
[ 5758.171161] #PF: error_code(0x0002) - not-present page
[ 5758.176894] PGD 0 P4D 0
[ 5758.179721] Oops: 0002 [#1] SMP PTI
[ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S
   - ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
[ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
SE5C600.86B.02.01.0002.082220131453 08/22/2013
[ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
[ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 
80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 25 
07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
[ 5758.234478] RSP: 0018:b147442d7af0 EFLAGS: 00010246
[ 5758.240309] RAX:  RBX: 0014b217 RCX: b14779fd9000
[ 5758.248281] RDX: 0014b217 RSI: 9c52f2ab1400 RDI: 0014b217
[ 5758.256246] RBP: e00c51168080 R08: e00c5116fe08 R09: 9c52fffd3000
[ 5758.264208] R10: e00c511537c8 R11: 9c52fffd3c90 R12: 
[ 5758.272172] R13: e00c5117 R14: e00c5117 R15: e00c51168040
[ 5758.280134] FS:  () GS:9c52f2a8() 
knlGS:
[ 5758.289163] CS:  0010 DS:  ES:  CR0: 80050033
[ 5758.295575] CR2: 0007 CR3: 22a0e003 CR4: 000606e0
[ 5758.303538] Call Trace:
[ 5758.306273]  split_huge_page_to_list+0x88b/0x950
[ 5758.311433]  deferred_split_scan+0x1ca/0x310
[ 5758.316202]  do_shrink_slab+0x12c/0x2a0
[ 5758.320491]  shrink_slab+0x20f/0x2c0
[ 5758.324482]  shrink_node+0x240/0x6c0
[ 5758.328469]  balance_pgdat+0x2d1/0x550
[ 5758.332652]  kswapd+0x201/0x3c0
[ 5758.336157]  ? finish_wait+0x80/0x80
[ 5758.340147]  ? balance_pgdat+0x550/0x550
[ 5758.344525]  kthread+0x114/0x130
[ 5758.348126]  ? kthread_park+0x80/0x80
[ 5758.352214]  ret_from_fork+0x22/0x30
[ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr 
intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper 
crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops 
cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 
ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs 
libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit 
crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
[ 5758.412673] CR2: 0007
[0.00] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 
(mockbu...@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red 
Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020

After further digging it's found that the following race condition exists in the
original implementation,

CPU1 CPU2
 
deferred_split_scan()
  split_huge_page(page) /* page isn't compound head */
split_huge_page_to_list(page, NULL)
  __split_huge_page(page, )
ClearPageCompound(head)
/* unlock all subpages except page (not head) */
 
add_to_swap(head)  /* not THP */
   
get_swap_page(head)
   
add_to_swap_cache(head, )
 
SetPageSwapCache(head)
 if PageSwapCache(head)
   split_swap_cluster(/* swap entry of head */)
 /* Deref sis->cluster_info: NULL accessing! */

So, in split_huge_page_to_list(), PageSwapCache() is called for the already
split and unlocked "head", which may be added to swap cache in another CPU.  So
split_swap_cluster() may be called wrongly.

To fix the race, the call to split_swap_cluster() is moved to
__split_huge_page() before all subpages are unlocked.  So that the
PageSwapCache() is stable.

Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
Reported-and-tested-by: Rafael Aquini 
Signed-off-by: "Huang, Ying" 
Cc: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Andrea Arcangeli 
---
 mm/huge_memory.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cba3812a5c3e..87b0389673dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2478,6 +2478,12 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
remap_page(head, nr);
 
+   if (PageSwapCache(head)) {
+

Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-10-08 Thread Huang, Ying
Rafael Aquini  writes:

> On Thu, Oct 01, 2020 at 10:31:57AM -0400, Rafael Aquini wrote:
>> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
>> > Rafael Aquini  writes:
>> > >> Or, can you help to run the test with a debug kernel based on upstream
>> > >> kernel.  I can provide some debug patch.
>> > >> 
>> > >
>> > > Sure, I can set your patches to run with the test cases we have that 
>> > > tend to 
>> > > reproduce the issue with some degree of success.
>> > 
>> > Thanks!
>> > 
>> > I found a race condition.  During THP splitting, "head" may be unlocked
>> > before calling split_swap_cluster(), because head != page during
>> > deferred splitting.  So we should call split_swap_cluster() before
>> > unlocking.  The debug patch to do that is as below.  Can you help to
>> > test it?
>> > 
>> > Best Regards,
>> > Huang, Ying
>> > 
>> > 8<
>> > From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
>> > From: Huang Ying 
>> > Date: Fri, 25 Sep 2020 11:10:56 +0800
>> > Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>> >  split THP
>> > 
>> > ---
>> >  mm/huge_memory.c | 13 +++--
>> >  1 file changed, 7 insertions(+), 6 deletions(-)
>> > 
>> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> > index faadc449cca5..8d79e5e6b46e 100644
>> > --- a/mm/huge_memory.c
>> > +++ b/mm/huge_memory.c
>> > @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, 
>> > struct list_head *list,
>> >  
>> >remap_page(head);
>> >  
>> > +  if (PageSwapCache(head)) {
>> > +  swp_entry_t entry = { .val = page_private(head) };
>> > +
>> > +  split_swap_cluster(entry);
>> > +  }
>> > +
>> >for (i = 0; i < HPAGE_PMD_NR; i++) {
>> >struct page *subpage = head + i;
>> >if (subpage == page)
>> > @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, 
>> > struct list_head *list)
>> >}
>> >  
>> >__split_huge_page(page, list, end, flags);
>> > -  if (PageSwapCache(head)) {
>> > -  swp_entry_t entry = { .val = page_private(head) };
>> > -
>> > -  ret = split_swap_cluster(entry);
>> > -  } else
>> > -  ret = 0;
>> > +  ret = 0;
>> >} else {
>> >if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>> >pr_alert("total_mapcount: %u, page_count(): %u\n",
>> > -- 
>> > 2.28.0
>> > 
>> 
>> I left it running for several days, on several systems that had seen the
>> crash hitting before, and no crashes were observed for either the upstream
>> kernel nor the distro build 4.18-based kernel.
>> 
>> I guess we can comfortably go with your patch. Thanks!
>> 
>>
> Ping
>
> Are you going to post this patchfix soon? Or do you rather have me
> posting it?

Sorry for late replying.  I just come back from a long local holiday.
Thanks a lot for testing!  I will prepare the formal fixing patch.

Best Regards,
Huang, Ying


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-26 Thread Huang, Ying
Rafael Aquini  writes:

> On Fri, Sep 25, 2020 at 11:21:58AM +0800, Huang, Ying wrote:
>> Rafael Aquini  writes:
>> >> Or, can you help to run the test with a debug kernel based on upstream
>> >> kernel.  I can provide some debug patch.
>> >> 
>> >
>> > Sure, I can set your patches to run with the test cases we have that tend 
>> > to 
>> > reproduce the issue with some degree of success.
>> 
>> Thanks!
>> 
>> I found a race condition.  During THP splitting, "head" may be unlocked
>> before calling split_swap_cluster(), because head != page during
>> deferred splitting.  So we should call split_swap_cluster() before
>> unlocking.  The debug patch to do that is as below.  Can you help to
>> test it?
>>
>
>
> I finally could grab a good crashdump and confirm that head is really
> not locked.

Thanks!  That's really helpful for us to root cause the bug.

> I still need to dig into it to figure out more about the
> crash. I guess that your patch will guarantee that lock on head, but
> it still doesn't help on explaining how did we get the THP marked as 
> PG_swapcache, given that it should fail add_to_swap()->get_swap_page()
> right? 

Because ClearPageCompound(head) is called in __split_huge_page(), then
all subpages except "page" are unlocked.  So previously, when
split_swap_cluster() is called in split_huge_page_to_list(), the THP has
been split already and "head" may be unlocked.  Then the normal page
"head" can be added to swap cache.

CPU1 CPU2
 
deferred_split_scan()
  split_huge_page(page) /* page isn't compound head */
split_huge_page_to_list(page, NULL)
  __split_huge_page(page, )
ClearPageCompound(head)
/* unlock all subpages except page (not head) */
 
add_to_swap(head)  /* not THP */
   
get_swap_page(head)
   
add_to_swap_cache(head, )
 
SetPageSwapCache(head)
 if PageSwapCache(head)
   split_swap_cluster(/* swap entry of head */)
 /* Deref sis->cluster_info: NULL accessing! */

> I'll give your patch a run over the weekend, hopefully we'll have more
> info on this next week.

Thanks!

Best Regards,
Huang, Ying

>> Best Regards,
>> Huang, Ying
>> 
>> 8<
>> From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
>> From: Huang Ying 
>> Date: Fri, 25 Sep 2020 11:10:56 +0800
>> Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
>>  split THP
>> 
>> ---
>>  mm/huge_memory.c | 13 +++--
>>  1 file changed, 7 insertions(+), 6 deletions(-)
>> 
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index faadc449cca5..8d79e5e6b46e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  
>>  remap_page(head);
>>  
>> +if (PageSwapCache(head)) {
>> +swp_entry_t entry = { .val = page_private(head) };
>> +
>> +split_swap_cluster(entry);
>> +}
>> +
>>  for (i = 0; i < HPAGE_PMD_NR; i++) {
>>  struct page *subpage = head + i;
>>  if (subpage == page)
>> @@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct 
>> list_head *list)
>>  }
>>  
>>  __split_huge_page(page, list, end, flags);
>> -if (PageSwapCache(head)) {
>> -swp_entry_t entry = { .val = page_private(head) };
>> -
>> -ret = split_swap_cluster(entry);
>> -} else
>> -ret = 0;
>> +ret = 0;
>>  } else {
>>  if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>>  pr_alert("total_mapcount: %u, page_count(): %u\n",
>> -- 
>> 2.28.0
>> 


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-24 Thread Huang, Ying
Rafael Aquini  writes:
>> Or, can you help to run the test with a debug kernel based on upstream
>> kernel.  I can provide some debug patch.
>> 
>
> Sure, I can set your patches to run with the test cases we have that tend to 
> reproduce the issue with some degree of success.

Thanks!

I found a race condition.  During THP splitting, "head" may be unlocked
before calling split_swap_cluster(), because head != page during
deferred splitting.  So we should call split_swap_cluster() before
unlocking.  The debug patch to do that is as below.  Can you help to
test it?

Best Regards,
Huang, Ying

8<
>From 24ce0736a9f587d2dba12f12491c88d3e296a491 Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Fri, 25 Sep 2020 11:10:56 +0800
Subject: [PATCH] dbg: Call split_swap_clsuter() before unlock page during
 split THP

---
 mm/huge_memory.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faadc449cca5..8d79e5e6b46e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2444,6 +2444,12 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
remap_page(head);
 
+   if (PageSwapCache(head)) {
+   swp_entry_t entry = { .val = page_private(head) };
+
+   split_swap_cluster(entry);
+   }
+
for (i = 0; i < HPAGE_PMD_NR; i++) {
struct page *subpage = head + i;
if (subpage == page)
@@ -2678,12 +2684,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
}
 
__split_huge_page(page, list, end, flags);
-   if (PageSwapCache(head)) {
-   swp_entry_t entry = { .val = page_private(head) };
-
-   ret = split_swap_cluster(entry);
-   } else
-   ret = 0;
+   ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
pr_alert("total_mapcount: %u, page_count(): %u\n",
-- 
2.28.0



Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-24 Thread Huang, Ying
Hi, Andrew,

Andrew Morton  writes:

> On Wed, 23 Sep 2020 09:42:51 -0400 Rafael Aquini  wrote:
>
>> On Tue, Sep 22, 2020 at 12:47:50PM -0700, Andrew Morton wrote:
>> > On Tue, 22 Sep 2020 14:48:38 -0400 Rafael Aquini  wrote:
>> > 
>> > > The swap area descriptor only gets struct swap_cluster_info *cluster_info
>> > > allocated if the swapfile is backed by non-rotational storage.
>> > > When the swap area is laid on top of ordinary disk spindles, 
>> > > lock_cluster()
>> > > will naturally return NULL.
>> > > 
>> > > CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number 
>> > > of
>> > > use cases, and split_swap_cluster(), which is the counterpart of 
>> > > split_huge_page()
>> > > for the THPs in the swapcache, misses checking the return of 
>> > > lock_cluster before
>> > > operating on the cluster_info pointer.
>> > > 
>> > > This patch addresses that issue by adding a proper check for the pointer
>> > > not being NULL in the wrappers cluster_{is,clear}_huge(), in order to 
>> > > avoid
>> > > crashes similar to the one below:
>> > > 
>> > > ...
>> > >
>> > > Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap 
>> > > out")
>> > > Signed-off-by: Rafael Aquini 
>> > 
>> > Did you consider cc:stable?
>> >
>> 
>> UGH! I missed adding it to my cc list. Shall I just forward it, now, or
>> do you prefer a fresh repost?
>
> I added the cc:stable to my copy.

Please don't merge this patch.  This patch doesn't fix the bug, but hide
the real bug.  I will work with Rafael on root causing and fixing.

Best Regards,
Huang, Ying


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Huang, Ying
Rafael Aquini  writes:

>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>> SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>> split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>

int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
struct deferred_split *ds_queue = get_deferred_split_queue(head);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
unsigned long flags;
pgoff_t end;

VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
VM_BUG_ON_PAGE(!PageLocked(head), head);

I found there's page lock checking in split_huge_page_to_list().

Best Regards,
Huang, Ying


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Huang, Ying
Rafael Aquini  writes:
> The bug here is quite simple: split_swap_cluster() misses checking for
> lock_cluster() returning NULL before committing to change cluster_info->flags.

I don't think so.  We shouldn't run into this situation firstly.  So the
"fix" hides the real bug instead of fixing it.  Just like we call
VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
instead of returning if !PageLocked(head) silently.

> The fundamental problem has nothing to do with allocating, or not allocating
> a swap cluster, but it has to do with the fact that the THP deferred split 
> scan
> can transiently race with swapcache insertion, and the fact that when you run
> your swap area on rotational storage cluster_info is _always_ NULL.
> split_swap_cluster() needs to check for lock_cluster() returning NULL because
> that's one possible case, and it clearly fails to do so.

If there's a race, we should fix the race.  But the code path for
swapcache insertion is,

add_to_swap()
  get_swap_page() /* Return if fails to allocate */
  add_to_swap_cache()
SetPageSwapCache()

While the code path to split THP is,

split_huge_page_to_list()
  if PageSwapCache()
split_swap_cluster()

Both code paths are protected by the page lock.  So there should be some
other reasons to trigger the bug.

And again, for HDD, a THP shouldn't have PageSwapCache() set at the
first place.  If so, the bug is that the flag is set and we should fix
the setting.

> Run a workload that cause multiple THP COW, and add a memory hogger to create
> memory pressure so you'll force the reclaimers to kick the registered
> shrinkers. The trigger is not heavy swapping, and that's probably why
> most swap test cases don't hit it. The window is tight, but you will get the
> NULL pointer dereference.

Do you have a script to reproduce the bug?

> Regardless you find furhter bugs, or not, this patch is needed to correct a
> blunt coding mistake.

As above.  I don't agree with that.

Best Regards,
Huang, Ying


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-23 Thread Huang, Ying
Rafael Aquini  writes:

> On Wed, Sep 23, 2020 at 01:13:49PM +0800, Huang, Ying wrote:
>> Rafael Aquini  writes:
>> 
>> > On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
>> >> Hi, Rafael,
>> >> 
>> >> Rafael Aquini  writes:
>> >> 
>> >> > The swap area descriptor only gets struct swap_cluster_info 
>> >> > *cluster_info
>> >> > allocated if the swapfile is backed by non-rotational storage.
>> >> > When the swap area is laid on top of ordinary disk spindles, 
>> >> > lock_cluster()
>> >> > will naturally return NULL.
>> >> 
>> >> Thanks for reporting.  But the bug looks strange.  Because in a system
>> >> with only HDD swap devices, during THP swap out, the swap cluster
>> >> shouldn't be allocated, as in
>> >> 
>> >> shrink_page_list()
>> >>   add_to_swap()
>> >> get_swap_page()
>> >>   get_swap_pages()
>> >> swap_alloc_cluster()
>> >>
>> >
>> > The underlying problem is that swap_info_struct.cluster_info is always 
>> > NULL 
>> > on the rotational storage case.
>> 
>> Yes.
>> 
>> > So, it's very easy to follow that constructions 
>> > like this one, in split_swap_cluster 
>> >
>> > ...
>> > ci = lock_cluster(si, offset);
>> > cluster_clear_huge(ci);
>> > ...
>> >
>> > will go for a NULL pointer dereference, in that case, given that 
>> > lock_cluster 
>> > reads:
>> >
>> > ...
>> >struct swap_cluster_info *ci;
>> > ci = si->cluster_info;
>> > if (ci) {
>> > ci += offset / SWAPFILE_CLUSTER;
>> > spin_lock(&ci->lock);
>> > }
>> > return ci;
>> > ...
>> 
>> But on HDD, we shouldn't call split_swap_cluster() at all, because we
>> will not allocate swap cluster firstly.  So, if we run into this,
>> there should be some other bug, we need to figure it out.
>>
>
> split_swap_cluster() gets called by split_huge_page_to_list(),
> if the page happens to be in the swapcache, and it will always
> go that way, regardless the backing storage type:
>
> ...
> __split_huge_page(page, list, end, flags);
> if (PageSwapCache(head)) {
> swp_entry_t entry = { .val = page_private(head) };
>
> ret = split_swap_cluster(entry);
> } else
> ret = 0;
> ...
>
> The problem is not about allocating the swap_cluster -- it's obviously
> not allocated in these cases. The problem is that on rotational
> storage you don't even have the base structure that allows you to
> keep the swap clusters (cluster_info does not get allocated, at all,
> so si->cluster_info is always NULL)
>
> You can argue about other bugs all you want, it doesn't change
> the fact that this code is incomplete as it sits, because it 
> misses checking for a real case where lock_cluster() will return NULL.

I don't want to argue about anything.  I just want to fix the bug.  The
fix here will hide the real bug instead of fixing it.  For the situation
you described (PageSwapCache() returns true for a THP backed by a normal
swap entry (not swap cluster)), we will run into other troubles too.  So
we need to find the root cause and fix it.

Can you help me to collect more information to fix the real bug?  Or,
how to reproduce it?

Best Regards,
Huang, Ying


Re: [RFC -V2] autonuma: Migrate on fault among multiple bound nodes

2020-09-22 Thread Huang, Ying
Phil Auld  writes:

> Hi,
>
> On Tue, Sep 22, 2020 at 02:54:01PM +0800 Huang Ying wrote:
>> Now, AutoNUMA can only optimize the page placement among the NUMA nodes if 
>> the
>> default memory policy is used.  Because the memory policy specified 
>> explicitly
>> should take precedence.  But this seems too strict in some situations.  For
>> example, on a system with 4 NUMA nodes, if the memory of an application is 
>> bound
>> to the node 0 and 1, AutoNUMA can potentially migrate the pages between the 
>> node
>> 0 and 1 to reduce cross-node accessing without breaking the explicit memory
>> binding policy.
>> 
>> So in this patch, if mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY) is used to 
>> bind
>> the memory of the application to multiple nodes, and in the hint page fault
>> handler both the faulting page node and the accessing node are in the policy
>> nodemask, the page will be tried to be migrated to the accessing node to 
>> reduce
>> the cross-node accessing.
>>
>
> Do you have any performance numbers that show the effects of this on
> a workload?

I have done some simple test to confirm that NUMA balancing works in the
target configuration.

As for performance numbers, it's exactly same as that of the original
NUMA balancing in a different configuration.  Between without memory
binding and with memory bound to all NUMA nodes.

>
>> [Peter Zijlstra: provided the simplified implementation method.]
>> 
>> Questions:
>> 
>> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA optimizing
>> globally.  But for the memory areas that are bound to multiple NUMA nodes, 
>> even
>> if the AutoNUMA is enabled globally via the sysctl knob, we still need to 
>> enable
>> AutoNUMA again with a special flag.  Why not just optimize the page 
>> placement if
>> possible as long as AutoNUMA is enabled globally?  The interface would look
>> simpler with that.
>
>
> I agree. I think it should try to do this if globally enabled.

Thanks!

>> 
>> Signed-off-by: "Huang, Ying" 
>> Cc: Andrew Morton 
>> Cc: Ingo Molnar 
>> Cc: Mel Gorman 
>> Cc: Rik van Riel 
>> Cc: Johannes Weiner 
>> Cc: "Matthew Wilcox (Oracle)" 
>> Cc: Dave Hansen 
>> Cc: Andi Kleen 
>> Cc: Michal Hocko 
>> Cc: David Rientjes 
>> ---
>>  mm/mempolicy.c | 17 +++--
>>  1 file changed, 11 insertions(+), 6 deletions(-)
>> 
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index eddbe4e56c73..273969204732 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2494,15 +2494,19 @@ int mpol_misplaced(struct page *page, struct 
>> vm_area_struct *vma, unsigned long
>>  break;
>>  
>>  case MPOL_BIND:
>> -
>>  /*
>> - * allows binding to multiple nodes.
>> - * use current page if in policy nodemask,
>> - * else select nearest allowed node, if any.
>> - * If no allowed nodes, use current [!misplaced].
>> + * Allows binding to multiple nodes.  If both current and
>> + * accessing nodes are in policy nodemask, migrate to
>> + * accessing node to optimize page placement. Otherwise,
>> + * use current page if in policy nodemask, else select
>> + * nearest allowed node, if any.  If no allowed nodes, use
>> + * current [!misplaced].
>>   */
>> -    if (node_isset(curnid, pol->v.nodes))
>> +if (node_isset(curnid, pol->v.nodes)) {
>> +if (node_isset(thisnid, pol->v.nodes))
>> +goto moron;
>
> Nice label :)

OK.  Because quite some people pay attention to this.  I will rename all
"moron" to "mopron" as suggested by Matthew.  Although MPOL_F_MORON is
defined in include/uapi/linux/mempolicy.h, it is explicitly marked as
internal flags.

Best Regards,
Huang, Ying

>>  goto out;
>> +}
>>  z = first_zones_zonelist(
>>  node_zonelist(numa_node_id(), GFP_HIGHUSER),
>>  gfp_zone(GFP_HIGHUSER),
>> @@ -2516,6 +2520,7 @@ int mpol_misplaced(struct page *page, struct 
>> vm_area_struct *vma, unsigned long
>>  
>>  /* Migrate the page towards the node whose CPU is referencing it */
>>  if (pol->flags & MPOL_F_MORON) {
>> +moron:
>>  polnid = thisnid;
>>  
>>  if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
>> -- 
>> 2.28.0
>> 
>
>
> Cheers,
> Phil


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-22 Thread Huang, Ying
Rafael Aquini  writes:

> On Wed, Sep 23, 2020 at 10:21:36AM +0800, Huang, Ying wrote:
>> Hi, Rafael,
>> 
>> Rafael Aquini  writes:
>> 
>> > The swap area descriptor only gets struct swap_cluster_info *cluster_info
>> > allocated if the swapfile is backed by non-rotational storage.
>> > When the swap area is laid on top of ordinary disk spindles, lock_cluster()
>> > will naturally return NULL.
>> 
>> Thanks for reporting.  But the bug looks strange.  Because in a system
>> with only HDD swap devices, during THP swap out, the swap cluster
>> shouldn't be allocated, as in
>> 
>> shrink_page_list()
>>   add_to_swap()
>> get_swap_page()
>>   get_swap_pages()
>> swap_alloc_cluster()
>>
>
> The underlying problem is that swap_info_struct.cluster_info is always NULL 
> on the rotational storage case.

Yes.

> So, it's very easy to follow that constructions 
> like this one, in split_swap_cluster 
>
> ...
> ci = lock_cluster(si, offset);
> cluster_clear_huge(ci);
> ...
>
> will go for a NULL pointer dereference, in that case, given that lock_cluster 
> reads:
>
> ...
>   struct swap_cluster_info *ci;
> ci = si->cluster_info;
> if (ci) {
> ci += offset / SWAPFILE_CLUSTER;
> spin_lock(&ci->lock);
> }
> return ci;
> ...

But on HDD, we shouldn't call split_swap_cluster() at all, because we
will not allocate swap cluster firstly.  So, if we run into this,
there should be some other bug, we need to figure it out.

Best Regards,
Huang, Ying


Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference

2020-09-22 Thread Huang, Ying
Hi, Rafael,

Rafael Aquini  writes:

> The swap area descriptor only gets struct swap_cluster_info *cluster_info
> allocated if the swapfile is backed by non-rotational storage.
> When the swap area is laid on top of ordinary disk spindles, lock_cluster()
> will naturally return NULL.

Thanks for reporting.  But the bug looks strange.  Because in a system
with only HDD swap devices, during THP swap out, the swap cluster
shouldn't be allocated, as in

shrink_page_list()
  add_to_swap()
get_swap_page()
  get_swap_pages()
swap_alloc_cluster()

Where si->free_clusters is checked, and it should be empty for HDD.  So
in shrink_page_list(), the THP should have been split.  While in
split_huge_page_to_list(), PageSwapCache() is checked before calling
split_swap_cluster().  So this appears strange.

All in all, it appears that we need to find the real root cause of the
bug.

Did you test with the latest upstream kernel?  Can you help trace the
return value of swap_alloc_cluster()?  Can you share the swap device
information?

Best Regards,
Huang, Ying

> CONFIG_THP_SWAP exposes cluster_info infrastructure to a broader number of
> use cases, and split_swap_cluster(), which is the counterpart of 
> split_huge_page()
> for the THPs in the swapcache, misses checking the return of lock_cluster 
> before
> operating on the cluster_info pointer.
>
> This patch addresses that issue by adding a proper check for the pointer
> not being NULL in the wrappers cluster_{is,clear}_huge(), in order to avoid
> crashes similar to the one below:
>
> [ 5758.157556] BUG: kernel NULL pointer dereference, address: 0007
> [ 5758.165331] #PF: supervisor write access in kernel mode
> [ 5758.171161] #PF: error_code(0x0002) - not-present page
> [ 5758.176894] PGD 0 P4D 0
> [ 5758.179721] Oops: 0002 [#1] SMP PTI
> [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S  
>  - ---  5.9.0-0.rc3.1.tst.el8.x86_64 #1
> [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
> SE5C600.86B.02.01.0002.082220131453 08/22/2013
> [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
> [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 
> 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 <80> 24 
> 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
> [ 5758.234478] RSP: 0018:b147442d7af0 EFLAGS: 00010246
> [ 5758.240309] RAX:  RBX: 0014b217 RCX: 
> b14779fd9000
> [ 5758.248281] RDX: 0014b217 RSI: 9c52f2ab1400 RDI: 
> 0014b217
> [ 5758.256246] RBP: e00c51168080 R08: e00c5116fe08 R09: 
> 9c52fffd3000
> [ 5758.264208] R10: e00c511537c8 R11: 9c52fffd3c90 R12: 
> 
> [ 5758.272172] R13: e00c5117 R14: e00c5117 R15: 
> e00c51168040
> [ 5758.280134] FS:  () GS:9c52f2a8() 
> knlGS:
> [ 5758.289163] CS:  0010 DS:  ES:  CR0: 80050033
> [ 5758.295575] CR2: 0007 CR3: 22a0e003 CR4: 
> 000606e0
> [ 5758.303538] Call Trace:
> [ 5758.306273]  split_huge_page_to_list+0x88b/0x950
> [ 5758.311433]  deferred_split_scan+0x1ca/0x310
> [ 5758.316202]  do_shrink_slab+0x12c/0x2a0
> [ 5758.320491]  shrink_slab+0x20f/0x2c0
> [ 5758.324482]  shrink_node+0x240/0x6c0
> [ 5758.328469]  balance_pgdat+0x2d1/0x550
> [ 5758.332652]  kswapd+0x201/0x3c0
> [ 5758.336157]  ? finish_wait+0x80/0x80
> [ 5758.340147]  ? balance_pgdat+0x550/0x550
> [ 5758.344525]  kthread+0x114/0x130
> [ 5758.348126]  ? kthread_park+0x80/0x80
> [ 5758.352214]  ret_from_fork+0x22/0x30
> [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr 
> intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
> mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper 
> crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt 
> fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm 
> intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei 
> ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci 
> libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash 
> dm_log dm_mod
> [ 5758.412673] CR2: 0007
> [0.00] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 
> (mockbu...@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red 
> Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020
>
> Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
> Signed-off-by: Rafael Aquini 
> ---
>  mm/swapfile.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
&

[RFC -V2] autonuma: Migrate on fault among multiple bound nodes

2020-09-21 Thread Huang Ying
Now, AutoNUMA can only optimize the page placement among the NUMA nodes if the
default memory policy is used.  Because the memory policy specified explicitly
should take precedence.  But this seems too strict in some situations.  For
example, on a system with 4 NUMA nodes, if the memory of an application is bound
to the node 0 and 1, AutoNUMA can potentially migrate the pages between the node
0 and 1 to reduce cross-node accessing without breaking the explicit memory
binding policy.

So in this patch, if mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY) is used to bind
the memory of the application to multiple nodes, and in the hint page fault
handler both the faulting page node and the accessing node are in the policy
nodemask, the page will be tried to be migrated to the accessing node to reduce
the cross-node accessing.

[Peter Zijlstra: provided the simplified implementation method.]

Questions:

Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA optimizing
globally.  But for the memory areas that are bound to multiple NUMA nodes, even
if the AutoNUMA is enabled globally via the sysctl knob, we still need to enable
AutoNUMA again with a special flag.  Why not just optimize the page placement if
possible as long as AutoNUMA is enabled globally?  The interface would look
simpler with that.

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Dave Hansen 
Cc: Andi Kleen 
Cc: Michal Hocko 
Cc: David Rientjes 
---
 mm/mempolicy.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..273969204732 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2494,15 +2494,19 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
break;
 
case MPOL_BIND:
-
/*
-* allows binding to multiple nodes.
-* use current page if in policy nodemask,
-* else select nearest allowed node, if any.
-* If no allowed nodes, use current [!misplaced].
+* Allows binding to multiple nodes.  If both current and
+* accessing nodes are in policy nodemask, migrate to
+* accessing node to optimize page placement. Otherwise,
+* use current page if in policy nodemask, else select
+* nearest allowed node, if any.  If no allowed nodes, use
+* current [!misplaced].
 */
-   if (node_isset(curnid, pol->v.nodes))
+   if (node_isset(curnid, pol->v.nodes)) {
+   if (node_isset(thisnid, pol->v.nodes))
+   goto moron;
goto out;
+   }
z = first_zones_zonelist(
node_zonelist(numa_node_id(), GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
@@ -2516,6 +2520,7 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
 
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
+moron:
polnid = thisnid;
 
if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
-- 
2.28.0



Re: [PATCH 2/2] mm,swap: skip swap readahead if page was obtained instantaneously

2020-09-21 Thread huang ying
On Tue, Sep 22, 2020 at 10:02 AM Rik van Riel  wrote:
>
> Check whether a swap page was obtained instantaneously, for example
> because it is in zswap, or on a very fast IO device which uses busy
> waiting, and we did not wait on IO to swap in this page.
> If no IO was needed to get the swap page we want, kicking off readahead
> on surrounding swap pages is likely to be counterproductive, because the
> extra loads will cause additional latency, use up extra memory, and chances
> are the surrounding pages in swap are just as fast to load as this one,
> making readahead pointless.
>
> Signed-off-by: Rik van Riel 
> ---
>  mm/swap_state.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index aacb9ba53f63..6919f9d5fe88 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -637,6 +637,7 @@ static struct page *swap_cluster_read_one(swp_entry_t 
> entry,
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf)

Why not do this for swap_vma_readahead() too?  swap_cluster_read_one()
can be used in swap_vma_readahead() too.

>  {
> +   struct page *page;
> unsigned long entry_offset = swp_offset(entry);
> unsigned long offset = entry_offset;
> unsigned long start_offset, end_offset;
> @@ -668,11 +669,18 @@ struct page *swap_cluster_readahead(swp_entry_t entry, 
> gfp_t gfp_mask,
> end_offset = si->max - 1;
>
> blk_start_plug(&plug);
> +   /* If we read the page without waiting on IO, skip readahead. */
> +   page = swap_cluster_read_one(entry, offset, gfp_mask, vma, addr, 
> false);
> +   if (page && PageUptodate(page))
> +   goto skip_unplug;
> +
> +   /* Ok, do the async read-ahead now. */
> for (offset = start_offset; offset <= end_offset ; offset++) {
> -   /* Ok, do the async read-ahead now */
> -   swap_cluster_read_one(entry, offset, gfp_mask, vma, addr,
> - offset != entry_offset);
> +   if (offset == entry_offset)
> +   continue;
> +   swap_cluster_read_one(entry, offset, gfp_mask, vma, addr, 
> true);
> }
> +skip_unplug:
> blk_finish_plug(&plug);
>
> lru_add_drain();/* Push any new pages onto the LRU now */

Best Regards,
Huang, Ying


Re: [RFC] autonuma: Migrate on fault among multiple bound nodes

2020-09-16 Thread Huang, Ying
pet...@infradead.org writes:

> On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote:
>> +static bool mpol_may_mof(struct mempolicy *pol)
>> +{
>> +/* May migrate among bound nodes for MPOL_BIND */
>> +return pol->flags & MPOL_F_MOF ||
>> +(pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1);
>> +}
>
> This is weird, why not just set F_MOF on the policy?
>
> In fact, why wouldn't something like:
>
>   mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY);
>
> work today? Afaict MF_LAZY will unconditionally result in M_MOF.

Another question.

This means for all VMAs that are mbind() without MPOL_MF_LAZY and tasks
which binds memory via set_mempolicy(), we will not try to optimize
their page placement among the bound nodes even if sysctl knob
kernel.numa_balancing is enabled.

Is this the intended behavior?  Although we enable AutoNUMA globally, we
will not try to use it in any places if possible.  In some places, it
needs to be enabled again.

Best Regards,
Huang, Ying


Re: [RFC][PATCH 4/9] mm/migrate: make migrate_pages() return nr_succeeded

2020-09-16 Thread Huang, Ying
Dave Hansen  writes:

> diff -puN mm/migrate.c~migrate_pages-add-success-return mm/migrate.c
> --- a/mm/migrate.c~migrate_pages-add-success-return   2020-08-18 
> 11:36:51.284583183 -0700
> +++ b/mm/migrate.c2020-08-18 11:36:51.295583183 -0700
> @@ -1432,6 +1432,7 @@ out:
>   * @mode:The migration mode that specifies the constraints for
>   *   page migration, if any.
>   * @reason:  The reason for page migration.
> + * @nr_succeeded:The number of pages migrated successfully.
>   *
>   * The function returns after 10 attempts or if no pages are movable any more
>   * because the list has become empty or no retryable pages exist any more.
> @@ -1442,11 +1443,10 @@ out:
>   */
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   free_page_t put_new_page, unsigned long private,
> - enum migrate_mode mode, int reason)
> + enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
>  {
>   int retry = 1;
>   int nr_failed = 0;
> - int nr_succeeded = 0;
>   int pass = 0;
>   struct page *page;
>   struct page *page2;
> @@ -1500,7 +1500,7 @@ retry:
>   retry++;
>   break;
>   case MIGRATEPAGE_SUCCESS:
> - nr_succeeded++;
> + (*nr_succeeded)++;

I think now we should consider THP in counting now.  Because later
nr_succeeded will be used to counting the number of reclaimed pages,
and THP is respected for that.

Best Regards,
Huang, Ying

>   break;
>   default:
>   /*
> @@ -1517,11 +1517,11 @@ retry:
>   nr_failed += retry;
>   rc = nr_failed;
>  out:
> - if (nr_succeeded)
> - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> + if (*nr_succeeded)
> + count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
>   if (nr_failed)
>   count_vm_events(PGMIGRATE_FAIL, nr_failed);
> - trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
> + trace_mm_migrate_pages(*nr_succeeded, nr_failed, mode, reason);
>  
>   if (!swapwrite)
>   current->flags &= ~PF_SWAPWRITE;


Re: [RFC] autonuma: Migrate on fault among multiple bound nodes

2020-09-16 Thread Huang, Ying
Hi, Peter,

Thanks for comments!

pet...@infradead.org writes:

> On Wed, Sep 16, 2020 at 08:59:36AM +0800, Huang Ying wrote:
>
>> So in this patch, if MPOL_BIND is used to bind the memory of the
>> application to multiple nodes, and in the hint page fault handler both
>> the faulting page node and the accessing node are in the policy
>> nodemask, the page will be tried to be migrated to the accessing node
>> to reduce the cross-node accessing.
>
> Seems fair enough..
>
>> Questions:
>> 
>> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA
>> optimizing globally.  And now, it appears that the explicit NUMA
>> memory policy specifying (e.g. via numactl, mbind(), etc.) acts like
>> an implicit per-thread/VMA knob to enable/disable the AutoNUMA
>> optimizing for the thread/VMA.  Although this looks like a side effect
>> instead of an API, from commit fc3147245d19 ("mm: numa: Limit NUMA
>> scanning to migrate-on-fault VMAs"), this is used by some users?  So
>> the question is, do we need an explicit per-thread/VMA knob to
>> enable/disable AutoNUMA optimizing for the thread/VMA?  Or just use
>> the global knob, either optimize all thread/VMAs as long as the
>> explicitly specified memory policies are respected, or don't optimize
>> at all.
>
> I don't understand the question; that commit is not about disabling numa
> balancing, it's about avoiding pointless work and overhead. What's the
> point of scanning memory if you're not going to be allowed to move it
> anyway.

Because we are going to enable the moving, this makes scanning not
pointless, but may also introduce overhead.

>> Signed-off-by: "Huang, Ying" 
>> Cc: Andrew Morton 
>> Cc: Ingo Molnar 
>> Cc: Mel Gorman 
>> Cc: Rik van Riel 
>> Cc: Johannes Weiner 
>> Cc: "Matthew Wilcox (Oracle)" 
>> Cc: Dave Hansen 
>> Cc: Andi Kleen 
>> Cc: Michal Hocko 
>> Cc: David Rientjes 
>> ---
>>  mm/mempolicy.c | 43 +++
>>  1 file changed, 31 insertions(+), 12 deletions(-)
>> 
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index eddbe4e56c73..a941eab2de24 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1827,6 +1827,13 @@ static struct mempolicy *get_vma_policy(struct 
>> vm_area_struct *vma,
>>  return pol;
>>  }
>>  
>> +static bool mpol_may_mof(struct mempolicy *pol)
>> +{
>> +/* May migrate among bound nodes for MPOL_BIND */
>> +return pol->flags & MPOL_F_MOF ||
>> +(pol->mode == MPOL_BIND && nodes_weight(pol->v.nodes) > 1);
>> +}
>
> This is weird, why not just set F_MOF on the policy?
>
> In fact, why wouldn't something like:
>
>   mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY);
>
> work today? Afaict MF_LAZY will unconditionally result in M_MOF.

There are some subtle difference.

- LAZY appears unnecessary for the per-task memory policy via
  set_mempolicy().  While migrating among multiple bound nodes appears
  reasonable as a per-task memory policy.

- LAZY also means move the pages not on the bound nodes to the bound
  nodes if the memory is available.  Some users may want to do that only
  if should_numa_migrate_memory() returns true.

>> @@ -2494,20 +2503,30 @@ int mpol_misplaced(struct page *page, struct 
>> vm_area_struct *vma, unsigned long
>>  break;
>>  
>>  case MPOL_BIND:
>>  /*
>> + * Allows binding to multiple nodes.  If both current and
>> + * accessing nodes are in policy nodemask, migrate to
>> + * accessing node to optimize page placement. Otherwise,
>> + * use current page if in policy nodemask or MPOL_F_MOF not
>> + * set, else select nearest allowed node, if any.  If no
>> + * allowed nodes, use current [!misplaced].
>>   */
>> +if (node_isset(curnid, pol->v.nodes)) {
>> +if (node_isset(thisnid, pol->v.nodes)) {
>> +moron = true;
>> +polnid = thisnid;
>> +} else {
>> +goto out;
>> +}
>> +} else if (!(pol->flags & MPOL_F_MOF)) {
>>  goto out;
>> +} else {
>> +z = first_zones_zonelist(
>>  node_zonelist(numa_node_id(), GFP_HIGHUSER),
>>  gfp_zone(GFP_HIGHUSER),
>>

[PATCH] x86, fakenuma: Avoid too large emulated node

2020-09-07 Thread Huang Ying
On a testing system with 2 physical NUMA node, 8GB memory, a small
memory hole from 640KB to 1MB, and a large memory hole from 3GB to
4GB.  If "numa=fake=1G" is used in kernel command line, the resulting
fake NUMA nodes are as follows,

NUMA: Node 0 [mem 0x-0x0009] + [mem 0x0010-0xbfff] -> 
[mem 0x-0xbfff]
NUMA: Node 0 [mem 0x-0xbfff] + [mem 0x1-0x13fff] -> 
[mem 0x-0x13fff]
Faking node 0 at [mem 0x-0x41ff] (1056MB)
Faking node 1 at [mem 0x00014000-0x00017fff] (1024MB)
Faking node 2 at [mem 0x4200-0x81ff] (1024MB)
Faking node 3 at [mem 0x00018000-0x0001bfff] (1024MB)
Faking node 4 at [mem 0x8200-0x00013fff] (3040MB)
Faking node 5 at [mem 0x0001c000-0x0001] (1024MB)
Faking node 6 at [mem 0x0002-0x00023fff] (1024MB)

Where, 7 fake NUMA nodes are emulated, the size of fake node 4 is 3040
- 1024 = 2016MB.  This is nearly 2 times of the size of the other fake
nodes (about 1024MB).  This isn't a reasonable splitting.  The better
way is to make the fake node size not too large or small.  So in this
patch, the splitting algorithm is changed to make the fake node size
between 1/2 to 3/2 of the specified node size.  After applying this
patch, the resulting fake NUMA nodes become,

Faking node 0 at [mem 0x-0x41ff] (1056MB)
Faking node 1 at [mem 0x00014000-0x00017fff] (1024MB)
Faking node 2 at [mem 0x4200-0x81ff] (1024MB)
Faking node 3 at [mem 0x00018000-0x0001bfff] (1024MB)
Faking node 4 at [mem 0x8200-0x000103ff] (2080MB)
Faking node 5 at [mem 0x0001c000-0x0001] (1024MB)
Faking node 6 at [mem 0x00010400-0x00013fff] (960MB)
Faking node 7 at [mem 0x0002-0x00023fff] (1024MB)

The newly added node 6 is a little smaller than the specified node
size (960MB vs. 1024MB).  But the overall results look more
reasonable.

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dan Williams 
Cc: David Rientjes 
Cc: Dave Jiang 
---
 arch/x86/mm/numa_emulation.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 683cd12f4793..231469e1de6a 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -300,9 +300,10 @@ static int __init 
split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
/*
 * If there won't be enough non-reserved memory for the
 * next node, this one must extend to the end of the
-* physical node.
+* physical node.  The size of the emulated node should
+* be between size/2 and size*3/2.
 */
-   if ((limit - end - mem_hole_size(end, limit) < size)
+   if ((limit - end - mem_hole_size(end, limit) < size / 2)
&& !uniform)
end = limit;
 
-- 
2.28.0



[tip: x86/urgent] x86, fakenuma: Fix invalid starting node ID

2020-09-04 Thread tip-bot2 for Huang Ying
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: ccae0f36d500aef727f98acd8d0601e6b262a513
Gitweb:
https://git.kernel.org/tip/ccae0f36d500aef727f98acd8d0601e6b262a513
Author:Huang Ying 
AuthorDate:Fri, 04 Sep 2020 14:10:47 +08:00
Committer: Ingo Molnar 
CommitterDate: Fri, 04 Sep 2020 08:56:13 +02:00

x86, fakenuma: Fix invalid starting node ID

Commit:

  cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")

uses "-1" as the starting node ID, which causes the strange kernel log as
follows, when "numa=fake=32G" is added to the kernel command line:

Faking node -1 at [mem 0x-0x000893ff] (35136MB)
Faking node 0 at [mem 0x00184000-0x00203fff] (32768MB)
Faking node 1 at [mem 0x00089400-0x00183fff] (64192MB)
Faking node 2 at [mem 0x00204000-0x00283fff] (32768MB)
Faking node 3 at [mem 0x00284000-0x00303fff] (32768MB)

And finally the kernel crashes:

BUG: Bad page state in process swapper  pfn:00011
page:(ptrval) refcount:0 mapcount:1 mapping:(ptrval) 
index:0x55cd7e44b270 pfn:0x11
failed to read mapping contents, not a valid kernel address?
flags: 0x5(locked|uptodate)
raw: 0005 55cd7e44af30 55cd7e44af50 00010006
raw: 55cd7e44b270 55cd7e44b290  55cd7e44b510
page dumped because: page still charged to cgroup
page->mem_cgroup:55cd7e44b510
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS 
SE5C620.86B.02.01.0008.031920191559 03/19/2019
Call Trace:
 dump_stack+0x57/0x80
 bad_page.cold+0x63/0x94
 __free_pages_ok+0x33f/0x360
 memblock_free_all+0x127/0x195
 mem_init+0x23/0x1f5
 start_kernel+0x219/0x4f5
 secondary_startup_64+0xb6/0xc0

Fix this bug via using 0 as the starting node ID.  This restores the
original behavior before cc9aec03e58f.

[ mingo: Massaged the changelog. ]

Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200904061047.612950-1-ying.hu...@intel.com
---
 arch/x86/mm/numa_emulation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index c5174b4..683cd12 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -321,7 +321,7 @@ static int __init split_nodes_size_interleave(struct 
numa_meminfo *ei,
  u64 addr, u64 max_addr, u64 size)
 {
return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size,
-   0, NULL, NUMA_NO_NODE);
+   0, NULL, 0);
 }
 
 static int __init setup_emu2phys_nid(int *dfl_phys_nid)


[PATCH RESEND] x86, fakenuma: Fix invalid starting node ID

2020-09-03 Thread Huang Ying
Commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split
capability") uses "-1" as the starting node ID, which causes the
strange kernel log as following, when "numa=fake=32G" is added to the
kernel command line.

Faking node -1 at [mem 0x-0x000893ff] (35136MB)
Faking node 0 at [mem 0x00184000-0x00203fff] (32768MB)
Faking node 1 at [mem 0x00089400-0x00183fff] (64192MB)
Faking node 2 at [mem 0x00204000-0x00283fff] (32768MB)
Faking node 3 at [mem 0x00284000-0x00303fff] (32768MB)

And finally kernel BUG as following,

BUG: Bad page state in process swapper  pfn:00011
page:(ptrval) refcount:0 mapcount:1 mapping:(ptrval) 
index:0x55cd7e44b270 pfn:0x11
failed to read mapping contents, not a valid kernel address?
flags: 0x5(locked|uptodate)
raw: 0005 55cd7e44af30 55cd7e44af50 00010006
raw: 55cd7e44b270 55cd7e44b290  55cd7e44b510
page dumped because: page still charged to cgroup
page->mem_cgroup:55cd7e44b510
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS 
SE5C620.86B.02.01.0008.031920191559 03/19/2019
Call Trace:
 dump_stack+0x57/0x80
 bad_page.cold+0x63/0x94
 __free_pages_ok+0x33f/0x360
 memblock_free_all+0x127/0x195
 mem_init+0x23/0x1f5
 start_kernel+0x219/0x4f5
 secondary_startup_64+0xb6/0xc0

Fixes this bug via using 0 as the starting node ID.  This restores the
original behavior before the commit cc9aec03e58f ("x86/numa_emulation:
Introduce uniform split capability").

Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Dan Williams 
Cc: David Rientjes 
Cc: Dave Jiang 
---
 arch/x86/mm/numa_emulation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index c5174b4e318b..683cd12f4793 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -321,7 +321,7 @@ static int __init split_nodes_size_interleave(struct 
numa_meminfo *ei,
  u64 addr, u64 max_addr, u64 size)
 {
return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size,
-   0, NULL, NUMA_NO_NODE);
+   0, NULL, 0);
 }
 
 static int __init setup_emu2phys_nid(int *dfl_phys_nid)
-- 
2.28.0



[PATCH] x86, fakenuma: Fix invalid starting node ID

2020-08-28 Thread Huang Ying
Commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split
capability") uses "-1" as the starting node ID, which causes the
strange kernel log as following, when "numa=fake=32G" is added to the
kernel command line.

Faking node -1 at [mem 0x-0x000893ff] (35136MB)
Faking node 0 at [mem 0x00184000-0x00203fff] (32768MB)
Faking node 1 at [mem 0x00089400-0x00183fff] (64192MB)
Faking node 2 at [mem 0x00204000-0x00283fff] (32768MB)
Faking node 3 at [mem 0x00284000-0x00303fff] (32768MB)

And finally kernel BUG as following,

BUG: Bad page state in process swapper  pfn:00011
page:(ptrval) refcount:0 mapcount:1 mapping:(ptrval) 
index:0x55cd7e44b270 pfn:0x11
failed to read mapping contents, not a valid kernel address?
flags: 0x5(locked|uptodate)
raw: 0005 55cd7e44af30 55cd7e44af50 00010006
raw: 55cd7e44b270 55cd7e44b290  55cd7e44b510
page dumped because: page still charged to cgroup
page->mem_cgroup:55cd7e44b510
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.9.0-rc2 #1
Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS 
SE5C620.86B.02.01.0008.031920191559 03/19/2019
Call Trace:
 dump_stack+0x57/0x80
 bad_page.cold+0x63/0x94
 __free_pages_ok+0x33f/0x360
 memblock_free_all+0x127/0x195
 mem_init+0x23/0x1f5
 start_kernel+0x219/0x4f5
 secondary_startup_64+0xb6/0xc0

Fixes this bug via using 0 as the starting node ID.  This restores the
original behavior before the commit cc9aec03e58f ("x86/numa_emulation:
Introduce uniform split capability").

Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability")
Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Dan Williams 
Cc: David Rientjes 
Cc: Dave Jiang 
---
 arch/x86/mm/numa_emulation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index c5174b4e318b..683cd12f4793 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -321,7 +321,7 @@ static int __init split_nodes_size_interleave(struct 
numa_meminfo *ei,
  u64 addr, u64 max_addr, u64 size)
 {
return split_nodes_size_interleave_uniform(ei, pi, addr, max_addr, size,
-   0, NULL, NUMA_NO_NODE);
+   0, NULL, 0);
 }
 
 static int __init setup_emu2phys_nid(int *dfl_phys_nid)
-- 
2.28.0



[RFC -V3 0/5] autonuma: Optimize memory placement for memory tiering system

2020-08-24 Thread Huang Ying
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering
system, because the performance of the different types of memory are
usually different.

After commit c221c0b0308f ("device-dax: "Hotplug" persistent memory
for use like normal RAM"), the PMEM could be used as the
cost-effective volatile memory in separate NUMA nodes.  In a typical
memory tiering system, there are CPUs, DRAM and PMEM in each physical
NUMA node.  The CPUs and the DRAM will be put in one logical node,
while the PMEM will be put in another (faked) logical node.

To optimize the system overall performance, the hot pages should be
placed in DRAM node.  To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.

In the original AutoNUMA, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a
node and migrate the pages to the node.  So we can reuse these
mechanisms to build the mechanisms to optimize the page placement in
the memory tiering system.  This has been implemented in this
patchset.

At the other hand, the cold pages should be placed in PMEM node.  So,
we also need to identify the cold pages in the DRAM node and migrate
them to PMEM node.

In the following patchset,

[RFC][PATCH 0/9] [v3] Migrate Pages in lieu of discard
https://lkml.kernel.org/lkml/20200818184122.29c41...@viggo.jf.intel.com/

A mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented.  Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM
node.  And this frees the space on DRAM node for the hot PMEM pages to
be promoted to.  This has been implemented in this patchset too.

The patchset is based on the following not-yet-merged patchset,

[RFC][PATCH 0/9] [v3] Migrate Pages in lieu of discard
https://lkml.kernel.org/lkml/20200818184122.29c41...@viggo.jf.intel.com/

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from below,

https://github.com/hying-caritas/linux/commits/autonuma-r3

We have tested the solution with the pmbench memory accessing
benchmark with the 80:20 read/write ratio and the normal access
address distribution on a 2 socket Intel server with Optane DC
Persistent Memory Model.  The test results of the base kernel and step
by step optimizations are as follows,

Throughput  Promotion  DRAM bandwidth
  access/s   MB/sMB/s
   --- --  --
Base63868367.1 3626.7
Patch 1137611105.1  353.5  8608.5
Patch 2136124113.3  351.8  8480.7
Patch 316015.7  208.2  9407.8
Patch 4158461356.4  105.3  8790.0
Patch 5163254205.3   73.6  8800.2

The whole patchset improves the benchmark score up to 155.6%.  The
basic AutoNUMA based optimization solution, the hot page selection
algorithm, and the threshold automatic adjustment algorithms improves
the performance or reduce the overhead (promotion MB/s) mostly.

Changelog:

v3:

- Move the rate limit control as late as possible per Mel Gorman's
  comments.

- Revise the hot page selection implementation to store page scan time
  in struct page.

- Code cleanup.

- Rebased on the latest page demotion patchset.

v2:

- Addressed comments for V1.

- Rebased on v5.5.

Best Regards,
Huang, Ying


[RFC -V3 2/5] autonuma, memory tiering: Skip to scan fast memory

2020-08-24 Thread Huang Ying
If the AutoNUMA isn't used to optimize the page placement among
sockets but only among memory types, the hot pages in the fast memory
node couldn't be migrated (promoted) to anywhere.  So it's unnecessary
to scan the pages in the fast memory node via changing their PTE/PMD
mapping to be PROT_NONE.  So that the page faults could be avoided
too.

In the test, if only the memory tiering AutoNUMA mode is enabled, the
number of the AutoNUMA hint faults for the DRAM node is reduced to
almost 0 with the patch.  While the benchmark score doesn't change
visibly.

Signed-off-by: "Huang, Ying" 
Suggested-by: Dave Hansen 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/node.h |  5 +
 mm/huge_memory.c | 30 +-
 mm/mprotect.c| 13 -
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/node.h b/include/linux/node.h
index f7a539390c81..ac0e7a45edff 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -189,4 +189,9 @@ static inline int next_demotion_node(int node)
 }
 #endif
 
+static inline bool node_is_toptier(int node)
+{
+   return node_state(node, N_CPU);
+}
+
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..7d5db965a48c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1821,17 +1822,28 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
}
 #endif
 
-   /*
-* Avoid trapping faults against the zero page. The read-only
-* data is likely to be read-cached on the local CPU and
-* local/remote hits to the zero page are not interesting.
-*/
-   if (prot_numa && is_huge_zero_pmd(*pmd))
-   goto unlock;
+   if (prot_numa) {
+   struct page *page;
+   /*
+* Avoid trapping faults against the zero page. The read-only
+* data is likely to be read-cached on the local CPU and
+* local/remote hits to the zero page are not interesting.
+*/
+   if (is_huge_zero_pmd(*pmd))
+   goto unlock;
 
-   if (prot_numa && pmd_protnone(*pmd))
-   goto unlock;
+   if (pmd_protnone(*pmd))
+   goto unlock;
 
+   page = pmd_page(*pmd);
+   /*
+* Skip scanning top tier node if normal numa
+* balancing is disabled
+*/
+   if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+   node_is_toptier(page_to_nid(page)))
+   goto unlock;
+   }
/*
 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..8abec0c267fa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -83,6 +84,7 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
 */
if (prot_numa) {
struct page *page;
+   int nid;
 
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
@@ -109,7 +111,16 @@ static unsigned long change_pte_range(struct 
vm_area_struct *vma, pmd_t *pmd,
 * Don't mess with PTEs if page is already on 
the node
 * a single-threaded process is running on.
 */
-   if (target_node == page_to_nid(page))
+   nid = page_to_nid(page);
+   if (target_node == nid)
+   continue;
+
+   /*
+* Skip scanning top tier node if normal numa
+* balancing is disabled
+*/
+   if (!(sysctl_numa_balancing_mode & 
NUMA_BALANCING_NORMAL) &&
+   node_is_toptier(nid))
continue;
}
 
-- 
2.27.0



[RFC -V3 4/5] autonuma, memory tiering: Rate limit NUMA migration throughput

2020-08-24 Thread Huang Ying
In AutoNUMA memory tiering mode, the hot slow memory pages could be
promoted to the fast memory node via AutoNUMA.  But this incurs some
overhead too.  So that sometimes the workload performance may be hurt.
To avoid too much disturbing to the workload in these situations, we
should make it possible to rate limit the promotion throughput.

So, in this patch, we implement a simple rate limit algorithm as
follows.  The number of the candidate pages to be promoted to the fast
memory node via AutoNUMA is counted, if the count exceeds the limit
specified by the users, the AutoNUMA promotion will be stopped until
the next second.

Test the patch with the pmbench memory accessing benchmark with 80:20
read/write ratio and normal access address distribution on a 2 socket
Intel server with Optane DC Persistent Memory Model.  In the test, the
page promotion throughput decreases 49.4% (from 208.2 MB/s to 105.3
MB/s) with the patch, while the benchmark score decreases only 1.1%.

A new sysctl knob kernel.numa_balancing_rate_limit_mbps is added for
the users to specify the limit.

TODO: Add ABI document for new sysctl knob.

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/mmzone.h   |  7 +++
 include/linux/sched/sysctl.h |  6 ++
 kernel/sched/fair.c  | 29 +++--
 kernel/sysctl.c  |  8 
 mm/vmstat.c  |  3 +++
 5 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f6f884970511..6e1e138cf61c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -203,6 +203,9 @@ enum node_stat_item {
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED,   /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED,   /* pages returned via unpin_user_page() */
+#ifdef CONFIG_NUMA_BALANCING
+   NUMA_NR_CANDIDATE,  /* candidate pages to migrate */
+#endif
NR_VM_NODE_STAT_ITEMS
 };
 
@@ -746,6 +749,10 @@ typedef struct pglist_data {
struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_NUMA_BALANCING
+   unsigned long numa_ts;
+   unsigned long numa_nr_candidate;
+#endif
/* Fields commonly accessed by the page reclaim scanner */
 
/*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 435d66269d0a..40a3b6b3e0f8 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -50,6 +50,12 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_NUMA_BALANCING
+extern unsigned int sysctl_numa_balancing_rate_limit;
+#else
+#define sysctl_numa_balancing_rate_limit   0
+#endif
+
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62510b435a89..7835485e4b8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1078,6 +1078,11 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
 
 /* The page with hint page fault latency < threshold in ms is considered hot */
 unsigned int sysctl_numa_balancing_hot_threshold = 1000;
+/*
+ * Restrict the NUMA migration per second in MB for each target node
+ * if no enough free space in target node
+ */
+unsigned int sysctl_numa_balancing_rate_limit = 65536;
 
 struct numa_group {
refcount_t refcount;
@@ -1450,6 +1455,23 @@ static int numa_hint_fault_latency(struct page *page)
return (time - last_time) & PAGE_ACCESS_TIME_MASK;
 }
 
+static bool numa_migration_check_rate_limit(struct pglist_data *pgdat,
+   unsigned long rate_limit, int nr)
+{
+   unsigned long nr_candidate;
+   unsigned long now = jiffies, last_ts;
+
+   mod_node_page_state(pgdat, NUMA_NR_CANDIDATE, nr);
+   nr_candidate = node_page_state(pgdat, NUMA_NR_CANDIDATE);
+   last_ts = pgdat->numa_ts;
+   if (now > last_ts + HZ &&
+   cmpxchg(&pgdat->numa_ts, last_ts, now) == last_ts)
+   pgdat->numa_nr_candidate = nr_candidate;
+   if (nr_candidate - pgdat->numa_nr_candidate > rate_limit)
+   return false;
+   return true;
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
int src_nid, int dst_cpu)
 {
@@ -1464,7 +1486,7 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
!nod

[RFC -V3 5/5] autonuma, memory tiering: Adjust hot threshold automatically

2020-08-24 Thread Huang Ying
It isn't easy for the administrator to determine the hot threshold.
So in this patch, a method to adjust the hot threshold automatically
is implemented.  The basic idea is to control the number of the
candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we
will try to promote the page, and the page is called the candidate
promotion page.

If the number of the candidate promotion pages in the statistics
interval is much more than the promotion rate limit, the hot threshold
will be decreased to reduce the number of the candidate promotion
pages.  Otherwise, the hot threshold will be increased to increase the
number of the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and
the hot/cold distribution need to be stable.  Because the page tables
are scanned linearly in AutoNUMA, but the hot/cold distribution isn't
uniform along the address, the statistics interval should be larger
than the AutoNUMA scan period.  So in the patch, the max scan period
is used as statistics interval and it works well in our tests.

The sysctl knob kernel.numa_balancing_hot_threshold_ms becomes the
initial value and max value of the hot threshold.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
3% with 30% fewer NUMA page migrations on a 2 socket Intel server with
Optance DC Persistent Memory.  Because it improves the accuracy of the
hot page selection.

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/mmzone.h |  3 +++
 kernel/sched/fair.c| 40 
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e1e138cf61c..f7a7f0c374d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -752,6 +752,9 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA_BALANCING
unsigned long numa_ts;
unsigned long numa_nr_candidate;
+   unsigned long numa_threshold_ts;
+   unsigned long numa_threshold_nr_candidate;
+   unsigned long numa_threshold;
 #endif
/* Fields commonly accessed by the page reclaim scanner */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7835485e4b8a..110e3c847a29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1472,6 +1472,35 @@ static bool numa_migration_check_rate_limit(struct 
pglist_data *pgdat,
return true;
 }
 
+#define NUMA_MIGRATION_ADJUST_STEPS16
+
+static void numa_migration_adjust_threshold(struct pglist_data *pgdat,
+   unsigned long rate_limit,
+   unsigned long ref_th)
+{
+   unsigned long now = jiffies, last_th_ts, th_period;
+   unsigned long unit_th, th;
+   unsigned long nr_cand, ref_cand, diff_cand;
+
+   th_period = msecs_to_jiffies(sysctl_numa_balancing_scan_period_max);
+   last_th_ts = pgdat->numa_threshold_ts;
+   if (now > last_th_ts + th_period &&
+   cmpxchg(&pgdat->numa_threshold_ts, last_th_ts, now) == last_th_ts) {
+   ref_cand = rate_limit *
+   sysctl_numa_balancing_scan_period_max / 1000;
+   nr_cand = node_page_state(pgdat, NUMA_NR_CANDIDATE);
+   diff_cand = nr_cand - pgdat->numa_threshold_nr_candidate;
+   unit_th = ref_th / NUMA_MIGRATION_ADJUST_STEPS;
+   th = pgdat->numa_threshold ? : ref_th;
+   if (diff_cand > ref_cand * 11 / 10)
+   th = max(th - unit_th, unit_th);
+   else if (diff_cand < ref_cand * 9 / 10)
+   th = min(th + unit_th, ref_th);
+   pgdat->numa_threshold_nr_candidate = nr_cand;
+   pgdat->numa_threshold = th;
+   }
+}
+
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
int src_nid, int dst_cpu)
 {
@@ -1486,19 +1515,22 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
!node_is_toptier(src_nid)) {
struct pglist_data *pgdat;
-   unsigned long rate_limit, latency, th;
+   unsigned long rate_limit, latency, th, def_th;
 
pgdat = NODE_DATA(dst_nid);
if (pgdat_free_space_enough(pgdat))
return true;
 
-   th = sysctl_numa_balancing_hot_threshold;
+   d

[RFC -V3 3/5] autonuma, memory tiering: Hot page selection with hint page fault latency

2020-08-24 Thread Huang Ying
To optimize page placement in a memory tiering system with AutoNUMA,
the hot pages in the slow memory node need to be identified.
Essentially, the original AutoNUMA implementation selects the mostly
recently accessed (MRU) pages as the hot pages.  But this isn't a very
good algorithm to identify the hot pages.

So, in this patch we implemented a better hot page selection
algorithm.  Which is based on AutoNUMA page table scanning and hint
page fault as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a good estimation of the page hot/cold.

But it's hard to find some extra space in struct page to hold the scan
time.  Fortunately, we can reuse some bits used by the original
AutoNUMA.

AutoNUMA uses some bits in struct page to store the page accessing CPU
and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes,
as long as the pages are hot, they need to be promoted to the fast
memory node.  So the accessing CPU and PID information are unnecessary
for the slow memory pages.  We can reuse these bits in struct page to
record the scan time for them.  For the fast memory pages, these bits
are used as before.

The remaining problem is how to determine the hot threshold.  It's not
easy to be done automatically.  So we provide a sysctl knob:
kernel.numa_balancing_hot_threshold_ms.  All pages with hint page
fault latency < the threshold will be considered hot.  The system
administrator can determine the hot threshold via various information,
such as PMEM bandwidth limit, the average number of the pages pass the
hot threshold, etc.  The default hot threshold is 1 second, which
works well in our performance test.

The patch improves the score of pmbench memory accessing benchmark
with 80:20 read/write ratio and normal access address distribution by
16.8% with 41.1% less pages promoted (that is, less overhead) on a 2
socket Intel server with Optance DC Persistent Memory.

The downside of the patch is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this,

- If there are enough free space in the fast memory node, the hot
  threshold will not be used, all pages will be promoted upon the hint
  page fault for fast response.

- If fast response is more important for system performance, the
  administrator can set a higher hot threshold.

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/mm.h   | 29 
 include/linux/sched/sysctl.h |  1 +
 kernel/sched/fair.c  | 67 
 kernel/sysctl.c  |  7 
 mm/huge_memory.c | 13 +--
 mm/memory.c  | 11 +-
 mm/migrate.c | 12 +++
 mm/mmzone.c  | 17 +
 mm/mprotect.c|  8 -
 9 files changed, 160 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..0eac5049c153 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1278,6 +1278,18 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
+/* page access time bits needs to hold at least 4 seconds */
+#define PAGE_ACCESS_TIME_MIN_BITS  12
+#if LAST_CPUPID_SHIFT < PAGE_ACCESS_TIME_MIN_BITS
+#definePAGE_ACCESS_TIME_BUCKETS\
+   (PAGE_ACCESS_TIME_MIN_BITS - LAST_CPUPID_SHIFT)
+#else
+#definePAGE_ACCESS_TIME_BUCKETS0
+#endif
+
+#define PAGE_ACCESS_TIME_MASK  \
+   (LAST_CPUPID_MASK << PAGE_ACCESS_TIME_BUCKETS)
+
 static inline int cpu_pid_to_cpupid(int cpu, int pid)
 {
return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & 
L

[RFC -V3 1/5] autonuma: Optimize page placement for memory tiering system

2020-08-24 Thread Huang Ying
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory).  The
memory subsystem of these machines can be called memory tiering
system, because the performance of the different types of memory are
usually different.

In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally.  So in this
patch, the AutoNUMA mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.

In a typical memory tiering system, there are CPUs, fast memory and
slow memory in each physical NUMA node.  The CPUs and the fast memory
will be put in one logical node (called fast memory node), while the
slow memory will be put in another (faked) logical node (called slow
memory node).  That is, the fast memory is regarded as local while the
slow memory is regarded as remote.  So it's possible for the recently
accessed pages in the slow memory node to be promoted to the fast
memory node via the existing AutoNUMA mechanism.

The original AutoNUMA mechanism will stop to migrate pages if the free
memory of the target node will become below the high watermark.  This
is a reasonable policy if there's only one memory type.  But this
makes the original AutoNUMA mechanism almost not work to optimize page
placement among different memory types.  Details are as follows.

It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes.  Otherwise, it's
unnecessary to use the slow memory at all.  So in the common cases,
there are almost always no enough free pages in the fast memory nodes,
so that the globally hot pages in the slow memory node cannot be
promoted to the fast memory node.  To solve the issue, we have 2
choices as follows,

a. Ignore the free pages watermark checking when promoting hot pages
   from the slow memory node to the fast memory node.  This will
   create some memory pressure in the fast memory node, thus trigger
   the memory reclaiming.  So that, the cold pages in the fast memory
   node will be demoted to the slow memory node.

b. Make kswapd of the fast memory node to reclaim pages until the free
   pages are a little more (about 10MB) than the high watermark.  Then,
   if the free pages of the fast memory node reaches high watermark, and
   some hot pages need to be promoted, kswapd of the fast memory node
   will be waken up to demote some cold pages in the fast memory node to
   the slow memory node.  This will free some extra space in the fast
   memory node, so the hot pages in the slow memory node can be
   promoted to the fast memory node.

The choice "a" will create the memory pressure in the fast memory
node.  If the memory pressure of the workload is high, the memory
pressure may become so high that the memory allocation latency of the
workload is influenced, e.g. the direct reclaiming may be triggered.

The choice "b" works much better at this aspect.  If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation.  So in this patch, choice "b" is
implemented.

In addition to the original page placement optimization among sockets,
the AutoNUMA mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types.  So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.

The sysctl is converted from a Boolean value to a bits field.  The
definition of the flags is,

- 0x0: NUMA_BALANCING_DISABLED
- 0x1: NUMA_BALANCING_NORMAL
- 0x2: NUMA_BALANCING_MEMORY_TIERING

TODO:

- Update ABI document: Documentation/sysctl/kernel.txt

Signed-off-by: "Huang, Ying" 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Dave Hansen 
Cc: Dan Williams 
Cc: linux-kernel@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/sched/sysctl.h |  5 +
 kernel/sched/core.c  |  9 +++--
 kernel/sysctl.c  |  7 ---
 mm/migrate.c | 30 +++---
 mm/vmscan.c  | 15 +++
 5 files changed, 54 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 660ac49f2b53..bdd38045d14c 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -39,6 +39,11 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 
+#define NUMA_BALANCING_DISABLED0x0
+#define NUMA_BALANCING_NORMAL  0x1
+#define NUMA_BALANCING_MEMORY_TIERING  0x2
+
+extern int sysctl_numa_balancing_mode;
 extern unsigne

Re: [PATCH v2] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-20 Thread Huang, Ying
Dave Chinner  writes:

> On Fri, Aug 21, 2020 at 08:21:45AM +0800, Gao Xiang wrote:
>> Hi Dave,
>> 
>> On Fri, Aug 21, 2020 at 09:34:46AM +1000, Dave Chinner wrote:
>> > On Thu, Aug 20, 2020 at 12:53:23PM +0800, Gao Xiang wrote:
>> > > SWP_FS is used to make swap_{read,write}page() go through
>> > > the filesystem, and it's only used for swap files over
>> > > NFS. So, !SWP_FS means non NFS for now, it could be either
>> > > file backed or device backed. Something similar goes with
>> > > legacy SWP_FILE.
>> > > 
>> > > So in order to achieve the goal of the original patch,
>> > > SWP_BLKDEV should be used instead.
>> > > 
>> > > FS corruption can be observed with SSD device + XFS +
>> > > fragmented swapfile due to CONFIG_THP_SWAP=y.
>> > > 
>> > > I reproduced the issue with the following details:
>> > > 
>> > > Environment:
>> > > QEMU + upstream kernel + buildroot + NVMe (2 GB)
>> > > 
>> > > Kernel config:
>> > > CONFIG_BLK_DEV_NVME=y
>> > > CONFIG_THP_SWAP=y
>> > 
>> > Ok, so at it's core this is a swap file extent versus THP swap
>> > cluster alignment issue?
>> 
>> I think yes.
>> 
>> > 
>> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
>> > > index 6c26916e95fd..2937daf3ca02 100644
>> > > --- a/mm/swapfile.c
>> > > +++ b/mm/swapfile.c
>> > > @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
>> > > swp_entries[], int entry_size)
>> > >  goto nextsi;
>> > >  }
>> > >  if (size == SWAPFILE_CLUSTER) {
>> > > -if (!(si->flags & SWP_FS))
>> > > +if (si->flags & SWP_BLKDEV)
>> > >  n_ret = swap_alloc_cluster(si, 
>> > > swp_entries);
>> > >  } else
>> > >  n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>> > 
>> > IOWs, if you don't make this change, does the corruption problem go
>> > away if you align swap extents in iomap_swapfile_add_extent() to
>> > (SWAPFILE_CLUSTER * PAGE_SIZE) instead of just PAGE_SIZE?
>> > 
>> > I.e. if the swapfile extents are aligned correctly to huge page swap
>> > cluster size and alignment, does the swap clustering optimisations
>> > for swapping THP pages work correctly? And, if so, is there any
>> > performance benefit we get from enabling proper THP swap clustering
>> > on swapfiles?
>> > 
>> 
>> Yeah, I once think about some similiar thing as well. My thought for now is
>> 
>>  - First, SWAP THP doesn't claim to support such swapfile for now.
>>And the original author tried to explicitly avoid the whole thing in
>> 
>>f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed 
>> swap device")
>> 
>>So such thing would be considered as some new feature and need
>>more testing at least. But for now I think we just need a quick
>>fix to fix the commit f0eea189e8e9 to avoid regression and for
>>backport use.
>
> Sure, a quick fix is fine for the current issue. I'm asking
> questions about the design/architecture of how THP_SWAP is supposed
> to work and whether swapfiles are violating some other undocumented
> assumption about swapping THP files...

The main requirement for THP_SWAP is that the swap cluster need to be
mapped to the continuous block device space.

So Yes.  In theory, it's possible to support THP_SWAP for swapfile.  But
I don't know whether people need it or not.

Best Regards,
Huang, Ying

>>  - It is hard for users to control swapfile in
>>SWAPFILE_CLUSTER * PAGE_SIZE extents, especially users'
>>disk are fragmented or have some on-disk metadata limitation or
>>something. It's very hard for users to utilize this and arrange
>>their swapfile physical addr alignment and fragments for now.
>
> This isn't something users control. The swapfile extent mapping code
> rounds the swap extents inwards so that the parts of the on-disk
> extents that are not aligned or cannot hold a full page are
> omitted from the ranges of the file that can be swapped to.
>
> i.e. a file that extents aligned to 4kB is fine for a 4KB page size
> machine, but needs additional alignment to allow swapping to work on
> a 64kB page size machine. Hence the swap code rounds the file
> extents inwards to PAGE_SIZE to align them correctly. We really
> should be doing this for THP page size rather than PAGE_SIZE if
> THP_SWAP is enabled, regardless of whether swap clustering is
> enabled or not...
>
> Cheers,
>
> Dave.


Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

2020-08-20 Thread Huang, Ying
Yang Shi  writes:

> On Thu, Aug 20, 2020 at 8:22 AM Dave Hansen  wrote:
>>
>> On 8/20/20 1:06 AM, Huang, Ying wrote:
>> >> +/* Migrate pages selected for demotion */
>> >> +nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, 
>> >> sc);
>> >> +
>> >>  pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>> >>
>> >>  mem_cgroup_uncharge_list(&free_pages);
>> >> _
>> > Generally, it's good to batch the page migration.  But one side effect
>> > is that, if the pages are failed to be migrated, they will be placed
>> > back to the LRU list instead of falling back to be reclaimed really.
>> > This may cause some issue in some situation.  For example, if there's no
>> > enough space in the PMEM (slow) node, so the page migration fails, OOM
>> > may be triggered, because the direct reclaiming on the DRAM (fast) node
>> > may make no progress, while it can reclaim some pages really before.
>>
>> Yes, agreed.
>
> Kind of. But I think that should be transient and very rare. The
> kswapd on pmem nodes will be waken up to drop pages when we try to
> allocate migration target pages. It should be very rare that there is
> not reclaimable page on pmem nodes.
>
>>
>> There are a couple of ways we could fix this.  Instead of splicing
>> 'demote_pages' back into 'ret_pages', we could try to get them back on
>> 'page_list' and goto the beginning on shrink_page_list().  This will
>> probably yield the best behavior, but might be a bit ugly.
>>
>> We could also add a field to 'struct scan_control' and just stop trying
>> to migrate after it has failed one or more times.  The trick will be
>> picking a threshold that doesn't mess with either the normal reclaim
>> rate or the migration rate.
>
> In my patchset I implemented a fallback mechanism via adding a new
> PGDAT_CONTENDED node flag. Please check this out:
> https://patchwork.kernel.org/patch/10993839/.
>
> Basically the PGDAT_CONTENDED flag will be set once migrate_pages()
> return -ENOMEM which indicates the target pmem node is under memory
> pressure, then it would fallback to regular reclaim path. The flag
> would be cleared by clear_pgdat_congested() once the pmem node memory
> pressure is gone.

There may be some races between the flag set and clear.  For example,

- try to migrate some pages from DRAM node to PMEM node

- no enough free pages on the PMEM node, so wakeup kswapd

- kswapd on PMEM node reclaimed some page and try to clear
  PGDAT_CONTENDED on DRAM node

- set PGDAT_CONTENDED on DRAM node
 
This may be resolvable.  But I still prefer to fallback to real page
reclaiming directly for the pages failed to be migrated.  That looks
more robust.

Best Regards,
Huang, Ying

> We already use node flags to indicate the state of node in reclaim
> code, i.e. PGDAT_WRITEBACK, PGDAT_DIRTY, etc. So, adding a new flag
> sounds more straightforward to me IMHO.
>
>>
>> This is on my list to fix up next.
>>


Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim

2020-08-20 Thread Huang, Ying
>   }
>  
> + /* Migrate pages selected for demotion */
> + nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc);
> +
>   pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>  
>   mem_cgroup_uncharge_list(&free_pages);
> _

Generally, it's good to batch the page migration.  But one side effect
is that, if the pages are failed to be migrated, they will be placed
back to the LRU list instead of falling back to be reclaimed really.
This may cause some issue in some situation.  For example, if there's no
enough space in the PMEM (slow) node, so the page migration fails, OOM
may be triggered, because the direct reclaiming on the DRAM (fast) node
may make no progress, while it can reclaim some pages really before.

Best Regards,
Huang, Ying


Re: [PATCH v2] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Huang, Ying
Gao Xiang  writes:

> SWP_FS is used to make swap_{read,write}page() go through
> the filesystem, and it's only used for swap files over
> NFS. So, !SWP_FS means non NFS for now, it could be either
> file backed or device backed. Something similar goes with
> legacy SWP_FILE.
>
> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
>
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
>
> I reproduced the issue with the following details:
>
> Environment:
> QEMU + upstream kernel + buildroot + NVMe (2 GB)
>
> Kernel config:
> CONFIG_BLK_DEV_NVME=y
> CONFIG_THP_SWAP=y
>
> Some reproducable steps:
> mkfs.xfs -f /dev/nvme0n1
> mkdir /tmp/mnt
> mount /dev/nvme0n1 /tmp/mnt
> bs="32k"
> sz="1024m"# doesn't matter too much, I also tried 16m
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
>
> mkswap /tmp/mnt/sw
> swapon /tmp/mnt/sw
>
> stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
>
> Symptoms:
>  - FS corruption (e.g. checksum failure)
>  - memory corruption at: 0xd2808010
>  - segfault
>
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> Cc: "Huang, Ying" 
> Cc: Yang Shi 
> Cc: Rafael Aquini 
> Cc: Dave Chinner 
> Cc: stable 
> Signed-off-by: Gao Xiang 

Thanks!

Reviewed-by: "Huang, Ying" 

Best Regards,
Huang, Ying


Re: [PATCH] mm, THP, swap: fix allocating cluster for swapfile by mistake

2020-08-19 Thread Huang, Ying
Gao Xiang  writes:

> SWP_FS doesn't mean the device is file-backed swap device,
> which just means each writeback request should go through fs
> by DIO. Or it'll just use extents added by .swap_activate(),
> but it also works as file-backed swap device.
>
> So in order to achieve the goal of the original patch,
> SWP_BLKDEV should be used instead.
>
> FS corruption can be observed with SSD device + XFS +
> fragmented swapfile due to CONFIG_THP_SWAP=y.
>
> Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file 
> backed swap device")
> Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
> Cc: "Huang, Ying" 
> Cc: stable 
> Signed-off-by: Gao Xiang 

Good catch!  The fix itself looks good me!  Although the description is
a little confusing.

After some digging, it seems that SWP_FS is set on the swap devices
which make swap entry read/write go through the file system specific
callback (now used by swap over NFS only).

Best Regards,
Huang, Ying

> ---
>
> I reproduced the issue with the following details:
>
> Environment:
> QEMU + upstream kernel + buildroot + NVMe (2 GB)
>
> Kernel config:
> CONFIG_BLK_DEV_NVME=y
> CONFIG_THP_SWAP=y
>
> Some reproducable steps:
> mkfs.xfs -f /dev/nvme0n1
> mkdir /tmp/mnt
> mount /dev/nvme0n1 /tmp/mnt
> bs="32k"
> sz="1024m"# doesn't matter too much, I also tried 16m
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
> xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
>
> mkswap /tmp/mnt/sw
> swapon /tmp/mnt/sw
>
> stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
>
> Symptoms:
>  - FS corruption (e.g. checksum failure)
>  - memory corruption at: 0xd2808010
>  - segfault
>  ... 
>
>  mm/swapfile.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6c26916e95fd..2937daf3ca02 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1074,7 +1074,7 @@ int get_swap_pages(int n_goal, swp_entry_t 
> swp_entries[], int entry_size)
>   goto nextsi;
>   }
>   if (size == SWAPFILE_CLUSTER) {
> - if (!(si->flags & SWP_FS))
> + if (si->flags & SWP_BLKDEV)
>   n_ret = swap_alloc_cluster(si, swp_entries);
>   } else
>   n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,


Re: linux-next: not-present page at swap_vma_readahead()

2020-07-19 Thread Huang, Ying
Qian Cai  writes:

> On Mon, Jul 20, 2020 at 03:32:59AM +0000, Huang, Ying wrote:
>> Thanks!  Can you try the dbg patch attached?  That will print more debugging 
>> information when abnormal PTE pointer is detected.
>
> Here with both of your patches applied,
>
> [  183.627876][ T3959] ra_info: 8, 3, 4, aabe3209
> [  183.633160][ T3959] i: 0, pte: aabe3209, faddr: 0
> [  183.638574][ T3959] ra_info: 8, 3, 4, aabe3209
> [  183.643787][ T3959] i: 1, pte: 06e61f24, faddr: 0
> [  183.649189][ T3959] ra_info: 8, 3, 4, aabe3209
> [  183.654371][ T3959] i: 2, pte: ce16a68e, faddr: 0
> [  183.851372][ T3839] ra_info: 8, 3, 4, 85efad17
> [  183.856550][ T3839] i: 0, pte: 85efad17, faddr: 0
> [  183.862503][ T3839] 
> ==
> [  183.870563][ T3839] BUG: KASAN: slab-out-of-bounds in 
> swapin_readahead+0x840/0xd60
> [  183.878147][ T3839] Read of size 8 at addr 008919f1ffe8 by task 
> trinity-c128/3839
> [  183.886001][ T3839] CPU: 9 PID: 3839 Comm: trinity-c128 Not tainted 
> 5.8.0-rc5-next-20200717+ #2
> [  183.894710][ T3839] Hardware name: HPE Apollo 70 
> /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
> [  183.905157][ T3839] Call trace:
> [  183.908314][ T3839]  dump_backtrace+0x0/0x398
> [  183.912680][ T3839]  show_stack+0x14/0x20
> [  183.916704][ T3839]  dump_stack+0x140/0x1c8
> [  183.920910][ T3839]  print_address_description.constprop.10+0x54/0x550
> [  183.927454][ T3839]  kasan_report+0x134/0x1b8
> [  183.931833][ T3839]  __asan_report_load8_noabort+0x2c/0x50
> [  183.937334][ T3839]  swapin_readahead+0x840/0xd60
> [  183.942049][ T3839]  do_swap_page+0xb1c/0x1a78
> [  183.946508][ T3839]  handle_mm_fault+0xfd0/0x2c50
> [  183.948789][ T3754] ra_info: 8, 3, 4, d0b6ebd5
> [  183.951229][ T3839]  do_page_fault+0x230/0x818
> [  183.956402][ T3754] i: 0, pte: d0b6ebd5, faddr: 0
> [  183.960896][ T3839]  do_translation_fault+0x90/0xb0
> [  183.966330][ T3754] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
> [  183.971172][ T3839]  do_mem_abort+0x64/0x180
> [  183.971192][ T3839]  el0_sync_handler+0x2a0/0x410
> [  183.971207][ T3839]  el0_sync+0x140/0x180
> [  183.977984][ T3754] ra_info: 8, 3, 4, d0b6ebd5
> [  183.977997][ T3754] i: 1, pte: 530a7b17, faddr: 0
> [  183.982278][ T3839] Allocated by task 3699:
> [  183.982296][ T3839]  kasan_save_stack+0x24/0x50
> [  183.982310][ T3839]  __kasan_kmalloc.isra.10+0xc4/0xe0
> [  183.987003][ T3754] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
> [  183.987019][ T3754] ra_info: 8, 3, 4, d0b6ebd5
> [  183.991033][ T3839]  kasan_slab_alloc+0x14/0x20
> [  183.991048][ T3839]  slab_post_alloc_hook+0x58/0x5d0
> [  183.991064][ T3839]  kmem_cache_alloc+0x19c/0x448
> [  183.996185][ T3754] i: 2, pte: 031f0751, faddr: 0
> [  183.996200][ T3754] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
> [  184.001617][ T3839]  create_object+0x58/0x960
> [  184.001639][ T3839]  kmemleak_alloc+0x2c/0x38
> [  184.001657][ T3839]  slab_post_alloc_hook+0x78/0x5d0
> [  184.025674][ T3830] ra_info: 8, 3, 4, d77f2b57
> [  184.027442][ T3839]  kmem_cache_alloc+0x19c/0x448
> [  184.032002][ T3830] i: 0, pte: 0026 (3737) used g  184.047047][ T193][ 
> T3839]  co T3932] i: 0, pt59417][ T3932] i: 1, pte: e38ee039, faddr: 0
> [  184.059424][ T3932] ra_info: 8, 3, 4, 4ae69ce9
> [  184.059431][ T3932] i: 2, pte: 35544c25, faddr: 0
> [  184.062563][ T3830] ra_info: 8, 3, 4, d77f2b57
> [  184.067511][ T3839]  _do_fork+0x128/0x11f8
> [  184.072663][ T3830] i: 2, pte: 2f241b20, faddr: 0
> [  184.077369][ T3839]  __do_sys_clone+0xac/0xd8
> [  184.110993][ T3997] ra_info: 8, 3, 4, d40684b7
> [  184.113421][ T3839]  __arm64_sys_clone+0xa0/0xf8

This appears to run on ARM64.  Can you help to try this on x86?  I'm
not familiar with ARM.

Best Regards,
Huang, Ying

> [  184.116524][ T3832] ra_info: 8, 3, 4, b572965a
> [  184.116534][ T3832] i: 0, pte: b572965a, faddr: 0
> [  184.116541][ T3832] ra_info: 8, 3, 4, b572965a
> [  184.116549][ T3832] i: 1, pte: 7c91cc64, faddr: 0
> [  184.116556][ T3832] ra_info: 8, 3, 4, b572965a
> [  184.116563][ T3832] i: 2, pte: 24f944e4, faddr: 0
> [  184.118541][ T3997] i: 0, pte: d40684b7, faddr: 0
> [  184.118552][ T3997] ra_info: 8, 3, 4, d40684b7
> [  184.123956][ T3839]  do_el0_svc+0x124/0x228
> [  184.123970][ T3839]  el0_sync_handler+0x260/0x410
> [  184.123988][ T3839]  el0_sync+0x140/0x180
> [  184.129119][ T3997] i: 1, pte: 35d81ad0, faddr: 0
> [  184.134523][ T3839] T

RE: linux-next: not-present page at swap_vma_readahead()

2020-07-19 Thread Huang, Ying
Thanks!  Can you try the dbg patch attached?  That will print more debugging 
information when abnormal PTE pointer is detected.

Best Regards,
Huang, Ying

From: Qian Cai [c...@lca.pw]
Sent: Monday, July 20, 2020 10:12 AM
To: Huang, Ying
Cc: Linux-MM; LKML; Minchan Kim; Hugh Dickins; Andrew Morton
Subject: Re: linux-next: not-present page at swap_vma_readahead()

On Mon, Jul 20, 2020 at 12:37:30AM +, Huang, Ying wrote:
> Hi,
>
> Sorry for late reply.  I found a problem in the swap readahead code.  Can you 
> help to check whether it can fix this?

Unfortunately, I can still reproduce it easily after applied the patch.

# git clone https://gitlab.com/cailca/linux-mm
# git checkout v5.8-rc1 -- *.sh
# dnf -y install tar wget golang libseccomp-devel jq
# ./runc.sh

[  575.517290][T28667] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.522901][T28650] BUG: KASAN: slab-out-of-bounds in 
swapin_readahead+0x780/0xbd8
swap_vma_readahead at mm/swap_state.c:758
(inlined by) swapin_readahead at mm/swap_state.c:802
[  575.522928][T28650] Read of size 8 at addr 0089a603ffe8 by task 
trinity-c92/28650
[  575.522947][T28650] CPU: 126 PID: 28650 Comm: trinity-c92 Not tainted 
5.8.0-rc5-next-20200717+ #1
[  575.522958][T28650] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.11 06/18/2019
[  575.522966][T28650] Call trace:
[  575.529895][T28667] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.535819][T28590] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.535829][T28590] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.535836][T28590] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.537424][T28650]  dump_backtrace+0x0/0x398
[  575.537438][T28650]  show_stack+0x14/0x20
[  575.545308][T28667] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.554134][T28650]  dump_stack+0x140/0x1c8
[  575.554148][T28650]  print_address_description.constprop.10+0x54/0x550
[  575.554159][T28650]  kasan_report+0x134/0x1b8
[  575.554173][T28650]  __asan_report_load8_noabort+0x2c/0x50
[  575.559496][T28588] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.559506][T28588] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.559513][T28588] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.562203][T28586] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.562215][T28586] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.562223][T28586] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.665163][T28560] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.671260][T28650]  swapin_readahead+0x780/0xbd8
[  575.671280][T28650]  do_swap_page+0xb1c/0x1a78
do_swap_page at mm/memory.c:3166
[  575.678067][T28560] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.682774][T28650]  handle_mm_fault+0xfd0/0x2c50
handle_pte_fault at mm/memory.c:4234
(inlined by) __handle_mm_fault at mm/memory.c:4368
(inlined by) handle_mm_fault at mm/memory.c:4466
[  575.682789][T28650]  do_page_fault+0x230/0x818
[  575.682804][T28650]  do_translation_fault+0x90/0xb0
[  575.682819][T28650]  do_mem_abort+0x64/0x180
[  575.687259][T28560] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.694051][T28650]  el1_sync_handler+0x188/0x1b8
[  575.694064][T28650]  el1_sync+0x7c/0x100
[  575.694079][T28650]  strncpy_from_user+0x270/0x3e8
[  575.694100][T28650]  getname_flags+0x80/0x330
[  575.698001][T28827] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.698048][T28827] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.698056][T28827] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.755679][T28620] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.757304][T28650]  user_path_at_empty+0x2c/0x60
[  575.764131][T28620] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.768782][T28650]  do_linkat+0x10c/0x528
[  575.768792][T28650]  __arm64_sys_linkat+0xa0/0xf8
[  575.768802][T28650]  do_el0_svc+0x124/0x228
[  575.768812][T28650]  el0_sync_handler+0x260/0x410
[  575.768820][T28650]  el0_sytack+0x24/0x50+0x14/0x20
[  5ap file entry 58_object+0x58/0x968c/0x1880
[  575.779790][T28650]  __alloc_percpu_gfp+0x14/0x20
[  575.779799][T28650]  qdisc_alloc+0x2bc/0xb98
[  575.779809][T28650]  qdisc_create_dflt+0x60/0x748
[  575.803406][T28643] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.806107][T28650]  mq_init+0x1a0/0x3b8
[  575.806120][T28650]  qdisc_create_dflt+0xc8/0x748
[  575.811321][T28643] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.815788][T28650]  dev_activate+0x488/0x8b8
[  575.815806][T28650]  __dev_open+0x240/0x360
[  575.820848][T28643] get_swap_device: Bad swap file entry 58025a5a5a5a5a5a
[  575.827542][T28650]  __dev_change_flags+0x344/0x480
[  575.827553][T28650]  dev_change_flags+0x74/0x140
[  575.906574][T28650]  do_setlink+0x7c8/0x2760
[

RE: linux-next: not-present page at swap_vma_readahead()

2020-07-19 Thread Huang, Ying
Hi,

Sorry for late reply.  I found a problem in the swap readahead code.  Can you 
help to check whether it can fix this?

Best Regards,
Huang, Ying

From: Qian Cai [c...@lca.pw]
Sent: Tuesday, June 16, 2020 9:13 AM
To: Huang, Ying
Cc: Linux-MM; LKML; Minchan Kim; Hugh Dickins; Andrew Morton
Subject: Re: linux-next: not-present page at swap_vma_readahead()

On Wed, Apr 15, 2020 at 10:01:53AM +0800, Huang, Ying wrote:
> Qian Cai  writes:
>
> >> On Apr 14, 2020, at 10:32 AM, Qian Cai  wrote:
> >>
> >> Fuzzers are unhappy. Thoughts?
> >
> > This is rather to reproduce. All the traces so far are from 
> > copy_from_user() to trigger a page fault,
> > and then it dereferences a bad pte in swap_vma_readahead(),
> >
> > for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
> >  i++, pte++) {
> > pentry = *pte;   <— crashed here.
> > if (pte_none(pentry))
>
> Is it possible to bisect this?
>
> Because the crash point is identified, it may be helpful to collect and
> analyze the status of the faulting page table and readahead ptes.  But I
> am not familiar with the ARM64 architecture.  So I cannot help much
> here.

Ying, looks like the bug is still there today which manifests itself
into a different form. Looking at the logs, I believe it was involved
with swapoff(). Any other thought? I still have not found time to bisect
this yet.

[  785.477183][ T8727] BUG: KASAN: slab-out-of-bounds in 
swapin_readahead+0x7b8/0xbc0
swap_vma_readahead at mm/swap_state.c:759
(inlined by) swapin_readahead at mm/swap_state.c:803
[  785.484752][ T8727] Read of size 8 at addr 00886ecaffe8 by task 
trinity-c35/8727
[  785.492488][ T8727]
[  785.494675][ T8727] CPU: 35 PID: 8727 Comm: trinity-c35 Not tainted 
5.7.0-next-20200610 #3
[  785.502942][ T8727] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.11 06/18/2019
[  785.513387][ T8727] Call trace:
[  785.516538][ T8727]  dump_backtrace+0x0/0x398
[  785.520891][ T8727]  show_stack+0x14/0x20
[  785.524900][ T8727]  dump_stack+0x140/0x1b8
[  785.529087][ T8727]  print_address_description.isra.12+0x54/0x4a8
[  785.535185][ T8727]  kasan_report+0x134/0x1b8
[  785.539545][ T8727]  __asan_report_load8_noabort+0x2c/0x50
[  785.545036][ T8727]  swapin_readahead+0x7b8/0xbc0
[  785.549745][ T8727]  do_swap_page+0xb1c/0x19a0
[  785.554195][ T8727]  handle_mm_fault+0xf10/0x2b30
[  785.558905][ T8727]  do_page_fault+0x230/0x908
[  785.563354][ T8727]  do_translation_fault+0xe0/0x108
[  785.568323][ T8727]  do_mem_abort+0x64/0x180
[  785.572597][ T8727]  el1_sync_handler+0x188/0x1b8
[  785.577305][ T8727]  el1_sync+0x7c/0x100
[  785.581232][ T8727]  __arch_copy_to_user+0xc4/0x158
[  785.586115][ T8727]  __arm64_sys_sysinfo+0x2c/0xd0
[  785.590912][ T8727]  do_el0_svc+0x124/0x220
[  785.595100][ T8727]  el0_sync_handler+0x260/0x408
[  785.599807][ T8727]  el0_sync+0x140/0x180
[  785.603818][ T8727]
[  785.606007][ T8727] Allocated by task 8673:
[  785.610193][ T8727]  save_stack+0x24/0x50
[  785.614208][ T8727]  __kasan_kmalloc.isra.13+0xc4/0xe0
[  785.619350][ T8727]  kasan_slab_alloc+0x14/0x20
[  785.623885][ T8727]  slab_post_alloc_hook+0x50/0xa8
[  785.628769][ T8727]  kmem_cache_alloc+0x18c/0x438
[  785.633479][ T8727]  create_object+0x58/0x960
[  785.637844][ T8727]  kmemleak_alloc+0x2c/0x38
[  785.642205][ T8727]  slab_post_alloc_hook+0x70/0xa8
[  785.647089][ T8727]  kmem_cache_alloc_trace+0x178/0x308
[  785.652322][ T8727]  refill_pi_state_cache.part.10+0x3c/0x1a8
[  785.658073][ T8727]  futex_lock_pi+0x404/0x5e0
[  785.662519][ T8727]  do_futex+0x790/0x1448
[  785.18][ T8727]  __arm64_sys_futex+0x204/0x588
[  785.671411][ T8727]  do_el0_svc+0x124/0x220
[  785.675603][ T8727]  el0_sync_handler+0x260/0x408
[  785.680312][ T8727]  el0_sync+0x140/0x180
[  785.684322][ T8727]
[  785.686510][ T8727] Freed by task 0:
[  785.690088][ T8727]  save_stack+0x24/0x50
[  785.694104][ T8727]  __kasan_slab_free+0x124/0x198
[  785.698899][ T8727]  kasan_slab_free+0x10/0x18
[  785.703340][ T8727]  slab_free_freelist_hook+0x110/0x298
[  785.708648][ T8727]  kmem_cache_free+0xc8/0x3e0
[  785.713175][ T8727]  free_object_rcu+0x1e0/0x3b8
[  785.717796][ T8727]  rcu_core+0x8bc/0xf40
[  785.721810][ T8727]  rcu_core_si+0xc/0x18
[  785.725825][ T8727]  efi_header_end+0x2d8/0x1204
[  785.730442][ T8727]
[  785.732625][ T8727] The buggy address belongs to the object at 
00886ecafd28
[  785.732625][ T8727]  which belongs to the cache kmemleak_object of size 368
[  785.746875][ T8727] The buggy address is located 336 bytes to the right of
[  785.746875][ T8727]  368-byte region [00886ecafd28, 00886ecafe98)
[  785.760519][ T8727] The buggy address belongs to the page:
[  785.766009][ T8727] page:ffe021fbb280 refcount:1 mapcount:0 
mapping: index:0x

Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail

2020-07-06 Thread Huang, Ying
Feng Tang  writes:

> On Mon, Jul 06, 2020 at 06:34:34AM -0700, Andi Kleen wrote:
>> >ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>> > -  if (ret == 0 && write)
>> > +  if (ret == 0 && write) {
>> > +  if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
>> > +  schedule_on_each_cpu(sync_overcommit_as);
>> 
>> The schedule_on_each_cpu is not atomic, so the problem could still happen
>> in that window.
>> 
>> I think it may be ok if it eventually resolves, but certainly needs
>> a comment explaining it. Can you do some stress testing toggling the
>> policy all the time on different CPUs and running the test on
>> other CPUs and see if the test fails?
>
> For the raw test case reported by 0day, this patch passed in 200 times
> run. And I will read the ltp code and try stress testing it as you
> suggested.
>
>
>> The other alternative would be to define some intermediate state
>> for the sysctl variable and only switch to never once the 
>> schedule_on_each_cpu
>> returned. But that's more complexity.
>
> One thought I had is to put this schedule_on_each_cpu() before
> the proc_dointvec_minmax() to do the sync before sysctl_overcommit_memory
> is really changed. But the window still exists, as the batch is
> still the larger one. 

Can we change the batch firstly, then sync the global counter, finally
change the overcommit policy?

Best Regards,
Huang, Ying


Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

2020-07-03 Thread Huang, Ying
Dave Hansen  writes:
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> +  unsigned long action, void 
> *arg)
> +{
> + switch (action) {
> + case MEM_GOING_OFFLINE:
> + /*
> +  * Make sure there are not transient states where
> +  * an offline node is a migration target.  This
> +  * will leave migration disabled until the offline
> +  * completes and the MEM_OFFLINE case below runs.
> +  */
> + disable_all_migrate_targets();
> + break;
> + case MEM_OFFLINE:
> + case MEM_ONLINE:
> + /*
> +  * Recalculate the target nodes once the node
> +  * reaches its final state (online or offline).
> +  */
> + set_migration_target_nodes();
> + break;
> + case MEM_CANCEL_OFFLINE:
> + /*
> +  * MEM_GOING_OFFLINE disabled all the migration
> +  * targets.  Reenable them.
> +  */
> + set_migration_target_nodes();
> + break;
> + case MEM_GOING_ONLINE:
> + case MEM_CANCEL_ONLINE:
> + break;

I think we need to call
disable_all_migrate_targets()/set_migration_target_nodes() for CPU
online/offline event too.  Because that will influence node_state(nid,
N_CPU).  Which will influence node demotion relationship.

> + }
> +
> + return notifier_from_errno(0);
>  }
> +

Best Regards,
Huang, Ying


Re: [PATCH 1/3] mm/vmscan: restore zone_reclaim_mode ABI

2020-07-02 Thread Huang, Ying
Dave Hansen  writes:

> From: Dave Hansen 
>
> I went to go add a new RECLAIM_* mode for the zone_reclaim_mode
> sysctl.  Like a good kernel developer, I also went to go update the
> documentation.  I noticed that the bits in the documentation didn't
> match the bits in the #defines.
>
> The VM never explicitly checks the RECLAIM_ZONE bit.  The bit is,
> however implicitly checked when checking 'node_reclaim_mode==0'.
> The RECLAIM_ZONE #define was removed in a cleanup.  That, by itself
> is fine.
>
> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed.  That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel.  The end result is that
> if someone had a script that did:
>
>   sysctl vm.zone_reclaim_mode=1
>
> That script went from doing nothing

Per my understanding, this script would have enabled node reclaim for
clean unmapped pages before commit 648b5cf368e0 ("mm/vmscan: remove
unused RECLAIM_OFF/RECLAIM_ZONE").  So we should revise the description
here?

> to writing out pages during
> node reclaim after the commit in question.  That's not great.
>
> Put the bits back the way they were and add a comment so something
> like this is a bit harder to do again.  Update the documentation to
> make it clear that the first bit is ignored.
>

Best Regards,
Huang, Ying


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Huang, Ying
David Rientjes  writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> > Could this cause us to break a user's mbind() or allow a user to 
>> > circumvent their cpuset.mems?
>> 
>> In its current form, yes.
>> 
>> My current rationale for this is that while it's not as deferential as
>> it can be to the user/kernel ABI contract, it's good *overall* behavior.
>>  The auto-migration only kicks in when the data is about to go away.  So
>> while the user's data might be slower than they like, it is *WAY* faster
>> than they deserve because it should be off on the disk.
>> 
>
> It's outside the scope of this patchset, but eventually there will be a 
> promotion path that I think requires a strict 1:1 relationship between 
> DRAM and PMEM nodes because otherwise mbind(), set_mempolicy(), and 
> cpuset.mems become ineffective for nodes facing memory pressure.

I have posted an patchset for AutoNUMA based promotion support,

https://lore.kernel.org/lkml/20200218082634.1596727-1-ying.hu...@intel.com/

Where, the page is promoted upon NUMA hint page fault.  So all memory
policy (mbind(), set_mempolicy(), and cpuset.mems) are available.  We
can refuse promoting the page to the DRAM nodes that are not allowed by
any memory policy.  So, 1:1 relationship isn't necessary for promotion.

> For the purposes of this patchset, agreed that DRAM -> PMEM -> swap makes 
> perfect sense.  Theoretically, I think you could have DRAM N0 and N1 and 
> then a single PMEM N2 and this N2 can be the terminal node for both N0 and 
> N1.  On promotion, I think we need to rely on something stronger than 
> autonuma to decide which DRAM node to promote to: specifically any user 
> policy put into effect (memory tiering or autonuma shouldn't be allowed to 
> subvert these user policies).
>
> As others have mentioned, we lose the allocation or process context at the 
> time of demotion or promotion

As above, we have process context at time of promotion.

> and any workaround for that requires some 
> hacks, such as mapping the page to cpuset (what is the right solution for 
> shared pages?) or adding NUMA locality handling to memcg.

It sounds natural to me to add NUMA nodes restriction to memcg.

Best Regards,
Huang, Ying


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Huang, Ying
David Rientjes  writes:

> On Wed, 1 Jul 2020, Dave Hansen wrote:
>
>> Even if they don't allocate directly from PMEM, is it OK for such an app
>> to get its cold data migrated to PMEM?  That's a much more subtle
>> question and I suspect the kernel isn't going to have a single answer
>> for it.  I suspect we'll need a cpuset-level knob to turn auto-demotion
>> on or off.
>> 
>
> I think the answer is whether the app's cold data can be reclaimed, 
> otherwise migration to PMEM is likely better in terms of performance.  So 
> any such app today should just be mlocking its cold data if it can't 
> handle overhead from reclaim?

Yes.  That's a way to solve the problem.  A cpuset-level knob may be
more flexible, because you don't need to change the application source
code.

Best Regards,
Huang, Ying


Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

2020-07-01 Thread Huang, Ying
Dave Hansen  writes:

> On 6/30/20 1:22 AM, Huang, Ying wrote:
>>> +   /*
>>> +* To avoid cycles in the migration "graph", ensure
>>> +* that migration sources are not future targets by
>>> +* setting them in 'used_targets'.
>>> +*
>>> +* But, do this only once per pass so that multiple
>>> +* source nodes can share a target node.
>> establish_migrate_target() calls find_next_best_node(), which will set
>> target_node in used_targets.  So it seems that the nodes_or() below is
>> only necessary to initialize used_targets, and multiple source nodes
>> cannot share one target node in current implementation.
>
> Yes, that is true.  My focus on this implementation was simplicity and
> sanity for common configurations.  I can certainly imagine scenarios
> where this is suboptimal.
>
> I'm totally open to other ways of doing this.

OK.  So when we really need to share one target node for multiple source
nodes, we can add a parameter to find_next_best_node() to specify
whether set target_node in used_targets.

Best Regards,
Huang, Ying


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-07-01 Thread Huang, Ying
David Rientjes  writes:

> On Tue, 30 Jun 2020, Yang Shi wrote:
>
>> > > From: Dave Hansen 
>> > > 
>> > > If a memory node has a preferred migration path to demote cold pages,
>> > > attempt to move those inactive pages to that migration node before
>> > > reclaiming. This will better utilize available memory, provide a faster
>> > > tier than swapping or discarding, and allow such pages to be reused
>> > > immediately without IO to retrieve the data.
>> > > 
>> > > When handling anonymous pages, this will be considered before swap if
>> > > enabled. Should the demotion fail for any reason, the page reclaim
>> > > will proceed as if the demotion feature was not enabled.
>> > > 
>> > Thanks for sharing these patches and kick-starting the conversation, Dave.
>> > 
>> > Could this cause us to break a user's mbind() or allow a user to
>> > circumvent their cpuset.mems?
>> > 
>> > Because we don't have a mapping of the page back to its allocation
>> > context (or the process context in which it was allocated), it seems like
>> > both are possible.
>> 
>> Yes, this could break the memory placement policy enforced by mbind and
>> cpuset. I discussed this with Michal on mailing list and tried to find a way
>> to solve it, but unfortunately it seems not easy as what you mentioned above.
>> The memory policy and cpuset is stored in task_struct rather than mm_struct.
>> It is not easy to trace back to task_struct from page (owner field of
>> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not
>> preferred way).
>> 
>
> Yeah, and Ying made a similar response to this message.
>
> We can do this if we consider pmem not to be a separate memory tier from 
> the system perspective, however, but rather the socket perspective.  In 
> other words, a node can only demote to a series of exclusive pmem ranges 
> and promote to the same series of ranges in reverse order.  So DRAM node 0 
> can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM 
> node 3 -- a pmem range cannot be demoted to, or promoted from, more than 
> one DRAM node.
>
> This naturally takes care of mbind() and cpuset.mems if we consider pmem 
> just to be slower volatile memory and we don't need to deal with the 
> latency concerns of cross socket migration.  A user page will never be 
> demoted to a pmem range across the socket and will never be promoted to a 
> different DRAM node that it doesn't have access to.
>
> That can work with the NUMA abstraction for pmem, but it could also 
> theoretically be a new memory zone instead.  If all memory living on pmem 
> is migratable (the natural way that memory hotplug is done, so we can 
> offline), this zone would live above ZONE_MOVABLE.  Zonelist ordering 
> would determine whether we can allocate directly from this memory based on 
> system config or a new gfp flag that could be set for users of a mempolicy 
> that allows allocations directly from pmem.  If abstracted as a NUMA node 
> instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't 
> make much sense.

Why can not we just bind the memory of the application to node 0, 2, 3
via mbind() or cpuset.mems?  Then the application can allocate memory
directly from PMEM.  And if we bind the memory of the application via
mbind() to node 0, we can only allocate memory directly from DRAM.

Best Regards,
Huang, Ying


Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard

2020-06-30 Thread Huang, Ying
David Rientjes  writes:

> On Mon, 29 Jun 2020, Dave Hansen wrote:
>
>> From: Dave Hansen 
>> 
>> If a memory node has a preferred migration path to demote cold pages,
>> attempt to move those inactive pages to that migration node before
>> reclaiming. This will better utilize available memory, provide a faster
>> tier than swapping or discarding, and allow such pages to be reused
>> immediately without IO to retrieve the data.
>> 
>> When handling anonymous pages, this will be considered before swap if
>> enabled. Should the demotion fail for any reason, the page reclaim
>> will proceed as if the demotion feature was not enabled.
>> 
>
> Thanks for sharing these patches and kick-starting the conversation, Dave.
>
> Could this cause us to break a user's mbind() or allow a user to 
> circumvent their cpuset.mems?
>
> Because we don't have a mapping of the page back to its allocation 
> context (or the process context in which it was allocated), it seems like 
> both are possible.

For mbind, I think we don't have enough information during reclaim to
enforce the node binding policy.  But for cpuset, if cgroup v2 (with the
unified hierarchy) is used, it's possible to get the node binding policy
via something like,

  cgroup_get_e_css(page->mem_cgroup, &cpuset_cgrp_subsys)

> So let's assume that migration nodes cannot be other DRAM nodes.  
> Otherwise, memory pressure could be intentionally or unintentionally 
> induced to migrate these pages to another node.  Do we have such a 
> restriction on migration nodes?
>
>> Some places we would like to see this used:
>> 
>>   1. Persistent memory being as a slower, cheaper DRAM replacement
>>   2. Remote memory-only "expansion" NUMA nodes
>>   3. Resolving memory imbalances where one NUMA node is seeing more
>>  allocation activity than another.  This helps keep more recent
>>  allocations closer to the CPUs on the node doing the allocating.
>> 
>
> (3) is the concerning one given the above if we are to use 
> migrate_demote_mapping() for DRAM node balancing.
>
>> Yang Shi's patches used an alternative approach where to-be-discarded
>> pages were collected on a separate discard list and then discarded
>> as a batch with migrate_pages().  This results in simpler code and
>> has all the performance advantages of batching, but has the
>> disadvantage that pages which fail to migrate never get swapped.
>> 
>> #Signed-off-by: Keith Busch 
>> Signed-off-by: Dave Hansen 
>> Cc: Keith Busch 
>> Cc: Yang Shi 
>> Cc: David Rientjes 
>> Cc: Huang Ying 
>> Cc: Dan Williams 
>> ---
>> 
>>  b/include/linux/migrate.h|6 
>>  b/include/trace/events/migrate.h |3 +-
>>  b/mm/debug.c |1 
>>  b/mm/migrate.c   |   52 
>> +++
>>  b/mm/vmscan.c|   25 ++
>>  5 files changed, 86 insertions(+), 1 deletion(-)
>> 
>> diff -puN 
>> include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>>  include/linux/migrate.h
>> --- 
>> a/include/linux/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>>   2020-06-29 16:34:38.950312604 -0700
>> +++ b/include/linux/migrate.h2020-06-29 16:34:38.963312604 -0700
>> @@ -25,6 +25,7 @@ enum migrate_reason {
>>  MR_MEMPOLICY_MBIND,
>>  MR_NUMA_MISPLACED,
>>  MR_CONTIG_RANGE,
>> +MR_DEMOTION,
>>  MR_TYPES
>>  };
>>  
>> @@ -78,6 +79,7 @@ extern int migrate_huge_page_move_mappin
>>struct page *newpage, struct page *page);
>>  extern int migrate_page_move_mapping(struct address_space *mapping,
>>  struct page *newpage, struct page *page, int extra_count);
>> +extern int migrate_demote_mapping(struct page *page);
>>  #else
>>  
>>  static inline void putback_movable_pages(struct list_head *l) {}
>> @@ -104,6 +106,10 @@ static inline int migrate_huge_page_move
>>  return -ENOSYS;
>>  }
>>  
>> +static inline int migrate_demote_mapping(struct page *page)
>> +{
>> +return -ENOSYS;
>> +}
>>  #endif /* CONFIG_MIGRATION */
>>  
>>  #ifdef CONFIG_COMPACTION
>> diff -puN 
>> include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>>  include/trace/events/migrate.h
>> --- 
>> a/include/trace/events/migrate.h~0008-mm-vmscan-Attempt-to-migrate-page-in-lieu-of-discard
>>2020-06-2

Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

2020-06-30 Thread Huang, Ying
Yang Shi  writes:

> On 6/30/20 5:48 PM, Huang, Ying wrote:
>> Hi, Yang,
>>
>> Yang Shi  writes:
>>
>>>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>>>> --- a/mm/vmscan.c~enable-numa-demotion2020-06-29 16:35:01.017312549 
>>>>> -0700
>>>>> +++ b/mm/vmscan.c 2020-06-29 16:35:01.023312549 -0700
>>>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>>> * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>>> * ABI.  New bits are OK, but existing bits can never change.
>>>>> */
>>>>> -#define RECLAIM_RSVD  (1<<0) /* (currently ignored/unused) */
>>>>> -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
>>>>> -#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */
>>>>> +#define RECLAIM_RSVD (1<<0)  /* (currently ignored/unused) */
>>>>> +#define RECLAIM_WRITE(1<<1)  /* Writeout pages during reclaim */
>>>>> +#define RECLAIM_UNMAP(1<<2)  /* Unmap pages during reclaim */
>>>>> +#define RECLAIM_MIGRATE  (1<<3)  /* Migrate pages during reclaim */
>>>>>  /*
>>>>> * Priority for NODE_RECLAIM. This determines the fraction of pages
>>>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>>>> patch.
>>>>
>>>> If my understanding of the code were correct, shrink_do_demote_mapping()
>>>> is called by shrink_page_list(), which is used by kswapd and direct
>>>> reclaim.  So as long as the persistent memory node is onlined,
>>>> reclaim-based migration will be enabled regardless of node reclaim mode.
>>> It looks so according to the code. But the intention of a new node
>>> reclaim mode is to do migration on reclaim *only when* the
>>> RECLAIM_MODE is enabled by the users.
>>>
>>> It looks the patch just clear the migration target node masks if the
>>> memory is offlined.
>>>
>>> So, I'm supposed you need check if node_reclaim is enabled before
>>> doing migration in shrink_page_list() and also need make node reclaim
>>> to adopt the new mode.
>> But why shouldn't we migrate in kswapd and direct reclaim?  I think that
>> we may need a way to control it, but shouldn't disable it
>> unconditionally.
>
> Let me share some background. In the past discussions on LKML and last
> year's LSFMM the opt-in approach was preferred since the new feature
> might be not stable and mature.  So the new node reclaim mode was
> suggested by both Mel and Michal. I'm supposed this is still a valid
> point now.

Is there any technical reason?  I think the code isn't very complex.  If
we really worry about stable and mature, isn't it enough to provide some
way to enable/disable the feature?  Even for kswapd and direct reclaim?

Best Regards,
Huang, Ying

> Once it is mature and stable enough we definitely could make it
> universally preferred and default behavior.
>
>>
>>> Please refer to
>>> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang@linux.alibaba.com/
>>>
>> Best Regards,
>> Huang, Ying


Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

2020-06-30 Thread Huang, Ying
Hi, Yang,

Yang Shi  writes:

>>> diff -puN mm/vmscan.c~enable-numa-demotion mm/vmscan.c
>>> --- a/mm/vmscan.c~enable-numa-demotion  2020-06-29 16:35:01.017312549 
>>> -0700
>>> +++ b/mm/vmscan.c   2020-06-29 16:35:01.023312549 -0700
>>> @@ -4165,9 +4165,10 @@ int node_reclaim_mode __read_mostly;
>>>* These bit locations are exposed in the vm.zone_reclaim_mode sysctl
>>>* ABI.  New bits are OK, but existing bits can never change.
>>>*/
>>> -#define RECLAIM_RSVD  (1<<0)   /* (currently ignored/unused) */
>>> -#define RECLAIM_WRITE (1<<1)   /* Writeout pages during reclaim */
>>> -#define RECLAIM_UNMAP (1<<2)   /* Unmap pages during reclaim */
>>> +#define RECLAIM_RSVD   (1<<0)  /* (currently ignored/unused) */
>>> +#define RECLAIM_WRITE  (1<<1)  /* Writeout pages during reclaim */
>>> +#define RECLAIM_UNMAP  (1<<2)  /* Unmap pages during reclaim */
>>> +#define RECLAIM_MIGRATE(1<<3)  /* Migrate pages during reclaim */
>>> /*
>>>* Priority for NODE_RECLAIM. This determines the fraction of pages
>> I found that RECLAIM_MIGRATE is defined but never referenced in the
>> patch.
>>
>> If my understanding of the code were correct, shrink_do_demote_mapping()
>> is called by shrink_page_list(), which is used by kswapd and direct
>> reclaim.  So as long as the persistent memory node is onlined,
>> reclaim-based migration will be enabled regardless of node reclaim mode.
>
> It looks so according to the code. But the intention of a new node
> reclaim mode is to do migration on reclaim *only when* the
> RECLAIM_MODE is enabled by the users.
>
> It looks the patch just clear the migration target node masks if the
> memory is offlined.
>
> So, I'm supposed you need check if node_reclaim is enabled before
> doing migration in shrink_page_list() and also need make node reclaim
> to adopt the new mode.

But why shouldn't we migrate in kswapd and direct reclaim?  I think that
we may need a way to control it, but shouldn't disable it
unconditionally.

> Please refer to
> https://lore.kernel.org/linux-mm/1560468577-101178-6-git-send-email-yang@linux.alibaba.com/
>

Best Regards,
Huang, Ying


Re: [RFC][PATCH 5/8] mm/numa: automatically generate node migration order

2020-06-30 Thread Huang, Ying
Dave Hansen  writes:

> +/*
> + * Find an automatic demotion target for 'node'.
> + * Failing here is OK.  It might just indicate
> + * being at the end of a chain.
> + */
> +static int establish_migrate_target(int node, nodemask_t *used)
> +{
> + int migration_target;
> +
> + /*
> +  * Can not set a migration target on a
> +  * node with it already set.
> +  *
> +  * No need for READ_ONCE() here since this
> +  * in the write path for node_demotion[].
> +  * This should be the only thread writing.
> +  */
> + if (node_demotion[node] != NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + migration_target = find_next_best_node(node, used);
> + if (migration_target == NUMA_NO_NODE)
> + return NUMA_NO_NODE;
> +
> + node_demotion[node] = migration_target;
> +
> + return migration_target;
> +}
> +
> +/*
> + * When memory fills up on a node, memory contents can be
> + * automatically migrated to another node instead of
> + * discarded at reclaim.
> + *
> + * Establish a "migration path" which will start at nodes
> + * with CPUs and will follow the priorities used to build the
> + * page allocator zonelists.
> + *
> + * The difference here is that cycles must be avoided.  If
> + * node0 migrates to node1, then neither node1, nor anything
> + * node1 migrates to can migrate to node0.
> + *
> + * This function can run simultaneously with readers of
> + * node_demotion[].  However, it can not run simultaneously
> + * with itself.  Exclusion is provided by memory hotplug events
> + * being single-threaded.
> + */
> +void set_migration_target_nodes(void)
> +{
> + nodemask_t next_pass = NODE_MASK_NONE;
> + nodemask_t this_pass = NODE_MASK_NONE;
> + nodemask_t used_targets = NODE_MASK_NONE;
> + int node;
> +
> + get_online_mems();
> + /*
> +  * Avoid any oddities like cycles that could occur
> +  * from changes in the topology.  This will leave
> +  * a momentary gap when migration is disabled.
> +  */
> + disable_all_migrate_targets();
> +
> + /*
> +  * Ensure that the "disable" is visible across the system.
> +  * Readers will see either a combination of before+disable
> +  * state or disable+after.  They will never see before and
> +  * after state together.
> +  *
> +  * The before+after state together might have cycles and
> +  * could cause readers to do things like loop until this
> +  * function finishes.  This ensures they can only see a
> +  * single "bad" read and would, for instance, only loop
> +  * once.
> +  */
> + smp_wmb();
> +
> + /*
> +  * Allocations go close to CPUs, first.  Assume that
> +  * the migration path starts at the nodes with CPUs.
> +  */
> + next_pass = node_states[N_CPU];
> +again:
> + this_pass = next_pass;
> + next_pass = NODE_MASK_NONE;
> + /*
> +  * To avoid cycles in the migration "graph", ensure
> +  * that migration sources are not future targets by
> +  * setting them in 'used_targets'.
> +  *
> +  * But, do this only once per pass so that multiple
> +      * source nodes can share a target node.

establish_migrate_target() calls find_next_best_node(), which will set
target_node in used_targets.  So it seems that the nodes_or() below is
only necessary to initialize used_targets, and multiple source nodes
cannot share one target node in current implementation.

Best Regards,
Huang, Ying

> +  */
> + nodes_or(used_targets, used_targets, this_pass);
> + for_each_node_mask(node, this_pass) {
> + int target_node = establish_migrate_target(node, &used_targets);
> +
> + if (target_node == NUMA_NO_NODE)
> + continue;
> +
> + /* Visit targets from this pass in the next pass: */
> + node_set(target_node, next_pass);
> + }
> + /* Is another pass necessary? */
> + if (!nodes_empty(next_pass))
> + goto again;
> +
> + put_online_mems();
> +}


Re: [RFC][PATCH 8/8] mm/numa: new reclaim mode to enable reclaim-based migration

2020-06-30 Thread Huang, Ying
Hi, Dave,

Dave Hansen  writes:

> From: Dave Hansen 
>
> Some method is obviously needed to enable reclaim-based migration.
>
> Just like traditional autonuma, there will be some workloads that
> will benefit like workloads with more "static" configurations where
> hot pages stay hot and cold pages stay cold.  If pages come and go
> from the hot and cold sets, the benefits of this approach will be
> more limited.
>
> The benefits are truly workload-based and *not* hardware-based.
> We do not believe that there is a viable threshold where certain
> hardware configurations should have this mechanism enabled while
> others do not.
>
> To be conservative, earlier work defaulted to disable reclaim-
> based migration and did not include a mechanism to enable it.
> This propses extending the existing "zone_reclaim_mode" (now
> now really node_reclaim_mode) as a method to enable it.
>
> We are open to any alternative that allows end users to enable
> this mechanism or disable it it workload harm is detected (just
> like traditional autonuma).
>
> The implementation here is pretty simple and entirely unoptimized.
> On any memory hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that
> the node_demotion[] array is always ready to be used in case the
> new reclaim mode is enabled.  This recalculation is far from
> optimal, most glaringly that it does not even attempt to figure
> out if nodes are actually coming or going.
>
> Signed-off-by: Dave Hansen 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> ---
>
>  b/Documentation/admin-guide/sysctl/vm.rst |9 
>  b/mm/migrate.c|   61 
> +-
>  b/mm/vmscan.c |7 +--
>  3 files changed, 73 insertions(+), 4 deletions(-)
>
> diff -puN Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion 
> Documentation/admin-guide/sysctl/vm.rst
> --- a/Documentation/admin-guide/sysctl/vm.rst~enable-numa-demotion
> 2020-06-29 16:35:01.012312549 -0700
> +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-06-29 16:35:01.021312549 
> -0700
> @@ -941,6 +941,7 @@ This is value OR'ed together of
>  1(bit currently ignored)
>  2Zone reclaim writes dirty pages out
>  4Zone reclaim swaps pages
> +8Zone reclaim migrates pages
>  ====
>  
>  zone_reclaim_mode is disabled by default.  For file servers or workloads
> @@ -965,3 +966,11 @@ of other processes running on other node
>  Allowing regular swap effectively restricts allocations to the local
>  node unless explicitly overridden by memory policies or cpuset
>  configurations.
> +
> +Page migration during reclaim is intended for systems with tiered memory
> +configurations.  These systems have multiple types of memory with varied
> +performance characteristics instead of plain NUMA systems where the same
> +kind of memory is found at varied distances.  Allowing page migration
> +during reclaim enables these systems to migrate pages from fast tiers to
> +slow tiers when the fast tier is under pressure.  This migration is
> +performed before swap.
> diff -puN mm/migrate.c~enable-numa-demotion mm/migrate.c
> --- a/mm/migrate.c~enable-numa-demotion   2020-06-29 16:35:01.015312549 
> -0700
> +++ b/mm/migrate.c2020-06-29 16:35:01.022312549 -0700
> @@ -49,6 +49,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -3165,6 +3166,10 @@ void set_migration_target_nodes(void)
>* Avoid any oddities like cycles that could occur
>* from changes in the topology.  This will leave
>* a momentary gap when migration is disabled.
> +  *
> +  * This is superfluous for memory offlining since
> +  * MEM_GOING_OFFLINE does it independently, but it
> +  * does not hurt to do it a second time.
>*/
>   disable_all_migrate_targets();
>  
> @@ -3211,6 +3216,60 @@ again:
>   /* Is another pass necessary? */
>   if (!nodes_empty(next_pass))
>   goto again;
> +}
>  
> - put_online_mems();
> +/*
> + * React to hotplug events that might online or offline
> + * NUMA nodes.
> + *
> + * This leaves migrate-on-reclaim transiently disabled
> + * between the MEM_GOING_OFFLINE and MEM_OFFLINE events.
> + * This runs whether RECLAIM_MIGRATE is enabled or not.
> + * That ensures that the user can turn RECLAIM_MIGRATE
> + * without needing to recalcuate migration targets.
> + */
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +static int __meminit migrate_on_reclaim_callback(struct notifier_

Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

2020-06-23 Thread Huang, Ying
"Huang, Ying"  writes:

> Andrew Morton  writes:
>
>> On Wed, 20 May 2020 11:15:02 +0800 Huang Ying  wrote:
>>
>>> In some swap scalability test, it is found that there are heavy lock
>>> contention on swap cache even if we have split one swap cache radix
>>> tree per swap device to one swap cache radix tree every 64 MB trunk in
>>> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>> 
>>> The reason is as follow.  After the swap device becomes fragmented so
>>> that there's no free swap cluster, the swap device will be scanned
>>> linearly to find the free swap slots.  swap_info_struct->cluster_next
>>> is the next scanning base that is shared by all CPUs.  So nearby free
>>> swap slots will be allocated for different CPUs.  The probability for
>>> multiple CPUs to operate on the same 64 MB trunk is high.  This causes
>>> the lock contention on the swap cache.
>>> 
>>> To solve the issue, in this patch, for SSD swap device, a percpu
>>> version next scanning base (cluster_next_cpu) is added.  Every CPU
>>> will use its own per-cpu next scanning base.  And after finishing
>>> scanning a 64MB trunk, the per-cpu scanning base will be changed to
>>> the beginning of another randomly selected 64MB trunk.  In this way,
>>> the probability for multiple CPUs to operate on the same 64 MB trunk
>>> is reduced greatly.  Thus the lock contention is reduced too.  For
>>> HDD, because sequential access is more important for IO performance,
>>> the original shared next scanning base is used.
>>> 
>>> To test the patch, we have run 16-process pmbench memory benchmark on
>>> a 2-socket server machine with 48 cores.  One ram disk is configured
>>
>> What does "ram disk" mean here?  Which drivers(s) are in use and backed
>> by what sort of memory?
>
> We use the following kernel command line
>
> memmap=48G!6G memmap=48G!68G
>
> to create 2 DRAM based /dev/pmem disks (48GB each).  Then we use these
> ram disks as swap devices.
>
>>> as the swap device per socket.  The pmbench working-set size is much
>>> larger than the available memory so that swapping is triggered.  The
>>> memory read/write ratio is 80/20 and the accessing pattern is random.
>>> In the original implementation, the lock contention on the swap cache
>>> is heavy.  The perf profiling data of the lock contention code path is
>>> as following,
>>> 
>>> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.91
>>> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.11
>>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
>>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
>>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.29
>>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
>>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.93
>>> 
>>> After applying this patch, it becomes,
>>> 
>>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
>>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  2.3
>>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
>>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.8
>>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19
>>> 
>>> The lock contention on the swap cache is almost eliminated.
>>> 
>>> And the pmbench score increases 18.5%.  The swapin throughput
>>> increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
>>> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.
>>
>> If this was backed by plain old RAM, can we assume that the performance
>> improvement on SSD swap is still good?
>
> We need really fast disk to show the benefit.  I have tried this on 2
> Intel P3600 NVMe disks.  The performance improvement is only about 1%.
> The improvement should be better on the faster disks, such as Intel
> Optane disk.  I will try to find some to test.

I finally find 2 Intel Optane disks to test.  The pmbench throughput
(page accesses per second) increases ~1.7% with the patch.  The swapin
throughput increases 2% (~1.36 GB/s to ~1.39 GB/s), the swapout
throughput increases 1.7% (~1.61 GB/s to 1.63 GB/s).  Perf profile shows
the CPU cycles on the swap cache radix tree spinlock is reduced from
~1.76% to nearly 0.  So the performance difference is much smaller, but
still measurable.

Best Regards,
Huang, Ying


Re: [PATCH 13/16] mm: support THP migration to device private memory

2020-06-22 Thread Huang, Ying
Ralph Campbell  writes:

> On 6/22/20 4:54 PM, Yang Shi wrote:
>> On Mon, Jun 22, 2020 at 4:02 PM John Hubbard  wrote:
>>>
>>> On 2020-06-22 15:33, Yang Shi wrote:
>>>> On Mon, Jun 22, 2020 at 3:30 PM Yang Shi  wrote:
>>>>> On Mon, Jun 22, 2020 at 2:53 PM Zi Yan  wrote:
>>>>>> On 22 Jun 2020, at 17:31, Ralph Campbell wrote:
>>>>>>> On 6/22/20 1:10 PM, Zi Yan wrote:
>>>>>>>> On 22 Jun 2020, at 15:36, Ralph Campbell wrote:
>>>>>>>>> On 6/21/20 4:20 PM, Zi Yan wrote:
>>>>>>>>>> On 19 Jun 2020, at 17:56, Ralph Campbell wrote:
>>> ...
>>>>>> Ying(cc’d) developed the code to swapout and swapin THP in one piece: 
>>>>>> https://lore.kernel.org/linux-mm/20181207054122.27822-1-ying.hu...@intel.com/.
>>>>>> I am not sure whether the patchset makes into mainstream or not. It 
>>>>>> could be a good technical reference
>>>>>> for swapping in device private pages, although swapping in pages from 
>>>>>> disk and from device private
>>>>>> memory are two different scenarios.
>>>>>>
>>>>>> Since the device private memory swapin impacts core mm performance, we 
>>>>>> might want to discuss your patches
>>>>>> with more people, like the ones from Ying’s patchset, in the next 
>>>>>> version.
>>>>>
>>>>> I believe Ying will give you more insights about how THP swap works.
>>>>>
>>>>> But, IMHO device memory migration (migrate to system memory) seems
>>>>> like THP CoW more than swap.
>>>
>>>
>>> A fine point: overall, the desired behavior is "migrate", not CoW.
>>> That's important. Migrate means that you don't leave a page behind, even
>>> a read-only one. And that's exactly how device private migration is
>>> specified.
>>>
>>> We should try to avoid any erosion of clarity here. Even if somehow
>>> (really?) the underlying implementation calls this THP CoW, the actual
>>> goal is to migrate pages over to the device (and back).
>>>
>>>
>>>>>
>>>>> When migrating in:
>>>>
>>>> Sorry for my fat finger, hit sent button inadvertently, let me finish here.
>>>>
>>>> When migrating in:
>>>>
>>>>   - if THP is enabled: allocate THP, but need handle allocation
>>>> failure by falling back to base page
>>>>   - if THP is disabled: fallback to base page
>>>>
>>>
>>> OK, but *all* page entries (base and huge/large pages) need to be cleared,
>>> when migrating to device memory, unless I'm really confused here.
>>> So: not CoW.
>>
>> I realized the comment caused more confusion. I apologize for the
>> confusion. Yes, the trigger condition for swap/migration and CoW are
>> definitely different. Here I mean the fault handling part of migrating
>> into system memory.
>>
>> Swap-in just needs to handle the base page case since THP swapin is
>> not supported in upstream yet and the PMD is split in swap-out phase
>> (see shrink_page_list).
>>
>> The patch adds THP migration support to device memory, but you need to
>> handle migrate in (back to system memory) case correctly. The fault
>> handling should look like THP CoW fault handling behavior (before
>> 5.8):
>>  - if THP is enabled: allocate THP, fallback if allocation is failed
>>  - if THP is disabled: fallback to base page
>>
>> Swap fault handling doesn't look like the above. So, I said it seems
>> like more THP CoW (fault handling part only before 5.8). I hope I
>> articulate my mind.
>>
>> However, I didn't see such fallback is handled. It looks if THP
>> allocation is failed, it just returns SIGBUS; and no check about THP
>> status if I read the patches correctly. The THP might be disabled for
>> the specific vma or system wide before migrating from device memory
>> back to system memory.
>
> You are correct, the patch wasn't handling the fallback case.
> I'll add that in the next version.

For fallback, you need to split the THP in device firstly.  Because you
will migrate a base page instead a whole THP now.

Best Regards,
Huang, Ying

>>>
>>> thanks,
>>> --
>>> John Hubbard
>>> NVIDIA


[PATCH -V4] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-28 Thread Huang, Ying
From: Huang Ying 

In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own per-cpu next scanning base.  And after finishing
scanning a 64MB trunk, the per-cpu scanning base will be changed to
the beginning of another randomly selected 64MB trunk.  In this way,
the probability for multiple CPUs to operate on the same 64 MB trunk
is reduced greatly.  Thus the lock contention is reduced too.  For
HDD, because sequential access is more important for IO performance,
the original shared next scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.91
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.11
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.29
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.93

After applying this patch, it becomes,

_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  2.3
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.8
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%.  The swapin throughput
increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.

Signed-off-by: "Huang, Ying" 
Reviewed-by: Daniel Jordan 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---

Changelog:

v4:

- Fix wrong ALIGN() usage with ALIGN_DOWN().  Thanks Daniel's comments!

- Add some comments.  Thanks Daniel's comments!

v3:

- Fix cluster_next_cpu allocation and freeing.  Thanks Daniel's comments!

v2:

- Rebased on latest mmotm tree (v5.7-rc5-mmots-2020-05-15-16-36), the
  mem cgroup change has influence on performance data.

- Fix cluster_next_cpu initialization per Andrew and Daniel's comments.

- Change per-cpu scan base every 64MB per Andrew's comments.

---
 include/linux/swap.h |  1 +
 mm/swapfile.c| 61 
 2 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
unsigned int inuse_pages;   /* number of those currently in use */
unsigned int cluster_next;  /* likely index for next allocation */
unsigned int cluster_nr;/* countdown to next cluster search */
+   unsigned int __percpu *cluster_next_cpu; /*percpu index for next 
allocation */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap 
location */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev;  /* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 423c234aca15..c12e1fe6b067 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -612,10 +612,12 @@ static bool scan_swap_map_try_ssd_cluster(struct 
swap_info_struct *si,
} else if (!cluster_list_empty(&si->discard_clusters)) {
/*
  

Re: [PATCH -V3] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-28 Thread Huang, Ying
Daniel Jordan  writes:

> On Thu, May 28, 2020 at 01:32:40PM +0800, Huang, Ying wrote:
>> Daniel Jordan  writes:
>> 
>> > On Mon, May 25, 2020 at 08:26:48AM +0800, Huang Ying wrote:
>> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> index 423c234aca15..0abd93d2a4fc 100644
>> >> --- a/mm/swapfile.c
>> >> +++ b/mm/swapfile.c
>> >> @@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
>> >> swap_info_struct *si,
>> >>* discarding, do discard now and reclaim them
>> >>*/
>> >>   swap_do_scheduled_discard(si);
>> >> - *scan_base = *offset = si->cluster_next;
>> >> + *scan_base = this_cpu_read(*si->cluster_next_cpu);
>> >> + *offset = *scan_base;
>> >>   goto new_cluster;
>> >
>> > Why is this done?  As far as I can tell, the values always get overwritten 
>> > at
>> > the end of the function with tmp and tmp isn't derived from them.  Seems
>> > ebc2a1a69111 moved some logic that used to make sense but doesn't have any
>> > effect now.
>> 
>> If we fail to allocate from cluster, "scan_base" and "offset" will not
>> be overridden.
>
> Ok, if another task races to allocate the clusters the first just discarded.
>
>> And "cluster_next" or "cluster_next_cpu" may be changed
>> in swap_do_scheduled_discard(), because the lock is released and
>> re-acquired there.
>
> I see, by another task on the same cpu for cluster_next_cpu.
>
> Both probably unlikely, but at least it tries to pick up where the racing task
> left off.  You might tack this onto the comment:
>
>* discarding, do discard now and reclaim them, then reread
>  * cluster_next_cpu since we dropped si->lock
>     /*

Sure.  Will add this in the next version.

>> The code may not have much value.
>
> No, it makes sense.
>
>> > These aside, patch looks good to me.
>> 
>> Thanks for your review!  It really help me to improve the quality of the
>> patch.  Can I add your "Reviewed-by" in the next version?
>
> Sure,
> Reviewed-by: Daniel Jordan 

Thanks!

Best Regards,
Huang, Ying


Re: [PATCH -V3] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-27 Thread Huang, Ying
Daniel Jordan  writes:

> On Mon, May 25, 2020 at 08:26:48AM +0800, Huang Ying wrote:
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 423c234aca15..0abd93d2a4fc 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
>> swap_info_struct *si,
>>   * discarding, do discard now and reclaim them
>>   */
>>  swap_do_scheduled_discard(si);
>> -*scan_base = *offset = si->cluster_next;
>> +*scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +*offset = *scan_base;
>>  goto new_cluster;
>
> Why is this done?  As far as I can tell, the values always get overwritten at
> the end of the function with tmp and tmp isn't derived from them.  Seems
> ebc2a1a69111 moved some logic that used to make sense but doesn't have any
> effect now.

If we fail to allocate from cluster, "scan_base" and "offset" will not
be overridden.  And "cluster_next" or "cluster_next_cpu" may be changed
in swap_do_scheduled_discard(), because the lock is released and
re-acquired there.

The code may not have much value.  And you may think that it's better to
remove it.  But that should be in another patch.

>>  } else
>>  return false;
>> @@ -721,6 +722,34 @@ static void swap_range_free(struct swap_info_struct 
>> *si, unsigned long offset,
>>  }
>>  }
>>  
>> +static void set_cluster_next(struct swap_info_struct *si, unsigned long 
>> next)
>> +{
>> +unsigned long prev;
>> +
>> +if (!(si->flags & SWP_SOLIDSTATE)) {
>> +si->cluster_next = next;
>> +return;
>> +}
>> +
>> +prev = this_cpu_read(*si->cluster_next_cpu);
>> +/*
>> + * Cross the swap address space size aligned trunk, choose
>> + * another trunk randomly to avoid lock contention on swap
>> + * address space if possible.
>> + */
>> +if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
>> +(next >> SWAP_ADDRESS_SPACE_SHIFT)) {
>> +/* No free swap slots available */
>> +if (si->highest_bit <= si->lowest_bit)
>> +return;
>> +next = si->lowest_bit +
>> +prandom_u32_max(si->highest_bit - si->lowest_bit + 1);
>> +next = ALIGN(next, SWAP_ADDRESS_SPACE_PAGES);
>> +next = max_t(unsigned int, next, si->lowest_bit);
>
> next is always greater than lowest_bit because it's aligned up.  I think the
> intent of the max_t line is to handle when next is aligned outside the valid
> range, so it'd have to be ALIGN_DOWN instead?

Oops.  I misunderstood "ALIGN()" here.  Yes.  we should use ALIGN_DOWN()
instead.  Thanks for pointing this out!

>
> These aside, patch looks good to me.

Thanks for your review!  It really help me to improve the quality of the
patch.  Can I add your "Reviewed-by" in the next version?

Best Regards,
Huang, Ying


[PATCH -V3] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-24 Thread Huang Ying
In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own per-cpu next scanning base.  And after finishing
scanning a 64MB trunk, the per-cpu scanning base will be changed to
the beginning of another randomly selected 64MB trunk.  In this way,
the probability for multiple CPUs to operate on the same 64 MB trunk
is reduced greatly.  Thus the lock contention is reduced too.  For
HDD, because sequential access is more important for IO performance,
the original shared next scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.91
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.11
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.29
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.93

After applying this patch, it becomes,

_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  2.3
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.8
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%.  The swapin throughput
increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.

Signed-off-by: "Huang, Ying" 
Cc: Daniel Jordan 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---

Changelog:

v3:

- Fix cluster_next_cpu allocation and freeing.  Thanks Daniel's comments!

v2:

- Rebased on latest mmotm tree (v5.7-rc5-mmots-2020-05-15-16-36), the
  mem cgroup change has influence on performance data.

- Fix cluster_next_cpu initialization per Andrew and Daniel's comments.

- Change per-cpu scan base every 64MB per Andrew's comments.

---
 include/linux/swap.h |  1 +
 mm/swapfile.c| 58 +---
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
unsigned int inuse_pages;   /* number of those currently in use */
unsigned int cluster_next;  /* likely index for next allocation */
unsigned int cluster_nr;/* countdown to next cluster search */
+   unsigned int __percpu *cluster_next_cpu; /*percpu index for next 
allocation */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap 
location */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev;  /* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 423c234aca15..0abd93d2a4fc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
swap_info_struct *si,
 * discarding, do discard now and reclaim them
 */
swap_do_scheduled_discard(si);
-   *scan_base = *offset = si->cluster_next;
+   *scan_base = this_cpu_read(*si->cluster_n

Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-21 Thread Huang, Ying
Daniel Jordan  writes:

> On Wed, May 20, 2020 at 11:15:02AM +0800, Huang Ying wrote:
>> @@ -2827,6 +2865,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>>  if (!p)
>>  return ERR_PTR(-ENOMEM);
>> +p->cluster_next_cpu = alloc_percpu(unsigned int);
>> +if (!p->cluster_next_cpu) {
>> +kvfree(p);
>> +return ERR_PTR(-ENOMEM);
>> +}
>
> There should be free_percpu()s at two places after this, but I think the
> allocation really belongs right...
>
>> @@ -3202,7 +3245,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>   * select a random position to start with to help wear leveling
>>   * SSD
>>   */
>> -p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> ...here because then it's only allocated when it's actually used.

Good catch!  And yes, this is the better place to allocate memory.  I
will fix this in the new version!  Thanks a lot!

Best Regards,
Huang, Ying

>> +for_each_possible_cpu(cpu) {
>> +per_cpu(*p->cluster_next_cpu, cpu) =
>> +1 + prandom_u32_max(p->highest_bit);
>> +}
>>  nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>>  
>>  cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
>> -- 
>> 2.26.2
>> 
>> 


Re: [PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-20 Thread Huang, Ying
Andrew Morton  writes:

> On Wed, 20 May 2020 11:15:02 +0800 Huang Ying  wrote:
>
>> In some swap scalability test, it is found that there are heavy lock
>> contention on swap cache even if we have split one swap cache radix
>> tree per swap device to one swap cache radix tree every 64 MB trunk in
>> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>> 
>> The reason is as follow.  After the swap device becomes fragmented so
>> that there's no free swap cluster, the swap device will be scanned
>> linearly to find the free swap slots.  swap_info_struct->cluster_next
>> is the next scanning base that is shared by all CPUs.  So nearby free
>> swap slots will be allocated for different CPUs.  The probability for
>> multiple CPUs to operate on the same 64 MB trunk is high.  This causes
>> the lock contention on the swap cache.
>> 
>> To solve the issue, in this patch, for SSD swap device, a percpu
>> version next scanning base (cluster_next_cpu) is added.  Every CPU
>> will use its own per-cpu next scanning base.  And after finishing
>> scanning a 64MB trunk, the per-cpu scanning base will be changed to
>> the beginning of another randomly selected 64MB trunk.  In this way,
>> the probability for multiple CPUs to operate on the same 64 MB trunk
>> is reduced greatly.  Thus the lock contention is reduced too.  For
>> HDD, because sequential access is more important for IO performance,
>> the original shared next scanning base is used.
>> 
>> To test the patch, we have run 16-process pmbench memory benchmark on
>> a 2-socket server machine with 48 cores.  One ram disk is configured
>
> What does "ram disk" mean here?  Which drivers(s) are in use and backed
> by what sort of memory?

We use the following kernel command line

memmap=48G!6G memmap=48G!68G

to create 2 DRAM based /dev/pmem disks (48GB each).  Then we use these
ram disks as swap devices.

>> as the swap device per socket.  The pmbench working-set size is much
>> larger than the available memory so that swapping is triggered.  The
>> memory read/write ratio is 80/20 and the accessing pattern is random.
>> In the original implementation, the lock contention on the swap cache
>> is heavy.  The perf profiling data of the lock contention code path is
>> as following,
>> 
>> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.91
>> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.11
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.29
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.93
>> 
>> After applying this patch, it becomes,
>> 
>> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
>> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  2.3
>> _raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
>> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.8
>> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19
>> 
>> The lock contention on the swap cache is almost eliminated.
>> 
>> And the pmbench score increases 18.5%.  The swapin throughput
>> increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
>> throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.
>
> If this was backed by plain old RAM, can we assume that the performance
> improvement on SSD swap is still good?

We need really fast disk to show the benefit.  I have tried this on 2
Intel P3600 NVMe disks.  The performance improvement is only about 1%.
The improvement should be better on the faster disks, such as Intel
Optane disk.  I will try to find some to test.

> Does the ram disk actually set SWP_SOLIDSTATE?

Yes.  "blk_queue_flag_set(QUEUE_FLAG_NONROT, q)" is called in
drivers/nvdimm/pmem.c.

Best Regards,
Huang, Ying


[PATCH -V2] swap: Reduce lock contention on swap cache from swap slots allocation

2020-05-19 Thread Huang Ying
In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own per-cpu next scanning base.  And after finishing
scanning a 64MB trunk, the per-cpu scanning base will be changed to
the beginning of another randomly selected 64MB trunk.  In this way,
the probability for multiple CPUs to operate on the same 64 MB trunk
is reduced greatly.  Thus the lock contention is reduced too.  For
HDD, because sequential access is more important for IO performance,
the original shared next scanning base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.91
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.11
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.51
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 1.66
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.29
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.03
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.93

After applying this patch, it becomes,

_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.58
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  2.3
_raw_spin_lock_irqsave.swap_cgroup_record.mem_cgroup_uncharge_swap: 2.26
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.8
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.19

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 18.5%.  The swapin throughput
increases 18.7% from 2.96 GB/s to 3.51 GB/s.  While the swapout
throughput increases 18.5% from 2.99 GB/s to 3.54 GB/s.

Signed-off-by: "Huang, Ying" 
Cc: Daniel Jordan 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---

Changelog:

v2:

- Rebased on latest mmotm tree (v5.7-rc5-mmots-2020-05-15-16-36), the
  mem cgroup change has influence on performance data.

- Fix cluster_next_cpu initialization per Andrew and Daniel's comments.

- Change per-cpu scan base every 64MB per Andrew's comments.

---
 include/linux/swap.h |  1 +
 mm/swapfile.c| 54 
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
unsigned int inuse_pages;   /* number of those currently in use */
unsigned int cluster_next;  /* likely index for next allocation */
unsigned int cluster_nr;/* countdown to next cluster search */
+   unsigned int __percpu *cluster_next_cpu; /*percpu index for next 
allocation */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap 
location */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev;  /* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 423c234aca15..f5e3ab06bf18 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -615,7 +615,8 @@ static bool scan_swap_map_try_ssd_cluster(struct 
swap_info_struct *si,
 * discarding, do discard now and reclaim them
 */
swap_do_scheduled_discard(si);
-   *scan_base = *offset = si->cluster_next;
+   *scan_base = this_cpu_read(*si->cluster_next_cpu);
+   *offset = *scan_base;
   

Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

2020-05-17 Thread Huang, Ying
Daniel Jordan  writes:

> On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
>> And the pmbench score increases 15.9%.
>
> What metric is that, and how long did you run the benchmark for?

I run the benchmark for 1800s.  The metric comes from the following
output of the pmbench,

[1] Benchmark done - took 1800.088 sec for 12291 page access

That is, the throughput is 12291 / 1800.088 = 68280.0 (accesses/s).
Then we sum the values from the different processes.

> Given that this thing is probabilistic, did you notice much variance from run
> to run?

The results looks quite stable for me.  The standard deviation of
results run to run is less than 1% for me.

>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct 
>> *si,
>>   */
>>  
>>  si->flags += SWP_SCANNING;
>> -scan_base = offset = si->cluster_next;
>> +/*
>> + * Use percpu scan base for SSD to reduce lock contention on
>> + * cluster and swap cache.  For HDD, sequential access is more
>> + * important.
>> + */
>> +if (si->flags & SWP_SOLIDSTATE)
>> +scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +else
>> +scan_base = si->cluster_next;
>> +offset = scan_base;
>>  
>>  /* SSD algorithm */
>>  if (si->cluster_info) {
>
> It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
> check the same thing and I'd stick with the one that's already there.

Yes.  In effect, (si->flags & SWP_SOLIDSTATE) and (si->cluster_info)
always has same value at least for now.  But I don't think they are
exactly same in semantics.  So I would rather to use their exact
semantics.

>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct 
>> swap_info_struct *p,
>>  
>>  p->lowest_bit  = 1;
>>  p->cluster_next = 1;
>> +for_each_possible_cpu(i)
>> +per_cpu(*p->cluster_next_cpu, i) = 1;
>
> These are later overwritten if the device is an SSD which seems to be the only
> case where these are used, so why have this?

Yes.  You are right.  Will remove this in the future versions.

>>  p->cluster_nr = 0;
>>  
>>  maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>   * SSD
>>   */
>>  p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>> +for_each_possible_cpu(cpu) {
>> +per_cpu(*p->cluster_next_cpu, cpu) =
>> +1 + prandom_u32_max(p->highest_bit);
>> +}
>
> Is there a reason for adding one?  The history didn't enlighten me about why
> cluster_next does it.

The first swap slot is the swap partition header, you cand find the
corresponding code in syscall swapon function, below comments "Read the
swap header.".

Best Regards,
Huang, Ying


Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

2020-05-17 Thread Huang, Ying
Hi, Andrew,

Andrew Morton  writes:

> On Thu, 14 May 2020 15:04:24 +0800 Huang Ying  wrote:
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct 
>> *si,
>>   */
>>  
>>  si->flags += SWP_SCANNING;
>> -scan_base = offset = si->cluster_next;
>> +/*
>> + * Use percpu scan base for SSD to reduce lock contention on
>> + * cluster and swap cache.  For HDD, sequential access is more
>> + * important.
>> + */
>> +if (si->flags & SWP_SOLIDSTATE)
>> +scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +else
>> +scan_base = si->cluster_next;
>> +offset = scan_base;
>
> Do we need to make SSD differ from spinning here?  Do bad things happen
> if !SWP_SOLIDSTATE devices use the per-cpu cache?

I think the swapout throughput may be affected.  Because HDD seek is
necessary to swapout for multiple CPUs, if per-cpu cluster_next is used.
But I just realized that per-cpu swap slots cache will cause seek too.
If we really care about the performance to use HDD as swap, maybe we
should disable per-cpu swap slots cache for HDD too?

>>  /* SSD algorithm */
>>  if (si->cluster_info) {
>> @@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct 
>> *si,
>>  unlock_cluster(ci);
>>  
>>  swap_range_alloc(si, offset, 1);
>> -si->cluster_next = offset + 1;
>> +if (si->flags & SWP_SOLIDSTATE)
>> +this_cpu_write(*si->cluster_next_cpu, offset + 1);
>> +else
>> +si->cluster_next = offset + 1;
>>  slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  /* got enough slots or reach max slots? */
>> @@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>>  if (!p)
>>  return ERR_PTR(-ENOMEM);
>> +p->cluster_next_cpu = alloc_percpu(unsigned int);
>> +if (!p->cluster_next_cpu) {
>> +kvfree(p);
>> +return ERR_PTR(-ENOMEM);
>> +}
>>  
>>  spin_lock(&swap_lock);
>>  for (type = 0; type < nr_swapfiles; type++) {
>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct 
>> swap_info_struct *p,
>>  
>>  p->lowest_bit  = 1;
>>  p->cluster_next = 1;
>> +for_each_possible_cpu(i)
>> +per_cpu(*p->cluster_next_cpu, i) = 1;
>>  p->cluster_nr = 0;
>>  
>>  maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>   * SSD
>>   */
>>  p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> We shouldn't need to do this now?

Yes.  Thanks for pointing this out.  Will delete this in the future
version.

>> +for_each_possible_cpu(cpu) {
>> +per_cpu(*p->cluster_next_cpu, cpu) =
>> +1 + prandom_u32_max(p->highest_bit);
>> +}
>
> Would there be any benefit in spreading these out evenly?  Intervals of
> (p->highest_bit/num_possible_cpus())?  That would reduce collisions,
> but not for very long I guess.

These may be spread more evenly with
(p->highest_bit/num_possible_cpus()).  I just worry about the possible
situation that num_possible_cpus() >> num_online_cpus().  Where current
method works better?

> Speaking of which, I wonder if there are failure modes in which all the
> CPUs end up getting into sync.
>
> And is it the case that if two or more CPUs have the same (or similar)
> per_cpu(*p->cluster_next_cpu, cpu), they'll each end up pointlessly
> scanning slots which another CPU has just scanned, thus rather
> defeating the purpose of having the cluster_next cache?
>
> IOW, should there be some additional collision avoidance scheme to
> prevent a CPU from pointing its cluster_ext into a 64MB trunk which
> another CPU is already using?

Yes.  That sounds reasonable.  How about something as below,

When per-cpu cluster_next is assigned, if the new value is in a
different 64MB (or larger) trunk of the old value, we will assign a
random value between p->lowest_bit and p->highest_bit to per-cpu
cluster_next.

This can reduce the possibility of collision to be almost 0 if there's
enough free swap slots.  And t

[PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache

2020-05-14 Thread Huang Ying
In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own next scanning base.  So the probability for multiple
CPUs to operate on the same 64 MB trunk is reduced greatly.  Thus the
lock contention is reduced too.  For HDD, because sequential access is
more important for IO performance, the original shared next scanning
base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  7.93
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list:   7.03
_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page:   3.7
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.9
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.32
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.01
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:0.87

After applying this patch, it becomes,

_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page:   3.99
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.0
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:  1.47
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node:1.31
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.88
_raw_spin_lock.scan_swap_map_slots.get_swap_pages.get_swap_page:0.76
_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:  0.53

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 15.9%.  The swapin throughput
increases 16.2% from 2.84 GB/s to 3.3 GB/s.  While the swapout
throughput increases 16.1% from 2.87 GB/s to 3.33 GB/s.

Signed-off-by: "Huang, Ying" 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---
 include/linux/swap.h |  1 +
 mm/swapfile.c| 27 +--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
unsigned int inuse_pages;   /* number of those currently in use */
unsigned int cluster_next;  /* likely index for next allocation */
unsigned int cluster_nr;/* countdown to next cluster search */
+   unsigned int __percpu *cluster_next_cpu; /*percpu index for next 
allocation */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap 
location */
struct rb_root swap_extent_root;/* root of the swap extent rbtree */
struct block_device *bdev;  /* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 35be7a7271f4..9f1343b066c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 */
 
si->flags += SWP_SCANNING;
-   scan_base = offset = si->cluster_next;
+   /*
+* Use percpu scan base for SSD to reduce lock contention on
+* cluster and swap cache.  For HDD, sequential access is more
+* important.
+*/
+   if (si->flags & SWP_SOLIDSTATE)
+   scan_base = this_cpu_read(*si->cluster_next_cpu);
+   else
+   scan_base = si->cluster_next;
+   offset = scan_base;
 
/* SSD algorithm */
if (si->cluster_info) {
@@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
unlock_cluster(ci);
 
swap_range_alloc(si, offset, 1);
-   si->cluster

[PATCH -V2] mm, swap: Use prandom_u32_max()

2020-05-12 Thread Huang Ying
To improve the code readability and take advantage of the common
implementation.

Signed-off-by: "Huang, Ying" 
Acked-by: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---

Changelog:

v2:

- Revise the patch description per Michal's comments.

---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a0a123e59ce6..2ec8b21201d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, 
int, swap_flags)
 * select a random position to start with to help wear leveling
 * SSD
 */
-   p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
+   p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
-- 
2.26.2



Re: [PATCH] mm, swap: Use prandom_u32_max()

2020-05-12 Thread Huang, Ying
Michal Hocko  writes:

> On Tue 12-05-20 15:14:46, Huang, Ying wrote:
>> Michal Hocko  writes:
>> 
>> > On Tue 12-05-20 14:41:46, Huang Ying wrote:
>> >> To improve the code readability and get random number with higher
>> >> quality.
>> >
>> > I understand the readability argument but why should prandom_u32_max
>> > (which I was not aware of) provide a higher quality randomness?
>> 
>> I am not expert on random number generator.  I have heard about that the
>> randomness of the low order bits of some random number generator isn't
>> good enough.  Anyway, by using the common implementation, the real
>> random number generator expert can fix the possible issue once for all
>> users.
>
> Please drop the quality argument if you cannot really justify it. This
> will likely just confuse future readers the same way it confused me
> here. Because prandom_u32_max uses the same source of randomness the
> only difference is the way how modulo vs. u64 overflow arithmetic is
> used for distributing values. I am not aware the later would be
> a way to achieve a higher quality randomness. If the interval
> distribution is better with the later then it would be great to have it
> documented.

OK. Fair enough.

Best Regards,
Huang, Ying

>> >> Signed-off-by: "Huang, Ying" 
>> >> Cc: Michal Hocko 
>> >> Cc: Minchan Kim 
>> >> Cc: Tim Chen 
>> >> Cc: Hugh Dickins 
>> >
>> > To the change itself
>> > Acked-by: Michal Hocko 
>> 
>> Thanks!
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> >> ---
>> >>  mm/swapfile.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> index a0a123e59ce6..2ec8b21201d6 100644
>> >> --- a/mm/swapfile.c
>> >> +++ b/mm/swapfile.c
>> >> @@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> >> specialfile, int, swap_flags)
>> >>* select a random position to start with to help wear leveling
>> >>* SSD
>> >>*/
>> >> - p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
>> >> + p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>> >>   nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>> >>  
>> >>   cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
>> >> -- 
>> >> 2.26.2


Re: [PATCH] mm, swap: Use prandom_u32_max()

2020-05-12 Thread Huang, Ying
Michal Hocko  writes:

> On Tue 12-05-20 14:41:46, Huang Ying wrote:
>> To improve the code readability and get random number with higher
>> quality.
>
> I understand the readability argument but why should prandom_u32_max
> (which I was not aware of) provide a higher quality randomness?

I am not expert on random number generator.  I have heard about that the
randomness of the low order bits of some random number generator isn't
good enough.  Anyway, by using the common implementation, the real
random number generator expert can fix the possible issue once for all
users.

>> Signed-off-by: "Huang, Ying" 
>> Cc: Michal Hocko 
>> Cc: Minchan Kim 
>> Cc: Tim Chen 
>> Cc: Hugh Dickins 
>
> To the change itself
> Acked-by: Michal Hocko 

Thanks!

Best Regards,
Huang, Ying

>> ---
>>  mm/swapfile.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index a0a123e59ce6..2ec8b21201d6 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>   * select a random position to start with to help wear leveling
>>   * SSD
>>   */
>> -p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
>> +p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>>  nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>>  
>>  cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
>> -- 
>> 2.26.2


[PATCH] mm, swap: Use prandom_u32_max()

2020-05-11 Thread Huang Ying
To improve the code readability and get random number with higher
quality.

Signed-off-by: "Huang, Ying" 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Tim Chen 
Cc: Hugh Dickins 
---
 mm/swapfile.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a0a123e59ce6..2ec8b21201d6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3220,7 +3220,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, 
int, swap_flags)
 * select a random position to start with to help wear leveling
 * SSD
 */
-   p->cluster_next = 1 + (prandom_u32() % p->highest_bit);
+   p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
-- 
2.26.2



Re: [PATCH 3/3] mm/swapfile.c: count won't be bigger than SWAP_MAP_MAX

2020-05-07 Thread Huang, Ying
Wei Yang  writes:

> On Wed, May 06, 2020 at 04:22:54PM +0800, Huang, Ying wrote:
>>Wei Yang  writes:
>>
>>> On Fri, May 01, 2020 at 03:48:53PM -0700, Andrew Morton wrote:
>>>>On Fri,  1 May 2020 01:52:59 + Wei Yang  
>>>>wrote:
>>>>
>>>>> When the condition is true, there are two possibilities:
>>>>
>>>>I'm struggling with this one.
>>>>
>>>>>1. count == SWAP_MAP_BAD
>>>>>2. count == (SWAP_MAP_MAX & COUNT_CONTINUED) == SWAP_MAP_SHMEM
>>>>
>>>>I'm not sure what 2. is trying to say.  For a start, (SWAP_MAP_MAX &
>>>>COUNT_CONTINUED) is zero.  I guess it meant "|"?
>>>
>>> Oops, you are right. It should be (SWAP_MAP_MAX | COUNT_CONTINUED).
>>>
>>> Sorry for the confusion.
>>>
>>>>
>>>>Also, the return value documentation says we return EINVAL for migration
>>>>entries.  Where's that happening, or is the comment out of date?
>>>>
>>>
>>> Not paid attention to this.
>>>
>>> Take look into the code, I don't find a relationship between the swap count
>>> and migration. Seems we just make a migration entry but not duplicate it.  
>>> If my understanding is correct.
>>
>>Per my understanding, one functionality of the error path is to catch
>>the behavior that shouldn't happen at all.  For example, if
>>__swap_duplicate() is called for the migration entry because of some
>>race condition.
>>
>
> If __swap_duplicate() run for a migration entry, it returns since
> get_swap_entry() couldn't find a swap_info_struct. So the return value is
> -EINVAL.
>
> While when this situation would happen? And the race condition you mean is?

Sorry for confusing.  I don't mean there are some known race conditions
in current kernel that will trigger the error code path.  I mean we may
use the error path to identify some race conditions in the future.

I remember that Matthew thought that the swap code should work
reasonably even for garbage PTE.

Best Regards,
Huang, Ying

>>Best Regards,
>>Huang, Ying


Re: [PATCH 3/3] mm/swapfile.c: count won't be bigger than SWAP_MAP_MAX

2020-05-06 Thread Huang, Ying
Wei Yang  writes:

> On Fri, May 01, 2020 at 03:48:53PM -0700, Andrew Morton wrote:
>>On Fri,  1 May 2020 01:52:59 + Wei Yang  wrote:
>>
>>> When the condition is true, there are two possibilities:
>>
>>I'm struggling with this one.
>>
>>>1. count == SWAP_MAP_BAD
>>>2. count == (SWAP_MAP_MAX & COUNT_CONTINUED) == SWAP_MAP_SHMEM
>>
>>I'm not sure what 2. is trying to say.  For a start, (SWAP_MAP_MAX &
>>COUNT_CONTINUED) is zero.  I guess it meant "|"?
>
> Oops, you are right. It should be (SWAP_MAP_MAX | COUNT_CONTINUED).
>
> Sorry for the confusion.
>
>>
>>Also, the return value documentation says we return EINVAL for migration
>>entries.  Where's that happening, or is the comment out of date?
>>
>
> Not paid attention to this.
>
> Take look into the code, I don't find a relationship between the swap count
> and migration. Seems we just make a migration entry but not duplicate it.  
> If my understanding is correct.

Per my understanding, one functionality of the error path is to catch
the behavior that shouldn't happen at all.  For example, if
__swap_duplicate() is called for the migration entry because of some
race condition.

Best Regards,
Huang, Ying


Re: [PATCH v2] mm/swapfile.c: simplify the scan loop in scan_swap_map_slots()

2020-04-28 Thread Huang, Ying
Wei Yang  writes:

> On Mon, Apr 27, 2020 at 08:55:33AM +0800, Huang, Ying wrote:
>>Wei Yang  writes:
>>
>>> On Sun, Apr 26, 2020 at 09:07:11AM +0800, Huang, Ying wrote:
>>>>Wei Yang  writes:
>>>>
>>>>> On Fri, Apr 24, 2020 at 10:02:58AM +0800, Huang, Ying wrote:
>>>>>>Wei Yang  writes:
>>>>>>
>>>>> [...]
>>>>>>>>
>>>>>>>>if "offset > si->highest_bit" is true and "offset < scan_base" is true,
>>>>>>>>scan_base need to be returned.
>>>>>>>>
>>>>>>>
>>>>>>> When this case would happen in the original code?
>>>>>>
>>>>>>In the original code, the loop can still stop.
>>>>>>
>>>>>
>>>>> Sorry, I don't get your point yet.
>>>>>
>>>>> In original code, there are two separate loops
>>>>>
>>>>> while (++offset <= si->highest_bit) {
>>>>> }
>>>>>
>>>>> while (offset < scan_base) {
>>>>> }
>>>>>
>>>>> And for your condition, (offset > highest_bit) && (offset < scan_base), 
>>>>> which
>>>>> terminates the first loop and fits the second loop well.
>>>>>
>>>>> Not sure how this condition would stop the loop in original code?
>>>>
>>>>Per my understanding, in your code, if some other task changes
>>>>si->highest_bit to be less than scan_base in parallel.  The loop may
>>>>cannot stop.
>>>
>>> When (offset > scan_base), (offset >  si->highest_bit) means offset will be
>>> set to si->lowest_bit.
>>>
>>> When (offset < scan_base), next_offset() would always increase offset till
>>> offset is scan_base.
>>>
>>> Sorry, I didn't catch your case. Would you minding giving more detail?
>>
>>Don't think in single thread model.  There's no lock to prevent other
>>tasks to change si->highest_bit simultaneously.  For example, task B may
>>change si->highest_bit to be less than scan_base in task A.
>>
>
> Yes, I am trying to think about it in parallel mode.
>
> Here are the cases, it might happen in parallel when task B change highest_bit
> to be less than scan_base.
>
> (1)
>  offset
>v
>   +---+--+
> ^   ^  ^
>   lowest_bit   highest_bitscan_base
>
>
> (2)
>offset
>  v
>   +---+--+
> ^   ^  ^
>   lowest_bit   highest_bitscan_base
>

This is the case in my mind.  But my original understanding to your code
wasn't correct.  As you said, loop can stop because offset is kept
increasing.  Sorry about that.

But I still don't like your new code.  It's not as obvious as the
original one.

Best Regards,
Huang, Ying

> (3)
>    offset
>      v
>   +---+--+
> ^   ^  ^
>   lowest_bit   highest_bitscan_base
>
> Case (1), (offset > highest) && (offset > scan_base),  offset would be set to
> lowest_bit. This  looks good.
>
> Case (2), (offset > highest) && (offset < scan_base),  since offset is less
> than scan_base, it wouldn't be set to lowest. Instead it will continue to
> scan_base.
>
> Case (3), almost the same as Case (2).
>
> In Case (2) and (3), one thing interesting is the loop won't stop at
> highest_bit, while the behavior is the same as original code.
>
> Maybe your concern is this one? I still not figure out your point about the
> infinite loop. Hope you would share some light on it.
>
>
>>Best Regards,
>>Huang, Ying
>>
>>>>
>>>>Best Regards,
>>>>Huang, Ying
>>>>
>>>>>>Best Regards,
>>>>>>Huang, Ying
>>>>>>
>>>>>>>>Again, the new code doesn't make it easier to find this kind of issues.
>>>>>>>>
>>>>>>>>Best Regards,
>>>>>>>>Huang, Ying


Re: [PATCH RESEND] autonuma: Fix scan period updating

2019-07-29 Thread Huang, Ying
Mel Gorman  writes:

> On Mon, Jul 29, 2019 at 04:16:28PM +0800, Huang, Ying wrote:
>> Srikar Dronamraju  writes:
>> 
>> >> >> 
>> >> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> >> slow down scanning
>> >> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>> >> >> if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> >> speed up scanning
>> >> 
>> >> Thought about this again.  For example, a multi-threads workload runs on
>> >> a 4-sockets machine, and most memory accesses are shared.  The optimal
>> >> situation will be pseudo-interleaving, that is, spreading memory
>> >> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
>> >> "remote" > "local".  And we should slow down scanning to reduce the
>> >> overhead.
>> >> 
>> >> What do you think about this?
>> >
>> > If all 4 nodes have equal access, then all 4 nodes will be active nodes.
>> >
>> > From task_numa_fault()
>> >
>> >if (!priv && !local && ng && ng->active_nodes > 1 &&
>> >numa_is_active_node(cpu_node, ng) &&
>> >numa_is_active_node(mem_node, ng))
>> >local = 1;
>> >
>> > Hence all accesses will be accounted as local. Hence scanning would slow
>> > down.
>> 
>> Yes.  You are right!  Thanks a lot!
>> 
>> There may be another case.  For example, a workload with 9 threads runs
>> on a 2-sockets machine, and most memory accesses are shared.  7 threads
>> runs on the node 0 and 2 threads runs on the node 1 based on CPU load
>> balancing.  Then the 2 threads on the node 1 will have "share" >>
>> "private" and "remote" >> "local".  But it doesn't help to speed up
>> scanning.
>> 
>
> Ok, so the results from the patch are mostly neutral. There are some
> small differences in scan rates depending on the workload but it's not
> universal and the headline performance is sometimes worse. I couldn't
> find something that would justify the change on its own.

Thanks a lot for your help!

> I think in the short term -- just fix the comments.

Then we will change the comments to something like,

"Slow down scanning if most memory accesses are private."

It's hard to be understood.  Maybe we just keep the code and comments as
it was until we have better understanding.

> For the shared access consideration, the scan rate is important but so too
> is the decision on when pseudo interleaving should be used. Both should
> probably be taken into account when making changes in this area. The
> current code may not be optimal but it also has not generated bug reports,
> high CPU usage or obviously bad locality decision in the field.  Hence,
> for this patch or a similar series, it is critical that some workloads are
> selected that really care about the locality of shared access and evaluate
> based on that. Initially it was done with a large battery of tests run
> by different people but some of those people have changed role since and
> would not be in a position to rerun the tests. There also was the issue
> that when those were done, NUMA balancing was new so it's comparative
> baseline was "do nothing at all".

Yes.  I totally agree that we should change the behavior based on
testing.

Best Regards,
Huang, Ying


Re: [PATCH RESEND] autonuma: Fix scan period updating

2019-07-29 Thread Huang, Ying
Srikar Dronamraju  writes:

>> >> 
>> >> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> slow down scanning
>> >> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>> >> if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> >> speed up scanning
>> 
>> Thought about this again.  For example, a multi-threads workload runs on
>> a 4-sockets machine, and most memory accesses are shared.  The optimal
>> situation will be pseudo-interleaving, that is, spreading memory
>> accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
>> "remote" > "local".  And we should slow down scanning to reduce the
>> overhead.
>> 
>> What do you think about this?
>
> If all 4 nodes have equal access, then all 4 nodes will be active nodes.
>
> From task_numa_fault()
>
>   if (!priv && !local && ng && ng->active_nodes > 1 &&
>   numa_is_active_node(cpu_node, ng) &&
>   numa_is_active_node(mem_node, ng))
>   local = 1;
>
> Hence all accesses will be accounted as local. Hence scanning would slow
> down.

Yes.  You are right!  Thanks a lot!

There may be another case.  For example, a workload with 9 threads runs
on a 2-sockets machine, and most memory accesses are shared.  7 threads
runs on the node 0 and 2 threads runs on the node 1 based on CPU load
balancing.  Then the 2 threads on the node 1 will have "share" >>
"private" and "remote" >> "local".  But it doesn't help to speed up
scanning.

Best Regards,
Huang, Ying


Re: [PATCH RESEND] autonuma: Fix scan period updating

2019-07-28 Thread Huang, Ying
Srikar Dronamraju  writes:

> * Huang, Ying  [2019-07-26 15:45:39]:
>
>> Hi, Srikar,
>> 
>> >
>> > More Remote + Private page Accesses:
>> > Most likely the Private accesses are going to be local accesses.
>> >
>> > In the unlikely event of the private accesses not being local, we should
>> > scan faster so that the memory and task consolidates.
>> >
>> > More Remote + Shared page Accesses: This means the workload has not
>> > consolidated and needs to scan faster. So we need to scan faster.
>> 
>> This sounds reasonable.  But
>> 
>> lr_ratio < NUMA_PERIOD_THRESHOLD
>> 
>> doesn't indicate More Remote.  If Local = Remote, it is also true.  If
>
> less lr_ratio means more remote.
>
>> there are also more Shared, we should slow down the scanning.  So, the
>
> Why should we slowing down if there are more remote shared accesses?
>
>> logic could be
>> 
>> if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> slow down scanning
>> else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>> if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
>> speed up scanning

Thought about this again.  For example, a multi-threads workload runs on
a 4-sockets machine, and most memory accesses are shared.  The optimal
situation will be pseudo-interleaving, that is, spreading memory
accesses evenly among 4 NUMA nodes.  Where "share" >> "private", and
"remote" > "local".  And we should slow down scanning to reduce the
overhead.

What do you think about this?

Best Regards,
Huang, Ying

>> else
>> slow down scanning
>> } else
>>speed up scanning
>> 
>> This follows your idea better?
>> 
>> Best Regards,
>> Huang, Ying


Re: [PATCH RESEND] autonuma: Fix scan period updating

2019-07-26 Thread Huang, Ying
Hi, Srikar,

Srikar Dronamraju  writes:

> * Huang, Ying  [2019-07-25 16:01:24]:
>
>> From: Huang Ying 
>> 
>> From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
>> Slow down scan rate if shared faults dominate"), the autonuma scan
>> period should be increased (scanning is slowed down) if the majority
>> of the page accesses are shared with other processes.  But in current
>> code, the scan period will be decreased (scanning is speeded up) in
>> that situation.
>> 
>> The commit log and comments make more sense.  So this patch fixes the
>> code to make it match the commit log and comments.  And this has been
>> verified via tracing the scan period changing and /proc/vmstat
>> numa_pte_updates counter when running a multi-threaded memory
>> accessing program (most memory areas are accessed by multiple
>> threads).
>> 
>
> Lets split into 4 modes.
> More Local and Private Page Accesses:
> We definitely want to scan slowly i.e increase the scan window.
>
> More Local and Shared Page Accesses:
> We still want to scan slowly because we have consolidated and there is no
> point in scanning faster. So scan slowly + increase the scan window.
> (Do remember access on any active node counts as local!!!)
>
> More Remote + Private page Accesses:
> Most likely the Private accesses are going to be local accesses.
>
> In the unlikely event of the private accesses not being local, we should
> scan faster so that the memory and task consolidates.
>
> More Remote + Shared page Accesses: This means the workload has not
> consolidated and needs to scan faster. So we need to scan faster.

This sounds reasonable.  But

lr_ratio < NUMA_PERIOD_THRESHOLD

doesn't indicate More Remote.  If Local = Remote, it is also true.  If
there are also more Shared, we should slow down the scanning.  So, the
logic could be

if (lr_ratio >= NUMA_PERIOD_THRESHOLD)
slow down scanning
else if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
if (NUMA_PERIOD_SLOTS - lr_ratio >= NUMA_PERIOD_THRESHOLD)
speed up scanning
else
slow down scanning
} else
   speed up scanning

This follows your idea better?

Best Regards,
Huang, Ying

> So I would think we should go back to before 37ec97deb3a8.
>
> i.e 
>
>   int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
>
>   if (!slot)
>   slot = 1;
>   diff = slot * period_slot;
>
>
> No?
>
>> Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults 
>> dominate")
>> Signed-off-by: "Huang, Ying" 
>> Cc: Rik van Riel 
>> Cc: Peter Zijlstra (Intel) 
>> Cc: Mel Gorman 
>> Cc: jhla...@redhat.com
>> Cc: lvena...@redhat.com
>> Cc: Ingo Molnar 
>> Cc: Andrew Morton 
>> ---
>>  kernel/sched/fair.c | 20 ++--
>>  1 file changed, 10 insertions(+), 10 deletions(-)
>> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 036be95a87e9..468a1c5038b2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct 
>> *p,
>>  unsigned long shared, unsigned long private)
>>  {
>>  unsigned int period_slot;
>> -int lr_ratio, ps_ratio;
>> +int lr_ratio, sp_ratio;
>>  int diff;
>>  
>>  unsigned long remote = p->numa_faults_locality[0];
>> @@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct 
>> task_struct *p,
>>   */
>>  period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
>>  lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
>> -ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
>> +sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
>>  
>> -if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
>> +if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
>>  /*
>> - * Most memory accesses are local. There is no need to
>> - * do fast NUMA scanning, since memory is already local.
>> + * Most memory accesses are shared with other tasks.
>> + * There is no point in continuing fast NUMA scanning,
>> + * since other tasks may just move the memory elsewhere.
>
> With this change, I would expect that with Shared page accesses,
> consolidation to take a hit.
>
>>   */
>> -int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
>> +int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
>>

Re: kernel BUG at mm/swap_state.c:170!

2019-07-25 Thread Huang, Ying
Matthew Wilcox  writes:

> On Tue, Jul 23, 2019 at 01:08:42PM +0800, Huang, Ying wrote:
>> @@ -2489,6 +2491,14 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  /* complete memcg works before add pages to LRU */
>>  mem_cgroup_split_huge_fixup(head);
>>  
>> +if (PageAnon(head) && PageSwapCache(head)) {
>> +swp_entry_t entry = { .val = page_private(head) };
>> +
>> +offset = swp_offset(entry);
>> +swap_cache = swap_address_space(entry);
>> +xa_lock(&swap_cache->i_pages);
>> +}
>> +
>>  for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
>>  __split_huge_page_tail(head, i, lruvec, list);
>>  /* Some pages can be beyond i_size: drop them from page cache */
>> @@ -2501,6 +2511,9 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  } else if (!PageAnon(page)) {
>>  __xa_store(&head->mapping->i_pages, head[i].index,
>>  head + i, 0);
>> +} else if (swap_cache) {
>> +__xa_store(&swap_cache->i_pages, offset + i,
>> +   head + i, 0);
>
> I tried something along these lines (though I think I messed up the offset
> calculation which is why it wasn't working for me).  My other concern
> was with the case where SWAPFILE_CLUSTER was less than HPAGE_PMD_NR.
> Don't we need to drop the lock and look up a new swap_cache if offset >=
> SWAPFILE_CLUSTER?

In swapfile.c, there is

#ifdef CONFIG_THP_SWAP
#define SWAPFILE_CLUSTERHPAGE_PMD_NR
...
#else
#define SWAPFILE_CLUSTER256
...
#endif

So, if a THP is in swap cache, then SWAPFILE_CLUSTER equals
HPAGE_PMD_NR.


And there is one swap address space for each 64M swap space.  So one THP
will be in one swap address space.

In swap.h, there is

/* One swap address space for each 64M swap space */
#define SWAP_ADDRESS_SPACE_SHIFT14
#define SWAP_ADDRESS_SPACE_PAGES(1 << SWAP_ADDRESS_SPACE_SHIFT)

Best Regards,
Huang, Ying


[PATCH RESEND] autonuma: Fix scan period updating

2019-07-25 Thread Huang, Ying
From: Huang Ying 

>From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
Slow down scan rate if shared faults dominate"), the autonuma scan
period should be increased (scanning is slowed down) if the majority
of the page accesses are shared with other processes.  But in current
code, the scan period will be decreased (scanning is speeded up) in
that situation.

The commit log and comments make more sense.  So this patch fixes the
code to make it match the commit log and comments.  And this has been
verified via tracing the scan period changing and /proc/vmstat
numa_pte_updates counter when running a multi-threaded memory
accessing program (most memory areas are accessed by multiple
threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults 
dominate")
Signed-off-by: "Huang, Ying" 
Cc: Rik van Riel 
Cc: Peter Zijlstra (Intel) 
Cc: Mel Gorman 
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Cc: Ingo Molnar 
Cc: Andrew Morton 
---
 kernel/sched/fair.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..468a1c5038b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private)
 {
unsigned int period_slot;
-   int lr_ratio, ps_ratio;
+   int lr_ratio, sp_ratio;
int diff;
 
unsigned long remote = p->numa_faults_locality[0];
@@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct 
*p,
 */
period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-   ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+   sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
 
-   if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+   if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
/*
-* Most memory accesses are local. There is no need to
-* do fast NUMA scanning, since memory is already local.
+* Most memory accesses are shared with other tasks.
+* There is no point in continuing fast NUMA scanning,
+* since other tasks may just move the memory elsewhere.
 */
-   int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+   int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
slot = 1;
diff = slot * period_slot;
} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
/*
-* Most memory accesses are shared with other tasks.
-* There is no point in continuing fast NUMA scanning,
-* since other tasks may just move the memory elsewhere.
+* Most memory accesses are local. There is no need to
+* do fast NUMA scanning, since memory is already local.
 */
int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
@@ -1998,7 +1998,7 @@ static void update_task_scan_period(struct task_struct *p,
 * yet they are not on the local NUMA node. Speed up
 * NUMA scanning to get the memory moved over.
 */
-   int ratio = max(lr_ratio, ps_ratio);
+   int ratio = max(lr_ratio, sp_ratio);
diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
}
 
-- 
2.20.1



Re: kernel BUG at mm/swap_state.c:170!

2019-07-25 Thread Huang, Ying
Mikhail Gavrilov  writes:

> On Tue, 23 Jul 2019 at 10:08, Huang, Ying  wrote:
>>
>> Thanks!  I have found another (easier way) to reproduce the panic.
>> Could you try the below patch on top of v5.2-rc2?  It can fix the panic
>> for me.
>>
>
> Thanks! Amazing work! The patch fixes the issue completely. The system
> worked at a high load of 16 hours without failures.

Thanks a lot for your help!

Hi, Matthew and Kirill,

I think we can fold this fix patch into your original patch and try
again.

> But still seems to me that page cache is being too actively crowded
> out with a lack of memory. Since, in addition to the top speed SSD on
> which the swap is located, there is also the slow HDD in the system
> that just starts to rustle continuously when swap being used. It would
> seem better to push some of the RAM onto a fast SSD into the swap
> partition than to leave the slow HDD without a cache.
>
> https://imgur.com/a/e8TIkBa
>
> But I am afraid it will be difficult to implement such an algorithm
> that analyzes the waiting time for the file I/O and waiting for paging
> (memory) and decides to leave parts in memory where the waiting time
> is more higher it would be more efficient for systems with several
> drives with access speeds can vary greatly. By waiting time I mean
> waiting time reading/writing to storage multiplied on the count of
> hits. Thus, we will not just keep in memory the most popular parts of
> the memory/disk, but also those parts of which read/write where was
> most costly.

Yes.  This is a valid problem.  I remember Johannes has a solution long
ago, but I don't know why he give up that.  Some information can be
found in the following URL.

https://lwn.net/Articles/690079/

Best Regards,
Huang, Ying

> --
> Best Regards,
> Mike Gavrilov.


Re: kernel BUG at mm/swap_state.c:170!

2019-07-22 Thread Huang, Ying
Mikhail Gavrilov  writes:

> On Mon, 22 Jul 2019 at 12:53, Huang, Ying  wrote:
>>
>> Yes.  This is quite complex.  Is the transparent huge page enabled in
>> your system?  You can check the output of
>>
>> $ cat /sys/kernel/mm/transparent_hugepage/enabled
>
> always [madvise] never
>
>> And, whether is the swap device you use a SSD or NVMe disk (not HDD)?
>
> NVMe INTEL Optane 905P SSDPE21D480GAM3

Thanks!  I have found another (easier way) to reproduce the panic.
Could you try the below patch on top of v5.2-rc2?  It can fix the panic
for me.

Best Regards,
Huang, Ying

---8<--
>From 5e519c2de54b9fd4b32b7a59e47ce7f94beb8845 Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Tue, 23 Jul 2019 08:49:57 +0800
Subject: [PATCH] dbg xa head

---
 mm/huge_memory.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f8bce9a6b32..c6ca1c7157ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2482,6 +2482,8 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
struct page *head = compound_head(page);
pg_data_t *pgdat = page_pgdat(head);
struct lruvec *lruvec;
+   struct address_space *swap_cache = NULL;
+   unsigned long offset;
int i;
 
lruvec = mem_cgroup_page_lruvec(head, pgdat);
@@ -2489,6 +2491,14 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
 
+   if (PageAnon(head) && PageSwapCache(head)) {
+   swp_entry_t entry = { .val = page_private(head) };
+
+   offset = swp_offset(entry);
+   swap_cache = swap_address_space(entry);
+   xa_lock(&swap_cache->i_pages);
+   }
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
__split_huge_page_tail(head, i, lruvec, list);
/* Some pages can be beyond i_size: drop them from page cache */
@@ -2501,6 +2511,9 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
} else if (!PageAnon(page)) {
__xa_store(&head->mapping->i_pages, head[i].index,
head + i, 0);
+   } else if (swap_cache) {
+   __xa_store(&swap_cache->i_pages, offset + i,
+  head + i, 0);
}
}
 
@@ -2508,9 +2521,10 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
/* Additional pin to swap cache */
-   if (PageSwapCache(head))
+   if (PageSwapCache(head)) {
page_ref_add(head, 2);
-   else
+   xa_unlock(&swap_cache->i_pages);
+   } else
page_ref_inc(head);
} else {
/* Additional pin to page cache */
-- 
2.20.1



Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-07-22 Thread Huang, Ying
"Huang, Ying"  writes:

> Rong Chen  writes:
>
>> On 6/26/19 11:17 AM, Josef Bacik wrote:
>>> On Wed, Jun 26, 2019 at 10:39:36AM +0800, Rong Chen wrote:
>>>> On 6/25/19 10:22 PM, Josef Bacik wrote:
>>>>> On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote:
>>>>>> "Huang, Ying"  writes:
>>>>>>
>>>>>>> "Huang, Ying"  writes:
>>>>>>>
>>>>>>>> Hi, Josef,
>>>>>>>>
>>>>>>>> kernel test robot  writes:
>>>>>>>>
>>>>>>>>> Greeting,
>>>>>>>>>
>>>>>>>>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to 
>>>>>>>>> commit:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
>>>>>>>>> delalloc metadata differently")
>>>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>>>
>>>>>>>>> in testcase: aim7
>>>>>>>>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @
>>>>>>>>> 3.00GHz with 384G memory
>>>>>>>>> with following parameters:
>>>>>>>>>
>>>>>>>>>   disk: 4BRD_12G
>>>>>>>>>   md: RAID0
>>>>>>>>>   fs: btrfs
>>>>>>>>>   test: disk_rr
>>>>>>>>>   load: 1500
>>>>>>>>>   cpufreq_governor: performance
>>>>>>>>>
>>>>>>>>> test-description: AIM7 is a traditional UNIX system level benchmark
>>>>>>>>> suite which is used to test and measure the performance of multiuser
>>>>>>>>> system.
>>>>>>>>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>>>>>>> Here's another regression, do you have time to take a look at this?
>>>>>>> Ping
>>>>>> Ping again ...
>>>>>>
>>>>> Finally got time to look at this but I can't get the reproducer to work
>>>>>
>>>>> root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml
>>>>> Traceback (most recent call last):
>>>>>   11: from /root/lkp-tests/bin/run-local:18:in `'
>>>>>   10: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>9: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>8: from /root/lkp-tests/lib/yaml.rb:5:in `'
>>>>>7: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>6: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>5: from /root/lkp-tests/lib/common.rb:9:in `'
>>>>>4: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>3: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>>2: from /root/lkp-tests/lib/array_ext.rb:3:in `>>>> (required)>'
>>>>>1: from 
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': 
>>>>> cannot load such file -- active_support/core_ext/enumerable (LoadError)
>>>> Hi Josef,
>>>>
>>>> I tried the latest lkp-tests, and didn't have the problem. Could you please
>>>> update the lkp-tests repo and run "lkp install" again?
>>>>
>>> I updated it this morning, and I just updated it now, my tree is on
>>>
>>> 2c5b1a95b08dbe81bba64419c482a877a3b424ac
>>>
>>> lkp install says everything is installed except
>>>
>>> No match for argument: libipc-run-perl
>>
>> I've just fixed it. could you add "libipc-run-perl: perl-IPC-Run" to
>> the end of distro/adaptation/fedora?
>>
>> Thanks,
>> Rong Chen
>>
>>
>>>
>>> and it still doesn't run properly.  Thanks,
>
> Hi, Josef,
>
> Do you have time to try it again?  The latest lkp-tests code has the fix 
> merged.

Ping...

Best Regards,
Huang, Ying


Re: kernel BUG at mm/swap_state.c:170!

2019-07-22 Thread Huang, Ying
Mikhail Gavrilov  writes:

> On Mon, 22 Jul 2019 at 06:37, huang ying  wrote:
>>
>> I am trying to reproduce this bug.  Can you give me some information
>> about your test case?
>
> It not easy, but I try to explain:
>
> 1. I have the system with 32Gb RAM, 64GB swap and after boot, I always
> launch follow applications:
> a. Google Chrome dev channel
> Note: here you should have 3 windows full of tabs on my
> monitor 118 tabs in each window.
> Don't worry modern Chrome browser is wise and load tabs only on 
> demand.
> We will use this feature later (on the last step).
> b. Firefox Nightly ASAN this build with enabled address sanitizer.
> c. Virtual Machine Manager (virt-manager) and start a virtual
> machine with Windows 10 (2048 MiB RAM allocated)
> d. Evolution
> e. Steam client
> f. Telegram client
> g. DeadBeef music player
>
> After all launched applications 15GB RAM should be allocated.
>
> 2. This step the most difficult, because we should by using Firefox
> allocated 27-28GB RAM.
> I use the infinite scroll on sites Facebook, VK, Pinterest, Tumblr
> and open many tabs in Firefox as I could.
> Note: our goal is 27-28GB allocated RAM in the system.
>
> 3. When we hit our goal in the second step now go to Google Chrome and
> click as fast as you can on all unloaded tabs.
> As usual, after 60 tabs this issue usually happens. 100%
> reproducible for me.
>
> Of course, I tried to simplify my workflow case by using stress-ng but
> without success.
>
> I hope it will help to make autotests.

Yes.  This is quite complex.  Is the transparent huge page enabled in
your system?  You can check the output of

$ cat /sys/kernel/mm/transparent_hugepage/enabled

And, whether is the swap device you use a SSD or NVMe disk (not HDD)?

Best Regards,
Huang, Ying

> --
> Best Regards,
> Mike Gavrilov.


Re: kernel BUG at mm/swap_state.c:170!

2019-07-21 Thread huang ying
Hi, Mikhail,

On Wed, May 29, 2019 at 12:05 PM Mikhail Gavrilov
 wrote:
>
> Hi folks.
> I am observed kernel panic after update to git tag 5.2-rc2.
> This crash happens at memory pressing when swap being used.
>
> Unfortunately in journalctl saved only this:
>
> May 29 08:02:02 localhost.localdomain kernel: page:e9095823
> refcount:1 mapcount:1 mapping:8f3ffeb36949 index:0x625002ab2
> May 29 08:02:02 localhost.localdomain kernel: anon
> May 29 08:02:02 localhost.localdomain kernel: flags:
> 0x17fffe00080034(uptodate|lru|active|swapbacked)
> May 29 08:02:02 localhost.localdomain kernel: raw: 0017fffe00080034
> e90944640888 e90956e208c8 8f3ffeb36949
> May 29 08:02:02 localhost.localdomain kernel: raw: 000625002ab2
>  0001 8f41aeeff000
> May 29 08:02:02 localhost.localdomain kernel: page dumped because:
> VM_BUG_ON_PAGE(entry != page)
> May 29 08:02:02 localhost.localdomain kernel: 
> page->mem_cgroup:8f41aeeff000
> May 29 08:02:02 localhost.localdomain kernel: [ cut here
> ]
> May 29 08:02:02 localhost.localdomain kernel: kernel BUG at 
> mm/swap_state.c:170!

I am trying to reproduce this bug.  Can you give me some information
about your test case?

Best Regards,
Huang, Ying


Re: [PATCH -mm] autonuma: Fix scan period updating

2019-07-15 Thread Huang, Ying
Mel Gorman  writes:

> On Fri, Jul 12, 2019 at 06:48:05PM +0800, Huang, Ying wrote:
>> > Ordinarily I would hope that the patch was motivated by observed
>> > behaviour so you have a metric for goodness. However, for NUMA balancing
>> > I would typically run basic workloads first -- dbench, tbench, netperf,
>> > hackbench and pipetest. The objective would be to measure the degree
>> > automatic NUMA balancing is interfering with a basic workload to see if
>> > they patch reduces the number of minor faults incurred even though there
>> > is no NUMA balancing to be worried about. This measures the general
>> > overhead of a patch. If your reasoning is correct, you'd expect lower
>> > overhead.
>> >
>> > For balancing itself, I usually look at Andrea's original autonuma
>> > benchmark, NAS Parallel Benchmark (D class usually although C class for
>> > much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
>> > benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
>> > than 2015 is (which can be unstable for a variety of reasons). In this
>> > case, I would be looking at whether the overhead is reduced, whether the
>> > ratio of local hits is the same or improved and the primary metric of
>> > each (time to completion for Andrea's and NAS, throughput for JBB).
>> >
>> > Even if there is no change to locality and the primary metric but there
>> > is less scanning and overhead overall, it would still be an improvement.
>> 
>> Thanks a lot for your detailed guidance.
>> 
>
> No problem.
>
>> > If you have trouble doing such an evaluation, I'll queue tests if they
>> > are based on a patch that addresses the specific point of concern (scan
>> > period not updated) as it's still not obvious why flipping the logic of
>> > whether shared or private is considered was necessary.
>> 
>> I can do the evaluation, but it will take quite some time for me to
>> setup and run all these benchmarks.  So if these benchmarks have already
>> been setup in your environment, so that your extra effort is minimal, it
>> will be great if you can queue tests for the patch.  Feel free to reject
>> me for any inconvenience.
>> 
>
> They're not setup as such, but my testing infrastructure is heavily
> automated so it's easy to do and I think it's worth looking at. If you
> update your patch to target just the scan period aspects, I'll queue it
> up and get back to you. It usually takes a few days for the automation
> to finish whatever it's doing and pick up a patch for evaluation.

Thanks a lot for your help!  The updated patch is as follows.  It
targets only the scan period aspects.

Best Regards,
Huang, Ying

--8<
>From 910a52cbf5a521c1562a573904c9507d0367bb0f Mon Sep 17 00:00:00 2001
From: Huang Ying 
Date: Sat, 22 Jun 2019 17:36:29 +0800
Subject: [PATCH] autonuma: Fix scan period updating

>From the commit log and comments of commit 37ec97deb3a8 ("sched/numa:
Slow down scan rate if shared faults dominate"), the autonuma scan
period should be increased (scanning is slowed down) if the majority
of the page accesses are shared with other processes.  But in current
code, the scan period will be decreased (scanning is speeded up) in
that situation.

The commit log and comments make more sense.  So this patch fixes the
code to make it match the commit log and comments.  And this has been
verified via tracing the scan period changing and /proc/vmstat
numa_pte_updates counter when running a multi-threaded memory
accessing program (most memory areas are accessed by multiple
threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults 
dominate")
Signed-off-by: "Huang, Ying" 
Cc: Rik van Riel 
Cc: Peter Zijlstra (Intel) 
Cc: Mel Gorman 
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Cc: Ingo Molnar 
---
 kernel/sched/fair.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..468a1c5038b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1940,7 +1940,7 @@ static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private)
 {
unsigned int period_slot;
-   int lr_ratio, ps_ratio;
+   int lr_ratio, sp_ratio;
int diff;
 
unsigned long remote = p->numa_faults_locality[0];
@@ -1971,22 +1971,22 @@ static void update_task_scan_period(struct task_struct 
*p,
 */
period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PE

Re: [PATCH -mm] autonuma: Fix scan period updating

2019-07-12 Thread Huang, Ying
Mel Gorman  writes:

> On Thu, Jul 04, 2019 at 08:32:06AM +0800, Huang, Ying wrote:
>> Mel Gorman  writes:
>> 
>> > On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
>> >> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman  wrote:
>> >> >
>> >> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
>> >> > > The autonuma scan period should be increased (scanning is slowed down)
>> >> > > if the majority of the page accesses are shared with other processes.
>> >> > > But in current code, the scan period will be decreased (scanning is
>> >> > > speeded up) in that situation.
>> >> > >
>> >> > > This patch fixes the code.  And this has been tested via tracing the
>> >> > > scan period changing and /proc/vmstat numa_pte_updates counter when
>> >> > > running a multi-threaded memory accessing program (most memory
>> >> > > areas are accessed by multiple threads).
>> >> > >
>> >> >
>> >> > The patch somewhat flips the logic on whether shared or private is
>> >> > considered and it's not immediately obvious why that was required. That
>> >> > aside, other than the impact on numa_pte_updates, what actual
>> >> > performance difference was measured and on on what workloads?
>> >> 
>> >> The original scanning period updating logic doesn't match the original
>> >> patch description and comments.  I think the original patch
>> >> description and comments make more sense.  So I fix the code logic to
>> >> make it match the original patch description and comments.
>> >> 
>> >> If my understanding to the original code logic and the original patch
>> >> description and comments were correct, do you think the original patch
>> >> description and comments are wrong so we need to fix the comments
>> >> instead?  Or you think we should prove whether the original patch
>> >> description and comments are correct?
>> >> 
>> >
>> > I'm about to get knocked offline so cannot answer properly. The code may
>> > indeed be wrong and I have observed higher than expected NUMA scanning
>> > behaviour than expected although not enough to cause problems. A comment
>> > fix is fine but if you're changing the scanning behaviour, it should be
>> > backed up with data justifying that the change both reduces the observed
>> > scanning and that it has no adverse performance implications.
>> 
>> Got it!  Thanks for comments!  As for performance testing, do you have
>> some candidate workloads?
>> 
>
> Ordinarily I would hope that the patch was motivated by observed
> behaviour so you have a metric for goodness. However, for NUMA balancing
> I would typically run basic workloads first -- dbench, tbench, netperf,
> hackbench and pipetest. The objective would be to measure the degree
> automatic NUMA balancing is interfering with a basic workload to see if
> they patch reduces the number of minor faults incurred even though there
> is no NUMA balancing to be worried about. This measures the general
> overhead of a patch. If your reasoning is correct, you'd expect lower
> overhead.
>
> For balancing itself, I usually look at Andrea's original autonuma
> benchmark, NAS Parallel Benchmark (D class usually although C class for
> much older or smaller machines) and spec JBB 2005 and 2015. Of the JBB
> benchmarks, 2005 is usually more reasonable for evaluating NUMA balancing
> than 2015 is (which can be unstable for a variety of reasons). In this
> case, I would be looking at whether the overhead is reduced, whether the
> ratio of local hits is the same or improved and the primary metric of
> each (time to completion for Andrea's and NAS, throughput for JBB).
>
> Even if there is no change to locality and the primary metric but there
> is less scanning and overhead overall, it would still be an improvement.

Thanks a lot for your detailed guidance.

> If you have trouble doing such an evaluation, I'll queue tests if they
> are based on a patch that addresses the specific point of concern (scan
> period not updated) as it's still not obvious why flipping the logic of
> whether shared or private is considered was necessary.

I can do the evaluation, but it will take quite some time for me to
setup and run all these benchmarks.  So if these benchmarks have already
been setup in your environment, so that your extra effort is minimal, it
will be great if you can queue tests for the patch.  Feel free to reject
me for any inconvenience.

Best Regards,
Huang, Ying


Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-07-08 Thread Huang, Ying
Rong Chen  writes:

> On 6/26/19 11:17 AM, Josef Bacik wrote:
>> On Wed, Jun 26, 2019 at 10:39:36AM +0800, Rong Chen wrote:
>>> On 6/25/19 10:22 PM, Josef Bacik wrote:
>>>> On Fri, Jun 21, 2019 at 08:48:03AM +0800, Huang, Ying wrote:
>>>>> "Huang, Ying"  writes:
>>>>>
>>>>>> "Huang, Ying"  writes:
>>>>>>
>>>>>>> Hi, Josef,
>>>>>>>
>>>>>>> kernel test robot  writes:
>>>>>>>
>>>>>>>> Greeting,
>>>>>>>>
>>>>>>>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit:
>>>>>>>>
>>>>>>>>
>>>>>>>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
>>>>>>>> delalloc metadata differently")
>>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>>>>
>>>>>>>> in testcase: aim7
>>>>>>>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz 
>>>>>>>> with 384G memory
>>>>>>>> with following parameters:
>>>>>>>>
>>>>>>>>disk: 4BRD_12G
>>>>>>>>md: RAID0
>>>>>>>>fs: btrfs
>>>>>>>>test: disk_rr
>>>>>>>>load: 1500
>>>>>>>>cpufreq_governor: performance
>>>>>>>>
>>>>>>>> test-description: AIM7 is a traditional UNIX system level benchmark
>>>>>>>> suite which is used to test and measure the performance of multiuser
>>>>>>>> system.
>>>>>>>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>>>>>> Here's another regression, do you have time to take a look at this?
>>>>>> Ping
>>>>> Ping again ...
>>>>>
>>>> Finally got time to look at this but I can't get the reproducer to work
>>>>
>>>> root@destiny ~/lkp-tests# bin/lkp run ~/job-aim.yaml
>>>> Traceback (most recent call last):
>>>>   11: from /root/lkp-tests/bin/run-local:18:in `'
>>>>   10: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>9: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>8: from /root/lkp-tests/lib/yaml.rb:5:in `'
>>>>7: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>6: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>5: from /root/lkp-tests/lib/common.rb:9:in `'
>>>>4: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>3: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>>2: from /root/lkp-tests/lib/array_ext.rb:3:in `'
>>>>1: from 
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require'
>>>> /usr/share/rubygems/rubygems/core_ext/kernel_require.rb:54:in `require': 
>>>> cannot load such file -- active_support/core_ext/enumerable (LoadError)
>>> Hi Josef,
>>>
>>> I tried the latest lkp-tests, and didn't have the problem. Could you please
>>> update the lkp-tests repo and run "lkp install" again?
>>>
>> I updated it this morning, and I just updated it now, my tree is on
>>
>> 2c5b1a95b08dbe81bba64419c482a877a3b424ac
>>
>> lkp install says everything is installed except
>>
>> No match for argument: libipc-run-perl
>
> I've just fixed it. could you add "libipc-run-perl: perl-IPC-Run" to
> the end of distro/adaptation/fedora?
>
> Thanks,
> Rong Chen
>
>
>>
>> and it still doesn't run properly.  Thanks,

Hi, Josef,

Do you have time to try it again?  The latest lkp-tests code has the fix merged.

Best Regards,
Huang, Ying

>>
>> Josef


Re: [PATCH -mm] autonuma: Fix scan period updating

2019-07-03 Thread Huang, Ying
Mel Gorman  writes:

> On Tue, Jun 25, 2019 at 09:23:22PM +0800, huang ying wrote:
>> On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman  wrote:
>> >
>> > On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
>> > > The autonuma scan period should be increased (scanning is slowed down)
>> > > if the majority of the page accesses are shared with other processes.
>> > > But in current code, the scan period will be decreased (scanning is
>> > > speeded up) in that situation.
>> > >
>> > > This patch fixes the code.  And this has been tested via tracing the
>> > > scan period changing and /proc/vmstat numa_pte_updates counter when
>> > > running a multi-threaded memory accessing program (most memory
>> > > areas are accessed by multiple threads).
>> > >
>> >
>> > The patch somewhat flips the logic on whether shared or private is
>> > considered and it's not immediately obvious why that was required. That
>> > aside, other than the impact on numa_pte_updates, what actual
>> > performance difference was measured and on on what workloads?
>> 
>> The original scanning period updating logic doesn't match the original
>> patch description and comments.  I think the original patch
>> description and comments make more sense.  So I fix the code logic to
>> make it match the original patch description and comments.
>> 
>> If my understanding to the original code logic and the original patch
>> description and comments were correct, do you think the original patch
>> description and comments are wrong so we need to fix the comments
>> instead?  Or you think we should prove whether the original patch
>> description and comments are correct?
>> 
>
> I'm about to get knocked offline so cannot answer properly. The code may
> indeed be wrong and I have observed higher than expected NUMA scanning
> behaviour than expected although not enough to cause problems. A comment
> fix is fine but if you're changing the scanning behaviour, it should be
> backed up with data justifying that the change both reduces the observed
> scanning and that it has no adverse performance implications.

Got it!  Thanks for comments!  As for performance testing, do you have
some candidate workloads?

Best Regards,
Huang, Ying


Re: [PATCH -mm] autonuma: Fix scan period updating

2019-06-25 Thread huang ying
On Mon, Jun 24, 2019 at 10:25 PM Mel Gorman  wrote:
>
> On Mon, Jun 24, 2019 at 10:56:04AM +0800, Huang Ying wrote:
> > The autonuma scan period should be increased (scanning is slowed down)
> > if the majority of the page accesses are shared with other processes.
> > But in current code, the scan period will be decreased (scanning is
> > speeded up) in that situation.
> >
> > This patch fixes the code.  And this has been tested via tracing the
> > scan period changing and /proc/vmstat numa_pte_updates counter when
> > running a multi-threaded memory accessing program (most memory
> > areas are accessed by multiple threads).
> >
>
> The patch somewhat flips the logic on whether shared or private is
> considered and it's not immediately obvious why that was required. That
> aside, other than the impact on numa_pte_updates, what actual
> performance difference was measured and on on what workloads?

The original scanning period updating logic doesn't match the original
patch description and comments.  I think the original patch
description and comments make more sense.  So I fix the code logic to
make it match the original patch description and comments.

If my understanding to the original code logic and the original patch
description and comments were correct, do you think the original patch
description and comments are wrong so we need to fix the comments
instead?  Or you think we should prove whether the original patch
description and comments are correct?

Best Regards,
Huang, Ying


[PATCH -mm -V2] mm, swap: Fix THP swap out

2019-06-24 Thread Huang, Ying
From: Huang Ying 

0-Day test system reported some OOM regressions for several
THP (Transparent Huge Page) swap test cases.  These regressions are
bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as
256").  In the commit, BIO_MAX_PAGES is set to 256 even when THP swap
is enabled.  So the bio_alloc(gfp_flags, 512) in get_swap_bio() may
fail when swapping out THP.  That causes the OOM.

As in the patch description of 6861428921b5 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write
THP to swap space.  So the issue is fixed via doing that in
get_swap_bio().

BTW: I remember I have checked the THP swap code when
6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged,
and thought the THP swap code needn't to be changed.  But apparently,
I was wrong.  I should have done this at that time.

Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying" 
Cc: Ming Lei 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Daniel Jordan 

Changelogs:

V2:

- Replace __bio_add_page() with bio_add_page() per Ming's comments.

---
 mm/page_io.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 2e8019d0e048..189415852077 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -29,10 +29,9 @@
 static struct bio *get_swap_bio(gfp_t gfp_flags,
struct page *page, bio_end_io_t end_io)
 {
-   int i, nr = hpage_nr_pages(page);
struct bio *bio;
 
-   bio = bio_alloc(gfp_flags, nr);
+   bio = bio_alloc(gfp_flags, 1);
if (bio) {
struct block_device *bdev;
 
@@ -41,9 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
bio->bi_end_io = end_io;
 
-   for (i = 0; i < nr; i++)
-   bio_add_page(bio, page + i, PAGE_SIZE, 0);
-   VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
+   bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
}
return bio;
 }
-- 
2.20.1



Re: [PATCH -mm] mm, swap: Fix THP swap out

2019-06-24 Thread Huang, Ying
Ming Lei  writes:

> On Mon, Jun 24, 2019 at 12:44:41PM +0800, Huang, Ying wrote:
>> Ming Lei  writes:
>> 
>> > Hi Huang Ying,
>> >
>> > On Mon, Jun 24, 2019 at 10:23:36AM +0800, Huang, Ying wrote:
>> >> From: Huang Ying 
>> >> 
>> >> 0-Day test system reported some OOM regressions for several
>> >> THP (Transparent Huge Page) swap test cases.  These regressions are
>> >> bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as
>> >> 256").  In the commit, BIO_MAX_PAGES is set to 256 even when THP swap
>> >> is enabled.  So the bio_alloc(gfp_flags, 512) in get_swap_bio() may
>> >> fail when swapping out THP.  That causes the OOM.
>> >> 
>> >> As in the patch description of 6861428921b5 ("block: always define
>> >> BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write
>> >> THP to swap space.  So the issue is fixed via doing that in
>> >> get_swap_bio().
>> >> 
>> >> BTW: I remember I have checked the THP swap code when
>> >> 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged,
>> >> and thought the THP swap code needn't to be changed.  But apparently,
>> >> I was wrong.  I should have done this at that time.
>> >> 
>> >> Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
>> >> Signed-off-by: "Huang, Ying" 
>> >> Cc: Ming Lei 
>> >> Cc: Michal Hocko 
>> >> Cc: Johannes Weiner 
>> >> Cc: Hugh Dickins 
>> >> Cc: Minchan Kim 
>> >> Cc: Rik van Riel 
>> >> Cc: Daniel Jordan 
>> >> ---
>> >>  mm/page_io.c | 7 ++-
>> >>  1 file changed, 2 insertions(+), 5 deletions(-)
>> >> 
>> >> diff --git a/mm/page_io.c b/mm/page_io.c
>> >> index 2e8019d0e048..4ab997f84061 100644
>> >> --- a/mm/page_io.c
>> >> +++ b/mm/page_io.c
>> >> @@ -29,10 +29,9 @@
>> >>  static struct bio *get_swap_bio(gfp_t gfp_flags,
>> >>   struct page *page, bio_end_io_t end_io)
>> >>  {
>> >> - int i, nr = hpage_nr_pages(page);
>> >>   struct bio *bio;
>> >>  
>> >> - bio = bio_alloc(gfp_flags, nr);
>> >> + bio = bio_alloc(gfp_flags, 1);
>> >>   if (bio) {
>> >>   struct block_device *bdev;
>> >>  
>> >> @@ -41,9 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
>> >>   bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
>> >>   bio->bi_end_io = end_io;
>> >>  
>> >> - for (i = 0; i < nr; i++)
>> >> - bio_add_page(bio, page + i, PAGE_SIZE, 0);
>> >
>> > bio_add_page() supposes to work, just wondering why it doesn't recently.
>> 
>> Yes.  Just checked and bio_add_page() works too.  I should have used
>> that.  The problem isn't bio_add_page(), but bio_alloc(), because nr ==
>> 512 > 256, mempool cannot be used during swapout, so swapout will fail.
>
> Then we can pass 1 to bio_alloc(), together with single bio_add_page()
> for making the code more readable.
>

Yes.  Will send out v2 to replace __bio_add_page() with bio_add_page().

Best Regards,
Huang, Ying


Re: [PATCH -mm] mm, swap: Fix THP swap out

2019-06-23 Thread Huang, Ying
Ming Lei  writes:

> Hi Huang Ying,
>
> On Mon, Jun 24, 2019 at 10:23:36AM +0800, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> 0-Day test system reported some OOM regressions for several
>> THP (Transparent Huge Page) swap test cases.  These regressions are
>> bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as
>> 256").  In the commit, BIO_MAX_PAGES is set to 256 even when THP swap
>> is enabled.  So the bio_alloc(gfp_flags, 512) in get_swap_bio() may
>> fail when swapping out THP.  That causes the OOM.
>> 
>> As in the patch description of 6861428921b5 ("block: always define
>> BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write
>> THP to swap space.  So the issue is fixed via doing that in
>> get_swap_bio().
>> 
>> BTW: I remember I have checked the THP swap code when
>> 6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged,
>> and thought the THP swap code needn't to be changed.  But apparently,
>> I was wrong.  I should have done this at that time.
>> 
>> Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
>> Signed-off-by: "Huang, Ying" 
>> Cc: Ming Lei 
>> Cc: Michal Hocko 
>> Cc: Johannes Weiner 
>> Cc: Hugh Dickins 
>> Cc: Minchan Kim 
>> Cc: Rik van Riel 
>> Cc: Daniel Jordan 
>> ---
>>  mm/page_io.c | 7 ++-
>>  1 file changed, 2 insertions(+), 5 deletions(-)
>> 
>> diff --git a/mm/page_io.c b/mm/page_io.c
>> index 2e8019d0e048..4ab997f84061 100644
>> --- a/mm/page_io.c
>> +++ b/mm/page_io.c
>> @@ -29,10 +29,9 @@
>>  static struct bio *get_swap_bio(gfp_t gfp_flags,
>>  struct page *page, bio_end_io_t end_io)
>>  {
>> -int i, nr = hpage_nr_pages(page);
>>  struct bio *bio;
>>  
>> -bio = bio_alloc(gfp_flags, nr);
>> +bio = bio_alloc(gfp_flags, 1);
>>  if (bio) {
>>  struct block_device *bdev;
>>  
>> @@ -41,9 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
>>  bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
>>  bio->bi_end_io = end_io;
>>  
>> -for (i = 0; i < nr; i++)
>> -bio_add_page(bio, page + i, PAGE_SIZE, 0);
>
> bio_add_page() supposes to work, just wondering why it doesn't recently.

Yes.  Just checked and bio_add_page() works too.  I should have used
that.  The problem isn't bio_add_page(), but bio_alloc(), because nr ==
512 > 256, mempool cannot be used during swapout, so swapout will fail.

Best Regards,
Huang, Ying

> Could you share me one test case for reproducing it?
>
>> -VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
>> +__bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
>>  }
>>  return bio;
>
> Actually the above code can be simplified as:
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 2e8019d0e048..c20b4189d0a1 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -29,7 +29,7 @@
>  static struct bio *get_swap_bio(gfp_t gfp_flags,
>   struct page *page, bio_end_io_t end_io)
>  {
> - int i, nr = hpage_nr_pages(page);
> + int nr = hpage_nr_pages(page);
>   struct bio *bio;
>  
>   bio = bio_alloc(gfp_flags, nr);
> @@ -41,8 +41,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
>   bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
>   bio->bi_end_io = end_io;
>  
> - for (i = 0; i < nr; i++)
> - bio_add_page(bio, page + i, PAGE_SIZE, 0);
> + bio_add_page(bio, page, PAGE_SIZE * nr, 0);
>   VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
>   }
>   return bio;
>
>
> Thanks,
> Ming


[PATCH -mm] autonuma: Fix scan period updating

2019-06-23 Thread Huang Ying
The autonuma scan period should be increased (scanning is slowed down)
if the majority of the page accesses are shared with other processes.
But in current code, the scan period will be decreased (scanning is
speeded up) in that situation.

This patch fixes the code.  And this has been tested via tracing the
scan period changing and /proc/vmstat numa_pte_updates counter when
running a multi-threaded memory accessing program (most memory
areas are accessed by multiple threads).

Fixes: 37ec97deb3a8 ("sched/numa: Slow down scan rate if shared faults 
dominate")
Signed-off-by: "Huang, Ying" 
Cc: Rik van Riel 
Cc: Peter Zijlstra (Intel) 
Cc: Mel Gorman 
Cc: jhla...@redhat.com
Cc: lvena...@redhat.com
Cc: Ingo Molnar 
---
 kernel/sched/fair.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..79bc4d2d1e58 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1923,7 +1923,7 @@ static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private)
 {
unsigned int period_slot;
-   int lr_ratio, ps_ratio;
+   int lr_ratio, sp_ratio;
int diff;
 
unsigned long remote = p->numa_faults_locality[0];
@@ -1954,22 +1954,22 @@ static void update_task_scan_period(struct task_struct 
*p,
 */
period_slot = DIV_ROUND_UP(p->numa_scan_period, NUMA_PERIOD_SLOTS);
lr_ratio = (local * NUMA_PERIOD_SLOTS) / (local + remote);
-   ps_ratio = (private * NUMA_PERIOD_SLOTS) / (private + shared);
+   sp_ratio = (shared * NUMA_PERIOD_SLOTS) / (private + shared);
 
-   if (ps_ratio >= NUMA_PERIOD_THRESHOLD) {
+   if (sp_ratio >= NUMA_PERIOD_THRESHOLD) {
/*
-* Most memory accesses are local. There is no need to
-* do fast NUMA scanning, since memory is already local.
+* Most memory accesses are shared with other tasks.
+* There is no point in continuing fast NUMA scanning,
+* since other tasks may just move the memory elsewhere.
 */
-   int slot = ps_ratio - NUMA_PERIOD_THRESHOLD;
+   int slot = sp_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
slot = 1;
diff = slot * period_slot;
} else if (lr_ratio >= NUMA_PERIOD_THRESHOLD) {
/*
-* Most memory accesses are shared with other tasks.
-* There is no point in continuing fast NUMA scanning,
-* since other tasks may just move the memory elsewhere.
+* Most memory accesses are local. There is no need to
+* do fast NUMA scanning, since memory is already local.
 */
int slot = lr_ratio - NUMA_PERIOD_THRESHOLD;
if (!slot)
@@ -1981,7 +1981,7 @@ static void update_task_scan_period(struct task_struct *p,
 * yet they are not on the local NUMA node. Speed up
 * NUMA scanning to get the memory moved over.
 */
-   int ratio = max(lr_ratio, ps_ratio);
+   int ratio = max(lr_ratio, sp_ratio);
diff = -(NUMA_PERIOD_THRESHOLD - ratio) * period_slot;
}
 
-- 
2.21.0



[PATCH -mm] mm, swap: Fix THP swap out

2019-06-23 Thread Huang, Ying
From: Huang Ying 

0-Day test system reported some OOM regressions for several
THP (Transparent Huge Page) swap test cases.  These regressions are
bisected to 6861428921b5 ("block: always define BIO_MAX_PAGES as
256").  In the commit, BIO_MAX_PAGES is set to 256 even when THP swap
is enabled.  So the bio_alloc(gfp_flags, 512) in get_swap_bio() may
fail when swapping out THP.  That causes the OOM.

As in the patch description of 6861428921b5 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write
THP to swap space.  So the issue is fixed via doing that in
get_swap_bio().

BTW: I remember I have checked the THP swap code when
6861428921b5 ("block: always define BIO_MAX_PAGES as 256") was merged,
and thought the THP swap code needn't to be changed.  But apparently,
I was wrong.  I should have done this at that time.

Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying" 
Cc: Ming Lei 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Daniel Jordan 
---
 mm/page_io.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 2e8019d0e048..4ab997f84061 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -29,10 +29,9 @@
 static struct bio *get_swap_bio(gfp_t gfp_flags,
struct page *page, bio_end_io_t end_io)
 {
-   int i, nr = hpage_nr_pages(page);
struct bio *bio;
 
-   bio = bio_alloc(gfp_flags, nr);
+   bio = bio_alloc(gfp_flags, 1);
if (bio) {
struct block_device *bdev;
 
@@ -41,9 +40,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags,
bio->bi_iter.bi_sector <<= PAGE_SHIFT - 9;
bio->bi_end_io = end_io;
 
-   for (i = 0; i < nr; i++)
-   bio_add_page(bio, page + i, PAGE_SIZE, 0);
-   VM_BUG_ON(bio->bi_iter.bi_size != PAGE_SIZE * nr);
+   __bio_add_page(bio, page, PAGE_SIZE * hpage_nr_pages(page), 0);
}
return bio;
 }
-- 
2.20.1



Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-06-20 Thread Huang, Ying
"Huang, Ying"  writes:

> "Huang, Ying"  writes:
>
>> Hi, Josef,
>>
>> kernel test robot  writes:
>>
>>> Greeting,
>>>
>>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit:
>>>
>>>
>>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
>>> delalloc metadata differently")
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>
>>> in testcase: aim7
>>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
>>> 384G memory
>>> with following parameters:
>>>
>>> disk: 4BRD_12G
>>> md: RAID0
>>> fs: btrfs
>>> test: disk_rr
>>> load: 1500
>>> cpufreq_governor: performance
>>>
>>> test-description: AIM7 is a traditional UNIX system level benchmark
>>> suite which is used to test and measure the performance of multiuser
>>> system.
>>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>>
>> Here's another regression, do you have time to take a look at this?
>
> Ping

Ping again ...

Best Regards,
Huang, Ying


[PATCH -mm RESEND] mm: fix race between swapoff and mincore

2019-06-10 Thread Huang, Ying
From: Huang Ying 

Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") on,
after swapoff, the address_space associated with the swap device will be
freed.  So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.

When mincore process unmapped range for swapped shmem pages, it doesn't
hold the lock to prevent swap device from being swapoff.  So the following
race is possible,

CPU1CPU2
do_mincore()swapoff()
  walk_page_range()
mincore_unmapped_range()
  __mincore_unmapped_range
mincore_page
  as = swap_address_space()
  ... exit_swap_address_space()
  ...   kvfree(spaces)
  find_get_page(as)

The address space may be accessed after being freed.

To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.

Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks")
Signed-off-by: "Huang, Ying" 
Reviewed-by: Andrew Morton 
Acked-by: Michal Hocko 
Cc: Hugh Dickins 
Cc: Paul E. McKenney 
Cc: Minchan Kim 
Cc: Johannes Weiner 
Cc: Tim Chen 
Cc: Mel Gorman 
Cc: Jérôme Glisse 
Cc: Andrea Arcangeli 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Jan Kara 
Cc: Dave Jiang 
Cc: Daniel Jordan 
Cc: Andrea Parri 
---
 mm/mincore.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index c3f058bd0faf..4fe91d497436 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -68,8 +68,16 @@ static unsigned char mincore_page(struct address_space 
*mapping, pgoff_t pgoff)
 */
if (xa_is_value(page)) {
swp_entry_t swp = radix_to_swp_entry(page);
-   page = find_get_page(swap_address_space(swp),
-swp_offset(swp));
+   struct swap_info_struct *si;
+
+   /* Prevent swap device to being swapoff under us */
+   si = get_swap_device(swp);
+   if (si) {
+   page = find_get_page(swap_address_space(swp),
+swp_offset(swp));
+   put_swap_device(si);
+   } else
+   page = NULL;
}
} else
page = find_get_page(mapping, pgoff);
-- 
2.20.1



Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-06-10 Thread Huang, Ying
"Huang, Ying"  writes:

> Hi, Josef,
>
> kernel test robot  writes:
>
>> Greeting,
>>
>> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit:
>>
>>
>> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve
>> delalloc metadata differently")
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: aim7
>> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
>> 384G memory
>> with following parameters:
>>
>>  disk: 4BRD_12G
>>  md: RAID0
>>  fs: btrfs
>>  test: disk_rr
>>  load: 1500
>>  cpufreq_governor: performance
>>
>> test-description: AIM7 is a traditional UNIX system level benchmark
>> suite which is used to test and measure the performance of multiuser
>> system.
>> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/
>
> Here's another regression, do you have time to take a look at this?

Ping

Best Regards,
Huang, Ying


Re: [PATCH -mm] mm, swap: Fix bad swap file entry warning

2019-05-31 Thread Huang, Ying
Michal Hocko  writes:

> On Fri 31-05-19 10:41:02, Huang, Ying wrote:
>> From: Huang Ying 
>> 
>> Mike reported the following warning messages
>> 
>>   get_swap_device: Bad swap file entry 1401
>> 
>> This is produced by
>> 
>> - total_swapcache_pages()
>>   - get_swap_device()
>> 
>> Where get_swap_device() is used to check whether the swap device is
>> valid and prevent it from being swapoff if so.  But get_swap_device()
>> may produce warning message as above for some invalid swap devices.
>> This is fixed via calling swp_swap_info() before get_swap_device() to
>> filter out the swap devices that may cause warning messages.
>> 
>> Fixes: 6a946753dbe6 ("mm/swap_state.c: simplify total_swapcache_pages() with 
>> get_swap_device()")
>
> I suspect this is referring to a mmotm patch right?

Yes.

> There doesn't seem
> to be any sha like this in Linus' tree AFAICS. If that is the case then
> please note that mmotm patch showing up in linux-next do not have a
> stable sha1 and therefore you shouldn't reference them in the commit
> message. Instead please refer to the specific mmotm patch file so that
> Andrew knows it should be folded in to it.

Thanks for reminding!  I will be more careful in the future.  It seems
that Andrew has identified the right patch to be folded into.

Best Regards,
Huang, Ying


Re: mmotm 2019-05-29-20-52 uploaded

2019-05-30 Thread Huang, Ying
"Huang, Ying"  writes:

> Hi, Mike,
>
> Mike Kravetz  writes:
>
>> On 5/29/19 8:53 PM, a...@linux-foundation.org wrote:
>>> The mm-of-the-moment snapshot 2019-05-29-20-52 has been uploaded to
>>> 
>>>http://www.ozlabs.org/~akpm/mmotm/
>>> 
>>
>> With this kernel, I seem to get many messages such as:
>>
>> get_swap_device: Bad swap file entry 1401
>>
>> It would seem to be related to commit 3e2c19f9bef7e
>>> * mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch
>
> Hi, Mike,
>
> Thanks for reporting!  I find an issue in my patch and I can reproduce
> your problem now.  The reason is total_swapcache_pages() will call
> get_swap_device() for invalid swap device.  So we need to find a way to
> silence the warning.  I will post a fix ASAP.

I have sent out a fix patch in another thread with title

"[PATCH -mm] mm, swap: Fix bad swap file entry warning"

Can you try it?

Best Regards,
Huang, Ying



[PATCH -mm] mm, swap: Fix bad swap file entry warning

2019-05-30 Thread Huang, Ying
From: Huang Ying 

Mike reported the following warning messages

  get_swap_device: Bad swap file entry 1401

This is produced by

- total_swapcache_pages()
  - get_swap_device()

Where get_swap_device() is used to check whether the swap device is
valid and prevent it from being swapoff if so.  But get_swap_device()
may produce warning message as above for some invalid swap devices.
This is fixed via calling swp_swap_info() before get_swap_device() to
filter out the swap devices that may cause warning messages.

Fixes: 6a946753dbe6 ("mm/swap_state.c: simplify total_swapcache_pages() with 
get_swap_device()")
Signed-off-by: "Huang, Ying" 
Cc: Mike Kravetz 
Cc: Andrea Parri 
Cc: Paul E. McKenney 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Hugh Dickins 
---
 mm/swap_state.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b84c58b572ca..62da25b7f2ed 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -76,8 +76,13 @@ unsigned long total_swapcache_pages(void)
struct swap_info_struct *si;
 
for (i = 0; i < MAX_SWAPFILES; i++) {
+   swp_entry_t entry = swp_entry(i, 1);
+
+   /* Avoid get_swap_device() to warn for bad swap entry */
+   if (!swp_swap_info(entry))
+   continue;
/* Prevent swapoff to free swapper_spaces */
-   si = get_swap_device(swp_entry(i, 1));
+   si = get_swap_device(entry);
if (!si)
continue;
nr = nr_swapper_spaces[i];
-- 
2.20.1



Re: mmotm 2019-05-29-20-52 uploaded

2019-05-30 Thread Huang, Ying
Hi, Mike,

Mike Kravetz  writes:

> On 5/29/19 8:53 PM, a...@linux-foundation.org wrote:
>> The mm-of-the-moment snapshot 2019-05-29-20-52 has been uploaded to
>> 
>>http://www.ozlabs.org/~akpm/mmotm/
>> 
>
> With this kernel, I seem to get many messages such as:
>
> get_swap_device: Bad swap file entry 1401
>
> It would seem to be related to commit 3e2c19f9bef7e
>> * mm-swap-fix-race-between-swapoff-and-some-swap-operations.patch

Hi, Mike,

Thanks for reporting!  I find an issue in my patch and I can reproduce
your problem now.  The reason is total_swapcache_pages() will call
get_swap_device() for invalid swap device.  So we need to find a way to
silence the warning.  I will post a fix ASAP.

Best Regards,
Huang, Ying


Re: [LKP] [btrfs] c8eaeac7b7: aim7.jobs-per-min -11.7% regression

2019-05-29 Thread Huang, Ying
Hi, Josef,

kernel test robot  writes:

> Greeting,
>
> FYI, we noticed a -11.7% regression of aim7.jobs-per-min due to commit:
>
>
> commit: c8eaeac7b734347c3afba7008b7af62f37b9c140 ("btrfs: reserve delalloc 
> metadata differently")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> in testcase: aim7
> on test machine: 40 threads Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz with 
> 384G memory
> with following parameters:
>
>   disk: 4BRD_12G
>   md: RAID0
>   fs: btrfs
>   test: disk_rr
>   load: 1500
>   cpufreq_governor: performance
>
> test-description: AIM7 is a traditional UNIX system level benchmark suite 
> which is used to test and measure the performance of multiuser system.
> test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/

Here's another regression, do you have time to take a look at this?

Best Regards,
Huang, Ying


Re: [v7 PATCH 2/2] mm: vmscan: correct some vmscan counters for THP swapout

2019-05-28 Thread Huang, Ying
Yang Shi  writes:

> Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after
> swapped out"), THP can be swapped out in a whole.  But, nr_reclaimed
> and some other vm counters still get inc'ed by one even though a whole
> THP (512 pages) gets swapped out.
>
> This doesn't make too much sense to memory reclaim.  For example, direct
> reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP
> could fulfill it.  But, if nr_reclaimed is not increased correctly,
> direct reclaim may just waste time to reclaim more pages,
> SWAP_CLUSTER_MAX * 512 pages in worst case.
>
> And, it may cause pgsteal_{kswapd|direct} is greater than
> pgscan_{kswapd|direct}, like the below:
>
> pgsteal_kswapd 122933
> pgsteal_direct 26600225
> pgscan_kswapd 174153
> pgscan_direct 14678312
>
> nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
> break some page reclaim logic, e.g.
>
> vmpressure: this looks at the scanned/reclaimed ratio so it won't
> change semantics as long as scanned & reclaimed are fixed in parallel.
>
> compaction/reclaim: compaction wants a certain number of physical pages
> freed up before going back to compacting.
>
> kswapd priority raising: kswapd raises priority if we scan fewer pages
> than the reclaim target (which itself is obviously expressed in order-0
> pages). As a result, kswapd can falsely raise its aggressiveness even
> when it's making great progress.
>
> Other than nr_scanned and nr_reclaimed, some other counters, e.g.
> pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed
> too since they are user visible via cgroup, /proc/vmstat or trace
> points, otherwise they would be underreported.
>
> When isolating pages from LRUs, nr_taken has been accounted in base
> page, but nr_scanned and nr_skipped are still accounted in THP.  It
> doesn't make too much sense too since this may cause trace point
> underreport the numbers as well.
>
> So accounting those counters in base page instead of accounting THP as
> one page.
>
> nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
> file cache, so they are not impacted by THP swap.
>
> This change may result in lower steal/scan ratio in some cases since
> THP may get split during page reclaim, then a part of tail pages get
> reclaimed instead of the whole 512 pages, but nr_scanned is accounted
> by 512, particularly for direct reclaim.  But, this should be not a
> significant issue.
>
> Cc: "Huang, Ying" 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: "Kirill A . Shutemov" 
> Cc: Hugh Dickins 
> Cc: Shakeel Butt 
> Cc: Hillf Danton 
> Signed-off-by: Yang Shi 

Looks good to me!  Thanks for your effort!

Reviewed-by: "Huang, Ying" 

Best Regards,
Huang, Ying


[PATCH -mm] mm, swap: Simplify total_swapcache_pages() with get_swap_device()

2019-05-27 Thread Huang, Ying
From: Huang Ying 

total_swapcache_pages() may race with swapper_spaces[] allocation and
freeing.  Previously, this is protected with a swapper_spaces[]
specific RCU mechanism.  To simplify the logic/code complexity, it is
replaced with get/put_swap_device().  The code line number is reduced
too.  Although not so important, the swapoff() performance improves
too because one synchronize_rcu() call during swapoff() is deleted.

Signed-off-by: "Huang, Ying" 
Cc: Hugh Dickins 
Cc: Paul E. McKenney 
Cc: Minchan Kim 
Cc: Johannes Weiner 
Cc: Tim Chen 
Cc: Mel Gorman 
Cc: Jérôme Glisse 
Cc: Michal Hocko 
Cc: Andrea Arcangeli 
Cc: Yang Shi 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Jan Kara 
Cc: Dave Jiang 
Cc: Daniel Jordan 
Cc: Andrea Parri 
---
 mm/swap_state.c | 28 ++--
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index f509cdaa81b1..b84c58b572ca 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -73,23 +73,19 @@ unsigned long total_swapcache_pages(void)
unsigned int i, j, nr;
unsigned long ret = 0;
struct address_space *spaces;
+   struct swap_info_struct *si;
 
-   rcu_read_lock();
for (i = 0; i < MAX_SWAPFILES; i++) {
-   /*
-* The corresponding entries in nr_swapper_spaces and
-* swapper_spaces will be reused only after at least
-* one grace period.  So it is impossible for them
-* belongs to different usage.
-*/
-   nr = nr_swapper_spaces[i];
-   spaces = rcu_dereference(swapper_spaces[i]);
-   if (!nr || !spaces)
+   /* Prevent swapoff to free swapper_spaces */
+   si = get_swap_device(swp_entry(i, 1));
+   if (!si)
continue;
+   nr = nr_swapper_spaces[i];
+   spaces = swapper_spaces[i];
for (j = 0; j < nr; j++)
ret += spaces[j].nrpages;
+   put_swap_device(si);
}
-   rcu_read_unlock();
return ret;
 }
 
@@ -611,20 +607,16 @@ int init_swap_address_space(unsigned int type, unsigned 
long nr_pages)
mapping_set_no_writeback_tags(space);
}
nr_swapper_spaces[type] = nr;
-   rcu_assign_pointer(swapper_spaces[type], spaces);
+   swapper_spaces[type] = spaces;
 
return 0;
 }
 
 void exit_swap_address_space(unsigned int type)
 {
-   struct address_space *spaces;
-
-   spaces = swapper_spaces[type];
+   kvfree(swapper_spaces[type]);
nr_swapper_spaces[type] = 0;
-   rcu_assign_pointer(swapper_spaces[type], NULL);
-   synchronize_rcu();
-   kvfree(spaces);
+   swapper_spaces[type] = NULL;
 }
 
 static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
-- 
2.20.1



Re: [v6 PATCH 2/2] mm: vmscan: correct some vmscan counters for THP swapout

2019-05-27 Thread Huang, Ying
Yang Shi  writes:

> On 5/27/19 3:06 PM, Huang, Ying wrote:
>> Yang Shi  writes:
>>
>>> Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after
>>> swapped out"), THP can be swapped out in a whole.  But, nr_reclaimed
>>> and some other vm counters still get inc'ed by one even though a whole
>>> THP (512 pages) gets swapped out.
>>>
>>> This doesn't make too much sense to memory reclaim.  For example, direct
>>> reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP
>>> could fulfill it.  But, if nr_reclaimed is not increased correctly,
>>> direct reclaim may just waste time to reclaim more pages,
>>> SWAP_CLUSTER_MAX * 512 pages in worst case.
>>>
>>> And, it may cause pgsteal_{kswapd|direct} is greater than
>>> pgscan_{kswapd|direct}, like the below:
>>>
>>> pgsteal_kswapd 122933
>>> pgsteal_direct 26600225
>>> pgscan_kswapd 174153
>>> pgscan_direct 14678312
>>>
>>> nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
>>> break some page reclaim logic, e.g.
>>>
>>> vmpressure: this looks at the scanned/reclaimed ratio so it won't
>>> change semantics as long as scanned & reclaimed are fixed in parallel.
>>>
>>> compaction/reclaim: compaction wants a certain number of physical pages
>>> freed up before going back to compacting.
>>>
>>> kswapd priority raising: kswapd raises priority if we scan fewer pages
>>> than the reclaim target (which itself is obviously expressed in order-0
>>> pages). As a result, kswapd can falsely raise its aggressiveness even
>>> when it's making great progress.
>>>
>>> Other than nr_scanned and nr_reclaimed, some other counters, e.g.
>>> pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed
>>> too since they are user visible via cgroup, /proc/vmstat or trace
>>> points, otherwise they would be underreported.
>>>
>>> When isolating pages from LRUs, nr_taken has been accounted in base
>>> page, but nr_scanned and nr_skipped are still accounted in THP.  It
>>> doesn't make too much sense too since this may cause trace point
>>> underreport the numbers as well.
>>>
>>> So accounting those counters in base page instead of accounting THP as
>>> one page.
>>>
>>> nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
>>> file cache, so they are not impacted by THP swap.
>>>
>>> This change may result in lower steal/scan ratio in some cases since
>>> THP may get split during page reclaim, then a part of tail pages get
>>> reclaimed instead of the whole 512 pages, but nr_scanned is accounted
>>> by 512, particularly for direct reclaim.  But, this should be not a
>>> significant issue.
>>>
>>> Cc: "Huang, Ying" 
>>> Cc: Johannes Weiner 
>>> Cc: Michal Hocko 
>>> Cc: Mel Gorman 
>>> Cc: "Kirill A . Shutemov" 
>>> Cc: Hugh Dickins 
>>> Cc: Shakeel Butt 
>>> Cc: Hillf Danton 
>>> Signed-off-by: Yang Shi 
>>> ---
>>> v6: Fixed the other double account issue introduced by v5 per Huang Ying
>>> v5: Fixed sc->nr_scanned double accounting per Huang Ying
>>>  Added some comments to address the concern about premature OOM per 
>>> Hillf Danton
>>> v4: Fixed the comments from Johannes and Huang Ying
>>> v3: Removed Shakeel's Reviewed-by since the patch has been changed 
>>> significantly
>>>  Switched back to use compound_order per Matthew
>>>  Fixed more counters per Johannes
>>> v2: Added Shakeel's Reviewed-by
>>>  Use hpage_nr_pages instead of compound_order per Huang Ying and 
>>> William Kucharski
>>>
>>>   mm/vmscan.c | 47 +++
>>>   1 file changed, 35 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index b65bc50..378edff 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1118,6 +1118,7 @@ static unsigned long shrink_page_list(struct 
>>> list_head *page_list,
>>> int may_enter_fs;
>>> enum page_references references = PAGEREF_RECLAIM_CLEAN;
>>> bool dirty, writeback;
>>> +   unsigned int nr_pages;
>>> cond_resched();
>>>   @@ -1129,7 +113

Re: [v6 PATCH 2/2] mm: vmscan: correct some vmscan counters for THP swapout

2019-05-27 Thread Huang, Ying
Yang Shi  writes:

> Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after
> swapped out"), THP can be swapped out in a whole.  But, nr_reclaimed
> and some other vm counters still get inc'ed by one even though a whole
> THP (512 pages) gets swapped out.
>
> This doesn't make too much sense to memory reclaim.  For example, direct
> reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP
> could fulfill it.  But, if nr_reclaimed is not increased correctly,
> direct reclaim may just waste time to reclaim more pages,
> SWAP_CLUSTER_MAX * 512 pages in worst case.
>
> And, it may cause pgsteal_{kswapd|direct} is greater than
> pgscan_{kswapd|direct}, like the below:
>
> pgsteal_kswapd 122933
> pgsteal_direct 26600225
> pgscan_kswapd 174153
> pgscan_direct 14678312
>
> nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
> break some page reclaim logic, e.g.
>
> vmpressure: this looks at the scanned/reclaimed ratio so it won't
> change semantics as long as scanned & reclaimed are fixed in parallel.
>
> compaction/reclaim: compaction wants a certain number of physical pages
> freed up before going back to compacting.
>
> kswapd priority raising: kswapd raises priority if we scan fewer pages
> than the reclaim target (which itself is obviously expressed in order-0
> pages). As a result, kswapd can falsely raise its aggressiveness even
> when it's making great progress.
>
> Other than nr_scanned and nr_reclaimed, some other counters, e.g.
> pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed
> too since they are user visible via cgroup, /proc/vmstat or trace
> points, otherwise they would be underreported.
>
> When isolating pages from LRUs, nr_taken has been accounted in base
> page, but nr_scanned and nr_skipped are still accounted in THP.  It
> doesn't make too much sense too since this may cause trace point
> underreport the numbers as well.
>
> So accounting those counters in base page instead of accounting THP as
> one page.
>
> nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
> file cache, so they are not impacted by THP swap.
>
> This change may result in lower steal/scan ratio in some cases since
> THP may get split during page reclaim, then a part of tail pages get
> reclaimed instead of the whole 512 pages, but nr_scanned is accounted
> by 512, particularly for direct reclaim.  But, this should be not a
> significant issue.
>
> Cc: "Huang, Ying" 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: "Kirill A . Shutemov" 
> Cc: Hugh Dickins 
> Cc: Shakeel Butt 
> Cc: Hillf Danton 
> Signed-off-by: Yang Shi 
> ---
> v6: Fixed the other double account issue introduced by v5 per Huang Ying
> v5: Fixed sc->nr_scanned double accounting per Huang Ying
> Added some comments to address the concern about premature OOM per Hillf 
> Danton 
> v4: Fixed the comments from Johannes and Huang Ying
> v3: Removed Shakeel's Reviewed-by since the patch has been changed 
> significantly
> Switched back to use compound_order per Matthew
> Fixed more counters per Johannes
> v2: Added Shakeel's Reviewed-by
> Use hpage_nr_pages instead of compound_order per Huang Ying and William 
> Kucharski
>
>  mm/vmscan.c | 47 +++
>  1 file changed, 35 insertions(+), 12 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b65bc50..378edff 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1118,6 +1118,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   int may_enter_fs;
>   enum page_references references = PAGEREF_RECLAIM_CLEAN;
>   bool dirty, writeback;
> + unsigned int nr_pages;
>  
>   cond_resched();
>  
> @@ -1129,7 +1130,10 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>  
>   VM_BUG_ON_PAGE(PageActive(page), page);
>  
> - sc->nr_scanned++;
> + nr_pages = 1 << compound_order(page);
> +
> + /* Account the number of base pages even though THP */
> + sc->nr_scanned += nr_pages;
>  
>   if (unlikely(!page_evictable(page)))
>   goto activate_locked;
> @@ -1250,7 +1254,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   case PAGEREF_ACTIVATE:
>   goto activate_locked;
>   case PAGEREF_KEEP:
> - stat->nr_ref_keep++;
> + sta

Re: [RESEND v5 PATCH 2/2] mm: vmscan: correct some vmscan counters for THP swapout

2019-05-26 Thread Huang, Ying
Yang Shi  writes:

> On 5/27/19 10:11 AM, Huang, Ying wrote:
>> Yang Shi  writes:
>>
>>> Since commit bd4c82c22c36 ("mm, THP, swap: delay splitting THP after
>>> swapped out"), THP can be swapped out in a whole.  But, nr_reclaimed
>>> and some other vm counters still get inc'ed by one even though a whole
>>> THP (512 pages) gets swapped out.
>>>
>>> This doesn't make too much sense to memory reclaim.  For example, direct
>>> reclaim may just need reclaim SWAP_CLUSTER_MAX pages, reclaiming one THP
>>> could fulfill it.  But, if nr_reclaimed is not increased correctly,
>>> direct reclaim may just waste time to reclaim more pages,
>>> SWAP_CLUSTER_MAX * 512 pages in worst case.
>>>
>>> And, it may cause pgsteal_{kswapd|direct} is greater than
>>> pgscan_{kswapd|direct}, like the below:
>>>
>>> pgsteal_kswapd 122933
>>> pgsteal_direct 26600225
>>> pgscan_kswapd 174153
>>> pgscan_direct 14678312
>>>
>>> nr_reclaimed and nr_scanned must be fixed in parallel otherwise it would
>>> break some page reclaim logic, e.g.
>>>
>>> vmpressure: this looks at the scanned/reclaimed ratio so it won't
>>> change semantics as long as scanned & reclaimed are fixed in parallel.
>>>
>>> compaction/reclaim: compaction wants a certain number of physical pages
>>> freed up before going back to compacting.
>>>
>>> kswapd priority raising: kswapd raises priority if we scan fewer pages
>>> than the reclaim target (which itself is obviously expressed in order-0
>>> pages). As a result, kswapd can falsely raise its aggressiveness even
>>> when it's making great progress.
>>>
>>> Other than nr_scanned and nr_reclaimed, some other counters, e.g.
>>> pgactivate, nr_skipped, nr_ref_keep and nr_unmap_fail need to be fixed
>>> too since they are user visible via cgroup, /proc/vmstat or trace
>>> points, otherwise they would be underreported.
>>>
>>> When isolating pages from LRUs, nr_taken has been accounted in base
>>> page, but nr_scanned and nr_skipped are still accounted in THP.  It
>>> doesn't make too much sense too since this may cause trace point
>>> underreport the numbers as well.
>>>
>>> So accounting those counters in base page instead of accounting THP as
>>> one page.
>>>
>>> nr_dirty, nr_unqueued_dirty, nr_congested and nr_writeback are used by
>>> file cache, so they are not impacted by THP swap.
>>>
>>> This change may result in lower steal/scan ratio in some cases since
>>> THP may get split during page reclaim, then a part of tail pages get
>>> reclaimed instead of the whole 512 pages, but nr_scanned is accounted
>>> by 512, particularly for direct reclaim.  But, this should be not a
>>> significant issue.
>>>
>>> Cc: "Huang, Ying" 
>>> Cc: Johannes Weiner 
>>> Cc: Michal Hocko 
>>> Cc: Mel Gorman 
>>> Cc: "Kirill A . Shutemov" 
>>> Cc: Hugh Dickins 
>>> Cc: Shakeel Butt 
>>> Signed-off-by: Yang Shi 
>>> ---
>>> v5: Fixed sc->nr_scanned double accounting per Huang Ying
>>>  Added some comments to address the concern about premature OOM per 
>>> Hillf Danton
>>> v4: Fixed the comments from Johannes and Huang Ying
>>> v3: Removed Shakeel's Reviewed-by since the patch has been changed 
>>> significantly
>>>  Switched back to use compound_order per Matthew
>>>  Fixed more counters per Johannes
>>> v2: Added Shakeel's Reviewed-by
>>>  Use hpage_nr_pages instead of compound_order per Huang Ying and 
>>> William Kucharski
>>>
>>>   mm/vmscan.c | 42 +++---
>>>   1 file changed, 31 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index b65bc50..f4f4d57 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1118,6 +1118,7 @@ static unsigned long shrink_page_list(struct 
>>> list_head *page_list,
>>> int may_enter_fs;
>>> enum page_references references = PAGEREF_RECLAIM_CLEAN;
>>> bool dirty, writeback;
>>> +   unsigned int nr_pages;
>>> cond_resched();
>>>   @@ -1129,6 +1130,13 @@ static unsigned long
>>> shrink_page_list(struct list_head *page_list,
>>> V

<    1   2   3   4   5   6   7   8   9   10   >