Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-15 Thread Zi Yan
On 15 Apr 2021, at 2:45, Huang, Ying wrote:

> "Zi Yan"  writes:
>
>> On 13 Apr 2021, at 23:00, Huang, Ying wrote:
>>
>>> Yang Shi  writes:
>>>
>>>> The generic migration path will check refcount, so no need check refcount 
>>>> here.
>>>> But the old code actually prevents from migrating shared THP (mapped by 
>>>> multiple
>>>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>>>
>>> What prevents us from migrating shared THP?  If no, why not just remove
>>> the old refcount checking?
>>
>> If two or more processes are in different NUMA nodes, a THP shared by them 
>> can be
>> migrated back and forth between NUMA nodes, which is quite costly. Unless we 
>> have
>> a better way of figuring out a good location for such pages to reduce the 
>> number
>> of migration, it might be better not to move them, right?
>>
>
> Some mechanism has been provided in should_numa_migrate_memory() to
> identify the shared pages from the private pages.  Do you find it
> doesn't work well in some situations?
>
> The multiple threads in one process which run on different NUMA nodes
> may share pages too.  So it isn't a good solution to exclude pages
> shared by multiple processes.

After recheck the patch, it seems that no shared THP migration here is a side 
effect
of the original page_count check, which might not be intended and be worth 
fixing.
But Yang just want to solve one problem, simplifying THP NUMA migration,
at a time. Maybe a separate patch would be better for both discussing and 
fixing this problem.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: Optimise nth_page for contiguous memmap

2021-04-14 Thread Zi Yan
On 13 Apr 2021, at 15:46, Matthew Wilcox (Oracle) wrote:

> If the memmap is virtually contiguous (either because we're using
> a virtually mapped memmap or because we don't support a discontig
> memmap at all), then we can implement nth_page() by simple addition.
> Contrary to popular belief, the compiler is not able to optimise this
> itself for a vmemmap configuration.  This reduces one example user (sg.c)
> by four instructions:
>
> struct page *page = nth_page(rsv_schp->pages[k], offset >> 
> PAGE_SHIFT);
>
> before:
>49 8b 45 70 mov0x70(%r13),%rax
>48 63 c9movslq %ecx,%rcx
>48 c1 eb 0c shr$0xc,%rbx
>48 8b 04 c8 mov(%rax,%rcx,8),%rax
>48 2b 05 00 00 00 00sub0x0(%rip),%rax
>R_X86_64_PC32  vmemmap_base-0x4
>48 c1 f8 06 sar$0x6,%rax
>48 01 d8add%rbx,%rax
>48 c1 e0 06 shl$0x6,%rax
>48 03 05 00 00 00 00add0x0(%rip),%rax
>R_X86_64_PC32  vmemmap_base-0x4
>
> after:
>49 8b 45 70 mov0x70(%r13),%rax
>48 63 c9movslq %ecx,%rcx
>48 c1 eb 0c shr$0xc,%rbx
>48 c1 e3 06 shl$0x6,%rbx
>48 03 1c c8 add(%rax,%rcx,8),%rbx
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  include/linux/mm.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25b9041f9925..2327f99b121f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -234,7 +234,11 @@ int overcommit_policy_handler(struct ctl_table *, int, 
> void *, size_t *,
>  int __add_to_page_cache_locked(struct page *page, struct address_space 
> *mapping,
>   pgoff_t index, gfp_t gfp, void **shadowp);
>
> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
> +#else
> +#define nth_page(page,n) ((page) + (n))
> +#endif
>
>  /* to align the pointer to the (next) page boundary */
>  #define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
> -- 
> 2.30.2

LGTM. Thanks.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [v2 PATCH 6/7] mm: migrate: check mapcount for THP instead of ref count

2021-04-14 Thread Zi Yan
On 13 Apr 2021, at 23:00, Huang, Ying wrote:

> Yang Shi  writes:
>
>> The generic migration path will check refcount, so no need check refcount 
>> here.
>> But the old code actually prevents from migrating shared THP (mapped by 
>> multiple
>> processes), so bail out early if mapcount is > 1 to keep the behavior.
>
> What prevents us from migrating shared THP?  If no, why not just remove
> the old refcount checking?

If two or more processes are in different NUMA nodes, a THP shared by them can 
be
migrated back and forth between NUMA nodes, which is quite costly. Unless we 
have
a better way of figuring out a good location for such pages to reduce the number
of migration, it might be better not to move them, right?

>
> Best Regards,
> Huang, Ying
>
>> Signed-off-by: Yang Shi 
>> ---
>>  mm/migrate.c | 16 
>>  1 file changed, 4 insertions(+), 12 deletions(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index a72994c68ec6..dc7cc7f3a124 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2067,6 +2067,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
>> struct page *page)
>>
>>  VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
>>
>> +/* Do not migrate THP mapped by multiple processes */
>> +if (PageTransHuge(page) && page_mapcount(page) > 1)
>> +return 0;
>> +
>>  /* Avoid migrating to a node that is nearly full */
>>  if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
>>  return 0;
>> @@ -2074,18 +2078,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, 
>> struct page *page)
>>  if (isolate_lru_page(page))
>>  return 0;
>>
>> -/*
>> - * migrate_misplaced_transhuge_page() skips page migration's usual
>> - * check on page_count(), so we must do it here, now that the page
>> - * has been isolated: a GUP pin, or any other pin, prevents migration.
>> - * The expected page count is 3: 1 for page's mapcount and 1 for the
>> - * caller's pin and 1 for the reference taken by isolate_lru_page().
>> - */
>> -if (PageTransHuge(page) && page_count(page) != 3) {
>> -putback_lru_page(page);
>> -return 0;
>> -}
>> -
>>  page_lru = page_is_file_lru(page);
>>  mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
>>  thp_nr_pages(page));


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v8 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-31 Thread Zi Yan
From: Zi Yan 

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:
>From v7:
1. Used the right way of looping through page cache pages. (the error
   was pointed out by Matthew Wilcox)

>From v6:
1. pr_info -> pr_debug.
2. Added cond_resched() in all split loops. (suggested by David Rientjes)

>From v5:
1. Skipped special VMAs and other fixes. (suggested by Yang Shi)

>From v4:
1. Fixed the error code return issue, spotted by kernel test robot
   .

>From v3:
1. Factored out split huge pages in the given pid code to a separate
   function.
2. Added the missing put_page for not split pages.
3. pr_debug -> pr_info, make reading results simpler.

>From v2:
1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:
1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 155 -
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 318 ++
 4 files changed, 467 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8eba529a0f17..1bcab247aea8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2927,16 +2928,14 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
+   pr_debug("Split all THPs\n");
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2960,15 +2959,155 @@ static int split_huge_pages_set(void *data, u64 val)
unlock_page(page);
 next:
put_page(page);
+   cond_resched();
}
}
 
-   pr_info("%lu of %lu THP split\n", split, total);
+   pr_debug("%lu of %lu THP split\n", split, total);
+}
 
-   return 0;
+static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
+{
+   return vma_is_special_huge(vma) || (vma->vm_flags & VM_IO) ||
+   is_vm_hugetlb_page(vma);
+}
+
+static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
+   unsigned long vaddr_end)
+{
+   int ret = 0;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+   unsigned long addr;
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   pr_debug("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+  

[PATCH v8 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-31 Thread Zi Yan
From: Zi Yan 

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Suggested-by: Kirill A. Shutemov 
Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 90 ++-
 .../selftests/vm/split_huge_page_test.c   | 82 +++--
 2 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1bcab247aea8..eb0f3aaf49f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3062,6 +3062,65 @@ static int split_huge_pages_pid(int pid, unsigned long 
vaddr_start,
return ret;
 }
 
+static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
+   pgoff_t off_end)
+{
+   struct filename *file;
+   struct file *candidate;
+   struct address_space *mapping;
+   int ret = -EINVAL;
+   pgoff_t index;
+   int nr_pages = 1;
+   unsigned long total = 0, split = 0;
+
+   file = getname_kernel(file_path);
+   if (IS_ERR(file))
+   return ret;
+
+   candidate = file_open_name(file, O_RDONLY, 0);
+   if (IS_ERR(candidate))
+   goto out;
+
+   pr_debug("split file-backed THPs in file: %s, page offset: [0x%lx - 
0x%lx]\n",
+file_path, off_start, off_end);
+
+   mapping = candidate->f_mapping;
+
+   for (index = off_start; index < off_end; index += nr_pages) {
+   struct page *fpage = pagecache_get_page(mapping, index,
+   FGP_ENTRY | FGP_HEAD, 0);
+
+   nr_pages = 1;
+   if (xa_is_value(fpage) || !fpage)
+   continue;
+
+   if (!is_transparent_hugepage(fpage))
+   goto next;
+
+   total++;
+   nr_pages = thp_nr_pages(fpage);
+
+   if (!trylock_page(fpage))
+   goto next;
+
+   if (!split_huge_page(fpage))
+   split++;
+
+   unlock_page(fpage);
+next:
+   put_page(fpage);
+   cond_resched();
+   }
+
+   filp_close(candidate, NULL);
+   ret = 0;
+
+   pr_debug("%lu of %lu file-backed THP split\n", split, total);
+out:
+   putname(file);
+   return ret;
+}
+
 #define MAX_INPUT_BUF_SZ 255
 
 static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
@@ -3069,7 +3128,8 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
 {
static DEFINE_MUTEX(split_debug_mutex);
ssize_t ret;
-   char input_buf[MAX_INPUT_BUF_SZ]; /* hold pid, start_vaddr, end_vaddr */
+   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */
+   char input_buf[MAX_INPUT_BUF_SZ];
int pid;
unsigned long vaddr_start, vaddr_end;
 
@@ -3084,6 +3144,34 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
goto out;
 
input_buf[MAX_INPUT_BUF_SZ - 1] = '\0';
+
+   if (input_buf[0] == '/') {
+   char *tok;
+   char *buf = input_buf;
+   char file_path[MAX_INPUT_BUF_SZ];
+   pgoff_t off_start = 0, off_end = 0;
+   size_t input_len = strlen(input_buf);
+
+   tok = strsep(, ",");
+   if (tok) {
+   strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
+   if (ret != 2) {
+   ret = -EINVAL;
+   goto out;
+   }
+   ret = split_huge_pages_in_file(file_path, off_start, off_end);
+   if (!ret)
+   ret = input_len;
+
+   goto out;
+   }
+
ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index 2c0c18e60c57..1af16d2c2a0a 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -7,11 +7,13 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -24,6 +26,9 @@ uint64_t pmd_pagesize;
 #define SMAP_PATH "/proc/self/smaps"
 #define INPUT_MAX 80
 
+#define PID_FMT "%d,0x%lx,0x%lx"
+#define PATH_FMT "%s,0x%lx,0x%lx"
+
 #define PFN_MASK ((1UL<<55)-1)
 #define KPF_TH

Re: [PATCH v7 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-31 Thread Zi Yan
On 31 Mar 2021, at 12:44, Matthew Wilcox wrote:

> On Mon, Mar 29, 2021 at 11:39:32AM -0400, Zi Yan wrote:
>> +for (off_cur = off_start; off_cur < off_end;) {
>> +struct page *fpage = pagecache_get_page(mapping, off_cur,
>> +FGP_ENTRY | FGP_HEAD, 0);
>> +
>> +if (xa_is_value(fpage) || !fpage) {
>> +off_cur += PAGE_SIZE;
>> +continue;
>> +}
>> +
>> +if (!is_transparent_hugepage(fpage)) {
>> +off_cur += PAGE_SIZE;
>> +goto next;
>> +}
>> +total++;
>> +off_cur = fpage->index + thp_size(fpage);
>
> That can't be right.  fpage->index is in units of pages and thp_size is
> in units of bytes.  I wish C had a better type system.
> I think you meant:
>
>   off_cur = fpage->index + thp_nr_pages(fpage);
>
> Also, I think this loop would read better as ...
>
>   for (index = off_start; index < off_end; index += nr_pages) {
>   struct page *fpage = pagecache_get_page(mapping, index,
>   FGP_ENTRY | FGP_HEAD, 0);
>   nr_pages = 1;
>   if (xa_is_value(fpage) || !fpage)
>   continue;
>   if (!is_transparent_hugepage(fpage))
>   goto next;
>   total++;
>   nr_pages = thp_nr_pages(fpage);
> ...

Thanks for catching this! I mixed this with looping through VMA, which
is in units of bytes. I will fix this and use your suggested loop code.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v7 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-29 Thread Zi Yan
From: Zi Yan 

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Suggested-by: Kirill A. Shutemov 
Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 91 ++-
 .../selftests/vm/split_huge_page_test.c   | 81 -
 2 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1bcab247aea8..ca47f5a317f3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3062,6 +3062,66 @@ static int split_huge_pages_pid(int pid, unsigned long 
vaddr_start,
return ret;
 }
 
+static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
+   pgoff_t off_end)
+{
+   struct filename *file;
+   struct file *candidate;
+   struct address_space *mapping;
+   int ret = -EINVAL;
+   pgoff_t off_cur;
+   unsigned long total = 0, split = 0;
+
+   file = getname_kernel(file_path);
+   if (IS_ERR(file))
+   return ret;
+
+   candidate = file_open_name(file, O_RDONLY, 0);
+   if (IS_ERR(candidate))
+   goto out;
+
+   pr_debug("split file-backed THPs in file: %s, offset: [0x%lx - 
0x%lx]\n",
+file_path, off_start, off_end);
+
+   mapping = candidate->f_mapping;
+
+   for (off_cur = off_start; off_cur < off_end;) {
+   struct page *fpage = pagecache_get_page(mapping, off_cur,
+   FGP_ENTRY | FGP_HEAD, 0);
+
+   if (xa_is_value(fpage) || !fpage) {
+   off_cur += PAGE_SIZE;
+   continue;
+   }
+
+   if (!is_transparent_hugepage(fpage)) {
+   off_cur += PAGE_SIZE;
+   goto next;
+   }
+   total++;
+   off_cur = fpage->index + thp_size(fpage);
+
+   if (!trylock_page(fpage))
+   goto next;
+
+   if (!split_huge_page(fpage))
+   split++;
+
+   unlock_page(fpage);
+next:
+   put_page(fpage);
+   cond_resched();
+   }
+
+   filp_close(candidate, NULL);
+   ret = 0;
+
+   pr_debug("%lu of %lu file-backed THP split\n", split, total);
+out:
+   putname(file);
+   return ret;
+}
+
 #define MAX_INPUT_BUF_SZ 255
 
 static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
@@ -3069,7 +3129,8 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
 {
static DEFINE_MUTEX(split_debug_mutex);
ssize_t ret;
-   char input_buf[MAX_INPUT_BUF_SZ]; /* hold pid, start_vaddr, end_vaddr */
+   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */
+   char input_buf[MAX_INPUT_BUF_SZ];
int pid;
unsigned long vaddr_start, vaddr_end;
 
@@ -3084,6 +3145,34 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
goto out;
 
input_buf[MAX_INPUT_BUF_SZ - 1] = '\0';
+
+   if (input_buf[0] == '/') {
+   char *tok;
+   char *buf = input_buf;
+   char file_path[MAX_INPUT_BUF_SZ];
+   pgoff_t off_start = 0, off_end = 0;
+   size_t input_len = strlen(input_buf);
+
+   tok = strsep(, ",");
+   if (tok) {
+   strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
+   if (ret != 2) {
+   ret = -EINVAL;
+   goto out;
+   }
+   ret = split_huge_pages_in_file(file_path, off_start, off_end);
+   if (!ret)
+   ret = input_len;
+
+   goto out;
+   }
+
ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index 2c0c18e60c57..845a63cdb052 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -7,11 +7,13 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -24,6 +26,9 @@ uint64_t pmd_pagesize;
 #define SMAP_PATH "/proc/self/smaps"
 #define INPUT_MAX 80
 
+#define PID_FMT "%d,0x%lx,0x%lx"
+#define PATH_FMT "%s,0x%lx,0x%lx&q

[PATCH v7 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-29 Thread Zi Yan
From: Zi Yan 

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:
>From v6:
1. pr_info -> pr_debug.
2. Added cond_resched() in all split loops. (suggested by David Rientjes)

>From v5:
1. Skipped special VMAs and other fixes. (suggested by Yang Shi)

>From v4:
1. Fixed the error code return issue, spotted by kernel test robot
   .

>From v3:
1. Factored out split huge pages in the given pid code to a separate
   function.
2. Added the missing put_page for not split pages.
3. pr_debug -> pr_info, make reading results simpler.

>From v2:
1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:
1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 155 -
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 318 ++
 4 files changed, 467 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8eba529a0f17..1bcab247aea8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2927,16 +2928,14 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
+   pr_debug("Split all THPs\n");
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2960,15 +2959,155 @@ static int split_huge_pages_set(void *data, u64 val)
unlock_page(page);
 next:
put_page(page);
+   cond_resched();
}
}
 
-   pr_info("%lu of %lu THP split\n", split, total);
+   pr_debug("%lu of %lu THP split\n", split, total);
+}
 
-   return 0;
+static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
+{
+   return vma_is_special_huge(vma) || (vma->vm_flags & VM_IO) ||
+   is_vm_hugetlb_page(vma);
+}
+
+static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
+   unsigned long vaddr_end)
+{
+   int ret = 0;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+   unsigned long addr;
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   pr_debug("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+

Re: [PATCH 0/3] Cleanup for khugepaged

2021-03-25 Thread Zi Yan
On 25 Mar 2021, at 9:56, Miaohe Lin wrote:

> Hi all,
> This series contains cleanups to remove unnecessary out label and
> meaningless !pte_present() check. Also use helper function to simplify
> the code. More details can be found in the respective changelogs.
> Thanks!
>
> Miaohe Lin (3):
>   khugepaged: use helper function range_in_vma() in
> collapse_pte_mapped_thp()
>   khugepaged: remove unnecessary out label in collapse_huge_page()
>   khugepaged: remove meaningless !pte_present() check in
> khugepaged_scan_pmd()
>
>  mm/khugepaged.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)

All looks good to me. Thanks.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v6 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-24 Thread Zi Yan
On 24 Mar 2021, at 15:16, David Rientjes wrote:

> On Mon, 22 Mar 2021, Zi Yan wrote:
>
>> From: Zi Yan 
>>
>> We did not have a direct user interface of splitting the compound page
>> backing a THP and there is no need unless we want to expose the THP
>> implementation details to users. Make /split_huge_pages accept
>> a new command to do that.
>>
>> By writing ",," to
>> /split_huge_pages, THPs within the given virtual address range
>> from the process with the given pid are split. It is used to test
>> split_huge_page function. In addition, a selftest program is added to
>> tools/testing/selftests/vm to utilize the interface by splitting
>> PMD THPs and PTE-mapped THPs.
>>
>
> I'm curious if this is the only use case or whether you have additional
> use cases or extensions in mind?

At the moment, this is the only use case I have in mind. I am developing
1GB THP support and split huge pages to any lower order (this will be useful
when Matthew Wilcox’s large page in page cache is in) and find there is no
way of splitting specific THPs (the compound pages) from user space.
So I added this interface for testing, mostly for splitting one or two
specified THPs. A potential extension might be to add a new 
parameter to test my split_huge_page_to_any_order(), after it gets in.

>
> Specifically, I'm wondering if you are looking into more appropriately
> dividing the limited number of hugepages available on the system amongst
> the most latency sensitive processes?
>
> The set of hugepages available on the system can be limited by
> fragmentation.  We've found opportunities where latency sensitive
> processes would benefit from memory backed by thp, but they cannot be
> allocated at fault for this reason.  Yet, there are other processes that
> have memory backed by hugepages that may not be benefiting from them.
>
> Could this be useful to split a hugepage for a latency tolerant
> application, migrate the pages elsewhere, and make the now-free hugepage
> available for a latency sensitive application (through something like my
> MADV_COLLAPSE proposal?)

The idea sounds quite interesting and reasonable. In the scenario you described,
I just wonder if we want to do SPLIT + COLLAPSE or we can combine them
into a single swap of THP and base pages (potentially using some extra free
pages as tmp space). Because there is no guarantee that the split subpages
can all be migrated and make space for a new THP, or other allocations might
steal a free page from the split THP range, causing subsequence THP allocation
failure.

I am OK with exposing this via a proper user space interface and can prepare
a patch for it. I just want to know if there are other use cases of splitting
THPs (the compound pages).

>
> Couple questions inline.
>
>> This does not change the old behavior, i.e., writing 1 to the interface
>> to split all THPs in the system.
>>
>> Changelog:
>>
>> From v5:
>> 1. Skipped special VMAs and other fixes. (suggested by Yang Shi)
>>
>> From v4:
>> 1. Fixed the error code return issue, spotted by kernel test robot
>>.
>>
>> From v3:
>> 1. Factored out split huge pages in the given pid code to a separate
>>function.
>> 2. Added the missing put_page for not split pages.
>> 3. pr_debug -> pr_info, make reading results simpler.
>>
>> From v2:
>> 1. Reused existing /split_huge_pages interface. (suggested by
>>Yang Shi)
>>
>> From v1:
>> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>>robot .
>> 2. Dropped the use of find_mm_struct and code it directly, since there
>>is no need for the permission check in that function and the function
>>is only available when migration is on.
>> 3. Added some comments in the selftest program to clarify how PTE-mapped
>>THPs are formed.
>>
>> Signed-off-by: Zi Yan 
>> Reviewed-by: Yang Shi 
>> ---
>>  mm/huge_memory.c  | 151 -
>>  tools/testing/selftests/vm/.gitignore |   1 +
>>  tools/testing/selftests/vm/Makefile   |   1 +
>>  .../selftests/vm/split_huge_page_test.c   | 318 ++
>>  4 files changed, 464 insertions(+), 7 deletions(-)
>>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index bff92dea5ab3..b653255a548e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -2922,16 +2923,14

[PATCH v6 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-22 Thread Zi Yan
From: Zi Yan 

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Suggested-by: Kirill A. Shutemov 
Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 91 ++-
 .../selftests/vm/split_huge_page_test.c   | 79 +++-
 2 files changed, 164 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b653255a548e..d3b20e101df2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3055,6 +3055,65 @@ static int split_huge_pages_pid(int pid, unsigned long 
vaddr_start,
return ret;
 }
 
+static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
+   pgoff_t off_end)
+{
+   struct filename *file;
+   struct file *candidate;
+   struct address_space *mapping;
+   int ret = -EINVAL;
+   pgoff_t off_cur;
+   unsigned long total = 0, split = 0;
+
+   file = getname_kernel(file_path);
+   if (IS_ERR(file))
+   return ret;
+
+   candidate = file_open_name(file, O_RDONLY, 0);
+   if (IS_ERR(candidate))
+   goto out;
+
+   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 0x%lx]\n",
+file_path, off_start, off_end);
+
+   mapping = candidate->f_mapping;
+
+   for (off_cur = off_start; off_cur < off_end;) {
+   struct page *fpage = pagecache_get_page(mapping, off_cur,
+   FGP_ENTRY | FGP_HEAD, 0);
+
+   if (xa_is_value(fpage) || !fpage) {
+   off_cur += PAGE_SIZE;
+   continue;
+   }
+
+   if (!is_transparent_hugepage(fpage)) {
+   off_cur += PAGE_SIZE;
+   goto next;
+   }
+   total++;
+   off_cur = fpage->index + thp_size(fpage);
+
+   if (!trylock_page(fpage))
+   goto next;
+
+   if (!split_huge_page(fpage))
+   split++;
+
+   unlock_page(fpage);
+next:
+   put_page(fpage);
+   }
+
+   filp_close(candidate, NULL);
+   ret = 0;
+
+   pr_info("%lu of %lu file-backed THP split\n", split, total);
+out:
+   putname(file);
+   return ret;
+}
+
 #define MAX_INPUT_BUF_SZ 255
 
 static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
@@ -3062,7 +3121,8 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
 {
static DEFINE_MUTEX(split_debug_mutex);
ssize_t ret;
-   char input_buf[MAX_INPUT_BUF_SZ]; /* hold pid, start_vaddr, end_vaddr */
+   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */
+   char input_buf[MAX_INPUT_BUF_SZ];
int pid;
unsigned long vaddr_start, vaddr_end;
 
@@ -3077,6 +3137,35 @@ static ssize_t split_huge_pages_write(struct file *file, 
const char __user *buf,
goto out;
 
input_buf[MAX_INPUT_BUF_SZ - 1] = '\0';
+
+   if (input_buf[0] == '/') {
+   char *tok;
+   char *buf = input_buf;
+   char file_path[MAX_INPUT_BUF_SZ];
+   pgoff_t off_start = 0, off_end = 0;
+   size_t input_len = strlen(input_buf);
+
+   tok = strsep(, ",");
+   if (tok) {
+   strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
+   if (ret != 2) {
+   pr_info("ret: %ld\n", ret);
+   ret = -EINVAL;
+   goto out;
+   }
+   ret = split_huge_pages_in_file(file_path, off_start, off_end);
+   if (!ret)
+   ret = input_len;
+
+   goto out;
+   }
+
ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index 2c0c18e60c57..ebdf2d738978 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -7,11 +7,13 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -24,6 +26,9 @@ uint64_t pmd_pagesize;
 #define SMAP_PATH "/proc/self/smaps"
 #define INPUT_MAX 80
 
+#define PID_FMT "%d,0x%lx,0x%lx"
+#define PATH_FM

[PATCH v6 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-22 Thread Zi Yan
From: Zi Yan 

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:

>From v5:
1. Skipped special VMAs and other fixes. (suggested by Yang Shi)

>From v4:
1. Fixed the error code return issue, spotted by kernel test robot
   .

>From v3:
1. Factored out split huge pages in the given pid code to a separate
   function.
2. Added the missing put_page for not split pages.
3. pr_debug -> pr_info, make reading results simpler.

>From v2:
1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:
1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
---
 mm/huge_memory.c  | 151 -
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 318 ++
 4 files changed, 464 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..b653255a548e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
+   pr_info("Split all THPs\n");
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2959,11 +2958,149 @@ static int split_huge_pages_set(void *data, u64 val)
}
 
pr_info("%lu of %lu THP split\n", split, total);
+}
 
-   return 0;
+static inline bool vma_not_suitable_for_thp_split(struct vm_area_struct *vma)
+{
+   return vma_is_special_huge(vma) || (vma->vm_flags & VM_IO) ||
+   is_vm_hugetlb_page(vma);
 }
-DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
-   "%llu\n");
+
+static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
+   unsigned long vaddr_end)
+{
+   int ret = 0;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+   unsigned long addr;
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+   /* skip special VMA and hugetlb VMA */
+   if (vma_not_suitable_for_thp_split(vma)) {
+   addr = vma->vm_end;
+   continue;
+   }
+
+  

Re: [PATCH v5 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-21 Thread Zi Yan
On 19 Mar 2021, at 19:37, Yang Shi wrote:

> On Thu, Mar 18, 2021 at 5:52 PM Zi Yan  wrote:
>>
>> From: Zi Yan 
>>
>> We did not have a direct user interface of splitting the compound page
>> backing a THP and there is no need unless we want to expose the THP
>> implementation details to users. Make /split_huge_pages accept
>> a new command to do that.
>>
>> By writing ",," to
>> /split_huge_pages, THPs within the given virtual address range
>> from the process with the given pid are split. It is used to test
>> split_huge_page function. In addition, a selftest program is added to
>> tools/testing/selftests/vm to utilize the interface by splitting
>> PMD THPs and PTE-mapped THPs.
>>
>> This does not change the old behavior, i.e., writing 1 to the interface
>> to split all THPs in the system.
>>
>> Changelog:
>>
>> From v5:
>> 1. Skipped special VMAs and other fixes. (suggested by Yang Shi)
>
> Looks good to me. Reviewed-by: Yang Shi 
>
> Some nits below:
>
>>
>> From v4:
>> 1. Fixed the error code return issue, spotted by kernel test robot
>>.
>>
>> From v3:
>> 1. Factored out split huge pages in the given pid code to a separate
>>function.
>> 2. Added the missing put_page for not split pages.
>> 3. pr_debug -> pr_info, make reading results simpler.
>>
>> From v2:
>> 1. Reused existing /split_huge_pages interface. (suggested by
>>Yang Shi)
>>
>> From v1:
>> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>>robot .
>> 2. Dropped the use of find_mm_struct and code it directly, since there
>>is no need for the permission check in that function and the function
>>is only available when migration is on.
>> 3. Added some comments in the selftest program to clarify how PTE-mapped
>>THPs are formed.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  mm/huge_memory.c  | 143 +++-
>>  tools/testing/selftests/vm/.gitignore |   1 +
>>  tools/testing/selftests/vm/Makefile   |   1 +
>>  .../selftests/vm/split_huge_page_test.c   | 318 ++
>>  4 files changed, 456 insertions(+), 7 deletions(-)
>>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index bff92dea5ab3..9bf9bc489228 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
>>  };
>>
>>  #ifdef CONFIG_DEBUG_FS
>> -static int split_huge_pages_set(void *data, u64 val)
>> +static void split_huge_pages_all(void)
>>  {
>> struct zone *zone;
>> struct page *page;
>> unsigned long pfn, max_zone_pfn;
>> unsigned long total = 0, split = 0;
>>
>> -   if (val != 1)
>> -   return -EINVAL;
>> -
>> +   pr_info("Split all THPs\n");
>> for_each_populated_zone(zone) {
>> max_zone_pfn = zone_end_pfn(zone);
>> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>> @@ -2959,11 +2958,141 @@ static int split_huge_pages_set(void *data, u64 val)
>> }
>>
>> pr_info("%lu of %lu THP split\n", split, total);
>> +}
>>
>> -   return 0;
>> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>> +   unsigned long vaddr_end)
>> +{
>> +   int ret = 0;
>> +   struct task_struct *task;
>> +   struct mm_struct *mm;
>> +   unsigned long total = 0, split = 0;
>> +   unsigned long addr;
>> +
>> +   vaddr_start &= PAGE_MASK;
>> +   vaddr_end &= PAGE_MASK;
>> +
>> +   /* Find the task_struct from pid */
>> +   rcu_read_lock();
>> +   task = find_task_by_vpid(pid);
>> +   if (!task) {
>> +   rcu_read_unlock();
>> +   ret = -ESRCH;
>> +   goto out;
>> +   }
>> +   get_task_struct(task);
>> +   rcu_read_unlock();
>> +
>> +   /* Find the mm_struct */
>> +   mm = get_task_mm(task);
>> +   put_task_struct(task);
>> +
>> +   if (!mm) {
>> +   ret = -EINVAL;
>> +

[PATCH v5 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-18 Thread Zi Yan
From: Zi Yan 

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:

>From v5:
1. Skipped special VMAs and other fixes. (suggested by Yang Shi)

>From v4:
1. Fixed the error code return issue, spotted by kernel test robot
   .

>From v3:
1. Factored out split huge pages in the given pid code to a separate
   function.
2. Added the missing put_page for not split pages.
3. pr_debug -> pr_info, make reading results simpler.

>From v2:
1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:
1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 143 +++-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 318 ++
 4 files changed, 456 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..9bf9bc489228 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
+   pr_info("Split all THPs\n");
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2959,11 +2958,141 @@ static int split_huge_pages_set(void *data, u64 val)
}
 
pr_info("%lu of %lu THP split\n", split, total);
+}
 
-   return 0;
+static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
+   unsigned long vaddr_end)
+{
+   int ret = 0;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+   unsigned long addr;
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+   /* skip special VMA and hugetlb VMA */
+   if (vma_is_special_huge(vma) || is_vm_hugetlb_page(vma)) {
+   addr = vma->vm_end;
+   continue;
+   }
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   continue;
+   if (!page)
+   continue;
+
+   if (!is_tr

[PATCH v5 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-18 Thread Zi Yan
From: Zi Yan 

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Suggested-by: Kirill A. Shutemov 
Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 97 ++-
 .../selftests/vm/split_huge_page_test.c   | 79 ++-
 2 files changed, 168 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9bf9bc489228..6d6537cc8c56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3049,12 +3049,74 @@ static int split_huge_pages_pid(int pid, unsigned long 
vaddr_start,
return ret;
 }
 
+static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
+   pgoff_t off_end)
+{
+   struct filename *file;
+   struct file *candidate;
+   struct address_space *mapping;
+   int ret = -EINVAL;
+   pgoff_t off_cur;
+   unsigned long total = 0, split = 0;
+
+   file = getname_kernel(file_path);
+   if (IS_ERR(file))
+   return ret;
+
+   candidate = file_open_name(file, O_RDONLY, 0);
+   if (IS_ERR(candidate))
+   goto out;
+
+   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 0x%lx]\n",
+file_path, off_start, off_end);
+
+   mapping = candidate->f_mapping;
+
+   for (off_cur = off_start; off_cur < off_end;) {
+   struct page *fpage = pagecache_get_page(mapping, off_cur,
+   FGP_ENTRY | FGP_HEAD, 0);
+
+   if (xa_is_value(fpage) || !fpage) {
+   off_cur += PAGE_SIZE;
+   continue;
+   }
+
+   if (!is_transparent_hugepage(fpage)) {
+   off_cur += PAGE_SIZE;
+   goto next;
+   }
+   total++;
+   off_cur = fpage->index + thp_size(fpage);
+
+   if (!trylock_page(fpage))
+   goto next;
+
+   if (!split_huge_page(fpage))
+   split++;
+
+   unlock_page(fpage);
+next:
+   put_page(fpage);
+   }
+
+   filp_close(candidate, NULL);
+   ret = 0;
+
+   pr_info("%lu of %lu file-backed THP split\n", split, total);
+out:
+   putname(file);
+   return ret;
+}
+
+#define MAX_INPUT_BUF_SZ 255
+
 static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
size_t count, loff_t *ppops)
 {
static DEFINE_MUTEX(split_debug_mutex);
ssize_t ret;
-   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */
+   char input_buf[MAX_INPUT_BUF_SZ];
int pid;
unsigned long vaddr_start, vaddr_end;
 
@@ -3064,11 +3126,40 @@ static ssize_t split_huge_pages_write(struct file 
*file, const char __user *buf,
 
ret = -EFAULT;
 
-   memset(input_buf, 0, 80);
+   memset(input_buf, 0, MAX_INPUT_BUF_SZ);
if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
goto out;
 
-   input_buf[79] = '\0';
+   input_buf[MAX_INPUT_BUF_SZ - 1] = '\0';
+
+   if (input_buf[0] == '/') {
+   char *tok;
+   char *buf = input_buf;
+   char file_path[MAX_INPUT_BUF_SZ];
+   pgoff_t off_start = 0, off_end = 0;
+   size_t input_len = strlen(input_buf);
+
+   tok = strsep(, ",");
+   if (tok) {
+   strncpy(file_path, tok, MAX_INPUT_BUF_SZ);
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
+   if (ret != 2) {
+   pr_info("ret: %ld\n", ret);
+   ret = -EINVAL;
+   goto out;
+   }
+   ret = split_huge_pages_in_file(file_path, off_start, off_end);
+   if (!ret)
+   ret = input_len;
+
+   goto out;
+   }
+
ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index 2c0c18e60c57..ebdf2d738978 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -7,11 +7,13 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -24,6 +26,9 @@ uint64_t pmd_pagesize;
 #de

Re: [PATCH v3 5/6] mm/huge_memory.c: remove unused macro TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG

2021-03-18 Thread Zi Yan
On 18 Mar 2021, at 8:27, Miaohe Lin wrote:

> The commit 4958e4d86ecb ("mm: thp: remove debug_cow switch") forgot to
> remove TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG macro. Remove it here.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/huge_mm.h | 3 ---
>  1 file changed, 3 deletions(-)

LGTM. Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 3/6] mm/huge_memory.c: rework the function do_huge_pmd_numa_page() slightly

2021-03-18 Thread Zi Yan
On 18 Mar 2021, at 8:27, Miaohe Lin wrote:

> The current code that checks if migrating misplaced transhuge page is
> needed is pretty hard to follow. Rework it and add a comment to make
> its logic more clear and improve readability.
>
> Signed-off-by: Miaohe Lin 
> ---
>  mm/huge_memory.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
LGTM. Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 2/6] mm/huge_memory.c: make get_huge_zero_page() return bool

2021-03-18 Thread Zi Yan
On 18 Mar 2021, at 8:27, Miaohe Lin wrote:

> It's guaranteed that huge_zero_page will not be NULL if huge_zero_refcount
> is increased successfully. When READ_ONCE(huge_zero_page) is returned,
> there must be a huge_zero_page and it can be replaced with returning 'true'
> when we do not care about the value of huge_zero_page. We can thus make it
> return bool to save READ_ONCE cpu cycles as the return value is just used
> to check if huge_zero_page exists.
>
> Signed-off-by: Miaohe Lin 
> ---
>  mm/huge_memory.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
LGTM. Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-17 Thread Zi Yan
On 16 Mar 2021, at 19:18, Yang Shi wrote:

> On Mon, Mar 15, 2021 at 1:34 PM Zi Yan  wrote:
>>
>> From: Zi Yan 
>>
>> Further extend /split_huge_pages to accept
>> ",," for file-backed THP split tests since
>> tmpfs may have file backed by THP that mapped nowhere.
>>
>> Update selftest program to test file-backed THP split too.
>>
>> Suggested-by: Kirill A. Shutemov 
>> Signed-off-by: Zi Yan 
>> ---
>>  mm/huge_memory.c  | 95 ++-
>>  .../selftests/vm/split_huge_page_test.c   | 79 ++-
>>  2 files changed, 166 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 3bfee54e2cd0..da91ee97d944 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3043,12 +3043,72 @@ static int split_huge_pages_pid(int pid, unsigned 
>> long vaddr_start,
>> return ret;
>>  }
>>
>> +static int split_huge_pages_in_file(const char *file_path, pgoff_t 
>> off_start,
>> +   pgoff_t off_end)
>> +{
>> +   struct filename *file;
>> +   struct file *candidate;
>> +   struct address_space *mapping;
>> +   int ret = -EINVAL;
>> +   pgoff_t off_cur;
>> +   unsigned long total = 0, split = 0;
>> +
>> +   file = getname_kernel(file_path);
>> +   if (IS_ERR(file))
>> +   return ret;
>> +
>> +   candidate = file_open_name(file, O_RDONLY, 0);
>> +   if (IS_ERR(candidate))
>> +   goto out;
>> +
>> +   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 
>> 0x%lx]\n",
>> +file_path, off_start, off_end);
>> +
>> +   mapping = candidate->f_mapping;
>> +
>> +   for (off_cur = off_start; off_cur < off_end;) {
>> +   struct page *fpage = pagecache_get_page(mapping, off_cur,
>> +   FGP_ENTRY | FGP_HEAD, 0);
>> +
>> +   if (xa_is_value(fpage) || !fpage) {
>
> Why do you have FGP_ENTRY? It seems it would return page instead of
> NULL if page is value. So I think you could remove FGP_ENTRY and
> xa_is_value() check as well.

The comment on FGP_ENTRY says “If there is a shadow/swap/DAX entry, return
it instead of allocating a new page to replace it”. I do not think we
want to allocate new pages here. I mostly follow the use of pagecache_get_page()
in shmem_getpage_gfp without swapin or allocating new pages.

>
>> +   off_cur += PAGE_SIZE;
>> +   continue;
>> +   }
>> +
>> +   if (!is_transparent_hugepage(fpage)) {
>> +   off_cur += PAGE_SIZE;
>> +   goto next;
>> +   }
>> +   total++;
>> +   off_cur = fpage->index + thp_size(fpage);
>> +
>> +   if (!trylock_page(fpage))
>> +   goto next;
>> +
>> +   if (!split_huge_page(fpage))
>> +   split++;
>> +
>> +   unlock_page(fpage);
>> +next:
>> +   put_page(fpage);
>> +   }
>> +
>> +   filp_close(candidate, NULL);
>> +   ret = 0;
>> +
>> +   pr_info("%lu of %lu file-backed THP split\n", split, total);
>> +out:
>> +   putname(file);
>> +   return ret;
>> +}
>> +
>>  static ssize_t split_huge_pages_write(struct file *file, const char __user 
>> *buf,
>> size_t count, loff_t *ppops)
>>  {
>> static DEFINE_MUTEX(mutex);
>> ssize_t ret;
>> -   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
>> +   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end 
>> */
>> +   char input_buf[MAX_INPUT];
>
> I didn't find where MAX_INPUT is defined in your patch. Just saw
> include/uapi/linux/limits.h have it defined. Is it the one you really
> refer to?

Yeah, I want to use 255 as the max input size and find MAX_INPUT. From your 
comment,
I think it is better to define a MACRO here explicitly.


>> int pid;
>> unsigned long vaddr_start, vaddr_end;
>>
>> @@ -3058,11 +3118,40 @@ static ssize_t split_huge_pages_write(struct file 
>> *file, const char __user *buf,
>>
>> ret = -EFAULT;
>>
>> -   memset(input_buf, 0, 80);
>> +   memset(input_buf, 0, MAX_I

Re: [PATCH v4 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-17 Thread Zi Yan
On 16 Mar 2021, at 18:23, Yang Shi wrote:

> On Mon, Mar 15, 2021 at 1:34 PM Zi Yan  wrote:
>>
>> From: Zi Yan 
>>
>> We did not have a direct user interface of splitting the compound page
>> backing a THP and there is no need unless we want to expose the THP
>> implementation details to users. Make /split_huge_pages accept
>> a new command to do that.
>>
>> By writing ",," to
>> /split_huge_pages, THPs within the given virtual address range
>> from the process with the given pid are split. It is used to test
>> split_huge_page function. In addition, a selftest program is added to
>> tools/testing/selftests/vm to utilize the interface by splitting
>> PMD THPs and PTE-mapped THPs.
>>
>> This does not change the old behavior, i.e., writing 1 to the interface
>> to split all THPs in the system.
>>
>> Changelog:
>>
>> From v3:
>> 1. Factored out split huge pages in the given pid code to a separate
>>function.
>> 2. Added the missing put_page for not split pages.
>> 3. pr_debug -> pr_info, make reading results simpler.
>>
>> From v2:
>>
>> 1. Reused existing /split_huge_pages interface. (suggested by
>>Yang Shi)
>>
>> From v1:
>>
>> 1. Removed unnecessary calling to vma_migratable, spotted by kernel test
>>robot .
>> 2. Dropped the use of find_mm_struct and code it directly, since there
>>    is no need for the permission check in that function and the function
>>is only available when migration is on.
>> 3. Added some comments in the selftest program to clarify how PTE-mapped
>>THPs are formed.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  mm/huge_memory.c  | 136 +++-
>>  tools/testing/selftests/vm/.gitignore |   1 +
>>  tools/testing/selftests/vm/Makefile   |   1 +
>>  .../selftests/vm/split_huge_page_test.c   | 313 ++
>>  4 files changed, 444 insertions(+), 7 deletions(-)
>>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index bff92dea5ab3..3bfee54e2cd0 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
>>  };
>>
>>  #ifdef CONFIG_DEBUG_FS
>> -static int split_huge_pages_set(void *data, u64 val)
>> +static void split_huge_pages_all(void)
>>  {
>> struct zone *zone;
>> struct page *page;
>> unsigned long pfn, max_zone_pfn;
>> unsigned long total = 0, split = 0;
>>
>> -   if (val != 1)
>> -   return -EINVAL;
>> -
>> +   pr_info("Split all THPs\n");
>> for_each_populated_zone(zone) {
>> max_zone_pfn = zone_end_pfn(zone);
>> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>> @@ -2959,11 +2958,134 @@ static int split_huge_pages_set(void *data, u64 val)
>> }
>>
>> pr_info("%lu of %lu THP split\n", split, total);
>> +}
>>
>> -   return 0;
>> +static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>> +   unsigned long vaddr_end)
>> +{
>> +   int ret = 0;
>> +   struct task_struct *task;
>> +   struct mm_struct *mm;
>> +   unsigned long total = 0, split = 0;
>> +   unsigned long addr;
>> +
>> +   vaddr_start &= PAGE_MASK;
>> +   vaddr_end &= PAGE_MASK;
>> +
>> +   /* Find the task_struct from pid */
>> +   rcu_read_lock();
>> +   task = find_task_by_vpid(pid);
>> +   if (!task) {
>> +   rcu_read_unlock();
>> +   ret = -ESRCH;
>> +   goto out;
>> +   }
>> +   get_task_struct(task);
>> +   rcu_read_unlock();
>> +
>> +   /* Find the mm_struct */
>> +   mm = get_task_mm(task);
>> +   put_task_struct(task);
>> +
>> +   if (!mm) {
>> +   ret = -EINVAL;
>> +   goto out;
>> +   }
>> +
>> +   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
>> +pid, vaddr_start, vaddr_end);
>> +
>> +   mmap_read_lock(mm);
>> +   /*
>> +* alway

Re: [PATCH v4 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-15 Thread Zi Yan
On 15 Mar 2021, at 20:36, kernel test robot wrote:

> Hi Zi,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on kselftest/next]
> [also build test WARNING on linux/master linus/master v5.12-rc3]
> [cannot apply to hnaz-linux-mm/master next-20210315]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:
> https://github.com/0day-ci/linux/commits/Zi-Yan/mm-huge_memory-a-new-debugfs-interface-for-splitting-THP-tests/20210316-043528
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
> config: i386-randconfig-m021-20210315 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> smatch warnings:
> mm/huge_memory.c:3086 split_huge_pages_write() warn: sscanf doesn't return 
> error codes
>
> vim +3086 mm/huge_memory.c
>
>   3051
>   3052static ssize_t split_huge_pages_write(struct file *file, const 
> char __user *buf,
>   3053size_t count, loff_t *ppops)
>   3054{
>   3055static DEFINE_MUTEX(mutex);
>   3056ssize_t ret;
>   3057char input_buf[80]; /* hold pid, start_vaddr, end_vaddr 
> */
>   3058int pid;
>   3059unsigned long vaddr_start, vaddr_end;
>   3060
>   3061ret = mutex_lock_interruptible();
>   3062if (ret)
>   3063return ret;
>   3064
>   3065ret = -EFAULT;
>   3066
>   3067memset(input_buf, 0, 80);
>   3068if (copy_from_user(input_buf, buf, min_t(size_t, count, 
> 80)))
>   3069goto out;
>   3070
>   3071input_buf[79] = '\0';
>   3072ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , 
> _start, _end);
>   3073if (ret == 1 && pid == 1) {
>   3074split_huge_pages_all();
>   3075ret = strlen(input_buf);
>   3076goto out;
>   3077} else if (ret != 3) {
>   3078ret = -EINVAL;
>   3079goto out;
>   3080}
>   3081
>   3082if (!split_huge_pages_pid(pid, vaddr_start, vaddr_end))
>   3083ret = strlen(input_buf);

Change this to:

ret = split_huge_pages_pid(pid, vaddr_start, vaddr_end);
if (!ret)
ret = strlen(input_buf);

should fix the warning. I will resend after I get feedback for the patches.


>   3084out:
>   3085mutex_unlock();
>> 3086 return ret;
>   3087





—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v4 1/2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-15 Thread Zi Yan
From: Zi Yan 

We did not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:

>From v3:
1. Factored out split huge pages in the given pid code to a separate
   function.
2. Added the missing put_page for not split pages.
3. pr_debug -> pr_info, make reading results simpler.

>From v2:

1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:

1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 136 +++-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 313 ++
 4 files changed, 444 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..3bfee54e2cd0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2922,16 +2923,14 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
+   pr_info("Split all THPs\n");
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2959,11 +2958,134 @@ static int split_huge_pages_set(void *data, u64 val)
}
 
pr_info("%lu of %lu THP split\n", split, total);
+}
 
-   return 0;
+static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
+   unsigned long vaddr_end)
+{
+   int ret = 0;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+   unsigned long addr;
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   pr_info("Split huge pages in pid: %d, vaddr: [0x%lx - 0x%lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   break;
+   if (!page)
+   break;
+
+   if (!is_transparent_hugepage(page))
+   goto next;
+
+   total++;
+   if (!can_split_huge_page(compound_head(page), NULL))
+   goto next;
+
+   if (!trylock_page(page))
+   goto next;
+
+   if (!split_huge_page(page))
+   split++;
+
+   unlock_page(page);
+next:
+

[PATCH v4 2/2] mm: huge_memory: debugfs for file-backed THP split.

2021-03-15 Thread Zi Yan
From: Zi Yan 

Further extend /split_huge_pages to accept
",," for file-backed THP split tests since
tmpfs may have file backed by THP that mapped nowhere.

Update selftest program to test file-backed THP split too.

Suggested-by: Kirill A. Shutemov 
Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 95 ++-
 .../selftests/vm/split_huge_page_test.c   | 79 ++-
 2 files changed, 166 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3bfee54e2cd0..da91ee97d944 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3043,12 +3043,72 @@ static int split_huge_pages_pid(int pid, unsigned long 
vaddr_start,
return ret;
 }
 
+static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
+   pgoff_t off_end)
+{
+   struct filename *file;
+   struct file *candidate;
+   struct address_space *mapping;
+   int ret = -EINVAL;
+   pgoff_t off_cur;
+   unsigned long total = 0, split = 0;
+
+   file = getname_kernel(file_path);
+   if (IS_ERR(file))
+   return ret;
+
+   candidate = file_open_name(file, O_RDONLY, 0);
+   if (IS_ERR(candidate))
+   goto out;
+
+   pr_info("split file-backed THPs in file: %s, offset: [0x%lx - 0x%lx]\n",
+file_path, off_start, off_end);
+
+   mapping = candidate->f_mapping;
+
+   for (off_cur = off_start; off_cur < off_end;) {
+   struct page *fpage = pagecache_get_page(mapping, off_cur,
+   FGP_ENTRY | FGP_HEAD, 0);
+
+   if (xa_is_value(fpage) || !fpage) {
+   off_cur += PAGE_SIZE;
+   continue;
+   }
+
+   if (!is_transparent_hugepage(fpage)) {
+   off_cur += PAGE_SIZE;
+   goto next;
+   }
+   total++;
+   off_cur = fpage->index + thp_size(fpage);
+
+   if (!trylock_page(fpage))
+   goto next;
+
+   if (!split_huge_page(fpage))
+   split++;
+
+   unlock_page(fpage);
+next:
+   put_page(fpage);
+   }
+
+   filp_close(candidate, NULL);
+   ret = 0;
+
+   pr_info("%lu of %lu file-backed THP split\n", split, total);
+out:
+   putname(file);
+   return ret;
+}
+
 static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
size_t count, loff_t *ppops)
 {
static DEFINE_MUTEX(mutex);
ssize_t ret;
-   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   /* hold pid, start_vaddr, end_vaddr or file_path, off_start, off_end */
+   char input_buf[MAX_INPUT];
int pid;
unsigned long vaddr_start, vaddr_end;
 
@@ -3058,11 +3118,40 @@ static ssize_t split_huge_pages_write(struct file 
*file, const char __user *buf,
 
ret = -EFAULT;
 
-   memset(input_buf, 0, 80);
+   memset(input_buf, 0, MAX_INPUT);
if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
goto out;
 
-   input_buf[79] = '\0';
+   input_buf[MAX_INPUT - 1] = '\0';
+
+   if (input_buf[0] == '/') {
+   char *tok;
+   char *buf = input_buf;
+   char file_path[MAX_INPUT];
+   pgoff_t off_start = 0, off_end = 0;
+   size_t input_len = strlen(input_buf);
+
+   tok = strsep(, ",");
+   if (tok) {
+   strncpy(file_path, tok, MAX_INPUT);
+   } else {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = sscanf(buf, "0x%lx,0x%lx", _start, _end);
+   if (ret != 2) {
+   pr_info("ret: %ld\n", ret);
+   ret = -EINVAL;
+   goto out;
+   }
+   ret = split_huge_pages_in_file(file_path, off_start, off_end);
+   if (!ret)
+   ret = input_len;
+
+   goto out;
+   }
+
ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
if (ret == 1 && pid == 1) {
split_huge_pages_all();
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index 9f33ddbb3182..0202702f7eda 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -7,11 +7,13 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -24,6 +26,9 @@ uint64_t pmd_pagesize;
 #define SMAP_PATH "/proc/self/smaps"
 #define INPUT_MAX 80
 
+#define

Re: [PATCH v3] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-15 Thread Zi Yan
On 15 Mar 2021, at 8:07, Kirill A. Shutemov wrote:

> On Thu, Mar 11, 2021 at 07:57:12PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> We do not have a direct user interface of splitting the compound page
>> backing a THP
>
> But we do. You expand it.
>
>> and there is no need unless we want to expose the THP
>> implementation details to users. Make /split_huge_pages accept
>> a new command to do that.
>>
>> By writing ",," to
>> /split_huge_pages, THPs within the given virtual address range
>> from the process with the given pid are split. It is used to test
>> split_huge_page function. In addition, a selftest program is added to
>> tools/testing/selftests/vm to utilize the interface by splitting
>> PMD THPs and PTE-mapped THPs.
>>
>
> Okay, makes sense.
>
> But it doesn't cover non-mapped THPs. tmpfs may have file backed by THP
> that mapped nowhere. Do we want to cover this case too?

Sure. It would be useful when large page in page cache too. I will send
v4 with tmpfs THP split. I will definitely need a review for it, since
I am not familiar with getting a page from a file path.

> Maybe have PID:,, and
> FILE:,, ?

Or just check input[0] == ‘/‘ for file path input.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries

2021-03-14 Thread Zi Yan
On 14 Mar 2021, at 20:03, Yu Zhao wrote:

> On Sun, Mar 14, 2021 at 10:51:03PM +, Matthew Wilcox wrote:
>> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
>>> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
>>>
>>>> Some architectures support the accessed bit on non-leaf PMD entries
>>>> (parents) in addition to leaf PTE entries (children) where pages are
>>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
>>>> as part of linear-address translation [1]. Page table walkers who are
>>>> interested in the accessed bit on children can take advantage of this:
>>>> they do not need to search the children when the accessed bit is not
>>>> set on a parent, given that they have previously cleared the accessed
>>>> bit on this parent in addition to its children.
>>>>
>>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>>>>  Volume 3 (October 2019), section 4.8
>>>
>>> Just curious. Does this also apply to non-leaf PUD entries? Do you
>>> mind sharing which sentence from the manual gives the information?
>>
>> The first few sentences from 4.8:
>>
>> : For any paging-structure entry that is used during linear-address
>> : translation, bit 5 is the accessed flag. For paging-structure
>> : entries that map a page (as opposed to referencing another paging
>> : structure), bit 6 is the dirty flag. These flags are provided for
>> : use by memory-management software to manage the transfer of pages and
>> : paging structures into and out of physical memory.
>>
>> : Whenever the processor uses a paging-structure entry as part of
>> : linear-address translation, it sets the accessed flag in that entry
>> : (if it is not already set).

Matthew, thanks for the pointer.

>
> As far as I know x86 is the one that supports this.
>
>> The way they differentiate between the A and D bits makes it clear to
>> me that the A bit is set at each level of the tree, but the D bit is
>> only set on leaf entries.
>
> And the difference makes perfect sense (to me). Kudos to Intel.

Hi Yu,

You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no 
HAVE_ARCH_PARENT_PUD_YOUNG.
Is it PUD granularity too large to be useful for multigenerational LRU 
algorithm?

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v1 00/14] Multigenerational LRU

2021-03-14 Thread Zi Yan
On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> TLDR
> 
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> a performant, versatile and straightforward augment.
>
> Repo
> 
> git fetch https://linux-mm.googlesource.com/page-reclaim 
> refs/changes/01/1101/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1101
>
> Background
> ==
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> 
> Notion of the active/inactive
> -
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. And false active/inactive rates are relatively high. In
> addition, scans of largely varying numbers of pages are unpredictable
> because inactive_is_low() is based on magic numbers.
>
> For phones and laptops, the eviction is biased toward file pages
> because the selection has to resort to heuristics as direct
> comparisons between anon and file types are infeasible. On Android and
> Chrome OS, executable pages are frequently evicted despite the fact
> that there are many less recently used anon pages. This causes "janks"
> (slow UI rendering) and negatively impacts user experience.
>
> For systems with multiple nodes and/or memcgs, it is impossible to
> compare lruvecs based on the notion of the active/inactive.
>
> Incremental scans via the rmap
> --
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For most of
> the systems running cloud workloads, incremental scans lose the
> advantage under sustained memory pressure due to high ratios of the
> number of scanned pages to the number of reclaimed pages. In our case,
> the average ratio of pgscan to pgsteal is about 7.
>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>4.63%  do_raw_spin_lock
>3.89%  vma_interval_tree_iter_next
>3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>7.61%  do_raw_spin_lock
>5.69%  vma_interval_tree_iter_next
>4.91%  vma_interval_tree_subtree_search
>3.71%  page_referenced_one
>
> Solutions
> =
> Notion of generation numbers
> 
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> configurable generations, and thus they have relatively low false
> active/inactive rates. Each generation includes all pages that have
> been referenced since the last generation.
>
> Given an lruvec, scans and the selections between anon and file types
> are all based on generation numbers, which are simple and yet
> effective. For different lruvecs, comparisons are still possible based
> on birth times of generations.
>
> Differential scans via page tables
> --
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for most of the
> systems running cloud workloads.
>
> On Chrome OS, our real-world benchmark that browses popular websites
> in multiple tabs demonstrates 51% less CPU usage from kswapd and 52%
> (full) less PSI on v5.11. And kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>4.54%  page_vma_mapped_walk
>4.45%  memset_erms
>3.47%  walk_pte_range
>2.88%  zram_bvec_rw

Is this profile from a system with this patchset applied or not?
Do you mind sharing some profiling data with before and after applying
the patchset? So it would be easier to see the improvement brought by
this patchset.

>
> In addition, direct reclaim latency is reduced by 22% at 99th
> 

Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries

2021-03-14 Thread Zi Yan
On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page table walkers who are
> interested in the accessed bit on children can take advantage of this:
> they do not need to search the children when the accessed bit is not
> set on a parent, given that they have previously cleared the accessed
> bit on this parent in addition to its children.
>
> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>  Volume 3 (October 2019), section 4.8

Just curious. Does this also apply to non-leaf PUD entries? Do you
mind sharing which sentence from the manual gives the information?

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v3] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-11 Thread Zi Yan
From: Zi Yan 

We do not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Make /split_huge_pages accept
a new command to do that.

By writing ",," to
/split_huge_pages, THPs within the given virtual address range
from the process with the given pid are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

This does not change the old behavior, i.e., writing 1 to the interface
to split all THPs in the system.

Changelog:

>From v2:

1. Reused existing /split_huge_pages interface. (suggested by
   Yang Shi)

>From v1:

1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 122 ++-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 313 ++
 4 files changed, 430 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..f9fdff286a94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2922,16 +2923,13 @@ static struct shrinker deferred_split_shrinker = {
 };
 
 #ifdef CONFIG_DEBUG_FS
-static int split_huge_pages_set(void *data, u64 val)
+static void split_huge_pages_all(void)
 {
struct zone *zone;
struct page *page;
unsigned long pfn, max_zone_pfn;
unsigned long total = 0, split = 0;
 
-   if (val != 1)
-   return -EINVAL;
-
for_each_populated_zone(zone) {
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
@@ -2959,11 +2957,121 @@ static int split_huge_pages_set(void *data, u64 val)
}
 
pr_info("%lu of %lu THP split\n", split, total);
+}
+
+static ssize_t split_huge_pages_write(struct file *file, const char __user 
*buf,
+   size_t count, loff_t *ppops)
+{
+   static DEFINE_MUTEX(mutex);
+   ssize_t ret;
+   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   int pid;
+   unsigned long vaddr_start, vaddr_end, addr;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+
+   ret = mutex_lock_interruptible();
+   if (ret)
+   return ret;
+
+   ret = -EFAULT;
+
+   memset(input_buf, 0, 80);
+   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+   goto out;
+
+   input_buf[79] = '\0';
+   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
+   if (ret == 1 && pid == 1) {
+   split_huge_pages_all();
+   ret = strlen(input_buf);
+   goto out;
+   } else if (ret != 3) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   ret = strlen(input_buf);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+

Re: [PATCH v2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-11 Thread Zi Yan
On 10 Mar 2021, at 20:12, Yang Shi wrote:

> On Wed, Mar 10, 2021 at 7:36 AM Zi Yan  wrote:
>>
>> From: Zi Yan 
>>
>> We do not have a direct user interface of splitting the compound page
>> backing a THP and there is no need unless we want to expose the THP
>> implementation details to users. Adding an interface for debugging.
>>
>> By writing ",," to
>> /split_huge_pages_in_range_pid, THPs within the given virtual
>
> Can we reuse the existing split_huge_page knob instead of creating a new one?
>
> Two knobs for splitting huge pages on debugging purpose seem
> overkilling to me IMHO. I'm wondering if we could check if a special
> value (e.g. 1 or -1) is written then split all THPs as split_huge_page
> knob does?
>
> I don't think this interface is used widely so the risk should be very
> low for breaking userspace.

Thanks for the suggestion.

I prefer a separate interface to keep input handling simpler. I am also
planning to enhance this interface later to enable splitting huge pages
to any lower order when Matthew Wilcox’s large page in page cache gets in,
so it is better to keep it separate from existing split_huge_pages.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] hugetlb: add demote/split page functionality

2021-03-10 Thread Zi Yan
On 10 Mar 2021, at 12:05, Michal Hocko wrote:

> On Wed 10-03-21 11:46:57, Zi Yan wrote:
>> On 10 Mar 2021, at 11:23, Michal Hocko wrote:
>>
>>> On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
>>> [...]
>>>> Converting larger to smaller hugetlb pages can be accomplished today by
>>>> first freeing the larger page to the buddy allocator and then allocating
>>>> the smaller pages.  However, there are two issues with this approach:
>>>> 1) This process can take quite some time, especially if allocation of
>>>>the smaller pages is not immediate and requires migration/compaction.
>>>> 2) There is no guarantee that the total size of smaller pages allocated
>>>>will match the size of the larger page which was freed.  This is
>>>>because the area freed by the larger page could quickly be
>>>>fragmented.
>>>
>>> I will likely not surprise to show some level of reservation. While your
>>> concerns about reconfiguration by existing interfaces are quite real is
>>> this really a problem in practice? How often do you need such a
>>> reconfiguration?
>>>
>>> Is this all really worth the additional code to something as tricky as
>>> hugetlb code base?
>>>
>>>>  include/linux/hugetlb.h |   8 ++
>>>>  mm/hugetlb.c| 199 +++-
>>>>  2 files changed, 204 insertions(+), 3 deletions(-)
>>>>
>>>> -- 
>>>> 2.29.2
>>>>
>>
>> The high level goal of this patchset seems to enable flexible huge page
>> allocation from a single pool, when multiple huge page sizes are available
>> to use. The limitation of existing mechanism is that user has to specify
>> how many huge pages he/she wants and how many gigantic pages he/she wants
>> before the actual use.
>
> I believe I have understood this part. And I am not questioning that.
> This seems useful. I am mostly asking whether we need such a
> flexibility. Mostly because of the additional code and future
> maintenance complexity which has turned to be a problem for a long time.
> Each new feature tends to just add on top of the existing complexity.

I totally agree. This patchset looks to me like a partial functional
replication of splitting high order free pages to lower order ones in buddy
allocator. That is why I had the crazy idea below.

>
>> I just want to throw an idea here, please ignore if it is too crazy.
>> Could we have a variant buddy allocator for huge page allocations,
>> which only has available huge page orders in the free list? For example,
>> if user wants 2MB and 1GB pages, the allocator will only have order-9 and
>> order-19 pages; when order-9 pages run out, we can split order-19 pages;
>> if possible, adjacent order-9 pages can be merged back to order-19 pages.
>
> I assume you mean to remove those pages from the allocator when they
> are reserved rather than really used, right? I am not really sure how

No. The allocator maintains all the reserved pages for huge page allocations,
replacing existing cma_alloc or alloc_contig_pages. The kernel builds
the free list when pages are reserved either during boot time or runtime.

> you want to deal with lower orders consuming/splitting too much from
> higher orders which then makes those unusable for the use even though
> they were preallocated for a specific workload. Another worry is that a
> gap between 2MB and 1GB pages is just too big so a single 2MB request
> from 1G pool will make the whole 1GB page unusable even when the smaller
> pool needs few pages.

Yeah, the gap between 2MB and 1GB is large. The fragmentation will be
a problem. Maybe we do not need it right now, since this patchset does not
propose promoting/merging pages. Or we can reuse the existing
anti fragmentation mechanisms but with pageblock set to gigantic page size
in this pool.

I admit my idea is a much intrusive change, but I feel that more
functionality replications of core mm are added to hugetlb code, then why
not reuse the core mm code.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] hugetlb: add demote/split page functionality

2021-03-10 Thread Zi Yan
On 10 Mar 2021, at 11:23, Michal Hocko wrote:

> On Mon 08-03-21 16:18:52, Mike Kravetz wrote:
> [...]
>> Converting larger to smaller hugetlb pages can be accomplished today by
>> first freeing the larger page to the buddy allocator and then allocating
>> the smaller pages.  However, there are two issues with this approach:
>> 1) This process can take quite some time, especially if allocation of
>>the smaller pages is not immediate and requires migration/compaction.
>> 2) There is no guarantee that the total size of smaller pages allocated
>>will match the size of the larger page which was freed.  This is
>>because the area freed by the larger page could quickly be
>>fragmented.
>
> I will likely not surprise to show some level of reservation. While your
> concerns about reconfiguration by existing interfaces are quite real is
> this really a problem in practice? How often do you need such a
> reconfiguration?
>
> Is this all really worth the additional code to something as tricky as
> hugetlb code base?
>
>>  include/linux/hugetlb.h |   8 ++
>>  mm/hugetlb.c| 199 +++-
>>  2 files changed, 204 insertions(+), 3 deletions(-)
>>
>> -- 
>> 2.29.2
>>

The high level goal of this patchset seems to enable flexible huge page
allocation from a single pool, when multiple huge page sizes are available
to use. The limitation of existing mechanism is that user has to specify
how many huge pages he/she wants and how many gigantic pages he/she wants
before the actual use.

I just want to throw an idea here, please ignore if it is too crazy.
Could we have a variant buddy allocator for huge page allocations,
which only has available huge page orders in the free list? For example,
if user wants 2MB and 1GB pages, the allocator will only have order-9 and
order-19 pages; when order-9 pages run out, we can split order-19 pages;
if possible, adjacent order-9 pages can be merged back to order-19 pages.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v2] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-10 Thread Zi Yan
From: Zi Yan 

We do not have a direct user interface of splitting the compound page
backing a THP and there is no need unless we want to expose the THP
implementation details to users. Adding an interface for debugging.

By writing ",," to
/split_huge_pages_in_range_pid, THPs within the given virtual
address range from the process with the given pid are split. It is used
to test split_huge_page function. In addition, a selftest program is
added to tools/testing/selftests/vm to utilize the interface by
splitting PMD THPs and PTE-mapped THPs.

Changelog:

>From v1:

1. Removed unnecessary calling to vma_migratable, spotted by kernel test
   robot .
2. Dropped the use of find_mm_struct and code it directly, since there
   is no need for the permission check in that function and the function
   is only available when migration is on.
3. Added some comments in the selftest program to clarify how PTE-mapped
   THPs are formed.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  | 112 ++
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 320 ++
 4 files changed, 434 insertions(+)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bff92dea5ab3..7797e8b2aba0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2965,10 +2966,121 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppops)
+{
+   static DEFINE_MUTEX(mutex);
+   ssize_t ret;
+   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   int pid;
+   unsigned long vaddr_start, vaddr_end, addr;
+   struct task_struct *task;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+
+   ret = mutex_lock_interruptible();
+   if (ret)
+   return ret;
+
+   ret = -EFAULT;
+
+   memset(input_buf, 0, 80);
+   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+   goto out;
+
+   input_buf[79] = '\0';
+   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
+   if (ret != 3) {
+   ret = -EINVAL;
+   goto out;
+   }
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   ret = strlen(input_buf);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   /* Find the task_struct from pid */
+   rcu_read_lock();
+   task = find_task_by_vpid(pid);
+   if (!task) {
+   rcu_read_unlock();
+   ret = -ESRCH;
+   goto out;
+   }
+   get_task_struct(task);
+   rcu_read_unlock();
+
+   /* Find the mm_struct */
+   mm = get_task_mm(task);
+   put_task_struct(task);
+
+   if (!mm) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start)
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   break;
+   if (!page)
+   break;
+
+   if (!is_transparent_hugepage(page))
+   continue;
+
+   total++;
+   if (!can_split_huge_page(compound_head(page), NULL))
+   continue;
+
+   if (!trylock_page(page))
+   continue;
+
+   if (!split_huge_page(page))
+   split++;
+
+   unlock_page(page);
+   put_page(page);
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   pr_debug("%lu of %lu THP split\n", split, total);
+out:
+   mutex_unlock();
+   return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+   .owner   = THIS_MODULE,
+   .write   = split_huge_pages_in_range_pid_write,
+   .llseek  = no_llseek,
+};
+
 static in

Re: [PATCH v2 1/5] userfaultfd: support minor fault handling for shmem

2021-03-09 Thread Zi Yan
On 1 Mar 2021, at 19:01, Axel Rasmussen wrote:

> Modify the userfaultfd register API to allow registering shmem VMAs in
> minor mode. Modify the shmem mcopy implementation to support
> UFFDIO_CONTINUE in order to resolve such faults.
>
> Combine the shmem mcopy handler functions into a single
> shmem_mcopy_atomic_pte, which takes a mode parameter. This matches how
> the hugetlbfs implementation is structured, and lets us remove a good
> chunk of boilerplate.
>
> Signed-off-by: Axel Rasmussen 
> ---
>  fs/userfaultfd.c |  6 +--
>  include/linux/shmem_fs.h | 26 -
>  include/uapi/linux/userfaultfd.h |  4 +-
>  mm/memory.c  |  8 +--
>  mm/shmem.c   | 92 +++-
>  mm/userfaultfd.c | 27 +-
>  6 files changed, 79 insertions(+), 84 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 14f92285d04f..9f3b8684cf3c 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct 
> vm_area_struct *vma,
>   }
>
>   if (vm_flags & VM_UFFD_MINOR) {
> - /* FIXME: Add minor fault interception for shmem. */
> - if (!is_vm_hugetlb_page(vma))
> + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma)))
>   return false;
>   }
>
> @@ -1941,7 +1940,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
>   /* report all available features and ioctls to userland */
>   uffdio_api.features = UFFD_API_FEATURES;
>  #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
> - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS;
> + uffdio_api.features &=
> + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
>  #endif
>   uffdio_api.ioctls = UFFD_API_IOCTLS;
>   ret = -EFAULT;
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index d82b6f396588..f0919c3722e7 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* inode in-kernel data */
>
> @@ -122,21 +123,16 @@ static inline bool shmem_file(struct file *file)
>  extern bool shmem_charge(struct inode *inode, long pages);
>  extern void shmem_uncharge(struct inode *inode, long pages);
>
> +#ifdef CONFIG_USERFAULTFD
>  #ifdef CONFIG_SHMEM
> -extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> -   struct vm_area_struct *dst_vma,
> -   unsigned long dst_addr,
> -   unsigned long src_addr,
> -   struct page **pagep);
> -extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm,
> - pmd_t *dst_pmd,
> - struct vm_area_struct *dst_vma,
> - unsigned long dst_addr);
> -#else
> -#define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \
> -src_addr, pagep)({ BUG(); 0; })
> -#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \
> -  dst_addr)  ({ BUG(); 0; })
> -#endif
> +int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> +struct vm_area_struct *dst_vma,
> +unsigned long dst_addr, unsigned long src_addr,
> +enum mcopy_atomic_mode mode, struct page **pagep);
> +#else /* !CONFIG_SHMEM */
> +#define shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \
> +src_addr, mode, pagep)({ BUG(); 0; })
> +#endif /* CONFIG_SHMEM */
> +#endif /* CONFIG_USERFAULTFD */
>
>  #endif
> diff --git a/include/uapi/linux/userfaultfd.h 
> b/include/uapi/linux/userfaultfd.h
> index bafbeb1a2624..47d9790d863d 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -31,7 +31,8 @@
>  UFFD_FEATURE_MISSING_SHMEM | \
>  UFFD_FEATURE_SIGBUS |\
>  UFFD_FEATURE_THREAD_ID | \
> -UFFD_FEATURE_MINOR_HUGETLBFS)
> +UFFD_FEATURE_MINOR_HUGETLBFS |   \
> +UFFD_FEATURE_MINOR_SHMEM)
>  #define UFFD_API_IOCTLS  \
>   ((__u64)1 << _UFFDIO_REGISTER | \
>(__u64)1 << _UFFDIO_UNREGISTER |   \
> @@ -196,6 +197,7 @@ struct uffdio_api {
>  #define UFFD_FEATURE_SIGBUS  (1<<7)
>  #define UFFD_FEATURE_THREAD_ID   (1<<8)
>  #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9)
> +#define UFFD_FEATURE_MINOR_SHMEM (1<<10)
>   __u64 features;
>
>   __u64 ioctls;
> diff --git a/mm/memory.c b/mm/memory.c
> index c8e357627318..a1e5ff55027e 100644
> --- a/mm/memory.c
> +++ 

Re: [PATCH v4 3/3] x86/vmemmap: Handle unpopulated sub-pmd ranges

2021-03-09 Thread Zi Yan
On 1 Mar 2021, at 3:32, Oscar Salvador wrote:

> When the size of a struct page is not multiple of 2MB, sections do
> not span a PMD anymore and so when populating them some parts of the
> PMD will remain unused.
> Because of this, PMDs will be left behind when depopulating sections
> since remove_pmd_table() thinks that those unused parts are still in
> use.
>
> Fix this by marking the unused parts with PAGE_UNUSED, so memchr_inv()
> will do the right thing and will let us free the PMD when the last user
> of it is gone.
>
> This patch is based on a similar patch by David Hildenbrand:
>
> https://lore.kernel.org/linux-mm/20200722094558.9828-9-da...@redhat.com/
> https://lore.kernel.org/linux-mm/20200722094558.9828-10-da...@redhat.com/
>
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> ---
>  arch/x86/mm/init_64.c | 106 
> ++
>  1 file changed, 98 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ecb3c488ac8..7e8de63f02b3 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -871,7 +871,93 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   return add_pages(nid, start_pfn, nr_pages, params);
>  }
>
> -#define PAGE_INUSE 0xFD
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +#define PAGE_UNUSED 0xFD
> +
> +/*
> + * The unused vmemmap range, which was not yet memset(PAGE_UNUSED) ranges
> + * from unused_pmd_start to next PMD_SIZE boundary.
> + */
> +static unsigned long unused_pmd_start __meminitdata;
> +
> +static void __meminit vmemmap_flush_unused_pmd(void)
> +{
> + if (!unused_pmd_start)
> + return;
> + /*
> +  * Clears (unused_pmd_start, PMD_END]
> +  */
> + memset((void *)unused_pmd_start, PAGE_UNUSED,
> +ALIGN(unused_pmd_start, PMD_SIZE) - unused_pmd_start);
> + unused_pmd_start = 0;
> +}
> +
> +/* Returns true if the PMD is completely unused and thus it can be freed */
> +static bool __meminit vmemmap_unuse_sub_pmd(unsigned long addr, unsigned 
> long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> + vmemmap_flush_unused_pmd();
> + memset((void *)addr, PAGE_UNUSED, end - addr);
> +
> + return !memchr_inv((void *)start, PAGE_UNUSED, PMD_SIZE);
> +}
> +
> +static void __meminit __vmemmap_use_sub_pmd(unsigned long start)
> +{
> + /*
> +  * As we expect to add in the same granularity as we remove, it's
> +  * sufficient to mark only some piece used to block the memmap page from
> +  * getting removed when removing some other adjacent memmap (just in
> +  * case the first memmap never gets initialized e.g., because the memory
> +  * block never gets onlined).
> +  */
> + memset((void *)start, 0, sizeof(struct page));
> +}
> +
> +static void __meminit vmemmap_use_sub_pmd(unsigned long start, unsigned long 
> end)
> +{
> + /*
> +  * We only optimize if the new used range directly follows the
> +  * previously unused range (esp., when populating consecutive sections).
> +  */
> + if (unused_pmd_start == start) {
> + if (likely(IS_ALIGNED(end, PMD_SIZE)))
> + unused_pmd_start = 0;
> + else
> + unused_pmd_start = end;
> + return;
> + }
> +
> + vmemmap_flush_unused_pmd();
> + __vmemmap_use_sub_pmd(start);
> +}
> +
> +static void __meminit vmemmap_use_new_sub_pmd(unsigned long start, unsigned 
> long end)
> +{
> + vmemmap_flush_unused_pmd();
> +
> + /*
> +  * Could be our memmap page is filled with PAGE_UNUSED already from a
> +  * previous remove.
> +  */
> + __vmemmap_use_sub_pmd(start);
> +
> + /*
> +  * Mark the unused parts of the new memmap range
> +  */
> + if (!IS_ALIGNED(start, PMD_SIZE))
> + memset((void *)start, PAGE_UNUSED,
> +start - ALIGN_DOWN(start, PMD_SIZE));
> + /*
> +  * We want to avoid memset(PAGE_UNUSED) when populating the vmemmap of
> +  * consecutive sections. Remember for the last added PMD the last
> +  * unused range in the populated PMD.
> +  */
> + if (!IS_ALIGNED(end, PMD_SIZE))
> + unused_pmd_start = end;
> +}
> +#endif
>
>  static void __meminit free_pagetable(struct page *page, int order)
>  {
> @@ -1006,7 +1092,6 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, 
> unsigned long end,
>   unsigned long next, pages = 0;
>   pte_t *pte_base;
>   pmd_t *pmd;
> - void *page_addr;
>
>   pmd = pmd_start + pmd_index(addr);
>   for (; addr < end; addr = next, pmd++) {
> @@ -1027,12 +1112,11 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long 
> addr, unsigned long end,
>   spin_unlock(_mm.page_table_lock);
>   pages++;
>   } else {
> - /* If here, we are freeing vmemmap pages. */
> -  

Re: [PATCH] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-08 Thread Zi Yan
On 8 Mar 2021, at 14:23, kernel test robot wrote:

> Hi Zi,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on kselftest/next]
> [also build test ERROR on linux/master linus/master v5.12-rc2 next-20210305]
> [cannot apply to hnaz-linux-mm/master]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:
> https://github.com/0day-ci/linux/commits/Zi-Yan/mm-huge_memory-a-new-debugfs-interface-for-splitting-THP-tests/20210308-232339
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
> config: x86_64-randconfig-a015-20210308 (attached as .config)
> compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
> 3a11a41795bec548e91621caaa4cc00fc31b2212)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install x86_64 cross compiling tool for clang build
> # apt-get install binutils-x86-64-linux-gnu
> # 
> https://github.com/0day-ci/linux/commit/961321af55684845ebc1e13e4c4e7c0da14a476a
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review 
> Zi-Yan/mm-huge_memory-a-new-debugfs-interface-for-splitting-THP-tests/20210308-232339
> git checkout 961321af55684845ebc1e13e4c4e7c0da14a476a
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All errors (new ones prefixed by >>):
>
>>> mm/huge_memory.c:3026:40: error: implicit declaration of function 
>>> 'vma_migratable' [-Werror,-Wimplicit-function-declaration]
>if (!vma || addr < vma->vm_start || !vma_migratable(vma))
> ^
>1 error generated.

There is no need to call vma_migratable() here. Will remove it.

Thanks for catching it.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-08 Thread Zi Yan
On 8 Mar 2021, at 13:11, David Hildenbrand wrote:

> On 08.03.21 18:49, Zi Yan wrote:
>> On 8 Mar 2021, at 11:17, David Hildenbrand wrote:
>>
>>> On 08.03.21 16:22, Zi Yan wrote:
>>>> From: Zi Yan 
>>>>
>>>> By writing ",," to
>>>> /split_huge_pages_in_range_pid, THPs in the process with the
>>>> given pid and virtual address range are split. It is used to test
>>>> split_huge_page function. In addition, a selftest program is added to
>>>> tools/testing/selftests/vm to utilize the interface by splitting
>>>> PMD THPs and PTE-mapped THPs.
>>>
>>> Won't something like
>>>
>>> 1. MADV_HUGEPAGE
>>>
>>> 2. Access memory
>>>
>>> 3. MADV_NOHUGEPAGE
>>>
>>> Have a similar effect? What's the benefit of this?
>>
>> Thanks for checking the patch.
>>
>> No, MADV_NOHUGEPAGE just replaces VM_HUGEPAGE with VM_NOHUGEPAGE,
>> nothing else will be done.
>
> Ah, okay - maybe my memory was tricking me. There is some s390x KVM code that 
> forces MADV_NOHUGEPAGE and force-splits everything.
>
> I do wonder, though, if this functionality would be worth a proper user 
> interface (e.g., madvise), though. There might be actual benefit in having 
> this as a !debug interface.
>
> I think you aware of the discussion in 
> https://lkml.kernel.org/r/d098c392-273a-36a4-1a29-59731cdf5...@google.com

Yes. Thanks for bringing this up.

>
> If there will be an interface to collapse a THP -- "this memory area is worth 
> extra performance now by collapsing a THP if possible" -- it might also be 
> helpful to have the opposite functionality -- "this memory area is not worth 
> a THP, rather use that somehwere else".
>
> MADV_HUGE_COLLAPSE vs. MADV_HUGE_SPLIT

I agree that MADV_HUGE_SPLIT would be useful as the opposite of COLLAPSE when 
user might just want PAGESIZE mappings.
Right now, HUGE_SPLIT is implicit from mapping changes like mprotect or 
MADV_DONTNEED.

My debugfs interface is a little different here, since it splits the compound 
pages mapped by the PMD mapping
(of course the PMD mapping is split too), whereas madvise only splits the PMD. 
I did not put it in a !debug
interface because I do not think we want to expose the kernel mechanism (the 
compound page) to the user and
let them decide when to split the compound page or not. MADV_HUGE_COLLAPSE is 
different, because we need
to form a compound THP to be able to get PMD mappings. But I am happy to change 
my mind if we find usefulness
in letting user split compound THPs via !debug interface.


>
> Just a thought.
>
>>
>> Without this, we do not have a way of splitting a specific THP
>> (the compound page) via any user interface for debugging.
>> We can only write to /split_huge_pages to split all THPs
>> in the system, which has an unwanted effect on the whole system
>> and can overwhelm us with a lot of information. This new debugfs
>> interface provides a more precise method.
>>
>>>>
>>>> Signed-off-by: Zi Yan 
>>>> ---
>>>>mm/huge_memory.c  |  98 ++
>>>>mm/internal.h |   1 +
>>>>mm/migrate.c  |   2 +-
>>>>tools/testing/selftests/vm/.gitignore |   1 +
>>>>tools/testing/selftests/vm/Makefile   |   1 +
>>>>.../selftests/vm/split_huge_page_test.c   | 318 ++
>>>>6 files changed, 420 insertions(+), 1 deletion(-)
>>>>create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 395c75111d33..818172f887bf 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -7,6 +7,7 @@
>>>> #include 
>>>>#include 
>>>> +#include 
>>>>#include 
>>>>#include 
>>>>#include 
>>>> @@ -2971,10 +2972,107 @@ static int split_huge_pages_set(void *data, u64 
>>>> val)
>>>>DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, 
>>>> split_huge_pages_set,
>>>>"%llu\n");
>>>>   +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
>>>> +  const char __user *buf, size_t count, loff_t *ppops)
>>>> +{
>>>> +  static DEFINE_MUTEX(mutex);
>>>> +  ssize_t ret;
>>>> +  char input_buf[80]; /* hold pid, start_vaddr, 

Re: [PATCH] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-08 Thread Zi Yan
+ the rest of cc back and move your reply inline.

On 8 Mar 2021, at 12:47, Mika Penttilä wrote:
>>
>>
>> On 8.3.2021 17.22, Zi Yan wrote:
>>> From: Zi Yan 
>>>
>>> By writing ",," to
>>> /split_huge_pages_in_range_pid, THPs in the process with the
>>> given pid and virtual address range are split. It is used to test
>>> split_huge_page function. In addition, a selftest program is added to
>>> tools/testing/selftests/vm to utilize the interface by splitting
>>> PMD THPs and PTE-mapped THPs.
>>>
>>> Signed-off-by: Zi Yan 
>>
>> Hi!
>>
>> I think your test program is not correct. The mremaps shrink to one page, 
>> after the first mremap the pointers are bogus.
>> Also, mremap splits pmds with split_huge_pmd().. And those you can't split  
>> with split_huge_page because it is a normal pmd.
>> Maybe you didn't indent to shrink to page size?
>>
>>
>> --Mika
> Hi,
>
> Sorry, wrote too fast.. the splits are okay of course from pte mapped thp to 
> plain pages (mremap -> split pmd -> debugfs write ->split pages).
> But the remap offsets are I think maybe not you wanted.


You mean I mremap the first PAGESIZE from first THP, second PAGESIZE from 
second THP and so on to
create PTE-mapped THPs? I did it on purpose so split_huge_page can work on 
different part of THPs.


>>
>>
>>> ---
>>>   mm/huge_memory.c  |  98 ++
>>>   mm/internal.h |   1 +
>>>   mm/migrate.c  |   2 +-
>>>   tools/testing/selftests/vm/.gitignore |   1 +
>>>   tools/testing/selftests/vm/Makefile   |   1 +
>>>   .../selftests/vm/split_huge_page_test.c   | 318 ++
>>>   6 files changed, 420 insertions(+), 1 deletion(-)
>>>   create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 395c75111d33..818172f887bf 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -7,6 +7,7 @@
>>>     #include 
>>>   #include 
>>> +#include 
>>>   #include 
>>>   #include 
>>>   #include 
>>> @@ -2971,10 +2972,107 @@ static int split_huge_pages_set(void *data, u64 
>>> val)
>>>   DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, 
>>> split_huge_pages_set,
>>>   "%llu\n");
>>>   +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
>>> +    const char __user *buf, size_t count, loff_t *ppops)
>>> +{
>>> +    static DEFINE_MUTEX(mutex);
>>> +    ssize_t ret;
>>> +    char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
>>> +    int pid;
>>> +    unsigned long vaddr_start, vaddr_end, addr;
>>> +    nodemask_t task_nodes;
>>> +    struct mm_struct *mm;
>>> +    unsigned long total = 0, split = 0;
>>> +
>>> +    ret = mutex_lock_interruptible();
>>> +    if (ret)
>>> +    return ret;
>>> +
>>> +    ret = -EFAULT;
>>> +
>>> +    memset(input_buf, 0, 80);
>>> +    if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
>>> +    goto out;
>>> +
>>> +    input_buf[79] = '\0';
>>> +    ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
>>> _end);
>>> +    if (ret != 3) {
>>> +    ret = -EINVAL;
>>> +    goto out;
>>> +    }
>>> +    vaddr_start &= PAGE_MASK;
>>> +    vaddr_end &= PAGE_MASK;
>>> +
>>> +    ret = strlen(input_buf);
>>> +    pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
>>> + pid, vaddr_start, vaddr_end);
>>> +
>>> +    mm = find_mm_struct(pid, _nodes);
>>> +    if (IS_ERR(mm)) {
>>> +    ret = -EINVAL;
>>> +    goto out;
>>> +    }
>>> +
>>> +    mmap_read_lock(mm);
>>> +    /*
>>> + * always increase addr by PAGE_SIZE, since we could have a PTE page
>>> + * table filled with PTE-mapped THPs, each of which is distinct.
>>> + */
>>> +    for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
>>> +    struct vm_area_struct *vma = find_vma(mm, addr);
>>> +    unsigned int follflags;
>>> +    struct page *page;
>>> +
>>> +    if (!vma || addr < vma->vm

Re: [PATCH] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-08 Thread Zi Yan
On 8 Mar 2021, at 11:17, David Hildenbrand wrote:

> On 08.03.21 16:22, Zi Yan wrote:
>> From: Zi Yan 
>>
>> By writing ",," to
>> /split_huge_pages_in_range_pid, THPs in the process with the
>> given pid and virtual address range are split. It is used to test
>> split_huge_page function. In addition, a selftest program is added to
>> tools/testing/selftests/vm to utilize the interface by splitting
>> PMD THPs and PTE-mapped THPs.
>
> Won't something like
>
> 1. MADV_HUGEPAGE
>
> 2. Access memory
>
> 3. MADV_NOHUGEPAGE
>
> Have a similar effect? What's the benefit of this?

Thanks for checking the patch.

No, MADV_NOHUGEPAGE just replaces VM_HUGEPAGE with VM_NOHUGEPAGE,
nothing else will be done.

Without this, we do not have a way of splitting a specific THP
(the compound page) via any user interface for debugging.
We can only write to /split_huge_pages to split all THPs
in the system, which has an unwanted effect on the whole system
and can overwhelm us with a lot of information. This new debugfs
interface provides a more precise method.

>>
>> Signed-off-by: Zi Yan 
>> ---
>>   mm/huge_memory.c  |  98 ++
>>   mm/internal.h |   1 +
>>   mm/migrate.c  |   2 +-
>>   tools/testing/selftests/vm/.gitignore |   1 +
>>   tools/testing/selftests/vm/Makefile   |   1 +
>>   .../selftests/vm/split_huge_page_test.c   | 318 ++
>>   6 files changed, 420 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 395c75111d33..818172f887bf 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>#include 
>>   #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include 
>> @@ -2971,10 +2972,107 @@ static int split_huge_pages_set(void *data, u64 val)
>>   DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
>>  "%llu\n");
>>  +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
>> +const char __user *buf, size_t count, loff_t *ppops)
>> +{
>> +static DEFINE_MUTEX(mutex);
>> +ssize_t ret;
>> +char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
>> +int pid;
>> +unsigned long vaddr_start, vaddr_end, addr;
>> +nodemask_t task_nodes;
>> +struct mm_struct *mm;
>> +unsigned long total = 0, split = 0;
>> +
>> +ret = mutex_lock_interruptible();
>> +if (ret)
>> +return ret;
>> +
>> +ret = -EFAULT;
>> +
>> +memset(input_buf, 0, 80);
>> +if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
>> +goto out;
>> +
>> +input_buf[79] = '\0';
>> +ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
>> _end);
>> +if (ret != 3) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +vaddr_start &= PAGE_MASK;
>> +vaddr_end &= PAGE_MASK;
>> +
>> +ret = strlen(input_buf);
>> +pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
>> + pid, vaddr_start, vaddr_end);
>> +
>> +mm = find_mm_struct(pid, _nodes);
>> +if (IS_ERR(mm)) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +
>> +mmap_read_lock(mm);
>> +/*
>> + * always increase addr by PAGE_SIZE, since we could have a PTE page
>> + * table filled with PTE-mapped THPs, each of which is distinct.
>> + */
>> +for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
>> +struct vm_area_struct *vma = find_vma(mm, addr);
>> +unsigned int follflags;
>> +struct page *page;
>> +
>> +if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>> +break;
>> +
>> +/* FOLL_DUMP to ignore special (like zero) pages */
>> +follflags = FOLL_GET | FOLL_DUMP;
>> +page = follow_page(vma, addr, follflags);
>> +
>> +if (IS_ERR(page))
>> +break;
>> +if (!page)
>> +break;
>> +
>> +if (!is_transparent_hugepage(page))
>> +continue;
>> +
>> +total++;
>> +   

[PATCH] mm: huge_memory: a new debugfs interface for splitting THP tests.

2021-03-08 Thread Zi Yan
From: Zi Yan 

By writing ",," to
/split_huge_pages_in_range_pid, THPs in the process with the
given pid and virtual address range are split. It is used to test
split_huge_page function. In addition, a selftest program is added to
tools/testing/selftests/vm to utilize the interface by splitting
PMD THPs and PTE-mapped THPs.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  |  98 ++
 mm/internal.h |   1 +
 mm/migrate.c  |   2 +-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 318 ++
 6 files changed, 420 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 395c75111d33..818172f887bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2971,10 +2972,107 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppops)
+{
+   static DEFINE_MUTEX(mutex);
+   ssize_t ret;
+   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   int pid;
+   unsigned long vaddr_start, vaddr_end, addr;
+   nodemask_t task_nodes;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+
+   ret = mutex_lock_interruptible();
+   if (ret)
+   return ret;
+
+   ret = -EFAULT;
+
+   memset(input_buf, 0, 80);
+   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+   goto out;
+
+   input_buf[79] = '\0';
+   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
+   if (ret != 3) {
+   ret = -EINVAL;
+   goto out;
+   }
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   ret = strlen(input_buf);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mm = find_mm_struct(pid, _nodes);
+   if (IS_ERR(mm)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   break;
+   if (!page)
+   break;
+
+   if (!is_transparent_hugepage(page))
+   continue;
+
+   total++;
+   if (!can_split_huge_page(compound_head(page), NULL))
+   continue;
+
+   if (!trylock_page(page))
+   continue;
+
+   if (!split_huge_page(page))
+   split++;
+
+   unlock_page(page);
+   put_page(page);
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   pr_debug("%lu of %lu THP split\n", split, total);
+out:
+   mutex_unlock();
+   return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+   .owner   = THIS_MODULE,
+   .write   = split_huge_pages_in_range_pid_write,
+   .llseek  = no_llseek,
+};
+
 static int __init split_huge_pages_debugfs(void)
 {
debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
_huge_pages_fops);
+   debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
+   _huge_pages_in_range_pid_fops);
return 0;
 }
 late_initcall(split_huge_pages_debugfs);
diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..1659d00100ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -623,4 +623,5 @@ struct migration_target_control {
gfp_t gfp_mask;
 };
 
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 62b81d5257aa..ce5f213debb2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1

Re: [PATCH v3 1/5] mm,memory_hotplug: Allocate memmap from the added memory range

2021-03-07 Thread Zi Yan
On 4 Mar 2021, at 4:59, Oscar Salvador wrote:

> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
>
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being done on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
>
> In this way, we have:
>
> [start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> [buddy_start_pfn, end_pfn - 1]   = Initialized and sent to buddy

+Mike for hugetlb discussion.

Just thinking about how it might impact gigantic page allocation like hugetlb.
When MHP_MEMMAP_ON_MEMORY is on, memmap pages are placed at the beginning
of each hot added memory block, so available PFNs from two consecutive
hot added memory blocks are not all contiguous, separated by memmap pages.
If the memory block size is <= 1GB, there is no way of reserving gigantic
pages for hugetlb during runtime using alloc_contig_pages from any hot
added memory. Am I getting this right?

I see this implication is documented at the high level in patch 3. Just
wonder if we want to be more specific. Or hugetlb is rarely used along
with hot-add memory.

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-04 Thread Zi Yan
On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:

> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged

s/fisrt/first/

> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui 
> ---
>  mm/page_alloc.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..3ed783e25c3c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3310,6 +3310,7 @@ void split_page(struct page *page, unsigned int order)
>   for (i = 1; i < (1 << order); i++)
>   set_page_refcounted(page + i);
>   split_page_owner(page, 1 << order);
> + split_page_memcg(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>
> -- 
> 2.25.0

LGTM. Thanks.

Reviewed-by: Zi Yan 


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg

2021-03-04 Thread Zi Yan
On 4 Mar 2021, at 2:40, Zhou Guanghui wrote:

> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
>
> Signed-off-by: Zhou Guanghui 
> ---
>  include/linux/memcontrol.h |  6 ++
>  mm/huge_memory.c   |  2 +-
>  mm/memcontrol.c| 15 ++-
>  3 files changed, 9 insertions(+), 14 deletions(-)
>
LGTM. Thanks.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 4/8] mm/rmap: Split migration into its own function

2021-03-02 Thread Zi Yan
On 26 Feb 2021, at 2:18, Alistair Popple wrote:

> Migration is currently implemented as a mode of operation for
> try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
> or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.
>
> However it does not have much in common with the rest of the unmap
> functionality of try_to_unmap_one() and thus splitting it into a
> separate function reduces the complexity of try_to_unmap_one() making it
> more readable.
>
> Several simplifications can also be made in try_to_migrate_one() based
> on the following observations:
>
>  - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
>  - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
>  - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.
>
> TTU_SPLIT_FREEZE is a special case of migration used when splitting an
> anonymous page. This is most easily dealt with by calling the correct
> function from unmap_page() in mm/huge_memory.c  - either
> try_to_migrate() for PageAnon or try_to_unmap().
>
> Signed-off-by: Alistair Popple 
> ---
>  include/linux/rmap.h |   4 +-
>  mm/huge_memory.c |  10 +-
>  mm/migrate.c |   9 +-
>  mm/rmap.c| 352 +++
>  4 files changed, 269 insertions(+), 106 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7f1ee411bd7b..77fa17de51d7 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -86,8 +86,6 @@ struct anon_vma_chain {
>  };
>
>  enum ttu_flags {
> - TTU_MIGRATION   = 0x1,  /* migration mode */
> -
>   TTU_SPLIT_HUGE_PMD  = 0x4,  /* split huge PMD if any */

It implies freeze in try_to_migrate() and no freeze in try_to_unmap(). I think
we need some comments here, above try_to_migrate(), and above try_to_unmap()
to clarify the implication.

>   TTU_IGNORE_MLOCK= 0x8,  /* ignore mlock */
>   TTU_IGNORE_HWPOISON = 0x20, /* corrupted page is recoverable */
> @@ -96,7 +94,6 @@ enum ttu_flags {
>* do a final flush if necessary */
>   TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock:
>* caller holds it */
> - TTU_SPLIT_FREEZE= 0x100,/* freeze pte under 
> splitting thp */
>  };
>
>  #ifdef CONFIG_MMU
> @@ -193,6 +190,7 @@ static inline void page_dup_rmap(struct page *page, bool 
> compound)
>  int page_referenced(struct page *, int is_locked,
>   struct mem_cgroup *memcg, unsigned long *vm_flags);
>
> +bool try_to_migrate(struct page *page, enum ttu_flags flags);
>  bool try_to_unmap(struct page *, enum ttu_flags flags);
>
>  /* Avoid racy checks */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d00b93dc2d9e..357052a4567b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2351,16 +2351,16 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>
>  static void unmap_page(struct page *page)
>  {
> - enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
> - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>   bool unmap_success;
>
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>
>   if (PageAnon(page))
> - ttu_flags |= TTU_SPLIT_FREEZE;
> -
> - unmap_success = try_to_unmap(page, ttu_flags);
> + unmap_success = try_to_migrate(page, ttu_flags);
> + else
> + unmap_success = try_to_unmap(page, ttu_flags |
> + TTU_IGNORE_MLOCK);

I think we need a comment here about why anonymous pages need try_to_migrate()
and others need try_to_unmap().

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/25] mm: Introduce struct folio

2021-03-02 Thread Zi Yan
On 2 Mar 2021, at 8:22, Matthew Wilcox wrote:

> On Mon, Mar 01, 2021 at 03:26:11PM -0500, Zi Yan wrote:
>>> +static inline struct folio *next_folio(struct folio *folio)
>>> +{
>>> +   return folio + folio_nr_pages(folio);
>>
>> Are you planning to make hugetlb use folio too?
>>
>> If yes, this might not work if we have CONFIG_SPARSEMEM && 
>> !CONFIG_SPARSEMEM_VMEMMAP
>> with a hugetlb folio > MAX_ORDER, because struct page might not be virtually 
>> contiguous.
>> See the experiment I did in [1].
>
> Actually, how about proofing this against a future change?
>
> static inline struct folio *next_folio(struct folio *folio)
> {
> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>   pfn_t next_pfn = page_to_pfn(>page) + folio_nr_pages(folio);
>   return (struct folio *)pfn_to_page(next_pfn);
> #else
>   return folio + folio_nr_pages(folio);
> #endif
> }
>
> (not compiled)

Yes, it should work. A better version might be that in the top half
you check folio order first and if the order >= MAX_ORDER, we use
the complicated code, otherwise just folio+folio_nr_pages(folio).

This CONFIG_SPARSEMEM && !CONFIG_SPARSEMEM_VMEMMAP is really not friendly
to >=MAX_ORDER pages. Most likely I am going to make 1GB THP
rely on CONFIG_SPARSEMEM_VMEMMAP to avoid complicated code.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm/memcg: set memcg when split pages

2021-03-02 Thread Zi Yan
On 2 Mar 2021, at 2:05, Zhouguanghui (OS Kernel) wrote:

> 在 2021/3/2 10:00, Zi Yan 写道:
>> On 1 Mar 2021, at 20:34, Zhou Guanghui wrote:
>>
>>> When split page, the memory cgroup info recorded in first page is
>>> not copied to tail pages. In this case, when the tail pages are
>>> freed, the uncharge operation is not performed. As a result, the
>>> usage of this memcg keeps increasing, and the OOM may occur.
>>>
>>> So, the copying of first page's memory cgroup info to tail pages
>>> is needed when split page.
>>>
>>> Signed-off-by: Zhou Guanghui 
>>> ---
>>>   include/linux/memcontrol.h | 10 ++
>>>   mm/page_alloc.c|  4 +++-
>>>   2 files changed, 13 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index e6dc793d587d..c7e2b4421dc1 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup 
>>> *memcg);
>>>   extern bool cgroup_memory_noswap;
>>>   #endif
>>>
>>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>>> +{
>>> +   if (src->memcg_data)
>>> +   dst->memcg_data = src->memcg_data;
>>> +}
>>> +
>>>   struct mem_cgroup *lock_page_memcg(struct page *page);
>>>   void __unlock_page_memcg(struct mem_cgroup *memcg);
>>>   void unlock_page_memcg(struct page *page);
>>> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup 
>>> *memcg)
>>>   {
>>>   }
>>>
>>> +static inline void copy_page_memcg(struct page *dst, struct page *src)
>>> +{
>>> +}
>>> +
>>>   static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>>>   {
>>> return NULL;
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int 
>>> order)
>>> VM_BUG_ON_PAGE(PageCompound(page), page);
>>> VM_BUG_ON_PAGE(!page_count(page), page);
>>>
>>> -   for (i = 1; i < (1 << order); i++)
>>> +   for (i = 1; i < (1 << order); i++) {
>>> set_page_refcounted(page + i);
>>> +   copy_page_memcg(page + i, page);
>>> +   }
>>> split_page_owner(page, 1 << order);
>>>   }
>>>   EXPORT_SYMBOL_GPL(split_page);
>>> -- 
>>> 2.25.0
>>
>> +memcg maintainers
>>
>> split_page() is used for non-compound higher-order pages. I am not sure
>> if there is any such pages monitored by memcg. Please let me know
>> if I miss anything.
>
> Thank you for taking time for this.
>
> This should be put in kmemcg, and I'll modify it.
>
> When the kmemcg is enabled and _GFP_ACCOUNT is set, the charged and
> uncharged sizes do not match when alloc/free_pages_exact method is used
> to apply for or free memory with exact size. This is because memcg data
> of the tail page is not set during the split page.

Thanks for your clarification. I missed kmemcg.

I have a question on copy_page_memcg above. By reading __memcg_kmem_charge_page
and __memcg_kmem_uncharge_page, it seems to me that every single page requires
a css_get(>css) at charge time and a css_put(>css) at uncharge 
time.
But your copy_page_memcg does not do css_get for split subpages. Will it cause
memcg->css underflow when subpages are uncharged?


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm/memcg: set memcg when split pages

2021-03-01 Thread Zi Yan
On 1 Mar 2021, at 20:34, Zhou Guanghui wrote:

> When split page, the memory cgroup info recorded in first page is
> not copied to tail pages. In this case, when the tail pages are
> freed, the uncharge operation is not performed. As a result, the
> usage of this memcg keeps increasing, and the OOM may occur.
>
> So, the copying of first page's memory cgroup info to tail pages
> is needed when split page.
>
> Signed-off-by: Zhou Guanghui 
> ---
>  include/linux/memcontrol.h | 10 ++
>  mm/page_alloc.c|  4 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..c7e2b4421dc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -867,6 +867,12 @@ void mem_cgroup_print_oom_group(struct mem_cgroup 
> *memcg);
>  extern bool cgroup_memory_noswap;
>  #endif
>
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> + if (src->memcg_data)
> + dst->memcg_data = src->memcg_data;
> +}
> +
>  struct mem_cgroup *lock_page_memcg(struct page *page);
>  void __unlock_page_memcg(struct mem_cgroup *memcg);
>  void unlock_page_memcg(struct page *page);
> @@ -1291,6 +1297,10 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  {
>  }
>
> +static inline void copy_page_memcg(struct page *dst, struct page *src)
> +{
> +}
> +
>  static inline struct mem_cgroup *lock_page_memcg(struct page *page)
>  {
>   return NULL;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..ee0a63dc1c9b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3307,8 +3307,10 @@ void split_page(struct page *page, unsigned int order)
>   VM_BUG_ON_PAGE(PageCompound(page), page);
>   VM_BUG_ON_PAGE(!page_count(page), page);
>
> - for (i = 1; i < (1 << order); i++)
> + for (i = 1; i < (1 << order); i++) {
>   set_page_refcounted(page + i);
> + copy_page_memcg(page + i, page);
> + }
>   split_page_owner(page, 1 << order);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
> -- 
> 2.25.0

+memcg maintainers

split_page() is used for non-compound higher-order pages. I am not sure
if there is any such pages monitored by memcg. Please let me know
if I miss anything.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 06/25] mm: Add get_folio

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> If we know we have a folio, we can call get_folio() instead of get_page()
> and save the overhead of calling compound_head().
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm.h | 19 ++-
>  1 file changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 873d649107ba..d71c5776b571 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1192,18 +1192,19 @@ static inline bool is_pci_p2pdma_page(const struct 
> page *page)
>  }
>
>  /* 127: arbitrary random number, small enough to assemble well */
> -#define page_ref_zero_or_close_to_overflow(page) \
> - ((unsigned int) page_ref_count(page) + 127u <= 127u)
> +#define folio_ref_zero_or_close_to_overflow(folio) \
> + ((unsigned int) page_ref_count(>page) + 127u <= 127u)
> +
> +static inline void get_folio(struct folio *folio)
> +{
> + /* Getting a page requires an already elevated page->_refcount. */
> + VM_BUG_ON_FOLIO(folio_ref_zero_or_close_to_overflow(folio), folio);
> + page_ref_inc(>page);
> +}
>
>  static inline void get_page(struct page *page)
>  {
> - page = compound_head(page);
> - /*
> -  * Getting a normal page or the head of a compound page
> -  * requires to already have an elevated page->_refcount.
> -  */
> - VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page);
> - page_ref_inc(page);
> + get_folio(page_folio(page));
>  }
>
>  bool __must_check try_grab_page(struct page *page, unsigned int flags);
> -- 
> 2.29.2

LGTM.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 05/25] mm: Add put_folio

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> If we know we have a folio, we can call put_folio() instead of put_page()
> and save the overhead of calling compound_head().  Also skips the
> devmap checks.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm.h | 15 ++-
>  1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7d787229dd40..873d649107ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1220,9 +1220,15 @@ static inline __must_check bool try_get_page(struct 
> page *page)
>   return true;
>  }
>
> +static inline void put_folio(struct folio *folio)
> +{
> + if (put_page_testzero(>page))
> + __put_page(>page);
> +}
> +
>  static inline void put_page(struct page *page)
>  {
> - page = compound_head(page);
> + struct folio *folio = page_folio(page);
>
>   /*
>* For devmap managed pages we need to catch refcount transition from
> @@ -1230,13 +1236,12 @@ static inline void put_page(struct page *page)
>* need to inform the device driver through callback. See
>* include/linux/memremap.h and HMM for details.
>*/
> - if (page_is_devmap_managed(page)) {
> - put_devmap_managed_page(page);
> + if (page_is_devmap_managed(>page)) {
> + put_devmap_managed_page(>page);
>   return;
>   }
>
> -     if (put_page_testzero(page))
> - __put_page(page);
> + put_folio(folio);
>  }
>
>  /*
> -- 
> 2.29.2

LGTM.

Reviewed-by: Zi Yan 


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 04/25] mm/debug: Add VM_BUG_ON_FOLIO and VM_WARN_ON_ONCE_FOLIO

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> These are the folio equivalents of VM_BUG_ON_PAGE and VM_WARN_ON_ONCE_PAGE.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mmdebug.h | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h
> index 5d0767cb424a..77d24e1dcaec 100644
> --- a/include/linux/mmdebug.h
> +++ b/include/linux/mmdebug.h
> @@ -23,6 +23,13 @@ void dump_mm(const struct mm_struct *mm);
>   BUG();  \
>   }   \
>   } while (0)
> +#define VM_BUG_ON_FOLIO(cond, folio) \
> + do {\
> + if (unlikely(cond)) {   \
> + dump_page(>page, "VM_BUG_ON_FOLIO(" 
> __stringify(cond)")");\
> + BUG();  \
> + }   \
> + } while (0)
>  #define VM_BUG_ON_VMA(cond, vma) \
>   do {\
>   if (unlikely(cond)) {   \
> @@ -48,6 +55,17 @@ void dump_mm(const struct mm_struct *mm);
>   }   \
>   unlikely(__ret_warn_once);  \
>  })
> +#define VM_WARN_ON_ONCE_FOLIO(cond, folio)   ({  \
> + static bool __section(".data.once") __warned;   \
> + int __ret_warn_once = !!(cond); \
> + \
> + if (unlikely(__ret_warn_once && !__warned)) {   \
> + dump_page(>page, "VM_WARN_ON_ONCE_FOLIO(" 
> __stringify(cond)")");\
> + __warned = true;\
> + WARN_ON(1); \
> + }   \
> + unlikely(__ret_warn_once);  \
> +})
>
>  #define VM_WARN_ON(cond) (void)WARN_ON(cond)
>  #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond)
> @@ -56,11 +74,13 @@ void dump_mm(const struct mm_struct *mm);
>  #else
>  #define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond)
>  #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond)
> +#define VM_BUG_ON_FOLIO(cond, folio) VM_BUG_ON(cond)
>  #define VM_BUG_ON_VMA(cond, vma) VM_BUG_ON(cond)
>  #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond)
>  #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond)
>  #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond)
>  #define VM_WARN_ON_ONCE_PAGE(cond, page)  BUILD_BUG_ON_INVALID(cond)
> +#define VM_WARN_ON_ONCE_FOLIO(cond, folio)  BUILD_BUG_ON_INVALID(cond)
>  #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond)
>  #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond)
>  #endif
> -- 
> 2.29.2

LGTM.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 03/25] mm/vmstat: Add folio stat wrappers

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> Allow page counters to be more readily modified by callers which have
> a folio.  Name these wrappers with 'stat' instead of 'state' as requested
> by Linus here:
> https://lore.kernel.org/linux-mm/CAHk-=wj847sudr-kt+46ft3+xffgiwpgthvm7djwgdi4cvr...@mail.gmail.com/
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/vmstat.h | 60 ++
>  1 file changed, 60 insertions(+)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 773135fc6e19..3c3373c2c3c2 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -396,6 +396,54 @@ static inline void drain_zonestat(struct zone *zone,
>   struct per_cpu_pageset *pset) { }
>  #endif   /* CONFIG_SMP */
>
> +static inline
> +void __inc_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> +{
> + __inc_zone_page_state(>page, item);

Shouldn’t we change the stats with folio_nr_pages(folio) here? And all
changes below. Otherwise one folio is always counted as a single page.

> +}
> +
> +static inline
> +void __dec_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> +{
> + __dec_zone_page_state(>page, item);
> +}
> +
> +static inline
> +void inc_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> +{
> + inc_zone_page_state(>page, item);
> +}
> +
> +static inline
> +void dec_zone_folio_stat(struct folio *folio, enum zone_stat_item item)
> +{
> + dec_zone_page_state(>page, item);
> +}
> +
> +static inline
> +void __inc_node_folio_stat(struct folio *folio, enum node_stat_item item)
> +{
> + __inc_node_page_state(>page, item);
> +}
> +
> +static inline
> +void __dec_node_folio_stat(struct folio *folio, enum node_stat_item item)
> +{
> + __dec_node_page_state(>page, item);
> +}
> +
> +static inline
> +void inc_node_folio_stat(struct folio *folio, enum node_stat_item item)
> +{
> + inc_node_page_state(>page, item);
> +}
> +
> +static inline
> +void dec_node_folio_stat(struct folio *folio, enum node_stat_item item)
> +{
> + dec_node_page_state(>page, item);
> +}
> +
>  static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages,
>int migratetype)
>  {
> @@ -530,6 +578,18 @@ static inline void __dec_lruvec_page_state(struct page 
> *page,
>   __mod_lruvec_page_state(page, idx, -1);
>  }
>
> +static inline void __inc_lruvec_folio_stat(struct folio *folio,
> +enum node_stat_item idx)
> +{
> + __mod_lruvec_page_state(>page, idx, 1);
> +}
> +
> +static inline void __dec_lruvec_folio_stat(struct folio *folio,
> +enum node_stat_item idx)
> +{
> + __mod_lruvec_page_state(>page, idx, -1);
> +}
> +
>  static inline void inc_lruvec_state(struct lruvec *lruvec,
>   enum node_stat_item idx)
>  {
> -- 
> 2.29.2


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 02/25] mm: Add folio_pgdat

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> This is just a convenience wrapper for callers with folios; pgdat can
> be reached from tail pages as well as head pages.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f20504017adf..7d787229dd40 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1503,6 +1503,11 @@ static inline pg_data_t *page_pgdat(const struct page 
> *page)
>   return NODE_DATA(page_to_nid(page));
>  }
>
> +static inline pg_data_t *folio_pgdat(const struct folio *folio)
> +{
> + return page_pgdat(>page);
> +}
> +
>  #ifdef SECTION_IN_PAGE_FLAGS
>  static inline void set_page_section(struct page *page, unsigned long section)
>  {
> -- 
> 2.29.2

LGTM.

Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/25] mm: Introduce struct folio

2021-03-01 Thread Zi Yan
On 1 Mar 2021, at 15:53, Matthew Wilcox wrote:

> On Mon, Mar 01, 2021 at 03:26:11PM -0500, Zi Yan wrote:
>>> +static inline struct folio *next_folio(struct folio *folio)
>>> +{
>>> +   return folio + folio_nr_pages(folio);
>>
>> Are you planning to make hugetlb use folio too?
>
> Eventually, probably.  It's not my focus.
>
>> If yes, this might not work if we have CONFIG_SPARSEMEM && 
>> !CONFIG_SPARSEMEM_VMEMMAP
>> with a hugetlb folio > MAX_ORDER, because struct page might not be virtually 
>> contiguous.
>> See the experiment I did in [1].
>>
>> [1] 
>> https://lore.kernel.org/linux-mm/16f7c58b-4d79-41c5-9b64-a1a1628f4...@nvidia.com/
>
> I thought we were going to forbid that configuration?  ie no pages
> larger than MAX_ORDER with (SPARSEMEM && !SPARSEMEM_VMEMMAP)
>
> https://lore.kernel.org/linux-mm/312aecbd-ca6d-4e93-a6c1-1df87babd...@nvidia.com/
>
> is somewhere else we were discussing this.

That is my plan for 1GB THP, making it depend on SPARSEMEM_VMEMMAP,
otherwise the THP code will be too complicated to read. My concern
is just about using folio in hugetlb, since

If hugetlb is not using folio soon, the patch looks good to me.

Reviewed-by: Zi Yan 


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 01/25] mm: Introduce struct folio

2021-03-01 Thread Zi Yan
On 28 Jan 2021, at 2:03, Matthew Wilcox (Oracle) wrote:

> We have trouble keeping track of whether we've already called
> compound_head() to ensure we're not operating on a tail page.  Further,
> it's never clear whether we intend a struct page to refer to PAGE_SIZE
> bytes or page_size(compound_head(page)).
>
> Introduce a new type 'struct folio' that always refers to an entire
> (possibly compound) page, and points to the head page (or base page).
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/mm.h   | 26 ++
>  include/linux/mm_types.h | 17 +
>  2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2d6e715ab8ea..f20504017adf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -924,6 +924,11 @@ static inline unsigned int compound_order(struct page 
> *page)
>   return page[1].compound_order;
>  }
>
> +static inline unsigned int folio_order(struct folio *folio)
> +{
> + return compound_order(>page);
> +}
> +
>  static inline bool hpage_pincount_available(struct page *page)
>  {
>   /*
> @@ -975,6 +980,26 @@ static inline unsigned int page_shift(struct page *page)
>
>  void free_compound_page(struct page *page);
>
> +static inline unsigned long folio_nr_pages(struct folio *folio)
> +{
> + return compound_nr(>page);
> +}
> +
> +static inline struct folio *next_folio(struct folio *folio)
> +{
> + return folio + folio_nr_pages(folio);

Are you planning to make hugetlb use folio too?

If yes, this might not work if we have CONFIG_SPARSEMEM && 
!CONFIG_SPARSEMEM_VMEMMAP
with a hugetlb folio > MAX_ORDER, because struct page might not be virtually 
contiguous.
See the experiment I did in [1].


[1] 
https://lore.kernel.org/linux-mm/16f7c58b-4d79-41c5-9b64-a1a1628f4...@nvidia.com/


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v1] mm/page_alloc: drop pr_info_ratelimited() in alloc_contig_range()

2021-03-01 Thread Zi Yan
On 1 Mar 2021, at 10:09, David Hildenbrand wrote:

> The information that some PFNs are busy is:
> a) not helpful for ordinary users: we don't even know *who* called
>alloc_contig_range(). This is certainly not worth a pr_info.*().
> b) not really helpful for debugging: we don't have any details *why*
>these PFNs are busy, and that is what we usually care about.
> c) not complete: there are other cases where we fail alloc_contig_range()
>using different paths that are not getting recorded.
>
> For example, we reach this path once we succeeded in isolating pageblocks,
> but failed to migrate some pages - which can happen easily on
> ZONE_NORMAL (i.e., has_unmovable_pages() is racy) but also on ZONE_MOVABLE
> i.e., we would have to retry longer to migrate).
>
> For example via virtio-mem when unplugging memory, we can create quite
> some noise (especially with ZONE_NORMAL) that is not of interest to
> users - it's expected that some allocations may fail as memory is busy.
>
> Let's just drop that pr_info_ratelimit() and rather implement a dynamic
> debugging mechanism in the future that can give us a better reason why
> alloc_contig_range() failed on specific pages.
>
> Cc: Andrew Morton 
> Cc: Minchan Kim 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Vlastimil Babka 
> Signed-off-by: David Hildenbrand 
> ---

LGTM. I agree that the printout is not quite useful.

Reviewed-by: Zi Yan 


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] hugetlb: fix update_and_free_page contig page struct assumption

2021-02-18 Thread Zi Yan
On 18 Feb 2021, at 12:51, Mike Kravetz wrote:

> On 2/18/21 9:40 AM, Zi Yan wrote:
>> On 18 Feb 2021, at 12:32, Jason Gunthorpe wrote:
>>
>>> On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote:
>>>> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote:
>>>>
>>>>> On Thu, Feb 18, 2021 at 02:45:54PM +, Matthew Wilcox wrote:
>>>>>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote:
>>>>>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz 
>>>>>>>  wrote:
>>>>>>>> page structs are not guaranteed to be contiguous for gigantic pages.  
>>>>>>>> The
>>>>>>>
>>>>>>> June 2014.  That's a long lurk time for a bug.  I wonder if some later
>>>>>>> commit revealed it.
>>>>>>
>>>>>> I would suggest that gigantic pages have not seen much use.  Certainly
>>>>>> performance with Intel CPUs on benchmarks that I've been involved with
>>>>>> showed lower performance with 1GB pages than with 2MB pages until quite
>>>>>> recently.
>>>>>
>>>>> I suggested in another thread that maybe it is time to consider
>>>>> dropping this "feature"
>>>>
>>>> You mean dropping gigantic page support in hugetlb?
>>>
>>> No, I mean dropping support for arches that want to do:
>>>
>>>tail_page != head_page + tail_page_nr
>>>
>>> because they can't allocate the required page array either virtually
>>> or physically contiguously.
>>>
>>> It seems like quite a burden on the core mm for a very niche, and
>>> maybe even non-existant, case.
>>>
>>> It was originally done for PPC, can these PPC systems use VMEMMAP now?
>>>
>>>>> The cost to fix GUP to be compatible with this will hurt normal
>>>>> GUP performance - and again, that nobody has hit this bug in GUP
>>>>> further suggests the feature isn't used..
>>>>
>>>> A easy fix might be to make gigantic hugetlb page depends on
>>>> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous.
>>>
>>> Yes, exactly.
>>
>> I actually have a question on CONFIG_SPARSEMEM_VMEMMAP. Can we assume
>> PFN_A - PFN_B == struct_page_A - struct_page_B, meaning all struct pages
>> are ordered based on physical addresses? I just wonder for two PFN ranges,
>> e.g., [0 - 128MB], [128MB - 256MB], if it is possible to first online
>> [128MB - 256MB] then [0 - 128MB] and the struct pages of [128MB - 256MB]
>> are in front of [0 - 128MB] in the vmemmap due to online ordering.
>
> I have not looked at the code which does the onlining and vmemmap setup.
> But, these definitions make me believe it is true:
>
> #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
>
> /* memmap is virtually contiguous.  */
> #define __pfn_to_page(pfn)  (vmemmap + (pfn))
> #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)

Makes sense. Thank you for checking.

I guess making gigantic page depends on CONFIG_SPARSEMEM_VMEMMAP might
be a good way of simplifying code and avoiding future bugs unless
there is an arch really needs gigantic page and cannot have VMEMMAP.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] hugetlb: fix update_and_free_page contig page struct assumption

2021-02-18 Thread Zi Yan
On 18 Feb 2021, at 12:32, Jason Gunthorpe wrote:

> On Thu, Feb 18, 2021 at 12:27:58PM -0500, Zi Yan wrote:
>> On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote:
>>
>>> On Thu, Feb 18, 2021 at 02:45:54PM +, Matthew Wilcox wrote:
>>>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote:
>>>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz  
>>>>> wrote:
>>>>>> page structs are not guaranteed to be contiguous for gigantic pages.  The
>>>>>
>>>>> June 2014.  That's a long lurk time for a bug.  I wonder if some later
>>>>> commit revealed it.
>>>>
>>>> I would suggest that gigantic pages have not seen much use.  Certainly
>>>> performance with Intel CPUs on benchmarks that I've been involved with
>>>> showed lower performance with 1GB pages than with 2MB pages until quite
>>>> recently.
>>>
>>> I suggested in another thread that maybe it is time to consider
>>> dropping this "feature"
>>
>> You mean dropping gigantic page support in hugetlb?
>
> No, I mean dropping support for arches that want to do:
>
>tail_page != head_page + tail_page_nr
>
> because they can't allocate the required page array either virtually
> or physically contiguously.
>
> It seems like quite a burden on the core mm for a very niche, and
> maybe even non-existant, case.
>
> It was originally done for PPC, can these PPC systems use VMEMMAP now?
>
>>> The cost to fix GUP to be compatible with this will hurt normal
>>> GUP performance - and again, that nobody has hit this bug in GUP
>>> further suggests the feature isn't used..
>>
>> A easy fix might be to make gigantic hugetlb page depends on
>> CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous.
>
> Yes, exactly.

I actually have a question on CONFIG_SPARSEMEM_VMEMMAP. Can we assume
PFN_A - PFN_B == struct_page_A - struct_page_B, meaning all struct pages
are ordered based on physical addresses? I just wonder for two PFN ranges,
e.g., [0 - 128MB], [128MB - 256MB], if it is possible to first online
[128MB - 256MB] then [0 - 128MB] and the struct pages of [128MB - 256MB]
are in front of [0 - 128MB] in the vmemmap due to online ordering.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] hugetlb: fix update_and_free_page contig page struct assumption

2021-02-18 Thread Zi Yan
On 18 Feb 2021, at 12:25, Jason Gunthorpe wrote:

> On Thu, Feb 18, 2021 at 02:45:54PM +, Matthew Wilcox wrote:
>> On Wed, Feb 17, 2021 at 11:02:52AM -0800, Andrew Morton wrote:
>>> On Wed, 17 Feb 2021 10:49:25 -0800 Mike Kravetz  
>>> wrote:
 page structs are not guaranteed to be contiguous for gigantic pages.  The
>>>
>>> June 2014.  That's a long lurk time for a bug.  I wonder if some later
>>> commit revealed it.
>>
>> I would suggest that gigantic pages have not seen much use.  Certainly
>> performance with Intel CPUs on benchmarks that I've been involved with
>> showed lower performance with 1GB pages than with 2MB pages until quite
>> recently.
>
> I suggested in another thread that maybe it is time to consider
> dropping this "feature"

You mean dropping gigantic page support in hugetlb?

>
> If it has been slightly broken for 7 years it seems a good bet it
> isn't actually being used.
>
> The cost to fix GUP to be compatible with this will hurt normal
> GUP performance - and again, that nobody has hit this bug in GUP
> further suggests the feature isn't used..

A easy fix might be to make gigantic hugetlb page depends on
CONFIG_SPARSEMEM_VMEMMAP, which guarantee all struct pages are contiguous.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording

2021-02-13 Thread Zi Yan
On 11 Feb 2021, at 18:44, Mike Kravetz wrote:

> On 2/11/21 12:47 PM, Zi Yan wrote:
>> On 28 Jan 2021, at 16:53, Mike Kravetz wrote:
>>
>>> On 1/28/21 10:26 AM, Joao Martins wrote:
>>>> For a given hugepage backing a VA, there's a rather ineficient
>>>> loop which is solely responsible for storing subpages in GUP
>>>> @pages/@vmas array. For each subpage we check whether it's within
>>>> range or size of @pages and keep increment @pfn_offset and a couple
>>>> other variables per subpage iteration.
>>>>
>>>> Simplify this logic and minimize the cost of each iteration to just
>>>> store the output page/vma. Instead of incrementing number of @refs
>>>> iteratively, we do it through pre-calculation of @refs and only
>>>> with a tight loop for storing pinned subpages/vmas.
>>>>
>>>> Additionally, retain existing behaviour with using mem_map_offset()
>>>> when recording the subpages for configurations that don't have a
>>>> contiguous mem_map.
>>>>
>>>> pinning consequently improves bringing us close to
>>>> {pin,get}_user_pages_fast:
>>>>
>>>>   - 16G with 1G huge page size
>>>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>>>
>>>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>>>> PIN_FAST_BENCHMARK: ~3.7k us
>>>>
>>>> Signed-off-by: Joao Martins 
>>>> ---
>>>>  mm/hugetlb.c | 49 -
>>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>
>>> Thanks for updating this.
>>>
>>> Reviewed-by: Mike Kravetz 
>>>
>>> I think there still is an open general question about whether we can always
>>> assume page structs are contiguous for really big pages.  That is outside
>>
>> I do not think page structs need to be contiguous, but PFNs within a big page
>> need to be contiguous, at least based on existing code like mem_map_offset() 
>> we have.
>
> Thanks for looking Zi,
> Yes, PFNs need to be contiguous.  Also, as you say page structs do not need
> to be contiguous.  The issue is that there is code that assumes page structs
> are contiguous for gigantic pages.  hugetlb code does not make this assumption
> and does a pfn_to_page() when looping through page structs for gigantic pages.
>
> I do not believe this to be a huge issue.  In most cases 
> CONFIG_VIRTUAL_MEM_MAP
> is defined and struct pages can be accessed contiguously.  I 'think' we could
> run into problems with CONFIG_SPARSEMEM and without CONFIG_VIRTUAL_MEM_MAP
> and doing hotplug operations.  However, I still need to look into more.

Yeah, you are right about this. The combination of CONFIG_SPARSEMEM,
!CONFIG_SPARSEMEM_VMEMMAP and doing hotplug does cause errors, as simple as
dynamically reserving gigantic hugetlb pages then freeing them in a system
with CONFIG_SPARSEMEM_VMEMMAP not set and some hotplug memory.

Here are the steps to reproduce:
0. Configure a kernel with CONFIG_SPARSEMEM_VMEMMAP not set.
1. Create a VM using qemu with “-m size=8g,slots=16,maxmem=16g” to enable 
hotplug.
2. After boot the machine, add large enough memory using
   “object_add memory-backend-ram,id=mem1,size=7g” and
   “device_add pc-dimm,id=dimm1,memdev=mem1”.
3. In the guest OS, online all hot-plugged memory. My VM has 128MB memory block 
size.
If you have larger memory block size, I think you will need to plug in more 
memory.
4. Reserve gigantic hugetlb pages so that hot-plugged memory will be used. I 
reserved
12GB, like “echo 12 | sudo tee 
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
5. Free all hugetlb gigantic pages,
“echo 0 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages”.
6. You will get “BUG: Bad page state in process …” errors.

The patch below can fix the error, but I suspect there might be other places 
missing
the necessary mem_map_next() too.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4bdb58ab14cb..aae99c6984f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1319,7 +1319,8 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
h->nr_huge_pages--;
h->nr_huge_pages_node[page_to_nid(page)]--;
for (i = 0; i < pages_per_huge_page(h); i++) {
-   page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
+   struct page *subpage = mem_map_next(subpage, page, i);
+   subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
1 << PG_referenced | 1 << PG_dirty |
1 << PG_active | 1 << PG_private |
1 << PG_writeback);


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 2/2] mm/hugetlb: refactor subpage recording

2021-02-11 Thread Zi Yan
On 28 Jan 2021, at 16:53, Mike Kravetz wrote:

> On 1/28/21 10:26 AM, Joao Martins wrote:
>> For a given hugepage backing a VA, there's a rather ineficient
>> loop which is solely responsible for storing subpages in GUP
>> @pages/@vmas array. For each subpage we check whether it's within
>> range or size of @pages and keep increment @pfn_offset and a couple
>> other variables per subpage iteration.
>>
>> Simplify this logic and minimize the cost of each iteration to just
>> store the output page/vma. Instead of incrementing number of @refs
>> iteratively, we do it through pre-calculation of @refs and only
>> with a tight loop for storing pinned subpages/vmas.
>>
>> Additionally, retain existing behaviour with using mem_map_offset()
>> when recording the subpages for configurations that don't have a
>> contiguous mem_map.
>>
>> pinning consequently improves bringing us close to
>> {pin,get}_user_pages_fast:
>>
>>   - 16G with 1G huge page size
>>   gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w
>>
>> PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
>> PIN_FAST_BENCHMARK: ~3.7k us
>>
>> Signed-off-by: Joao Martins 
>> ---
>>  mm/hugetlb.c | 49 -
>>  1 file changed, 28 insertions(+), 21 deletions(-)
>
> Thanks for updating this.
>
> Reviewed-by: Mike Kravetz 
>
> I think there still is an open general question about whether we can always
> assume page structs are contiguous for really big pages.  That is outside

I do not think page structs need to be contiguous, but PFNs within a big page
need to be contiguous, at least based on existing code like mem_map_offset() we 
have.
The assumption seems valid according to the existing big page allocation 
methods,
which use alloc_contig_pages() at the end of the day. alloc_contig_pages()
calls pfn_range_valid_contig() to make sure all PFNs are contiguous.
On the other hand, the buddy allocator only merges contiguous PFNs, so there
will be no problem even if someone configures the buddy allocator to allocate
gigantic pages.

Unless someone comes up with some fancy way of making page allocations from
contiguous page structs in SPARSEMEM_VMEMMAP case, where non-contiguous
PFNs with contiguous page structs are possible, or out of any adjacent
pages in !SPARSEMEM_VMEMMAP case, where non-contiguous page structs
and non-contiguous PFNs are possible, we should be good.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH V5] x86/mm: Tracking linear mapping split events

2021-01-28 Thread Zi Yan
On 28 Jan 2021, at 11:41, Dave Hansen wrote:

> On 1/28/21 8:33 AM, Zi Yan wrote:
>>> One of the many lasting (as we don't coalesce back) sources for
>>> huge page splits is tracing as the granular page
>>> attribute/permission changes would force the kernel to split code
>>> segments mapped to huge pages to smaller ones thereby increasing
>>> the probability of TLB miss/reload even after tracing has been
>>> stopped.
>> It is interesting to see this statement saying splitting kernel
>> direct mappings causes performance loss, when Zhengjun (cc’d) from
>> Intel recently posted a kernel direct mapping performance report[1]
>> saying 1GB mappings are good but not much better than 2MB and 4KB
>> mappings.
>
> No, that's not what the report said.
>
> *Overall*, there is no clear winner between 4k, 2M and 1G.  In other
> words, no one page size is best for *ALL* workloads.
>
> There were *ABSOLUTELY* individual workloads in those tests that saw
> significant deltas between the direct map sizes.  There are also
> real-world workloads that feel the impact here.

Yes, it is what I understand from the report. But this patch says
“
Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
The splintering of huge direct pages into smaller ones does result in
a measurable performance hit caused by frequent TLB miss and reloads.
”,

indicating large mappings (2MB, 1GB) are generally better. It is
different from what the report said, right?

The above text could be improved to make sure readers get both sides
of the story and not get afraid of performance loss after seeing
a lot of direct_map_xxx_splits events.



—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH V5] x86/mm: Tracking linear mapping split events

2021-01-28 Thread Zi Yan
On 28 Jan 2021, at 5:49, Saravanan D wrote:

> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic lifetime hugepage split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
>
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> 
> swap_ra 0
> swap_ra_hit 0
> direct_map_level2_splits 94
> direct_map_level3_splits 4
> nr_unstable 0
> 
>
> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.

It is interesting to see this statement saying splitting kernel direct mappings
causes performance loss, when Zhengjun (cc’d) from Intel recently posted
a kernel direct mapping performance report[1] saying 1GB mappings are good
but not much better than 2MB and 4KB mappings.

I would love to hear the stories from both sides. Or maybe I misunderstand
anything.


[1]https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884e...@linux.intel.com/
>
> Documentation regarding linear mapping split events added to admin-guide
> as requested in V3 of the patch.
>
> Signed-off-by: Saravanan D 
> ---
>  .../admin-guide/mm/direct_mapping_splits.rst  | 59 +++
>  Documentation/admin-guide/mm/index.rst|  1 +
>  arch/x86/mm/pat/set_memory.c  |  8 +++
>  include/linux/vm_event_item.h |  4 ++
>  mm/vmstat.c   |  4 ++
>  5 files changed, 76 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/direct_mapping_splits.rst
>
> diff --git a/Documentation/admin-guide/mm/direct_mapping_splits.rst 
> b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> new file mode 100644
> index ..298751391deb
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/direct_mapping_splits.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=
> +Direct Mapping Splits
> +=
> +
> +Kernel maps all of physical memory in linear/direct mapped pages with
> +translation of virtual kernel address to physical address is achieved
> +through a simple subtraction of offset. CPUs maintain a cache of these
> +translations on fast caches called TLBs. CPU architectures like x86 allow
> +direct mapping large portions of memory into hugepages (2M, 1G, etc) in
> +various page table levels.
> +
> +Maintaining huge direct mapped pages greatly reduces TLB miss pressure.
> +The splintering of huge direct pages into smaller ones does result in
> +a measurable performance hit caused by frequent TLB miss and reloads.
> +
> +One of the many lasting (as we don't coalesce back) sources for huge page
> +splits is tracing as the granular page attribute/permission changes would
> +force the kernel to split code segments mapped to hugepages to smaller
> +ones thus increasing the probability of TLB miss/reloads even after
> +tracing has been stopped.
> +
> +On x86 systems, we can track the splitting of huge direct mapped pages
> +through lifetime event counters in ``/proc/vmstat``
> +
> + direct_map_level2_splits xxx
> + direct_map_level3_splits yyy
> +
> +where:
> +
> +direct_map_level2_splits
> + are 2M/4M hugepage split events
> +direct_map_level3_splits
> + are 1G hugepage split events
> +
> +The distribution of direct mapped system memory in various page sizes
> +post splits can be viewed through ``/proc/meminfo`` whose output
> +will include the following lines depending upon supporting CPU
> +architecture
> +
> + DirectMap4k:x kB
> + DirectMap2M:y kB
> + DirectMap1G:z kB
> +
> +where:
> +
> +DirectMap4k
> + is the total amount of direct mapped memory (in kB)
> + accessed through 4k pages
> +DirectMap2M
> + is the total amount of direct mapped memory (in kB)
> + accessed through 2M pages
> +DirectMap1G
> + is the total amount of direct mapped memory (in kB)
> + accessed through 1G pages
> +
> +
> +-- Saravanan D, Jan 27, 2021
> diff --git a/Documentation/admin-guide/mm/index.rst 
> b/Documentation/admin-guide/mm/index.rst
> index 4b14d8b50e9e..9439780f3f07 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -38,3 +38,4 @@ the Linux memory management.
> soft-dirty
> transhuge
> userfaultfd
> +   direct_mapping_splits
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..a7b3c5f1d316 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -16,6 +16,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include 
>  #include 
> @@ -91,6 +93,12 @@ static void split_page_count(int level)

Re: [PATCH v1 1/2] mm/cma: expose all pages to the buddy if activation of an area fails

2021-01-27 Thread Zi Yan
On 27 Jan 2021, at 5:18, David Hildenbrand wrote:

> Right now, if activation fails, we might already have exposed some pages to
> the buddy for CMA use (although they will never get actually used by CMA),
> and some pages won't be exposed to the buddy at all.
>
> Let's check for "single zone" early and on error, don't expose any pages
> for CMA use - instead, expose them to the buddy available for any use.
> Simply call free_reserved_page() on every single page - easier than
> going via free_reserved_area(), converting back and forth between pfns
> and virt addresses.
>
> In addition, make sure to fixup totalcma_pages properly.
>
> Example: 6 GiB QEMU VM with "... hugetlb_cma=2G movablecore=20% ...":
>   [0.006891] hugetlb_cma: reserve 2048 MiB, up to 2048 MiB per node
>   [0.006893] cma: Reserved 2048 MiB at 0x0001
>   [0.006893] hugetlb_cma: reserved 2048 MiB on node 0
>   ...
>   [0.175433] cma: CMA area hugetlb0 could not be activated
>
> Before this patch:
>   # cat /proc/meminfo
>   MemTotal:5867348 kB
>   MemFree: 5692808 kB
>   MemAvailable:5542516 kB
>   ...
>   CmaTotal:2097152 kB
>   CmaFree: 1884160 kB
>
> After this patch:
>   # cat /proc/meminfo
>   MemTotal:6077308 kB
>   MemFree: 5904208 kB
>   MemAvailable:5747968 kB
>   ...
>   CmaTotal:  0 kB
>   CmaFree:   0 kB
>
> Note: cma_init_reserved_mem() makes sure that we always cover full
> pageblocks / MAX_ORDER - 1 pages.
>
> Cc: Andrew Morton 
> Cc: Thomas Gleixner 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Mike Rapoport 
> Cc: Oscar Salvador 
> Cc: Michal Hocko 
> Cc: Wei Yang 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/cma.c | 43 +--
>  1 file changed, 21 insertions(+), 22 deletions(-)

LGTM. Reviewed-by: Zi Yan 

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] mm/hugetlb: refactor subpage recording

2021-01-26 Thread Zi Yan
On 26 Jan 2021, at 21:24, Matthew Wilcox wrote:

> On Tue, Jan 26, 2021 at 08:07:30PM -0400, Jason Gunthorpe wrote:
>> I'm looking at Matt's folio patches and see:
>>
>> +static inline struct folio *next_folio(struct folio *folio)
>> +{
>> +   return folio + folio_nr_pages(folio);
>> +}
>
> This is a replacement for places that would do 'page++'.  eg it's
> used by the bio iterator where we already checked that the phys addr
> and the struct page are contiguous.
>
>> And checking page_trans_huge_mapcount():
>>
>>  for (i = 0; i < thp_nr_pages(page); i++) {
>>  mapcount = atomic_read([i]._mapcount) + 1;
>
> I think we are guaranteed this for transparent huge pages.  At least
> for now.  Zi Yan may have some thoughts for his work on 1GB transhuge
> pages ...

It should work for 1GB THP too. My implementation allocates 1GB pages
from cma_alloc(), which calls alloc_contig_range(). At least for now
subpages from a 1GB THP are physically contiguous.

It will be a concern if we use other ways (like migrating in-use pages)
of forming 1GB THPs. Thanks for pointing this out.

>
>> And we have the same logic in hmm_vma_walk_pud():
>>
>>  if (pud_huge(pud) && pud_devmap(pud)) {
>>  pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>  for (i = 0; i < npages; ++i, ++pfn)
>>  hmm_pfns[i] = pfn | cpu_flags;
>>
>> So, if page[n] does not access the tail pages of a compound we have
>> many more people who are surprised by this than just GUP.
>>
>> Where are these special rules for hugetlb compound tails documented?
>> Why does it need to be like this?
>>
>> Isn't it saner to forbid a compound and its tails from being
>> non-linear in the page array? That limits when compounds can be
>> created, but seems more likely to happen than a full mm audit to find
>> all the places that assume linearity.
>>
>> Jason


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH 6/7] mm: truncate: split thp to a non-zero order if possible.

2020-11-19 Thread Zi Yan
From: Zi Yan 

To minimize the number of pages after a truncation, when truncating a
THP, we do not need to split it all the way down to order-0. The THP has
at most three parts, the part before offset, the part to be truncated,
the part left at the end. Use the non-zero minimum of them to decide
what order we split the THP to.

Signed-off-by: Zi Yan 
---
 mm/truncate.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 20bd17538ec2..2e93d702f2c6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -237,7 +237,9 @@ int truncate_inode_page(struct address_space *mapping, 
struct page *page)
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 {
loff_t pos = page_offset(page);
-   unsigned int offset, length;
+   unsigned int offset, length, remaining, min_subpage_size = PAGE_SIZE;
+   unsigned int new_order;
+
 
if (pos < start)
offset = start - pos;
@@ -248,6 +250,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
length = length - offset;
else
length = end + 1 - pos - offset;
+   remaining = thp_size(page) - offset - length;
 
wait_on_page_writeback(page);
if (length == thp_size(page)) {
@@ -267,7 +270,29 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
do_invalidatepage(page, offset, length);
if (!PageTransHuge(page))
return true;
-   return split_huge_page(page) == 0;
+
+   /*
+* find the non-zero minimum of offset, length, and remaining and use it
+* to decide the new order of the page after split
+*/
+   if (offset && remaining)
+   min_subpage_size = min_t(unsigned int,
+min_t(unsigned int, offset, length),
+remaining);
+   else if (!offset)
+   min_subpage_size = min_t(unsigned int, length, remaining);
+   else /* remaining == 0 */
+   min_subpage_size = min_t(unsigned int, length, offset);
+
+   min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
+
+   new_order = ilog2(min_subpage_size/PAGE_SIZE);
+
+   /* order-1 THP not supported, downgrade to order-0 */
+   if (new_order == 1)
+   new_order = 0;
+
+   return split_huge_page_to_list_to_order(page, NULL, new_order) == 0;
 }
 
 /*
-- 
2.28.0



[PATCH 4/7] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-19 Thread Zi Yan
From: Zi Yan 

It adds a new_order parameter to set new page order in page owner and
uses old_order instead of nr to make the parameters look consistent.
It prepares for upcoming changes to support split huge page to any
lower order.

Signed-off-by: Zi Yan 
---
 include/linux/page_owner.h | 10 ++
 mm/huge_memory.c   |  3 ++-
 mm/page_alloc.c|  2 +-
 mm/page_owner.c| 13 +++--
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..9caaed51403c 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -11,7 +11,8 @@ extern struct page_ext_operations page_owner_ops;
 extern void __reset_page_owner(struct page *page, unsigned int order);
 extern void __set_page_owner(struct page *page,
unsigned int order, gfp_t gfp_mask);
-extern void __split_page_owner(struct page *page, unsigned int nr);
+extern void __split_page_owner(struct page *page, unsigned int old_order,
+   unsigned int new_order);
 extern void __copy_page_owner(struct page *oldpage, struct page *newpage);
 extern void __set_page_owner_migrate_reason(struct page *page, int reason);
 extern void __dump_page_owner(struct page *page);
@@ -31,10 +32,11 @@ static inline void set_page_owner(struct page *page,
__set_page_owner(page, order, gfp_mask);
 }
 
-static inline void split_page_owner(struct page *page, unsigned int nr)
+static inline void split_page_owner(struct page *page, unsigned int old_order,
+   unsigned int new_order)
 {
if (static_branch_unlikely(_owner_inited))
-   __split_page_owner(page, nr);
+   __split_page_owner(page, old_order, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
 {
@@ -60,7 +62,7 @@ static inline void set_page_owner(struct page *page,
 {
 }
 static inline void split_page_owner(struct page *page,
-   unsigned int order)
+   unsigned int old_order, unsigned int new_order)
 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7ab5cac5851..aae7405a0989 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2422,6 +2422,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
struct lruvec *lruvec;
struct address_space *swap_cache = NULL;
unsigned long offset = 0;
+   unsigned int order = thp_order(head);
unsigned int nr = thp_nr_pages(head);
int i;
 
@@ -2458,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
ClearPageCompound(head);
 
-   split_page_owner(head, nr);
+   split_page_owner(head, order, 0);
 
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63d8d8b72c10..414f26950190 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3297,7 +3297,7 @@ void split_page(struct page *page, unsigned int order)
 
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
-   split_page_owner(page, 1 << order);
+   split_page_owner(page, order, 0);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..00a679a1230b 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,19 +204,20 @@ void __set_page_owner_migrate_reason(struct page *page, 
int reason)
page_owner->last_migrate_reason = reason;
 }
 
-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int old_order,
+   unsigned int new_order)
 {
-   int i;
-   struct page_ext *page_ext = lookup_page_ext(page);
+   int i, old_nr = 1 << old_order, new_nr = 1 << new_order;
+   struct page_ext *page_ext;
struct page_owner *page_owner;
 
if (unlikely(!page_ext))
return;
 
-   for (i = 0; i < nr; i++) {
+   for (i = 0; i < old_nr; i += new_nr) {
+   page_ext = lookup_page_ext(page + i);
page_owner = get_page_owner(page_ext);
-   page_owner->order = 0;
-   page_ext = page_ext_next(page_ext);
+   page_owner->order = new_order;
}
 }
 
-- 
2.28.0



[PATCH 2/7] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

2020-11-19 Thread Zi Yan
From: Zi Yan 

Huge pages in the process with the given pid and virtual address range
are split. It is used to test split huge page function. In addition,
a testing program is added to tools/testing/selftests/vm to utilize the
interface by splitting PMD THPs and PTE-mapped THPs.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  |  98 ++
 mm/internal.h |   1 +
 mm/migrate.c  |   2 +-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 313 ++
 6 files changed, 415 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1bf51d3f2f2d..88d8b7fce5d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2934,10 +2935,107 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppops)
+{
+   static DEFINE_MUTEX(mutex);
+   ssize_t ret;
+   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   int pid;
+   unsigned long vaddr_start, vaddr_end, addr;
+   nodemask_t task_nodes;
+   struct mm_struct *mm;
+   unsigned long total = 0, split = 0;
+
+   ret = mutex_lock_interruptible();
+   if (ret)
+   return ret;
+
+   ret = -EFAULT;
+
+   memset(input_buf, 0, 80);
+   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+   goto out;
+
+   input_buf[79] = '\0';
+   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
+   if (ret != 3) {
+   ret = -EINVAL;
+   goto out;
+   }
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   ret = strlen(input_buf);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mm = find_mm_struct(pid, _nodes);
+   if (IS_ERR(mm)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mmap_read_lock(mm);
+   /*
+* always increase addr by PAGE_SIZE, since we could have a PTE page
+* table filled with PTE-mapped THPs, each of which is distinct.
+*/
+   for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   break;
+   if (!page)
+   break;
+
+   if (!is_transparent_hugepage(page))
+   continue;
+
+   total++;
+   if (!can_split_huge_page(compound_head(page), NULL))
+   continue;
+
+   if (!trylock_page(page))
+   continue;
+
+   if (!split_huge_page(page))
+   split++;
+
+   unlock_page(page);
+   put_page(page);
+   }
+   mmap_read_unlock(mm);
+   mmput(mm);
+
+   pr_debug("%lu of %lu THP split\n", split, total);
+out:
+   mutex_unlock();
+   return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+   .owner   = THIS_MODULE,
+   .write   = split_huge_pages_in_range_pid_write,
+   .llseek  = no_llseek,
+};
+
 static int __init split_huge_pages_debugfs(void)
 {
debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
_huge_pages_fops);
+   debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
+   _huge_pages_in_range_pid_fops);
return 0;
 }
 late_initcall(split_huge_pages_debugfs);
diff --git a/mm/internal.h b/mm/internal.h
index fbebc3ff288c..b94b2d96e47a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -627,4 +627,5 @@ struct migration_target_control {
 
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
 void page_cache_free_page(struct address_space *mapping, struct page *page);
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 6dfc7

[PATCH 5/7] mm: thp: split huge page to any lower order pages.

2020-11-19 Thread Zi Yan
From: Zi Yan 

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page
order. Also we need to reinitialize page_deferred_list after removing
the page from the split_queue, otherwise a subsequent split will see
list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after
truncating a pagecache THP. For anonymous THPs, we can only split them
to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan 
---
 include/linux/huge_mm.h |   8 +++
 mm/huge_memory.c| 119 +---
 mm/swap.c   |   1 -
 3 files changed, 96 insertions(+), 32 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7723deda33e2..0c856f805617 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -182,6 +182,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+   unsigned int new_order);
 static inline int split_huge_page(struct page *page)
 {
return split_huge_page_to_list(page, NULL);
@@ -385,6 +387,12 @@ split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
return 0;
 }
+static inline int
+split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+   unsigned int new_order)
+{
+   return 0;
+}
 static inline int split_huge_page(struct page *page)
 {
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aae7405a0989..cc70f70862d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2325,12 +2325,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_RMAP_LOCKED;
bool unmap_success;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
+   if (thp_order(page) >= HPAGE_PMD_ORDER)
+   ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
if (PageAnon(page))
ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2338,21 +2340,23 @@ static void unmap_page(struct page *page)
VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
-static void remap_page(struct page *page, unsigned int nr)
+static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
 {
-   int i;
-   if (PageTransHuge(page)) {
+   unsigned int i;
+
+   if (thp_nr_pages(page) == nr) {
remove_migration_ptes(page, page, true);
} else {
-   for (i = 0; i < nr; i++)
+   for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-   struct lruvec *lruvec, struct list_head *list)
+   struct lruvec *lruvec, struct list_head *list, unsigned int 
new_order)
 {
struct page *page_tail = head + tail;
+   unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
 
VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
 
@@ -2376,6 +2380,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 #ifdef CONFIG_64BIT
 (1L << PG_arch_2) |
 #endif
+compound_head_flag |
 (1L << PG_dirty)));
 
/* ->mapping in first tail page is compound_mapcount */
@@ -2384,7 +2389,10 @@ static void __split_huge_page_tail(struct page *head, 
int tail,
page_tail->mapping = head->mapping;
page_tail->index = head->index + tail;
 
-   /* Page flags must be visible before we make the page non-compound. */
+   /*
+* Page flags must be visible before we make the page non-compound or
+* a compound page in new_order.
+*/
smp_wmb();
 
/*
@@ -2394,10 +2402,15 @@ static void __split_huge_page_tail(struct page *head, 
int tail,
 * which needs correct compound_head().
 */
clear_compound_head(page_tail);
+   if (new_order) {
+   prep_compound_page(page_tail, new_order);
+   thp_prep(page_tail);
+   }
 
/* Finally unfreeze refcount. Additional reference from page cache. */
-   page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
- PageSwapCache(head)));
+   page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
+  PageSwapCache(head)) ?
+   thp_nr_pages(page_

[PATCH 0/7] Split huge pages to any lower order pages and selftests.

2020-11-19 Thread Zi Yan
From: Zi Yan 

Hi all,

With Matthew's THP in pagecache patches[1], we will be able to handle any size
pagecache THPs, but currently split_huge_page can only split a THP to order-0
pages. This can easily erase the benefit of having pagecache THPs, when
operations like truncate might want to keep pages larger than order-0. In
response, here is the patches to add support for splitting a THP to any lower
order pages. In addition, this patchset prepares for my PUD THP patchset[2],
since splitting a PUD THP to multiple PMD THPs can be handled by
split_huge_page_to_list_to_order function added by this patchset, which reduces
a lot of redundant code without just replicating split_huge_page for PUD THP.

To help the tests of splitting huge pages, I added a new debugfs interface
at /split_huge_pages_in_range_pid, so developers can split THPs in a
given range from a process with the given pid by writing
",,," to the interface. I also added a
new test program to test 1) splitting PMD THPs, 2) splitting PTE-mapped THPs,
3) splitting pagecache THPs to any lower order, 4) truncating a pagecache
THP to a page with a lower order, and 5) punching holes in a pagecache THP to
cause splitting THPs to lower order THPs.

The patchset is on top of Matthew's pagecache/next tree[3].

* Patch 1 is cherry-picked from Matthew's recent xarray fix [4] just to make 
sure
  Patch 3 to 7 can run without problem. I let Matthew decide how it should get
  picked up.
* Patch 2 is self-contained and can be merged if it looks OK.

Comments and/or suggestions are welcome.

ChangeLog
===
>From RFC:
1. Fixed debugfs to handle splitting PTE-mapped THPs properly and added stats
   for split THPs.
2. Added a new test case for splitting PTE-mapped THPs. Each of the four PTEs
   points to a different subpage from four THPs and used kpageflags to check
   whether a PTE points to a THP or not (AnonHugePages from smap does not show
   PTE-mapped THPs).
3. mem_cgroup_split_huge_fixup() takes order instead of nr.
4. split_page_owner takes old_order and new_order instead of nr and new_order.
5. Corrected __split_page_owner declaration and fixed its implementation when
   splitting a THP to a new order.
6. Renamed left to remaining in truncate_inode_partial_page().
7. Use VM_BUG_ON instead of WARN_ONCE when splitting a THP to the unsupported
   order-0 and splitting anonymous THPs to non-zero orders.
8. Added punching holes in a file as a new pagecache THP split test case, which
   uncovered an xarray bug.


[1] https://lore.kernel.org/linux-mm/20201029193405.29125-1-wi...@infradead.org/
[2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi@sent.com/
[3] https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/next
[4] https://git.infradead.org/users/willy/xarray.git


Matthew Wilcox (Oracle) (1):
  XArray: Fix splitting to non-zero orders

Zi Yan (6):
  mm: huge_memory: add new debugfs interface to trigger split huge page
on any page range.
  mm: memcg: make memcg huge page split support any order split.
  mm: page_owner: add support for splitting to any order in split
page_owner.
  mm: thp: split huge page to any lower order pages.
  mm: truncate: split thp to a non-zero order if possible.
  mm: huge_memory: enable debugfs to split huge pages to any order.

 include/linux/huge_mm.h   |   8 +
 include/linux/memcontrol.h|   5 +-
 include/linux/page_owner.h|  10 +-
 lib/test_xarray.c |  26 +-
 lib/xarray.c  |   4 +-
 mm/huge_memory.c  | 219 ++--
 mm/internal.h |   1 +
 mm/memcontrol.c   |   6 +-
 mm/migrate.c  |   2 +-
 mm/page_alloc.c   |   2 +-
 mm/page_owner.c   |  13 +-
 mm/swap.c |   1 -
 mm/truncate.c |  29 +-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 479 ++
 16 files changed, 742 insertions(+), 65 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

--
2.28.0



[PATCH 7/7] mm: huge_memory: enable debugfs to split huge pages to any order.

2020-11-19 Thread Zi Yan
From: Zi Yan 

It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both
debugfs, truncating a file, and punching holes in a file.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  |  13 +-
 .../selftests/vm/split_huge_page_test.c   | 192 --
 2 files changed, 186 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cc70f70862d8..d6ce7be65fb2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2999,7 +2999,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
static DEFINE_MUTEX(mutex);
ssize_t ret;
char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
-   int pid;
+   int pid, to_order = 0;
unsigned long vaddr_start, vaddr_end, addr;
nodemask_t task_nodes;
struct mm_struct *mm;
@@ -3016,8 +3016,9 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
goto out;
 
input_buf[79] = '\0';
-   ret = sscanf(input_buf, "%d,0x%lx,0x%lx", , _start, 
_end);
-   if (ret != 3) {
+   ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", , _start, 
_end, _order);
+   /* cannot split to order-1 THP, which is not possible */
+   if ((ret != 3 && ret != 4) || to_order == 1) {
ret = -EINVAL;
goto out;
}
@@ -3025,8 +3026,8 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
vaddr_end &= PAGE_MASK;
 
ret = strlen(input_buf);
-   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
-pid, vaddr_start, vaddr_end);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx], to order: 
%d\n",
+pid, vaddr_start, vaddr_end, to_order);
 
mm = find_mm_struct(pid, _nodes);
if (IS_ERR(mm)) {
@@ -3066,7 +3067,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
if (!trylock_page(page))
continue;
 
-   if (!split_huge_page(page))
+   if (!split_huge_page_to_list_to_order(page, NULL, to_order))
split++;
 
unlock_page(page);
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index cd2ced8c1261..bfd35ae9cfd2 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 uint64_t pagesize;
 unsigned int pageshift;
@@ -24,6 +25,7 @@ uint64_t pmd_pagesize;
 #define PMD_SIZE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
 #define SMAP_PATH "/proc/self/smaps"
+#define THP_FS_PATH "/mnt/thp_fs"
 #define INPUT_MAX 80
 
 #define PFN_MASK ((1UL<<55)-1)
@@ -89,19 +91,20 @@ static int write_file(const char *path, const char *buf, 
size_t buflen)
return (unsigned int) numwritten;
 }
 
-static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end, 
int order)
 {
char input[INPUT_MAX];
int ret;
 
-   ret = snprintf(input, INPUT_MAX, "%d,0x%lx,0x%lx", pid, vaddr_start,
-   vaddr_end);
+   ret = snprintf(input, INPUT_MAX, "%d,0x%lx,0x%lx,%d", pid, vaddr_start,
+   vaddr_end, order);
if (ret >= INPUT_MAX) {
printf("%s: Debugfs input is too long\n", __func__);
exit(EXIT_FAILURE);
}
 
-   if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
+   /* order == 1 is an invalid input that should be detected. */
+   if (order != 1 && !write_file(SPLIT_DEBUGFS, input, ret + 1)) {
perror(SPLIT_DEBUGFS);
exit(EXIT_FAILURE);
}
@@ -118,7 +121,7 @@ static bool check_for_pattern(FILE *fp, const char 
*pattern, char *buf)
return false;
 }
 
-static uint64_t check_huge(void *addr)
+static uint64_t check_huge(void *addr, const char *prefix)
 {
uint64_t thp = 0;
int ret;
@@ -143,13 +146,13 @@ static uint64_t check_huge(void *addr)
goto err_out;
 
/*
-* Fetch the AnonHugePages: in the same block and check the number of
+* Fetch the @prefix in the same block and check the number of
 * hugepages.
 */
-   if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+   if (!check_for_pattern(fp, prefix, buffer))
goto err_out;
 
-   if (sscanf(buffer, "AnonHugePages:%10ld kB", ) != 1) {
+   if (sscanf([strlen(prefix)], &q

[PATCH 1/7] XArray: Fix splitting to non-zero orders

2020-11-19 Thread Zi Yan
From: "Matthew Wilcox (Oracle)" 

Splitting an order-4 entry into order-2 entries would leave the array
containing pointers to 40008000c000 instead of .
This is a one-character fix, but enhance the test suite to check this
case.

Reported-by: Zi Yan 
Signed-off-by: Matthew Wilcox (Oracle) 
---
 lib/test_xarray.c | 26 ++
 lib/xarray.c  |  4 ++--
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8294f43f4981..8b1c318189ce 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1530,24 +1530,24 @@ static noinline void check_store_range(struct xarray 
*xa)
 
 #ifdef CONFIG_XARRAY_MULTI
 static void check_split_1(struct xarray *xa, unsigned long index,
-   unsigned int order)
+   unsigned int order, unsigned int new_order)
 {
-   XA_STATE(xas, xa, index);
-   void *entry;
-   unsigned int i = 0;
+   XA_STATE_ORDER(xas, xa, index, new_order);
+   unsigned int i;
 
xa_store_order(xa, index, order, xa, GFP_KERNEL);
 
xas_split_alloc(, xa, order, GFP_KERNEL);
xas_lock();
xas_split(, xa, order);
+   for (i = 0; i < (1 << order); i += (1 << new_order))
+   __xa_store(xa, index + i, xa_mk_index(index + i), 0);
xas_unlock();
 
-   xa_for_each(xa, index, entry) {
-   XA_BUG_ON(xa, entry != xa);
-   i++;
+   for (i = 0; i < (1 << order); i++) {
+   unsigned int val = index + (i & ~((1 << new_order) - 1));
+   XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
}
-   XA_BUG_ON(xa, i != 1 << order);
 
xa_set_mark(xa, index, XA_MARK_0);
XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
@@ -1557,14 +1557,16 @@ static void check_split_1(struct xarray *xa, unsigned 
long index,
 
 static noinline void check_split(struct xarray *xa)
 {
-   unsigned int order;
+   unsigned int order, new_order;
 
XA_BUG_ON(xa, !xa_empty(xa));
 
for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {
-   check_split_1(xa, 0, order);
-   check_split_1(xa, 1UL << order, order);
-   check_split_1(xa, 3UL << order, order);
+   for (new_order = 0; new_order < order; new_order++) {
+   check_split_1(xa, 0, order, new_order);
+   check_split_1(xa, 1UL << order, order, new_order);
+   check_split_1(xa, 3UL << order, order, new_order);
+   }
}
 }
 #else
diff --git a/lib/xarray.c b/lib/xarray.c
index fc70e37c4c17..74915ba018c4 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1012,7 +1012,7 @@ void xas_split_alloc(struct xa_state *xas, void *entry, 
unsigned int order,
 
do {
unsigned int i;
-   void *sibling;
+   void *sibling = NULL;
struct xa_node *node;
 
node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
@@ -1022,7 +1022,7 @@ void xas_split_alloc(struct xa_state *xas, void *entry, 
unsigned int order,
for (i = 0; i < XA_CHUNK_SIZE; i++) {
if ((i & mask) == 0) {
RCU_INIT_POINTER(node->slots[i], entry);
-   sibling = xa_mk_sibling(0);
+   sibling = xa_mk_sibling(i);
} else {
RCU_INIT_POINTER(node->slots[i], sibling);
}
-- 
2.28.0



[PATCH 3/7] mm: memcg: make memcg huge page split support any order split.

2020-11-19 Thread Zi Yan
From: Zi Yan 

It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the new page order, always 0 for now. It
prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan 
Reviewed-by: Ralph Campbell 
Acked-by: Roman Gushchin 
---
 include/linux/memcontrol.h | 5 +++--
 mm/huge_memory.c   | 2 +-
 mm/memcontrol.c| 6 +++---
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8d5daf95988..39707feae505 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1062,7 +1062,7 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_order);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1396,7 +1396,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
*pgdat, int order,
return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void mem_cgroup_split_huge_fixup(struct page *head,
+  unsigned int new_order)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88d8b7fce5d7..d7ab5cac5851 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2428,7 +2428,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
/* complete memcg works before add pages to LRU */
-   mem_cgroup_split_huge_fixup(head);
+   mem_cgroup_split_huge_fixup(head, 0);
 
if (PageAnon(head) && PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de5869dd354d..4521ed3a51b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3223,15 +3223,15 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
size_t size)
  * Because tail pages are not marked as "used", set it. We're under
  * pgdat->lru_lock and migration entries setup in all page mappings.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_order)
 {
struct mem_cgroup *memcg = page_memcg(head);
-   int i;
+   int i, new_nr = 1 << new_order;
 
if (mem_cgroup_disabled())
return;
 
-   for (i = 1; i < thp_nr_pages(head); i++) {
+   for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
css_get(>css);
head[i].memcg_data = (unsigned long)memcg;
}
-- 
2.28.0



Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-17 Thread Zi Yan
On 17 Nov 2020, at 16:22, Matthew Wilcox wrote:

> On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:
>> On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:
>>
>>> On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
>>>> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
>>>>> Matthew recently converted split_page_owner to take nr instead of 
>>>>> order.[1]
>>>>> But I am not
>>>>> sure why, since it seems to me that two call sites (__split_huge_page in
>>>>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
>>>>> information.
>>>>
>>>> Yeah, I'm not sure why too. Maybe Matthew has some input here?
>>>> You can also pass new_nr, but IMO orders look so much better here.
>>>
>>> If only I'd written that information in the changelog ... oh wait, I did!
>>>
>>> mm/page_owner: change split_page_owner to take a count
>>>
>>> The implementation of split_page_owner() prefers a count rather than the
>>> old order of the page.  When we support a variable size THP, we won't
>>> have the order at this point, but we will have the number of pages.
>>> So change the interface to what the caller and callee would prefer.
>>
>> There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
>> mm/huge_memory.c. The former has the page order. The latter has the page 
>> order
>> information before __split_huge_page_tail is called, so we can do
>> old_order = thp_order(head) instead of nr = thp_nr_page(head) and use 
>> old_order.
>> What am I missing there?
>
> Sure, we could also do that.  But what I wrote was true at the time I
> wrote it.

Got it. Thanks. Will change it to use old_order to make split_page_owner 
parameters
look more consistent.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-17 Thread Zi Yan
On 17 Nov 2020, at 16:10, Matthew Wilcox wrote:

> On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
>> -for (i = 0; i < nr; i++) {
>> +for (i = 0; i < nr; i += (1 << new_order)) {
>>  page_owner = get_page_owner(page_ext);
>> -page_owner->order = 0;
>> +page_owner->order = new_order;
>>  page_ext = page_ext_next(page_ext);
>>  }
>
> This doesn't do what you're hoping it will.  It's going to set ->order to
> new_order for the first N pages instead of every 1/N pages.
>
> You'll need to do something like
>
>   page_ext = lookup_page_ext(page + i);

Will use this. Thanks.

>
> or add a new page_ext_add(page_ext, 1 << new_order);


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-17 Thread Zi Yan
On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

> On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
>> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
>>> Matthew recently converted split_page_owner to take nr instead of order.[1]
>>> But I am not
>>> sure why, since it seems to me that two call sites (__split_huge_page in
>>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
>>> information.
>>
>> Yeah, I'm not sure why too. Maybe Matthew has some input here?
>> You can also pass new_nr, but IMO orders look so much better here.
>
> If only I'd written that information in the changelog ... oh wait, I did!
>
> mm/page_owner: change split_page_owner to take a count
>
> The implementation of split_page_owner() prefers a count rather than the
> old order of the page.  When we support a variable size THP, we won't
> have the order at this point, but we will have the number of pages.
> So change the interface to what the caller and callee would prefer.

There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
mm/huge_memory.c. The former has the page order. The latter has the page order
information before __split_huge_page_tail is called, so we can do
old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
What am I missing there?

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH] docs/vm: remove unused 3 items explanation for /proc/vmstat

2020-11-16 Thread Zi Yan
On 16 Nov 2020, at 4:51, Alex Shi wrote:

> Commit 5647bc293ab1 ("mm: compaction: Move migration fail/success
> stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs
> still has their explanation. let's remove them.
>
> "compact_blocks_moved",
> "compact_pages_moved",
> "compact_pagemigrate_failed",
>
> Signed-off-by: Alex Shi 
> Cc: Jonathan Corbet 
> Cc: Andrew Morton 
> Cc: Yang Shi 
> Cc: "Kirill A. Shutemov" 
> Cc: David Rientjes 
> Cc: Zi Yan 
> Cc: linux-...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 15 ---
>  1 file changed, 15 deletions(-)
>

LGTM. Reviewed-by: Zi Yan .

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

2020-11-16 Thread Zi Yan
On 16 Nov 2020, at 11:06, Kirill A. Shutemov wrote:

> On Wed, Nov 11, 2020 at 03:40:03PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Huge pages in the process with the given pid and virtual address range
>> are split. It is used to test split huge page function. In addition,
>> a testing program is added to tools/testing/selftests/vm to utilize the
>> interface by splitting PMD THPs.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  mm/huge_memory.c  |  98 +++
>>  mm/internal.h |   1 +
>>  mm/migrate.c  |   2 +-
>>  tools/testing/selftests/vm/Makefile   |   1 +
>>  .../selftests/vm/split_huge_page_test.c   | 161 ++
>>  5 files changed, 262 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 207ebca8c654..c4fead5ead31 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val)
>>  DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
>>  "%llu\n");
>>
>> +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
>> +const char __user *buf, size_t count, loff_t *ppops)
>> +{
>> +static DEFINE_MUTEX(mutex);
>> +ssize_t ret;
>> +char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
>> +int pid;
>> +unsigned long vaddr_start, vaddr_end, addr;
>> +nodemask_t task_nodes;
>> +struct mm_struct *mm;
>> +
>> +ret = mutex_lock_interruptible();
>> +if (ret)
>> +return ret;
>> +
>> +ret = -EFAULT;
>> +
>> +memset(input_buf, 0, 80);
>> +if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
>> +goto out;
>> +
>> +input_buf[80] = '\0';
>
> Hm. Out-of-buffer access?

Sorry. Will fix it.

>
>> +ret = sscanf(input_buf, "%d,%lx,%lx", , _start, _end);
>
> Why hex without 0x prefix?

No particular reason. Let me add the prefix.

>
>> +if (ret != 3) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +vaddr_start &= PAGE_MASK;
>> +vaddr_end &= PAGE_MASK;
>> +
>> +ret = strlen(input_buf);
>> +pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
>> + pid, vaddr_start, vaddr_end);
>> +
>> +mm = find_mm_struct(pid, _nodes);
>
> I don't follow why you need nodemask.

I don’t need it. I just reuse the find_mm_struct function from
mm/migrate.c.

>
>> +if (IS_ERR(mm)) {
>> +ret = -EINVAL;
>> +goto out;
>> +}
>> +
>> +mmap_read_lock(mm);
>> +for (addr = vaddr_start; addr < vaddr_end;) {
>> +struct vm_area_struct *vma = find_vma(mm, addr);
>> +unsigned int follflags;
>> +struct page *page;
>> +
>> +if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>> +break;
>> +
>> +/* FOLL_DUMP to ignore special (like zero) pages */
>> +follflags = FOLL_GET | FOLL_DUMP;
>> +page = follow_page(vma, addr, follflags);
>> +
>> +if (IS_ERR(page))
>> +break;
>> +if (!page)
>> +break;
>> +
>> +if (!is_transparent_hugepage(page))
>> +goto next;
>> +
>> +if (!can_split_huge_page(page, NULL))
>> +goto next;
>> +
>> +if (!trylock_page(page))
>> +goto next;
>> +
>> +addr += page_size(page) - PAGE_SIZE;
>
> Who said it was mapped as huge? mremap() allows to construct an PTE page
> table that filled with PTE-mapped THPs, each of them distinct.

I forgot about this. I was trying to be smart to skip the rest of
subpages if we split a THP. I will increase addr always by PAGE_SIZE
to handle this situation.

>> +
>> +/* reset addr if split fails */
>> +if (split_huge_page(page))
>> +addr -= (page_size(page) - PAGE_SIZE);
>> +
>> +unlock_page(p

Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-16 Thread Zi Yan
On 16 Nov 2020, at 11:25, Kirill A. Shutemov wrote:

> On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> It adds a new_order parameter to set new page order in page owner.
>> It prepares for upcoming changes to support split huge page to any lower
>> order.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  include/linux/page_owner.h | 7 ---
>>  mm/huge_memory.c   | 2 +-
>>  mm/page_alloc.c| 2 +-
>>  mm/page_owner.c| 6 +++---
>>  4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
>> index 3468794f83d2..215cbb159568 100644
>> --- a/include/linux/page_owner.h
>> +++ b/include/linux/page_owner.h
>> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>>  __set_page_owner(page, order, gfp_mask);
>>  }
>>
>> -static inline void split_page_owner(struct page *page, unsigned int nr)
>> +static inline void split_page_owner(struct page *page, unsigned int nr,
>> +unsigned int new_order)
>>  {
>>  if (static_branch_unlikely(_owner_inited))
>> -__split_page_owner(page, nr);
>> +__split_page_owner(page, nr, new_order);
>
> Hm. Where do you correct __split_page_owner() declaration. I don't see it.

I missed it. Will add it. Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-13 Thread Zi Yan

On 13 Nov 2020, at 19:15, Roman Gushchin wrote:


On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:

From: Zi Yan 

It adds a new_order parameter to set new page order in page owner.
It prepares for upcoming changes to support split huge page to any 
lower

order.

Signed-off-by: Zi Yan 
---
 include/linux/page_owner.h | 7 ---
 mm/huge_memory.c   | 2 +-
 mm/page_alloc.c| 2 +-
 mm/page_owner.c| 6 +++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..215cbb159568 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -31,10 +31,11 @@ static inline void set_page_owner(struct page 
*page,

__set_page_owner(page, order, gfp_mask);
 }

-static inline void split_page_owner(struct page *page, unsigned int 
nr)
+static inline void split_page_owner(struct page *page, unsigned int 
nr,

+   unsigned int new_order)
 {
if (static_branch_unlikely(_owner_inited))
-   __split_page_owner(page, nr);
+   __split_page_owner(page, nr, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page 
*newpage)

 {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page 
*page,

 {
 }
 static inline void split_page_owner(struct page *page,
-   unsigned int order)
+   unsigned int nr, unsigned int new_order)


With the addition of the new argument it's a bit hard to understand
what the function is supposed to do. It seems like nr == 
page_order(page),
is it right? Maybe we can pass old_order and new_order? Or just the 
page

and the new order?


Yeah, it is a bit confusing. Please see more below.




 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page 
*newpage)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f599f5b9bf7f..8b7d771ee962 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page 
*page, struct list_head *list,


ClearPageCompound(head);

-   split_page_owner(head, nr);
+   split_page_owner(head, nr, 1);

/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..a9eead0e091a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int 
order)


for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
-   split_page_owner(page, 1 << order);
+   split_page_owner(page, 1 << order, 1);
 }
 EXPORT_SYMBOL_GPL(split_page);

diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..2b7f7e9056dc 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page 
*page, int reason)

page_owner->last_migrate_reason = reason;
 }

-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int nr, unsigned 
int new_order)

 {
int i;
struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, 
unsigned int nr)

if (unlikely(!page_ext))
return;

-   for (i = 0; i < nr; i++) {
+   for (i = 0; i < nr; i += (1 << new_order)) {
page_owner = get_page_owner(page_ext);
-   page_owner->order = 0;
+   page_owner->order = new_order;
page_ext = page_ext_next(page_ext);


I believe there cannot be any leftovers because nr is always a power 
of 2.
Is it true? Converting nr argument to order (if it's possible) will 
make it obvious.


Right. nr = thp_nr_pages(head), which is a power of 2. There would not 
be any

leftover.

Matthew recently converted split_page_owner to take nr instead of 
order.[1] But I am not

sure why, since it seems to me that two call sites (__split_huge_page in
mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order 
information.



[1]https://lore.kernel.org/linux-mm/20200908195539.25896-4-wi...@infradead.org/


—
Best Regards,
Yan Zi


Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-13 Thread Zi Yan
On 13 Nov 2020, at 19:52, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a pagecache THP. For anonymous THPs, we can only split them
>> to order-0 like before until we add support for any size anonymous THPs.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  include/linux/huge_mm.h |  8 +
>>  mm/huge_memory.c| 78 +
>>  mm/swap.c   |  1 -
>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 60a907a19f7d..9819cd9b4619 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>>
>>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head 
>> *list,
>> +unsigned int new_order);
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  return split_huge_page_to_list(page, NULL);
>> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
>> list_head *list)
>>  {
>>  return 0;
>>  }
>> +static inline int
>> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
>> +unsigned int new_order)
>
> It was
> int split_huge_page_to_list_to_order(struct page *page, struct list_head 
> *list,
>   unsigned int new_order);
> above.

Right. It should be split_huge_page_to_list_to_order. Will fix it.

>
>> +{
>> +return 0;
>> +}
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8b7d771ee962..88f50da40c9b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct 
>> *vma,
>>  static void unmap_page(struct page *page)
>>  {
>>  enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
>> -TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>> +TTU_RMAP_LOCKED;
>>  bool unmap_success;
>>
>>  VM_BUG_ON_PAGE(!PageHead(page), page);
>>
>> +if (thp_order(page) >= HPAGE_PMD_ORDER)
>> +ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>>  if (PageAnon(page))
>>  ttu_flags |= TTU_SPLIT_FREEZE;
>>
>> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>>  VM_BUG_ON_PAGE(!unmap_success, page);
>>  }
>>
>> -static void remap_page(struct page *page, unsigned int nr)
>> +static void remap_page(struct page *page, unsigned int nr, unsigned int 
>> new_nr)
>>  {
>>  int i;
>> -if (PageTransHuge(page)) {
>> +if (thp_nr_pages(page) == nr) {
>>  remove_migration_ptes(page, page, true);
>>  } else {
>> -for (i = 0; i < nr; i++)
>> +for (i = 0; i < nr; i += new_nr)
>>  remove_migration_ptes(page + i, page + i, true);
>>  }
>>  }
>>
>>  static void __split_huge_page_tail(struct page *head, int tail,
>> -struct lruvec *lruvec, struct list_head *list)
>> +struct lruvec *lruvec, struct list_head *list, unsigned int 
>> new_order)
>>  {
>>  struct page *page_tail = head + tail;
>> +unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>
>>  VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
>>
>> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, 
>> int tail,
>>  #ifdef CONFIG_64BIT
>>   (1L << PG_arch_2) |
>>  #endif
>> + compound_head_flag |
>>   (1L << PG_dirty)));
>>
>>  /* ->mapping in first tail page is compound_mapcount */
>> @@ -2395,10 +2400,15 @@ static voi

Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

2020-11-13 Thread Zi Yan
On 13 Nov 2020, at 19:23, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:
>> From: Zi Yan 
>>
>> It reads thp_nr_pages and splits to provided new_nr. It prepares for
>> upcoming changes to support split huge page to any lower order.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>  include/linux/memcontrol.h | 5 +++--
>>  mm/huge_memory.c   | 2 +-
>>  mm/memcontrol.c| 4 ++--
>>  3 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 0f4dd7829fb2..b3bac79ceed6 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct 
>> mm_struct *mm,
>>  }
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -void mem_cgroup_split_huge_fixup(struct page *head);
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
>>  #endif
>>
>>  #else /* CONFIG_MEMCG */
>> @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
>> *pgdat, int order,
>>  return 0;
>>  }
>>
>> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
>> +static inline void mem_cgroup_split_huge_fixup(struct page *head,
>> +   unsigned int new_nr)
>>  {
>>  }
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c4fead5ead31..f599f5b9bf7f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  lruvec = mem_cgroup_page_lruvec(head, pgdat);
>>
>>  /* complete memcg works before add pages to LRU */
>> -mem_cgroup_split_huge_fixup(head);
>> +mem_cgroup_split_huge_fixup(head, 1);
>>
>>  if (PageAnon(head) && PageSwapCache(head)) {
>>  swp_entry_t entry = { .val = page_private(head) };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 33f632689cee..e9705ba6bbcc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
>> size_t size)
>>   * Because tail pages are not marked as "used", set it. We're under
>>   * pgdat->lru_lock and migration entries setup in all page mappings.
>>   */
>> -void mem_cgroup_split_huge_fixup(struct page *head)
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
>
> I'd go with unsigned int new_order, then it's obvious that we can split
> the original page without any leftovers.

Makes sense. Will change it.

>
> Other than that the patch looks good!
> Acked-by: Roman Gushchin 

Thanks.

>>  {
>>  struct mem_cgroup *memcg = page_memcg(head);
>>  int i;
>> @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  if (mem_cgroup_disabled())
>>  return;
>>
>> -for (i = 1; i < thp_nr_pages(head); i++) {
>> +for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
>>  css_get(>css);
>>  head[i].memcg_data = (unsigned long)memcg;
>>  }
>> -- 
>> 2.28.0
>>


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

2020-11-12 Thread Zi Yan
On 12 Nov 2020, at 17:22, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Huge pages in the process with the given pid and virtual address range
>> are split. It is used to test split huge page function. In addition,
>> a testing program is added to tools/testing/selftests/vm to utilize the
>> interface by splitting PMD THPs.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>   mm/huge_memory.c  |  98 +++
>>   mm/internal.h |   1 +
>>   mm/migrate.c  |   2 +-
>>   tools/testing/selftests/vm/Makefile   |   1 +
>>   .../selftests/vm/split_huge_page_test.c   | 161 ++
>>   5 files changed, 262 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>
> Don't forget to update ".gitignore" to include "split_huge_page_test".

Sure. Thanks for pointing this out.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

2020-11-12 Thread Zi Yan
On 12 Nov 2020, at 17:08, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan 
>>
>> To minimize the number of pages after a truncation, when truncating a
>> THP, we do not need to split it all the way down to order-0. The THP has
>> at most three parts, the part before offset, the part to be truncated,
>> the part left at the end. Use the non-zero minimum of them to decide
>> what order we split the THP to.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>   mm/truncate.c | 22 --
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 20bd17538ec2..6d8e3c6115bc 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, 
>> struct page *page)
>>   bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t 
>> end)
>>   {
>>  loff_t pos = page_offset(page);
>> -unsigned int offset, length;
>> +unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;
>
> Maybe use "remaining" instead of "left" since I think of the latter as the 
> length of the
> left side (offset).

Sure. Will change the name.

>
>>  if (pos < start)
>>  offset = start - pos;
>> @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, 
>> loff_t start, loff_t end)
>>  length = length - offset;
>>  else
>>  length = end + 1 - pos - offset;
>> +left = thp_size(page) - offset - length;
>>  wait_on_page_writeback(page);
>>  if (length == thp_size(page)) {
>> @@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, 
>> loff_t start, loff_t end)
>>  do_invalidatepage(page, offset, length);
>>  if (!PageTransHuge(page))
>>  return true;
>> -return split_huge_page(page) == 0;
>> +
>> +/*
>> + * find the non-zero minimum of offset, length, and left and use it to
>> + * decide the new order of the page after split
>> + */
>> +if (offset && left)
>> +min_subpage_size = min_t(unsigned int,
>> + min_t(unsigned int, offset, length),
>> + left);
>> +else if (!offset)
>> +min_subpage_size = min_t(unsigned int, length, left);
>> +else /* !left */
>> +min_subpage_size = min_t(unsigned int, length, offset);
>> +
>> +min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
>> +
>> +return split_huge_page_to_list_to_order(page, NULL,
>> +ilog2(min_subpage_size/PAGE_SIZE)) == 0;
>>   }
>
> What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2?
> Splitting the page in half wouldn't result in a page that could be freed
> but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).

Is it possible? The whole THP is divided into three parts, offset, length, and
remaining (renamed from left). If offset is not aligned to 1/2, it is either
greater than 1/2 or smaller than 1/2. If it is the former, length and remaining
will be smaller than 1/2, so min_subpage_size cannot be 1/2. If it is the 
latter,
min_subpage_size cannot be 1/2 either. Because min_subpage_size is the smallest
non-zero value of offset, length, and remaining. Let me know if I miss anything.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-12 Thread Zi Yan
On 12 Nov 2020, at 17:01, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan 
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a pagecache THP. For anonymous THPs, we can only split them
>> to order-0 like before until we add support for any size anonymous THPs.
>>
>> Signed-off-by: Zi Yan 
>> ---
>>   include/linux/huge_mm.h |  8 +
>>   mm/huge_memory.c| 78 +
>>   mm/swap.c   |  1 -
>>   3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 60a907a19f7d..9819cd9b4619 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>>bool can_split_huge_page(struct page *page, int *pextra_pins);
>>   int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head 
>> *list,
>> +unsigned int new_order);
>>   static inline int split_huge_page(struct page *page)
>>   {
>>  return split_huge_page_to_list(page, NULL);
>> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
>> list_head *list)
>>   {
>>  return 0;
>>   }
>> +static inline int
>> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
>> +unsigned int new_order)
>> +{
>> +return 0;
>> +}
>>   static inline int split_huge_page(struct page *page)
>>   {
>>  return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8b7d771ee962..88f50da40c9b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct 
>> *vma,
>>   static void unmap_page(struct page *page)
>>   {
>>  enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
>> -TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>> +TTU_RMAP_LOCKED;
>>  bool unmap_success;
>>  VM_BUG_ON_PAGE(!PageHead(page), page);
>>  +   if (thp_order(page) >= HPAGE_PMD_ORDER)
>> +ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>>  if (PageAnon(page))
>>  ttu_flags |= TTU_SPLIT_FREEZE;
>>  @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>>  VM_BUG_ON_PAGE(!unmap_success, page);
>>   }
>>  -static void remap_page(struct page *page, unsigned int nr)
>> +static void remap_page(struct page *page, unsigned int nr, unsigned int 
>> new_nr)
>>   {
>>  int i;
>
> Use unsigned int i?
> Maybe a blank line here and the {}'s around if/else aren't needed.
>
>> -if (PageTransHuge(page)) {
>> +if (thp_nr_pages(page) == nr) {
>>  remove_migration_ptes(page, page, true);
>>  } else {
>> -for (i = 0; i < nr; i++)
>> +for (i = 0; i < nr; i += new_nr)
>>  remove_migration_ptes(page + i, page + i, true);
>>  }
>>   }
>>static void __split_huge_page_tail(struct page *head, int tail,
>> -struct lruvec *lruvec, struct list_head *list)
>> +struct lruvec *lruvec, struct list_head *list, unsigned int 
>> new_order)
>>   {
>>  struct page *page_tail = head + tail;
>> +unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>  VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
>>  @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, 
>> int tail,
>>   #ifdef CONFIG_64BIT
>>   (1L << PG_arch_2) |
>>   #endif
>> + compound_head_flag |
>>   (1L << PG_dirty)));
>>  /* ->mapping in first tail page is compound_mapcount */
>> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page 
>> *head, int tail,
>>   * which needs correct compound_head().
>>   */
>>  clear_compound_head(page_tail);
>> +if (new_or

Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

2020-11-12 Thread Zi Yan
On 12 Nov 2020, at 12:58, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan 
>>
>> It reads thp_nr_pages and splits to provided new_nr. It prepares for
>> upcoming changes to support split huge page to any lower order.
>>
>> Signed-off-by: Zi Yan 
>
> Looks OK to me.
> Reviewed-by: Ralph Campbell 

Thanks.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-12 Thread Zi Yan
On 12 Nov 2020, at 12:57, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan 
>>
>> It adds a new_order parameter to set new page order in page owner.
>> It prepares for upcoming changes to support split huge page to any lower
>> order.
>>
>> Signed-off-by: Zi Yan 
>
> Except for a minor fix below, you can add:
> Reviewed-by: Ralph Campbell 

Thanks.

>
>> ---
>>   include/linux/page_owner.h | 7 ---
>>   mm/huge_memory.c   | 2 +-
>>   mm/page_alloc.c| 2 +-
>>   mm/page_owner.c| 6 +++---
>>   4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
>> index 3468794f83d2..215cbb159568 100644
>> --- a/include/linux/page_owner.h
>> +++ b/include/linux/page_owner.h
>> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>>  __set_page_owner(page, order, gfp_mask);
>>   }
>>  -static inline void split_page_owner(struct page *page, unsigned int nr)
>> +static inline void split_page_owner(struct page *page, unsigned int nr,
>> +unsigned int new_order)
>>   {
>>  if (static_branch_unlikely(_owner_inited))
>> -__split_page_owner(page, nr);
>> +__split_page_owner(page, nr, new_order);
>>   }
>>   static inline void copy_page_owner(struct page *oldpage, struct page 
>> *newpage)
>>   {
>> @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
>>   {
>>   }
>>   static inline void split_page_owner(struct page *page,
>> -unsigned int order)
>> +unsigned int nr, unsigned int new_order)
>>   {
>>   }
>>   static inline void copy_page_owner(struct page *oldpage, struct page 
>> *newpage)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index f599f5b9bf7f..8b7d771ee962 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, 
>> struct list_head *list,
>>  ClearPageCompound(head);
>>  -   split_page_owner(head, nr);
>> +split_page_owner(head, nr, 1);
>
> Shouldn't this be 0, not 1?
> (new_order not new_nr).
>

Yes, I forgot to fix the call site after I change the function signature. 
Thanks.

>>  /* See comment in __split_huge_page_tail() */
>>  if (PageAnon(head)) {
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d77220615fd5..a9eead0e091a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
>>  for (i = 1; i < (1 << order); i++)
>>  set_page_refcounted(page + i);
>> -split_page_owner(page, 1 << order);
>> +split_page_owner(page, 1 << order, 1);
>
> Ditto, 0.
>

Sure, will fix this too.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

2020-11-11 Thread Zi Yan
From: Zi Yan 

It reads thp_nr_pages and splits to provided new_nr. It prepares for
upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan 
---
 include/linux/memcontrol.h | 5 +++--
 mm/huge_memory.c   | 2 +-
 mm/memcontrol.c| 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..b3bac79ceed6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
*pgdat, int order,
return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void mem_cgroup_split_huge_fixup(struct page *head,
+  unsigned int new_nr)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4fead5ead31..f599f5b9bf7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
/* complete memcg works before add pages to LRU */
-   mem_cgroup_split_huge_fixup(head);
+   mem_cgroup_split_huge_fixup(head, 1);
 
if (PageAnon(head) && PageSwapCache(head)) {
swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33f632689cee..e9705ba6bbcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t 
size)
  * Because tail pages are not marked as "used", set it. We're under
  * pgdat->lru_lock and migration entries setup in all page mappings.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
 {
struct mem_cgroup *memcg = page_memcg(head);
int i;
@@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
if (mem_cgroup_disabled())
return;
 
-   for (i = 1; i < thp_nr_pages(head); i++) {
+   for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
css_get(>css);
head[i].memcg_data = (unsigned long)memcg;
}
-- 
2.28.0



[RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-11 Thread Zi Yan
From: Zi Yan 

It adds a new_order parameter to set new page order in page owner.
It prepares for upcoming changes to support split huge page to any lower
order.

Signed-off-by: Zi Yan 
---
 include/linux/page_owner.h | 7 ---
 mm/huge_memory.c   | 2 +-
 mm/page_alloc.c| 2 +-
 mm/page_owner.c| 6 +++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..215cbb159568 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
__set_page_owner(page, order, gfp_mask);
 }
 
-static inline void split_page_owner(struct page *page, unsigned int nr)
+static inline void split_page_owner(struct page *page, unsigned int nr,
+   unsigned int new_order)
 {
if (static_branch_unlikely(_owner_inited))
-   __split_page_owner(page, nr);
+   __split_page_owner(page, nr, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
 {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
 {
 }
 static inline void split_page_owner(struct page *page,
-   unsigned int order)
+   unsigned int nr, unsigned int new_order)
 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f599f5b9bf7f..8b7d771ee962 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
ClearPageCompound(head);
 
-   split_page_owner(head, nr);
+   split_page_owner(head, nr, 1);
 
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..a9eead0e091a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
 
for (i = 1; i < (1 << order); i++)
set_page_refcounted(page + i);
-   split_page_owner(page, 1 << order);
+   split_page_owner(page, 1 << order, 1);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..2b7f7e9056dc 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int 
reason)
page_owner->last_migrate_reason = reason;
 }
 
-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int nr, unsigned int 
new_order)
 {
int i;
struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr)
if (unlikely(!page_ext))
return;
 
-   for (i = 0; i < nr; i++) {
+   for (i = 0; i < nr; i += (1 << new_order)) {
page_owner = get_page_owner(page_ext);
-   page_owner->order = 0;
+   page_owner->order = new_order;
page_ext = page_ext_next(page_ext);
}
 }
-- 
2.28.0



[RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order.

2020-11-11 Thread Zi Yan
From: Zi Yan 

It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both
debugfs and truncating a file.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  |  13 +--
 .../selftests/vm/split_huge_page_test.c   | 102 +-
 2 files changed, 105 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88f50da40c9b..b7470607a08b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2974,7 +2974,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
static DEFINE_MUTEX(mutex);
ssize_t ret;
char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
-   int pid;
+   int pid, to_order = 0;
unsigned long vaddr_start, vaddr_end, addr;
nodemask_t task_nodes;
struct mm_struct *mm;
@@ -2990,8 +2990,9 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
goto out;
 
input_buf[80] = '\0';
-   ret = sscanf(input_buf, "%d,%lx,%lx", , _start, _end);
-   if (ret != 3) {
+   ret = sscanf(input_buf, "%d,%lx,%lx,%d", , _start, 
_end, _order);
+   /* cannot split to order-1 THP, which is not possible */
+   if ((ret != 3 && ret != 4) || to_order == 1) {
ret = -EINVAL;
goto out;
}
@@ -2999,8 +3000,8 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
vaddr_end &= PAGE_MASK;
 
ret = strlen(input_buf);
-   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
-pid, vaddr_start, vaddr_end);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx], to order: 
%d\n",
+pid, vaddr_start, vaddr_end, to_order);
 
mm = find_mm_struct(pid, _nodes);
if (IS_ERR(mm)) {
@@ -3038,7 +3039,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct 
file *file,
addr += page_size(page) - PAGE_SIZE;
 
/* reset addr if split fails */
-   if (split_huge_page(page))
+   if (split_huge_page_to_list_to_order(page, NULL, to_order))
addr -= (page_size(page) - PAGE_SIZE);
 
unlock_page(page);
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c 
b/tools/testing/selftests/vm/split_huge_page_test.c
index c8a32ae9e13a..bcbc5a9d327c 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PAGE_4KB (4096UL)
 #define PAGE_2MB (512UL*PAGE_4KB)
@@ -31,6 +32,7 @@
 
 #define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
 #define SMAP_PATH "/proc/self/smaps"
+#define THP_FS_PATH "/mnt/thp_fs"
 #define INPUT_MAX 80
 
 static int write_file(const char *path, const char *buf, size_t buflen)
@@ -50,13 +52,13 @@ static int write_file(const char *path, const char *buf, 
size_t buflen)
return (unsigned int) numwritten;
 }
 
-static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end, 
int order)
 {
char input[INPUT_MAX];
int ret;
 
-   ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start,
-   vaddr_end);
+   ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx,%d", pid, vaddr_start,
+   vaddr_end, order);
if (ret >= INPUT_MAX) {
printf("%s: Debugfs input is too long\n", __func__);
exit(EXIT_FAILURE);
@@ -139,7 +141,7 @@ void split_pmd_thp(void)
}
 
/* split all possible huge pages */
-   write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
+   write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len, 
0);
 
*one_page = 0;
 
@@ -153,9 +155,101 @@ void split_pmd_thp(void)
free(one_page);
 }
 
+void create_pagecache_thp_and_fd(size_t fd_size, int *fd, char **addr)
+{
+   const char testfile[] = THP_FS_PATH "/test";
+   size_t i;
+   int dummy;
+
+   srand(time(NULL));
+
+   *fd = open(testfile, O_CREAT | O_RDWR, 0664);
+
+   for (i = 0; i < fd_size; i++) {
+   unsigned char byte = rand();
+
+   write(*fd, , sizeof(byte));
+   }
+   close(*fd);
+   sync();
+   *fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
+   if (*fd == -1) {
+   perror("open drop_caches");
+   exit(EXIT_FAILURE);
+   }
+   if (write(*fd, "3", 1) != 1) {
+   perror("write to drop_caches");
+   exit(EXIT_FAILURE);
+   }
+   close(*fd);
+
+   

[RFC PATCH 0/6] Split huge pages to any lower order pages.

2020-11-11 Thread Zi Yan
From: Zi Yan 

Hi all,

With Matthew's THP in pagecache patches[1], we will be able to handle any size
pagecache THPs, but currently split_huge_page can only split a THP to order-0
pages. This can easily erase the benefit of having pagecache THPs, when
operations like truncate might want to keep pages larger than order-0. In
response, here is the patches to add support for splitting a THP to any lower
order pages. In addition, this patchset prepares for my PUD THP patchset[2],
since splitting a PUD THP to multiple PMD THPs can be handled by
split_huge_page_to_list_to_order function added by this patchset, which reduces
a lot of redundant code without just replicating split_huge_page for PUD THP.

The patchset is on top of Matthew's pagecache/next tree[3].

To ease the tests of split_huge_page functions, I added a new debugfs interface
at /split_huge_pages_in_range_pid, so developers can split THPs in a
given range from a process with the given pid by writing
",,," to the interface. I also added a
new test program to test 1) split PMD THPs, 2) split pagecache THPs to any lower
order, and 3) truncating a pagecache THP to a page with a lower order.

Suggestions and comments are welcome. Thanks.


[1] https://lore.kernel.org/linux-mm/20201029193405.29125-1-wi...@infradead.org/
[2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi@sent.com/
[3] https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/next

Zi Yan (6):
  mm: huge_memory: add new debugfs interface to trigger split huge page
on any page range.
  mm: memcg: make memcg huge page split support any order split.
  mm: page_owner: add support for splitting to any order in split
page_owner.
  mm: thp: add support for split huge page to any lower order pages.
  mm: truncate: split thp to a non-zero order if possible.
  mm: huge_memory: enable debugfs to split huge pages to any order.

 include/linux/huge_mm.h   |   8 +
 include/linux/memcontrol.h|   5 +-
 include/linux/page_owner.h|   7 +-
 mm/huge_memory.c  | 177 ++--
 mm/internal.h |   1 +
 mm/memcontrol.c   |   4 +-
 mm/migrate.c  |   2 +-
 mm/page_alloc.c   |   2 +-
 mm/page_owner.c   |   6 +-
 mm/swap.c |   1 -
 mm/truncate.c |  22 +-
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 255 ++
 13 files changed, 453 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

--
2.28.0



[RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

2020-11-11 Thread Zi Yan
From: Zi Yan 

Huge pages in the process with the given pid and virtual address range
are split. It is used to test split huge page function. In addition,
a testing program is added to tools/testing/selftests/vm to utilize the
interface by splitting PMD THPs.

Signed-off-by: Zi Yan 
---
 mm/huge_memory.c  |  98 +++
 mm/internal.h |   1 +
 mm/migrate.c  |   2 +-
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/split_huge_page_test.c   | 161 ++
 5 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 207ebca8c654..c4fead5ead31 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppops)
+{
+   static DEFINE_MUTEX(mutex);
+   ssize_t ret;
+   char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+   int pid;
+   unsigned long vaddr_start, vaddr_end, addr;
+   nodemask_t task_nodes;
+   struct mm_struct *mm;
+
+   ret = mutex_lock_interruptible();
+   if (ret)
+   return ret;
+
+   ret = -EFAULT;
+
+   memset(input_buf, 0, 80);
+   if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+   goto out;
+
+   input_buf[80] = '\0';
+   ret = sscanf(input_buf, "%d,%lx,%lx", , _start, _end);
+   if (ret != 3) {
+   ret = -EINVAL;
+   goto out;
+   }
+   vaddr_start &= PAGE_MASK;
+   vaddr_end &= PAGE_MASK;
+
+   ret = strlen(input_buf);
+   pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+pid, vaddr_start, vaddr_end);
+
+   mm = find_mm_struct(pid, _nodes);
+   if (IS_ERR(mm)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mmap_read_lock(mm);
+   for (addr = vaddr_start; addr < vaddr_end;) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+   unsigned int follflags;
+   struct page *page;
+
+   if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+   break;
+
+   /* FOLL_DUMP to ignore special (like zero) pages */
+   follflags = FOLL_GET | FOLL_DUMP;
+   page = follow_page(vma, addr, follflags);
+
+   if (IS_ERR(page))
+   break;
+   if (!page)
+   break;
+
+   if (!is_transparent_hugepage(page))
+   goto next;
+
+   if (!can_split_huge_page(page, NULL))
+   goto next;
+
+   if (!trylock_page(page))
+   goto next;
+
+   addr += page_size(page) - PAGE_SIZE;
+
+   /* reset addr if split fails */
+   if (split_huge_page(page))
+   addr -= (page_size(page) - PAGE_SIZE);
+
+   unlock_page(page);
+next:
+   /* next page */
+   addr += page_size(page);
+   put_page(page);
+   }
+   mmap_read_unlock(mm);
+
+
+   mmput(mm);
+out:
+   mutex_unlock();
+   return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+   .owner   = THIS_MODULE,
+   .write   = split_huge_pages_in_range_pid_write,
+   .llseek  = no_llseek,
+};
+
 static int __init split_huge_pages_debugfs(void)
 {
debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
_huge_pages_fops);
+   debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
+   _huge_pages_in_range_pid_fops);
return 0;
 }
 late_initcall(split_huge_pages_debugfs);
diff --git a/mm/internal.h b/mm/internal.h
index 3ea43642b99d..fd841a38830f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -624,4 +624,5 @@ struct migration_target_control {
 
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
 void page_cache_free_page(struct address_space *mapping, struct page *page);
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index a50bbb0e029b..e35654d1087d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1851,7 +1851,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned 
long nr_pages,
return nr_pages ? -EFAULT : 0;

[RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

2020-11-11 Thread Zi Yan
From: Zi Yan 

To minimize the number of pages after a truncation, when truncating a
THP, we do not need to split it all the way down to order-0. The THP has
at most three parts, the part before offset, the part to be truncated,
the part left at the end. Use the non-zero minimum of them to decide
what order we split the THP to.

Signed-off-by: Zi Yan 
---
 mm/truncate.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 20bd17538ec2..6d8e3c6115bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, 
struct page *page)
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 {
loff_t pos = page_offset(page);
-   unsigned int offset, length;
+   unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;
 
if (pos < start)
offset = start - pos;
@@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
length = length - offset;
else
length = end + 1 - pos - offset;
+   left = thp_size(page) - offset - length;
 
wait_on_page_writeback(page);
if (length == thp_size(page)) {
@@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
do_invalidatepage(page, offset, length);
if (!PageTransHuge(page))
return true;
-   return split_huge_page(page) == 0;
+
+   /*
+* find the non-zero minimum of offset, length, and left and use it to
+* decide the new order of the page after split
+*/
+   if (offset && left)
+   min_subpage_size = min_t(unsigned int,
+min_t(unsigned int, offset, length),
+left);
+   else if (!offset)
+   min_subpage_size = min_t(unsigned int, length, left);
+   else /* !left */
+   min_subpage_size = min_t(unsigned int, length, offset);
+
+   min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
+
+   return split_huge_page_to_list_to_order(page, NULL,
+   ilog2(min_subpage_size/PAGE_SIZE)) == 0;
 }
 
 /*
-- 
2.28.0



[RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-11 Thread Zi Yan
From: Zi Yan 

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page
order. Also we need to reinitialize page_deferred_list after removing
the page from the split_queue, otherwise a subsequent split will see
list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after
truncating a pagecache THP. For anonymous THPs, we can only split them
to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan 
---
 include/linux/huge_mm.h |  8 +
 mm/huge_memory.c| 78 +
 mm/swap.c   |  1 -
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 60a907a19f7d..9819cd9b4619 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+   unsigned int new_order);
 static inline int split_huge_page(struct page *page)
 {
return split_huge_page_to_list(page, NULL);
@@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
return 0;
 }
+static inline int
+split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
+   unsigned int new_order)
+{
+   return 0;
+}
 static inline int split_huge_page(struct page *page)
 {
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8b7d771ee962..88f50da40c9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 static void unmap_page(struct page *page)
 {
enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   TTU_RMAP_LOCKED;
bool unmap_success;
 
VM_BUG_ON_PAGE(!PageHead(page), page);
 
+   if (thp_order(page) >= HPAGE_PMD_ORDER)
+   ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
if (PageAnon(page))
ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
-static void remap_page(struct page *page, unsigned int nr)
+static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
 {
int i;
-   if (PageTransHuge(page)) {
+   if (thp_nr_pages(page) == nr) {
remove_migration_ptes(page, page, true);
} else {
-   for (i = 0; i < nr; i++)
+   for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-   struct lruvec *lruvec, struct list_head *list)
+   struct lruvec *lruvec, struct list_head *list, unsigned int 
new_order)
 {
struct page *page_tail = head + tail;
+   unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
 
VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
 
@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 #ifdef CONFIG_64BIT
 (1L << PG_arch_2) |
 #endif
+compound_head_flag |
 (1L << PG_dirty)));
 
/* ->mapping in first tail page is compound_mapcount */
@@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, 
int tail,
 * which needs correct compound_head().
 */
clear_compound_head(page_tail);
+   if (new_order) {
+   prep_compound_page(page_tail, new_order);
+   thp_prep(page_tail);
+   }
 
/* Finally unfreeze refcount. Additional reference from page cache. */
-   page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
- PageSwapCache(head)));
+   page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
+  PageSwapCache(head)) ?
+   thp_nr_pages(page_tail) : 0));
 
if (page_is_young(head))
set_page_young(page_tail);
@@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-   pgoff_t end, unsigned long flags)
+   pgoff_t end, unsigned long flags, unsigned int new_order)
 {
struct page *head = compound_head(page);
pg_data_t *pgdat = page_pgdat(head);
@@ -2424,12 +2434,13 @@ sta

Re: [PATCH 5/5] mm: migrate: return -ENOSYS if THP migration is unsupported

2020-11-06 Thread Zi Yan
On 3 Nov 2020, at 8:03, Yang Shi wrote:

> In the current implementation unmap_and_move() would return -ENOMEM if
> THP migration is unsupported, then the THP will be split.  If split is
> failed just exit without trying to migrate other pages.  It doesn't make
> too much sense since there may be enough free memory to migrate other
> pages and there may be a lot base pages on the list.
>
> Return -ENOSYS to make consistent with hugetlb.  And if THP split is
> failed just skip and try other pages on the list.
>
> Just skip the whole list and exit when free memory is really low.
>
> Signed-off-by: Yang Shi 
> ---
>  mm/migrate.c | 62 ++--
>  1 file changed, 46 insertions(+), 16 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8f6a61c9274b..b3466d8c7f03 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1172,7 +1172,7 @@ static int unmap_and_move(new_page_t get_new_page,
>   struct page *newpage = NULL;
>
>   if (!thp_migration_supported() && PageTransHuge(page))
> - return -ENOMEM;
> + return -ENOSYS;
>
>   if (page_count(page) == 1) {
>   /* page was freed from under us. So we are done. */
> @@ -1376,6 +1376,20 @@ static int unmap_and_move_huge_page(new_page_t 
> get_new_page,
>   return rc;
>  }
>
> +static inline int try_split_thp(struct page *page, struct page *page2,
> + struct list_head *from)
> +{
> + int rc = 0;
> +
> + lock_page(page);
> + rc = split_huge_page_to_list(page, from);
> + unlock_page(page);
> + if (!rc)
> + list_safe_reset_next(page, page2, lru);

This does not work as expected, right? After macro expansion, we have
page2 = list_next_entry(page, lru). Since page2 is passed as a pointer, the 
change
does not return back the caller. You need to use the pointer to page2 here.

> +
> + return rc;
> +}
> +
>  /*
>   * migrate_pages - migrate the pages specified in a list, to the free pages
>   *  supplied as the target for the page migration
> @@ -1445,24 +1459,40 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   reason, _pages);
>
>   switch(rc) {
> + /*
> +  * THP migration might be unsupported or the
> +  * allocation could've failed so we should
> +  * retry on the same page with the THP split
> +  * to base pages.
> +  *
> +  * Head page is retried immediately and tail
> +  * pages are added to the tail of the list so
> +  * we encounter them after the rest of the list
> +  * is processed.
> +  */
> + case -ENOSYS:
> + /* THP migration is unsupported */
> + if (is_thp) {
> + if (!try_split_thp(page, page2, from)) {
> + nr_thp_split++;
> + goto retry;
> + }
> +
> + nr_thp_failed++;
> + nr_failed += nr_subpages;
> + break;
> + }
> +
> + /* Hugetlb migration is unsupported */
> + nr_failed++;
> + break;
>   case -ENOMEM:
>   /*
> -  * THP migration might be unsupported or the
> -  * allocation could've failed so we should
> -  * retry on the same page with the THP split
> -  * to base pages.
> -  *
> -  * Head page is retried immediately and tail
> -  * pages are added to the tail of the list so
> -  * we encounter them after the rest of the list
> -  * is processed.
> +  * When memory is low, don't bother to try to 
> migrate
> +  * other pages, just exit.

The comment does not match the code below. For THPs, the code tries to split 
the THP
and migrate the base pages if the split is successful.

>*/
>   if (is_thp) {
> - lock_page(page);
> - rc = split_huge_page_to_list(page, 
> from);
> - unlock_page(page);
> - if (!rc) {
> - 

Re: [PATCH 4/5] mm: migrate: clean up migrate_prep{_local}

2020-11-06 Thread Zi Yan
On 3 Nov 2020, at 8:03, Yang Shi wrote:

> The migrate_prep{_local} never fails, so it is pointless to have return
> value and check the return value.
>
> Signed-off-by: Yang Shi 
> ---
>  include/linux/migrate.h | 4 ++--
>  mm/mempolicy.c  | 8 ++--
>  mm/migrate.c| 8 ++--
>  3 files changed, 6 insertions(+), 14 deletions(-)
>

LGTM. Thanks. Reviewed-by: Zi Yan 


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/5] mm: migrate: simplify the logic for handling permanent failure

2020-11-06 Thread Zi Yan
On 3 Nov 2020, at 8:03, Yang Shi wrote:

> When unmap_and_move{_huge_page}() returns !-EAGAIN and !MIGRATEPAGE_SUCCESS,
> the page would be put back to LRU or proper list if it is non-LRU movable
> page.  But, the callers always call putback_movable_pages() to put the
> failed pages back later on, so it seems not very efficient to put every
> single page back immediately, and the code looks convoluted.
>
> Put the failed page on a separate list, then splice the list to migrate
> list when all pages are tried.  It is the caller's responsibility to
> call putback_movable_pages() to handle failures.  This also makes the
> code simpler and more readable.
>
> After the change the rules are:
> * Success: non hugetlb page will be freed, hugetlb page will be put
>back
> * -EAGAIN: stay on the from list
> * -ENOMEM: stay on the from list
> * Other errno: put on ret_pages list then splice to from list

Can you put this before the switch case in the migrate_pages? That will
be very helpful to understand the code.
>
> The from list would be empty iff all pages are migrated successfully, it

s/iff/if unless you really mean if and only if. :)


Everything else looks good to me. Thanks for making the code cleaner.
With the changes above, you can add Reviewed-by: Zi Yan .

> was not so before.  This has no impact to current existing callsites.
>
> Signed-off-by: Yang Shi 
> ---
>  mm/migrate.c | 58 ++--
>  1 file changed, 29 insertions(+), 29 deletions(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8a2e7e19e27b..c33c92495ead 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1169,7 +1169,8 @@ static int unmap_and_move(new_page_t get_new_page,
>  free_page_t put_new_page,
>  unsigned long private, struct page *page,
>  int force, enum migrate_mode mode,
> -enum migrate_reason reason)
> +enum migrate_reason reason,
> +struct list_head *ret)
>  {
>   int rc = MIGRATEPAGE_SUCCESS;
>   struct page *newpage = NULL;
> @@ -1206,7 +1207,14 @@ static int unmap_and_move(new_page_t get_new_page,
>* migrated will have kept its references and be restored.
>*/
>   list_del(>lru);
> + }
>
> + /*
> +  * If migration is successful, releases reference grabbed during
> +  * isolation. Otherwise, restore the page to right list unless
> +  * we want to retry.
> +  */
> + if (rc == MIGRATEPAGE_SUCCESS) {
>   /*
>* Compaction can migrate also non-LRU pages which are
>* not accounted to NR_ISOLATED_*. They can be recognized
> @@ -1215,35 +1223,16 @@ static int unmap_and_move(new_page_t get_new_page,
>   if (likely(!__PageMovable(page)))
>   mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
>   page_is_file_lru(page), 
> -thp_nr_pages(page));
> - }
>
> - /*
> -  * If migration is successful, releases reference grabbed during
> -  * isolation. Otherwise, restore the page to right list unless
> -  * we want to retry.
> -  */
> - if (rc == MIGRATEPAGE_SUCCESS) {
>   if (reason != MR_MEMORY_FAILURE)
>   /*
>* We release the page in page_handle_poison.
>*/
>   put_page(page);
>   } else {
> - if (rc != -EAGAIN) {
> - if (likely(!__PageMovable(page))) {
> - putback_lru_page(page);
> - goto put_new;
> - }
> + if (rc != -EAGAIN)
> + list_add_tail(>lru, ret);
>
> - lock_page(page);
> - if (PageMovable(page))
> - putback_movable_page(page);
> - else
> - __ClearPageIsolated(page);
> - unlock_page(page);
> - put_page(page);
> - }
> -put_new:
>   if (put_new_page)
>   put_new_page(newpage, private);
>   else
> @@ -1274,7 +1263,8 @@ static int unmap_and_move(new_page_t get_new_page,
>  static int unmap_and_move_huge_page(new_page_t get_new_page,
>   free_page_t put_new_page, unsigned long private,
>   struct page *hpage, int force,
> -   

Re: [PATCH] mm/compaction: count pages and stop correctly during page isolation.

2020-10-30 Thread Zi Yan
On 30 Oct 2020, at 14:33, Yang Shi wrote:

> On Fri, Oct 30, 2020 at 6:36 AM Michal Hocko  wrote:
>>
>> On Fri 30-10-20 08:20:50, Zi Yan wrote:
>>> On 30 Oct 2020, at 5:43, Michal Hocko wrote:
>>>
>>>> [Cc Vlastimil]
>>>>
>>>> On Thu 29-10-20 16:04:35, Zi Yan wrote:
>>>>> From: Zi Yan 
>>>>>
>>>>> In isolate_migratepages_block, when cc->alloc_contig is true, we are
>>>>> able to isolate compound pages, nr_migratepages and nr_isolated did not
>>>>> count compound pages correctly, causing us to isolate more pages than we
>>>>> thought. Use thp_nr_pages to count pages. Otherwise, we might be trapped
>>>>> in too_many_isolated while loop, since the actual isolated pages can go
>>>>> up to COMPACT_CLUSTER_MAX*512=16384, where COMPACT_CLUSTER_MAX is 32,
>>>>> since we stop isolation after cc->nr_migratepages reaches to
>>>>> COMPACT_CLUSTER_MAX.
>>>>>
>>>>> In addition, after we fix the issue above, cc->nr_migratepages could
>>>>> never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
>>>>> thus page isolation could not stop as we intended. Change the isolation
>>>>> stop condition to >=.
>>>>>
>>>>> Signed-off-by: Zi Yan 
>>>>> ---
>>>>>  mm/compaction.c | 8 
>>>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>>> index ee1f8439369e..0683a4999581 100644
>>>>> --- a/mm/compaction.c
>>>>> +++ b/mm/compaction.c
>>>>> @@ -1012,8 +1012,8 @@ isolate_migratepages_block(struct compact_control 
>>>>> *cc, unsigned long low_pfn,
>>>>>
>>>>>  isolate_success:
>>>>>list_add(>lru, >migratepages);
>>>>> -  cc->nr_migratepages++;
>>>>> -  nr_isolated++;
>>>>> +  cc->nr_migratepages += thp_nr_pages(page);
>>>>> +  nr_isolated += thp_nr_pages(page);
>>>>
>>>> Does thp_nr_pages work for __PageMovable pages?
>>>
>>> Yes. It is the same as compound_nr() but compiled
>>> to 1 when THP is not enabled.
>>
>> I am sorry but I do not follow. First of all the implementation of the
>> two is different and also I was asking about __PageMovable which should
>> never be THP IIRC. Can they be compound though?
>
> I have the same question, can they be compound? If they can be
> compound, PageTransHuge() can't tell from THP and compound movable
> page, right?

Right. I have updated the patch and use compound_nr instead.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v3 1/2] mm/compaction: count pages and stop correctly during page isolation.

2020-10-30 Thread Zi Yan
From: Zi Yan 

In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages, nr_migratepages and nr_isolated did not
count compound pages correctly, causing us to isolate more pages than we
thought. Count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop,
since the actual isolated pages can go up to
COMPACT_CLUSTER_MAX*512=16384, where COMPACT_CLUSTER_MAX is 32,
since we stop isolation after cc->nr_migratepages reaches to
COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended. Change the isolation
stop condition to >=.

The issue can be triggered as follows:
In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them
(so some THPs are allocated in the CMA region and mlocked), reserving
6 1GB hugetlb pages via
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will get stuck
(looping in too_many_isolated function) until we kill either task.
With the patch applied, oom will kill the application with 10GB THPs and
let hugetlb page reservation finish.

Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA 
allocations”)
Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
Cc: 
---
 mm/compaction.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ee1f8439369e..3e834ac402f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1012,8 +1012,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
-   cc->nr_migratepages++;
-   nr_isolated++;
+   cc->nr_migratepages += compound_nr(page);
+   nr_isolated += compound_nr(page);
 
/*
 * Avoid isolating too much unless this block is being
@@ -1021,7 +1021,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * or a lock is contended. For contention, isolate quickly to
 * potentially remove one source of contention.
 */
-   if (cc->nr_migratepages == COMPACT_CLUSTER_MAX &&
+   if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
!cc->rescan && !cc->contended) {
++low_pfn;
break;
@@ -1132,7 +1132,7 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
if (!pfn)
break;
 
-   if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+   if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
break;
}
 
-- 
2.28.0



[PATCH v3 2/2] mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate.

2020-10-30 Thread Zi Yan
From: Zi Yan 

In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.

Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA 
allocations”)
Suggested-by: Vlastimil Babka 
Signed-off-by: Zi Yan 
Cc: 
---
 mm/compaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 3e834ac402f1..4d237a7c3830 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -817,6 +817,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * delay for some time until fewer pages are isolated
 */
while (unlikely(too_many_isolated(pgdat))) {
+   /* stop isolation if there are still pages not migrated */
+   if (cc->nr_migratepages)
+   return 0;
+
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
-- 
2.28.0



Re: [PATCH v2 1/2] mm/compaction: count pages and stop correctly during page isolation.

2020-10-30 Thread Zi Yan
On 30 Oct 2020, at 14:12, Matthew Wilcox wrote:

> On Fri, Oct 30, 2020 at 11:57:15AM -0400, Zi Yan wrote:
>> In isolate_migratepages_block, when cc->alloc_contig is true, we are
>> able to isolate compound pages, nr_migratepages and nr_isolated did not
>> count compound pages correctly, causing us to isolate more pages than we
>> thought. Use thp_nr_pages to count pages. Otherwise, we might be trapped
>
> Maybe replace that sentence with "Count compound pages as the number of
> base pages they contain"?

Sure. And compound_nr is used instead of thp_nr_pages in fact.

OK. V3 is coming.

—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[PATCH v2 1/2] mm/compaction: count pages and stop correctly during page isolation.

2020-10-30 Thread Zi Yan
From: Zi Yan 

In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages, nr_migratepages and nr_isolated did not
count compound pages correctly, causing us to isolate more pages than we
thought. Use thp_nr_pages to count pages. Otherwise, we might be trapped
in too_many_isolated while loop, since the actual isolated pages can go
up to COMPACT_CLUSTER_MAX*512=16384, where COMPACT_CLUSTER_MAX is 32,
since we stop isolation after cc->nr_migratepages reaches to
COMPACT_CLUSTER_MAX.

In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended. Change the isolation
stop condition to >=.

The issue can be triggered as follows:
In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them
(so some THPs are allocated in the CMA region and mlocked), reserving
6 1GB hugetlb pages via
/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will get stuck
(looping in too_many_isolated function) until we kill either task.
With the patch applied, oom will kill the application with 10GB THPs and
let hugetlb page reservation finish.

Fixes: 1da2f328fa64 (“mm,thp,compaction,cma: allow THP migration for CMA 
allocations”)
Signed-off-by: Zi Yan 
Reviewed-by: Yang Shi 
Cc: 
---
 mm/compaction.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ee1f8439369e..3e834ac402f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1012,8 +1012,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
-   cc->nr_migratepages++;
-   nr_isolated++;
+   cc->nr_migratepages += compound_nr(page);
+   nr_isolated += compound_nr(page);
 
/*
 * Avoid isolating too much unless this block is being
@@ -1021,7 +1021,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * or a lock is contended. For contention, isolate quickly to
 * potentially remove one source of contention.
 */
-   if (cc->nr_migratepages == COMPACT_CLUSTER_MAX &&
+   if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
!cc->rescan && !cc->contended) {
++low_pfn;
break;
@@ -1132,7 +1132,7 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
if (!pfn)
break;
 
-   if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+   if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX)
break;
}
 
-- 
2.28.0



  1   2   3   4   5   6   7   8   >