Re: [PATCH] lib/test_hmm.c: fix harmless shift wrapping bug

2021-03-16 Thread Ralph Campbell



On 3/16/21 4:43 AM, Dan Carpenter wrote:

The "cmd.npages" variable is a u64 that comes from the user.  I noticed
during review that it could have a shift wrapping bug when it is used
in the integer overflow test on the next line.

It turns out this is harmless.  The users all do:

unsigned long size = cmd->npages << PAGE_SHIFT;

and after that "size" is used consistently and "cmd->npages" is never
used again.  So even when there is an integer overflow, everything works
fine.

Even though this is harmless, I believe syzbot will complain and fixing
it makes the code easier to read.

Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM")
Signed-off-by: Dan Carpenter 
---
  lib/test_hmm.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..541466034a6b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -930,6 +930,8 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
  
  	if (cmd.addr & ~PAGE_MASK)

return -EINVAL;
+   if (cmd.npages > ULONG_MAX >> PAGE_SHIFT)
+   return -EINVAL;
if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT)))
return -EINVAL;


Looks good to me too. Thanks for sending this.
Reviewed-by: Ralph Campbell 


Re: [PATCH v4 5/8] mm: Device exclusive memory access

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 


I see in the next two patches how make_device_exclusive_entry() and
check_device_exclusive_range() are used. This points out a similar
problem that migrate_vma_setup() had before I added the
mmu_notifier_range_init_migrate() helper to pass a cookie from
migrate_vma_setup() to the invalidation callback so the device driver
could ignore an invalidation callback triggered by the caller and thus
resulting in a deadlock or having to invalidate device PTEs that
wouldn't be migrating.

I think you can eliminate the need for check_device_exclusive_range() in
the same way by adding a "void *" pointer to make_device_exclusive_entry()
and passing that through to try_to_protect(), setting rmap_walk_control rwc.arg
and then passing arg to mmu_notifier_range_init_migrate().
Although, maybe it would be better to define a new
mmu_notifier_range_init_exclusive() and event type MMU_NOTIFY_EXCLUSIVE so
that a device driver can revoke atomic/exclusive access but keep read/write
access to other parts of the page.

I thought about how make_device_exclusive_entry() is similar to 
hmm_range_fault()
and whether it would be possible to add a new HMM_PFN_REQ_EXCLUSIVE flag but I
see that make_device_exclusive_entry() returns the pages locked and with an
additional get_page() reference. This doesn't fit well with the other
hmm_range_fault() entries being returned as a "snapshot" so having a different
API makes sense. I think it would be useful to add a HMM_PFN_EXCLUSIVE flag so
that snapshots of the page tables can at least report that a page is exclusively
being accessed by *some* device. Unfortunately, there is no pgmap pointer to be
able to tell which device has exclusive access (since any struct page could be
exclusively accessed, not just device private ones).


Re: [PATCH v4 4/8] mm/rmap: Split migration into its own function

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a
separate function reduces the complexity of try_to_unmap_one() making it
more readable.

Several simplifications can also be made in try_to_migrate_one() based
on the following observations:

  - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
  - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
  - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c  - either
try_to_migrate() for PageAnon or try_to_unmap().

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 


Looks reasonable to me. I do worry a bit about code duplication.
At some point in the future, it might be discovered that other combinations
of TTU_XXX flags are needed in which case a careful check of try_to_migrate()
and try_to_unmap() will be needed.

Reviewed-by: Ralph Campbell 


Re: [PATCH v4 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used
in different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Signed-off-by: Alistair Popple 


Looks good to me.
Reviewed-by: Ralph Campbell 


Re: [PATCH v4 2/8] mm/swapops: Rework swap entry manipulation code

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 


Looks good to me too.
Reviewed-by: Ralph Campbell 


Re: [PATCH v4 1/8] mm: Remove special swap entry functions

2021-03-08 Thread Ralph Campbell



On 3/3/21 10:16 PM, Alistair Popple wrote:

Remove the migration and device private entry_to_page() and
entry_to_pfn() inline functions and instead open code them directly.
This results in shorter code which is easier to understand.

Signed-off-by: Alistair Popple 


Looks OK to me.
Reviewed-by: Ralph Campbell 


---

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
   builds
---
  arch/s390/mm/pgtable.c  |  2 +-
  fs/proc/task_mmu.c  | 23 +-
  include/linux/swap.h|  4 +--
  include/linux/swapops.h | 69 ++---
  mm/hmm.c|  5 ++-
  mm/huge_memory.c|  4 +--
  mm/memcontrol.c |  2 +-
  mm/memory.c | 10 +++---
  mm/migrate.c|  6 ++--
  mm/page_vma_mapped.c|  6 ++--
  10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..aae001096c46 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry));
  
  		dec_mm_counter(mm, mm_counter(page));

}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 602e3a52884d..7c75af0fc423 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
  
  		if (is_migration_entry(entry))

-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
  
-		if (is_migration_entry(swpent))

-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1382,11 +1378,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
  
  	if (page && !PageAnon(page))

@@ -1443,7 +1436,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
  #endif
  
diff --git a/include/linux/swap.h b/include/linux/swap.h

index 596bc2f4d9b0..57a7690966a4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -519,8 +519,8 @@ static inline void show_swap_cache_info(void)
  {
  }
  
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})

-#define swapcache_prepare(e) ({(is_migration_entry(e) || 
is_device_private_entry(e)

RE: [PATCH v3 6/8] mm: Selftests for exclusive device memory

2021-03-01 Thread Ralph Campbell
> From: Alistair Popple 
> Sent: Thursday, February 25, 2021 11:19 PM
> To: linux...@kvack.org; nouv...@lists.freedesktop.org;
> bske...@redhat.com; a...@linux-foundation.org
> Cc: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; dri-
> de...@lists.freedesktop.org; John Hubbard ; Ralph
> Campbell ; jgli...@redhat.com; Jason Gunthorpe
> ; h...@infradead.org; dan...@ffwll.ch; Alistair Popple
> 
> Subject: [PATCH v3 6/8] mm: Selftests for exclusive device memory
> 
> Adds some selftests for exclusive device memory.
> 
> Signed-off-by: Alistair Popple 

One minor nit below, but you can add
Tested-by: Ralph Campbell 
Reviewed-by: Ralph Campbell 

> +static int dmirror_exclusive(struct dmirror *dmirror,
> +  struct hmm_dmirror_cmd *cmd)
> +{
> + unsigned long start, end, addr;
> + unsigned long size = cmd->npages << PAGE_SHIFT;
> + struct mm_struct *mm = dmirror->notifier.mm;
> + struct page *pages[64];
> + struct dmirror_bounce bounce;
> + unsigned long next;
> + int ret;
> +
> + start = cmd->addr;
> + end = start + size;
> + if (end < start)
> + return -EINVAL;
> +
> + /* Since the mm is for the mirrored process, get a reference first. */
> + if (!mmget_not_zero(mm))
> + return -EINVAL;
> +
> + mmap_read_lock(mm);
> + for (addr = start; addr < end; addr = next) {
> + int i, mapped;
> +
> + if (end < addr + (64 << PAGE_SHIFT))
> + next = end;
> + else
> + next = addr + (64 << PAGE_SHIFT);

I suggest using ARRAY_SIZE(pages) instead of '64' to make the meaning clear.



RE: [PATCH v3 5/8] mm: Device exclusive memory access

2021-03-01 Thread Ralph Campbell
> From: Alistair Popple 
> Sent: Thursday, February 25, 2021 11:18 PM
> To: linux...@kvack.org; nouv...@lists.freedesktop.org;
> bske...@redhat.com; a...@linux-foundation.org
> Cc: linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; dri-
> de...@lists.freedesktop.org; John Hubbard ; Ralph
> Campbell ; jgli...@redhat.com; Jason Gunthorpe
> ; h...@infradead.org; dan...@ffwll.ch; Alistair Popple
> 
> Subject: [PATCH v3 5/8] mm: Device exclusive memory access
> 
> Some devices require exclusive write access to shared virtual memory (SVM)
> ranges to perform atomic operations on that memory. This requires CPU page
> tables to be updated to deny access whilst atomic operations are occurring.
> 
> In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE).
> When a SVM range needs to be marked for exclusive access by a device all page
> table mappings for the particular range are replaced with device exclusive 
> swap
> entries. This causes any CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original 
> mapping. This
> results in MMU notifiers being called which a driver uses to update access
> permissions such as revoking atomic access. After notifiers have been called 
> the
> device will no longer have exclusive access to the region.
> 
> Signed-off-by: Alistair Popple 
> ---
>  Documentation/vm/hmm.rst |  15 
>  include/linux/rmap.h |   3 +
>  include/linux/swap.h |   4 +-
>  include/linux/swapops.h  |  44 ++-
>  mm/hmm.c |   5 ++
>  mm/memory.c  | 108 +-
>  mm/mprotect.c|   8 ++
>  mm/page_vma_mapped.c |   9 ++-
>  mm/rmap.c| 163 +++
>  9 files changed, 352 insertions(+), 7 deletions(-)
...
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages) {
> + long npages = (end - start) >> PAGE_SHIFT;
> + long i;

Nit: you should use unsigned long for 'i' and 'npages' to match start/end.


RE: [PATCH v3 6/8] mm: Selftests for exclusive device memory

2021-03-01 Thread Ralph Campbell

> From: Jason Gunthorpe 
> Sent: Monday, March 1, 2021 9:56 AM
> To: Alistair Popple 
> Cc: linux...@kvack.org; nouv...@lists.freedesktop.org;
> bske...@redhat.com; a...@linux-foundation.org; linux-...@vger.kernel.org;
> linux-kernel@vger.kernel.org; dri-de...@lists.freedesktop.org; John Hubbard
> ; Ralph Campbell ;
> jgli...@redhat.com; h...@infradead.org; dan...@ffwll.ch
> Subject: Re: [PATCH v3 6/8] mm: Selftests for exclusive device memory
> 
> On Fri, Feb 26, 2021 at 06:18:30PM +1100, Alistair Popple wrote:
> > Adds some selftests for exclusive device memory.
> >
> > Signed-off-by: Alistair Popple 
> > ---
> >  lib/test_hmm.c | 124 ++
> >  lib/test_hmm_uapi.h|   2 +
> >  tools/testing/selftests/vm/hmm-tests.c | 219 +
> >  3 files changed, 345 insertions(+)
> 
> Please get Ralph to review this, otherwise:
> 
> Acked-by: Jason Gunthorpe 
> 
> Jason

I'm working on it. Thanks for encouragement. 



Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.

2020-11-12 Thread Ralph Campbell



On 11/11/20 12:40 PM, Zi Yan wrote:

From: Zi Yan 

Huge pages in the process with the given pid and virtual address range
are split. It is used to test split huge page function. In addition,
a testing program is added to tools/testing/selftests/vm to utilize the
interface by splitting PMD THPs.

Signed-off-by: Zi Yan 
---
  mm/huge_memory.c  |  98 +++
  mm/internal.h |   1 +
  mm/migrate.c  |   2 +-
  tools/testing/selftests/vm/Makefile   |   1 +
  .../selftests/vm/split_huge_page_test.c   | 161 ++
  5 files changed, 262 insertions(+), 1 deletion(-)
  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c


Don't forget to update ".gitignore" to include "split_huge_page_test".


Re: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.

2020-11-12 Thread Ralph Campbell



On 11/11/20 12:40 PM, Zi Yan wrote:

From: Zi Yan 

To minimize the number of pages after a truncation, when truncating a
THP, we do not need to split it all the way down to order-0. The THP has
at most three parts, the part before offset, the part to be truncated,
the part left at the end. Use the non-zero minimum of them to decide
what order we split the THP to.

Signed-off-by: Zi Yan 
---
  mm/truncate.c | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 20bd17538ec2..6d8e3c6115bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, 
struct page *page)
  bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
  {
loff_t pos = page_offset(page);
-   unsigned int offset, length;
+   unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;


Maybe use "remaining" instead of "left" since I think of the latter as the 
length of the
left side (offset).
 

if (pos < start)
offset = start - pos;
@@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
length = length - offset;
else
length = end + 1 - pos - offset;
+   left = thp_size(page) - offset - length;
  
  	wait_on_page_writeback(page);

if (length == thp_size(page)) {
@@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t 
start, loff_t end)
do_invalidatepage(page, offset, length);
if (!PageTransHuge(page))
return true;
-   return split_huge_page(page) == 0;
+
+   /*
+* find the non-zero minimum of offset, length, and left and use it to
+* decide the new order of the page after split
+*/
+   if (offset && left)
+   min_subpage_size = min_t(unsigned int,
+min_t(unsigned int, offset, length),
+left);
+   else if (!offset)
+   min_subpage_size = min_t(unsigned int, length, left);
+   else /* !left */
+   min_subpage_size = min_t(unsigned int, length, offset);
+
+   min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
+
+   return split_huge_page_to_list_to_order(page, NULL,
+   ilog2(min_subpage_size/PAGE_SIZE)) == 0;
  }


What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2?
Splitting the page in half wouldn't result in a page that could be freed
but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).


Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.

2020-11-12 Thread Ralph Campbell



On 11/11/20 12:40 PM, Zi Yan wrote:

From: Zi Yan 

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page
order. Also we need to reinitialize page_deferred_list after removing
the page from the split_queue, otherwise a subsequent split will see
list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after
truncating a pagecache THP. For anonymous THPs, we can only split them
to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan 
---
  include/linux/huge_mm.h |  8 +
  mm/huge_memory.c| 78 +
  mm/swap.c   |  1 -
  3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 60a907a19f7d..9819cd9b4619 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
  
  bool can_split_huge_page(struct page *page, int *pextra_pins);

  int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+   unsigned int new_order);
  static inline int split_huge_page(struct page *page)
  {
return split_huge_page_to_list(page, NULL);
@@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct 
list_head *list)
  {
return 0;
  }
+static inline int
+split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
+   unsigned int new_order)
+{
+   return 0;
+}
  static inline int split_huge_page(struct page *page)
  {
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8b7d771ee962..88f50da40c9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
  static void unmap_page(struct page *page)
  {
enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
-   TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+   TTU_RMAP_LOCKED;
bool unmap_success;
  
  	VM_BUG_ON_PAGE(!PageHead(page), page);
  
+	if (thp_order(page) >= HPAGE_PMD_ORDER)

+   ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
if (PageAnon(page))
ttu_flags |= TTU_SPLIT_FREEZE;
  
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)

VM_BUG_ON_PAGE(!unmap_success, page);
  }
  
-static void remap_page(struct page *page, unsigned int nr)

+static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
  {
int i;


Use unsigned int i?
Maybe a blank line here and the {}'s around if/else aren't needed.


-   if (PageTransHuge(page)) {
+   if (thp_nr_pages(page) == nr) {
remove_migration_ptes(page, page, true);
} else {
-   for (i = 0; i < nr; i++)
+   for (i = 0; i < nr; i += new_nr)
remove_migration_ptes(page + i, page + i, true);
}
  }
  
  static void __split_huge_page_tail(struct page *head, int tail,

-   struct lruvec *lruvec, struct list_head *list)
+   struct lruvec *lruvec, struct list_head *list, unsigned int 
new_order)
  {
struct page *page_tail = head + tail;
+   unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
  
  	VM_BUG_ON_PAGE(atomic_read(_tail->_mapcount) != -1, page_tail);
  
@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,

  #ifdef CONFIG_64BIT
 (1L << PG_arch_2) |
  #endif
+compound_head_flag |
 (1L << PG_dirty)));
  
  	/* ->mapping in first tail page is compound_mapcount */

@@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, 
int tail,
 * which needs correct compound_head().
 */
clear_compound_head(page_tail);
+   if (new_order) {
+   prep_compound_page(page_tail, new_order);
+   thp_prep(page_tail);
+   }
  
  	/* Finally unfreeze refcount. Additional reference from page cache. */

-   page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
- PageSwapCache(head)));
+   page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
+  PageSwapCache(head)) ?
+   thp_nr_pages(page_tail) : 0));
  
  	if (page_is_young(head))

set_page_young(page_tail);
@@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
  }
  
  static void __split_huge_page(struct page *page, struct list_head *list,

-   pgoff_t end, unsigned long flags)
+   pgoff_t end, unsigned long flags, unsigned int new_order)
  {

Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.

2020-11-12 Thread Ralph Campbell



On 11/11/20 12:40 PM, Zi Yan wrote:

From: Zi Yan 

It reads thp_nr_pages and splits to provided new_nr. It prepares for
upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan 


Looks OK to me.
Reviewed-by: Ralph Campbell 


---
  include/linux/memcontrol.h | 5 +++--
  mm/huge_memory.c   | 2 +-
  mm/memcontrol.c| 4 ++--
  3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..b3bac79ceed6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct 
*mm,
  }
  
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE

-void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
  #endif
  
  #else /* CONFIG_MEMCG */

@@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t 
*pgdat, int order,
return 0;
  }
  
-static inline void mem_cgroup_split_huge_fixup(struct page *head)

+static inline void mem_cgroup_split_huge_fixup(struct page *head,
+  unsigned int new_nr)
  {
  }
  
diff --git a/mm/huge_memory.c b/mm/huge_memory.c

index c4fead5ead31..f599f5b9bf7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
lruvec = mem_cgroup_page_lruvec(head, pgdat);
  
  	/* complete memcg works before add pages to LRU */

-   mem_cgroup_split_huge_fixup(head);
+   mem_cgroup_split_huge_fixup(head, 1);
  
  	if (PageAnon(head) && PageSwapCache(head)) {

swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33f632689cee..e9705ba6bbcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t 
size)
   * Because tail pages are not marked as "used", set it. We're under
   * pgdat->lru_lock and migration entries setup in all page mappings.
   */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
  {
struct mem_cgroup *memcg = page_memcg(head);
int i;
@@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
if (mem_cgroup_disabled())
return;
  
-	for (i = 1; i < thp_nr_pages(head); i++) {

+   for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
css_get(>css);
head[i].memcg_data = (unsigned long)memcg;
}



Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.

2020-11-12 Thread Ralph Campbell



On 11/11/20 12:40 PM, Zi Yan wrote:

From: Zi Yan 

It adds a new_order parameter to set new page order in page owner.
It prepares for upcoming changes to support split huge page to any lower
order.

Signed-off-by: Zi Yan 


Except for a minor fix below, you can add:
Reviewed-by: Ralph Campbell 


---
  include/linux/page_owner.h | 7 ---
  mm/huge_memory.c   | 2 +-
  mm/page_alloc.c| 2 +-
  mm/page_owner.c| 6 +++---
  4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..215cbb159568 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
__set_page_owner(page, order, gfp_mask);
  }
  
-static inline void split_page_owner(struct page *page, unsigned int nr)

+static inline void split_page_owner(struct page *page, unsigned int nr,
+   unsigned int new_order)
  {
if (static_branch_unlikely(_owner_inited))
-   __split_page_owner(page, nr);
+   __split_page_owner(page, nr, new_order);
  }
  static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
  {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
  {
  }
  static inline void split_page_owner(struct page *page,
-   unsigned int order)
+   unsigned int nr, unsigned int new_order)
  {
  }
  static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f599f5b9bf7f..8b7d771ee962 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
  
  	ClearPageCompound(head);
  
-	split_page_owner(head, nr);

+   split_page_owner(head, nr, 1);


Shouldn't this be 0, not 1?
(new_order not new_nr).


/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..a9eead0e091a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
  
  	for (i = 1; i < (1 << order); i++)

set_page_refcounted(page + i);
-   split_page_owner(page, 1 << order);
+   split_page_owner(page, 1 << order, 1);


Ditto, 0.


  }
  EXPORT_SYMBOL_GPL(split_page);



diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..2b7f7e9056dc 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int 
reason)
page_owner->last_migrate_reason = reason;
  }
  
-void __split_page_owner(struct page *page, unsigned int nr)

+void __split_page_owner(struct page *page, unsigned int nr, unsigned int 
new_order)
  {
int i;
struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr)
if (unlikely(!page_ext))
return;
  
-	for (i = 0; i < nr; i++) {

+   for (i = 0; i < nr; i += (1 << new_order)) {
page_owner = get_page_owner(page_ext);
-   page_owner->order = 0;
+   page_owner->order = new_order;
page_ext = page_ext_next(page_ext);
}
  }



Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-11 Thread Ralph Campbell



On 11/9/20 1:14 AM, Christoph Hellwig wrote:

On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:


On 11/6/20 12:03 AM, Christoph Hellwig wrote:

I hate the extra pin count magic here.  IMHO we really need to finish
off the series to get rid of the extra references on the ZONE_DEVICE
pages first.


First, thanks for the review comments.

I don't like the extra refcount either, that is why I tried to fix that up
before resending this series. However, you didn't like me just fixing the
refcount only for device private pages and I don't know the dax/pmem code
and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
long it will take me to fix all the use cases.
So I wanted to make progress on the THP migration code in the mean time.


I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
a normal memory allocator.  DAX is the interesting case, any specific
help that you need with that?


There are 4 types of ZONE_DEVICE struct pages:
MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
MEMORY_DEVICE_PCI_P2PDMA.

Currently, memremap_pages() allocates struct pages for a physical address range
with a page_ref_count(page) of one and increments the pgmap->ref per CPU
reference count by the number of pages created since each ZONE_DEVICE struct
page has a pointer to the pgmap.

The struct pages are not freed until memunmap_pages() is called which
calls put_page() which calls put_dev_pagemap() which releases a reference to
pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
the struct page reference count can go to zero and back to non-zero without
changing the pgmap->ref reference count.

Q1: Is that safe? Is there some code that depends on put_page() dropping
the pgmap->ref reference count as part of memunmap_pages()?
My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.

MEMORY_DEVICE_PCI_P2PDMA:
Struct pages are created in pci_p2pdma_add_resource() and represent device
memory accessible by PCIe bar address space. Memory is allocated with
pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
call will allocate memory in a minimum of PAGE_SIZE units.
Reference counting is +1 per *allocation* on the pgmap->ref reference count.
Note that this is not +1 per page which is what put_page() expects. So
currently, a get_page()/put_page() works OK because the page reference count
only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
would be incorrect if the allocation size was greater than one page.

I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
There are a huge number of places sg_page() is called so it is hard to tell
whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
pages. pci_p2pmem_virt_to_bus() will return the physical address and I guess
pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.

Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
reference count. Or maybe just leave it at one like it is now.

MEMORY_DEVICE_GENERIC:
Struct pages are created in dev_dax_probe() and represent non-volatile memory.
The device can be mmap()'ed which calls dax_mmap() which sets
vma->vm_flags | VM_HUGEPAGE.
A CPU page fault will result in a PTE, PMD, or PUD sized page
(but not compound) to be inserted by vmf_insert_mixed() which will call either
insert_pfn() or insert_page().
Neither insert_pfn() nor insert_page() increments the page reference count.
Invalidations don't callback into the device driver so I don't see how page
reference counts can be tracked without adding a mmu_interval_notifier.

I think just leaving the page reference count at one is better than trying
to use the mmu_interval_notifier or changing vmf_insert_mixed() and
invalidations of pfn_t_devmap(pfn) to adjust the page reference count.

MEMORY_DEVICE_PRIVATE:
This case has the most core mm code having to specially check for
is_device_private_page() and adjusting the expected reference count when the
page isn't mapped by any process. There is a clear allocation and free so it
can be changed to use a reference count of zero while free (see [2]).

MEMORY_DEVICE_FS_DAX:
Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
an initial reference count of one.
The problem I see is that there are 3 states that are important:
a) memory is free and not allocated to any file (page_ref_count() == 0).
b) memory is allocated to a file and in th

Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-09 Thread Ralph Campbell



On 11/9/20 1:14 AM, Christoph Hellwig wrote:

On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:


On 11/6/20 12:03 AM, Christoph Hellwig wrote:

I hate the extra pin count magic here.  IMHO we really need to finish
off the series to get rid of the extra references on the ZONE_DEVICE
pages first.


First, thanks for the review comments.

I don't like the extra refcount either, that is why I tried to fix that up
before resending this series. However, you didn't like me just fixing the
refcount only for device private pages and I don't know the dax/pmem code
and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
long it will take me to fix all the use cases.
So I wanted to make progress on the THP migration code in the mean time.


I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
a normal memory allocator.  DAX is the interesting case, any specific
help that you need with that?


Thanks for the offer. I'm putting a list together... :-)


[PATCH] include/linux/huge_mm.h: remove extern keyword

2020-11-06 Thread Ralph Campbell
The external function definitions don't need the "extern" keyword.
Remove them so future changes don't copy the function definition style.

Signed-off-by: Ralph Campbell 
---

This applies cleanly to linux-mm 5.10.0-rc2 and is for Andrew's tree.

 include/linux/huge_mm.h | 93 ++---
 1 file changed, 41 insertions(+), 52 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..6a19f35f836b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -7,43 +7,37 @@
 
 #include  /* only for vma_is_dax() */
 
-extern vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
-extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
-extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
-struct vm_area_struct *vma);
+vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+ struct vm_area_struct *vma);
+void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
+int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+ pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+ struct vm_area_struct *vma);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
 }
 #endif
 
-extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
-extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd,
- unsigned int flags);
-extern bool madvise_free_huge_pmd(struct mmu_gather *tlb,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, unsigned long addr, unsigned long next);
-extern int zap_huge_pmd(struct mmu_gather *tlb,
-   struct vm_area_struct *vma,
-   pmd_t *pmd, unsigned long addr);
-extern int zap_huge_pud(struct mmu_gather *tlb,
-   struct vm_area_struct *vma,
-   pud_t *pud, unsigned long addr);
-extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
-unsigned long new_addr,
-pmd_t *old_pmd, pmd_t *new_pmd);
-extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, pgprot_t newprot,
-   unsigned long cp_flags);
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
+struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
+  unsigned long addr, pmd_t *pmd,
+  unsigned int flags);
+bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+  pmd_t *pmd, unsigned long addr, unsigned long next);
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t 
*pmd,
+unsigned long addr);
+int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t 
*pud,
+unsigned long addr);
+bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
+int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+   pgprot_t newprot, unsigned long cp_flags);
 vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
   pgprot_t pgprot, bool write);
 
@@ -100,13 +94,13 @@ enum transparent_hugepage_flag {
 struct kobject;
 struct kobj_attribute;
 
-extern ssize_t single_hugepage_flag_store(struct kobject *kobj,
-struct kobj_attribute *attr,
-const char *buf, size_t count,
-enum transparent_hugepage_flag flag);
-extern ssize_t single_hugepage_flag_show(struct kobject *kobj,
-   struct kobj_attribute *attr, char *buf,
-   enum transparent_hugepage_flag flag);
+ssize_t single_hugepage_flag_store(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  const char *buf, si

Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-06 Thread Ralph Campbell



On 11/6/20 12:03 AM, Christoph Hellwig wrote:

I hate the extra pin count magic here.  IMHO we really need to finish
off the series to get rid of the extra references on the ZONE_DEVICE
pages first.


First, thanks for the review comments.

I don't like the extra refcount either, that is why I tried to fix that up
before resending this series. However, you didn't like me just fixing the
refcount only for device private pages and I don't know the dax/pmem code
and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
long it will take me to fix all the use cases.
So I wanted to make progress on the THP migration code in the mean time.



Re: [PATCH v3 4/6] mm/thp: add THP allocation helper

2020-11-06 Thread Ralph Campbell



On 11/6/20 12:01 AM, Christoph Hellwig wrote:

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
+   unsigned long addr);


No need for the extern.  And also here: do we actually need the stub,
or can the caller make sure (using IS_ENABLED and similar) that the
compiler knows the code is dead?


Same problem as with prep_transhuge_device_private_page(), since
alloc_hugepage_direct_gfpmask() and alloc_hugepage_vma() are not
EXPORT_SYMBOL_GPL.


+struct page *alloc_transhugepage(struct vm_area_struct *vma,
+unsigned long haddr)
+{
+   gfp_t gfp;
+   struct page *page;
+
+   gfp = alloc_hugepage_direct_gfpmask(vma);
+   page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+   if (page)
+   prep_transhuge_page(page);
+   return page;


I think do_huge_pmd_anonymous_page should be switched to use this
helper as well.


Sure, I'll do that for v4.


Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()

2020-11-06 Thread Ralph Campbell



On 11/5/20 11:55 PM, Christoph Hellwig wrote:

On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:

+extern void prep_transhuge_device_private_page(struct page *page);


No need for the extern.


Right, I was just copying the style.
Would you like to see a preparatory patch that removes extern for the other
declarations in huge_mm.h?


+static inline void prep_transhuge_device_private_page(struct page *page)
+{
+}


Is the code to call this even reachable if THP support is configured
out?  If not just declaring it unconditionally and letting dead code
elimination do its job might be a tad cleaner.


The HMM test driver depends on TRANSPARENT_HUGEPAGE but the nouveau SVM
option doesn't and SVM is still useful if TRANSPARENT_HUGEPAGE is not 
configured.

The problem with defining prep_transhuge_device_private_page() in huge_mm.h
as a static inline function is that prep_compound_page() and 
prep_transhuge_page()
would have to be EXPORT_SYMBOL_GPL which are currently mm internal only.
The intent is to make this helper callable by separate device driver modules
using struct pages created with memremap_pages().


+void prep_transhuge_device_private_page(struct page *page)


I think a kerneldoc comment explaining what this function is useful for
would be helpful.


That is a good idea. I'll add it to v4.


Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()

2020-11-06 Thread Ralph Campbell



On 11/6/20 4:14 AM, Matthew Wilcox wrote:

On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:

Add a helper function to allow device drivers to create device private
transparent huge pages. This is intended to help support device private
THP migrations.


I think you'd be better off with these calling conventions:

-void prep_transhuge_page(struct page *page)
+struct page *thp_prep(struct page *page)
  {
+   if (!page || compound_order(page) == 0)
+   return page;
 /*
-* we use page->mapping and page->indexlru in second tail page
+* we use page->mapping and page->index in second tail page
  * as list_head: assuming THP order >= 2
  */
+   BUG_ON(compound_order(page) == 1);
  
 INIT_LIST_HEAD(page_deferred_list(page));

 set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+   return page;
  }

It simplifies the users.


I'm not sure what the simplification is.
If you mean the name change from prep_transhuge_page() to thp_prep(),
that seems fine to me. The following could also be renamed to
thp_prep_device_private_page() or similar.


+void prep_transhuge_device_private_page(struct page *page)
+{
+   prep_compound_page(page, HPAGE_PMD_ORDER);
+   prep_transhuge_page(page);
+   /* Only the head page has a reference to the pgmap. */
+   percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
+}
+EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);


Something else that may interest you from my patch series is support
for page sizes other than PMD_SIZE.  I don't know what page sizes 
hardware supports.  There's no support for page sizes other than PMD

for anonymous memory, so this might not be too useful for you yet.


I did see those changes. It might help some device drivers to do DMA in
larger than PAGE_SIZE blocks but less than PMD_SIZE. It might help
reduce page table sizes since 2MB, 64K, and 4K are commonly supported
GPU page sizes. The MIGRATE_PFN_COMPOUND flag is intended to indicate
that the page size is determined by page_size() so I was thinking ahead
to other than PMD sized pages. However, when migrating a pte_none() or
pmd_none() page, there is no source page to determine the size.
Maybe I need to encode the page order in the migrate PFN entry like
hmm_range_fault().

Anyway, I agree that thinking about page sizes other than PMD is good.


[PATCH v3 5/6] mm/hmm/test: add self tests for THP migration

2020-11-05 Thread Ralph Campbell
Add some basic stand alone self tests for migrating system memory to device
private memory and back.

Signed-off-by: Ralph Campbell 
---
 lib/test_hmm.c | 437 +
 lib/test_hmm_uapi.h|   3 +
 tools/testing/selftests/vm/hmm-tests.c | 404 +++
 3 files changed, 775 insertions(+), 69 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..456f1a90bcc3 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -66,6 +66,7 @@ struct dmirror {
struct xarray   pt;
struct mmu_interval_notifiernotifier;
struct mutexmutex;
+   __u64   flags;
 };
 
 /*
@@ -91,6 +92,7 @@ struct dmirror_device {
unsigned long   calloc;
unsigned long   cfree;
struct page *free_pages;
+   struct page *free_huge_pages;
spinlock_t  lock;   /* protects the above */
 };
 
@@ -450,6 +452,7 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
 }
 
 static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+  bool is_huge,
   struct page **ppage)
 {
struct dmirror_chunk *devmem;
@@ -503,28 +506,51 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 
mutex_unlock(>devmem_lock);
 
-   pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 
0x%lx)\n",
+   pr_info("dev %u added %u MB (total %u chunks, %u MB) PFNs [0x%lx 
0x%lx)\n",
+   MINOR(mdevice->cdevice.dev),
DEVMEM_CHUNK_SIZE / (1024 * 1024),
mdevice->devmem_count,
mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
pfn_first, pfn_last);
 
spin_lock(>lock);
-   for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+   for (pfn = pfn_first; pfn < pfn_last; ) {
struct page *page = pfn_to_page(pfn);
 
+   if (is_huge && (pfn & (HPAGE_PMD_NR - 1)) == 0 &&
+   pfn + HPAGE_PMD_NR <= pfn_last) {
+   prep_transhuge_device_private_page(page);
+   page->zone_device_data = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = page;
+   pfn += HPAGE_PMD_NR;
+   continue;
+   }
page->zone_device_data = mdevice->free_pages;
mdevice->free_pages = page;
+   pfn++;
}
if (ppage) {
-   *ppage = mdevice->free_pages;
-   mdevice->free_pages = (*ppage)->zone_device_data;
-   mdevice->calloc++;
+   if (is_huge) {
+   if (!mdevice->free_huge_pages)
+   goto err_unlock;
+   *ppage = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = (*ppage)->zone_device_data;
+   mdevice->calloc += thp_nr_pages(*ppage);
+   } else if (mdevice->free_pages) {
+   *ppage = mdevice->free_pages;
+   mdevice->free_pages = (*ppage)->zone_device_data;
+   mdevice->calloc++;
+   } else
+   goto err_unlock;
}
spin_unlock(>lock);
 
return true;
 
+err_unlock:
+   spin_unlock(>lock);
+   return false;
+
 err_release:
mutex_unlock(>devmem_lock);
release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));
@@ -534,7 +560,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
return false;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice,
+ bool is_huge)
 {
struct page *dpage = NULL;
struct page *rpage;
@@ -549,17 +576,40 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
 
spin_lock(>lock);
 
-   if (mdevice->free_pages) {
+   if (is_huge && mdevice->free_huge_pages) {
+   dpage = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = dpage->zone_device_data;
+   mdevice->calloc += thp_nr_pages(dpage);
+   spin_unlock(>lock);
+   } else if (!is_huge && mdevice->free_pages) {
dpage = mdevice->free_pages;
mdevice->free_pages = dpage->zone_device_data;
mdevice->calloc++;
spin_unlock(>lock);
} e

[PATCH v3 6/6] nouveau: support THP migration to private memory

2020-11-05 Thread Ralph Campbell
Add support for migrating transparent huge pages to and from device
private memory.

Signed-off-by: Ralph Campbell 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 ++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 215 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..93eea8e9d987 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -82,6 +82,7 @@ struct nouveau_dmem {
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
+   struct page *free_huge_pages;
spinlock_t lock;
 };
 
@@ -112,8 +113,13 @@ static void nouveau_dmem_page_free(struct page *page)
struct nouveau_dmem *dmem = chunk->drm->dmem;
 
spin_lock(>lock);
-   page->zone_device_data = dmem->free_pages;
-   dmem->free_pages = page;
+   if (PageHead(page)) {
+   page->zone_device_data = dmem->free_huge_pages;
+   dmem->free_huge_pages = page;
+   } else {
+   page->zone_device_data = dmem->free_pages;
+   dmem->free_pages = page;
+   }
 
WARN_ON(!chunk->callocated);
chunk->callocated--;
@@ -139,51 +145,100 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
 
 static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+   struct page *spage, bool is_huge, dma_addr_t *dma_addr)
 {
+   struct nouveau_svmm *svmm = spage->zone_device_data;
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
+   struct page *dpage;
+   unsigned int i;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (is_huge)
+   dpage = alloc_transhugepage(vmf->vma, args->start);
+   else
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
if (!dpage)
-   return VM_FAULT_SIGBUS;
-   lock_page(dpage);
+   return VM_FAULT_OOM;
+   WARN_ON_ONCE(thp_order(spage) != thp_order(dpage));
 
-   *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   *dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto error_free_page;
 
-   svmm = spage->zone_device_data;
+   lock_page(dpage);
+   i = (vmf->address - args->start) >> PAGE_SHIFT;
+   spage += i;
mutex_lock(>mutex);
nouveau_svmm_invalidate(svmm, args->start, args->end);
-   if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
+   if (drm->dmem->migrate.copy_func(drm, thp_nr_pages(dpage),
+   NOUVEAU_APER_HOST, *dma_addr, NOUVEAU_APER_VRAM,
+   nouveau_dmem_page_addr(spage)))
goto error_dma_unmap;
mutex_unlock(>mutex);
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   args->dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   if (is_huge)
+   args->dst[i] |= MIGRATE_PFN_COMPOUND;
return 0;
 
 error_dma_unmap:
mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   unlock_page(dpage);
+   dma_unmap_page(dev, *dma_addr, page_size(dpage), DMA_BIDIRECTIONAL);
 error_free_page:
__free_page(dpage);
return VM_FAULT_SIGBUS;
 }
 
+static vm_fault_t nouveau_dmem_fault_chunk(struct nouveau_drm *drm,
+   struct vm_fault *vmf, struct migrate_vma *args)
+{
+   struct device *dev = drm->dev->dev;
+   struct nouveau_fence *fence;
+   struct page *spage;
+   unsigned long src = args->src[0];
+   bool is_huge = (src & (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND)) ==
+   (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND);
+   unsigned long dma_page_size;
+   dma_addr_t dma_addr;
+   vm_fault_t ret = 0;
+
+   spage = migrate_pfn_to_page(src);
+   if (!spage) {
+   ret = VM_FAULT_SIGBUS;
+   goto out;
+   }
+   if (is_huge) {
+   dma_page_size = PMD_SIZE;
+   ret = nouveau_dmem_fault_copy_one(drm, vmf, args, spage, true,
+ _addr);
+ 

[PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_*

2020-11-05 Thread Ralph Campbell
This series adds support for transparent huge page migration to
migrate_vma_*() and adds nouveau SVM and HMM selftests as consumers.
Earlier versions were posted previously [1] and [2].

The patches apply cleanly to the linux-mm 5.10.0-rc2 tree. There are a
lot of other THP patches being posted. I don't think there are any
semantic conflicts but there may be some merge conflicts depending on
the order Andrew applies these.

Changes in v3:
Sent the patch ("mm/thp: fix __split_huge_pmd_locked() for migration PMD")
as a separate patch from this series.

Rebased to linux-mm 5.10.0-rc2.

Changes in v2:
Added splitting a THP midway in the migration process:
i.e., in migrate_vma_pages().

[1] https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampb...@nvidia.com
[2] https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampb...@nvidia.com

Ralph Campbell (6):
  mm/thp: add prep_transhuge_device_private_page()
  mm/migrate: move migrate_vma_collect_skip()
  mm: support THP migration to device private memory
  mm/thp: add THP allocation helper
  mm/hmm/test: add self tests for THP migration
  nouveau: support THP migration to private memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 +++-
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/gfp.h|  10 +
 include/linux/huge_mm.h|  12 +
 include/linux/memremap.h   |   9 +
 include/linux/migrate.h|   2 +
 lib/test_hmm.c | 437 +
 lib/test_hmm_uapi.h|   3 +
 mm/huge_memory.c   | 147 +++--
 mm/memcontrol.c|  25 +-
 mm/memory.c|  10 +-
 mm/memremap.c  |   4 +-
 mm/migrate.c   | 429 +++-
 mm/rmap.c  |   2 +-
 tools/testing/selftests/vm/hmm-tests.c | 404 +++
 16 files changed, 1522 insertions(+), 275 deletions(-)

-- 
2.20.1



[PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()

2020-11-05 Thread Ralph Campbell
Move the definition of migrate_vma_collect_skip() to make it callable
by migrate_vma_collect_hole(). This helps make the next patch easier
to read.

Signed-off-by: Ralph Campbell 
---
 mm/migrate.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c1585ec29827..665516319b66 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2253,6 +2253,21 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
*mm,
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
+static int migrate_vma_collect_skip(unsigned long start,
+   unsigned long end,
+   struct mm_walk *walk)
+{
+   struct migrate_vma *migrate = walk->private;
+   unsigned long addr;
+
+   for (addr = start; addr < end; addr += PAGE_SIZE) {
+   migrate->dst[migrate->npages] = 0;
+   migrate->src[migrate->npages++] = 0;
+   }
+
+   return 0;
+}
+
 static int migrate_vma_collect_hole(unsigned long start,
unsigned long end,
__always_unused int depth,
@@ -2281,21 +2296,6 @@ static int migrate_vma_collect_hole(unsigned long start,
return 0;
 }
 
-static int migrate_vma_collect_skip(unsigned long start,
-   unsigned long end,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->dst[migrate->npages] = 0;
-   migrate->src[migrate->npages++] = 0;
-   }
-
-   return 0;
-}
-
 static int migrate_vma_collect_pmd(pmd_t *pmdp,
   unsigned long start,
   unsigned long end,
-- 
2.20.1



[PATCH v3 4/6] mm/thp: add THP allocation helper

2020-11-05 Thread Ralph Campbell
Transparent huge page allocation policy is controlled by several sysfs
variables. Rather than expose these to each device driver that needs to
allocate THPs, provide a helper function.

Signed-off-by: Ralph Campbell 
---
 include/linux/gfp.h | 10 ++
 mm/huge_memory.c| 14 ++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..242398c4b556 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -564,6 +564,16 @@ static inline struct page *alloc_pages(gfp_t gfp_mask, 
unsigned int order)
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)\
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
+   unsigned long addr);
+#else
+static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   return NULL;
+}
+#endif
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a073e66d0ee2..c2c1d3e7c35f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,6 +765,20 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
return __do_huge_pmd_anonymous_page(vmf, page, gfp);
 }
 
+struct page *alloc_transhugepage(struct vm_area_struct *vma,
+unsigned long haddr)
+{
+   gfp_t gfp;
+   struct page *page;
+
+   gfp = alloc_hugepage_direct_gfpmask(vma);
+   page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+   if (page)
+   prep_transhuge_page(page);
+   return page;
+}
+EXPORT_SYMBOL_GPL(alloc_transhugepage);
+
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
pgtable_t pgtable)
-- 
2.20.1



[PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()

2020-11-05 Thread Ralph Campbell
Add a helper function to allow device drivers to create device private
transparent huge pages. This is intended to help support device private
THP migrations.

Signed-off-by: Ralph Campbell 
---
 include/linux/huge_mm.h | 5 +
 mm/huge_memory.c| 9 +
 2 files changed, 14 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..3ec26ef27a93 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -184,6 +184,7 @@ extern unsigned long thp_get_unmapped_area(struct file 
*filp,
unsigned long flags);
 
 extern void prep_transhuge_page(struct page *page);
+extern void prep_transhuge_device_private_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -377,6 +378,10 @@ static inline bool transhuge_vma_suitable(struct 
vm_area_struct *vma,
 
 static inline void prep_transhuge_page(struct page *page) {}
 
+static inline void prep_transhuge_device_private_page(struct page *page)
+{
+}
+
 static inline bool is_transparent_hugepage(struct page *page)
 {
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08a183f6c3ab..b4141f12ff31 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -498,6 +498,15 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
+void prep_transhuge_device_private_page(struct page *page)
+{
+   prep_compound_page(page, HPAGE_PMD_ORDER);
+   prep_transhuge_page(page);
+   /* Only the head page has a reference to the pgmap. */
+   percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
+}
+EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);
+
 bool is_transparent_hugepage(struct page *page)
 {
if (!PageCompound(page))
-- 
2.20.1



[PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-05 Thread Ralph Campbell
Support transparent huge page migration to ZONE_DEVICE private memory.
A new selection flag (MIGRATE_VMA_SELECT_COMPOUND) is added to request
THP migration. Otherwise, THPs are split when filling in the source PFN
array. A new flag (MIGRATE_PFN_COMPOUND) is added to the source PFN array
to indicate a huge page can be migrated. If the device driver can allocate
a huge page, it sets the MIGRATE_PFN_COMPOUND flag in the destination PFN
array. migrate_vma_pages() will fallback to PAGE_SIZE pages if
MIGRATE_PFN_COMPOUND is not set in both source and destination arrays.

Signed-off-by: Ralph Campbell 
---
 include/linux/huge_mm.h  |   7 +
 include/linux/memremap.h |   9 +
 include/linux/migrate.h  |   2 +
 mm/huge_memory.c | 124 +---
 mm/memcontrol.c  |  25 ++-
 mm/memory.c  |  10 +-
 mm/memremap.c|   4 +-
 mm/migrate.c | 413 ---
 mm/rmap.c|   2 +-
 9 files changed, 486 insertions(+), 110 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3ec26ef27a93..1e8625cc233c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -190,6 +190,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_migrating_huge_page(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address, struct page *page);
 static inline int split_huge_page(struct page *page)
 {
return split_huge_page_to_list(page, NULL);
@@ -456,6 +458,11 @@ static inline bool is_huge_zero_page(struct page *page)
return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+   return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
return false;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..9b39a896af37 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,15 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Used for private (un-addressable) device memory only.
+* This is called when a compound device private page is split.
+* The driver uses this callback to set tail_page->pgmap and
+* tail_page->zone_device_data appropriately based on the head
+* page.
+*/
+   void (*page_split)(struct page *head, struct page *tail_page);
 };
 
 #define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0f8d1583fa8e..92179bf360d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -144,6 +144,7 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 #define MIGRATE_PFN_MIGRATE(1UL << 1)
 #define MIGRATE_PFN_LOCKED (1UL << 2)
 #define MIGRATE_PFN_WRITE  (1UL << 3)
+#define MIGRATE_PFN_COMPOUND   (1UL << 4)
 #define MIGRATE_PFN_SHIFT  6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -161,6 +162,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+   MIGRATE_VMA_SELECT_COMPOUND = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b4141f12ff31..a073e66d0ee2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1682,23 +1682,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
} else {
struct page *page = NULL;
int flush_needed = 1;
+   bool is_anon = false;
 
if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
+   is_anon = PageAnon(page);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
swp_entry_t entry;
 
-   VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
entry = pmd_to_swp_entry(orig_pmd);
-   page = pfn_to_page(swp_offset(entry));
+   if (is_device_private_entry(entry)) {
+   page = device_private_entry_to_page(entry);
+   is_anon = PageAnon(page);
+   page_remove_rmap(page, true);
+   VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+ 

[PATCH v2] nouveau: fix the start/end range for migration

2020-10-26 Thread Ralph Campbell
The user level OpenCL code shouldn't have to align start and end
addresses to a page boundary. That is better handled in the nouveau
driver. The npages field is also redundant since it can be computed
from the start and end addresses.

Signed-off-by: Ralph Campbell 
---

I thought I sent this out earlier but I don't see any record of that so
I'm resending it. It applies cleanly to linux-5.10.0-rc1.

This is for Ben Skegg's nouveau tree.

I have been working with Karol Herbst on the OpenCL mesa changes for
nouveau which will be merged upstream soon.
With or without those changes, the user visible effect of this patch
only extends the range by one page (round up vs. round down to page
boundary).

Changes in v2:
I changed the start/end check to require a size so start has to be less
than end.

 drivers/gpu/drm/nouveau/nouveau_svm.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 2df1c0460559..4f69e4c3dafd 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -105,11 +105,11 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
struct nouveau_cli *cli = nouveau_cli(file_priv);
struct drm_nouveau_svm_bind *args = data;
unsigned target, cmd, priority;
-   unsigned long addr, end, size;
+   unsigned long addr, end;
struct mm_struct *mm;
 
args->va_start &= PAGE_MASK;
-   args->va_end &= PAGE_MASK;
+   args->va_end = ALIGN(args->va_end, PAGE_SIZE);
 
/* Sanity check arguments */
if (args->reserved0 || args->reserved1)
@@ -118,8 +118,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
if (args->va_start >= args->va_end)
return -EINVAL;
-   if (!args->npages)
-   return -EINVAL;
 
cmd = args->header >> NOUVEAU_SVM_BIND_COMMAND_SHIFT;
cmd &= NOUVEAU_SVM_BIND_COMMAND_MASK;
@@ -151,12 +149,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
if (args->stride)
return -EINVAL;
 
-   size = ((unsigned long)args->npages) << PAGE_SHIFT;
-   if ((args->va_start + size) <= args->va_start)
-   return -EINVAL;
-   if ((args->va_start + size) > args->va_end)
-   return -EINVAL;
-
/*
 * Ok we are ask to do something sane, for now we only support migrate
 * commands but we will add things like memory policy (what to do on
@@ -171,7 +163,7 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
}
 
-   for (addr = args->va_start, end = args->va_start + size; addr < end;) {
+   for (addr = args->va_start, end = args->va_end; addr < end;) {
struct vm_area_struct *vma;
unsigned long next;
 
-- 
2.20.1



[PATCH] mm: handle zone device pages in release_pages()

2020-10-21 Thread Ralph Campbell
release_pages() is an optimized, inlined version of __put_pages() except
that zone device struct pages that are not page_is_devmap_managed()
(i.e., memory_type MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA),
fall through to the code that could return the zone device page to the
page allocator instead of adjusting the pgmap reference count.
Clearly these type of pages are not having the reference count decremented
to zero via release_pages() or page allocation problems would be seen.
Just to be safe, handle the 1 to zero case in release_pages() like
__put_page() does.

Signed-off-by: Ralph Campbell 
---

I found this by code inspection while working on converting ZONE_DEVICE
struct pages to have zero based reference counts. I don't think there is
an actual problem that this fixes, it's more to future proof new uses of
release_pages().

This is for Andrew Morton's mm tree after the merge window.

 mm/swap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/swap.c b/mm/swap.c
index 0eb057141a04..106f519c45ac 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -907,6 +907,9 @@ void release_pages(struct page **pages, int nr)
put_devmap_managed_page(page);
continue;
}
+   if (put_page_testzero(page))
+   put_dev_pagemap(page->pgmap);
+   continue;
}
 
if (!put_page_testzero(page))
-- 
2.20.1



[PATCH] mm: optimize migrate_vma_pages() mmu notifier

2020-10-21 Thread Ralph Campbell
When migrating a zero page or pte_none() anonymous page to device private
memory, migrate_vma_setup() will initialize the src[] array with a NULL
PFN. This lets the device driver allocate device private memory and clear
it instead of DMAing a page of zeros over the device bus. Since the source
page didn't exist at the time, no struct page was locked nor a migration
PTE inserted into the CPU page tables. The actual PTE insertion happens
in migrate_vma_pages() when it tries to insert the device private struct
page PTE into the CPU page tables. migrate_vma_pages() has to call the
mmu notifiers again since another device could fault on the same page
before the page table locks are acquired. Allow device drivers to optimize
the invalidation similar to migrate_vma_setup() by calling
mmu_notifier_range_init() which sets struct mmu_notifier_range event type
to MMU_NOTIFY_MIGRATE and the migrate_pgmap_owner field.

Signed-off-by: Ralph Campbell 
---

This is for Andrew Morton's mm tree after the merge window.

 mm/migrate.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5ca5842df5db..560b57dde960 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2999,11 +2999,10 @@ void migrate_vma_pages(struct migrate_vma *migrate)
if (!notified) {
notified = true;
 
-   mmu_notifier_range_init(,
-   MMU_NOTIFY_CLEAR, 0,
-   NULL,
-   migrate->vma->vm_mm,
-   addr, migrate->end);
+   mmu_notifier_range_init_migrate(, 0,
+   migrate->vma, migrate->vma->vm_mm,
+   addr, migrate->end,
+   migrate->pgmap_owner);
mmu_notifier_invalidate_range_start();
}
migrate_vma_insert_page(migrate, addr, newpage,
-- 
2.20.1



[PATCH v2] mm/hmm: make device private reference counts zero based

2020-10-12 Thread Ralph Campbell
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for device private pages, leaving DAX as still being
a special case.

Signed-off-by: Ralph Campbell 
---

I'm sending this as a separate patch since I think it is ready to
merge. Originally, this was part of an RFC:
https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampb...@nvidia.com
and is changed to only make device private struct page reference
counts be zero based since DAX needs to detect when a page is not
referenced by GUP and not no references at all.

It applies cleanly to linux-5.9.0-rc8-mm1 plus my patch
("ext4/xfs: add page refcount helper")
https://lore.kernel.org/linux-mm/20201007214925.11181-1-rcampb...@nvidia.com
and also
("mm/memcg: fix device private memcg accounting")
https://lore.kernel.org/linux-mm/20201009215952.2726-1-rcampb...@nvidia.com
but doesn't really depend on them, just simple merge conflicts
without them.

This is for Andrew Morton after the 5.10.0-rc1 merge window closes.

Changes in v2:
Fixed the call to page_ref_add_unless() when moving a process from one
memory cgroup to another thanks to Ira Weiny's comment.

 arch/powerpc/kvm/book3s_hv_uvmem.c | 13 +++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 include/linux/memremap.h   |  6 ++--
 include/linux/mm.h |  9 +-
 lib/test_hmm.c |  7 -
 mm/internal.h  |  8 +
 mm/memcontrol.c| 11 ++-
 mm/memremap.c  | 42 ++
 mm/migrate.c   |  5 ---
 mm/swap.c  |  6 ++--
 10 files changed, 54 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..a0d08b1d8c1e 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
@@ -1151,6 +1151,7 @@ int kvmppc_uvmem_init(void)
struct resource *res;
void *addr;
unsigned long pfn_last, pfn_first;
+   unsigned long pfn;
 
size = kvmppc_get_secmem_size();
if (!size) {
@@ -1191,6 +1192,16 @@ int kvmppc_uvmem_init(void)
goto out_unmap;
}
 
+   /*
+* Pages are created with an initial reference count of one but should
+* have a reference count of zero while in the free state.
+*/
+   for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+   struct page *dpage = pfn_to_page(pfn);
+
+   set_page_count(dpage, 0);
+   }
+
pr_info("KVMPPC-UVMEM: Secure Memory size 0x%lx\n", size);
return ret;
 out_unmap:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..43860870bc51 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count should
+* be reset to one with init_page_count(page) before reusing the page.
+* This allows device drivers to implement their own memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..7a7013c57a4a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1110,14 +1110,7 @@ static inline bool page_is_devmap_managed(struct page 
*page)
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-  

Re: [PATCH] mm/memcg: fix device private memcg accounting

2020-10-12 Thread Ralph Campbell



On 10/12/20 6:28 AM, Johannes Weiner wrote:

On Fri, Oct 09, 2020 at 05:00:37PM -0700, Ralph Campbell wrote:


On 10/9/20 3:50 PM, Andrew Morton wrote:

On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell  wrote:


The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages
are never handled.
Fix this by checking for non_swap_entry() after handling device private
swap PTEs.


The fix looks good to me.

Acked-by: Johannes Weiner 


But this makes me suspect the answer is "there aren't any that we know
of".  Are you sure a cc:stable is warranted?



I assume the memory cgroup accounting would be off somehow when moving
a process to another memory cgroup.
Currently, the device private page is charged like a normal anonymous page
when allocated and is uncharged when the page is freed so I think that path is 
OK.
Maybe someone who knows more about memory cgroup accounting can comment?


As for whether to CC stable, I'm leaning toward no:

- When moving tasks, we'd leave their device pages behind in the old
   cgroup. This isn't great, but it doesn't cause counter imbalances or
   corruption or anything - we also skip locked pages, we used to skip
   pages mapped by more than one pte, the user can select whether to
   move pages along tasks at all, and if so, whether only anon or file.

- Charge moving itself is a bit of a questionable feature, and users
   have been moving away from it. Leaving tasks in a cgroup and
   changing the configuration is a heck of a lot cheaper than moving
   potentially gigabytes of pages to another configuration domain.

- According to the Fixes tag, this isn't a regression, either. Since
   their inception, we have never migrated device pages.


Thanks for the Acked-by and the comments.
I assume Andrew will update the tags when queuing this up unless he wants me to 
resend.


Re: [PATCH] mm/memcg: fix device private memcg accounting

2020-10-09 Thread Ralph Campbell



On 10/9/20 3:50 PM, Andrew Morton wrote:

On Fri, 9 Oct 2020 14:59:52 -0700 Ralph Campbell  wrote:


The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages
are never handled.
Fix this by checking for non_swap_entry() after handling device private
swap PTEs.

Cc: sta...@vger.kernel.org


I was going to ask "what are the end-user visible effects of the bug".
This is important information with a cc:stable.



I'm not sure exactly how to test this. I ran the HMM self tests but
that is a minimal sanity check. I think moving the self test from one
memory cgroup to another while it is running would exercise this patch.
I'm looking at how the test could move itself to another group after
migrating some anonymous memory to the test driver.



But this makes me suspect the answer is "there aren't any that we know
of".  Are you sure a cc:stable is warranted?



I assume the memory cgroup accounting would be off somehow when moving
a process to another memory cgroup.
Currently, the device private page is charged like a normal anonymous page
when allocated and is uncharged when the page is freed so I think that path is 
OK.
Maybe someone who knows more about memory cgroup accounting can comment?


[PATCH] mm/memcg: fix device private memcg accounting

2020-10-09 Thread Ralph Campbell
The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages
are never handled.
Fix this by checking for non_swap_entry() after handling device private
swap PTEs.

Cc: sta...@vger.kernel.org
Fixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE")
Signed-off-by: Ralph Campbell 
---

I'm not sure exactly how to test this. I ran the HMM self tests but
that is a minimal sanity check. I think moving the self test from one
memory cgroup to another while it is running would exercise this patch.
I'm looking at how the test could move itself to another group after
migrating some anonymous memory to the test driver.

This applies cleanly to linux-5.9.0-rc8-mm1 and is for Andrew Morton's
tree.

 mm/memcontrol.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2636f8bad908..3a24e3b619f5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5549,7 +5549,7 @@ static struct page *mc_handle_swap_pte(struct 
vm_area_struct *vma,
struct page *page = NULL;
swp_entry_t ent = pte_to_swp_entry(ptent);
 
-   if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
+   if (!(mc.flags & MOVE_ANON))
return NULL;
 
/*
@@ -5568,6 +5568,9 @@ static struct page *mc_handle_swap_pte(struct 
vm_area_struct *vma,
return page;
}
 
+   if (non_swap_entry(ent))
+   return NULL;
+
/*
 * Because lookup_swap_cache() updates some statistics counter,
 * we call find_get_page() with swapper_space directly.
-- 
2.20.1



Re: [PATCH] mm: make device private reference counts zero based

2020-10-09 Thread Ralph Campbell



On 10/9/20 9:53 AM, Ira Weiny wrote:

On Thu, Oct 08, 2020 at 10:25:44AM -0700, Ralph Campbell wrote:

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for device private pages, leaving DAX as still being
a special case.


What about the check in mc_handle_swap_pte()?

mm/memcontrol.c:

5513 /*
5514  * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which 
have
5515  * a refcount of 1 when free (unlike normal page)
5516  */
5517 if (!page_ref_add_unless(page, 1, 1))
5518 return NULL;

... does that need to change?  Perhaps just the comment?


Thanks for spotting this.

Actually, this whole bit of code is never reached because the
  if (non_swap_entry(ent))
return NULL;
covers device private pages and returns.

The device private pages are accounted for in mem_cgroup so this
probably needs fixing. I'll probably move the code before the
non_swap_entry() check and change the refcount increment to
page_ref_add_unless(page, 1, 0).



Signed-off-by: Ralph Campbell 
---



[snip]

  
  void put_devmap_managed_page(struct page *page);

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e151a7f10519..bf92a261fa6f 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -509,10 +509,15 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
pfn_first, pfn_last);
  
+	/*

+* Pages are created with an initial reference count of one but should
+* have a reference count of zero while in the free state.
+*/
spin_lock(>lock);
for (pfn = pfn_first; pfn < pfn_last; pfn++) {
struct page *page = pfn_to_page(pfn);
  
+		set_page_count(page, 0);


This confuses me.  How does this and init_page_count() not confuse the buddy
allocator?  Don't you have to reset the refcount somewhere after the test?


Device private struct pages are created with memmap_pages() and destroyed with
memunmap_pages(). They are never put on the LRU and never freed to the page
allocator. The refcount is set to zero by put_page() which triggers
the call to pgmap->page_free() instead. So only the driver handles the free 
pages
it creates.



page->zone_device_data = mdevice->free_pages;
mdevice->free_pages = page;
}
@@ -561,7 +566,7 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
}
  
  	dpage->zone_device_data = rpage;

-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
  
diff --git a/mm/internal.h b/mm/internal.h

index c43ccdddb0f6..e1443b73aa9b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
  


[snip]


diff --git a/mm/swap.c b/mm/swap.c
index 0eb057141a04..93d880c6f73c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -116,12 +116,11 @@ static void __put_compound_page(struct page *page)
  void __put_page(struct page *page)
  {
if (is_zone_device_page(page)) {
-   put_dev_pagemap(page->pgmap);
-
/*
 * The page belongs to the device that created pgmap. Do
 * not return it to page allocator.
 */
+   free_zone_device_page(page);


I really like this.

Ira



[PATCH] mm: make device private reference counts zero based

2020-10-08 Thread Ralph Campbell
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for device private pages, leaving DAX as still being
a special case.

Signed-off-by: Ralph Campbell 
---

I'm sending this as a separate patch since I think it is ready to
merge. Originally, this was part of an RFC:
https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampb...@nvidia.com
and is changed to only make device private struct page reference
counts be zero based since DAX needs to detect when a page is not
referenced by GUP and not no references at all.

It applies cleanly to linux-5.9.0-rc8-mm1 plus my patch
("ext4/xfs: add page refcount helper")
https://lore.kernel.org/linux-mm/20201007214925.11181-1-rcampb...@nvidia.com
but doesn't really depend on it, just a simple merge conflict
without it.

This is for Andrew Morton after the 5.10.0-rc1 merge window closes.

 arch/powerpc/kvm/book3s_hv_uvmem.c | 13 +++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 include/linux/memremap.h   |  6 ++--
 include/linux/mm.h |  9 +-
 lib/test_hmm.c |  7 -
 mm/internal.h  |  8 +
 mm/memremap.c  | 42 ++
 mm/migrate.c   |  5 ---
 mm/swap.c  |  6 ++--
 9 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..a0d08b1d8c1e 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
@@ -1151,6 +1151,7 @@ int kvmppc_uvmem_init(void)
struct resource *res;
void *addr;
unsigned long pfn_last, pfn_first;
+   unsigned long pfn;
 
size = kvmppc_get_secmem_size();
if (!size) {
@@ -1191,6 +1192,16 @@ int kvmppc_uvmem_init(void)
goto out_unmap;
}
 
+   /*
+* Pages are created with an initial reference count of one but should
+* have a reference count of zero while in the free state.
+*/
+   for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+   struct page *dpage = pfn_to_page(pfn);
+
+   set_page_count(dpage, 0);
+   }
+
pr_info("KVMPPC-UVMEM: Secure Memory size 0x%lx\n", size);
return ret;
 out_unmap:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..43860870bc51 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count should
+* be reset to one with init_page_count(page) before reusing the page.
+* This allows device drivers to implement their own memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6b8e30dce2e..ac848eeb2a1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1110,14 +1110,7 @@ static inline bool page_is_devmap_managed(struct page 
*page)
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
+   return page->pgmap->type == MEMORY_DEVICE_FS_DAX;
 }
 
 void put_devmap_managed_page(struct page *page);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e151a7f10519..bf92a261fa6f 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -509,10 +509,15 @@ static bool dmirror_allocate

Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-10-08 Thread Ralph Campbell



On 10/7/20 10:17 PM, Ram Pai wrote:

On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.



I was hoping this patch would resolve a page-reference issue that we run
into at random times while migrating a page to a device page backed by
secure-memory.

Unfortunately I continue to see the problem. There is a reference
held on that page, which fails the migration.

FYI
RP


I'm willing to look into it but I would need more information.
Can you give any more details about the conditions when it happens?
It would be great if you have a program that reproduces the problem.


[PATCH v2] ext4/xfs: add page refcount helper

2020-10-07 Thread Ralph Campbell
There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add helper functions to hide this detail.

Signed-off-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 
Acked-by: Darrick J. Wong 
Acked-by: Theodore Ts'o  # for fs/ext4/inode.c
---

Changes in v2:
I strongly resisted the idea of extending this patch but after Jan
Kara's comment about there being more places that could be cleaned
up, I felt compelled to make this one tensy wensy change to add
a dax_wakeup_page() to match the dax_wait_page().
I kept the Reviewed/Acked-bys since I don't think this substantially
changes the patch.

 fs/dax.c|  4 ++--
 fs/ext4/inode.c |  5 +
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 15 +++
 mm/memremap.c   |  3 ++-
 5 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..85c63f735909 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (page_ref_count(page) > 1)
+   if (!dax_layout_is_idle_page(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 771ed8b1fadb..132620cbfa13 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3937,10 +3937,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(ei));
+   error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);
 
return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3d1b95124744..a5304aaeaa3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -749,9 +749,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..e2da78e87338 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,21 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+   return page_ref_count(page) == 1;
+}
+
+static inline void dax_wakeup_page(struct page *page)
+{
+   wake_up_var(>_refcount);
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+   ___wait_var_event(&(_page)->_refcount,  \
+   dax_layout_is_idle_page(_page), \
+   TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
diff --git a/mm/memremap.c b/mm/memremap.c
index 2bb276680837..504a10ff2edf 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_XARRAY(pgmap_array);
 
@@ -508,7 +509,7 @@ void free_devmap_managed_page(struct page *page)
 {
/* notify page idle for dax */
if (!is_device_private_page(page)) {
-   wake_up_var(>_refcount);
+   dax_wakeup_page(page);
return;
}
 
-- 
2.20.1



Re: [PATCH] ext4/xfs: add page refcount helper

2020-10-07 Thread Ralph Campbell



On 10/7/20 1:25 AM, Jan Kara wrote:

On Tue 06-10-20 16:09:30, Ralph Campbell wrote:

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 


Looks as sane direction but if we are going to abstract checks when
ZONE_DEVICE page is idle, we should also update e.g.
mm/swap.c:put_devmap_managed_page() or
mm/gup.c:__unpin_devmap_managed_user_page() (there may be more places like
this but I found at least these two...). Maybe Dan has more thoughts about
this.

Honza


I think this is a good point but I would like to make that a follow on
patch rather than add to this one.


[PATCH] ext4/xfs: add page refcount helper

2020-10-06 Thread Ralph Campbell
There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 
---

I'm resending this as a separate patch since I think it is ready to
merge. Originally, this was part of an RFC and is unchanged from v3:
https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampb...@nvidia.com

It applies cleanly to linux-5.9.0-rc7-mm1 but doesn't really
depend on anything, just simple merge conflicts when applied to
other trees.
I'll let the various maintainers decide which tree and when to merge.
It isn't urgent since it is a clean up patch.

 fs/dax.c|  4 ++--
 fs/ext4/inode.c |  5 +
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..85c63f735909 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (page_ref_count(page) > 1)
+   if (!dax_layout_is_idle_page(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 771ed8b1fadb..132620cbfa13 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3937,10 +3937,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(ei));
+   error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);
 
return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3d1b95124744..a5304aaeaa3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -749,9 +749,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..8909a91cd381 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+   return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+   ___wait_var_event(&(_page)->_refcount,  \
+   dax_layout_is_idle_page(_page), \
+   TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.20.1



Re: [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-10-05 Thread Ralph Campbell



On 10/1/20 10:59 PM, Christoph Hellwig wrote:

On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 


Looks good,

Reviewed-by: Christoph Hellwig 


Thanks for the review.

I still have reservations about making this an official patch.
Did you see the updated cover letter?
Basically, I'm concerned about ZONE_DEVICE struct pages being inserted into
the process page table with a zero reference count with vmf_insert_mixed().
If it is to be a non-zero reference count, then DAX, pmem, and other uses
of ZONE_DEVICE pages need to be changed (or vmf_insert_mixed()) to
inc/dec in appropriate places but I don't feel I know that code well enough
to make those changes.


Re: [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper

2020-10-05 Thread Ralph Campbell




On 10/1/20 10:56 PM, Christoph Hellwig wrote:

On Thu, Oct 01, 2020 at 11:17:14AM -0700, Ralph Campbell wrote:

There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 


Looks good:

Reviewed-by: Christoph Hellwig 



Thanks for the review.
I'll resend this as an independent patch.


[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-10-01 Thread Ralph Campbell
This is still an RFC because after looking at the pmem/dax code some
more, I realized that the ZONE_DEVICE struct pages are being inserted
into the process' page tables with vmf_insert_mixed() and a zero
refcount on the ZONE_DEVICE struct page. This is sort of OK because
insert_pfn() increments the reference count on the pgmap which is what
prevents memunmap_pages() from freeing the struct pages and it doesn't
check for a non-zero struct page reference count.
But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that
checks for a reference count == 0.

// mmap() an ext4 file that is mounted -o dax.
ext4_dax_fault()
  ext4_dax_huge_fault()
dax_iomap_fault(_iomap_ops)
  dax_iomap_pte_fault()
ops->iomap_begin() // ext4_iomap_begin()
  ext4_map_blocks()
  ext4_set_iomap()
dax_iomap_pfn()
dax_insert_entry()
vmf_insert_mixed(pfn)
  __vm_insert_mixed()
if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
!pfn_t_devmap(pfn) && pfn_t_valid(pfn))
  insert_page()
get_page(page) // XXX would trigger VM_BUG_ON_PAGE()
page_add_file_rmap()
set_pte_at()
else
  insert_pfn()
pte_mkdevmap()
set_pte_at()

Should pmem set the page reference count to one before inserting the
pfn into the page tables (and decrement when removing devmap PTEs)?
What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases?
Where should they icrement/decrement the page reference count?
I don't know enough about how these are used to really know what to
do at this point. If people want me to continue to work on this series,
I will need some guidance.

---

Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly because they are "free" when the
reference count is one instead of zero. This leads to explicit checks
for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and
page migration which have to adjust the expected reference count when
determining if the page is isolated or idle. This is my attempt to make
ZONE_DEVICE pages be free when the reference count is zero and removing
the special cases.

I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have been able to successfully run xfstests on ext4 with the memmap
kernel boot option to simulate pmem.

Changes in v3:
Rebase to linux-mm 5.9.0-rc7-mm1.
Added a check for page_free() as suggested by Christoph Hellwig.
Added a helper for dax_wait_page() as suggested by Christoph Hellwig.

Changes in v2:
One of the big changes in v2 is that devm_memremap_pages() and
memremap_pages() now return the struct pages' reference count set to
zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if
page->_refcount is zero. I didn't see any such warnings running the
xfstests with dax/pmem but I'm not clear how the zero to one reference
count is handled.

Other changes in v2:
Rebased to Linux-5.9.0-rc6 to include pmem fixes.
I added patch 1 to introduce a page refcount helper for ext4 and xfs as
suggested by Christoph Hellwig.
I also applied Christoph Hellwig's other suggested changes for removing
the devmap_managed_key, etc.

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c   |  8 +--
 fs/ext4/inode.c|  5 +-
 fs/xfs/xfs_file.c  |  4 +-
 include/linux/dax.h| 10 +++
 include/linux/memremap.h   |  7 ++-
 include/linux/mm.h | 44 --
 lib/test_hmm.c |  2 +-
 mm/gup.c   | 44 --
 mm/internal.h  |  8 +++
 mm/memremap.c  | 84 +++---
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 +
 mm/swap.c  | 44 ++
 15 files changed, 63 insertions(+), 209 deletions(-)

-- 
2.20.1



[RFC PATCH v3 1/2] ext4/xfs: add page refcount helper

2020-10-01 Thread Ralph Campbell
There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
---
 fs/dax.c|  4 ++--
 fs/ext4/inode.c |  5 +
 fs/xfs/xfs_file.c   |  4 +---
 include/linux/dax.h | 10 ++
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..85c63f735909 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (page_ref_count(page) > 1)
+   if (!dax_layout_is_idle_page(page))
return page;
}
return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..4c3b80e68121 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3926,10 +3926,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(ei));
+   error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);
 
return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3d1b95124744..a5304aaeaa3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -749,9 +749,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page(inode, page, xfs_wait_dax_page);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..f906cf4db1cc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+   return page_ref_count(page) == 1;
+}
+
+#define dax_wait_page(_inode, _page, _wait_cb) \
+   ___wait_var_event(&(_page)->_refcount,  \
+   dax_layout_is_idle_page(_page), \
+   TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
-- 
2.20.1



[RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-10-01 Thread Ralph Campbell
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c   |  4 +-
 include/linux/dax.h|  2 +-
 include/linux/memremap.h   |  7 ++-
 include/linux/mm.h | 44 --
 lib/test_hmm.c |  2 +-
 mm/gup.c   | 44 --
 mm/internal.h  |  8 +++
 mm/memremap.c  | 84 +++---
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 +
 mm/swap.c  | 44 ++
 13 files changed, 50 insertions(+), 201 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..acee67710620 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..8bc7120e1216 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 85c63f735909..4804348f62e6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f906cf4db1cc..e4920ea6abd3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 
 static inline bool dax_layout_is_idle_page(struct page *page)
 {
-   return page_ref_count(page) == 1;
+   return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..2f63747caf56 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27c64d0d7520..3d71a820ae38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1100,39 +1100,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   if (!static_branch_unlikely(_managed_key))
-   return false;
-   if (!is_zone_device_page(page))
-   return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-

Re: [PATCH 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-28 Thread Ralph Campbell



On 9/25/20 11:41 PM, Christoph Hellwig wrote:

On Fri, Sep 25, 2020 at 01:44:42PM -0700, Ralph Campbell wrote:

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
---
  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
  include/linux/dax.h|  2 +-
  include/linux/memremap.h   |  7 ++-
  include/linux/mm.h | 44 --
  lib/test_hmm.c |  2 +-
  mm/gup.c   | 44 --
  mm/internal.h  |  8 +++
  mm/memremap.c  | 82 ++
  mm/migrate.c   |  5 --
  mm/page_alloc.c|  3 +
  mm/swap.c  | 46 +++
  12 files changed, 44 insertions(+), 203 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 7705d5557239..e6ec98325fab 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
  
  	dpage = pfn_to_page(uvmem_pfn);

dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
  out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 4e8112fde3e6..ca2e3c3edc36 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -323,7 +323,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
  
-	get_page(page);

+   init_page_count(page);
lock_page(page);
return page;
  }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3f78ed78d1d6..8d29f38645aa 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -240,7 +240,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
  
  static inline bool dax_layout_is_idle_page(struct page *page)

  {
-   return page_ref_count(page) <= 1;
+   return page_ref_count(page) == 0;
  }
  
  #endif

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5862746751b..f9224f88e4cd 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -65,9 +65,10 @@ enum memory_type {
  
  struct dev_pagemap_ops {

/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
  
diff --git a/include/linux/mm.h b/include/linux/mm.h

index b2f370f0b420..2159c2477aa3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,39 +1092,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
  }
  #endif
  
-#ifdef CONFIG_DEV_PAGEMAP_OPS

-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   if (!static_branch_unlikely(_managed_key))
-   return false;
-   if (!is_zone_device_page(page))
-   return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
  static inline bool is_device_private_page(const struct page *page)
  {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1171,17 +1138,6 @@ static inline void put_page(struct page *page)
  {
page = compound_head(page);
  
-	/*

-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
-*/
-   if (page_is

Re: [PATCH 1/2] ext4/xfs: add page refcount helper

2020-09-28 Thread Ralph Campbell



On 9/25/20 11:35 PM, Christoph Hellwig wrote:

On Fri, Sep 25, 2020 at 01:44:41PM -0700, Ralph Campbell wrote:

error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
+   dax_layout_is_idle_page(page),
TASK_INTERRUPTIBLE, 0, 0,
ext4_wait_dax_page(ei));



+++ b/fs/xfs/xfs_file.c
@@ -750,7 +750,7 @@ xfs_break_dax_layouts(
  
  	*retry = true;

return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
+   dax_layout_is_idle_page(page), TASK_INTERRUPTIBLE,
0, 0, xfs_wait_dax_page(inode));
  }


I still think a litte helper macro would be nice here:

#define dax_wait_page(_inode, _page, _wait_cb)  \
___wait_var_event(&(_page)->_refcount,   \
atomic_read(&(_page)->_refcount) == 1,   \
TASK_INTERRUPTIBLE, dax_layout_is_idle_page(_page), \
TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode));


OK, I'll add it in v3.


Re: [PATCH 1/2] ext4/xfs: add page refcount helper

2020-09-25 Thread Ralph Campbell



On 9/25/20 1:51 PM, Dan Williams wrote:

On Fri, Sep 25, 2020 at 1:45 PM Ralph Campbell  wrote:


There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
---
  fs/dax.c| 8 
  fs/ext4/inode.c | 2 +-
  fs/xfs/xfs_file.c   | 2 +-
  include/linux/dax.h | 5 +
  4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 994ab66a9907..8eddbcc0e149 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
 for_each_mapped_pfn(entry, pfn) {
 struct page *page = pfn_to_page(pfn);

-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
 WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 page->mapping = NULL;
 page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
 for_each_mapped_pfn(entry, pfn) {
 struct page *page = pfn_to_page(pfn);

-   if (page_ref_count(page) > 1)
+   if (!dax_layout_is_idle_page(page))
 return page;
 }
 return NULL;
@@ -560,11 +560,11 @@ static void *grab_mapping_entry(struct xa_state *xas,

  /**
   * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
   *
   * DAX requires ZONE_DEVICE mapped pages. These pages are never
   * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
   * any page in the mapping is busy, i.e. for DMA, or other
   * get_user_pages() usages.
   *
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..d9f8ad55523a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3927,7 +3927,7 @@ int ext4_break_layouts(struct inode *inode)
 return 0;

 error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
+   dax_layout_is_idle_page(page),
 TASK_INTERRUPTIBLE, 0, 0,
 ext4_wait_dax_page(ei));
 } while (error == 0);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a29f78a663ca..29ab96541bc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -750,7 +750,7 @@ xfs_break_dax_layouts(

 *retry = true;
 return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
+   dax_layout_is_idle_page(page), TASK_INTERRUPTIBLE,
 0, 0, xfs_wait_dax_page(inode));
  }

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 43b39ab9de1a..3f78ed78d1d6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -238,4 +238,9 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 return mapping->host && IS_DAX(mapping->host);
  }

+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+   return page_ref_count(page) <= 1;


Why convert the check from "== 1" to "<= 1" and then back to the ==
operator in the next patch? A refcount < 1 in this path before your
other change is a bug.


Mostly I was thinking > 1 was busy so <= 1 is idle. And yes, <=0 is never
supposed to happen. Checking for == 1 is probably better though.


[RFC PATCH v2 0/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-25 Thread Ralph Campbell
Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly because they are "free" when the
reference count is one instead of zero. This leads to explicit checks
for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and
page migration which have to adjust the expected reference count when
determining if the page is isolated or idle. This is my attempt to make
ZONE_DEVICE pages be free when the reference count is zero and removing
the special cases.

I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.
I have been able to successfully run xfstests on ext4 with the memmap
kernel boot option to simulate pmem.

One of the big changes in v2 is that devm_memremap_pages() and
memremap_pages() now return the struct pages' reference count set to
zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if
page->_refcount is zero. I didn't see any such warnings running the
xfstests with dax/pmem but I'm not clear how the zero to one reference
count is handled.

Other changes in v2:
Rebased to Linux-5.9.0-rc6 to include pmem fixes.
I added patch 1 to introduce a page refcount helper for ext4 and xfs as
suggested by Christoph Hellwig.
I also applied Christoph Hellwig's other suggested changes for removing
the devmap_managed_key, etc.

Ralph Campbell (2):
  ext4/xfs: add page refcount helper
  mm: remove extra ZONE_DEVICE struct page refcount

 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c   |  8 +--
 fs/ext4/inode.c|  2 +-
 fs/xfs/xfs_file.c  |  2 +-
 include/linux/dax.h|  5 ++
 include/linux/memremap.h   |  7 ++-
 include/linux/mm.h | 44 --
 lib/test_hmm.c |  2 +-
 mm/gup.c   | 44 --
 mm/internal.h  |  8 +++
 mm/memremap.c  | 82 ++
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 +
 mm/swap.c  | 46 +++
 15 files changed, 54 insertions(+), 208 deletions(-)

-- 
2.20.1



[PATCH 2/2] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-25 Thread Ralph Campbell
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 include/linux/dax.h|  2 +-
 include/linux/memremap.h   |  7 ++-
 include/linux/mm.h | 44 --
 lib/test_hmm.c |  2 +-
 mm/gup.c   | 44 --
 mm/internal.h  |  8 +++
 mm/memremap.c  | 82 ++
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 +
 mm/swap.c  | 46 +++
 12 files changed, 44 insertions(+), 203 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 7705d5557239..e6ec98325fab 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 4e8112fde3e6..ca2e3c3edc36 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -323,7 +323,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3f78ed78d1d6..8d29f38645aa 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -240,7 +240,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 
 static inline bool dax_layout_is_idle_page(struct page *page)
 {
-   return page_ref_count(page) <= 1;
+   return page_ref_count(page) == 0;
 }
 
 #endif
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5862746751b..f9224f88e4cd 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -65,9 +65,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b2f370f0b420..2159c2477aa3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,39 +1092,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   if (!static_branch_unlikely(_managed_key))
-   return false;
-   if (!is_zone_device_page(page))
-   return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 static inline bool is_device_private_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
@@ -1171,17 +1138,6 @@ static inline void put_page(struct page *page)
 {
page = compound_head(page);
 
-   /*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
-*/
-   if (page_is_devmap_managed(page)) {
-   put_devmap_managed_page(page);
-   return;
-   }
-
if (pu

[PATCH 1/2] ext4/xfs: add page refcount helper

2020-09-25 Thread Ralph Campbell
There are several places where ZONE_DEVICE struct pages assume a reference
count == 1 means the page is idle and free. Instead of open coding this,
add a helper function to hide this detail.

Signed-off-by: Ralph Campbell 
---
 fs/dax.c| 8 
 fs/ext4/inode.c | 2 +-
 fs/xfs/xfs_file.c   | 2 +-
 include/linux/dax.h | 5 +
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 994ab66a9907..8eddbcc0e149 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
page->index = 0;
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
-   if (page_ref_count(page) > 1)
+   if (!dax_layout_is_idle_page(page))
return page;
}
return NULL;
@@ -560,11 +560,11 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..d9f8ad55523a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3927,7 +3927,7 @@ int ext4_break_layouts(struct inode *inode)
return 0;
 
error = ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1,
+   dax_layout_is_idle_page(page),
TASK_INTERRUPTIBLE, 0, 0,
ext4_wait_dax_page(ei));
} while (error == 0);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a29f78a663ca..29ab96541bc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -750,7 +750,7 @@ xfs_break_dax_layouts(
 
*retry = true;
return ___wait_var_event(>_refcount,
-   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
+   dax_layout_is_idle_page(page), TASK_INTERRUPTIBLE,
0, 0, xfs_wait_dax_page(inode));
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 43b39ab9de1a..3f78ed78d1d6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -238,4 +238,9 @@ static inline bool dax_mapping(struct address_space 
*mapping)
return mapping->host && IS_DAX(mapping->host);
 }
 
+static inline bool dax_layout_is_idle_page(struct page *page)
+{
+   return page_ref_count(page) <= 1;
+}
+
 #endif
-- 
2.20.1



Re: [PATCH v2] mm/hmm/test: use after free in dmirror_allocate_chunk()

2020-09-24 Thread Ralph Campbell



On 9/24/20 6:46 AM, Dan Carpenter wrote:

The error handling code does this:

err_free:
kfree(devmem);
 ^
err_release:
release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));

The problem is that when we use "devmem->pagemap.range.start" the
"devmem" pointer is either NULL or freed.

Neither the allocation nor the call to request_free_mem_region() has to
be done under the lock so I moved those to the start of the function.

Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM")
Signed-off-by: Dan Carpenter 


Looks good.
Reviewed-by: Ralph Campbell 


Re: [PATCH] mm/hmm/test: use after free in dmirror_allocate_chunk()

2020-09-22 Thread Ralph Campbell



On 9/22/20 1:12 AM, Dan Carpenter wrote:

The error handling code does this:

err_free:
kfree(devmem);
 ^
err_release:
release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));

The problem is that when we use "devmem->pagemap.range.start" the
"devmem" pointer is either NULL or freed.

Neither the allocation nor the call to request_free_mem_region() has to
be done under the lock so I moved those to the start of the function.

Fixes: b2ef9f5a5cb3 ("mm/hmm/test: add selftest driver for HMM")
Signed-off-by: Dan Carpenter 
---
It's weird that I didn't catch the use after free when this code was
merged in May...  My bad.  Not sure what happened there.  How I found
this was that I have been reviewing release_mem_region() leaks and the
NULL dereference path is a leak.



Thanks for fixing this. I missed it too. :-)


  lib/test_hmm.c | 47 ---
  1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c8133f50160b..0503c78cb322 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -459,6 +459,22 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
unsigned long pfn_last;
void *ptr;
  
+	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);

+   if (!devmem)
+   return -ENOMEM;
+
+   res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+   if (IS_ERR(res))
+   goto err_devmem;
+
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   devmem->pagemap.range.start = res->start;
+   devmem->pagemap.range.end = res->end;
+   devmem->pagemap.nr_range = 1;
+   devmem->pagemap.ops = _devmem_ops;
+   devmem->pagemap.owner = mdevice;
+
mutex_lock(>devmem_lock);
  
  	if (mdevice->devmem_count == mdevice->devmem_capacity) {

@@ -471,30 +487,16 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
sizeof(new_chunks[0]) * new_capacity,
GFP_KERNEL);
if (!new_chunks)


Need to call mutex_unlock(>devmem_lock).
In fact, why not make this goto err_unlock and add
err_unlock: mutex_unlock() before the err_release:.


-   goto err;
+   goto err_release;>   
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
  
-	res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,

-   "hmm_dmirror");
-   if (IS_ERR(res))
-   goto err;
-
-   devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
-   if (!devmem)
-   goto err_release;
-
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-   devmem->pagemap.range.start = res->start;
-   devmem->pagemap.range.end = res->end;
-   devmem->pagemap.nr_range = 1;
-   devmem->pagemap.ops = _devmem_ops;
-   devmem->pagemap.owner = mdevice;
-
ptr = memremap_pages(>pagemap, numa_node_id());
-   if (IS_ERR(ptr))
-   goto err_free;
+   if (IS_ERR(ptr)) {
+   mutex_unlock(>devmem_lock);
+   goto err_release;
+   }


This could then be just goto err_unlock.


devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -525,12 +527,11 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
  
  	return true;
  
-err_free:

-   kfree(devmem);
  err_release:
release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));
-err:
-   mutex_unlock(>devmem_lock);
+err_devmem:
+   kfree(devmem);
+
return false;
  }
  


With the suggested change, you can add
Reviewed-by: Ralph Campbell 



[PATCH] mm: move call to compound_head() in release_pages()

2020-09-17 Thread Ralph Campbell
The function is_huge_zero_page() doesn't call compound_head() to make sure
the page pointer is a head page. The call to is_huge_zero_page() in
release_pages() is made before compound_head() is called so the test would
fail if release_pages() was called with a tail page of the huge_zero_page
and put_page_testzero() would be called releasing the page.
This is unlikely to be happening in normal use or we would be seeing all
sorts of process data corruption when accessing a THP zero page.

Looking at other places where is_huge_zero_page() is called, all seem to
only pass a head page so I think the right solution is to move the call
to compound_head() in release_pages() to a point before calling
is_huge_zero_page().

Signed-off-by: Ralph Campbell 
---

I found this by code inspection while working on my patch
("mm: remove extra ZONE_DEVICE struct page refcount").

This applies cleanly on the latest linux-mm and is for Andrew Morton's
tree.

 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index eca95afe7ad4..7e79829a2e73 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -889,6 +889,7 @@ void release_pages(struct page **pages, int nr)
locked_pgdat = NULL;
}
 
+   page = compound_head(page);
if (is_huge_zero_page(page))
continue;
 
@@ -910,7 +911,6 @@ void release_pages(struct page **pages, int nr)
}
}
 
-   page = compound_head(page);
if (!put_page_testzero(page))
continue;
 
-- 
2.20.1



Re: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-16 Thread Ralph Campbell



On 9/15/20 11:09 PM, Christoph Hellwig wrote:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..5a82037a4b26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
  #ifdef CONFIG_DEV_PAGEMAP_OPS
  void free_devmap_managed_page(struct page *page);
  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);


The export for devmap_managed_key can be dropped now.  In fact I think
we can remove devmap_managed_key entirely now - it is only checked in
the actual page free path instead of for each refcount manipulation,
so a good old unlikely is probably enough.

Also free_devmap_managed_page can move to mm/internal.h.


Good suggestion.


+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static void __put_devmap_managed_page(struct page *page)
+{
+   if (!static_branch_unlikely(_managed_key))
+   return;
+
+   switch (page->pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_FS_DAX:
+   free_devmap_managed_page(page);
+   break;
+   default:
+   break;
+   }
+}
+#else
+static inline void __put_devmap_managed_page(struct page *page)
+{
+}
+#endif


I think this should be moved to mm/memremap.c or even better
actually be folded into free_devmap_managed_page, which would need
a new name like free_zone_device_page().

Something like this incremental patch:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7bb9e93cf86cde..29350dc27cd0cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1090,11 +1090,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
  }
  #endif
  
-#ifdef CONFIG_DEV_PAGEMAP_OPS

-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
  static inline bool is_device_private_page(const struct page *page)
  {
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/internal.h b/mm/internal.h
index 6345b08ce86ccf..629959a6f26d7c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -618,4 +618,12 @@ struct migration_target_control {
gfp_t gfp_mask;
  };
  
+#ifdef CONFIG_DEV_PAGEMAP_OPS

+void free_zone_device_page(struct page *page);
+#else
+static inline void free_zone_device_page(struct page *page)
+{
+}
+#endif
+
  #endif/* __MM_INTERNAL_H */
diff --git a/mm/memremap.c b/mm/memremap.c
index d549e3733f4098..b15ad2264a4f1c 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -12,6 +12,7 @@
  #include 
  #include 
  #include 
+#include "internal.h"
  
  static DEFINE_XARRAY(pgmap_array);
  
@@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void)

  EXPORT_SYMBOL_GPL(memremap_compat_align);
  #endif
  
-#ifdef CONFIG_DEV_PAGEMAP_OPS

-DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
-EXPORT_SYMBOL(devmap_managed_key);
-
-static void devmap_managed_enable_put(void)
-{
-   static_branch_dec(_managed_key);
-}
-
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
-   (!pgmap->ops || !pgmap->ops->page_free)) {
-   WARN(1, "Missing page_free method\n");
-   return -EINVAL;
-   }
-
-   static_branch_inc(_managed_key);
-   return 0;
-}
-#else
-static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
-{
-   return -EINVAL;
-}
-static void devmap_managed_enable_put(void)
-{
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
  static void pgmap_array_delete(struct range *range)
  {
xa_store_range(_array, PHYS_PFN(range->start), 
PHYS_PFN(range->end),
@@ -181,7 +152,6 @@ void memunmap_pages(struct dev_pagemap *pgmap)
pageunmap_range(pgmap, i);
  
  	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");

-   devmap_managed_enable_put();
  }
  EXPORT_SYMBOL_GPL(memunmap_pages);
  
@@ -319,7 +289,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)

.pgprot = PAGE_KERNEL,
};
const int nr_range = pgmap->nr_range;
-   bool need_devmap_managed = true;
int error, i;
  
  	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))

@@ -331,8 +300,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
WARN(1, "Device private memory not supported\n");
return ERR_PTR(-EINVAL);
}
-   if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
-   WARN(1, "Missing migrate_to_ram method\n");
+   if (!pgmap->ops ||
+   !pgmap->ops->migrate_to_ram || !pgmap->ops->page_free) {
+   WARN(1, "Missing ops\n");
return ERR_PTR(-EINVAL);
}
if (!pgmap->owner) {
@@ -348,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_GENERIC:
-  

Re: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-16 Thread Ralph Campbell



On 9/15/20 11:10 PM, Christoph Hellwig wrote:

On Mon, Sep 14, 2020 at 04:10:38PM -0700, Dan Williams wrote:

You also need to fix up ext4_break_layouts() and
xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This
also needs some fstests exposure.


While we're at it, can we add a wait_fsdax_unref helper macro that hides
the _refcount access from the file systems?


Sure. I'll add a separate patch for it in v2.


Re: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-16 Thread Ralph Campbell



On 9/15/20 10:36 PM, Christoph Hellwig wrote:

On Tue, Sep 15, 2020 at 09:39:47AM -0700, Ralph Campbell wrote:

I don't think any of the three ->page_free instances even cares about
the page refcount.


Not true. The page_free() callback records the page is free by setting
a bit or putting the page on a free list but when it allocates a free
device private struct page to be used with migrate_vma_setup(), it needs to
increment the refcount.

For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA
struct pages, I think you are correct because they don't define page_free()
and from what I can see, don't decrement the page refcount to zero.


Umm, the whole point of ZONE_DEVICE is to have a struct page for
something that is not system memory.  For both the ppc kvm case (magic
hypervisor pool) and Noveau (device internal) memory that clear is the
case.  But looks like test_hmm uses normal pages to fake this up, so
I was wrong about the third caller.  But I think we can just call
set_page_count just before freeing the page there with a comment
explaining what is goin on.


Dan Williams thought that having the ZONE_DEVICE struct pages
be on a free list with a refcount of one was a bit strange and
that the driver should handle the zero to one transition.
But, that would mean a bit more invasive change to the 3 drivers
to set the reference count to zero after calling memremap_pages()
and setting the reference count to one when allocating a struct
page. What you are suggesting is what I also proposed in v1.


Re: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-15 Thread Ralph Campbell



On 9/15/20 9:29 AM, Christoph Hellwig wrote:

On Mon, Sep 14, 2020 at 04:53:25PM -0700, Ralph Campbell wrote:

Since set_page_refcounted() is defined in mm_interal.h I would have to
move the definition to someplace like page_ref.h or have the drivers
cal init_page_count() or set_page_count() since get_page() calls
VM_BUG_ON_PAGE() if refcount == 0.
I'll move set_page_refcounted() since that is what the page allocator
uses and seems named for the purpose.


I don't think any of the three ->page_free instances even cares about
the page refcount.


Not true. The page_free() callback records the page is free by setting
a bit or putting the page on a free list but when it allocates a free
device private struct page to be used with migrate_vma_setup(), it needs to
increment the refcount.

For the ZONE_DEVICE MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA
struct pages, I think you are correct because they don't define page_free()
and from what I can see, don't decrement the page refcount to zero.


Re: [PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-14 Thread Ralph Campbell



On 9/14/20 4:10 PM, Dan Williams wrote:

On Mon, Sep 14, 2020 at 3:45 PM Ralph Campbell  wrote:


ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
---

Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly/broken. This is my attempt to
fix it and it works for the HMM migration self tests.


Can you link to a technical description of what's broken? Or better
yet, summarize that argument in the changelog?


I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.


Back in the 4.15 days I could not convince myself that some code paths
blindly assumed that pages with refcount==0 were on an lru list. Since
then, struct page has been reorganized to not collide the ->pgmap back
pointer with the ->lru list and there have been other cleanups for
page pinning that might make this incremental cleanup viable.

You also need to fix up ext4_break_layouts() and
xfs_break_dax_layouts() to expect ->_refcount is 0 instead of 1. This
also needs some fstests exposure.


Got it. Thanks!


I have a modified THP migration patch series that applies on top of
this one and is cleaner since I don't have to add code to handle the
+1 reference count. The link below is for the earlier v2:
("mm/hmm/nouveau: add THP migration to migrate_vma_*")
https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampb...@nvidia.com


  arch/powerpc/kvm/book3s_hv_uvmem.c |  1 -
  drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 -
  include/linux/memremap.h   |  6 +--
  include/linux/mm.h | 39 ---
  lib/test_hmm.c |  1 -
  mm/gup.c   | 44 -
  mm/memremap.c  | 20 
  mm/migrate.c   |  5 --
  mm/swap.c  | 66 +++---
  9 files changed, 41 insertions(+), 142 deletions(-)


This diffstat is indeed appealing.



diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..00d97050d7ff 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)

 dpage = pfn_to_page(uvmem_pfn);
 dpage->zone_device_data = pvt;
-   get_page(dpage);
 lock_page(dpage);
 return dpage;
  out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..2a4bbe01a455 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 return NULL;
 }

-   get_page(page);
 lock_page(page);
 return page;
  }
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..7dd9802d2612 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -67,9 +67,9 @@ enum memory_type {

  struct dev_pagemap_ops {
 /*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count is
+* reset to 1 before calling page_free(). This allows the
+* device driver to implement its own memory management.


I'd clarify the order events / responsibility of the common core
page_free() and the device specific page_free(). At the same time, why
not update drivers to expect that the page is already refcount==0 on
entry? Seems odd to go through all this trouble to make the reference
count appear to be zero to the wider kernel but expect that drivers
get a fake reference on entry to their ->page_free() callbacks.


Good point.

Since set_page_refcounted() is defined in mm_interal.h I would have to
move the definition to someplace like page_ref.h or have the drivers
cal init_page_count() or set_page_count() since get_page() calls
VM_BUG_ON_PAGE() if refcount == 0.
I'll move set_page_refcounted() since that is what the page allocator
uses and seems named for the purpose.



[PATCH] mm: remove extra ZONE_DEVICE struct page refcount

2020-09-14 Thread Ralph Campbell
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

Signed-off-by: Ralph Campbell 
---

Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE
struct page reference counting is ugly/broken. This is my attempt to
fix it and it works for the HMM migration self tests.

I'm only sending this out as a RFC since I'm not that familiar with the
DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated
with devm_memremap_pages() or memremap_pages() but my best reading of
the code looks like it might be OK. I could use help testing these
configurations.

I have a modified THP migration patch series that applies on top of
this one and is cleaner since I don't have to add code to handle the
+1 reference count. The link below is for the earlier v2:
("mm/hmm/nouveau: add THP migration to migrate_vma_*")
https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampb...@nvidia.com


 arch/powerpc/kvm/book3s_hv_uvmem.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 -
 include/linux/memremap.h   |  6 +--
 include/linux/mm.h | 39 ---
 lib/test_hmm.c |  1 -
 mm/gup.c   | 44 -
 mm/memremap.c  | 20 
 mm/migrate.c   |  5 --
 mm/swap.c  | 66 +++---
 9 files changed, 41 insertions(+), 142 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 84e5a2dc8be5..00d97050d7ff 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -711,7 +711,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..2a4bbe01a455 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
lock_page(page);
return page;
 }
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..7dd9802d2612 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -67,9 +67,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count is
+* reset to 1 before calling page_free(). This allows the
+* device driver to implement its own memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 517751310dd2..5a82037a4b26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1093,34 +1093,6 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   if (!static_branch_unlikely(_managed_key))
-   return false;
-   if (!is_zone_device_page(page))
-   return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
-}
-
-void put_devmap_managed_page(struct page *page);
-
-#else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
-{
-   return false;
-}
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1169,17 +1141,6 @@ static inline void put_page(struct page *page)
 {
page = compound_head(page);
 
-   /*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
-*/
-   if (page_is_

[PATCH] hmm/test: remove unused dmirror_zero_page

2020-09-14 Thread Ralph Campbell
The variable dmirror_zero_page is unused in the HMM self test driver
which was probably intended to demonstrate how a driver could use
migrate_vma_setup() to share a single read-only device private zero page
similar to how the CPU does. However, this isn't needed for the self tests
so remove it.

Signed-off-by: Ralph Campbell 
---

This applies to linux-mm and is intended for Andrew Morton's git tree.

 lib/test_hmm.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e3065d6123f0..c8133f50160b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -36,7 +36,6 @@
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
-static struct page *dmirror_zero_page;
 
 struct dmirror_device;
 
@@ -1127,17 +1126,6 @@ static int __init hmm_dmirror_init(void)
goto err_chrdev;
}
 
-   /*
-* Allocate a zero page to simulate a reserved page of device private
-* memory which is always zero. The zero_pfn page isn't used just to
-* make the code here simpler (i.e., we need a struct page for it).
-*/
-   dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-   if (!dmirror_zero_page) {
-   ret = -ENOMEM;
-   goto err_chrdev;
-   }
-
pr_info("HMM test module loaded. This is only for testing HMM.\n");
return 0;
 
@@ -1153,8 +1141,6 @@ static void __exit hmm_dmirror_exit(void)
 {
int id;
 
-   if (dmirror_zero_page)
-   __free_page(dmirror_zero_page);
for (id = 0; id < DMIRROR_NDEVICES; id++)
dmirror_device_remove(dmirror_devices + id);
unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
-- 
2.20.1



[PATCH] mm/doc: add usage description for migrate_vma_*()

2020-09-09 Thread Ralph Campbell
The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
API usage by device drivers is not well documented.
Add a description for how device drivers are expected to use it.

Signed-off-by: Ralph Campbell 
---

There shouldn't be any merge conflict with my previous patch which
touched hmm.rst but since Jonathan Corbet took my last patch, perhaps he
would like to take this one through his tree too.

 Documentation/vm/hmm.rst | 137 +--
 1 file changed, 133 insertions(+), 4 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 7453acc1ea4f..dd9f76a4ef29 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -271,10 +271,139 @@ map those pages from the CPU side.
 Migration to and from device memory
 ===
 
-Because the CPU cannot access device memory, migration must use the device DMA
-engine to perform copy from and to device memory. For this we need to use
-migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() helpers.
-
+Because the CPU cannot access device memory directly, the device driver must
+use hardware DMA or device specific load/store instructions to migrate data.
+The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
+functions are designed to make drivers easier to write and to centralize common
+code across drivers.
+
+Before migrating pages to device private memory, special device private
+``struct page`` need to be created. These will be used as special "swap"
+page table entries so that a CPU process will fault if it tries to access
+a page that has been migrated to device private memory.
+
+These can be allocated and freed with::
+
+struct resource *res;
+struct dev_pagemap pagemap;
+
+res = request_free_mem_region(_resource, /* number of bytes */,
+  "name of driver resource");
+pagemap.type = MEMORY_DEVICE_PRIVATE;
+pagemap.range.start = res->start;
+pagemap.range.end = res->end;
+pagemap.nr_range = 1;
+pagemap.ops = _devmem_ops;
+memremap_pages(, numa_node_id());
+
+memunmap_pages();
+release_mem_region(pagemap.range.start, range_len());
+
+There are also devm_request_free_mem_region(), devm_memremap_pages(),
+devm_memunmap_pages(), and devm_release_mem_region() when the resources can
+be tied to a ``struct device``.
+
+The overall migration steps are similar to migrating NUMA pages within system
+memory (see :ref:`Page migration `) but the steps are split
+between device driver specific code and shared common code:
+
+1. ``mmap_read_lock()``
+
+   The device driver has to pass a ``struct vm_area_struct`` to
+   migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to
+   be held for the duration of the migration.
+
+2. ``migrate_vma_setup(struct migrate_vma *args)``
+
+   The device driver initializes the ``struct migrate_vma`` fields and passes
+   the pointer to migrate_vma_setup(). The ``args->flags`` field is used to
+   filter which source pages should be migrated. For example, setting
+   ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and
+   ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in
+   device private memory. If the latter flag is set, the ``args->pgmap_owner``
+   field is used to identify device private pages owned by the driver. This
+   avoids trying to migrate device private pages residing in other devices.
+   Currently only anonymous private VMA ranges can be migrated to or from
+   system memory and device private memory.
+
+   One of the first steps migrate_vma_setup() does is to invalidate other
+   device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and
+   ``mmu_notifier_invalidate_range_end()`` calls around the page table
+   walks to fill in the ``args->src`` array with PFNs to be migrated.
+   The ``invalidate_range_start()`` callback is passed a
+   ``struct mmu_notifier_range`` with the ``event`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
+   allows the device driver to skip the invalidation callback and only
+   invalidate device private MMU mappings that are actually migrating.
+   This is explained more in the next section.
+
+   While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()``
+   entry results in a valid "zero" PFN stored in the ``args->src`` array.
+   This lets the driver allocate device private memory and clear it instead
+   of copying a page of zeros. Valid PTE entries to system memory or
+   device private struct pages will be locked with ``lock_page()``, isolated
+   from the LRU (if system memory since device private pages are not on
+   the LRU), unmapped from the process, and a special migration PTE is
+   inserted in place of the original PTE.

[PATCH v3] mm/thp: fix __split_huge_pmd_locked() for migration PMD

2020-09-03 Thread Ralph Campbell
A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and
data could be lost. The function __split_huge_pmd() checks for a PMD
migration entry before calling __split_huge_pmd_locked() leading one to
think that __split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).
Fix these problems by checking for a PMD migration entry.

Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
cc: sta...@vger.kernel.org # 4.14+
Signed-off-by: Ralph Campbell 
Reviewed-by: Yang Shi 
Reviewed-by: Zi Yan 
---

No changes in v3 to this patch, just added reviewed-by and fixes to the
change log and sending this as a separate patch from the rest of the
series ("mm/hmm/nouveau: add THP migration to migrate_vma_*").
I'll hold off resending the series without this patch unless there are
changes needed.

 mm/huge_memory.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a468a4acb0a..606d712d9505 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2023,7 +2023,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
-   } else if (is_huge_zero_pmd(*pmd)) {
+   } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
 * FIXME: Do we want to invalidate secondary mmu by calling
 * mmu_notifier_invalidate_range() see comments below inside
@@ -2117,30 +2117,34 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
-   atomic_inc([i]._mapcount);
-   pte_unmap(pte);
-   }
-
-   /*
-* Set PG_double_map before dropping compound_mapcount to avoid
-* false-negative page_mapped().
-*/
-   if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
-   for (i = 0; i < HPAGE_PMD_NR; i++)
+   if (!pmd_migration)
atomic_inc([i]._mapcount);
+   pte_unmap(pte);
}
 
-   lock_page_memcg(page);
-   if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-   /* Last compound_mapcount is gone. */
-   __dec_lruvec_page_state(page, NR_ANON_THPS);
-   if (TestClearPageDoubleMap(page)) {
-   /* No need in mapcount reference anymore */
+   if (!pmd_migration) {
+   /*
+* Set PG_double_map before dropping compound_mapcount to avoid
+* false-negative page_mapped().
+*/
+   if (compound_mapcount(page) > 1 &&
+   !TestSetPageDoubleMap(page)) {
for (i = 0; i < HPAGE_PMD_NR; i++)
-   atomic_dec([i]._mapcount);
+   atomic_inc([i]._mapcount);
+   }
+
+   lock_page_memcg(page);
+   if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+   /* Last compound_mapcount is gone. */
+   __dec_lruvec_page_state(page, NR_ANON_THPS);
+   if (TestClearPageDoubleMap(page)) {
+   /* No need in mapcount reference anymore */
+   for (i = 0; i < HPAGE_PMD_NR; i++)
+   atomic_dec([i]._mapcount);
+   }
}
+   unlock_page_memcg(page);
}
-   unlock_page_memcg(page);
 
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
-- 
2.20.1



Re: [PATCH 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics

2020-09-03 Thread Ralph Campbell



On 9/3/20 9:56 AM, Catalin Marinas wrote:

On Mon, Aug 17, 2020 at 02:49:43PM +0530, Anshuman Khandual wrote:

pmd_present() and pmd_trans_huge() are expected to behave in the following
manner during various phases of a given PMD. It is derived from a previous
detailed discussion on this topic [1] and present THP documentation [2].

pmd_present(pmd):

- Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
- Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)


The second bullet doesn't make much sense. If you have a pmd mapping of
some I/O memory, pmd_present() still returns true (as does
pte_present()).


diff --git a/arch/arm64/include/asm/pgtable-prot.h 
b/arch/arm64/include/asm/pgtable-prot.h
index 4d867c6446c4..28792fdd9627 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,13 @@
  #define PTE_DEVMAP(_AT(pteval_t, 1) << 57)
  #define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when 
!PTE_VALID */
  
+/*

+ * This help indicate that the entry is present i.e pmd_page()


Nit: add another . after i.e


Another nit: "This help indicate" => "This helper indicates"

Maybe I should look at the series more. :-)


[PATCH v2] mm/doc: editorial pass on page migration

2020-09-02 Thread Ralph Campbell
Add Sphinx reference links to HMM and CPUSETS, and numerous small
editorial changes to make the page_migration.rst document more readable.

Signed-off-by: Ralph Campbell 
---

The patch applies cleanly to the latest linux or linux-mm tree.
Since this is MM relatated, perhaps Andrew Morton would like to
take this into the linux-mm tree.

Changes in v2:
Applied suggestions from Randy Dunlap:
Replace outdated ftp:// link to https://github
Changed "off node" to "off-node" and "non-lru" to "non-LRU"

 .../admin-guide/cgroup-v1/cpusets.rst |   2 +
 Documentation/vm/hmm.rst  |   2 +-
 Documentation/vm/page_migration.rst   | 164 +-
 3 files changed, 87 insertions(+), 81 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst 
b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 7ade3abd342a..5d844ed4df69 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -1,3 +1,5 @@
+.. _cpusets:
+
 ===
 CPUSETS
 ===
diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 6f9e000757fa..7453acc1ea4f 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -1,4 +1,4 @@
-.. hmm:
+.. _hmm:
 
 =
 Heterogeneous Memory Management (HMM)
diff --git a/Documentation/vm/page_migration.rst 
b/Documentation/vm/page_migration.rst
index 68883ac485fa..91a98a6b43bb 100644
--- a/Documentation/vm/page_migration.rst
+++ b/Documentation/vm/page_migration.rst
@@ -4,25 +4,28 @@
 Page migration
 ==
 
-Page migration allows the moving of the physical location of pages between
-nodes in a numa system while the process is running. This means that the
+Page migration allows moving the physical location of pages between
+nodes in a NUMA system while the process is running. This means that the
 virtual addresses that the process sees do not change. However, the
 system rearranges the physical location of those pages.
 
-The main intend of page migration is to reduce the latency of memory access
+Also see :ref:`Heterogeneous Memory Management (HMM) `
+for migrating pages to or from device private memory.
+
+The main intent of page migration is to reduce the latency of memory accesses
 by moving pages near to the processor where the process accessing that memory
 is running.
 
 Page migration allows a process to manually relocate the node on which its
 pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
-a new memory policy via mbind(). The pages of process can also be relocated
+a new memory policy via mbind(). The pages of a process can also be relocated
 from another process using the sys_migrate_pages() function call. The
-migrate_pages function call takes two sets of nodes and moves pages of a
+migrate_pages() function call takes two sets of nodes and moves pages of a
 process that are located on the from nodes to the destination nodes.
 Page migration functions are provided by the numactl package by Andi Kleen
 (a version later than 0.9.3 is required. Get it from
-ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
-which provides an interface similar to other numa functionality for page
+https://github.com/numactl/numactl.git). numactl provides libnuma
+which provides an interface similar to other NUMA functionality for page
 migration.  cat ``/proc//numa_maps`` allows an easy review of where the
 pages of a process are located. See also the numa_maps documentation in the
 proc(5) man page.
@@ -30,19 +33,19 @@ proc(5) man page.
 Manual migration is useful if for example the scheduler has relocated
 a process to a processor on a distant node. A batch scheduler or an
 administrator may detect the situation and move the pages of the process
-nearer to the new processor. The kernel itself does only provide
+nearer to the new processor. The kernel itself only provides
 manual page migration support. Automatic page migration may be implemented
 through user space processes that move pages. A special function call
 "move_pages" allows the moving of individual pages within a process.
-A NUMA profiler may f.e. obtain a log showing frequent off node
+For example, A NUMA profiler may obtain a log showing frequent off-node
 accesses and may use the result to move pages to more advantageous
 locations.
 
 Larger installations usually partition the system using cpusets into
 sections of nodes. Paul Jackson has equipped cpusets with the ability to
 move pages when a task is moved to another cpuset (See
-Documentation/admin-guide/cgroup-v1/cpusets.rst).
-Cpusets allows the automation of process locality. If a task is moved to
+:ref:`CPUSETS `).
+Cpusets allow the automation of process locality. If a task is moved to
 a new cpuset then also all its pages are moved with it so that the
 performance of the process does not sink dramatically. 

Re: [PATCH v2 1/7] mm/thp: fix __split_huge_pmd_locked() for migration PMD

2020-09-02 Thread Ralph Campbell



On 9/2/20 2:47 PM, Zi Yan wrote:

On 2 Sep 2020, at 12:58, Ralph Campbell wrote:


A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and
data could be lost. The function __split_huge_pmd() checks for a PMD
migration entry before calling __split_huge_pmd_locked() leading one to
think that __split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).
Fix these problems by checking for a PMD migration entry.

Signed-off-by: Ralph Campbell 


Thanks for the fix. You can add Reviewed-by: Zi Yan 

I think you also want to add the Fixes tag and cc stable.

Fixes 84c3fc4e9c56 (“mm: thp: check pmd migration entry in common path”)
cc: sta...@vger.kernel.org # 4.14+


Thanks, I'll add these.


Re: [PATCH] mm/doc: editorial pass on page migration

2020-09-02 Thread Ralph Campbell



On 9/2/20 12:41 PM, Randy Dunlap wrote:

Hey Ralph,

Thanks for the update/corrections. Nice job.

A few nits/comments below:

On 9/2/20 12:06 PM, Ralph Campbell wrote:

Add Sphinx reference links to HMM and CPUSETS, and numerous small
editorial changes to make the page_migration.rst document more readable.

Signed-off-by: Ralph Campbell 
---
  .../admin-guide/cgroup-v1/cpusets.rst |   2 +
  Documentation/vm/hmm.rst  |   2 +-
  Documentation/vm/page_migration.rst   | 150 +-
  3 files changed, 80 insertions(+), 74 deletions(-)




diff --git a/Documentation/vm/page_migration.rst 
b/Documentation/vm/page_migration.rst
index 68883ac485fa..bde21cd2f21f 100644
--- a/Documentation/vm/page_migration.rst
+++ b/Documentation/vm/page_migration.rst
@@ -4,25 +4,28 @@
  Page migration
  ==
  
-Page migration allows the moving of the physical location of pages between

-nodes in a numa system while the process is running. This means that the
+Page migration allows moving the physical location of pages between
+nodes in a NUMA system while the process is running. This means that the
  virtual addresses that the process sees do not change. However, the
  system rearranges the physical location of those pages.
  
-The main intend of page migration is to reduce the latency of memory access

+Also see :ref:`Heterogeneous Memory Management (HMM) `
+for migrating pages to or from device private memory.
+
+The main intent of page migration is to reduce the latency of memory accesses
  by moving pages near to the processor where the process accessing that memory
  is running.
  
  Page migration allows a process to manually relocate the node on which its

  pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
-a new memory policy via mbind(). The pages of process can also be relocated
+a new memory policy via mbind(). The pages of a process can also be relocated
  from another process using the sys_migrate_pages() function call. The
-migrate_pages function call takes two sets of nodes and moves pages of a
+migrate_pages() function call takes two sets of nodes and moves pages of a
  process that are located on the from nodes to the destination nodes.
  Page migration functions are provided by the numactl package by Andi Kleen
  (a version later than 0.9.3 is required. Get it from
  ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma


URL not valid/working AFAICT.


I'll update it to https://github.com/numactl/numactl.git


-which provides an interface similar to other numa functionality for page
+which provides an interface similar to other NUMA functionality for page
  migration.  cat ``/proc//numa_maps`` allows an easy review of where the
  pages of a process are located. See also the numa_maps documentation in the
  proc(5) man page.
@@ -30,19 +33,19 @@ proc(5) man page.
  Manual migration is useful if for example the scheduler has relocated
  a process to a processor on a distant node. A batch scheduler or an
  administrator may detect the situation and move the pages of the process
-nearer to the new processor. The kernel itself does only provide
+nearer to the new processor. The kernel itself only provides
  manual page migration support. Automatic page migration may be implemented
  through user space processes that move pages. A special function call
  "move_pages" allows the moving of individual pages within a process.
-A NUMA profiler may f.e. obtain a log showing frequent off node
+For example, A NUMA profiler may obtain a log showing frequent off node


nit only: off-node


OK


  accesses and may use the result to move pages to more advantageous
  locations.
  
  Larger installations usually partition the system using cpusets into

  sections of nodes. Paul Jackson has equipped cpusets with the ability to
  move pages when a task is moved to another cpuset (See
-Documentation/admin-guide/cgroup-v1/cpusets.rst).
-Cpusets allows the automation of process locality. If a task is moved to
+:ref:`CPUSETS `).
+Cpusets allow the automation of process locality. If a task is moved to
  a new cpuset then also all its pages are moved with it so that the
  performance of the process does not sink dramatically. Also the pages
  of processes in a cpuset are moved if the allowed memory nodes of a
@@ -67,9 +70,9 @@ In kernel use of migrate_pages()
 Lists of pages to be migrated are generated by scanning over
 pages and moving them into lists. This is done by
 calling isolate_lru_page().
-   Calling isolate_lru_page increases the references to the page
+   Calling isolate_lru_page() increases the references to the page
 so that it cannot vanish while the page migration occurs.
-   It also prevents the swapper or other scans to encounter
+   It also prevents the swapper or other scans from encountering
 the page.
  
  2. We ne

[PATCH] mm/doc: editorial pass on page migration

2020-09-02 Thread Ralph Campbell
Add Sphinx reference links to HMM and CPUSETS, and numerous small
editorial changes to make the page_migration.rst document more readable.

Signed-off-by: Ralph Campbell 
---
 .../admin-guide/cgroup-v1/cpusets.rst |   2 +
 Documentation/vm/hmm.rst  |   2 +-
 Documentation/vm/page_migration.rst   | 150 +-
 3 files changed, 80 insertions(+), 74 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst 
b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 7ade3abd342a..5d844ed4df69 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -1,3 +1,5 @@
+.. _cpusets:
+
 ===
 CPUSETS
 ===
diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 6f9e000757fa..7453acc1ea4f 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -1,4 +1,4 @@
-.. hmm:
+.. _hmm:
 
 =
 Heterogeneous Memory Management (HMM)
diff --git a/Documentation/vm/page_migration.rst 
b/Documentation/vm/page_migration.rst
index 68883ac485fa..bde21cd2f21f 100644
--- a/Documentation/vm/page_migration.rst
+++ b/Documentation/vm/page_migration.rst
@@ -4,25 +4,28 @@
 Page migration
 ==
 
-Page migration allows the moving of the physical location of pages between
-nodes in a numa system while the process is running. This means that the
+Page migration allows moving the physical location of pages between
+nodes in a NUMA system while the process is running. This means that the
 virtual addresses that the process sees do not change. However, the
 system rearranges the physical location of those pages.
 
-The main intend of page migration is to reduce the latency of memory access
+Also see :ref:`Heterogeneous Memory Management (HMM) `
+for migrating pages to or from device private memory.
+
+The main intent of page migration is to reduce the latency of memory accesses
 by moving pages near to the processor where the process accessing that memory
 is running.
 
 Page migration allows a process to manually relocate the node on which its
 pages are located through the MF_MOVE and MF_MOVE_ALL options while setting
-a new memory policy via mbind(). The pages of process can also be relocated
+a new memory policy via mbind(). The pages of a process can also be relocated
 from another process using the sys_migrate_pages() function call. The
-migrate_pages function call takes two sets of nodes and moves pages of a
+migrate_pages() function call takes two sets of nodes and moves pages of a
 process that are located on the from nodes to the destination nodes.
 Page migration functions are provided by the numactl package by Andi Kleen
 (a version later than 0.9.3 is required. Get it from
 ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
-which provides an interface similar to other numa functionality for page
+which provides an interface similar to other NUMA functionality for page
 migration.  cat ``/proc//numa_maps`` allows an easy review of where the
 pages of a process are located. See also the numa_maps documentation in the
 proc(5) man page.
@@ -30,19 +33,19 @@ proc(5) man page.
 Manual migration is useful if for example the scheduler has relocated
 a process to a processor on a distant node. A batch scheduler or an
 administrator may detect the situation and move the pages of the process
-nearer to the new processor. The kernel itself does only provide
+nearer to the new processor. The kernel itself only provides
 manual page migration support. Automatic page migration may be implemented
 through user space processes that move pages. A special function call
 "move_pages" allows the moving of individual pages within a process.
-A NUMA profiler may f.e. obtain a log showing frequent off node
+For example, A NUMA profiler may obtain a log showing frequent off node
 accesses and may use the result to move pages to more advantageous
 locations.
 
 Larger installations usually partition the system using cpusets into
 sections of nodes. Paul Jackson has equipped cpusets with the ability to
 move pages when a task is moved to another cpuset (See
-Documentation/admin-guide/cgroup-v1/cpusets.rst).
-Cpusets allows the automation of process locality. If a task is moved to
+:ref:`CPUSETS `).
+Cpusets allow the automation of process locality. If a task is moved to
 a new cpuset then also all its pages are moved with it so that the
 performance of the process does not sink dramatically. Also the pages
 of processes in a cpuset are moved if the allowed memory nodes of a
@@ -67,9 +70,9 @@ In kernel use of migrate_pages()
Lists of pages to be migrated are generated by scanning over
pages and moving them into lists. This is done by
calling isolate_lru_page().
-   Calling isolate_lru_page increases the references to the page
+   Calling isolate_lru_page() increases the references to the page
so that it cannot va

[PATCH v2 2/7] mm/migrate: move migrate_vma_collect_skip()

2020-09-02 Thread Ralph Campbell
Move the definition of migrate_vma_collect_skip() to make it callable
by migrate_vma_collect_hole(). This helps make the next patch easier
to read.

Signed-off-by: Ralph Campbell 
---
 mm/migrate.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 4f89360d9e77..ce16ed3deab6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2254,6 +2254,21 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
*mm,
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
+static int migrate_vma_collect_skip(unsigned long start,
+   unsigned long end,
+   struct mm_walk *walk)
+{
+   struct migrate_vma *migrate = walk->private;
+   unsigned long addr;
+
+   for (addr = start; addr < end; addr += PAGE_SIZE) {
+   migrate->dst[migrate->npages] = 0;
+   migrate->src[migrate->npages++] = 0;
+   }
+
+   return 0;
+}
+
 static int migrate_vma_collect_hole(unsigned long start,
unsigned long end,
__always_unused int depth,
@@ -2282,21 +2297,6 @@ static int migrate_vma_collect_hole(unsigned long start,
return 0;
 }
 
-static int migrate_vma_collect_skip(unsigned long start,
-   unsigned long end,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->dst[migrate->npages] = 0;
-   migrate->src[migrate->npages++] = 0;
-   }
-
-   return 0;
-}
-
 static int migrate_vma_collect_pmd(pmd_t *pmdp,
   unsigned long start,
   unsigned long end,
-- 
2.20.1



[PATCH v2 3/7] mm: support THP migration to device private memory

2020-09-02 Thread Ralph Campbell
Support transparent huge page migration to ZONE_DEVICE private memory.
A new selection flag (MIGRATE_VMA_SELECT_COMPOUND) is added to request
THP migration. Otherwise, THPs are split when filling in the source PFN
array. A new flag (MIGRATE_PFN_COMPOUND) is added to the source PFN array
to indicate a huge page can be migrated. If the device driver can allocate
a huge page, it sets the MIGRATE_PFN_COMPOUND flag in the destination PFN
array. migrate_vma_pages() will fallback to PAGE_SIZE pages if
MIGRATE_PFN_COMPOUND is not set in both source and destination arrays.

Signed-off-by: Ralph Campbell 
---
 include/linux/huge_mm.h  |   7 +
 include/linux/memremap.h |   9 +
 include/linux/migrate.h  |   2 +
 mm/huge_memory.c | 113 ---
 mm/memory.c  |  10 +-
 mm/migrate.c | 413 ---
 mm/rmap.c|   2 +-
 7 files changed, 459 insertions(+), 97 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8a8bc46a2432..87b42c81dedc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -192,6 +192,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_migrating_huge_page(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address, struct page *page);
 static inline int split_huge_page(struct page *page)
 {
return split_huge_page_to_list(page, NULL);
@@ -454,6 +456,11 @@ static inline bool is_huge_zero_page(struct page *page)
return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+   return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
return false;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 4e9c738f4b31..5175b1eaea01 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -88,6 +88,15 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Used for private (un-addressable) device memory only.
+* This is called when a compound device private page is split.
+* The driver uses this callback to set tail_page->pgmap and
+* tail_page->zone_device_data appropriately based on the head
+* page.
+*/
+   void (*page_split)(struct page *head, struct page *tail_page);
 };
 
 #define PGMAP_ALTMAP_VALID (1 << 0)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0f8d1583fa8e..92179bf360d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -144,6 +144,7 @@ static inline int migrate_misplaced_transhuge_page(struct 
mm_struct *mm,
 #define MIGRATE_PFN_MIGRATE(1UL << 1)
 #define MIGRATE_PFN_LOCKED (1UL << 2)
 #define MIGRATE_PFN_WRITE  (1UL << 3)
+#define MIGRATE_PFN_COMPOUND   (1UL << 4)
 #define MIGRATE_PFN_SHIFT  6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -161,6 +162,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+   MIGRATE_VMA_SELECT_COMPOUND = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 606d712d9505..a8d48994481a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1644,23 +1644,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
} else {
struct page *page = NULL;
int flush_needed = 1;
+   bool is_anon = false;
 
if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
+   is_anon = PageAnon(page);
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
swp_entry_t entry;
 
-   VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
entry = pmd_to_swp_entry(orig_pmd);
-   page = pfn_to_page(swp_offset(entry));
+   if (is_device_private_entry(entry)) {
+   page = device_private_entry_to_page(entry);
+   is_anon = PageAnon(page);
+   page_remove_rmap(page, true);
+   VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+   VM_BUG_ON_PAGE(!PageHead(page), page);
+   put_page(page);
+   } else {
+ 

[PATCH v2 0/7] mm/hmm/nouveau: add THP migration to migrate_vma_*

2020-09-02 Thread Ralph Campbell
This series adds support for transparent huge page migration to
migrate_vma_*() and adds nouveau SVM and HMM selftests as consumers.
An earlier version was posted previously [1]. This version now
supports splitting a THP midway in the migration process which
led to a number of changes.

The patches apply cleanly to the current linux-mm tree. Since there
are a couple of patches in linux-mm from Dan Williams that modify
lib/test_hmm.c and drivers/gpu/drm/nouveau/nouveau_dmem.c, it might
be easiest if Andrew could take these through the linux-mm tree
assuming that's OK with other maintainers like Ben Skeggs.

[1] https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampb...@nvidia.com

Ralph Campbell (7):
  mm/thp: fix __split_huge_pmd_locked() for migration PMD
  mm/migrate: move migrate_vma_collect_skip()
  mm: support THP migration to device private memory
  mm/thp: add prep_transhuge_device_private_page()
  mm/thp: add THP allocation helper
  mm/hmm/test: add self tests for THP migration
  nouveau: support THP migration to private memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 +++-
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/gfp.h|  10 +
 include/linux/huge_mm.h|  12 +
 include/linux/memremap.h   |   9 +
 include/linux/migrate.h|   2 +
 lib/test_hmm.c | 439 +
 lib/test_hmm_uapi.h|   3 +
 mm/huge_memory.c   | 177 +++---
 mm/memory.c|  10 +-
 mm/migrate.c   | 429 +++-
 mm/rmap.c  |   2 +-
 tools/testing/selftests/vm/hmm-tests.c | 404 +++
 14 files changed, 1519 insertions(+), 281 deletions(-)

-- 
2.20.1



[PATCH v2 6/7] mm/hmm/test: add self tests for THP migration

2020-09-02 Thread Ralph Campbell
Add some basic stand alone self tests for migrating system memory to device
private memory and back.

Signed-off-by: Ralph Campbell 
---
 lib/test_hmm.c | 439 +
 lib/test_hmm_uapi.h|   3 +
 tools/testing/selftests/vm/hmm-tests.c | 404 +++
 3 files changed, 777 insertions(+), 69 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e3065d6123f0..41c005c55bcf 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -67,6 +67,7 @@ struct dmirror {
struct xarray   pt;
struct mmu_interval_notifiernotifier;
struct mutexmutex;
+   __u64   flags;
 };
 
 /*
@@ -92,6 +93,7 @@ struct dmirror_device {
unsigned long   calloc;
unsigned long   cfree;
struct page *free_pages;
+   struct page *free_huge_pages;
spinlock_t  lock;   /* protects the above */
 };
 
@@ -451,6 +453,7 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
 }
 
 static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+  bool is_huge,
   struct page **ppage)
 {
struct dmirror_chunk *devmem;
@@ -504,28 +507,51 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 
mutex_unlock(>devmem_lock);
 
-   pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 
0x%lx)\n",
+   pr_info("dev %u added %u MB (total %u chunks, %u MB) PFNs [0x%lx 
0x%lx)\n",
+   MINOR(mdevice->cdevice.dev),
DEVMEM_CHUNK_SIZE / (1024 * 1024),
mdevice->devmem_count,
mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
pfn_first, pfn_last);
 
spin_lock(>lock);
-   for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+   for (pfn = pfn_first; pfn < pfn_last; ) {
struct page *page = pfn_to_page(pfn);
 
+   if (is_huge && (pfn & (HPAGE_PMD_NR - 1)) == 0 &&
+   pfn + HPAGE_PMD_NR <= pfn_last) {
+   prep_transhuge_device_private_page(page);
+   page->zone_device_data = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = page;
+   pfn += HPAGE_PMD_NR;
+   continue;
+   }
page->zone_device_data = mdevice->free_pages;
mdevice->free_pages = page;
+   pfn++;
}
if (ppage) {
-   *ppage = mdevice->free_pages;
-   mdevice->free_pages = (*ppage)->zone_device_data;
-   mdevice->calloc++;
+   if (is_huge) {
+   if (!mdevice->free_huge_pages)
+   goto err_unlock;
+   *ppage = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = (*ppage)->zone_device_data;
+   mdevice->calloc += compound_nr(*ppage);
+   } else if (mdevice->free_pages) {
+   *ppage = mdevice->free_pages;
+   mdevice->free_pages = (*ppage)->zone_device_data;
+   mdevice->calloc++;
+   } else
+   goto err_unlock;
}
spin_unlock(>lock);
 
return true;
 
+err_unlock:
+   spin_unlock(>lock);
+   return false;
+
 err_free:
kfree(devmem);
 err_release:
@@ -535,7 +561,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
return false;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice,
+ bool is_huge)
 {
struct page *dpage = NULL;
struct page *rpage;
@@ -550,17 +577,40 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
 
spin_lock(>lock);
 
-   if (mdevice->free_pages) {
+   if (is_huge && mdevice->free_huge_pages) {
+   dpage = mdevice->free_huge_pages;
+   mdevice->free_huge_pages = dpage->zone_device_data;
+   mdevice->calloc += compound_nr(dpage);
+   spin_unlock(>lock);
+   } else if (!is_huge && mdevice->free_pages) {
dpage = mdevice->free_pages;
mdevice->free_pages = dpage->zone_device_data;
mdevice->calloc++;
spin_unlock(>lock);
} else {
spin_unlock(>lock);
-   if (!dmirror_allocate_chunk(mdevice, ))
+ 

[PATCH v2 7/7] nouveau: support THP migration to private memory

2020-09-02 Thread Ralph Campbell
Add support for migrating transparent huge pages to and from device
private memory.

Signed-off-by: Ralph Campbell 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 ++---
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 215 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a13c6215bba8..78ad0ee77b3d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -82,6 +82,7 @@ struct nouveau_dmem {
struct list_head chunks;
struct mutex mutex;
struct page *free_pages;
+   struct page *free_huge_pages;
spinlock_t lock;
 };
 
@@ -112,8 +113,13 @@ static void nouveau_dmem_page_free(struct page *page)
struct nouveau_dmem *dmem = chunk->drm->dmem;
 
spin_lock(>lock);
-   page->zone_device_data = dmem->free_pages;
-   dmem->free_pages = page;
+   if (PageHead(page)) {
+   page->zone_device_data = dmem->free_huge_pages;
+   dmem->free_huge_pages = page;
+   } else {
+   page->zone_device_data = dmem->free_pages;
+   dmem->free_pages = page;
+   }
 
WARN_ON(!chunk->callocated);
chunk->callocated--;
@@ -139,51 +145,100 @@ static void nouveau_dmem_fence_done(struct nouveau_fence 
**fence)
 
 static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
struct vm_fault *vmf, struct migrate_vma *args,
-   dma_addr_t *dma_addr)
+   struct page *spage, bool is_huge, dma_addr_t *dma_addr)
 {
+   struct nouveau_svmm *svmm = spage->zone_device_data;
struct device *dev = drm->dev->dev;
-   struct page *dpage, *spage;
-   struct nouveau_svmm *svmm;
-
-   spage = migrate_pfn_to_page(args->src[0]);
-   if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-   return 0;
+   struct page *dpage;
+   unsigned int i;
 
-   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+   if (is_huge)
+   dpage = alloc_transhugepage(vmf->vma, args->start);
+   else
+   dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
if (!dpage)
-   return VM_FAULT_SIGBUS;
-   lock_page(dpage);
+   return VM_FAULT_OOM;
+   WARN_ON_ONCE(compound_order(spage) != compound_order(dpage));
 
-   *dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   *dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+DMA_BIDIRECTIONAL);
if (dma_mapping_error(dev, *dma_addr))
goto error_free_page;
 
-   svmm = spage->zone_device_data;
+   lock_page(dpage);
+   i = (vmf->address - args->start) >> PAGE_SHIFT;
+   spage += i;
mutex_lock(>mutex);
nouveau_svmm_invalidate(svmm, args->start, args->end);
-   if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-   NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
+   if (drm->dmem->migrate.copy_func(drm, compound_nr(dpage),
+   NOUVEAU_APER_HOST, *dma_addr, NOUVEAU_APER_VRAM,
+   nouveau_dmem_page_addr(spage)))
goto error_dma_unmap;
mutex_unlock(>mutex);
 
-   args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   args->dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+   if (is_huge)
+   args->dst[i] |= MIGRATE_PFN_COMPOUND;
return 0;
 
 error_dma_unmap:
mutex_unlock(>mutex);
-   dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   unlock_page(dpage);
+   dma_unmap_page(dev, *dma_addr, page_size(dpage), DMA_BIDIRECTIONAL);
 error_free_page:
__free_page(dpage);
return VM_FAULT_SIGBUS;
 }
 
+static vm_fault_t nouveau_dmem_fault_chunk(struct nouveau_drm *drm,
+   struct vm_fault *vmf, struct migrate_vma *args)
+{
+   struct device *dev = drm->dev->dev;
+   struct nouveau_fence *fence;
+   struct page *spage;
+   unsigned long src = args->src[0];
+   bool is_huge = (src & (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND)) ==
+   (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND);
+   unsigned long dma_page_size;
+   dma_addr_t dma_addr;
+   vm_fault_t ret = 0;
+
+   spage = migrate_pfn_to_page(src);
+   if (!spage) {
+   ret = VM_FAULT_SIGBUS;
+   goto out;
+   }
+   if (is_huge) {
+   dma_page_size = PMD_SIZE;
+   ret = nouveau_dmem_fault_copy_one(drm, vmf, args, spage, true,
+ 

[PATCH v2 5/7] mm/thp: add THP allocation helper

2020-09-02 Thread Ralph Campbell
Transparent huge page allocation policy is controlled by several sysfs
variables. Rather than expose these to each device driver that needs to
allocate THPs, provide a helper function.

Signed-off-by: Ralph Campbell 
---
 include/linux/gfp.h | 10 ++
 mm/huge_memory.c| 14 ++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..6faf4ea5501b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -562,6 +562,16 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
 #define alloc_page_vma_node(gfp_mask, vma, addr, node) \
alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
+   unsigned long addr);
+#else
+static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
+   unsigned long addr)
+{
+   return NULL;
+}
+#endif
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e848cc0c3dc..e4e1fe199dc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -764,6 +764,20 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
return __do_huge_pmd_anonymous_page(vmf, page, gfp);
 }
 
+struct page *alloc_transhugepage(struct vm_area_struct *vma,
+unsigned long haddr)
+{
+   gfp_t gfp;
+   struct page *page;
+
+   gfp = alloc_hugepage_direct_gfpmask(vma);
+   page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+   if (page)
+   prep_transhuge_page(page);
+   return page;
+}
+EXPORT_SYMBOL_GPL(alloc_transhugepage);
+
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
pgtable_t pgtable)
-- 
2.20.1



[PATCH v2 4/7] mm/thp: add prep_transhuge_device_private_page()

2020-09-02 Thread Ralph Campbell
Add a helper function to allow device drivers to create device private
transparent huge pages. This is intended to help support device private
THP migrations.

Signed-off-by: Ralph Campbell 
---
 include/linux/huge_mm.h | 5 +
 mm/huge_memory.c| 8 
 2 files changed, 13 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 87b42c81dedc..126e54da4fee 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -187,6 +187,7 @@ extern unsigned long thp_get_unmapped_area(struct file 
*filp,
unsigned long flags);
 
 extern void prep_transhuge_page(struct page *page);
+extern void prep_transhuge_device_private_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -382,6 +383,10 @@ static inline bool transhuge_vma_suitable(struct 
vm_area_struct *vma,
 
 static inline void prep_transhuge_page(struct page *page) {}
 
+static inline void prep_transhuge_device_private_page(struct page *page)
+{
+}
+
 static inline bool is_transparent_hugepage(struct page *page)
 {
return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a8d48994481a..1e848cc0c3dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -498,6 +498,14 @@ void prep_transhuge_page(struct page *page)
set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
+void prep_transhuge_device_private_page(struct page *page)
+{
+   prep_compound_page(page, HPAGE_PMD_ORDER);
+   prep_transhuge_page(page);
+   percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
+}
+EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);
+
 bool is_transparent_hugepage(struct page *page)
 {
if (!PageCompound(page))
-- 
2.20.1



[PATCH v2 1/7] mm/thp: fix __split_huge_pmd_locked() for migration PMD

2020-09-02 Thread Ralph Campbell
A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and
data could be lost. The function __split_huge_pmd() checks for a PMD
migration entry before calling __split_huge_pmd_locked() leading one to
think that __split_huge_pmd_locked() can handle splitting a migrating PMD.
However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.
Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).
Fix these problems by checking for a PMD migration entry.

Signed-off-by: Ralph Campbell 
---
 mm/huge_memory.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a468a4acb0a..606d712d9505 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2023,7 +2023,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
-   } else if (is_huge_zero_pmd(*pmd)) {
+   } else if (pmd_trans_huge(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
 * FIXME: Do we want to invalidate secondary mmu by calling
 * mmu_notifier_invalidate_range() see comments below inside
@@ -2117,30 +2117,34 @@ static void __split_huge_pmd_locked(struct 
vm_area_struct *vma, pmd_t *pmd,
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
-   atomic_inc([i]._mapcount);
-   pte_unmap(pte);
-   }
-
-   /*
-* Set PG_double_map before dropping compound_mapcount to avoid
-* false-negative page_mapped().
-*/
-   if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
-   for (i = 0; i < HPAGE_PMD_NR; i++)
+   if (!pmd_migration)
atomic_inc([i]._mapcount);
+   pte_unmap(pte);
}
 
-   lock_page_memcg(page);
-   if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
-   /* Last compound_mapcount is gone. */
-   __dec_lruvec_page_state(page, NR_ANON_THPS);
-   if (TestClearPageDoubleMap(page)) {
-   /* No need in mapcount reference anymore */
+   if (!pmd_migration) {
+   /*
+* Set PG_double_map before dropping compound_mapcount to avoid
+* false-negative page_mapped().
+*/
+   if (compound_mapcount(page) > 1 &&
+   !TestSetPageDoubleMap(page)) {
for (i = 0; i < HPAGE_PMD_NR; i++)
-   atomic_dec([i]._mapcount);
+   atomic_inc([i]._mapcount);
+   }
+
+   lock_page_memcg(page);
+   if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+   /* Last compound_mapcount is gone. */
+   __dec_lruvec_page_state(page, NR_ANON_THPS);
+   if (TestClearPageDoubleMap(page)) {
+   /* No need in mapcount reference anymore */
+   for (i = 0; i < HPAGE_PMD_NR; i++)
+   atomic_dec([i]._mapcount);
+   }
}
+   unlock_page_memcg(page);
}
-   unlock_page_memcg(page);
 
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
-- 
2.20.1



[PATCH 1/2] mm/migrate: remove unnecessary is_zone_device_page() check

2020-08-31 Thread Ralph Campbell
The check for is_zone_device_page() and is_device_private_page() is
unnecessary since the latter is sufficient to determine if the page
is a device private page. Simplify the code for easier reading.

Signed-off-by: Ralph Campbell 
---
 mm/migrate.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 659d3d8a3e1f..fe339a847328 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -246,13 +246,11 @@ static bool remove_migration_pte(struct page *page, 
struct vm_area_struct *vma,
else if (pte_swp_uffd_wp(*pvmw.pte))
pte = pte_mkuffd_wp(pte);
 
-   if (unlikely(is_zone_device_page(new))) {
-   if (is_device_private_page(new)) {
-   entry = make_device_private_entry(new, 
pte_write(pte));
-   pte = swp_entry_to_pte(entry);
-   if (pte_swp_uffd_wp(*pvmw.pte))
-   pte = pte_swp_mkuffd_wp(pte);
-   }
+   if (unlikely(is_device_private_page(new))) {
+   entry = make_device_private_entry(new, pte_write(pte));
+   pte = swp_entry_to_pte(entry);
+   if (pte_swp_uffd_wp(*pvmw.pte))
+   pte = pte_swp_mkuffd_wp(pte);
}
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.20.1



[PATCH 0/2] mm/migrate: preserve soft dirty in remove_migration_pte()

2020-08-31 Thread Ralph Campbell
Two small patches for Andrew Morton's mm tree.
I happened to notice this from code inspection after seeing Alistair
Popple's patch ("mm/rmap: Fixup copying of soft dirty and uffd ptes").

Ralph Campbell (2):
  mm/migrate: remove unnecessary is_zone_device_page() check
  mm/migrate: preserve soft dirty in remove_migration_pte()

 mm/migrate.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

-- 
2.20.1



[PATCH 2/2] mm/migrate: preserve soft dirty in remove_migration_pte()

2020-08-31 Thread Ralph Campbell
The code to remove a migration PTE and replace it with a device private
PTE was not copying the soft dirty bit from the migration entry.
This could lead to page contents not being marked dirty when faulting
the page back from device private memory.

Signed-off-by: Ralph Campbell 
---
 mm/migrate.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index fe339a847328..4f89360d9e77 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -249,6 +249,8 @@ static bool remove_migration_pte(struct page *page, struct 
vm_area_struct *vma,
if (unlikely(is_device_private_page(new))) {
entry = make_device_private_entry(new, pte_write(pte));
pte = swp_entry_to_pte(entry);
+   if (pte_swp_soft_dirty(*pvmw.pte))
+   pte = pte_swp_mksoft_dirty(pte);
if (pte_swp_uffd_wp(*pvmw.pte))
pte = pte_swp_mkuffd_wp(pte);
}
-- 
2.20.1



[PATCH v2] nouveau: fix the start/end range for migration

2020-08-31 Thread Ralph Campbell
The user level OpenCL code shouldn't have to align start and end
addresses to a page boundary. That is better handled in the nouveau
driver. The npages field is also redundant since it can be computed
from the start and end addresses.

Signed-off-by: Ralph Campbell 
---

This is for Ben Skegg's nouveau tree.

I have been working with Karol Herbst on the OpenCL mesa changes for
nouveau which will be merged upstream soon.
With or without those changes, the user visible effect of this patch
only extends the range by one page (round up vs. round down to page
boundary).

Changes in v2:
I changed the start/end check to require a size so start has to be less
than end.

 drivers/gpu/drm/nouveau/nouveau_svm.c | 14 +++---
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 2df1c0460559..4f69e4c3dafd 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -105,11 +105,11 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
struct nouveau_cli *cli = nouveau_cli(file_priv);
struct drm_nouveau_svm_bind *args = data;
unsigned target, cmd, priority;
-   unsigned long addr, end, size;
+   unsigned long addr, end;
struct mm_struct *mm;
 
args->va_start &= PAGE_MASK;
-   args->va_end &= PAGE_MASK;
+   args->va_end = ALIGN(args->va_end, PAGE_SIZE);
 
/* Sanity check arguments */
if (args->reserved0 || args->reserved1)
@@ -118,8 +118,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
if (args->va_start >= args->va_end)
return -EINVAL;
-   if (!args->npages)
-   return -EINVAL;
 
cmd = args->header >> NOUVEAU_SVM_BIND_COMMAND_SHIFT;
cmd &= NOUVEAU_SVM_BIND_COMMAND_MASK;
@@ -151,12 +149,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
if (args->stride)
return -EINVAL;
 
-   size = ((unsigned long)args->npages) << PAGE_SHIFT;
-   if ((args->va_start + size) <= args->va_start)
-   return -EINVAL;
-   if ((args->va_start + size) > args->va_end)
-   return -EINVAL;
-
/*
 * Ok we are ask to do something sane, for now we only support migrate
 * commands but we will add things like memory policy (what to do on
@@ -171,7 +163,7 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
}
 
-   for (addr = args->va_start, end = args->va_start + size; addr < end;) {
+   for (addr = args->va_start, end = args->va_end; addr < end;) {
struct vm_area_struct *vma;
unsigned long next;
 
-- 
2.20.1



Re: [PATCH] nouveau: fix the start/end range for migration

2020-08-31 Thread Ralph Campbell



On 8/31/20 11:02 AM, Jason Gunthorpe wrote:

On Mon, Aug 31, 2020 at 10:21:41AM -0700, Ralph Campbell wrote:


On 8/31/20 4:51 AM, Jason Gunthorpe wrote:

On Thu, Aug 27, 2020 at 02:37:44PM -0700, Ralph Campbell wrote:

The user level OpenCL code shouldn't have to align start and end
addresses to a page boundary. That is better handled in the nouveau
driver. The npages field is also redundant since it can be computed
from the start and end addresses.

Signed-off-by: Ralph Campbell 

This is for Ben Skegg's nouveau tree.

I have been working with Karol Herbst on the OpenCL mesa changes for
nouveau which will be merged upstream soon.
With or without those changes, the user visible effect of this patch
only extends the range by one page (round up vs. round down to page
boundary).

   drivers/gpu/drm/nouveau/nouveau_svm.c | 17 ++---
   1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 2df1c0460559..888aa0908c5a 100644
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -105,11 +105,14 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
struct nouveau_cli *cli = nouveau_cli(file_priv);
struct drm_nouveau_svm_bind *args = data;
unsigned target, cmd, priority;
-   unsigned long addr, end, size;
+   unsigned long addr, end;
struct mm_struct *mm;
args->va_start &= PAGE_MASK;
-   args->va_end &= PAGE_MASK;
+   args->va_end = ALIGN(args->va_end, PAGE_SIZE);
+   /* If no end address is given, assume a single page. */
+   if (args->va_end == 0)
+   args->va_end = args->va_start + PAGE_SIZE;


That is really weird, how is it useful for the kernel to map a region
of unknown size and alignment to the GPU?

Jason


I agree it is somewhat weird. The OpenCL 2.2 specification says that
clEnqueueSVMMigrateMem() takes an array of pointers and sizes (in bytes)
but the size is optional. There is no alignment required.
This "works" because the pointers have to be allocated with clSVMAlloc()
and presumably, the implementation for clSVMAlloc()
keeps track of the length allocated and can fill that in if size is zero.
However, requiring all allocations to be made with clSVMAlloc() defeats the
goal of being able to use regular malloc() and mmap() allocations for OpenCL
implementations that support fine-grained system allocations.
(See https://github.com/KhronosGroup/OpenCL-Docs/issues/392)

So if the size isn't specified, the most logical choices are do nothing and
return OK, return an error, or assume that at least one byte is being migrated
and try migrate it.


So if the app migrates the wrong memory then nothing bad happens, it
just might not get the performance from migration? Seems find but
really weird.

Jason



Given the principal of least surprise, I'm thinking the better choice is to
return an error from the driver and leave any other actions to the user level
library. I'll post a v2.


Re: [PATCH] nouveau: fix the start/end range for migration

2020-08-31 Thread Ralph Campbell



On 8/31/20 4:51 AM, Jason Gunthorpe wrote:

On Thu, Aug 27, 2020 at 02:37:44PM -0700, Ralph Campbell wrote:

The user level OpenCL code shouldn't have to align start and end
addresses to a page boundary. That is better handled in the nouveau
driver. The npages field is also redundant since it can be computed
from the start and end addresses.

Signed-off-by: Ralph Campbell 

This is for Ben Skegg's nouveau tree.

I have been working with Karol Herbst on the OpenCL mesa changes for
nouveau which will be merged upstream soon.
With or without those changes, the user visible effect of this patch
only extends the range by one page (round up vs. round down to page
boundary).

  drivers/gpu/drm/nouveau/nouveau_svm.c | 17 ++---
  1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 2df1c0460559..888aa0908c5a 100644
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -105,11 +105,14 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
struct nouveau_cli *cli = nouveau_cli(file_priv);
struct drm_nouveau_svm_bind *args = data;
unsigned target, cmd, priority;
-   unsigned long addr, end, size;
+   unsigned long addr, end;
struct mm_struct *mm;
  
  	args->va_start &= PAGE_MASK;

-   args->va_end &= PAGE_MASK;
+   args->va_end = ALIGN(args->va_end, PAGE_SIZE);
+   /* If no end address is given, assume a single page. */
+   if (args->va_end == 0)
+   args->va_end = args->va_start + PAGE_SIZE;


That is really weird, how is it useful for the kernel to map a region
of unknown size and alignment to the GPU?

Jason


I agree it is somewhat weird. The OpenCL 2.2 specification says that
clEnqueueSVMMigrateMem() takes an array of pointers and sizes (in bytes)
but the size is optional. There is no alignment required.
This "works" because the pointers have to be allocated with clSVMAlloc()
and presumably, the implementation for clSVMAlloc()
keeps track of the length allocated and can fill that in if size is zero.
However, requiring all allocations to be made with clSVMAlloc() defeats the
goal of being able to use regular malloc() and mmap() allocations for OpenCL
implementations that support fine-grained system allocations.
(See https://github.com/KhronosGroup/OpenCL-Docs/issues/392)

So if the size isn't specified, the most logical choices are do nothing and
return OK, return an error, or assume that at least one byte is being migrated
and try migrate it.


[PATCH] nouveau: fix the start/end range for migration

2020-08-27 Thread Ralph Campbell
The user level OpenCL code shouldn't have to align start and end
addresses to a page boundary. That is better handled in the nouveau
driver. The npages field is also redundant since it can be computed
from the start and end addresses.

Signed-off-by: Ralph Campbell 
---

This is for Ben Skegg's nouveau tree.

I have been working with Karol Herbst on the OpenCL mesa changes for
nouveau which will be merged upstream soon.
With or without those changes, the user visible effect of this patch
only extends the range by one page (round up vs. round down to page
boundary).

 drivers/gpu/drm/nouveau/nouveau_svm.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 2df1c0460559..888aa0908c5a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -105,11 +105,14 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
struct nouveau_cli *cli = nouveau_cli(file_priv);
struct drm_nouveau_svm_bind *args = data;
unsigned target, cmd, priority;
-   unsigned long addr, end, size;
+   unsigned long addr, end;
struct mm_struct *mm;
 
args->va_start &= PAGE_MASK;
-   args->va_end &= PAGE_MASK;
+   args->va_end = ALIGN(args->va_end, PAGE_SIZE);
+   /* If no end address is given, assume a single page. */
+   if (args->va_end == 0)
+   args->va_end = args->va_start + PAGE_SIZE;
 
/* Sanity check arguments */
if (args->reserved0 || args->reserved1)
@@ -118,8 +121,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
if (args->va_start >= args->va_end)
return -EINVAL;
-   if (!args->npages)
-   return -EINVAL;
 
cmd = args->header >> NOUVEAU_SVM_BIND_COMMAND_SHIFT;
cmd &= NOUVEAU_SVM_BIND_COMMAND_MASK;
@@ -151,12 +152,6 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
if (args->stride)
return -EINVAL;
 
-   size = ((unsigned long)args->npages) << PAGE_SHIFT;
-   if ((args->va_start + size) <= args->va_start)
-   return -EINVAL;
-   if ((args->va_start + size) > args->va_end)
-   return -EINVAL;
-
/*
 * Ok we are ask to do something sane, for now we only support migrate
 * commands but we will add things like memory policy (what to do on
@@ -171,7 +166,7 @@ nouveau_svmm_bind(struct drm_device *dev, void *data,
return -EINVAL;
}
 
-   for (addr = args->va_start, end = args->va_start + size; addr < end;) {
+   for (addr = args->va_start, end = args->va_end; addr < end;) {
struct vm_area_struct *vma;
unsigned long next;
 
-- 
2.20.1



[PATCH] mm/migrate: remove cpages-- in migrate_vma_finalize()

2020-08-27 Thread Ralph Campbell
The variable struct migrate_vma->cpages is only used in
migrate_vma_setup(). There is no need to decrement it in
migrate_vma_finalize() since it is never checked.

Signed-off-by: Ralph Campbell 
---

This applies to linux-mm and is for Andrew Morton's tree.

 mm/migrate.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0b945c8031be..eb859f7a9811 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -3087,7 +3087,6 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
 
remove_migration_ptes(page, newpage, false);
unlock_page(page);
-   migrate->cpages--;
 
if (is_zone_device_page(page))
put_page(page);
-- 
2.20.1



[PATCH] mm/migrate: remove obsolete comment about device public

2020-08-27 Thread Ralph Campbell
Device public memory never had an in tree consumer and was removed in
commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support").
Delete the obsolete comment.

Signed-off-by: Ralph Campbell 
---

This applies to linux-mm and is for Andrew Morton's tree.

 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index eb859f7a9811..34fdd25a26db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -381,7 +381,7 @@ static int expected_page_refs(struct address_space 
*mapping, struct page *page)
int expected_count = 1;
 
/*
-* Device public or private pages have an extra refcount as they are
+* Device private pages have an extra refcount as they are
 * ZONE_DEVICE pages.
 */
expected_count += is_device_private_page(page);
-- 
2.20.1



[PATCH] mm/test: use the new SKIP() macro

2020-08-27 Thread Ralph Campbell
Some tests might not be able to be run if resources like huge pages are
not available. Mark these tests as skipped instead of simply passing.

Signed-off-by: Ralph Campbell 
---

This applies to linux-mm and is for Andrew Morton's tree.

 tools/testing/selftests/vm/hmm-tests.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 93fc5cadce61..0a28a6a29581 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
 
n = gethugepagesizes(pagesizes, 4);
if (n <= 0)
-   return;
+   SKIP(return, "Huge page size could not be determined");
for (idx = 0; --n > 0; ) {
if (pagesizes[n] < pagesizes[idx])
idx = n;
@@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
buffer->ptr = get_hugepage_region(size, GHR_STRICT);
if (buffer->ptr == NULL) {
free(buffer);
-   return;
+   SKIP(return, "Huge page could not be allocated");
}
 
buffer->fd = -1;
-- 
2.20.1



[PATCH] mm/migrate: fix migrate_pgmap_owner w/o CONFIG_MMU_NOTIFIER

2020-08-06 Thread Ralph Campbell
On x86_64, when CONFIG_MMU_NOTIFIER is not set/enabled, there is a
compiler error:

../mm/migrate.c: In function 'migrate_vma_collect':
../mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member
named 'migrate_pgmap_owner'
  range.migrate_pgmap_owner = migrate->pgmap_owner;
   ^

Fixes: 998427b3ad2c ("mm/notifier: add migration invalidation type")
Signed-off-by: Ralph Campbell 
Reported-by: Randy Dunlap 
---

This is based on the latest linux and is for Andrew Morton's mm tree.
MMU_NOTIFIER is selected automatically by a number of other config
options so I missed this in my own testing. Thanks to Randy Dunlap for
finding it.

 include/linux/mmu_notifier.h | 13 +
 mm/migrate.c |  6 +++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index c6f0708195cd..b8200782dede 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -521,6 +521,16 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->flags = flags;
 }
 
+static inline void mmu_notifier_range_init_migrate(
+   struct mmu_notifier_range *range, unsigned int flags,
+   struct vm_area_struct *vma, struct mm_struct *mm,
+   unsigned long start, unsigned long end, void *pgmap)
+{
+   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
+   start, end);
+   range->migrate_pgmap_owner = pgmap;
+}
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
 ({ \
int __young;\
@@ -645,6 +655,9 @@ static inline void _mmu_notifier_range_init(struct 
mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
_mmu_notifier_range_init(range, start, end)
+#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
+   pgmap) \
+   _mmu_notifier_range_init(range, start, end)
 
 static inline bool
 mmu_notifier_range_blockable(const struct mmu_notifier_range *range)
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fcc465736ff..d179657f8685 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2386,9 +2386,9 @@ static void migrate_vma_collect(struct migrate_vma 
*migrate)
 * that the registered device driver can skip invalidating device
 * private page mappings that won't be migrated.
 */
-   mmu_notifier_range_init(, MMU_NOTIFY_MIGRATE, 0, migrate->vma,
-   migrate->vma->vm_mm, migrate->start, migrate->end);
-   range.migrate_pgmap_owner = migrate->pgmap_owner;
+   mmu_notifier_range_init_migrate(, 0, migrate->vma,
+   migrate->vma->vm_mm, migrate->start, migrate->end,
+   migrate->pgmap_owner);
mmu_notifier_invalidate_range_start();
 
walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end,
-- 
2.20.1



Re: linux-next: Tree for Aug 6 (mm/migrate.c)

2020-08-06 Thread Ralph Campbell



On 8/6/20 7:50 AM, Randy Dunlap wrote:

On 8/5/20 11:21 PM, Stephen Rothwell wrote:

Hi all,



on x86_64:

when CONFIG_MMU_NOTIFIER is not set/enabled:

../mm/migrate.c: In function 'migrate_vma_collect':
../mm/migrate.c:2481:7: error: 'struct mmu_notifier_range' has no member named 
'migrate_pgmap_owner'
   range.migrate_pgmap_owner = migrate->pgmap_owner;
^



Thanks for finding this.
I'm working on it.


Re: [PATCH v4 6/6] mm/migrate: remove range invalidation in migrate_vma_pages()

2020-07-31 Thread Ralph Campbell



On 7/31/20 12:15 PM, Jason Gunthorpe wrote:

On Tue, Jul 28, 2020 at 03:04:07PM -0700, Ralph Campbell wrote:


On 7/28/20 12:19 PM, Jason Gunthorpe wrote:

On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:

When migrating the special zero page, migrate_vma_pages() calls
mmu_notifier_invalidate_range_start() before replacing the zero page
PFN in the CPU page tables. This is unnecessary since the range was
invalidated in migrate_vma_setup() and the page table entry is checked
to be sure it hasn't changed between migrate_vma_setup() and
migrate_vma_pages(). Therefore, remove the redundant invalidation.


I don't follow this logic, the purpose of the invalidation is also to
clear out anything that may be mirroring this VA, and "the page hasn't
changed" doesn't seem to rule out that case?

I'm also not sure I follow where the zero page came from?


The zero page comes from an anonymous private VMA that is read-only
and the user level CPU process tries to read the page data (or any
other read page fault).


Jason



The overall migration process is:

mmap_read_lock()

migrate_vma_setup()
   // invalidates range, locks/isolates pages, puts migration entry in page 
table



migrate_vma_pages()
   // moves source struct page info to destination struct page info.
   // clears migration flag for pages that can't be migrated.



migrate_vma_finalize()
   // replaces migration page table entry with destination page PFN.

mmap_read_unlock()

Since the address range is invalidated in the migrate_vma_setup() stage,
and the page is isolated from the LRU cache, locked, unmapped, and the page 
table
holds a migration entry (so the page can't be faulted and the CPU page table set
valid again), and there are no extra page references (pins), the page
"should not be modified".


That is the physical page though, it doesn't prove nobody else is
reading the PTE.
  

For pte_none()/is_zero_pfn() entries, migrate_vma_setup() leaves the
pte_none()/is_zero_pfn() entry in place but does still call
mmu_notifier_invalidate_range_start() for the whole range being migrated.


Ok..


In the migrate_vma_pages() step, the pte page table is locked and the
pte entry checked to be sure it is still pte_none/is_zero_pfn(). If not,
the new page isn't inserted. If it is still none/zero, the new device private
struct page is inserted into the page table, replacing the 
pte_none()/is_zero_pfn()
page table entry. The secondary MMUs were already invalidated in the 
migrate_vma_setup()
step and a pte_none() or zero page can't be modified so the only invalidation 
needed
is the CPU TLB(s) for clearing the special zero page PTE entry.


No, the secondary MMU was invalidated but the invalidation start/end
range was exited. That means a secondary MMU is immeidately able to
reload the zero page into its MMU cache.

When this code replaces the PTE that has a zero page it also has to
invalidate again so that secondary MMU's are guaranteed to pick up the
new PTE value.

So, I still don't understand how this is safe?

Jason


Oops, you are right of course. I was only thinking of the device doing the 
migration
and forgetting about a second device faulting on the same page.
You can drop patch from the series.


Re: linux-next: manual merge of the hmm tree with the drm tree

2020-07-30 Thread Ralph Campbell



On 7/30/20 5:03 AM, Jason Gunthorpe wrote:

On Thu, Jul 30, 2020 at 07:21:10PM +1000, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the hmm tree got a conflict in:

   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c

between commit:

   7763d24f3ba0 ("drm/nouveau/vmm/gp100-: fix mapping 2MB sysmem pages")

from the drm tree and commits:

   4725c6b82a48 ("nouveau: fix mapping 2MB sysmem pages")
   1a77decd0cae ("nouveau: fix storing invalid ptes")

from the hmm tree.

7763d24f3ba0 and 4725c6b82a48 are exactly the same patch.


Oh? Ralph? What happened here?


Ben did email me saying he was planning to take this patch into
his nouveau tree and I did reply to him saying you had also taken it
into your tree and that I had more nouveau/SVM patches for you on the way.
So, I'm not sure what happened.


Ben can you drop 7763d24f3ba0 ?

Jason



Re: [PATCH v4 6/6] mm/migrate: remove range invalidation in migrate_vma_pages()

2020-07-28 Thread Ralph Campbell



On 7/28/20 12:19 PM, Jason Gunthorpe wrote:

On Thu, Jul 23, 2020 at 03:30:04PM -0700, Ralph Campbell wrote:

When migrating the special zero page, migrate_vma_pages() calls
mmu_notifier_invalidate_range_start() before replacing the zero page
PFN in the CPU page tables. This is unnecessary since the range was
invalidated in migrate_vma_setup() and the page table entry is checked
to be sure it hasn't changed between migrate_vma_setup() and
migrate_vma_pages(). Therefore, remove the redundant invalidation.


I don't follow this logic, the purpose of the invalidation is also to
clear out anything that may be mirroring this VA, and "the page hasn't
changed" doesn't seem to rule out that case?

I'm also not sure I follow where the zero page came from?


The zero page comes from an anonymous private VMA that is read-only
and the user level CPU process tries to read the page data (or any
other read page fault).


Jason



The overall migration process is:

mmap_read_lock()

migrate_vma_setup()
  // invalidates range, locks/isolates pages, puts migration entry in page 
table



migrate_vma_pages()
  // moves source struct page info to destination struct page info.
  // clears migration flag for pages that can't be migrated.



migrate_vma_finalize()
  // replaces migration page table entry with destination page PFN.

mmap_read_unlock()

Since the address range is invalidated in the migrate_vma_setup() stage,
and the page is isolated from the LRU cache, locked, unmapped, and the page 
table
holds a migration entry (so the page can't be faulted and the CPU page table set
valid again), and there are no extra page references (pins), the page
"should not be modified".

For pte_none()/is_zero_pfn() entries, migrate_vma_setup() leaves the
pte_none()/is_zero_pfn() entry in place but does still call
mmu_notifier_invalidate_range_start() for the whole range being migrated.

In the migrate_vma_pages() step, the pte page table is locked and the
pte entry checked to be sure it is still pte_none/is_zero_pfn(). If not,
the new page isn't inserted. If it is still none/zero, the new device private
struct page is inserted into the page table, replacing the 
pte_none()/is_zero_pfn()
page table entry. The secondary MMUs were already invalidated in the 
migrate_vma_setup()
step and a pte_none() or zero page can't be modified so the only invalidation 
needed
is the CPU TLB(s) for clearing the special zero page PTE entry.

Two devices could both try to do the migrate_vma_*() sequence and proceed in 
parallel up
to the migrate_vma_pages() step and try to install a new page for the hole/zero 
PTE but
only one will win and the other fail.


Re: [PATCH v4 3/6] mm/notifier: add migration invalidation type

2020-07-28 Thread Ralph Campbell



On 7/28/20 12:15 PM, Jason Gunthorpe wrote:

On Thu, Jul 23, 2020 at 03:30:01PM -0700, Ralph Campbell wrote:

  static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -513,6 +519,7 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->start = start;
range->end = end;
range->flags = flags;
+   range->migrate_pgmap_owner = NULL;
  }


Since this function is commonly called and nobody should read
migrate_pgmap_owner unless MMU_NOTIFY_MIGRATE is set as the event,
this assignment can be dropped.

Jason


I agree.
Acked-by: Ralph Campbell 


[PATCH v4 3/6] mm/notifier: add migration invalidation type

2020-07-23 Thread Ralph Campbell
Currently migrate_vma_setup() calls mmu_notifier_invalidate_range_start()
which flushes all device private page mappings whether or not a page
is being migrated to/from device private memory. In order to not disrupt
device mappings that are not being migrated, shift the responsibility
for clearing device private mappings to the device driver and leave
CPU page table unmapping handled by migrate_vma_setup(). To support
this, the caller of migrate_vma_setup() should always set struct
migrate_vma::pgmap_owner to a non NULL value that matches the device
private page->pgmap->owner. This value is then passed to the struct
mmu_notifier_range with a new event type which the driver's invalidation
function can use to avoid device MMU invalidations.

Signed-off-by: Ralph Campbell 
---
 include/linux/migrate.h  | 3 +++
 include/linux/mmu_notifier.h | 7 +++
 mm/migrate.c | 8 +++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index aafec0ca7b41..4cc7097d0271 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -206,6 +206,9 @@ struct migrate_vma {
 * Set to the owner value also stored in page->pgmap->owner for
 * migrating out of device private memory. The flags also need to
 * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
+* The caller should always set this field when using mmu notifier
+* callbacks to avoid device MMU invalidations for device private
+* pages that are not being migrated.
 */
void*pgmap_owner;
unsigned long   flags;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index fc68f3570e19..1921fcf6be5b 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -38,6 +38,10 @@ struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_RELEASE: used during mmu_interval_notifier invalidate to signal
  * that the mm refcount is zero and the range is no longer accessible.
+ *
+ * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
+ * a device driver to possibly ignore the invalidation if the
+ * migrate_pgmap_owner field matches the driver's device private pgmap owner.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -46,6 +50,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_PROTECTION_PAGE,
MMU_NOTIFY_SOFT_DIRTY,
MMU_NOTIFY_RELEASE,
+   MMU_NOTIFY_MIGRATE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -264,6 +269,7 @@ struct mmu_notifier_range {
unsigned long end;
unsigned flags;
enum mmu_notifier_event event;
+   void *migrate_pgmap_owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -513,6 +519,7 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
range->start = start;
range->end = end;
range->flags = flags;
+   range->migrate_pgmap_owner = NULL;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
diff --git a/mm/migrate.c b/mm/migrate.c
index e3ea68e3a08b..96e1f41a991e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2392,8 +2392,14 @@ static void migrate_vma_collect(struct migrate_vma 
*migrate)
 {
struct mmu_notifier_range range;
 
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, NULL,
+   /*
+* Note that the pgmap_owner is passed to the mmu notifier callback so
+* that the registered device driver can skip invalidating device
+* private page mappings that won't be migrated.
+*/
+   mmu_notifier_range_init(, MMU_NOTIFY_MIGRATE, 0, migrate->vma,
migrate->vma->vm_mm, migrate->start, migrate->end);
+   range.migrate_pgmap_owner = migrate->pgmap_owner;
mmu_notifier_invalidate_range_start();
 
walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end,
-- 
2.20.1



[PATCH v4 0/6] mm/migrate: avoid device private invalidations

2020-07-23 Thread Ralph Campbell
The goal for this series is to avoid device private memory TLB
invalidations when migrating a range of addresses from system
memory to device private memory and some of those pages have already
been migrated. The approach taken is to introduce a new mmu notifier
invalidation event type and use that in the device driver to skip
invalidation callbacks from migrate_vma_setup(). The device driver is
also then expected to handle device MMU invalidations as part of the
migrate_vma_setup(), migrate_vma_pages(), migrate_vma_finalize() process.
Note that this is opt-in. A device driver can simply invalidate its MMU
in the mmu notifier callback and not handle MMU invalidations in the
migration sequence.

This series is based on Jason Gunthorpe's HMM tree (linux-5.8.0-rc4).

Also, this replaces the need for the following two patches I sent:
("mm: fix migrate_vma_setup() src_owner and normal pages")
https://lore.kernel.org/linux-mm/2020062008.9971-1-rcampb...@nvidia.com
("nouveau: fix mixed normal and device private page migration")
https://lore.kernel.org/lkml/20200622233854.10889-3-rcampb...@nvidia.com

Changes in v4:
Added reviewed-by from Bharata B Rao.
Removed dead code checking for source device private page in lib/test_hmm.c
  dmirror_migrate_alloc_and_copy() since the source filter flag guarantees
  that.
Added patch 6 to remove a redundant invalidation in migrate_vma_pages().

Changes in v3:
Changed the direction field "dir" to a "flags" field and renamed
  src_owner to pgmap_owner.
Fixed a locking issue in nouveau for the migration invalidation.
Added a HMM selftest test case to exercise the HMM test driver
  invalidation changes.
Removed reviewed-by Bharata B Rao since this version is moderately
  changed.

Changes in v2:
Rebase to Jason Gunthorpe's HMM tree.
Added reviewed-by from Bharata B Rao.
Rename the mmu_notifier_range::data field to migrate_pgmap_owner as
  suggested by Jason Gunthorpe.

Ralph Campbell (6):
  nouveau: fix storing invalid ptes
  mm/migrate: add a flags parameter to migrate_vma
  mm/notifier: add migration invalidation type
  nouveau/svm: use the new migration invalidation
  mm/hmm/test: use the new migration invalidation
  mm/migrate: remove range invalidation in migrate_vma_pages()

 arch/powerpc/kvm/book3s_hv_uvmem.c|  4 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c| 19 ++--
 drivers/gpu/drm/nouveau/nouveau_svm.c | 21 -
 drivers/gpu/drm/nouveau/nouveau_svm.h | 13 +-
 .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c| 13 --
 include/linux/migrate.h   | 16 +--
 include/linux/mmu_notifier.h  |  7 +++
 lib/test_hmm.c| 43 +--
 mm/migrate.c  | 34 +--
 tools/testing/selftests/vm/hmm-tests.c| 18 ++--
 10 files changed, 112 insertions(+), 76 deletions(-)

-- 
2.20.1



[PATCH v4 1/6] nouveau: fix storing invalid ptes

2020-07-23 Thread Ralph Campbell
When migrating a range of system memory to device private memory, some
of the pages in the address range may not be migrating. In this case,
the non migrating pages won't have a new GPU MMU entry to store but
the nvif_object_ioctl() NVIF_VMM_V0_PFNMAP method doesn't check the input
and stores a bad valid GPU page table entry.
Fix this by skipping the invalid input PTEs when updating the GPU page
tables.

Signed-off-by: Ralph Campbell 
---
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
index ed37fddd063f..7eabe9fe0d2b 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c
@@ -79,8 +79,12 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt 
*pt,
dma_addr_t addr;
 
nvkm_kmap(pt->memory);
-   while (ptes--) {
+   for (; ptes; ptes--, map->pfn++) {
u64 data = 0;
+
+   if (!(*map->pfn & NVKM_VMM_PFN_V))
+   continue;
+
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */
 
@@ -100,7 +104,6 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt 
*pt,
}
 
VMM_WO064(pt, vmm, ptei++ * 8, data);
-   map->pfn++;
}
nvkm_done(pt->memory);
 }
@@ -310,9 +313,12 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt 
*pt,
dma_addr_t addr;
 
nvkm_kmap(pt->memory);
-   while (ptes--) {
+   for (; ptes; ptes--, map->pfn++) {
u64 data = 0;
 
+   if (!(*map->pfn & NVKM_VMM_PFN_V))
+   continue;
+
if (!(*map->pfn & NVKM_VMM_PFN_W))
data |= BIT_ULL(6); /* RO. */
 
@@ -332,7 +338,6 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt 
*pt,
}
 
VMM_WO064(pt, vmm, ptei++ * 16, data);
-   map->pfn++;
}
nvkm_done(pt->memory);
 }
-- 
2.20.1



[PATCH v4 6/6] mm/migrate: remove range invalidation in migrate_vma_pages()

2020-07-23 Thread Ralph Campbell
When migrating the special zero page, migrate_vma_pages() calls
mmu_notifier_invalidate_range_start() before replacing the zero page
PFN in the CPU page tables. This is unnecessary since the range was
invalidated in migrate_vma_setup() and the page table entry is checked
to be sure it hasn't changed between migrate_vma_setup() and
migrate_vma_pages(). Therefore, remove the redundant invalidation.
DKIM-Signature: v a

[PATCH v4 4/6] nouveau/svm: use the new migration invalidation

2020-07-23 Thread Ralph Campbell
Use the new MMU_NOTIFY_MIGRATE event to skip GPU MMU invalidations of
device private memory and handle the invalidation in the driver as part
of migrating device private memory.

Signed-off-by: Ralph Campbell 
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 15 ---
 drivers/gpu/drm/nouveau/nouveau_svm.c  | 21 +
 drivers/gpu/drm/nouveau/nouveau_svm.h  | 13 -
 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 78b9e3c2a5b3..511fe04cd680 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -140,6 +140,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct 
nouveau_drm *drm,
 {
struct device *dev = drm->dev->dev;
struct page *dpage, *spage;
+   struct nouveau_svmm *svmm;
 
spage = migrate_pfn_to_page(args->src[0]);
if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
@@ -154,14 +155,19 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct 
nouveau_drm *drm,
if (dma_mapping_error(dev, *dma_addr))
goto error_free_page;
 
+   svmm = spage->zone_device_data;
+   mutex_lock(>mutex);
+   nouveau_svmm_invalidate(svmm, args->start, args->end);
if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
goto error_dma_unmap;
+   mutex_unlock(>mutex);
 
args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
return 0;
 
 error_dma_unmap:
+   mutex_unlock(>mutex);
dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 error_free_page:
__free_page(dpage);
@@ -531,7 +537,8 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 }
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
-   unsigned long src, dma_addr_t *dma_addr, u64 *pfn)
+   struct nouveau_svmm *svmm, unsigned long src,
+   dma_addr_t *dma_addr, u64 *pfn)
 {
struct device *dev = drm->dev->dev;
struct page *dpage, *spage;
@@ -561,6 +568,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct 
nouveau_drm *drm,
goto out_free_page;
}
 
+   dpage->zone_device_data = svmm;
*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
if (src & MIGRATE_PFN_WRITE)
@@ -584,8 +592,8 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm 
*drm,
unsigned long addr = args->start, nr_dma = 0, i;
 
for (i = 0; addr < args->end; i++) {
-   args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->src[i],
-   dma_addrs + nr_dma, pfns + i);
+   args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
+   args->src[i], dma_addrs + nr_dma, pfns + i);
if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
nr_dma++;
addr += PAGE_SIZE;
@@ -616,6 +624,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
struct migrate_vma args = {
.vma= vma,
.start  = start,
+   .pgmap_owner= drm->dev,
.flags  = MIGRATE_VMA_SELECT_SYSTEM,
};
unsigned long i;
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index c5f8ca6fb2e3..67b068ac57b2 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -93,17 +93,6 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst)
return NULL;
 }
 
-struct nouveau_svmm {
-   struct mmu_notifier notifier;
-   struct nouveau_vmm *vmm;
-   struct {
-   unsigned long start;
-   unsigned long limit;
-   } unmanaged;
-
-   struct mutex mutex;
-};
-
 #define SVMM_DBG(s,f,a...) 
\
NV_DEBUG((s)->vmm->cli->drm, "svm-%p: "f"\n", (s), ##a)
 #define SVMM_ERR(s,f,a...) 
\
@@ -246,7 +235,7 @@ nouveau_svmm_join(struct nouveau_svmm *svmm, u64 inst)
 }
 
 /* Invalidate SVMM address-range on GPU. */
-static void
+void
 nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit)
 {
if (limit > start) {
@@ -279,6 +268,14 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier 
*mn,
if (unlikely(!svmm->vmm))
goto out;
 
+   /*
+* Ignore invalidation callbacks for device private pages since
+* the invalidation is handled as part of the migration proces

[PATCH v4 2/6] mm/migrate: add a flags parameter to migrate_vma

2020-07-23 Thread Ralph Campbell
The src_owner field in struct migrate_vma is being used for two purposes,
it acts as a selection filter for which types of pages are to be migrated
and it identifies device private pages owned by the caller. Split this
into separate parameters so the src_owner field can be used just to
identify device private pages owned by the caller of migrate_vma_setup().
Rename the src_owner field to pgmap_owner to reflect it is now used only
to identify which device private pages to migrate.

Signed-off-by: Ralph Campbell 
Reviewed-by: Bharata B Rao 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |  4 +++-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  4 +++-
 include/linux/migrate.h| 13 +
 lib/test_hmm.c | 15 ---
 mm/migrate.c   |  6 --
 5 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 09d8119024db..6850bd04bcb9 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -400,6 +400,7 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
mig.end = end;
mig.src = _pfn;
mig.dst = _pfn;
+   mig.flags = MIGRATE_VMA_SELECT_SYSTEM;
 
/*
 * We come here with mmap_lock write lock held just for
@@ -577,7 +578,8 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
long start,
mig.end = end;
mig.src = _pfn;
mig.dst = _pfn;
-   mig.src_owner = _uvmem_pgmap;
+   mig.pgmap_owner = _uvmem_pgmap;
+   mig.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
 
mutex_lock(>arch.uvmem_lock);
/* The requested page is already paged-out, nothing to do */
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e5c230d9ae24..78b9e3c2a5b3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -182,7 +182,8 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct 
vm_fault *vmf)
.end= vmf->address + PAGE_SIZE,
.src= ,
.dst= ,
-   .src_owner  = drm->dev,
+   .pgmap_owner= drm->dev,
+   .flags  = MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
};
 
/*
@@ -615,6 +616,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
struct migrate_vma args = {
.vma= vma,
.start  = start,
+   .flags  = MIGRATE_VMA_SELECT_SYSTEM,
};
unsigned long i;
u64 *pfns;
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e546cbf03dd..aafec0ca7b41 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -180,6 +180,11 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
return (pfn << MIGRATE_PFN_SHIFT) | MIGRATE_PFN_VALID;
 }
 
+enum migrate_vma_direction {
+   MIGRATE_VMA_SELECT_SYSTEM = (1UL << 0),
+   MIGRATE_VMA_SELECT_DEVICE_PRIVATE = (1UL << 1),
+};
+
 struct migrate_vma {
struct vm_area_struct   *vma;
/*
@@ -199,11 +204,11 @@ struct migrate_vma {
 
/*
 * Set to the owner value also stored in page->pgmap->owner for
-* migrating out of device private memory.  If set only device
-* private pages with this owner are migrated.  If not set
-* device private pages are not migrated at all.
+* migrating out of device private memory. The flags also need to
+* be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
 */
-   void*src_owner;
+   void*pgmap_owner;
+   unsigned long   flags;
 };
 
 int migrate_vma_setup(struct migrate_vma *args);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9aa577afc269..e78a1414f58e 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -585,15 +585,6 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
 */
spage = migrate_pfn_to_page(*src);
 
-   /*
-* Don't migrate device private pages from our own driver or
-* others. For our own we would do a device private memory copy
-* not a migration and for others, we would need to fault the
-* other device's page into system memory first.
-*/
-   if (spage && is_zone_device_page(spage))
-   continue;
-
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;
@@ -702,7 +693,8 @@ static int dmirror_migrate(struct dmirror *dmirror,
args.dst = dst_pfns;
args.start = addr;
args.end = next;
-   args.src_owner = N

[PATCH v4 5/6] mm/hmm/test: use the new migration invalidation

2020-07-23 Thread Ralph Campbell
Use the new MMU_NOTIFY_MIGRATE event to skip MMU invalidations of device
private memory and handle the invalidation in the driver as part of
migrating device private memory.

Signed-off-by: Ralph Campbell 
---
 lib/test_hmm.c | 30 +++---
 tools/testing/selftests/vm/hmm-tests.c | 18 
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e78a1414f58e..e7dc3de355b7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -214,6 +214,14 @@ static bool dmirror_interval_invalidate(struct 
mmu_interval_notifier *mni,
 {
struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
 
+   /*
+* Ignore invalidation callbacks for device private pages since
+* the invalidation is handled as part of the migration process.
+*/
+   if (range->event == MMU_NOTIFY_MIGRATE &&
+   range->migrate_pgmap_owner == dmirror->mdevice)
+   return true;
+
if (mmu_notifier_range_blockable(range))
mutex_lock(>mutex);
else if (!mutex_trylock(>mutex))
@@ -693,7 +701,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
args.dst = dst_pfns;
args.start = addr;
args.end = next;
-   args.pgmap_owner = NULL;
+   args.pgmap_owner = dmirror->mdevice;
args.flags = MIGRATE_VMA_SELECT_SYSTEM;
ret = migrate_vma_setup();
if (ret)
@@ -983,7 +991,7 @@ static void dmirror_devmem_free(struct page *page)
 }
 
 static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
-   struct dmirror_device *mdevice)
+ struct dmirror *dmirror)
 {
const unsigned long *src = args->src;
unsigned long *dst = args->dst;
@@ -1005,6 +1013,7 @@ static vm_fault_t 
dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
continue;
 
lock_page(dpage);
+   xa_erase(>pt, addr >> PAGE_SHIFT);
copy_highpage(dpage, spage);
*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
if (*src & MIGRATE_PFN_WRITE)
@@ -1013,15 +1022,6 @@ static vm_fault_t 
dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
return 0;
 }
 
-static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
- struct dmirror *dmirror)
-{
-   /* Invalidate the device's page table mapping. */
-   mutex_lock(>mutex);
-   dmirror_do_update(dmirror, args->start, args->end);
-   mutex_unlock(>mutex);
-}
-
 static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 {
struct migrate_vma args;
@@ -1051,11 +1051,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault 
*vmf)
if (migrate_vma_setup())
return VM_FAULT_SIGBUS;
 
-   ret = dmirror_devmem_fault_alloc_and_copy(, dmirror->mdevice);
+   ret = dmirror_devmem_fault_alloc_and_copy(, dmirror);
if (ret)
return ret;
migrate_vma_pages();
-   dmirror_devmem_fault_finalize_and_map(, dmirror);
+   /*
+* No device finalize step is needed since
+* dmirror_devmem_fault_alloc_and_copy() will have already
+* invalidated the device page table.
+*/
migrate_vma_finalize();
return 0;
 }
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index b533dd08da1d..91d38a29956b 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -881,8 +881,9 @@ TEST_F(hmm, migrate)
 }
 
 /*
- * Migrate anonymous memory to device private memory and fault it back to 
system
- * memory.
+ * Migrate anonymous memory to device private memory and fault some of it back
+ * to system memory, then try migrating the resulting mix of system and device
+ * private memory to the device.
  */
 TEST_F(hmm, migrate_fault)
 {
@@ -924,8 +925,17 @@ TEST_F(hmm, migrate_fault)
for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
ASSERT_EQ(ptr[i], i);
 
-   /* Fault pages back to system memory and check them. */
-   for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+   /* Fault half the pages back to system memory and check them. */
+   for (i = 0, ptr = buffer->ptr; i < size / (2 * sizeof(*ptr)); ++i)
+   ASSERT_EQ(ptr[i], i);
+
+   /* Migrate memory to the device again. */
+   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+
+   /* Check what the device read. */
+   for (i = 0, ptr

  1   2   3   >