Re: [PATCH 1/1] arch/fault: don't print logs for simulated poison errors

2024-05-09 Thread John Hubbard

On 5/9/24 1:39 PM, Axel Rasmussen wrote:

For real MCEs, various architectures print log messages when poisoned
memory is accessed (which results in a SIGBUS). These messages can be
important for users to understand the issue.

On the other hand, we have the userfaultfd UFFDIO_POISON operation,
which can "simulate" memory poisoning. That particular process will get
SIGBUS on access to the memory, but this effect is tied to an MM, rather
than being global like a real poison event. So, we don't want to log
about this case to the global kernel log; instead, let the process
itself log or whatever else it wants to do. This avoids spamming the
kernel log, and avoids e.g. drowning out real events with simulated
ones.

To identify this situation, add a new VM_FAULT_HWPOISON_SIM flag. This
is expected to be set *in addition to* one of the existing
VM_FAULT_HWPOISON or VM_FAULT_HWPOISON_LARGE flags (which are mutually
exclusive).

Signed-off-by: Axel Rasmussen 
---
  arch/parisc/mm/fault.c   | 7 +--
  arch/powerpc/mm/fault.c  | 6 --
  arch/x86/mm/fault.c  | 6 --
  include/linux/mm_types.h | 5 +
  mm/hugetlb.c | 3 ++-
  mm/memory.c  | 2 +-
  6 files changed, 21 insertions(+), 8 deletions(-)



This completely fixes the uffd-unit-test behavior, I just did a quick
test run to be sure as well.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA



diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index c39de84e98b0..e5370bcadf27 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -400,9 +400,12 @@ void do_page_fault(struct pt_regs *regs, unsigned long 
code,
  #ifdef CONFIG_MEMORY_FAILURE
if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
unsigned int lsb = 0;
-   printk(KERN_ERR
+
+   if (!(fault & VM_FAULT_HWPOISON_SIM)) {
+   pr_err(
"MCE: Killing %s:%d due to hardware memory corruption fault at %08lx\n",
-   tsk->comm, tsk->pid, address);
+   tsk->comm, tsk->pid, address);
+   }
/*
 * Either small page or large page may be poisoned.
 * In other words, VM_FAULT_HWPOISON_LARGE and
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 53335ae21a40..ac5e8a3c7fba 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -140,8 +140,10 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
address,
if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
unsigned int lsb = 0; /* shutup gcc */
  
-		pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",

-   current->comm, current->pid, address);
+   if (!(fault & VM_FAULT_HWPOISON_SIM)) {
+   pr_err("MCE: Killing %s:%d due to hardware memory corruption 
fault at %lx\n",
+   current->comm, current->pid, address);
+   }
  
  		if (fault & VM_FAULT_HWPOISON_LARGE)

lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e4f3c7721f45..16d077a3ad14 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -928,9 +928,11 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
struct task_struct *tsk = current;
unsigned lsb = 0;
  
-		pr_err_ratelimited(

+   if (!(fault & VM_FAULT_HWPOISON_SIM)) {
+   pr_err_ratelimited(
"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
-   tsk->comm, tsk->pid, address);
+   tsk->comm, tsk->pid, address);
+   }
if (fault & VM_FAULT_HWPOISON_LARGE)
lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
if (fault & VM_FAULT_HWPOISON)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5240bd7bca33..7f8fc3efc5b2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1226,6 +1226,9 @@ typedef __bitwise unsigned int vm_fault_t;
   * @VM_FAULT_HWPOISON_LARGE:  Hit poisoned large page. Index encoded
   *in upper bits
   * @VM_FAULT_SIGSEGV: segmentation fault
+ * @VM_FAULT_HWPOISON_SIM  Hit poisoned, PTE marker; this indicates a
+ * simulated poison (e.g. via usefaultfd's
+ *  UFFDIO_POISON), not a "real" hwerror.
   * @VM_FAULT_NOPAGE:  ->fault installed the pte, not return page
   * @VM_FAULT_LOCKED:  ->fault locked the returned p

Re: [PATCH 0/1] arch/fault: don't print logs for simulated poison errors

2024-05-09 Thread John Hubbard

On 5/9/24 1:39 PM, Axel Rasmussen wrote:

This patch expects to be applied on top of both of the following related
fixes:

- x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs
   https://lore.kernel.org/r/20240507022939.236896-1-jhubb...@nvidia.com


This got mostly effectively nacked. I disagree with that but do not
intend to "appeal". :)


- [0/2] Minor fixups for hugetlb fault path
   https://lore.kernel.org/r/20240509100148.22384-1-osalva...@suse.de

The latter is in mm-unstable currently, but the former is not (yet?). It
would need to be taken before this patch for it to apply cleanly.

Axel Rasmussen (1):
   arch/fault: don't print logs for simulated poison errors

  arch/parisc/mm/fault.c   | 7 +--
  arch/powerpc/mm/fault.c  | 6 --
  arch/x86/mm/fault.c  | 6 --
  include/linux/mm_types.h | 5 +
  mm/hugetlb.c | 3 ++-
  mm/memory.c  | 2 +-
  6 files changed, 21 insertions(+), 8 deletions(-)

--
2.45.0.118.g7fe29c98d7-goog



thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v1 3/3] mm: use "GUP-fast" instead "fast GUP" in remaining comments

2024-04-13 Thread John Hubbard

On 4/2/24 5:55 AM, David Hildenbrand wrote:

Let's fixup the remaining comments to consistently call that thing
"GUP-fast". With this change, we consistently call it "GUP-fast".

Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
  mm/filemap.c| 2 +-
  mm/khugepaged.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)


Yes, everything is changed over now, confirmed.

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v1 2/3] mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST

2024-04-13 Thread John Hubbard

On 4/2/24 5:55 AM, David Hildenbrand wrote:

Nowadays, we call it "GUP-fast", the external interface includes
functions like "get_user_pages_fast()", and we renamed all internal
functions to reflect that as well.

Let's make the config option reflect that.

Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
  arch/arm/Kconfig   |  2 +-
  arch/arm64/Kconfig |  2 +-
  arch/loongarch/Kconfig |  2 +-
  arch/mips/Kconfig  |  2 +-
  arch/powerpc/Kconfig   |  2 +-
  arch/riscv/Kconfig |  2 +-
  arch/s390/Kconfig  |  2 +-
  arch/sh/Kconfig|  2 +-
  arch/x86/Kconfig   |  2 +-
  include/linux/rmap.h   |  8 
  kernel/events/core.c   |  4 ++--
  mm/Kconfig |  2 +-
  mm/gup.c   | 10 +-
  mm/internal.h  |  2 +-
  14 files changed, 22 insertions(+), 22 deletions(-)



Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-13 Thread John Hubbard

On 4/2/24 5:55 AM, David Hildenbrand wrote:

Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
stays.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

The high-level internal functions for GUP-fast (+slow fallback) are now:
* internal_get_user_pages_fast() -> gup_fast_fallback()
* lockless_pages_from_mm() -> gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()


This is my favorite cleanup of 2024 so far. The above mix was confusing
even if one worked on this file all day long--you constantly have to
translate from function name, to "is this fast or slow?". whew.




The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()

The weird devmap stuff:
* __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf()
* __gup_device_huge_pmd   -> gup_fast_devmap_pmd_leaf()
* __gup_device_huge() -> gup_fast_devmap_leaf()
* undo_dev_pagemap()  -> gup_fast_undo_dev_pagemap()

Helper functions:
* unpin_user_pages_lockless() -> gup_fast_unpin_user_pages()
* gup_fast_folio_allowed() is already properly named
* gup_fast_permitted() is already properly named

With "gup_fast()", we now even have a function that is referred to in
comment in mm/mmu_gather.c.

Reviewed-by: Jason Gunthorpe 
Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
  mm/gup.c | 205 ---
  1 file changed, 103 insertions(+), 102 deletions(-)



Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread John Hubbard

On 2/16/24 08:56, Catalin Marinas wrote:
...

The problem is that the contpte_* symbols are called from the ptep_* inline
functions. So where those inlines are called from modules, we need to make sure
the contpte_* symbols are available.

John Hubbard originally reported this problem against v1 and I enumerated all
the drivers that call into the ptep_* inlines here:
https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t

So they definitely need to be exported. Perhaps we can tighten it to


Yes. Let's keep the in-tree modules working.


EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break anything
out-of-tree. I'm not sure what the normal policy is? arm64 seems to use ~equal
amounts of both.


EXPORT_SYMBOL_GPL() seems appropriate and low risk. As Catalin says below,
these really are deeply core mm routines, and any module operating at this
level is not going to be able to survive on EXPORT_SYMBOL alone, IMHO.

Now, if only I could find an out of tree module to test that claim on... :)



I don't think we are consistent here. For example set_pte_at() can't be
called from non-GPL modules because of __sync_icache_dcache. OTOH, such
driver is probably doing something dodgy. Same with
apply_to_page_range(), it's GPL-only (called from i915).

Let's see if others have any view over the next week or so, otherwise
I'd go for _GPL and relax it later if someone has a good use-case (can
be a patch on top adding _GPL).


I think going directly to _GPL for these is fine, actually.


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check

2023-07-14 Thread John Hubbard

On 7/13/23 02:08, David Hildenbrand wrote:
...

    **WEAK_DECLARATION**
  Using weak declarations like __attribute__((weak)) or __weak
  can have unintended link defects.  Avoid using them.

...which seems deeply out of touch with how arch layers work these days,
doesn't it? (This is not rhetorical; I'm asking in order to get an
opinion or two on the topic.)


Did some digging:

commit 65d9a9a60fd71be964effb2e94747a6acb6e7015
Author: Naveen N. Rao 
Date:   Fri Jul 1 13:04:04 2022 +0530

     kexec_file: drop weak attribute from functions
     As requested
     (http://lkml.kernel.org/r/87ee0q7b92@email.froward.int.ebiederm.org),
     this series converts weak functions in kexec to use the #ifdef approach.
     Quoting the 3e35142ef99fe ("kexec_file: drop weak attribute from
     arch_kexec_apply_relocations[_add]") changelog:
     : Since commit d1bcae833b32f1 ("ELF: Don't generate unused section 
symbols")
     : [1], binutils (v2.36+) started dropping section symbols that it thought
     : were unused.  This isn't an issue in general, but with kexec_file.c, gcc
     : is placing kexec_arch_apply_relocations[_add] into a separate
     : .text.unlikely section and the section symbol ".text.unlikely" is being
     : dropped.  Due to this, recordmcount is unable to find a non-weak symbol 
in
     : .text.unlikely to generate a relocation record against.
     This patch (of 2);
     Drop __weak attribute from functions in kexec_file.c:
     - arch_kexec_kernel_image_probe()
     - arch_kimage_file_post_load_cleanup()
     - arch_kexec_kernel_image_load()
     - arch_kexec_locate_mem_hole()
     - arch_kexec_kernel_verify_sig()
     arch_kexec_kernel_image_load() calls into kexec_image_load_default(), so
     drop the static attribute for the latter.
     arch_kexec_kernel_verify_sig() is not overridden by any architecture, so
     drop the __weak attribute.
     Link: 
https://lkml.kernel.org/r/cover.1656659357.git.naveen.n@linux.vnet.ibm.com
     Link: 
https://lkml.kernel.org/r/2cd7ca1fe4d6bb6ca38e3283c717878388ed6788.1656659357.git.naveen.n@linux.vnet.ibm.com
     Signed-off-by: Naveen N. Rao 
     Suggested-by: Eric Biederman 
     Signed-off-by: Andrew Morton 
     Signed-off-by: Mimi Zohar 


So, in general, it's use seems to be fine (unless some tool actually bails out).

https://lore.kernel.org/all/87ee0q7b92@email.froward.int.ebiederm.org/T/#u

Also mentions that__weak and non __weak variants ending up in the vmlinux. Did 
not
check if that's actually (still) the case.



OK, I looked at that commit and the associated discussion, and now have a
pretty clear picture of the preferred ways to do arch overrides.

Thanks for taking the time to look into it, and also to explain it.
Much appreciated!


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v3 3/7] mm/hotplug: Allow architecture to override memmap on memory support check

2023-07-12 Thread John Hubbard

On 7/11/23 09:09, David Hildenbrand wrote:
...

Can we make that a __weak function instead?


We can. It is confusing because we do have these two patterns within the kernel 
where we use

#ifndef x
#endif

vs

__weak x

What is the recommended way to override ? I have mostly been using #ifndef for 
most of the arch overrides till now.



I think when placing the implementation in a C file, it's __weak. But don't ask 
me :)

We do this already for arch_get_mappable_range() in mm/memory_hotplug.c and 
IMHO it looks quite nice.



It does look nice. I always forget which parts are supposed to be
__weak, so I went to check Documentation/ , and it was quite
entertaining. There are only two search hits: one trivial reference in
Documentation/conf.py, and the other in checkpatch.rst, which says:

  **WEAK_DECLARATION**
Using weak declarations like __attribute__((weak)) or __weak
can have unintended link defects.  Avoid using them.

...which seems deeply out of touch with how arch layers work these days,
doesn't it? (This is not rhetorical; I'm asking in order to get an
opinion or two on the topic.)


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH mm-unstable v1 09/20] mm/gup: reliable R/O long-term pinning in COW mappings

2022-11-23 Thread John Hubbard

On 11/16/22 02:26, David Hildenbrand wrote:
...

With this change, the new R/O long-term pinning tests for non-anonymous
memory succeed:
   # [RUN] R/O longterm GUP pin ... with shared zeropage
   ok 151 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd
   ok 152 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with tmpfile
   ok 153 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with huge zeropage
   ok 154 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
   ok 155 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
   ok 156 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
   ok 157 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd
   ok 158 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with tmpfile
   ok 159 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
   ok 160 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
   ok 161 Longterm R/O pin is reliable
   # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
   ok 162 Longterm R/O pin is reliable


Yes. I was able to reproduce these results, after some minor distractions
involving huge pages, don't ask. :)



Note 1: We don't care about short-term R/O-pinning, because they have
snapshot semantics: they are not supposed to observe modifications that
happen after pinning.

As one example, assume we start direct I/O to read from a page and store
page content into a file: modifications to page content after starting
direct I/O are not guaranteed to end up in the file. So even if we'd pin
the shared zeropage, the end result would be as expected -- getting zeroes
stored to the file.

Note 2: For shared mappings we'll now always fallback to the slow path to
lookup the VMA when R/O long-term pining. While that's the necessary price
we have to pay right now, it's actually not that bad in practice: most
FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
FOLL_FORCE because they tried dealing with COW mappings correctly ...

Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
populate exclusive anon pages that we can pin. There was a concern that
this could affect the memlock limit of existing setups.

For example, a VM running with VFIO could run into the memlock limit and
fail to run. However, we essentially had the same behavior already in
commit 17839856fd58 ("gup: document and work around "COW can break either
way" issue") which got merged into some enterprise distros, and there were
not any such complaints. So most probably, we're fine.

Signed-off-by: David Hildenbrand 
---
  include/linux/mm.h | 27 ---
  mm/gup.c   | 10 +-
  mm/huge_memory.c   |  2 +-
  mm/hugetlb.c   |  7 ---
  4 files changed, 34 insertions(+), 12 deletions(-)



Looks good,

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6bd2ee5872dd..e8cc838f42f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3095,8 +3095,12 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, 
int foll_flags)
   * Must be called with the (sub)page that's actually referenced via the
   * page table entry, which might not necessarily be the head page for a
   * PTE-mapped THP.
+ *
+ * If the vma is NULL, we're coming from the GUP-fast path and might have
+ * to fallback to the slow path just to lookup the vma.
   */
-static inline bool gup_must_unshare(unsigned int flags, struct page *page)
+static inline bool gup_must_unshare(struct vm_area_struct *vma,
+   unsigned int flags, struct page *page)
  {
/*
 * FOLL_WRITE is implicitly handled correctly as the page table entry
@@ -3109,8 +3113,25 @@ static inline bool gup_must_unshare(unsigned int flags, 
struct page *page)
 * Note: PageAnon(page) is stable until the page is actually getting
 * freed.
 */
-   if (!PageAnon(page))
-   return false;
+   if (!PageAnon(page)) {
+   /*
+* We only care about R/O long-term pining: R/O short-term
+* pinning does not have the semantics to observe successive
+* changes through the process page tables.
+*/
+   if (!(flags & FOLL_LONGTERM))
+   return false;
+
+   /* We really need the vma ... */
+   if (!vma)
+   return true;
+
+   /*
+* ... because we only care about writable private ("COW")
+* mapping

Re: [PATCH 6/7] nouveau/dmem: Evict device private memory during release

2022-09-26 Thread John Hubbard
On 9/26/22 14:35, Lyude Paul wrote:
>> +for (i = 0; i < npages; i++) {
>> +if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>> +struct page *dpage;
>> +
>> +/*
>> + * _GFP_NOFAIL because the GPU is going away and there
>> + * is nothing sensible we can do if we can't copy the
>> + * data back.
>> + */
> 
> You'll have to excuse me for a moment since this area of nouveau isn't one of
> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means infinite
> retry, in the case of a GPU hotplug event I would assume we would rather just
> stop trying to migrate things to the GPU and just drop the data instead of
> hanging on infinite retries.
> 
Hi Lyude!

Actually, I really think it's better in this case to keep trying
(presumably not necessarily infinitely, but only until memory becomes
available), rather than failing out and corrupting data.

That's because I'm not sure it's completely clear that this memory is
discardable. And at some point, we're going to make this all work with
file-backed memory, which will *definitely* not be discardable--I
realize that we're not there yet, of course.

But here, it's reasonable to commit to just retrying indefinitely,
really. Memory should eventually show up. And if it doesn't, then
restarting the machine is better than corrupting data, generally.


thanks,

-- 
John Hubbard
NVIDIA



Re: [PATCH v1 1/3] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-09-22 Thread John Hubbard
On 9/22/22 19:26, John Hubbard wrote:
> 
> Reviewed-by: John Hubbard 
> 

I forgot to mention that I had applied your fix to Akira's
issue, before reviewing. So that fix works and builds and
looks nice too.

thanks,

-- 
John Hubbard
NVIDIA



Re: [PATCH v1 1/3] coding-style.rst: document BUG() and WARN() rules ("do not crash the kernel")

2022-09-22 Thread John Hubbard
On 9/20/22 05:23, David Hildenbrand wrote:
> [1] 
> https://lkml.kernel.org/r/CAHk-=wiEAH+ojSpAgx_Ep=nkpwhu8ado3v56bxccsu97oyj...@mail.gmail.com
> [2] 
> https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com
> [2] 
> https://lkml.kernel.org/r/CAHk-=wit-dmhmfqery29jspjfgebx_ld+pnerc4j2ag990w...@mail.gmail.com

s/2/3/

...
> diff --git a/Documentation/process/coding-style.rst 
> b/Documentation/process/coding-style.rst
> index 03eb53fd029a..e05899cbfd49 100644
> --- a/Documentation/process/coding-style.rst
> +++ b/Documentation/process/coding-style.rst
> @@ -1186,6 +1186,67 @@ expression used.  For instance:
>   #endif /* CONFIG_SOMETHING */
>  
>  
> +22) Do not crash the kernel
> +---
> +
> +In general, it is not the kernel developer's decision to crash the kernel.

What do you think of this alternate wording:

In general, the decision to crash the kernel belongs to the user, rather
than to the kernel developer.


> +
> +Avoid panic()
> +=
> +
> +panic() should be used with care and primarily only during system boot.
> +panic() is, for example, acceptable when running out of memory during boot 
> and
> +not being able to continue.
> +
> +Use WARN() rather than BUG()
> +
> +
> +Do not add new code that uses any of the BUG() variants, such as BUG(),
> +BUG_ON(), or VM_BUG_ON(). Instead, use a WARN*() variant, preferably
> +WARN_ON_ONCE(), and possibly with recovery code. Recovery code is not
> +required if there is no reasonable way to at least partially recover.
> +
> +"I'm too lazy to do error handling" is not an excuse for using BUG(). Major
> +internal corruptions with no way of continuing may still use BUG(), but need
> +good justification.
> +
> +Use WARN_ON_ONCE() rather than WARN() or WARN_ON()
> +**
> +
> +WARN_ON_ONCE() is generally preferred over WARN() or WARN_ON(), because it
> +is common for a given warning condition, if it occurs at all, to occur
> +multiple times. This can fill up and wrap the kernel log, and can even slow
> +the system enough that the excessive logging turns into its own, additional
> +problem.
> +
> +Do not WARN lightly
> +***
> +
> +WARN*() is intended for unexpected, this-should-never-happen situations.
> +WARN*() macros are not to be used for anything that is expected to happen
> +during normal operation. These are not pre- or post-condition asserts, for
> +example. Again: WARN*() must not be used for a condition that is expected
> +to trigger easily, for example, by user space actions. pr_warn_once() is a
> +possible alternative, if you need to notify the user of a problem.
> +
> +Do not worry about panic_on_warn users
> +**
> +
> +A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> +available kernel option, and that many users set this option. This is why
> +there is a "Do not WARN lightly" writeup, above. However, the existence of
> +panic_on_warn users is not a valid reason to avoid the judicious use
> +WARN*(). That is because, whoever enables panic_on_warn has explicitly
> +asked the kernel to crash if a WARN*() fires, and such users must be
> +prepared to deal with the consequences of a system that is somewhat more
> +likely to crash.
> +
> +Use BUILD_BUG_ON() for compile-time assertions
> +**
> +
> +The use of BUILD_BUG_ON() is acceptable and encouraged, because it is a
> +compile-time assertion that has no effect at runtime.
> +
>  Appendix I) References
>  --
>  

I like the wording, it feels familiar somehow! :)

Reviewed-by: John Hubbard 

thanks,

-- 
John Hubbard
NVIDIA



Re: [PATCH v1 3/3] checkpatch: warn on usage of VM_BUG_ON() and other BUG variants

2022-09-22 Thread John Hubbard
On 9/22/22 19:11, Joe Perches wrote:
>> Should this be a separate patch? Adding a bunch of exceptions to the BUG() 
>> rules is 
>> a separate and distinct thing from adding VM_BUG_ON() and other *BUG*() 
>> variants to
>> the mix.
> 
> Not in my opinion.

OK, then. :)

> 
>>> my $msg_level = \
>>> $msg_level = \ if ($file);
>>> &{$msg_level}("AVOID_BUG",
>>> - "Avoid crashing the kernel - try using 
>>> WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
>>> + "Do not crash the kernel unless it is 
>>> unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than 
>>> BUG() or variants.\n" . $herecurr);
>>
>> Here's a requested tweak, to clean up the output and fix punctuation:
>>
>> "Avoid crashing the kernel--use WARN_ON_ONCE() plus recovery code (if 
>> feasible) instead of BUG() or variants.\n" . $herecurr);
> 
> Fixing punctuation here would be removing the trailing period as checkpatch
> only has periods for multi-sentence output messages.

OK, let's do that too. 

> 
> And I think that "Do not crash" is a stronger statement than "Avoid crashing"
> so I prefer the original suggestion but it's not a big deal either way.

Yes, stronger wording is better. So how about this:

"Do not crash the kernel unless it is absolutely unavoidable--use 
WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants\n" 
. $herecurr);


thanks,

-- 
John Hubbard
NVIDIA



Re: [PATCH v1 3/3] checkpatch: warn on usage of VM_BUG_ON() and other BUG variants

2022-09-22 Thread John Hubbard
On 9/20/22 05:23, David Hildenbrand wrote:
> checkpatch does not point out that VM_BUG_ON() and friends should be
> avoided, however, Linus notes:
> 
> VM_BUG_ON() has the exact same semantics as BUG_ON. It is literally
> no different, the only difference is "we can make the code smaller
> because these are less important". [1]
> 
> So let's warn on VM_BUG_ON() and other BUG variants as well. While at it,
> make it clearer that the kernel really shouldn't be crashed.
> 
> As there are some subsystem BUG macros that actually don't end up crashing
> the kernel -- for example, KVM_BUG_ON() -- exclude these manually.
> 
> [1] 
> https://lore.kernel.org/r/CAHk-=wg40eazofo16eviaj7mfqdhz2gvebvfsmf6gyzsprj...@mail.gmail.com
> 
> Signed-off-by: David Hildenbrand 
> ---
>  scripts/checkpatch.pl | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index 79e759aac543..21f3a79aa46f 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -4695,12 +4695,12 @@ sub process {
>   }
>   }
>  
> -# avoid BUG() or BUG_ON()
> - if ($line =~ /\b(?:BUG|BUG_ON)\b/) {
> +# do not use BUG() or variants
> + if ($line =~ 
> /\b(?!AA_|BUILD_|DCCP_|IDA_|KVM_|RWLOCK_|snd_|SPIN_)(?:[a-zA-Z_]*_)?BUG(?:_ON)?(?:_[A-Z_]+)?\s*\(/)
>  {

Should this be a separate patch? Adding a bunch of exceptions to the BUG() 
rules is 
a separate and distinct thing from adding VM_BUG_ON() and other *BUG*() 
variants to
the mix.

>   my $msg_level = \
>   $msg_level = \ if ($file);
>   &{$msg_level}("AVOID_BUG",
> -   "Avoid crashing the kernel - try using 
> WARN_ON & recovery code rather than BUG() or BUG_ON()\n" . $herecurr);
> +   "Do not crash the kernel unless it is 
> unavoidable - use WARN_ON_ONCE & recovery code (if reasonable) rather than 
> BUG() or variants.\n" . $herecurr);

Here's a requested tweak, to clean up the output and fix punctuation:

"Avoid crashing the kernel--use WARN_ON_ONCE() plus recovery code (if feasible) 
instead of BUG() or variants.\n" . $herecurr);


thanks,

-- 
John Hubbard
NVIDIA



Re: [v2 PATCH 1/2] mm: gup: fix the fast GUP race against THP collapse

2022-09-07 Thread John Hubbard

On 9/7/22 11:01, Yang Shi wrote:

Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm:
introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
sufficient to handle concurrent GUP-fast in all cases, it only handles
traditional IPI-based GUP-fast correctly.  On architectures that send
an IPI broadcast on TLB flush, it works as expected.  But on the
architectures that do not use IPI to broadcast TLB flush, it may have
the below race:

CPU A  CPU B
THP collapse fast GUP
   gup_pmd_range() <-- see valid pmd
   gup_pte_range() <-- work on 
pte
pmdp_collapse_flush() <-- clear pmd and flush
__collapse_huge_page_isolate()
 check page pinned <-- before GUP bump refcount
   pin the page
   check PTE <-- no change
__collapse_huge_page_copy()
 copy data to huge page
 ptep_clear()
install huge pmd for the huge page
   return the stale page
discard the stale page

The race could be fixed by checking whether PMD is changed or not after
taking the page pin in fast GUP, just like what it does for PTE.  If the
PMD is changed it means there may be parallel THP collapse, so GUP
should back off.

Also update the stale comment about serializing against fast GUP in
khugepaged.

Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
Acked-by: David Hildenbrand 
Acked-by: Peter Xu 
Signed-off-by: Yang Shi 
---
v2: * Incorporated the comment from Peter about the comment.
 * Moved the comment right before gup_pte_range() instead of in the
   body of the function, per John.
 * Added patch 2/2 per Aneesh.

  mm/gup.c| 34 --
  mm/khugepaged.c | 10 ++
  2 files changed, 34 insertions(+), 10 deletions(-)



Looks good.

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


diff --git a/mm/gup.c b/mm/gup.c
index f3fc1f08d90c..40aa1c937212 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2380,8 +2380,28 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
  }
  
  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL

-static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-unsigned int flags, struct page **pages, int *nr)
+/*
+ * Fast-gup relies on pte change detection to avoid concurrent pgtable
+ * operations.
+ *
+ * To pin the page, fast-gup needs to do below in order:
+ * (1) pin the page (by prefetching pte), then (2) check pte not changed.
+ *
+ * For the rest of pgtable operations where pgtable updates can be racy
+ * with fast-gup, we need to do (1) clear pte, then (2) check whether page
+ * is pinned.
+ *
+ * Above will work for all pte-level operations, including THP split.
+ *
+ * For THP collapse, it's a bit more complicated because fast-gup may be
+ * walking a pgtable page that is being freed (pte is still valid but pmd
+ * can be cleared already).  To avoid race in such condition, we need to
+ * also check pmd here to make sure pmd doesn't change (corresponds to
+ * pmdp_collapse_flush() in the THP collapse code path).
+ */
+static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
  {
struct dev_pagemap *pgmap = NULL;
int nr_start = *nr, ret = 0;
@@ -2423,7 +2443,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
goto pte_unmap;
}
  
-		if (unlikely(pte_val(pte) != pte_val(*ptep))) {

+   if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
+   unlikely(pte_val(pte) != pte_val(*ptep))) {
gup_put_folio(folio, 1, flags);
goto pte_unmap;
}
@@ -2470,8 +2491,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
   * get_user_pages_fast_only implementation that can pin pages. Thus it's still
   * useful to have gup_huge_pmd even if we can't operate on ptes.
   */
-static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
-unsigned int flags, struct page **pages, int *nr)
+static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
  {
return 0;
  }
@@ -2791,7 +2813,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned 
long addr, unsigned lo
if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
 PMD_SHIFT, next, flags, pages, nr))
 

Re: [PATCH v4 4/4] selftests/hmm-tests: Add test for dirty bits

2022-09-07 Thread John Hubbard

On 9/7/22 04:13, Alistair Popple wrote:

+   /*
+* Attempt to migrate memory to device, which should fail because
+* hopefully some pages are backed by swap storage.
+*/
+   ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));


Are you really sure that you want to assert on that? Because doing so
guarantees a test failure if and when we every upgrade the kernel to
be able to migrate swap-backed pages. And I seem to recall that this
current inability to migrate swap-backed pages is considered a flaw
to be fixed, right?


Right, that's a good point. I was using failure (ASSERT_TRUE) here as a
way of detecting that at least some pages are swap-backed, because if no
pages end up being swap-backed the test is invalid.


Yes. But "invalid" or "waived" is a much different test result than
"failed".



I'm not really sure what to do about it though. It's likely the fix for


Remove the assert. If the test framework allows and you prefer, you
can print a warning.


swap-backed migration may make this bug impossible to hit anyway,
because the obvious fix is to just drop the pages from the swapcache
during migration which would force writeback during subsequent reclaim.

So I'm inclined to leave this here even if it only serves to remind us
about it when we do fix migration of swap-backed pages, because we will
of course run hmm-tests before submitting that fix :-) We can then
either fix the test or drop it if we think it's no longer possible to
hit.


Oh no no no, please. This is not how to do tests. If you want a TODO
list somewhere, there are other ways. But tests that require maintenance
when you change something are an anti-pattern.


thanks,
--
John Hubbard
NVIDIA



Re: [PATCH v4 4/4] selftests/hmm-tests: Add test for dirty bits

2022-09-04 Thread John Hubbard
On 9/1/22 17:35, Alistair Popple wrote:
> We were not correctly copying PTE dirty bits to pages during
> migrate_vma_setup() calls. This could potentially lead to data loss, so
> add a test for this.
> 
> Signed-off-by: Alistair Popple 
> ---
>  tools/testing/selftests/vm/hmm-tests.c | 124 ++-
>  1 file changed, 124 insertions(+)
> 
> diff --git a/tools/testing/selftests/vm/hmm-tests.c 
> b/tools/testing/selftests/vm/hmm-tests.c
> index 529f53b..70fdb49 100644
> --- a/tools/testing/selftests/vm/hmm-tests.c
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -1200,6 +1200,130 @@ TEST_F(hmm, migrate_multiple)
>   }
>  }
>  
> +static char cgroup[] = "/sys/fs/cgroup/hmm-test-XX";
> +static int write_cgroup_param(char *cgroup_path, char *param, long value)
> +{
> + int ret;
> + FILE *f;
> + char *filename;
> +
> + if (asprintf(, "%s/%s", cgroup_path, param) < 0)
> + return -1;
> +
> + f = fopen(filename, "w");
> + if (!f) {
> + ret = -1;
> + goto out;
> + }
> +
> + ret = fprintf(f, "%ld\n", value);
> + if (ret < 0)
> + goto out1;
> +
> + ret = 0;
> +
> +out1:
> + fclose(f);
> +out:
> + free(filename);
> +
> + return ret;
> +}
> +
> +static int setup_cgroup(void)
> +{
> + pid_t pid = getpid();
> + int ret;
> +
> + if (!mkdtemp(cgroup))
> + return -1;
> +
> + ret = write_cgroup_param(cgroup, "cgroup.procs", pid);
> + if (ret)
> + return ret;
> +
> + return 0;
> +}
> +
> +static int destroy_cgroup(void)
> +{
> + pid_t pid = getpid();
> + int ret;
> +
> + ret = write_cgroup_param("/sys/fs/cgroup/cgroup.procs",
> + "cgroup.proc", pid);
> + if (ret)
> + return ret;
> +
> + if (rmdir(cgroup))
> + return -1;
> +
> + return 0;
> +}
> +
> +/*
> + * Try and migrate a dirty page that has previously been swapped to disk. 
> This
> + * checks that we don't loose dirty bits.

s/loose/lose/

> + */
> +TEST_F(hmm, migrate_dirty_page)
> +{
> + struct hmm_buffer *buffer;
> + unsigned long npages;
> + unsigned long size;
> + unsigned long i;
> + int *ptr;
> + int tmp = 0;
> +
> + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> + ASSERT_NE(npages, 0);
> + size = npages << self->page_shift;
> +
> + buffer = malloc(sizeof(*buffer));
> + ASSERT_NE(buffer, NULL);
> +
> + buffer->fd = -1;
> + buffer->size = size;
> + buffer->mirror = malloc(size);
> + ASSERT_NE(buffer->mirror, NULL);
> +
> + ASSERT_EQ(setup_cgroup(), 0);
> +
> + buffer->ptr = mmap(NULL, size,
> +PROT_READ | PROT_WRITE,
> +MAP_PRIVATE | MAP_ANONYMOUS,
> +buffer->fd, 0);
> + ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> + /* Initialize buffer in system memory. */
> + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> + ptr[i] = 0;
> +
> + ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
> +
> + /* Fault pages back in from swap as clean pages */
> + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> + tmp += ptr[i];
> +
> + /* Dirty the pte */
> + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> + ptr[i] = i;
> +
> + /*
> +  * Attempt to migrate memory to device, which should fail because
> +  * hopefully some pages are backed by swap storage.
> +  */
> + ASSERT_TRUE(hmm_migrate_sys_to_dev(self->fd, buffer, npages));

Are you really sure that you want to assert on that? Because doing so
guarantees a test failure if and when we every upgrade the kernel to
be able to migrate swap-backed pages. And I seem to recall that this
current inability to migrate swap-backed pages is considered a flaw
to be fixed, right?


> +
> + ASSERT_FALSE(write_cgroup_param(cgroup, "memory.reclaim", 1UL<<30));
> +
> + /* Check we still see the updated data after restoring from swap. */
> + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> + ASSERT_EQ(ptr[i], i);
> +
> + hmm_buffer_free(buffer);
> + destroy_cgroup();
> +}
> +
>  /*
>   * Read anonymous memory multiple times.
>   */

thanks,

-- 
John Hubbard
NVIDIA



Re: [PATCH RFC PKS/PMEM 57/58] nvdimm/pmem: Stray access protection for pmem->virt_addr

2020-10-09 Thread John Hubbard

On 10/9/20 12:50 PM, ira.we...@intel.com wrote:

From: Ira Weiny 

The pmem driver uses a cached virtual address to access its memory
directly.  Because the nvdimm driver is well aware of the special
protections it has mapped memory with, we call dev_access_[en|dis]able()
around the direct pmem->virt_addr (pmem_addr) usage instead of the
unnecessary overhead of trying to get a page to kmap.

Signed-off-by: Ira Weiny 
---
  drivers/nvdimm/pmem.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..e4dc1ae990fc 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
if (unlikely(is_bad_pmem(>bb, sector, len)))
return BLK_STS_IOERR;
  
+	dev_access_enable(false);

rc = read_pmem(page, page_off, pmem_addr, len);
+   dev_access_disable(false);


Hi Ira!

The APIs should be tweaked to use a symbol (GLOBAL, PER_THREAD), instead of
true/false. Try reading the above and you'll see that it sounds like it's
doing the opposite of what it is ("enable_this(false)" sounds like a clumsy
API design to *disable*, right?). And there is no hint about the scope.

And it *could* be so much more readable like this:

dev_access_enable(DEV_ACCESS_THIS_THREAD);



thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding

2020-09-15 Thread John Hubbard

On 9/11/20 1:36 PM, Vasily Gorbik wrote:

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
  unsigned int flags, struct page **pages, int *nr)
...
 pudp = pud_offset(, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
  unsigned int flags, struct page **pages, int *nr)
{
 unsigned long next;
 pud_t *pudp;

 // pud_offset returns  itself (a pointer to a value on stack)
 pudp = pud_offset(, addr);
 do {
 // on second iteratation reading "random" stack value
 pud_t pud = READ_ONCE(*pudp);

 // next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
 next = pud_addr_end(addr, end);
 ...
 } while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

 return 1;
}

This happens since s390 moved to common gup code with
commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
and commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code"). s390 tried to mimic static level folding by
changing pXd_offset primitives to always calculate top level page table
offset in pgd_offset and just return the value passed when pXd_offset
has to act as folded.

What is crucial for gup_fast and what has been overlooked is
that PxD_SIZE/MASK and thus pXd_addr_end should also change
correspondingly. And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original
pXdp pointers down to gup_pXd_range functions. And introduce
pXd_offset_lockless helpers, which take an additional pXd
entry value parameter. This has already been discussed in
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Cc:  # 5.2+
Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Reviewed-by: Gerald Schaefer 
Reviewed-by: Alexander Gordeev 
Signed-off-by: Vasily Gorbik 
---


Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard 


thanks,
--
John Hubbard
NVIDIA


v2: added brackets  -> &(pgd)

  arch/s390/include/asm/pgtable.h | 42 +++--
  include/linux/pgtable.h | 10 
  mm/gup.c| 18 +++---
  3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..b55561cc8786 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, 
unsigned long address)
  
  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
  
-static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)

+static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long 
address)
  {
-   if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
-   return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
-   return (p4d_t *) pgd;
+   if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
+   return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
+   return (p4d_t *) pgdp;
  }
+#define p4d_offset_lockless p4d_offset_lockless
  
-static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)

+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
  {
-   if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
-   return (pud_t *) p4d_deref(*p4d) + pud_index(address);
-   return (pud_t *) p4d;
+   return p4d_offset_lockless(pgdp, *pgdp, address);
+}
+
+static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long 
address)
+{
+   if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
+   return (pud_t *) p4d_deref(p4d) + pud_index(address);
+   return (pud_t *) p4dp;
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread John Hubbard

On 9/10/20 3:11 PM, Jason Gunthorpe wrote:

On Thu, Sep 10, 2020 at 02:22:37PM -0700, John Hubbard wrote:


Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


Yes, that is the issue. Remember there is no locking for GUP
fast. While a page table cannot be freed there is nothing preventing
the page table entry from being concurrently modified.



OK, then we are saying the same thing after all, good.


Without the stack variable it looks like this:

pud_t pud = READ_ONCE(*pudp);
if (!pud_present(pud))
 return
pmd_offset(pudp, address);

And pmd_offset() expands to

 return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);

Between the READ_ONCE(*pudp) and (*pud) inside pmd_offset() the value
of *pud can change, eg to !pud_present.

Then pud_page_vaddr(*pud) will crash. It is not use after free, it
is using data that has not been validated.



Right, that matches what I had in mind, too: you can still have a problem
even though you're in the same page table. I just wanted to confirm that
there's not some odd way to launch out into completely non-page-table
memory.


thanks,
--
John Hubbard
NVIDIA


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread John Hubbard

On 9/10/20 11:13 AM, Jason Gunthorpe wrote:

On Thu, Sep 10, 2020 at 10:35:38AM -0700, Linus Torvalds wrote:

On Thu, Sep 10, 2020 at 2:40 AM Alexander Gordeev
 wrote:


It is only gup_fast case that exposes the issue. It hits because
pointers to stack copies are passed to gup_pXd_range iterators, not
pointers to real page tables itself.


Can we possibly change fast-gup to not do the stack copies?

I'd actually rather do something like that, than the "addr_end" thing.



As you say, none of the other page table walking code does what the
GUP code does, and I don't think it's required.


As I understand it, the requirement is because fast-gup walks without
the page table spinlock, or mmap_sem held so it must READ_ONCE the
*pXX.

It then checks that it is a valid page table pointer, then calls
pXX_offset().

The arch implementation of pXX_offset() derefs again the passed pXX
pointer. So it defeats the READ_ONCE and the 2nd load could observe
something that is no longer a page table pointer and crash.


Just to be clear, though, that makes it sound a little wilder and
reckless than it really is, right?

Because actually, the page tables cannot be freed while gup_fast is
walking them, due to either IPI blocking during the walk, or the moral
equivalent (MMU_GATHER_RCU_TABLE_FREE) for non-IPI architectures. So the
pages tables can *change* underneath gup_fast, and for example pages can
be unmapped. But they remain valid page tables, it's just that their
contents are unstable. Even if pXd_none()==true.

Or am I way off here, and it really is possible (aside from the current
s390 situation) to observe something that "is no longer a page table"?


thanks,
--
John Hubbard
NVIDIA


Re: [linux-next RFC v2] mm/gup.c: Convert to use get_user_{page|pages}_fast_only()

2020-05-24 Thread John Hubbard

On 2020-05-23 21:27, Souptick Joarder wrote:

API __get_user_pages_fast() renamed to get_user_pages_fast_only()
to align with pin_user_pages_fast_only().

As part of this we will get rid of write parameter. Instead caller
will pass FOLL_WRITE to get_user_pages_fast_only(). This will not
change any existing functionality of the API.

All the callers are changed to pass FOLL_WRITE.


This looks good. A few nits below, but with those fixed, feel free to
add:

Reviewed-by: John Hubbard 



There are few places where 1 is passed to 2nd parameter of
__get_user_pages_fast() and return value is checked for 1
like [1]. Those are replaced with new inline
get_user_page_fast_only().

[1] if (__get_user_pages_fast(hva, 1, 1, ) == 1)



We try to avoid talking *too* much about the previous version of
the code. Just enough. So, instead of the above two paragraphs,
I'd compress it down to:

Also: introduce get_user_page_fast_only(), and use it in a few
places that hard-code nr_pages to 1.

...

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 93d93bd..8d4597f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1817,10 +1817,16 @@ extern int mprotect_fixup(struct vm_area_struct *vma,
  /*
   * doesn't attempt to fault and will return short.
   */
-int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
- struct page **pages);
+int get_user_pages_fast_only(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages);


Silly nit:

Can you please leave the original indentation in place? I don't normally
comment about this, but I like the original indentation better, and it matches
the pin_user_pages_fast() below, too.

...

@@ -2786,8 +2792,8 @@ static int internal_get_user_pages_fast(unsigned long 
start, int nr_pages,
   * If the architecture does not support this function, simply return with no
   * pages pinned.
   */
-int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
- struct page **pages)
+int get_user_pages_fast_only(unsigned long start, int nr_pages,
+   unsigned int gup_flags, struct page **pages)



Same thing here: you've changed the original indentation, which was (arguably, 
but
to my mind anyway) more readable, and for no reason. It still would have fit 
within
80 cols.

I'm sure it's a perfect 50/50 mix of people who prefer either indentation 
style, and
so for brand new code, I'll remain silent, as long as it is consistent with 
either
itself and/or the surrounding code. But changing it back and forth is a bit 
aggravating, and best avoided. :)



thanks,
--
John Hubbard
NVIDIA


Re: [linux-next RFC] mm/gup.c: Convert to use get_user_pages_fast_only()

2020-05-23 Thread John Hubbard

On 2020-05-23 12:35, Souptick Joarder wrote:
...

Everything you have done here is an improvement, and I'd be happy to
see it go in (after fixing the bug I note below).

But in reading through it, I noticed almost every user ...


- if (__get_user_pages_fast(hva, 1, 1, ) == 1) {
+ if (get_user_pages_fast_only(hva, 1, FOLL_WRITE, ) == 1) {


passes '1' as the second parameter.  So do we want to add:

static inline bool get_user_page_fast_only(unsigned long addr,
 unsigned int gup_flags, struct page **pagep)
{
 return get_user_pages_fast_only(addr, 1, gup_flags, pagep) == 1;
}


Yes, this can be added. Does get_user_page_fast_only() naming is fine ?



It seems like a good name to me. And I think that the new wrapper call is
a good move, too.

I did pause and reflect for a moment about the number gup/pup API calls we
are building up, but that's merely an indication of the wide usage of this
functionality. So it all feels about right.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v6 02/11] mm/gup: Use functions to track lockless pgtbl walks on gup_pgd_range

2020-02-07 Thread John Hubbard
On 2/5/20 7:25 PM, Leonardo Bras wrote:
> On Thu, 2020-02-06 at 00:08 -0300, Leonardo Bras wrote:
>> gup_pgd_range(addr, end, gup_flags, pages, );
>> -   local_irq_enable();
>> +   end_lockless_pgtbl_walk(IRQS_ENABLED);
>> ret = nr;
>> }
>>  
> 
> Just noticed IRQS_ENABLED is not available on other archs than ppc64.
> I will fix this for v7.
> 

What's the fix going to look like, approximately?


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v12 04/22] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2020-01-16 Thread John Hubbard

On 1/16/20 1:37 AM, Christoph Hellwig wrote:

On Wed, Jan 15, 2020 at 01:19:41PM -0800, John Hubbard wrote:

On 1/15/20 7:23 AM, Christoph Hellwig wrote:
...


I'm really not sold on this scheme.  Note that I think it is
particularly bad, but it also doesn't seem any better than what
we had before, and it introduced quite a bit more code.



Hi Christoph,

All by itself, yes. But the very next patch (which needs a little
rework for other reasons, so not included here) needs to reuse some of
these functions within __unpin_devmap_managed_user_page():


Well, then combine it with the series that actually does the change.



OK, that makes sense. I just double-checked with a quick test run, that it
doesn't have dependencies with the rest of this series, and it came out clean,
so:

Andrew, could you please remove just this one patch from mmotm and linux-next?




Also my vaguely recollection is that we had some idea on how to get rid
of the off by one refcounting for the zone device pages, which would be
a much better outcome.



Yes, I recall that Dan Williams mentioned it, but I don't think he provided
any details yet.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v12 11/22] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2020-01-15 Thread John Hubbard
On 1/15/20 7:30 AM, Christoph Hellwig wrote:
> On Tue, Jan 07, 2020 at 02:45:47PM -0800, John Hubbard wrote:
>> Introduce pin_user_pages*() variations of get_user_pages*() calls,
>> and also pin_longterm_pages*() variations.
>>
>> For now, these are placeholder calls, until the various call sites
>> are converted to use the correct get_user_pages*() or
>> pin_user_pages*() API.
> 
> What do the pure placeholders buy us?  The API itself looks ok,
> but until it actually is properly implemented it doesn't help at
> all, and we've had all kinds of bad experiences with these sorts
> of stub APIs.
> 

Hi Christoph,

Absolutely agreed, and in fact, after spending some time in this area I 
am getting a much better understanding of just how problematic "this will 
be used soon" APIs really are. However, this is not quite that case.

The following things make this different from a "pure placeholder" API:

1) These APIs are all exercised in the following patches in this series, 
unless I've overlooked one, and

2) The pages are actually tracked in the very next patch that I want to
post. That patch was posted as part of the v11 series [1], but 
Leon Romanovsky reported a problem with it, and so I'm going to add in
the ability to handle larger "pin" refcounts for the huge page cases.

Meanwhile, I wanted to get these long-simmering simpler preparatory
patches submitted, because it's clear that the API is unaffected by the
huge page refcount fix. (That fix will likely use the second struct page of
the compound page, to count up higher.)


[1] https://lore.kernel.org/r/20191216222537.491123-24-jhubb...@nvidia.com  
[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v12 04/22] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2020-01-15 Thread John Hubbard
On 1/15/20 7:23 AM, Christoph Hellwig wrote:
...
> 
> I'm really not sold on this scheme.  Note that I think it is
> particularly bad, but it also doesn't seem any better than what
> we had before, and it introduced quite a bit more code.
> 

Hi Christoph,

All by itself, yes. But the very next patch (which needs a little 
rework for other reasons, so not included here) needs to reuse some of 
these functions within __unpin_devmap_managed_user_page():

page_is_devmap_managed()
free_devmap_managed_page()

That patch was posted as part of the v11 series [1], and it did this:

+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+   int count;
+
+   if (!page_is_devmap_managed(page))
+   return false;
+
+   count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+   __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+
+   return true;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+   return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+   page = compound_head(page);
+
+   /*
+* For devmap managed pages we need to catch refcount transition from
+* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+* page is free and we need to inform the device driver through
+* callback. See include/linux/memremap.h and HMM for details.
+*/
+   if (__unpin_devmap_managed_user_page(page))
+   return;
+
+   if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+   __put_page(page);
+
+   __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(unpin_user_page);


[1] https://lore.kernel.org/r/20191216222537.491123-24-jhubb...@nvidia.com  
[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v12 00/22] mm/gup: prereqs to track dma-pinned pages: FOLL_PIN

2020-01-14 Thread John Hubbard
On 1/9/20 2:07 PM, John Hubbard wrote:
> On 1/7/20 2:45 PM, John Hubbard wrote:
>> Hi,
>>
>> The "track FOLL_PIN pages" would have been the very next patch, but it is
>> not included here because I'm still debugging a bug report from Leon.
>> Let's get all of the prerequisite work (it's been reviewed) into the tree
>> so that future reviews are easier. It's clear that any fixes that are
>> required to the tracking patch, won't affect these patches here.
>>
>> This implements an API naming change (put_user_page*() -->
>> unpin_user_page*()), and also adds FOLL_PIN page support, up to
>> *but not including* actually tracking FOLL_PIN pages. It extends
>> the FOLL_PIN support to a few select subsystems. More subsystems will
>> be added in follow up work.
>>
> 
> Hi Andrew and all,
> 
> To clarify: I'm hoping that this series can go into 5.6.
> 
> Meanwhile, I'm working on tracking down and solving the problem that Leon
> reported, in the "track FOLL_PIN pages" patch, and that patch is not part of
> this series.
> 

Hi Andrew and all,

Any thoughts on this?

As for the not-included-yet tracking patch, my local testing still suggests the
need to allow for larger refcounts of huge pages (in other words, I can write a 
test
to pin huge pages many times, and overflow with the same backtrace that Leon has
reported).

The second struct page (I recall Jan suggested) can hold those, so I'm going to 
proceed
with that approach, while waiting to see if Leon has any more test data for me.

Again, I think this series is worth getting out of the way, in the meantime.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v12 00/22] mm/gup: prereqs to track dma-pinned pages: FOLL_PIN

2020-01-09 Thread John Hubbard
On 1/7/20 2:45 PM, John Hubbard wrote:
> Hi,
> 
> The "track FOLL_PIN pages" would have been the very next patch, but it is
> not included here because I'm still debugging a bug report from Leon.
> Let's get all of the prerequisite work (it's been reviewed) into the tree
> so that future reviews are easier. It's clear that any fixes that are
> required to the tracking patch, won't affect these patches here.
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also adds FOLL_PIN page support, up to
> *but not including* actually tracking FOLL_PIN pages. It extends
> the FOLL_PIN support to a few select subsystems. More subsystems will
> be added in follow up work.
> 

Hi Andrew and all,

To clarify: I'm hoping that this series can go into 5.6.

Meanwhile, I'm working on tracking down and solving the problem that Leon
reported, in the "track FOLL_PIN pages" patch, and that patch is not part of
this series.

thanks,
-- 
John Hubbard
NVIDIA


[PATCH v12 16/22] fs/io_uring: set FOLL_PIN via pin_user_pages()

2020-01-07 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 562e3a1a1bf9..9f804cb25c61 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4815,7 +4815,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
+   pret = pin_user_pages(ubuf, nr_pages,
  FOLL_WRITE | FOLL_LONGTERM,
  pages, vmas);
if (pret == nr_pages) {
-- 
2.24.1



[PATCH v12 14/22] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2020-01-07 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.1



[PATCH v12 09/22] IB/umem: use get_user_pages_fast() to pin DMA pages

2020-01-07 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.1



[PATCH v12 11/22] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2020-01-07 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara 
Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++
 include/linux/mm.h|  63 --
 mm/gup.c  | 164 +--
 4 files changed, 426 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..71849830cd48
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for 
problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper 
calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ 
+ pin_user_pages  FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN i

[PATCH v12 22/22] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2020-01-07 Thread John Hubbard
In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c|  4 +--
 drivers/gpu/drm/via/via_dmablit.c   |  4 +--
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 fs/io_uring.c   |  4 +--
 include/linux/mm.h  | 26 -
 mm/gup.c| 32 ++---
 mm/process_vm_access.c  |  4 +--
 net/xdp/xdp_umem.c  |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 71849830cd48..1d490155ecd7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat 
entries: ::
 /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
 free_exit:
/* free the references taken */
-   put_user_pages(mem->hpages, pinned);
+   unpin_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_user_page(page);
+   unpin_user_page(page);
 
mem->hpas[i] = 0;
}
diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
*vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
-   put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
- (vsg->direction == DMA_FROM_DEVICE));
+   unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+  (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
page = sg_page_iter_page(_iter);
-   put_user_pages_dirty_lock(, 1, umem->writable && dirty);
+   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   put_user_pages_dirty_lock(p, npages, dirty);
+   unpin_user_pages_dirty_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, >pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthc

[PATCH v12 08/22] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2020-01-07 Thread John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index b61bd5c469ae..a594bc708367 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2411,7 +2411,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned long addr, len, end;
int nr = 0, ret = 0;
 
-   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+  FOLL_FORCE)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.1



[PATCH v12 06/22] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2020-01-07 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Kirill A. Shutemov 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 174 +--
 1 file changed, 92 insertions(+), 82 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 5938e29a5a8b..b61bd5c469ae 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -,88 +,6 @@ static __always_inline long 
__get_user_pages_locked(struct task_struct *tsk,
return pages_done;
 }
 
-/*
- * get_user_pages_remote() - pin user pages in memory
- * @tsk:   the task_struct to use for page fault accounting, or
- * NULL if faults are not to be recorded.
- * @mm:mm_struct of target mm
- * @start: starting user address
- * @nr_pages:  number of pages from start to pin
- * @gup_flags: flags modifying lookup behaviour
- * @pages: array that receives pointers to the pages pinned.
- * Should be at least nr_pages long. Or NULL, if caller
- * only intends to ensure the pages are faulted in.
- * @vmas:  array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
- * @locked:pointer to lock flag indicating whether lock is held and
- * subsequently whether VM_FAULT_RETRY functionality can be
- * utilised. Lock must initially be held.
- *
- * Returns either number of pages pinned (which may be less than the
- * number requested), or an error. Details about the return value:
- *
- * -- If nr_pages is 0, returns 0.
- * -- If nr_pages is >0, but no pages were pinned, returns -errno.
- * -- If nr_pages is >0, and some pages were pinned, returns the number of
- *pages pinned. Again, this may be less than nr_pages.
- *
- * The caller is responsible for releasing returned @pages, via put_page().
- *
- * @vmas are valid only as long as mmap_sem is held.
- *
- * Must be called with mmap_sem held for read or write.
- *
- * get_user_pages walks a process's page tables and takes a reference to
- * each struct page that each user address corresponds to at a given
- * instant. That is, it takes the page that would be accessed if a user
- * thread accesses the given user virtual address at that instant.
- *
- * This does not guarantee that the page exists in the user mappings when
- * get_user_pages returns, and there may even be a completely different
- * page there in some cases (eg. if mmapped pagecache has been invalidated
- * and subsequently re faulted). However it does guarantee that the page
- * won't be freed completely. And mostly callers simply care that the page
- * contains data that was valid *at some point in time*. Typically, an IO
- * or similar operation cannot guarantee anything stronger anyway because
- * locks can't be held over the syscall boundary.
- *
- * If gup_flags & FOLL_WRITE == 0, the page must not be written to. If the page
- * is written to, set_page_dirty (or set_page_dirty_lock, as appropriate) must
- * be called after the page is finished with, and before put_page is called.
- *
- * get_user_pages is typically used for fewer-copy IO operations, to get a
- * handle on the memory by some means other than accesses via the user virtual
- * addresses. The pages may be submitted for DMA to devices or accessed via
- * their kernel linear mapping (via the kmap APIs). Care should be taken to
- * use the correct cache flushing APIs.
- *
- * See also get_user_pages_fast, for performance critical applications.
- *
- * get_user_pages should be phased out in favor of
- * get_user_pages_locked|unlocked or get_user_pages_fast. Nothing
- * should use get_user_pages because it cannot pass
- * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
- */
-long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
-   unsigned long

[PATCH v12 13/22] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2020-01-07 Thread John Hubbard
Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
+   ret = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
int ret;
unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+   ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
  FOLL_WRITE | FOLL_LONGTERM, pages);
if (ret < 0)
goto out;
diff --g

[PATCH v12 17/22] net/xdp: set FOLL_PIN via pin_user_pages()

2020-01-07 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
+   npgs = pin_user_pages(umem->address, umem->npgs,
  gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
-- 
2.24.1



[PATCH v12 18/22] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2020-01-07 Thread John Hubbard
1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 flags | FOLL_LONGTERM, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v12 07/22] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

2020-01-07 Thread John Hubbard
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 30 +-
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   return 0;
+   ret = 0;
+   goto done;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (is_invalid_reserved_pfn(*pfn))
ret = 0;
}
-
+done:
up_read(>mmap_sem);
return ret;
 }
-- 
2.24.1



[PATCH v12 04/22] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2020-01-07 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle
  decrementing the refcount for ZONE_DEVICE pages.

* Change callers (just release_pages() and put_page()) to check
  page_is_devmap_managed() before calling the new
  put_devmap_managed_page() routine. This is a performance
  point: put_page() is a hot path, so we need to avoid non-
  inline function calls where possible.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Cc: Kirill A. Shutemov 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 18 +-
 mm/memremap.c  | 15 +--
 mm/swap.c  | 27 ++-
 3 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80a9162b406c..e2032ff640eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,11 +971,17 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+void put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool page_is_devmap_managed(struct page *page)
 {
return false;
 }
+
+static inline void put_devmap_managed_page(struct page *page)
+{
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1028,8 +1034,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index f915d074ac20..4c723d2049d5 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,4 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..cf39d24ada2a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -813,8 +813,10 @@ void release_pages(struct page **pages, int nr)
 * processing, and instead, expect a call to
 * put_page_testzero().
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
continue;
+   }
}

[PATCH v12 01/22] mm/gup: factor out duplicate code from four routines

2020-01-07 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Cc: Kirill A. Shutemov 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 95 
 1 file changed, 40 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..d56c6d6b85d3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,29 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+  unsigned long end, struct page **pages)
+{
+   int nr;
+
+   for (nr = 0; addr != end; addr += PAGE_SIZE)
+   pages[nr++] = page++;
+
+   return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+   /*
+* Calling put_page() for each ref is unnecessarily slow. Only the last
+* ref needs a put_page().
+*/
+   if (refs > 1)
+   page_ref_sub(page, refs - 1);
+   put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -2007,32 +2030,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2079,28 +2090,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2120,28 +2122,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compo

[PATCH v12 03/22] mm: Cleanup __put_devmap_managed_page() vs ->page_free()

2020-01-07 Thread John Hubbard
From: Dan Williams 

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: Jan Kara 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
Signed-off-by: John Hubbard 
---
 drivers/nvdimm/pmem.c |  6 
 mm/memremap.c | 80 ---
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(>_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index c51c6bd2fe34..f915d074ac20 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-   if (!pgmap->ops || !pgmap->ops->page_free) {
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+   (!pgmap->ops || !pgmap->ops->page_free)) {
WARN(1, "Missing page_free method\n");
return -EINVAL;
}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
int count = page_ref_dec_return(page);
 
-   /*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
-*/
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
+   /* still busy */
+   if (count > 1)
+   return;
 
-   mem_cgroup_uncharge(page);
+   /* only triggered by the dev_pagemap shutdown path */
+   if (count == 0) {
+   __put_page(page);
+   return;
+   }
 
-   /*
-* When a device_private page is freed, the page->mapping field
-* may still contain a (stale) mapping value. For example, the
-* lower bits of page->mapping may still identify the page as
-* an anonymous page. Ultimately, this entire field is just
-* stale and wrong, and it will cause errors if not cleared.
-* One example is:
-*
-*  migrate_vma_pages()
-*migrate_vma_insert_page()
-*  page_add_new_anon_rmap()
-*__page_set_anon_rmap()
-*  ...checks page->mapping, via PageAnon(page) call,
-*and incorrectly concludes that the page is an
-*anonymous page. Therefore, it incorrectly,
-*silently fails to set up the new anon rmap.
-*
-* For other types of ZONE_DEVICE pages, migration is either
-* handled differently or not done at all, so there is no need
-* to clear page->mapping.
-*/
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   /* notify page idle for dax */
+   if (!is_device_private_page(page)) {
+   wake_up_var(>_refcount);
+   return;
+   }
 
-   page->pgmap->ops->page_free(page);
-   } else if (!count)
-   __put_page(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
+
+   /*
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as an
+* anonymous page. Ultimately, this entire field is just st

[PATCH v12 12/22] goldish_pipe: convert to pin_user_pages() and put_user_page()

2020-01-07 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.1



[PATCH v12 10/22] media/v4l2-core: set pages dirty upon releasing DMA buffers

2020-01-07 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig 
Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Cc: 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v12 02/22] mm/gup: move try_get_compound_head() to top, fix minor issues

2020-01-07 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index d56c6d6b85d3..5938e29a5a8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.1



[PATCH v12 20/22] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2020-01-07 Thread John Hubbard
1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+   ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_page(page);
+   put_user_page(page);
+
mem->hpas[i] = 0;
}
 }
-- 
2.24.1



[PATCH v12 19/22] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2020-01-07 Thread John Hubbard
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.24.1



[PATCH v12 15/22] drm/via: set FOLL_PIN via pin_user_pages_fast()

2020-01-07 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.1



[PATCH v12 21/22] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2020-01-07 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index ad9d5b1c4473..8dba38e79a9f 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -49,18 +49,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.1



[PATCH v12 05/22] goldish_pipe: rename local pin_user_pages() routine

2020-01-07 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = goldfish_pin_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.1



[PATCH v12 00/22] mm/gup: prereqs to track dma-pinned pages: FOLL_PIN

2020-01-07 Thread John Hubbard
() (LPC: Dec 12, 2018): 
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): 
https://lwn.net/Articles/753027/


Dan Williams (1):
  mm: Cleanup __put_devmap_managed_page() vs ->page_free()

John Hubbard (21):
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine
  mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  IB/umem: use get_user_pages_fast() to pin DMA pages
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
  vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
  mm, tree-wide: rename put_user_page*() to unpin_user_page*()

 Documentation/core-api/index.rst|   1 +
 Documentation/core-api/pin_user_pages.rst   | 232 +
 arch/powerpc/mm/book3s64/iommu_api.c|  10 +-
 drivers/gpu/drm/via/via_dmablit.c   |   6 +-
 drivers/infiniband/core/umem.c  |  19 +-
 drivers/infiniband/core/umem_odp.c  |  13 +-
 drivers/infiniband/hw/hfi1/user_pages.c |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   4 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   8 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|   4 +-
 drivers/infiniband/sw/siw/siw_mem.c |   4 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   8 +-
 drivers/nvdimm/pmem.c   |   6 -
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c |  35 +-
 fs/io_uring.c   |   6 +-
 include/linux/mm.h  |  95 +++-
 mm/gup.c| 495 
 mm/gup_benchmark.c  |   9 +-
 mm/memremap.c   |  75 ++-
 mm/process_vm_access.c  |  28 +-
 mm/swap.c   |  27 +-
 net/xdp/xdp_umem.c  |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |   6 +-
 25 files changed, 762 insertions(+), 380 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

-- 
2.24.1



Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2020-01-06 Thread John Hubbard

On 1/6/20 1:01 AM, Jan Kara wrote:
...

Also, looking ahead:

a) if the problem disappears with the latest above test, then we likely have
a huge page refcount overflow, and there are a couple of different ways to
fix it.

b) if it still reproduces with the above, then it's some other random mistake,
and in that case I'd be inclined to do a sort of guided (or classic, 
unguided)
git bisect of the series. Because it could be any of several patches.

If that's too much trouble, then I'd have to fall back to submitting a few
patches at a time and working my way up to the tracking patch...


It could also be that an ordinary page reference is dropped with 'unpin'
thus underflowing the page refcount...

Honza



Yes.

And, I think I'm about out of time for this release cycle, so I'm probably 
going to
submit the prerequisite patches (patches 1-10, or more boldly, 1-22), for 
candidates
for 5.6.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-28 Thread John Hubbard
On 12/27/19 1:56 PM, John Hubbard wrote:
...
>> It is ancient verification test (~10y) which is not an easy task to
>> make it understandable and standalone :).
>>
> 
> Is this the only test that fails, btw? No other test failures or hints of
> problems?
> 
> (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> characterization of the test, such as how many pins per page, what page
> sizes are used? I'm still hoping to write a test to trigger something
> close to this...)
> 
> I do have a couple more ideas for test runs:
> 
> 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> page->_refcount into a no-op, and so if all is well (it may not be!) with the
> rest of the patch, then we'd expect this problem to not reappear.
> 
> 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> tests, of course), so we can see if there is a get/put mismatch. However, that
> will change the timing, and so it must be attempted independently of (1), in
> order to see if it ends up hiding the repro.
> 
> I've updated this branch to implement (1), but not (2), hoping you can give
> this one a spin?
> 
>     g...@github.com:johnhubbard/linux.git  
> pin_user_pages_tracking_v11_with_diags
> 
> 

Also, looking ahead:

a) if the problem disappears with the latest above test, then we likely have
   a huge page refcount overflow, and there are a couple of different ways to
   fix it. 

b) if it still reproduces with the above, then it's some other random mistake,
   and in that case I'd be inclined to do a sort of guided (or classic, 
unguided)
   git bisect of the series. Because it could be any of several patches.

   If that's too much trouble, then I'd have to fall back to submitting a few
   patches at a time and working my way up to the tracking patch...


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-27 Thread John Hubbard

On 12/24/19 9:26 PM, Leon Romanovsky wrote:
...

The branch is here (I just tested it and it seems healthy):

g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags


Hi,

We tested the following branch and here comes results:


Thanks for this testing run!


[root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
nr_foll_pin_requested 0
nr_foll_pin_returned 0



Zero pinned pages!


Maybe we are missing some CONFIG_* option?
https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714...@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f



Ah OK, it must be that CONFIG_DEBUG_VM is not set, thanks!



...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
not happening. And although the backtraces below show some of my new
routines (like try_grab_page), they also confirm the above: there is no
pin_user_page*() call in the stack.

In particular, it looks like ib_umem_get() is calling through to
get_user_pages*(), rather than pin_user_pages*(). I don't see how this
is possible, because the code on my screen shows ib_umem_get() calling
pin_user_pages_fast().



It must be that pin_user_pages() is in the call stack, but just not getting
printed. There's no other way to explain this.


Any thoughts or ideas are welcome here.

However, glossing over all of that and assuming that the new
GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
see any overflow. I'm less confident now that this is a true refcount
overflow.


Earlier in this email thread, I posted possible function call chain which
doesn't involve refcount overflow, but for some reason the refcount
overflow was chosen as a way to explore.



Well, both of the WARN() calls are asserting that the refcount went negative
(well, one asserts negative, and the other asserts "<=0"). So it's pretty
hard to talk our way out of a refcount overflow here.



Also, any information that would get me closer to being able to attempt
my own reproduction of the problem are *very* welcome. :)


It is ancient verification test (~10y) which is not an easy task to
make it understandable and standalone :).



Is this the only test that fails, btw? No other test failures or hints of
problems?

(Also, maybe hopeless, but can *anyone* on the RDMA list provide some
characterization of the test, such as how many pins per page, what page
sizes are used? I'm still hoping to write a test to trigger something
close to this...)

I do have a couple more ideas for test runs:

1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
page->_refcount into a no-op, and so if all is well (it may not be!) with the
rest of the patch, then we'd expect this problem to not reappear.

2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
tests, of course), so we can see if there is a get/put mismatch. However, that
will change the timing, and so it must be attempted independently of (1), in
order to see if it ends up hiding the repro.

I've updated this branch to implement (1), but not (2), hoping you can give
this one a spin?

g...@github.com:johnhubbard/linux.git  
pin_user_pages_tracking_v11_with_diags


thanks,
--
John Hubbard
NVIDIA





[root@serer consume_mtts]# (master) $ dmesg
[  425.221459] [ cut here ]
[  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 
try_grab_compound_head+0x90/0xa0
[  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp 
pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel 
rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp 
scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm 
ib_cm ib_core [last unloaded: mlxfw]
[  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G   O  
5.5.0-rc2+ #1
[  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
[  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f 
b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 
90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
[  425.245739] RSP: 0018:c96878a8 EFLAGS: 00010082
[  425.247124] RAX: 8001 RBX: 7f780488a000 RCX: 0bb0
[  425.248956] RDX: ea000e031087 RSI: 8a00 RDI: ea000dc58000
[  425.250761] RBP: ea000e031080 R08: c9687974 R09: 000fffe0
[  425.252661] R10:  R11: 88836256 R12: 008a
[  425.254487] R13: 8003716000e7 R14: 7f780488a000 R15: c9687974
[  425.256309] FS:  7f780d9d3740() GS:8883b1c8() 
knlGS:
[  425.258401] CS:  0010 DS:  ES:  CR0: 80050033
[  425.259949] CR2: 02334048 CR3: 00039c68c001 CR4: 001606a0
[  425.261884] Cal

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-24 Thread John Hubbard

On 12/22/19 5:23 AM, Leon Romanovsky wrote:

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:

On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you


I'm not so optimistic about that.



OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge
pages are used...

Because the other possibility is that the refcount really is going negative,
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb44c4d2ada7..8526fd03b978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
   * get_user_pages and page_mkclean and other calls that race to set up page
   * table entries.
   */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 8)

  void unpin_user_page(struct page *page);
  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

If that fails to repro, then we would be zeroing in on the root cause.

The branch is here (I just tested it and it seems healthy):

g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags


Hi,

We tested the following branch and here comes results:


Thanks for this testing run!


[root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
nr_foll_pin_requested 0
nr_foll_pin_returned 0



Zero pinned pages!

...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
not happening. And although the backtraces below show some of my new
routines (like try_grab_page), they also confirm the above: there is no
pin_user_page*() call in the stack.

In particular, it looks like ib_umem_get() is calling through to
get_user_pages*(), rather than pin_user_pages*(). I don't see how this
is possible, because the code on my screen shows ib_umem_get() calling
pin_user_pages_fast().

Any thoughts or ideas are welcome here.

However, glossing over all of that and assuming that the new
GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
see any overflow. I'm less confident now that this is a true refcount
overflow.

Also, any information that would get me closer to being able to attempt
my own reproduction of the problem are *very* welcome. :)

thanks,
--
John Hubbard
NVIDIA


[root@serer consume_mtts]# (master) $ dmesg
[  425.221459] [ cut here ]
[  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 
try_grab_compound_head+0x90/0xa0
[  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp 
pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel 
rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp 
scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm 
ib_cm ib_core [last unloaded: mlxfw]
[  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G   O  
5.5.0-rc2+ #1
[  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
[  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f 
b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 
90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
[  425.245739] RSP: 0018:c96878a8 EFLAGS: 00010082
[  425.247124] RAX: 8001 RBX: 7f780488a000 RCX: 0bb0
[  425.248956] RDX: ea000e031087 RSI: 8a00 RDI: ea000dc58000
[  425.250761] RBP: ea000e031080 R08: c9687974 R09: 000fffe0
[  425.252661] R10:  R11: 88836256 R12: 008a
[  425.254487] R13: 8003716000e7 R14: 7f780488a000 R15: c9687974
[  425.256309] FS:  7f780d9d3740() GS:8883b1c8() 
knlGS:
[  425.258401] CS:  0010 DS:  ES:  CR0: 80050033
[  425.259949] CR2: 02334048 CR3: 00039c68c001 CR4: 001606a0
[  425.261884] Call Trace:
[  425.262735]  gup_pgd_range+0x517/0x5a0
[  425.263819]  internal_get_user_pages_fast+0x210/0x250
[  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.26913

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-21 Thread John Hubbard

On 12/21/19 2:08 AM, Leon Romanovsky wrote:

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:

On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you


I'm not so optimistic about that.



OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge
pages are used...

Because the other possibility is that the refcount really is going negative,
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:


I'll see what I can do this Sunday.



The other data point that might shed light on whether it's a mismatch (this only
works if the system is not actually crashing, though), is checking the new
vmstat items, like this:

$ grep foll_pin /proc/vmstat
nr_foll_pin_requested 16288188
nr_foll_pin_returned 16288188

...but OTOH, if you've got long-term pins, then those are *supposed* to be
mismatched, so it only really helps in between tests.

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 4:51 PM, Dan Williams wrote:
> On Fri, Dec 20, 2019 at 4:41 PM John Hubbard  wrote:
>>
>> On 12/20/19 4:33 PM, Dan Williams wrote:
>> ...
>>>> I believe there might be also a different solution for this: For
>>>> transparent huge pages, we could find a space in 'struct page' of the
>>>> second page in the huge page for proper pin counter and just account pins
>>>> there so we'd have full width of 32-bits for it.
>>>
>>> That would require THP accounting for dax pages. It is something that
>>> was probably going to be needed, but this would seem to force the
>>> issue.
>>>
>>
>> Thanks for mentioning that, it wasn't obvious to me yet.
>>
>> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
>> test setup? I'd hate to go into this without having that coverage up
>> and running. It's been sketchy enough as it is. :)
> 
> You too can have the power of the gods for the low low price of a
> kernel command line parameter, or a qemu setup.
> 
> Details here:
> 
> https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
> https://nvdimm.wiki.kernel.org/pmem_in_qemu
> 

Swt! Now I can really cause some damage. :)

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 4:33 PM, Dan Williams wrote:
...
>> I believe there might be also a different solution for this: For
>> transparent huge pages, we could find a space in 'struct page' of the
>> second page in the huge page for proper pin counter and just account pins
>> there so we'd have full width of 32-bits for it.
> 
> That would require THP accounting for dax pages. It is something that
> was probably going to be needed, but this would seem to force the
> issue.
> 

Thanks for mentioning that, it wasn't obvious to me yet. 

How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
test setup? I'd hate to go into this without having that coverage up
and running. It's been sketchy enough as it is. :)

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 1:21 AM, Jan Kara wrote:
...
>> So, ideas and next steps:
>>
>> 1. Assuming that you *are* hitting this, I think I may have to fall back to
>> implementing the "deferred" part of this design, as part of this series, 
>> after
>> all. That means:
>>
>>   For the pin/unpin calls at least, stop treating all pages as if they are
>>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>>   That's not how it works now, and the need to hand back a huge array of
>>   subpages is part of the problem. This affects the callers too, so it's not
>>   a super quick change to make. (I was really hoping not to have to do this
>>   yet.)
> 
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
> 

Well, pretty much yes. It's really just the pin_user_pages*() callers, but
the internals, follow_page() and such, are so interconnected right now that
it would probably blow up into a huge effort, as you point out.

> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.
> 
>   Honza
> 

OK, let me pursue that. Given that I shouldn't need to handle pages
splitting, it should be not *too* bad.

I am starting to think that I should just post the first 9 or so 
prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should 
have been earlier in the sequence I guess), as 5.6 candidates, while I go
back to the drawing board here. 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...
>> $ ./build.sh
>> $ build/bin/run_tests.py
>>
>> If you get things that far I think Leon can get a reproduction for you
> 
> I'm not so optimistic about that.
> 

OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge 
pages are used...

Because the other possibility is that the refcount really is going negative, 
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb44c4d2ada7..8526fd03b978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 8)
 
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

If that fails to repro, then we would be zeroing in on the root cause. 

The branch is here (I just tested it and it seems healthy):

g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags



thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 10:48 AM, Leon Romanovsky wrote:
...
>> test_query_qp (tests.test_qp.QPTest) ... ok
>> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No 
>> devices with net interface'
>>
>> ==
>> FAIL: test_query_port (tests.test_device.DeviceTest)
>> --
>> Traceback (most recent call last):
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in 
>> test_query_port
>> self.verify_port_attr(port_attr)
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in 
>> verify_port_attr
>> assert 'Invalid' not in d.speed_to_str(attr.active_speed)
>> AssertionError
> 
> I'm very curious how did you get this assert "d.speed_to_str" covers all
> known speeds according to the IBTA.
> 

Hi Leon,

Short answer: I can make that one pass, with a small fix the the rdma-core test
suite:

commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str)
Author: John Hubbard 
Date:   Fri Dec 20 15:07:47 2019 -0800

device: fix speed_to_str(), to handle disabled links

For disabled links, the raw speed token is 0. However,
speed_to_str() doesn't have that in the list. This leads
to an assertion when running tests (test_query_port) when
one link is down and other link(s) are up.

Fix this by returning '(Disabled/down)' for the zero speed
case.

diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx
index 33d133fd..f8b7826b 100755
--- a/pyverbs/device.pyx
+++ b/pyverbs/device.pyx
@@ -923,8 +923,8 @@ def width_to_str(width):
 
 
 def speed_to_str(speed):
-l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps',
- 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
+l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps',
+ 8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
 try:
 return '{s} ({n})'.format(s=l[speed], n=speed)
 except KeyError:


Longer answer:
==

It looks like this test suite assumes that every link is connected! (Probably
in most test systems, they are.) But in my setup, the ConnectX cards each have
two slots, and I only have (and only need) one cable. So one link is up, and
the other is disabled. 

This leads to the other problem, which is that if a link is disabled, the
test suite finds a "0" token for attr.active_speed. That token is not in the
approved list, and so d.speed_to_str() asserts.

With some diagnostics added, I can see it checking each link: one passes, and
the other asserts:

diff --git a/tests/test_device.py b/tests/test_device.py
index 524e0e89..7b33d7db 100644
--- a/tests/test_device.py
+++ b/tests/test_device.py
@@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase):
 assert 'Invalid' not in d.translate_mtu(attr.max_mtu)
 assert 'Invalid' not in d.translate_mtu(attr.active_mtu)
 assert 'Invalid' not in d.width_to_str(attr.active_width)
+print("")
+print('Diagnostics ===')
+print('phys_state:', d.phys_state_to_str(attr.phys_state))
+print('active_width): ', d.width_to_str(attr.active_width))
+print('active_speed:  ',   d.speed_to_str(attr.active_speed))
+print('END of Diagnostics ')
 assert 'Invalid' not in d.speed_to_str(attr.active_speed)
 assert 'Invalid' not in d.translate_link_layer(attr.link_layer)
 assert attr.max_msg_sz > 0x1000

 assert attr.max_msg_sz > 0x1000

...and the test run from that is:

# ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... 
Diagnostics ===
phys_state: Link up (5)
active_width):  4X (2)
active_speed:   25.0 Gbps (32)
END of Diagnostics 

Diagnostics ===
phys_state: Disabled (3)
active_width):  4X (2)
active_speed:   Invalid speed
END of Diagnostics 
FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok

==
FAIL: test_query_port (tests.test_device.DeviceTest)
--
Traceback (most recent call last):
  File "/kernel_work/rdma-core/tests/test_device.py", 

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard
_port_attr(port_attr)
  File "/kernel_work/rdma-core/tests/test_device.py", line 113, in 
verify_port_attr
assert 'Invalid' not in d.speed_to_str(attr.active_speed)
AssertionError

--
Ran 67 tests in 10.058s

FAILED (failures=1, skipped=3)


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:

On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:

On 12/19/19 5:26 AM, Leon Romanovsky wrote:

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.


Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
   ib_umem_geti() ->
pin_user_pages_fast(FOLL_WRITE) ->
 internal_get_user_pages_fast(FOLL_WRITE) ->
  gup_pgd_range() ->
   gup_huge_pd() ->
gup_hugepte() ->
 try_grab_compound_head() ->


Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.


Considering that establishing these pins is entirely under user
control, we can't have a limit here.


There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?



If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.



I'll poke around the IB call stack and see how much of that return path
is in place, if any. Because it's the same situation for get_user_pages_fast().
This code just added a warning on overflow so we could spot it early.




3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)


Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you



OK, here goes.

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread John Hubbard

On 12/19/19 5:26 AM, Leon Romanovsky wrote:

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.


Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
  ib_umem_geti() ->
   pin_user_pages_fast(FOLL_WRITE) ->
internal_get_user_pages_fast(FOLL_WRITE) ->
 gup_pgd_range() ->
  gup_huge_pd() ->
   gup_hugepte() ->
try_grab_compound_head() ->


Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Do you have any idea how many pins (repeated pins on the same page, which
it sounds like you have) might be involved in your test case,
and the huge page and system page sizes? That would allow calculating
if we're likely overflowing for that reason.

So, ideas and next steps:

1. Assuming that you *are* hitting this, I think I may have to fall back to
implementing the "deferred" part of this design, as part of this series, after
all. That means:

  For the pin/unpin calls at least, stop treating all pages as if they are
  a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
  That's not how it works now, and the need to hand back a huge array of
  subpages is part of the problem. This affects the callers too, so it's not
  a super quick change to make. (I was really hoping not to have to do this
  yet.)

2. OR, maybe if you're *close* the the overflow, I could buy some development
time by moving the boundary between pinned vs get_page() refcounts, by
reducing GUP_PIN_COUNTING_BIAS. That's less robust, but I don't want
to rule it out just yet. After all, 1024 is a big chunk to count up with,
and even if get_page() calls add up to, say, 512 refs on a page, it's still
just a false positive on page_dma_pinned(). And false positives, if transient,
are OK.

3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)

4. (minor note to self) This also uncovers a minor weakness in diagnostics:
there's no page dump in these reports, because I chickened out and didn't
include my WARN_ONCE_PAGE() macro that would have provided it. Although,
even without it, it's obvious that this is a page overflow.


thanks,
--
John Hubbard
NVIDIA




  108 static __maybe_unused struct page *try_grab_compound_head(struct page 
*page,
  109   int refs,
  110   unsigned int 
flags)
  111 {
  112 if (flags & FOLL_GET)
  113 return try_get_compound_head(page, refs);
  114 else if (flags & FOLL_PIN)
  115 return try_pin_compound_head(page, refs);
  116
  117 WARN_ON_ONCE(1);
  118 return NULL;
  119 }

# (master) $ dmesg
[10924.70] mlx5_core :00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] [ cut here ]
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 
try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss 
nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm 
ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc 
virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 
ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core 
sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last 
unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G   OE 
5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 

Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-18 Thread John Hubbard

On 12/18/19 10:52 PM, Dan Williams wrote:

On Wed, Dec 18, 2019 at 9:51 PM John Hubbard  wrote:


On 12/18/19 9:27 PM, Dan Williams wrote:
...

@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
  page->mapping = NULL;
  page->pgmap->ops->page_free(page);
   }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);


This patch does not have a module consumer for
free_devmap_managed_page(), so the export should move to the patch
that needs the new export.


Hi Dan,

OK, I know that's a policy--although it seems quite pointless here given
that this is definitely going to need an EXPORT.

At the moment, the series doesn't use it in any module at all, so I'll just
delete the EXPORT for now.



Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
instead of EXPORT_SYMBOL_GPL is that there was no practical way to
hide the devmap details from evey module in the kernel that did
put_page(). I would expect free_devmap_managed_page() to
EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
static inline api.



Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have
to be careful that we don't shut out normal put_page() types of callers,
but...glancing through the current callers, that doesn't look to be a problem.
Good. So it should be OK to do EXPORT_SYMBOL_GPL here.

Are you *sure* you don't want to just pre-emptively EXPORT now, and save
looking at it again?


I'm positive. There is enough history for "trust me the consumer is
coming" turning out not to be true to justify the hassle in my mind. I
do trust you, but things happen.



OK, it's deleted locally. Thanks for looking at the patch. I'll post a v12 
series
that includes the change, once it looks like reviews are slowing down.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-18 Thread John Hubbard

On 12/18/19 9:27 PM, Dan Williams wrote:
...

@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
 page->mapping = NULL;
 page->pgmap->ops->page_free(page);
  }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);


This patch does not have a module consumer for
free_devmap_managed_page(), so the export should move to the patch
that needs the new export.


Hi Dan,

OK, I know that's a policy--although it seems quite pointless here given
that this is definitely going to need an EXPORT.

At the moment, the series doesn't use it in any module at all, so I'll just
delete the EXPORT for now.



Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
instead of EXPORT_SYMBOL_GPL is that there was no practical way to
hide the devmap details from evey module in the kernel that did
put_page(). I would expect free_devmap_managed_page() to
EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
static inline api.



Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have
to be careful that we don't shut out normal put_page() types of callers,
but...glancing through the current callers, that doesn't look to be a problem.
Good. So it should be OK to do EXPORT_SYMBOL_GPL here.

Are you *sure* you don't want to just pre-emptively EXPORT now, and save
looking at it again?

thanks,
--
John Hubbard
NVIDIA


[PATCH v12] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-18 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle
  decrementing the refcount for ZONE_DEVICE pages.

* Change callers (just release_pages() and put_page()) to check
  page_is_devmap_managed() before calling the new
  put_devmap_managed_page() routine. This is a performance
  point: put_page() is a hot path, so we need to avoid non-
  inline function calls where possible.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 18 +-
 mm/memremap.c  | 16 ++--
 mm/swap.c  | 27 ++-
 3 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..87b54126e46d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,11 +971,17 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+void put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool page_is_devmap_managed(struct page *page)
 {
return false;
 }
+
+static inline void put_devmap_managed_page(struct page *page)
+{
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1028,8 +1034,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..cf39d24ada2a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -813,8 +813,10 @@ void release_pages(struct page **pages, int nr)
 * processing, and instead, expect a call to
 * put_page_testzero().
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
continue;
+ 

Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-18 Thread John Hubbard

On 12/18/19 8:04 AM, Kirill A. Shutemov wrote:

On Mon, Dec 16, 2019 at 02:25:16PM -0800, John Hubbard wrote:

An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
   and limit the functionality to "read only": return a bool,
   with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
   what kind of page it is, and what kind of refcount handling it
   requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
   and limit the functionality to unconditionally freeing a devmap
   page.


What the reason to separate put_devmap_managed_page() from
free_devmap_managed_page() if free_devmap_managed_page() has exacly one
caller? Is it preparation for the next patches?



Yes. A later patch, #23, adds another caller: 
__unpin_devmap_managed_user_page().

...

@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
  }
  
+bool put_devmap_managed_page(struct page *page);

+
  #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+   return false;
+}
+
  static inline bool put_devmap_managed_page(struct page *page)
  {
return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);


put_devmap_managed_page() has yet another page_is_devmap_managed() check
inside. It looks strange.



Good point, it's an extra unnecessary check. So to clean it up, I'll note
that the "if" check is required here in put_page(), in order to stay out of
non-inlined function calls in the hot path (put_page()). So I'll do the
following:

* Leave the above code as it is here

* Simplify put_devmap_managed_page(), it was trying to do two separate things,
  and those two things have different requirements. So change it to a void
  function, with a WARN_ON_ONCE to assert that page_is_devmap_managed() is true,

* And change the other caller (release_pages()) to do that check.

...

@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
  }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {


Reversing the condition would save you an indentation level.


Yes. Done.

I'll also git-reply with an updated patch so you can see what it looks like.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines

2019-12-18 Thread John Hubbard

On 12/18/19 7:52 AM, Kirill A. Shutemov wrote:

On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:

+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}


It's not terribly efficient. Maybe something like:

VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
if (refs > 2)
page_ref_sub(page, refs - 1);
put_page(page);

?


OK, but how about this instead? I don't see the need for a "2", as that
is a magic number that requires explanation. Whereas "1" is not a magic
number--here it means: either there are "many" (>1) refs, or not.

And the routine won't be called with refs less than about 32 (2MB huge
page, 64KB base page == 32 subpages) anyway.

VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
/*
 * Calling put_page() for each ref is unnecessarily slow. Only the last
 * ref needs a put_page().
 */
if (refs > 1)
page_ref_sub(page, refs - 1);
    put_page(page);


thanks,
--
John Hubbard
NVIDIA
 


Re: [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-18 Thread John Hubbard

On 12/18/19 8:19 AM, Kirill A. Shutemov wrote:
...

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
  };
  
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,

+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);


Any particular reason for the forward declaration? Maybe move
get_user_pages_remote() down?



Yes, that's exactly why: I was thinking it would be cleaner to put in the
forward declaration, rather than moving code blocks, but either way seems
reasonable. I'll go ahead and move the code blocks and delete the forward
declaration, now that someone has weighed in in favor of that.

thanks,
--
John Hubbard
NVIDIA


[PATCH v12 23/25] mm/gup: track FOLL_PIN pages

2019-12-17 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

Reviewed-by: Jan Kara 
Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---

Hi,

The kbuild test robot noticed that try_pin_compound_head() can be
declared static, in mm/gup.c. This updated patch does that.

thanks,
John Hubbard
NVIDIA

 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h|  83 -
 include/linux/mmzone.h|   2 +
 include/linux/page_ref.h  |  10 +
 mm/gup.c  | 410 +-
 mm/huge_memory.c  |  29 +-
 mm/hugetlb.c  |  38 +-
 mm/vmstat.c   |   2 +
 8 files changed, 440 insertions(+), 136 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..bb44c4d2ada7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1016,6 +1016,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
page = compound_head(page);
@@ -1044,29 +1046,80 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many pin_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. 
In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual 
inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: th

Re: [RFC PATCH] mm/gup: try_pin_compound_head() can be static

2019-12-17 Thread John Hubbard

On 12/17/19 12:03 AM, kbuild test robot wrote:


Fixes: 8086d1c61970 ("mm/gup: track FOLL_PIN pages")
Signed-off-by: kbuild test robot 
---
  gup.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 038b71165a761..849a6f55938e6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -75,7 +75,7 @@ static inline struct page *try_get_compound_head(struct page 
*page, int refs)
   * @Return:   the compound head page, with ref appropriately incremented,
   * or NULL upon failure.
   */
-__must_check struct page *try_pin_compound_head(struct page *page, int refs)
+static __must_check struct page *try_pin_compound_head(struct page *page, int 
refs)
  {
struct page *head = try_get_compound_head(page,
  GUP_PIN_COUNTING_BIAS * refs);



Yes, it should have been declared static. And this also applies to the latest 
version
(v11). The preferred fix would stay within 80 columns, like this:

diff --git a/mm/gup.c b/mm/gup.c
index c2793a86450e..39b2f683bd2e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -75,7 +75,8 @@ static inline struct page *try_get_compound_head(struct page 
*page, int refs)
  * @Return:the compound head page, with ref appropriately incremented,
  * or NULL upon failure.
  */
-__must_check struct page *try_pin_compound_head(struct page *page, int refs)
+static __must_check struct page *try_pin_compound_head(struct page *page,
+  int refs)
 {
struct page *head = try_get_compound_head(page,
  GUP_PIN_COUNTING_BIAS * refs);


thanks,
--
John Hubbard
NVIDIA


[PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-12-16 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "--"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.24.1



[PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2019-12-16 Thread John Hubbard
In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c|  4 +--
 drivers/gpu/drm/via/via_dmablit.c   |  4 +--
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 fs/io_uring.c   |  4 +--
 include/linux/mm.h  | 26 -
 mm/gup.c| 32 ++---
 mm/process_vm_access.c  |  4 +--
 net/xdp/xdp_umem.c  |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 71849830cd48..1d490155ecd7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat 
entries: ::
 /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
 free_exit:
/* free the references taken */
-   put_user_pages(mem->hpages, pinned);
+   unpin_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_user_page(page);
+   unpin_user_page(page);
 
mem->hpas[i] = 0;
}
diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
*vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
-   put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
- (vsg->direction == DMA_FROM_DEVICE));
+   unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+  (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
page = sg_page_iter_page(_iter);
-   put_user_pages_dirty_lock(, 1, umem->writable && dirty);
+   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   put_user_pages_dirty_lock(p, npages, dirty);
+   unpin_user_pages_dirty_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, >pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthc

[PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

2019-12-16 Thread John Hubbard
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 30 +-
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   return 0;
+   ret = 0;
+   goto done;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (is_invalid_reserved_pfn(*pfn))
ret = 0;
}
-
+done:
up_read(>mmap_sem);
return ret;
 }
-- 
2.24.1



[PATCH v11 12/25] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2019-12-16 Thread John Hubbard
Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
+   ret = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
int ret;
unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+   ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
  FOLL_WRITE | FOLL_LONGTERM, pages);
if (ret < 0)
goto out;
diff --g

[PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-16 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig 
Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Cc: 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-12-16 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.1



[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

Reviewed-by: Jan Kara 
Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h|  83 -
 include/linux/mmzone.h|   2 +
 include/linux/page_ref.h  |  10 +
 mm/gup.c  | 409 +-
 mm/huge_memory.c  |  29 +-
 mm/hugetlb.c  |  38 +-
 mm/vmstat.c   |   2 +
 8 files changed, 439 insertions(+), 136 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..bb44c4d2ada7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1016,6 +1016,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
page = compound_head(page);
@@ -1044,29 +1046,80 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many pin_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. 
In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual 
inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkc

[PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-12-16 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b1833fedc5c..c6ff9cc7fe71 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4539,7 +4539,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
+   pret = pin_user_pages(ubuf, nr_pages,
  FOLL_WRITE | FOLL_LONGTERM,
  pages, vmas);
if (pret == nr_pages) {
-- 
2.24.1



[PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-16 Thread John Hubbard
1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+   ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_page(page);
+   put_user_page(page);
+
mem->hpas[i] = 0;
}
 }
-- 
2.24.1



[PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-16 Thread John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index c0c56888e7cc..958ab0757389 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned long addr, len, end;
int nr = 0, ret = 0;
 
-   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+  FOLL_FORCE)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.1



[PATCH v11 16/25] net/xdp: set FOLL_PIN via pin_user_pages()

2019-12-16 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
+   npgs = pin_user_pages(umem->address, umem->npgs,
  gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
-- 
2.24.1



[PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-16 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara 
Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++
 include/linux/mm.h|  63 --
 mm/gup.c  | 161 +--
 4 files changed, 423 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..71849830cd48
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for 
problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper 
calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ 
+ pin_user_pages  FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN i

[PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine

2019-12-16 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = goldfish_pin_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.1



[PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-12-16 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.1



[PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-16 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 17 +
 mm/memremap.c  | 16 ++--
 mm/swap.c  | 24 
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..77a4df06c8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+bool put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+   return false;
+}
+
 static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..49f7c2eea0ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+   }
+
+   return is_devmap;
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.1



[PATCH v11 19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-16 Thread John Hubbard
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.24.1



[PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-16 Thread John Hubbard
1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 flags | FOLL_LONGTERM, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls

2019-12-16 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():  via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 65 --
 tools/testing/selftests/vm/gup_benchmark.c | 15 -
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..76d32db48af8 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +21,42 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   unpin_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -65,6 +103,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags,
+pages + i);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -75,15 +121,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -101,6 +154,8 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
case GUP_FAST_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
break;
default:
return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_b

[PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-16 Thread John Hubbard
d routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

* io_uring: Some directed tests from liburing exercise this, and
they pass.
* process_vm_access.c: A small directed test passes.
* gup_benchmark: the enhanced version hits the new gup.c code, and
 passes.
* infiniband: ran "ib_write_bw", which exercises the umem.c changes,
  but not the other changes.
* VFIO: compiles (I'm vowing to set up a run time test soon, but it's
  not ready just yet)
* powerpc: it compiles...
* drm/via: compiles...
* goldfish: compiles...
* net/xdp: compiles...
* media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): 
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): 
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): 
https://lwn.net/Articles/753027/

Dan Williams (1):
  mm: Cleanup __put_devmap_managed_page() vs ->page_free()

John Hubbard (24):
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine
  mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  IB/umem: use get_user_pages_fast() to pin DMA pages
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
conversion
  vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
"1"
  mm, tree-wide: rename put_user_page*() to unpin_user_page*()
  mm/gup: track FOLL_PIN pages
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
coverage

 Documentation/core-api/index.rst|   1 +
 Documentation/core-api/pin_user_pages.rst   | 232 
 arch/powerpc/mm/book3s64/iommu_api.c|  10 +-
 drivers/gpu/drm/via/via_dmablit.c   |   6 +-
 drivers/infiniband/core/umem.c  |  19 +-
 drivers/infiniband/core/umem_odp.c  |  13 +-
 drivers/infiniband/hw/hfi1/user_pages.c |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   4 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   8 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|   4 +-
 drivers/infiniband/sw/siw/siw_mem.c |   4 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   8 +-
 drivers/nvdimm/pmem.c   |   6 -
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c |  35 +-
 fs/io_uring.c   |   6 +-
 include/linux/mm.h  | 155 -
 include/linux/mmzone.h  |   2 +
 include/linux/page_ref.h|  10 +
 mm/gup.c| 626 +++-
 mm/gup_benchmark.c  |  74 ++-
 mm/huge_memory.c|  29 +-
 mm/hugetlb.c|  38 +-
 mm/memremap.c   |  76 ++-
 mm/process_vm_access.c  |  28 +-
 mm/swap.c   |  24 +
 mm/vmstat.c |   2 +
 net/xdp/xdp_umem.c  |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |  21 +-
 tools/testing/selftests/vm/run_vmtests  |  22 +
 31 files changed, 1145 insertions(+), 369 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

--
2.24.1



[PATCH v11 01/25] mm/gup: factor out duplicate code from four routines

2019-12-16 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 91 ++--
 1 file changed, 36 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..f764432914c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+  unsigned long end, struct page **pages)
+{
+   int nr;
+
+   for (nr = 0; addr != end; addr += PAGE_SIZE)
+   pages[nr++] = page++;
+
+   return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -2007,32 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2079,28 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2120,28 +2118,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2157,28 +2146,20 @@ static int gup_huge_

[PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-16 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *locked)
 {
/*
-* FIXME: Current FOLL_LONGTERM behavior is incompatible with
+* Parts of FOLL_LONGTERM behavior are incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-* vmas.  As there are no users of this flag in this call we simply
-* disallow this option for now.
+* vmas. However, this only comes up if locked is set, and there are
+* callers that do request FOLL_LONGTERM, but do not set locked. So,
+* allow what we can.
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-   return -EINVAL;
+   if (gup_flags & FOLL_LONGTERM) {
+   if (WARN_ON_ONCE(locked))
+   return -EINVAL;
+   /*
+* This will check the vmas (even if our vmas arg is NULL)
+* and return -ENOTSUPP if DAX isn't allowed in this case:
+*/
+   return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+vmas, gup_flags | FOLL_TOUCH |
+FOLL_REMOTE);
+   }
 
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
   locked,
-- 
2.24.1



[PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-12-16 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.1



[PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-12-16 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.1



[PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-12-16 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.1



[PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free()

2019-12-16 Thread John Hubbard
From: Dan Williams 

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: Jan Kara 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
Signed-off-by: John Hubbard 
---
 drivers/nvdimm/pmem.c |  6 
 mm/memremap.c | 80 ---
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(>_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..e899fa876a62 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-   if (!pgmap->ops || !pgmap->ops->page_free) {
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+   (!pgmap->ops || !pgmap->ops->page_free)) {
WARN(1, "Missing page_free method\n");
return -EINVAL;
}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
int count = page_ref_dec_return(page);
 
-   /*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
-*/
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
+   /* still busy */
+   if (count > 1)
+   return;
 
-   mem_cgroup_uncharge(page);
+   /* only triggered by the dev_pagemap shutdown path */
+   if (count == 0) {
+   __put_page(page);
+   return;
+   }
 
-   /*
-* When a device_private page is freed, the page->mapping field
-* may still contain a (stale) mapping value. For example, the
-* lower bits of page->mapping may still identify the page as
-* an anonymous page. Ultimately, this entire field is just
-* stale and wrong, and it will cause errors if not cleared.
-* One example is:
-*
-*  migrate_vma_pages()
-*migrate_vma_insert_page()
-*  page_add_new_anon_rmap()
-*__page_set_anon_rmap()
-*  ...checks page->mapping, via PageAnon(page) call,
-*and incorrectly concludes that the page is an
-*anonymous page. Therefore, it incorrectly,
-*silently fails to set up the new anon rmap.
-*
-* For other types of ZONE_DEVICE pages, migration is either
-* handled differently or not done at all, so there is no need
-* to clear page->mapping.
-*/
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   /* notify page idle for dax */
+   if (!is_device_private_page(page)) {
+   wake_up_var(>_refcount);
+   return;
+   }
 
-   page->pgmap->ops->page_free(page);
-   } else if (!count)
-   __put_page(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
+
+   /*
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as an
+* anonymous page. Ultimately, this entire field is just st

[PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-12-16 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f764432914c4..3ecce297a47f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.1



Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread John Hubbard

On 12/16/19 4:53 AM, Jan Kara wrote:
...


I'd move this still a bit higher - just after VM_BUG_ON_PAGE() and before
if (flags & FOLL_TOUCH) test. Because touch_pmd() can update page tables
and we don't won't that if we're going to fail the fault.



Done. I'll post a full v11 series shortly.


With this fixed, the patch looks good to me so you can then add:

Reviewed-by: Jan Kara 

Honza



btw, thanks for the thorough review of this critical patch (and for your
patience with my mistakes). I really appreciate it, and this patchset would
not have made it this far without your detailed help and explanations.


thanks,
--
John Hubbard
NVIDIA


  1   2   3   4   5   >