Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-07 Thread Mike Rapoport
On Tue, Nov 05, 2019 at 11:00:06AM -0800, John Hubbard wrote:
> On 11/5/19 5:10 AM, Mike Rapoport wrote:
> ...
> >> ---
> >>  Documentation/vm/index.rst  |   1 +
> >>  Documentation/vm/pin_user_pages.rst | 212 ++
> > 
> > I think it belongs to Documentation/core-api.
> 
> Done:
> 
> diff --git a/Documentation/core-api/index.rst 
> b/Documentation/core-api/index.rst
> index ab0eae1c153a..413f7d7c8642 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
> generic-radix-tree
> memory-allocation
> mm-api
> +   pin_user_pages
> gfp_mask-from-fs-io
> timekeeping
> boot-time-mm

Thanks!
 
> ...
> >> diff --git a/Documentation/vm/pin_user_pages.rst 
> >> b/Documentation/vm/pin_user_pages.rst
> >> new file mode 100644
> >> index ..3910f49ca98c
> >> --- /dev/null
> >> +++ b/Documentation/vm/pin_user_pages.rst
> >> @@ -0,0 +1,212 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +
> >> +pin_user_pages() and related calls
> >> +
> > 
> > I know this is too much to ask, but having pin_user_pages() a part of more
> > general GUP description would be really great :)
> > 
> 
> Yes, definitely. But until I saw the reaction to the pin_user_pages() API
> family, I didn't want to write too much--it could have all been tossed out
> in favor of a whole different API. But now that we've had some initial
> reviews, I'm much more confident in being able to write about the larger 
> API set.
> 
> So yes, I'll put that on my pending list.
> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN 
> >> has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, 
> but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
> get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This 
> allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

Great, thanks! 
 
> thanks,
> 
> John Hubbard
> NVIDIA

-- 
Sincerely yours,
Mike.


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-06 Thread Ira Weiny
> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN 
> >> has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, 
> but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
> get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This 
> allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

I like this revision as well.

Ira



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-05 Thread John Hubbard
On 11/5/19 5:10 AM, Mike Rapoport wrote:
...
>> ---
>>  Documentation/vm/index.rst  |   1 +
>>  Documentation/vm/pin_user_pages.rst | 212 ++
> 
> I think it belongs to Documentation/core-api.

Done:

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm


...
>> diff --git a/Documentation/vm/pin_user_pages.rst 
>> b/Documentation/vm/pin_user_pages.rst
>> new file mode 100644
>> index ..3910f49ca98c
>> --- /dev/null
>> +++ b/Documentation/vm/pin_user_pages.rst
>> @@ -0,0 +1,212 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +
>> +pin_user_pages() and related calls
>> +
> 
> I know this is too much to ask, but having pin_user_pages() a part of more
> general GUP description would be really great :)
> 

Yes, definitely. But until I saw the reaction to the pin_user_pages() API
family, I didn't want to write too much--it could have all been tossed out
in favor of a whole different API. But now that we've had some initial
reviews, I'm much more confident in being able to write about the larger 
API set.

So yes, I'll put that on my pending list.


...
>> +This document describes the following functions: ::
>> +
>> + pin_user_pages
>> + pin_user_pages_fast
>> + pin_user_pages_remote
>> +
>> + pin_longterm_pages
>> + pin_longterm_pages_fast
>> + pin_longterm_pages_remote
>> +
>> +Basic description of FOLL_PIN
>> +=
>> +
>> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> 
> Consider reading this after, say, half a year ;-)
> 

OK, OK. I knew when I wrote that that it was not going to stay new forever, but
somehow failed to write the right thing anyway. :) 

Here's a revised set of paragraphs:

Basic description of FOLL_PIN
=

FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
("gup") family of functions. FOLL_PIN has significant interactions and
interdependencies with FOLL_LONGTERM, so both are covered here.

Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
the associated wrapper functions  (pin_user_pages() and others) to set the
correct combination of these flags, and to check for problems as well.


thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-05 Thread Mike Rapoport
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
> * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  Documentation/vm/index.rst  |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++

I think it belongs to Documentation/core-api.

>  include/linux/mm.h  |  62 ++-
>  mm/gup.c| 265 +---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index e8d943b21cf9..7194efa3554a 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
> page_migration
> page_frags
> page_owner
> +   pin_user_pages
> remap_file_pages
> slub
> split_page_table_lock
> diff --git a/Documentation/vm/pin_user_pages.rst 
> b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index ..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst
> @@ -0,0 +1,212 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +
> +pin_user_pages() and related calls
> +

I know this is too much to ask, but having pin_user_pages() a part of more
general GUP description would be really great :)

> +
> +.. contents:: :local:
> +
> +Overview
> +
> +
> +This document describes the following functions: ::
> +
> + pin_user_pages
> + pin_user_pages_fast
> + pin_user_pages_remote
> +
> + pin_longterm_pages
> + pin_longterm_pages_fast
> + pin_longterm_pages_remote
> +
> +Basic description of FOLL_PIN
> +=
> +
> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has

Consider reading this after, say, half a year ;-)

> +significant interactions and interdependencies with FOLL_LONGTERM, so both 
> are
> +covered here.
> +
> +Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
> +FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This 
> allows
> +the associated wrapper functions  (pin_user_pages and others) to set the 
> correct
> +combination of these flags, and to check for problems as well.
> +
> +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
> +multiple threads and call sites are free to pin the same struct pages, via 
> both
> +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or 
> the
> +other, not the struct page(s).
> +
> +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
> FOLL_PIN
> +uses a different reference counting technique.
> +
> +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
> +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
> +
> +Which flags are set by each wrapper
> +===
> +
> +Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
> +whatever flags the caller provides::
> +
> + Functiongup flags (FOLL_PIN or FOLL_LONGTERM only)
> + --
> + pin_user_pages  FOLL_PIN
> + pin_user_pages_fast FOLL_PIN
> + pin_user_pages_remote   

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
On 11/4/19 1:15 PM, Jason Gunthorpe wrote:
...
>> Right, and I thought about this when converting, and realized that the above 
>> code is working around the current gup.c limitations, which are "cannot 
>> support
>> gup remote with FOLL_LONGTERM".
> 
> But AFAICT it doesn't have a problem, the protection test is just too
> strict, and I guess the control flow needs a bit of fixing..
> 
> The issue is this:
> 
> static __always_inline long __get_user_pages_locked():
> {
> if (locked) {
> /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> BUG_ON(vmas);
> /* check caller initialized locked */
> BUG_ON(*locked != 1);
> }
> 
> 
> so remote could be written as:
> 
> if (gup_flags & FOLL_LONGTERM) {
>if (WARN_ON_ONCE(locked))
> return -EINVAL;
>return __gup_longterm_locked(...)
> }
> 
> return __get_user_pages_locked(...)
> 
> ??

Yes, that loosens it up just enough for the vfio case (which doesn't set 
"locked") to get through, great! OK, I'll put that (the above plus 
corresponding vfio fix) in a separate patch first. 

This should clear things up nicely.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 12:57:59PM -0800, John Hubbard wrote:
> On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> >>> Note for Jason: the (a) or (b) items are talking about the vfio case, 
> >>> which is
> >>> one of the two call sites that now use pin_longterm_pages_remote(), and 
> >>> the
> >>> other one is infiniband:
> >>>
> >>> drivers/infiniband/core/umem_odp.c:646: npages = 
> >>> pin_longterm_pages_remote(owning_process, owning_mm,
> >>> drivers/vfio/vfio_iommu_type1.c:353:ret = 
> >>> pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> >>
> >> vfio should be reverted until it can be properly implemented.
> >> The issue is that when you fix the implementation you might
> >> break vfio existing user and thus regress the kernel from user
> >> point of view. So i rather have the change to vfio reverted,
> >> i believe it was not well understood when it got upstream,
> >> between in my 5.4 tree it is still gup_remote not longterm.
> > 
> > It is clearly a bug, vfio must use LONGTERM, and does right above this
> > remote call:
> > 
> > if (mm == current->mm) {
> > ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> >  vmas);
> > } else {
> > ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> > vmas, NULL);
> > 
> > 
> > I'm not even sure that it really makes any sense to build a 'if' like
> > that, surely just always call remote??
> > 
> 
> 
> Right, and I thought about this when converting, and realized that the above 
> code is working around the current gup.c limitations, which are "cannot 
> support
> gup remote with FOLL_LONGTERM".

But AFAICT it doesn't have a problem, the protection test is just too
strict, and I guess the control flow needs a bit of fixing..

The issue is this:

static __always_inline long __get_user_pages_locked():
{
if (locked) {
/* if VM_FAULT_RETRY can be returned, vmas become invalid */
BUG_ON(vmas);
/* check caller initialized locked */
BUG_ON(*locked != 1);
}


so remote could be written as:

if (gup_flags & FOLL_LONGTERM) {
   if (WARN_ON_ONCE(locked))
return -EINVAL;
   return __gup_longterm_locked(...)
}

return __get_user_pages_locked(...)

??

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
>>> Note for Jason: the (a) or (b) items are talking about the vfio case, which 
>>> is
>>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>>> other one is infiniband:
>>>
>>> drivers/infiniband/core/umem_odp.c:646: npages = 
>>> pin_longterm_pages_remote(owning_process, owning_mm,
>>> drivers/vfio/vfio_iommu_type1.c:353:ret = 
>>> pin_longterm_pages_remote(NULL, mm, vaddr, 1,
>>
>> vfio should be reverted until it can be properly implemented.
>> The issue is that when you fix the implementation you might
>> break vfio existing user and thus regress the kernel from user
>> point of view. So i rather have the change to vfio reverted,
>> i believe it was not well understood when it got upstream,
>> between in my 5.4 tree it is still gup_remote not longterm.
> 
> It is clearly a bug, vfio must use LONGTERM, and does right above this
> remote call:
> 
> if (mm == current->mm) {
> ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
>  vmas);
> } else {
> ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> vmas, NULL);
> 
> 
> I'm not even sure that it really makes any sense to build a 'if' like
> that, surely just always call remote??
> 


Right, and I thought about this when converting, and realized that the above 
code is working around the current gup.c limitations, which are "cannot support
gup remote with FOLL_LONGTERM".

Given that observation, the code is getting itself some FOLL_LONGTERM support
for the non-remote case, and only hitting the limitation if the mm really is
non-current.

And if you look at my patch, it keeps the same behavior, while adding in the
new wrapper calls.

So...thoughts, preferences?


thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:33:09PM -0800, David Rientjes wrote:
> 
> 
> On Sun, 3 Nov 2019, John Hubbard wrote:
> 
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> > 
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> > 
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> > 
> > pin_user_pages()
> > pin_user_pages_remote()
> > pin_user_pages_fast()
> > 
> > pin_longterm_pages()
> > pin_longterm_pages_remote()
> > pin_longterm_pages_fast()
> > 
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> > 
> 
> Hi John,
> 
> I'm curious what consideration is given to what pageblock migrate types 
> that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
> longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
> purposes of anti-fragmentation?

We do not control page block, GUP can happens on _any_ page that is
map inside a process (anonymous private vma or regular file back one).

Cheers,
Jérôme



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
On 11/4/19 12:31 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> 
>> Note for Jason: the (a) or (b) items are talking about the vfio case, which 
>> is
>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>> other one is infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646: npages = 
>> pin_longterm_pages_remote(owning_process, owning_mm,
> 
> This is a mistake, it is not a longterm pin and does not need FOLL_PIN
> semantics

OK! So it really just wants to be get_user_pages_remote() / put_page()? I'll
change it back to that.

> 
>> Jason should weigh in on how he wants this to go, with respect to branching
>> and merging, since it sounds like that will conflict with the hmm branch 
> 
> I think since you don't need to change this site things should be
> fine?
> 

Right. 


thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> > Note for Jason: the (a) or (b) items are talking about the vfio case, which 
> > is
> > one of the two call sites that now use pin_longterm_pages_remote(), and the
> > other one is infiniband:
> > 
> > drivers/infiniband/core/umem_odp.c:646: npages = 
> > pin_longterm_pages_remote(owning_process, owning_mm,
> > drivers/vfio/vfio_iommu_type1.c:353:ret = 
> > pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> 
> vfio should be reverted until it can be properly implemented.
> The issue is that when you fix the implementation you might
> break vfio existing user and thus regress the kernel from user
> point of view. So i rather have the change to vfio reverted,
> i believe it was not well understood when it got upstream,
> between in my 5.4 tree it is still gup_remote not longterm.

It is clearly a bug, vfio must use LONGTERM, and does right above this
remote call:

if (mm == current->mm) {
ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
 vmas);
} else {
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
vmas, NULL);


I'm not even sure that it really makes any sense to build a 'if' like
that, surely just always call remote??

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> Jason, a question for you at the bottom.
> 
> On 11/4/19 11:52 AM, Jerome Glisse wrote:
> ...
> >> CASE 3: ODP
> >> ---
> >> RDMA hardware with page faulting support. Here, a well-written driver 
> >> doesn't
> > 
> > CASE3: Hardware with page fault support
> > ---
> > 
> > Here, a well-written 
> > 
> 
> Ah, OK. So just drop the first sentence, yes.
> 
> ...
> >> +   */
> >> +  gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >
> > Wouldn't it be better to not add pin_longterm_pages_remote() until
> > it can be properly implemented ?
> >
> 
>  Well, the problem is that I need each call site that requires FOLL_PIN
>  to use a proper wrapper. It's the FOLL_PIN that is the focus here, 
>  because
>  there is a hard, bright rule, which is: if and only if a caller sets
>  FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
>  be called.
> 
>  So this leaves me with only two reasonable choices:
> 
>  a) Convert the call site as above: pin_longterm_pages_remote(), which 
>  sets
>  FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
>  as it has been so far. When the FOLL_LONGTERM situation is fixed, the 
>  call
>  site *might* not need any changes to adopt the working gup.c code.
> 
>  b) Convert the call site to pin_user_pages_remote(), which also sets
>  FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
>  There would also be a comment at the call site, to the effect of, "this
>  is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> 
>  When the FOLL_LONGTERM situation is fixed, the call site will need to be
>  changed to pin_longterm_pages_remote().
> 
>  So you can probably see why I picked (a).
> >>>
> >>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> >>> never have the need for pin_longterm_pages_remote(). My fear is that
> >>> longterm has implication and it would be better to not drop this 
> >>> implication
> >>> by adding a wrapper that does not do what the name says.
> >>>
> >>> So do not introduce pin_longterm_pages_remote() until its first user
> >>> happens. This is option c)
> >>>
> >>
> >> Almost forgot, though: there is already another user: Infiniband:
> >>
> >> drivers/infiniband/core/umem_odp.c:646: npages = 
> >> pin_longterm_pages_remote(owning_process, owning_mm,
> > 
> > odp do not need that, i thought the HMM convertion was already upstream
> > but seems not, in any case odp do not need the longterm case it only
> > so best is to revert that user to gup_fast or something until it get
> > converted to HMM.
> > 
> 
> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646: npages = 
> pin_longterm_pages_remote(owning_process, owning_mm,
> drivers/vfio/vfio_iommu_type1.c:353:ret = 
> pin_longterm_pages_remote(NULL, mm, vaddr, 1,

vfio should be reverted until it can be properly implemented.
The issue is that when you fix the implementation you might
break vfio existing user and thus regress the kernel from user
point of view. So i rather have the change to vfio reverted,
i believe it was not well understood when it got upstream,
between in my 5.4 tree it is still gup_remote not longterm.


> Jerome, Jason: I really don't want to revert the put_page() to 
> put_user_page() 
> conversions that are already throughout the IB driver--pointless churn, right?
> I'd rather either delete them in Jason's tree, or go with what I have here
> while waiting for the deletion.
> 
> Maybe we should just settle on (a) or (b), so that the IB driver ends up with
> the wrapper functions? In fact, if it's getting deleted, then I'd prefer 
> leaving
> it at (a), since that's simple...
> 
> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 
> (ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
> being late).
> 
> thanks,
> 
> John Hubbard
> NVIDIA



Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:

> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646: npages = 
> pin_longterm_pages_remote(owning_process, owning_mm,

This is a mistake, it is not a longterm pin and does not need FOLL_PIN
semantics

> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 

I think since you don't need to change this site things should be
fine?

Jason


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
Jason, a question for you at the bottom.

On 11/4/19 11:52 AM, Jerome Glisse wrote:
...
>> CASE 3: ODP
>> ---
>> RDMA hardware with page faulting support. Here, a well-written driver doesn't
> 
> CASE3: Hardware with page fault support
> ---
> 
> Here, a well-written 
> 

Ah, OK. So just drop the first sentence, yes.

...
>> + */
>> +gup_flags |= FOLL_REMOTE | FOLL_PIN;
>
> Wouldn't it be better to not add pin_longterm_pages_remote() until
> it can be properly implemented ?
>

 Well, the problem is that I need each call site that requires FOLL_PIN
 to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
 there is a hard, bright rule, which is: if and only if a caller sets
 FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
 be called.

 So this leaves me with only two reasonable choices:

 a) Convert the call site as above: pin_longterm_pages_remote(), which sets
 FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
 as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
 site *might* not need any changes to adopt the working gup.c code.

 b) Convert the call site to pin_user_pages_remote(), which also sets
 FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
 There would also be a comment at the call site, to the effect of, "this
 is the wrong call to make: it really requires FOLL_LONGTERM behavior".

 When the FOLL_LONGTERM situation is fixed, the call site will need to be
 changed to pin_longterm_pages_remote().

 So you can probably see why I picked (a).
>>>
>>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
>>> never have the need for pin_longterm_pages_remote(). My fear is that
>>> longterm has implication and it would be better to not drop this implication
>>> by adding a wrapper that does not do what the name says.
>>>
>>> So do not introduce pin_longterm_pages_remote() until its first user
>>> happens. This is option c)
>>>
>>
>> Almost forgot, though: there is already another user: Infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646: npages = 
>> pin_longterm_pages_remote(owning_process, owning_mm,
> 
> odp do not need that, i thought the HMM convertion was already upstream
> but seems not, in any case odp do not need the longterm case it only
> so best is to revert that user to gup_fast or something until it get
> converted to HMM.
> 

Note for Jason: the (a) or (b) items are talking about the vfio case, which is
one of the two call sites that now use pin_longterm_pages_remote(), and the
other one is infiniband:

drivers/infiniband/core/umem_odp.c:646: npages = 
pin_longterm_pages_remote(owning_process, owning_mm,
drivers/vfio/vfio_iommu_type1.c:353:ret = 
pin_longterm_pages_remote(NULL, mm, vaddr, 1,


Jerome, Jason: I really don't want to revert the put_page() to put_user_page() 
conversions that are already throughout the IB driver--pointless churn, right?
I'd rather either delete them in Jason's tree, or go with what I have here
while waiting for the deletion.

Maybe we should just settle on (a) or (b), so that the IB driver ends up with
the wrapper functions? In fact, if it's getting deleted, then I'd prefer leaving
it at (a), since that's simple...

Jason should weigh in on how he wants this to go, with respect to branching
and merging, since it sounds like that will conflict with the hmm branch 
(ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
being late).

thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:30:32AM -0800, John Hubbard wrote:
> On 11/4/19 11:18 AM, Jerome Glisse wrote:
> > On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> >> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> >> ...
> >>>
> >>> Few nitpick belows, nonetheless:
> >>>
> >>> Reviewed-by: Jérôme Glisse 
> >>> [...]
>  +
>  +CASE 3: ODP
>  +---
>  +(Mellanox/Infiniband On Demand Paging: the hardware supports
>  +replayable page faulting). There are GUP references to pages serving as 
>  DMA
>  +buffers. For ODP, MMU notifiers are used to synchronize with 
>  page_mkclean()
>  +and munmap(). Therefore, normal GUP calls are sufficient, so neither 
>  flag
>  +needs to be set.
> >>>
> >>> I would not include ODP or anything like it here, they do not use
> >>> GUP anymore and i believe it is more confusing here. I would how-
> >>> ever include some text in this documentation explaining that hard-
> >>> ware that support page fault is superior as it does not incur any
> >>> of the issues described here.
> >>
> >> OK, agreed, here's a new write up that I'll put in v3:
> >>
> >>
> >> CASE 3: ODP
> >> ---
> > 
> > ODP is RDMA, maybe Hardware with page fault support instead
> > 
> >> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> 
> OK, so:
> 
> "RDMA hardware with page faulting support."
> 
> for the first sentence.

I would drop RDMA completely, RDMA is just one example, they are GPU, FPGA and
others that are in that category. See below

> 
> 
> >> Here, a well-written driver doesn't normally need to pin pages at all. 
> >> However,
> >> if the driver does choose to do so, it can register MMU notifiers for the 
> >> range,
> >> and will be called back upon invalidation. Either way (avoiding page 
> >> pinning, or
> >> using MMU notifiers to unpin upon request), there is proper 
> >> synchronization with 
> >> both filesystem and mm (page_mkclean(), munmap(), etc).
> >>
> >> Therefore, neither flag needs to be set.
> > 
> > In fact GUP should never be use with those.
> 
> 
> Yes. The next paragraph says that, but maybe not strong enough.
> 
> 
> >>
> >> It's worth mentioning here that pinning pages should not be the first 
> >> design
> >> choice. If page fault capable hardware is available, then the software 
> >> should
> >> be written so that it does not pin pages. This allows mm and filesystems to
> >> operate more efficiently and reliably.
> 
> Here's what we have after the above changes:
> 
> CASE 3: ODP
> ---
> RDMA hardware with page faulting support. Here, a well-written driver doesn't

CASE3: Hardware with page fault support
---

Here, a well-written 


> normally need to pin pages at all. However, if the driver does choose to do 
> so,
> it can register MMU notifiers for the range, and will be called back upon
> invalidation. Either way (avoiding page pinning, or using MMU notifiers to 
> unpin
> upon request), there is proper synchronization with both filesystem and mm
> (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.
> 
> In this case, ideally, neither get_user_pages() nor pin_user_pages() should 
> be 
> called. Instead, the software should be written so that it does not pin 
> pages. 
> This allows mm and filesystems to operate more efficiently and reliably.
> 
> >>> [...]
> >>>
>  @@ -1014,7 +1018,16 @@ static __always_inline long 
>  __get_user_pages_locked(struct task_struct *tsk,
>   BUG_ON(*locked != 1);
>   }
>   
>  -if (pages)
>  +/*
>  + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional 
>  behavior
>  + * is to set FOLL_GET if the caller wants pages[] filled in 
>  (but has
>  + * carelessly failed to specify FOLL_GET), so keep doing that, 
>  but only
>  + * for FOLL_GET, not for the newer FOLL_PIN.
>  + *
>  + * FOLL_PIN always expects pages to be non-null, but no need to 
>  assert
>  + * that here, as any failures will be obvious enough.
>  + */
>  +if (pages && !(flags & FOLL_PIN))
>   flags |= FOLL_GET;
> >>>
> >>> Did you look at user that have pages and not FOLL_GET set ?
> >>> I believe it would be better to first fix them to end up
> >>> with FOLL_GET set and then error out if pages is != NULL but
> >>> nor FOLL_GET or FOLL_PIN is set.
> >>>
> >>
> >> I was perhaps overly cautious, and didn't go there. However, it's probably
> >> doable, given that there was already the following in __get_user_pages():
> >>
> >> VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> >>
> >> ...which will have conditioned people and code to set FOLL_GET together 
> >> with
> >> pages. So I agree that the time is right.
> >>
> >> In order to make bisecting future failures simpler, I can insert a patch 
> 

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
On 11/4/19 11:18 AM, Jerome Glisse wrote:
> On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
>> On 11/4/19 9:33 AM, Jerome Glisse wrote:
>> ...
>>>
>>> Few nitpick belows, nonetheless:
>>>
>>> Reviewed-by: Jérôme Glisse 
>>> [...]
 +
 +CASE 3: ODP
 +---
 +(Mellanox/Infiniband On Demand Paging: the hardware supports
 +replayable page faulting). There are GUP references to pages serving as 
 DMA
 +buffers. For ODP, MMU notifiers are used to synchronize with 
 page_mkclean()
 +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
 +needs to be set.
>>>
>>> I would not include ODP or anything like it here, they do not use
>>> GUP anymore and i believe it is more confusing here. I would how-
>>> ever include some text in this documentation explaining that hard-
>>> ware that support page fault is superior as it does not incur any
>>> of the issues described here.
>>
>> OK, agreed, here's a new write up that I'll put in v3:
>>
>>
>> CASE 3: ODP
>> ---
> 
> ODP is RDMA, maybe Hardware with page fault support instead
> 
>> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.

OK, so:

"RDMA hardware with page faulting support."

for the first sentence.


>> Here, a well-written driver doesn't normally need to pin pages at all. 
>> However,
>> if the driver does choose to do so, it can register MMU notifiers for the 
>> range,
>> and will be called back upon invalidation. Either way (avoiding page 
>> pinning, or
>> using MMU notifiers to unpin upon request), there is proper synchronization 
>> with 
>> both filesystem and mm (page_mkclean(), munmap(), etc).
>>
>> Therefore, neither flag needs to be set.
> 
> In fact GUP should never be use with those.


Yes. The next paragraph says that, but maybe not strong enough.


>>
>> It's worth mentioning here that pinning pages should not be the first design
>> choice. If page fault capable hardware is available, then the software should
>> be written so that it does not pin pages. This allows mm and filesystems to
>> operate more efficiently and reliably.

Here's what we have after the above changes:

CASE 3: ODP
---
RDMA hardware with page faulting support. Here, a well-written driver doesn't
normally need to pin pages at all. However, if the driver does choose to do so,
it can register MMU notifiers for the range, and will be called back upon
invalidation. Either way (avoiding page pinning, or using MMU notifiers to unpin
upon request), there is proper synchronization with both filesystem and mm
(page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 
called. Instead, the software should be written so that it does not pin pages. 
This allows mm and filesystems to operate more efficiently and reliably.

>>> [...]
>>>
 @@ -1014,7 +1018,16 @@ static __always_inline long 
 __get_user_pages_locked(struct task_struct *tsk,
BUG_ON(*locked != 1);
}
  
 -  if (pages)
 +  /*
 +   * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
 +   * is to set FOLL_GET if the caller wants pages[] filled in (but has
 +   * carelessly failed to specify FOLL_GET), so keep doing that, but only
 +   * for FOLL_GET, not for the newer FOLL_PIN.
 +   *
 +   * FOLL_PIN always expects pages to be non-null, but no need to assert
 +   * that here, as any failures will be obvious enough.
 +   */
 +  if (pages && !(flags & FOLL_PIN))
flags |= FOLL_GET;
>>>
>>> Did you look at user that have pages and not FOLL_GET set ?
>>> I believe it would be better to first fix them to end up
>>> with FOLL_GET set and then error out if pages is != NULL but
>>> nor FOLL_GET or FOLL_PIN is set.
>>>
>>
>> I was perhaps overly cautious, and didn't go there. However, it's probably
>> doable, given that there was already the following in __get_user_pages():
>>
>> VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
>>
>> ...which will have conditioned people and code to set FOLL_GET together with
>> pages. So I agree that the time is right.
>>
>> In order to make bisecting future failures simpler, I can insert a patch 
>> right 
>> before this one, that changes the FOLL_GET setting into an assert, like this:
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 8f236a335ae9..be338961e80d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1014,8 +1014,8 @@ static __always_inline long 
>> __get_user_pages_locked(struct task_struct *tsk,
>> BUG_ON(*locked != 1);
>> }
>>  
>> -   if (pages)
>> -   flags |= FOLL_GET;
>> +   if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
>> +   return -EINVAL;
>>  
>> pages_done = 0;
>> lock_dropped = false;
>>
>>
>> ...and then add in FOLL_PIN, with this patch.
> 
> looks good but double check that 

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> ...
> > 
> > Few nitpick belows, nonetheless:
> > 
> > Reviewed-by: Jérôme Glisse 
> > [...]
> >> +
> >> +CASE 3: ODP
> >> +---
> >> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >> +replayable page faulting). There are GUP references to pages serving as 
> >> DMA
> >> +buffers. For ODP, MMU notifiers are used to synchronize with 
> >> page_mkclean()
> >> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >> +needs to be set.
> > 
> > I would not include ODP or anything like it here, they do not use
> > GUP anymore and i believe it is more confusing here. I would how-
> > ever include some text in this documentation explaining that hard-
> > ware that support page fault is superior as it does not incur any
> > of the issues described here.
> 
> OK, agreed, here's a new write up that I'll put in v3:
> 
> 
> CASE 3: ODP
> ---

ODP is RDMA, maybe Hardware with page fault support instead

> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> Here, a well-written driver doesn't normally need to pin pages at all. 
> However,
> if the driver does choose to do so, it can register MMU notifiers for the 
> range,
> and will be called back upon invalidation. Either way (avoiding page pinning, 
> or
> using MMU notifiers to unpin upon request), there is proper synchronization 
> with 
> both filesystem and mm (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.

In fact GUP should never be use with those.

> 
> It's worth mentioning here that pinning pages should not be the first design
> choice. If page fault capable hardware is available, then the software should
> be written so that it does not pin pages. This allows mm and filesystems to
> operate more efficiently and reliably.
> 
> > [...]
> > 
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 199da99e8ffc..1aea48427879 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> > 
> > [...]
> > 
> >> @@ -1014,7 +1018,16 @@ static __always_inline long 
> >> __get_user_pages_locked(struct task_struct *tsk,
> >>BUG_ON(*locked != 1);
> >>}
> >>  
> >> -  if (pages)
> >> +  /*
> >> +   * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >> +   * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >> +   * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >> +   * for FOLL_GET, not for the newer FOLL_PIN.
> >> +   *
> >> +   * FOLL_PIN always expects pages to be non-null, but no need to assert
> >> +   * that here, as any failures will be obvious enough.
> >> +   */
> >> +  if (pages && !(flags & FOLL_PIN))
> >>flags |= FOLL_GET;
> > 
> > Did you look at user that have pages and not FOLL_GET set ?
> > I believe it would be better to first fix them to end up
> > with FOLL_GET set and then error out if pages is != NULL but
> > nor FOLL_GET or FOLL_PIN is set.
> > 
> 
> I was perhaps overly cautious, and didn't go there. However, it's probably
> doable, given that there was already the following in __get_user_pages():
> 
> VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> 
> ...which will have conditioned people and code to set FOLL_GET together with
> pages. So I agree that the time is right.
> 
> In order to make bisecting future failures simpler, I can insert a patch 
> right 
> before this one, that changes the FOLL_GET setting into an assert, like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..be338961e80d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1014,8 +1014,8 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
> BUG_ON(*locked != 1);
> }
>  
> -   if (pages)
> -   flags |= FOLL_GET;
> +   if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> +   return -EINVAL;
>  
> pages_done = 0;
> lock_dropped = false;
> 
> 
> ...and then add in FOLL_PIN, with this patch.

looks good but double check that it should not happens, i will try
to check on my side too.

> 
> >>  
> >>pages_done = 0;
> > 
> >> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long 
> >> start, int nr_pages,
> >>return ret;
> >>  }
> >>  
> >> -/**
> >> - * get_user_pages_fast() - pin user pages in memory
> >> - * @start:starting user address
> >> - * @nr_pages: number of pages from start to pin
> >> - * @gup_flags:flags modifying pin behaviour
> >> - * @pages:array that receives pointers to the pages pinned.
> >> - *Should be at least nr_pages long.
> >> - *
> >> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> >> - * If not successful, it will fall back to taking the lock and
> >> - * calling get_user_pages().
> >> - *
> >> - * Returns number of pages pinned. This may be fewer than the number
> >> - 

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread John Hubbard
On 11/4/19 9:33 AM, Jerome Glisse wrote:
...
> 
> Few nitpick belows, nonetheless:
> 
> Reviewed-by: Jérôme Glisse 
> [...]
>> +
>> +CASE 3: ODP
>> +---
>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>> +replayable page faulting). There are GUP references to pages serving as DMA
>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>> +needs to be set.
> 
> I would not include ODP or anything like it here, they do not use
> GUP anymore and i believe it is more confusing here. I would how-
> ever include some text in this documentation explaining that hard-
> ware that support page fault is superior as it does not incur any
> of the issues described here.

OK, agreed, here's a new write up that I'll put in v3:


CASE 3: ODP
---
Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
Here, a well-written driver doesn't normally need to pin pages at all. However,
if the driver does choose to do so, it can register MMU notifiers for the range,
and will be called back upon invalidation. Either way (avoiding page pinning, or
using MMU notifiers to unpin upon request), there is proper synchronization 
with 
both filesystem and mm (page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

It's worth mentioning here that pinning pages should not be the first design
choice. If page fault capable hardware is available, then the software should
be written so that it does not pin pages. This allows mm and filesystems to
operate more efficiently and reliably.

> [...]
> 
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 199da99e8ffc..1aea48427879 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
> 
> [...]
> 
>> @@ -1014,7 +1018,16 @@ static __always_inline long 
>> __get_user_pages_locked(struct task_struct *tsk,
>>  BUG_ON(*locked != 1);
>>  }
>>  
>> -if (pages)
>> +/*
>> + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>> + * is to set FOLL_GET if the caller wants pages[] filled in (but has
>> + * carelessly failed to specify FOLL_GET), so keep doing that, but only
>> + * for FOLL_GET, not for the newer FOLL_PIN.
>> + *
>> + * FOLL_PIN always expects pages to be non-null, but no need to assert
>> + * that here, as any failures will be obvious enough.
>> + */
>> +if (pages && !(flags & FOLL_PIN))
>>  flags |= FOLL_GET;
> 
> Did you look at user that have pages and not FOLL_GET set ?
> I believe it would be better to first fix them to end up
> with FOLL_GET set and then error out if pages is != NULL but
> nor FOLL_GET or FOLL_PIN is set.
> 

I was perhaps overly cautious, and didn't go there. However, it's probably
doable, given that there was already the following in __get_user_pages():

VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

...which will have conditioned people and code to set FOLL_GET together with
pages. So I agree that the time is right.

In order to make bisecting future failures simpler, I can insert a patch right 
before this one, that changes the FOLL_GET setting into an assert, like this:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..be338961e80d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1014,8 +1014,8 @@ static __always_inline long 
__get_user_pages_locked(struct task_struct *tsk,
BUG_ON(*locked != 1);
}
 
-   if (pages)
-   flags |= FOLL_GET;
+   if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
+   return -EINVAL;
 
pages_done = 0;
lock_dropped = false;


...and then add in FOLL_PIN, with this patch.

>>  
>>  pages_done = 0;
> 
>> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long 
>> start, int nr_pages,
>>  return ret;
>>  }
>>  
>> -/**
>> - * get_user_pages_fast() - pin user pages in memory
>> - * @start:  starting user address
>> - * @nr_pages:   number of pages from start to pin
>> - * @gup_flags:  flags modifying pin behaviour
>> - * @pages:  array that receives pointers to the pages pinned.
>> - *  Should be at least nr_pages long.
>> - *
>> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
>> - * If not successful, it will fall back to taking the lock and
>> - * calling get_user_pages().
>> - *
>> - * Returns number of pages pinned. This may be fewer than the number
>> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
>> - * were pinned, returns -errno.
>> - */
>> -int get_user_pages_fast(unsigned long start, int nr_pages,
>> -unsigned int gup_flags, struct page **pages)
>> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>> +unsigned int gup_flags,
>> +struct page **pages)
> 
> Usualy function are rename to _old_func_name ie add _ in front. So
> here 

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
> * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 

Few nitpick belows, nonetheless:

Reviewed-by: Jérôme Glisse 

> ---
>  Documentation/vm/index.rst  |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++
>  include/linux/mm.h  |  62 ++-
>  mm/gup.c| 265 +---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 

[...]

> diff --git a/Documentation/vm/pin_user_pages.rst 
> b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index ..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst

[...]

> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for 
> describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +---
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. 
> That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +---
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.

I would not include ODP or anything like it here, they do not use
GUP anymore and i believe it is more confusing here. I would how-
ever include some text in this documentation explaining that hard-
ware that support page fault is superior as it does not incur any
of the issues described here.

> +
> +CASE 4: Pinning for struct page manipulation only
> +-
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1014,7 +1018,16 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
>   BUG_ON(*locked != 1);
>   }
>  
> - if (pages)
> + /*
> +  * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +  * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +  * carelessly failed