Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Ira Weiny
On Wed, Dec 09, 2020 at 03:13:57PM -0400, Jason Gunthorpe wrote:
> These functions accomplish the same thing but have different
> implementations.
> 
> unpin_user_page() has a bug where it calls mod_node_page_state() after
> calling put_page() which creates a risk that the page could have been
> hot-uplugged from the system.
> 
> Fix this by using put_compound_head() as the only implementation.
> 
> __unpin_devmap_managed_user_page() and related can be deleted as well in
> favour of the simpler, but slower, version in put_compound_head() that has
> an extra atomic page_ref_sub, but always calls put_page() which internally
> contains the special devmap code.
> 
> Move put_compound_head() to be directly after try_grab_compound_head() so
> people can find it in future.
> 
> Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
> reporting")
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Ira Weiny 

> ---
>  mm/gup.c | 103 +--
>  1 file changed, 23 insertions(+), 80 deletions(-)
> 
> With Matt's folio idea I'd next to go to make a
>   put_folio(folio, refs)
> 
> Which would cleanly eliminate that extra atomic here without duplicating the
> devmap special case.
> 
> This should also be called 'ungrab_compound_head' as we seem to be using the
> word 'grab' to mean 'pin or get' depending on GUP flags.
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609c3..7b33b7d4b324d7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -123,6 +123,28 @@ static __maybe_unused struct page 
> *try_grab_compound_head(struct page *page,
>   return NULL;
>  }
>  
> +static void put_compound_head(struct page *page, int refs, unsigned int 
> flags)
> +{
> + if (flags & FOLL_PIN) {
> + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> + refs);
> +
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, refs);
> + else
> + refs *= GUP_PIN_COUNTING_BIAS;
> + }
> +
> + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> + /*
> +  * Calling put_page() for each ref is unnecessarily slow. Only the last
> +  * ref needs a put_page().
> +  */
> + if (refs > 1)
> + page_ref_sub(page, refs - 1);
> + put_page(page);
> +}
> +
>  /**
>   * try_grab_page() - elevate a page's refcount by a flag-dependent amount
>   *
> @@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
> unsigned int flags)
>   return true;
>  }
>  
> -#ifdef CONFIG_DEV_PAGEMAP_OPS
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - int count, refs = 1;
> -
> - if (!page_is_devmap_managed(page))
> - return false;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> - refs = GUP_PIN_COUNTING_BIAS;
> -
> - count = page_ref_sub_return(page, refs);
> -
> - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
> - /*
> -  * devmap page refcounts are 1-based, rather than 0-based: if
> -  * refcount is 1, then the page is free and the refcount is
> -  * stable because nobody holds a reference on the page.
> -  */
> - if (count == 1)
> - free_devmap_managed_page(page);
> - else if (!count)
> - __put_page(page);
> -
> - return true;
> -}
> -#else
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - return false;
> -}
> -#endif /* CONFIG_DEV_PAGEMAP_OPS */
> -
>  /**
>   * unpin_user_page() - release a dma-pinned page
>   * @page:pointer to page to be released
> @@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
> *page)
>   */
>  void unpin_user_page(struct page *page)
>  {
> - int refs = 1;
> -
> - page = compound_head(page);
> -
> - /*
> -  * For devmap managed pages we need to catch refcount transition from
> -  * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> -  * page is free and we need to inform the device driver through
> -  * callback. See include/linux/memremap.h and HMM for details.
> -  */
> - if (__unpin_devmap_managed_user_page(page))
> - return;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> - refs = GUP_PIN_COUNTING_BIAS;
> -
> - if (page_ref_sub_and_test(page, refs))
> - __put_page(page);
> -
> - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
> + put_compound_head(compound_head(page), 1, FOLL_PIN);
>  }
>  EXPORT_SYMBOL(unpin_user_page);
>  
> @@ -2062,29 +2028,6 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
>   * This code is based heavily on the PowerPC implementation by Nick Piggin.
>   */
>  #ifdef CONFIG_HAVE_FAST_GUP
> -
> -static void put_compound_head(struct page 

Payment Notice

2020-12-09 Thread Carina . Wheeler


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: What is a Field Services Supervisor?

2020-12-09 Thread jhonmaccuine
To set up a new Microsoft Office, there are few pre-requests which we must have 
to fulfil earlier, i.e., user required an active high-speed internet connection 
and required storage within disk space to install it. Before pursuing further 
kindly make sure the availability of these two things. 
https://w-w-office.com/setup/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2020 at 12:57:38PM -0800, John Hubbard wrote:
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 98eb8e6d2609c3..7b33b7d4b324d7 100644
> > +++ b/mm/gup.c
> > @@ -123,6 +123,28 @@ static __maybe_unused struct page 
> > *try_grab_compound_head(struct page *page,
> > return NULL;
> >   }
> > +static void put_compound_head(struct page *page, int refs, unsigned int 
> > flags)
> > +{
> 
> It might be nice to rename "page" to "head", here.
> 
> While reading this I toyed with the idea of having this at the top:
> 
>   VM_BUG_ON_PAGE(compound_head(page) != page, page);
> 
> ...but it's overkill in a static function with pretty clear call sites. So I
> think it's just right as-is.

Matt's folio patches would take are of this, when that is available
all this becomes a lot better.

Thanks,
Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread John Hubbard

On 12/9/20 11:13 AM, Jason Gunthorpe wrote:

These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
reporting")
Signed-off-by: Jason Gunthorpe 
---
  mm/gup.c | 103 +--
  1 file changed, 23 insertions(+), 80 deletions(-)



Reviewed-by: John Hubbard 

With a couple of minor notes below:


With Matt's folio idea I'd next to go to make a
   put_folio(folio, refs)

Which would cleanly eliminate that extra atomic here without duplicating the
devmap special case.

This should also be called 'ungrab_compound_head' as we seem to be using the
word 'grab' to mean 'pin or get' depending on GUP flags.

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609c3..7b33b7d4b324d7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,28 @@ static __maybe_unused struct page 
*try_grab_compound_head(struct page *page,
return NULL;
  }
  
+static void put_compound_head(struct page *page, int refs, unsigned int flags)

+{


It might be nice to rename "page" to "head", here.

While reading this I toyed with the idea of having this at the top:

VM_BUG_ON_PAGE(compound_head(page) != page, page);

...but it's overkill in a static function with pretty clear call sites. So I
think it's just right as-is.



+   if (flags & FOLL_PIN) {
+   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+   refs);
+
+   if (hpage_pincount_available(page))
+   hpage_pincount_sub(page, refs);
+   else
+   refs *= GUP_PIN_COUNTING_BIAS;
+   }
+
+   VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+   /*
+* Calling put_page() for each ref is unnecessarily slow. Only the last
+* ref needs a put_page().
+*/
+   if (refs > 1)
+   page_ref_sub(page, refs - 1);
+   put_page(page);
+}
+
  /**
   * try_grab_page() - elevate a page's refcount by a flag-dependent amount
   *
@@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
unsigned int flags)
return true;
  }
  
-#ifdef CONFIG_DEV_PAGEMAP_OPS

-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   int count, refs = 1;
-
-   if (!page_is_devmap_managed(page))
-   return false;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   count = page_ref_sub_return(page, refs);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-
-   return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-


Wow, getting rid of that duplication is beautiful!

thanks,
--
John Hubbard
NVIDIA


  /**
   * unpin_user_page() - release a dma-pinned page
   * @page:pointer to page to be released
@@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
*page)
   */
  void unpin_user_page(struct page *page)
  {
-   int refs = 1;
-
-   page = compound_head(page);
-
-   /*
-* For devmap managed pages we need to catch refcount transition from
-* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
-* page is free and we need to inform the device driver through
-* callback. See include/linux/memremap.h and HMM for details.
-*/
-   if (__unpin_devmap_managed_user_page(page))
-   return;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   if (page_ref_sub_and_test(page, refs))
-   __put_page(page);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+   put_compound_head(compound_head(page), 1, FOLL_PIN);
  }
  

[PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Jason Gunthorpe
These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
reporting")
Signed-off-by: Jason Gunthorpe 
---
 mm/gup.c | 103 +--
 1 file changed, 23 insertions(+), 80 deletions(-)

With Matt's folio idea I'd next to go to make a
  put_folio(folio, refs)

Which would cleanly eliminate that extra atomic here without duplicating the
devmap special case.

This should also be called 'ungrab_compound_head' as we seem to be using the
word 'grab' to mean 'pin or get' depending on GUP flags.

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609c3..7b33b7d4b324d7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,28 @@ static __maybe_unused struct page 
*try_grab_compound_head(struct page *page,
return NULL;
 }
 
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
+{
+   if (flags & FOLL_PIN) {
+   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+   refs);
+
+   if (hpage_pincount_available(page))
+   hpage_pincount_sub(page, refs);
+   else
+   refs *= GUP_PIN_COUNTING_BIAS;
+   }
+
+   VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+   /*
+* Calling put_page() for each ref is unnecessarily slow. Only the last
+* ref needs a put_page().
+*/
+   if (refs > 1)
+   page_ref_sub(page, refs - 1);
+   put_page(page);
+}
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  *
@@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
unsigned int flags)
return true;
 }
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   int count, refs = 1;
-
-   if (!page_is_devmap_managed(page))
-   return false;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   count = page_ref_sub_return(page, refs);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-
-   return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:pointer to page to be released
@@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
*page)
  */
 void unpin_user_page(struct page *page)
 {
-   int refs = 1;
-
-   page = compound_head(page);
-
-   /*
-* For devmap managed pages we need to catch refcount transition from
-* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
-* page is free and we need to inform the device driver through
-* callback. See include/linux/memremap.h and HMM for details.
-*/
-   if (__unpin_devmap_managed_user_page(page))
-   return;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   if (page_ref_sub_and_test(page, refs))
-   __put_page(page);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+   put_compound_head(compound_head(page), 1, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -2062,29 +2028,6 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * This code is based heavily on the PowerPC implementation by Nick Piggin.
  */
 #ifdef CONFIG_HAVE_FAST_GUP
-
-static void put_compound_head(struct page *page, int refs, unsigned int flags)
-{
-   if (flags & FOLL_PIN) {
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
-   refs);
-
-   if (hpage_pincount_available(page))
-   

Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Matthew Wilcox
On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +, Joao Martins wrote:
> 
> > Today (without the series) struct pages are not represented the way they
> > are expressed in the page tables, which is what I am hoping to fix in this
> > series thus initializing these as compound pages of a given order. But me
> > introducing PGMAP_COMPOUND was to conservatively keep both old 
> > (non-compound)
> > and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.

Uh, yes.  I'm surprised it hasn't caused more problems.

> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.

But we also see good performance improvements from doing all reference
counts on the head page instead of spread throughout the pages, so we
really want compound pages.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Joao Martins
On 12/9/20 4:24 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +, Joao Martins wrote:
> 
>> Today (without the series) struct pages are not represented the way they
>> are expressed in the page tables, which is what I am hoping to fix in this
>> series thus initializing these as compound pages of a given order. But me
>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
>> and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.
> 
> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.
> 
/me nods

> This way it can remain general with no particularly special path for
> devmap or a special PGMAP_COMPOUND check here.

The less special paths the better, indeed.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Joao Martins
On 12/9/20 3:15 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 11:05:39AM +, Joao Martins wrote:
>>> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
>>> PFN we can apply this optimization. How come device has its own
>>> special path to do this?? 
>>
>> I think the reason is that zone_device struct pages have no
>> relationship to one other. So you anyways need to change individual
>> pages, as opposed to just the head page.
> 
> Huh? That can't be, unpin doesn't know the memory type when it unpins
> it, and as your series shows unpin always operates on the compound
> head. Thus pinning must also operate on compound heads
> 
I was referring to the code without this series, in the paragraph above.
Meaning today zone_device pages are *not* represented compound pages. And so
compound_head(page) on a non compound page just returns the page itself.

Otherwise, try_grab_page() (e.g. when pinning pages) would be broken.

>> I made it special to avoid breaking other ZONE_DEVICE users (and
>> gating that with PGMAP_COMPOUND). But if there's no concerns with
>> that, I can unilaterally enable it.
> 
> I didn't understand what PGMAP_COMPOUND was supposed to be for..
>  
PGMAP_COMPOUND purpose is to online these pages as compound pages (so head
and tails).

Today (without the series) struct pages are not represented the way they
are expressed in the page tables, which is what I am hoping to fix in this
series thus initializing these as compound pages of a given order. But me
introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
and new (compound pages) co-exist.

I wasn't sure I could just enable regardless, worried that I would be breaking
other ZONE_DEVICE/memremap_pages users.

>>> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
>>> (We already removed that from the hmm version of this, was that wrong?
>>> Is this different?) Dan?
> 
> And this is the key question - why do we need to get a pgmap here?
> 
> I'm going to assert that a pgmap cannot be destroyed concurrently with
> fast gup running. This is surely true on x86 as the TLB flush that
> must have preceeded a pgmap destroy excludes fast gup. Other arches
> must emulate this in their pgmap implementations.
> 
> So, why do we need pgmap here? Hoping Dan might know
> 
> If we delete the pgmap then the devmap stop being special.
>
I will let Dan chip in.

> CH and I looked at this and deleted it from the hmm side:
> 
> commit 068354ade5dd9e2b07d9b0c57055a681db6f4e37
> Author: Jason Gunthorpe 
> Date:   Fri Mar 27 17:00:13 2020 -0300
> 
> mm/hmm: remove pgmap checking for devmap pages
> 
> The checking boils down to some racy check if the pagemap is still
> available or not. Instead of checking this, rely entirely on the
> notifiers, if a pagemap is destroyed then all pages that belong to it must
> be removed from the tables and the notifiers triggered.
> 
> Link: https://lore.kernel.org/r/20200327200021.29372-2-...@ziepe.ca
> 
> Though I am wondering if this whole hmm thing is racy with memory
> unplug. Hmm.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Payment Notice

2020-12-09 Thread Lillian . Newton


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate

2020-12-09 Thread Joao Martins



On 12/9/20 6:16 AM, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
>> Replace vmem_altmap with an vmem_context argument. That let us
>> express how the vmemmap is gonna be initialized e.g. passing
>> flags and a page size for reusing pages upon initializing the
>> vmemmap.
> 
> How about this instead:
> 
> Replace the vmem_altmap argument with a vmem_context argument that
> contains vmem_altmap for now. Subsequent patches will add additional
> member elements to vmem_context, such as flags and page size.
> 
> No behavior changes are intended.
> 
> ?
> 
Yeap, it's better than way. Thanks.

>>
>> Signed-off-by: Joao Martins 
>> ---
>>   include/linux/memory_hotplug.h |  6 +-
>>   include/linux/mm.h |  2 +-
>>   mm/memory_hotplug.c|  3 ++-
>>   mm/sparse-vmemmap.c|  6 +-
>>   mm/sparse.c| 16 
>>   5 files changed, 21 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index 551093b74596..73f8bcbb58a4 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -81,6 +81,10 @@ struct mhp_params {
>>  pgprot_t pgprot;
>>   };
>>   
>> +struct vmem_context {
>> +struct vmem_altmap *altmap;
>> +};
>> +
>>   /*
>>* Zone resizing functions
>>*
>> @@ -353,7 +357,7 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
>> unsigned long nr_pages);
>>   extern bool is_memblock_offlined(struct memory_block *mem);
>>   extern int sparse_add_section(int nid, unsigned long pfn,
>> -unsigned long nr_pages, struct vmem_altmap *altmap);
>> +unsigned long nr_pages, struct vmem_context *ctx);
>>   extern void sparse_remove_section(struct mem_section *ms,
>>  unsigned long pfn, unsigned long nr_pages,
>>  unsigned long map_offset, struct vmem_altmap *altmap);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index db6ae4d3fb4e..2eb44318bb2d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3000,7 +3000,7 @@ static inline void print_vma_addr(char *prefix, 
>> unsigned long rip)
>>   
>>   void *sparse_buffer_alloc(unsigned long size);
>>   struct page * __populate_section_memmap(unsigned long pfn,
>> -unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
>> +unsigned long nr_pages, int nid, struct vmem_context *ctx);
>>   pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>   p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>   pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 63b2e46b6555..f8870c53fe5e 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -313,6 +313,7 @@ int __ref __add_pages(int nid, unsigned long pfn, 
>> unsigned long nr_pages,
>>  unsigned long cur_nr_pages;
>>  int err;
>>  struct vmem_altmap *altmap = params->altmap;
>> +struct vmem_context ctx = { .altmap = params->altmap };
> 
> OK, so this is the one place I can see where ctx is set up. And it's never 
> null.
> Let's remember that point...
> 

(...)

>>   
>>  if (WARN_ON_ONCE(!params->pgprot.pgprot))
>>  return -EINVAL;
>> @@ -341,7 +342,7 @@ int __ref __add_pages(int nid, unsigned long pfn, 
>> unsigned long nr_pages,
>>  /* Select all remaining pages up to the next section boundary */
>>  cur_nr_pages = min(end_pfn - pfn,
>> SECTION_ALIGN_UP(pfn + 1) - pfn);
>> -err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
>> +err = sparse_add_section(nid, pfn, cur_nr_pages, );
>>  if (err)
>>  break;
>>  cond_resched();
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 16183d85a7d5..bcda68ba1381 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -249,15 +249,19 @@ int __meminit vmemmap_populate_basepages(unsigned long 
>> start, unsigned long end,
>>   }
>>   
>>   struct page * __meminit __populate_section_memmap(unsigned long pfn,
>> -unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +unsigned long nr_pages, int nid, struct vmem_context *ctx)
>>   {
>>  unsigned long start = (unsigned long) pfn_to_page(pfn);
>>  unsigned long end = start + nr_pages * sizeof(struct page);
>> +struct vmem_altmap *altmap = NULL;
>>   
>>  if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>  !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>  return NULL;
>>   
>> +if (ctx)
> 
> But...ctx can never be null, right?
> 
Indeed.

This is an artifact of an old version of this where the passed parameter
could be null.

> I didn't spot any other issues, though.
> 
> thanks,
> 

Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Joao Martins
On 12/9/20 4:40 AM, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>   $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> 
> "gup_test", now that you're in linux-next, actually.
> 
> (Maybe I'll retrofit that test with getopt_long(), those options are
> getting more elaborate.)
> 
:)

>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
> 
> That is a beautiful result! I'm very motivated to see if this patchset
> can make it in, in some form.
> 
Cool!

>>
>> Signed-off-by: Joao Martins 
>> ---
>>   mm/gup.c | 67 ++--
>>   1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
>> addr, unsigned long end,
>>   }
>>   #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>   
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +   unsigned long end, struct page **pages)
>> +{
>> +int nr;
>> +
>> +for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +pages[nr++] = page++;
>> +
>> +return nr;
>> +}
>> +
>>   #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
>> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> - unsigned long end, unsigned int flags,
>> - struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +  struct page *head, unsigned long sz,
> 
> If this variable survives (I see Jason requested a reorg of this math stuff,
> and I also like that idea), then I'd like a slightly better name for "sz".
> 
Yeap.

> I was going to suggest one, but then realized that I can't understand how this
> works. See below...
> 
>> +  unsigned long addr, unsigned long end,
>> +  unsigned int flags, struct page **pages)
>> +{
>> +struct page *page;
>> +int refs;
>> +
>> +if (!(pgmap->flags & PGMAP_COMPOUND))
>> +return -1;
> 
> btw, I'm unhappy with returning -1 here and assigning it later to a refs 
> variable.
> (And that will show up even more clearly as an issue if you attempt to make
> refs unsigned everywhere!)
> 
Yes true.

The usage of @refs = -1 (therefore an int) was to differentiate when we are not 
in a
PGMAP_COMPOUND pgmap (and so for logic to keep as today).

Notice that in the PGMAP_COMPOUND case if we fail to grab the head compound 
page we return 0.

> I'm not going to suggest anything because there are a lot of ways to structure
> these routines, and I don't want to overly constrain you. Just please don't 
> assign
> negative values to any refs variables.
> 
OK.

TBH I'm a little afraid this can turn into further complexity if I have to keep 
the
non-compound pgmap around. But I will see how I can adjust this.

>> +
>> +page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it?
> Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it
> does work, then better naming might help. I'm probably missing a really
> obvious math trick here.

You're right. That was a mistake on my end, indeed. But the mistake wouldn't 
change the
logic, as the PageReference bit only applies to the head page.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2020-12-09 Thread Joao Martins
On 12/9/20 6:33 AM, Matthew Wilcox wrote:
> On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote:
>> On 12/8/20 9:28 AM, Joao Martins wrote:
>>> Add a new flag for struct dev_pagemap which designates that a a pagemap
>>
>> a a
>>
Ugh. Yeah will fix.

>>> is described as a set of compound pages or in other words, that how
>>> pages are grouped together in the page tables are reflected in how we
>>> describe struct pages. This means that rather than initializing
>>> individual struct pages, we also initialize these struct pages, as
>>
>> Let's not say "rather than x, we also do y", because it's self-contradictory.
>> I think you want to just leave out the "also", like this:
>>
>> "This means that rather than initializing> individual struct pages, we
>> initialize these struct pages ..."
>>
>> Is that right?
> 
Nop, my previous text was broken.

> I'd phrase it as:
> 
> Add a new flag for struct dev_pagemap which specifies that a pagemap is
> composed of a set of compound pages instead of individual pages.  When
> these pages are initialised, most are initialised as tail pages
> instead of order-0 pages.
> 
Thanks, I will use this instead.

>>> For certain ZONE_DEVICE users, like device-dax, which have a fixed page
>>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>>> thus playing the same tricks as hugetlb pages.
> 
> Rather than "playing the same tricks", how about "are treated the same
> way as THP or hugetlb pages"?
> 
>>> +   if (pgmap->flags & PGMAP_COMPOUND)
>>> +   percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>> +   - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>>
>> Is there some reason that we cannot use range_len(), instead of pfn_end() 
>> minus
>> pfn_first()? (Yes, this more about the pre-existing code than about your 
>> change.)
>>
Indeed one could use range_len() / pgmap->align and it would work. But (...)

>> And if not, then why are the nearby range_len() uses OK? I realize that 
>> range_len()
>> is simpler and skips a case, but it's not clear that it's required here. But 
>> I'm
>> new to this area so be warned. :)
>>
My use of pfns to calculate the nr of pages was to remain consistent with the 
rest of the
code in the function taking references in the pgmap->ref. The usages one sees 
ofrange_len
are are when the hotplug takes place which work at addresses and not PFNs.

>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what 
>> you
>> happen to want, but is not named accordingly. Can you use or create something
>> more accurately named? Like "number of pages in this large page"?
> 
> We have compound_nr(), but that takes a struct page as an argument.
> We also have HPAGE_NR_PAGES.  I'm not quite clear what you want.
> 
If possible I would rather keep the pfns as with the rest of the code. Another 
alternative
is like a range_nr_pages helper but I am not sure it's worth the trouble for 
one caller.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages

2020-12-09 Thread Joao Martins
On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:59PM +, Joao Martins wrote:
>> Rather than decrementing the ref count one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page.
>>
>> Signed-off-by: Joao Martins 
>>  mm/gup.c | 41 -
>>  1 file changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 194e6981eb03..3a9a7229f418 100644
>> +++ b/mm/gup.c
>> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct 
>> page *page)
>>  }
>>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
>>  
>> +static int record_refs(struct page **pages, int npages)
>> +{
>> +struct page *head = compound_head(pages[0]);
>> +int refs = 1, index;
>> +
>> +for (index = 1; index < npages; index++, refs++)
>> +if (compound_head(pages[index]) != head)
>> +break;
>> +
>> +return refs;
>> +}
>> +
>>  /**
>>   * unpin_user_page() - release a dma-pinned page
>>   * @page:pointer to page to be released
>> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page 
>> *page)
>>   * that such pages can be separately tracked and uniquely handled. In
>>   * particular, interactions with RDMA and filesystems need special handling.
>>   */
>> -void unpin_user_page(struct page *page)
>> +static void __unpin_user_page(struct page *page, int refs)
> 
> Refs should be unsigned everywhere.
> 
/me nods

> I suggest using clear language 'page' here should always be a compound
> head called 'head' (or do we have another common variable name for
> this?)
> 
> 'refs' is number of tail pages within the compound, so 'ntails' or
> something
> 
The usage of 'refs' seems to align with the rest of the GUP code. It's always 
referring to
tail pages and unpin case isn't any different IIUC.

I suppose we can always change that, but maybe better do that renaming in one 
shot as a
post cleanup?

>>  {
>> -int refs = 1;
>> +int orig_refs = refs;
>>  
>>  page = compound_head(page);
> 
> Caller should always do this
> 
/me nods

>> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>>  return;
>>  
>>  if (hpage_pincount_available(page))
>> -hpage_pincount_sub(page, 1);
>> +hpage_pincount_sub(page, refs);
>>  else
>> -refs = GUP_PIN_COUNTING_BIAS;
>> +refs *= GUP_PIN_COUNTING_BIAS;
>>  
>>  if (page_ref_sub_and_test(page, refs))
>>  __put_page(page);
>>  
>> -mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>> +mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> +}
> 
> And really this should be placed directly after
> try_grab_compound_head() and be given a similar name
> 'unpin_compound_head()'. Even better would be to split the FOLL_PIN
> part into a function so there was a clear logical pairing.
> 
> And reviewing it like that I want to ask if this unpin sequence is in
> the right order.. I would expect it to be the reverse order of the get
> 
> John?
> 
> Is it safe to call mod_node_page_state() after releasing the refcount?
> This could race with hot-unplugging the struct pages so I think it is
> wrong.
> 
It appears to be case based on John's follow up comment.

>> +void unpin_user_page(struct page *page)
>> +{
>> +__unpin_user_page(page, 1);
> 
> Thus this is
> 
>   __unpin_user_page(compound_head(page), 1);
> 
Got it.

>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>   bool make_dirty)
>>  {
>>  unsigned long index;
>> +int refs = 1;
>>  
>>  /*
>>   * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>  return;
>>  }
>>  
>> -for (index = 0; index < npages; index++) {
>> +for (index = 0; index < npages; index += refs) {
>>  struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, , ) {
>   if (!PageDirty(head))
>   set_page_dirty_lock(head, ntails);
>   unpin_user_page(head, ntails);
> }
> 
/me nods Let me attempt at that.

> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
I like the idea of a page range API alternative to unpin_user_pages(), but
improving current unpin_user_pages() would improve other unpin users too.

Perhaps the logic can be common, and the current unpin_user_pages() would have
the second iteration part, while the new (faster) API be based on computation.


Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas

2020-12-09 Thread Joao Martins
On 12/8/20 7:57 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:01PM +, Joao Martins wrote:
>> Similar to follow_hugetlb_page() add a follow_devmap_page which rather
>> than calling follow_page() per 4K page in a PMD/PUD it does so for the
>> entire PMD, where we lock the pmd/pud, get all pages , unlock.
>>
>> While doing so, we only change the refcount once when PGMAP_COMPOUND is
>> passed in.
>>
>> This let us improve {pin,get}_user_pages{,_longterm}() considerably:
>>
>> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w
>>
>> () [before] -> [after]
>> (get_user_pages 2M pages) ~150k us -> ~8.9k us
>> (pin_user_pages 2M pages) ~192k us -> ~9k us
>> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us
>>
>> Signed-off-by: Joao Martins 
>> ---
>> I've special-cased this to device-dax vmas given its similar page size
>> guarantees as hugetlbfs, but I feel this is a bit wrong. I am
>> replicating follow_hugetlb_page() as RFC ought to seek feedback whether
>> this should be generalized if no fundamental issues exist. In such case,
>> should I be changing follow_page_mask() to take either an array of pages
>> or a function pointer and opaque arguments which would let caller pick
>> its structure?
> 
> I would be extremely sad if this was the only way to do this :(
> 
> We should be trying to make things more general. 

Yeap, indeed.

Specially, when similar problem is observed for THP, at least from the
measurements I saw. It is all slow, except for hugetlbfs.

> The
> hmm_range_fault_path() doesn't have major special cases for device, I
> am struggling to understand why gup fast and slow do.
> 
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>struct page **cur;
>struct page **end;
>unsigned long vaddr;
>[..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. 

Thanks for the suggestions above.

I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..
Interesting, wasn't aware of the siloing.

I'll go investigate how this all refactoring goes together, at the point
of which a future iteration of this particular patch probably needs to
move independently from this series.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages

2020-12-09 Thread Joao Martins



On 12/8/20 7:49 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:58PM +, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>  $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
>>
>> Signed-off-by: Joao Martins 
>>  mm/gup.c | 67 ++--
>>  1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
>> addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>  
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +   unsigned long end, struct page **pages)
>> +{
>> +int nr;
>> +
>> +for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +pages[nr++] = page++;
>> +
>> +return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
>> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> - unsigned long end, unsigned int flags,
>> - struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +  struct page *head, unsigned long sz,
>> +  unsigned long addr, unsigned long end,
>> +  unsigned int flags, struct page **pages)
>> +{
>> +struct page *page;
>> +int refs;
>> +
>> +if (!(pgmap->flags & PGMAP_COMPOUND))
>> +return -1;
>> +
>> +page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> All the places that call record_subpages do some kind of maths like
> this, it should be placed inside record_subpages and not opencoded
> everywhere.
> 
Makes sense.

>> +refs = record_subpages(page, addr, end, pages);
>> +
>> +SetPageReferenced(page);
>> +head = try_grab_compound_head(head, refs, flags);
>> +if (!head) {
>> +ClearPageReferenced(page);
>> +return 0;
>> +}
>> +
>> +return refs;
>> +}
> 
> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
> PFN we can apply this optimization. How come device has its own
> special path to do this?? 
> 
I think the reason is that zone_device struct pages have no relationship to one 
other. So
you anyways need to change individual pages, as opposed to just the head page.

I made it special to avoid breaking other ZONE_DEVICE users (and gating that 
with
PGMAP_COMPOUND). But if there's no concerns with that, I can unilaterally 
enable it.

> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
> (We already removed that from the hmm version of this, was that wrong?
> Is this different?) Dan?
> 
> Also undo_dev_pagemap() is now out of date, we have unpin_user_pages()
> for that and no other error unwind touches ClearPageReferenced..
> 
/me nods Yeap I saw that too.

> Basic idea is good though!
> 
Cool, thanks!

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()

2020-12-09 Thread Joao Martins



On 12/8/20 7:29 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:00PM +, Joao Martins wrote:
> 
>>  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, 
>> int dirty)
>>  {
>> +bool make_dirty = umem->writable && dirty;
>> +struct page **page_list = NULL;
>>  struct sg_page_iter sg_iter;
>> +unsigned long nr = 0;
>>  struct page *page;
>>  
>> +page_list = (struct page **) __get_free_page(GFP_KERNEL);
> 
> Gah, no, don't do it like this!
> 
> Instead something like:
> 
>   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
> unpin_use_pages_range_dirty_lock(sg_page(sg), 
> sg->length/PAGE_SIZE,
>umem->writable && dirty);
> 
> And have the mm implementation split the contiguous range of pages into
> pairs of (compound head, ntails) with a bit of maths.
> 
Got it :)

I was trying to avoid another exported symbol.

Albeit upon your suggestion below, it doesn't justify the efficiency/clearness 
lost.

Joao
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [External] [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2020-12-09 Thread Muchun Song
On Wed, Dec 9, 2020 at 1:32 AM Joao Martins  wrote:
>
> Hey,
>
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
>
> [0] 
> https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuc...@bytedance.com/
>

Oh, well. It looks like you agree with my optimization approach
and have fully understood. Also, welcome help me review that
series if you have time. :)

> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap 
> pages).
>
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
>
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total 
> memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total 
> memory)
>
> Along the way I've extended it past 'struct page' overhead *trying* to 
> address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
>
> So to summarize what the series does:
>
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or 
> a
> given @align order. The main difference though, is that contrary to the 
> hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it. IOW no
> freeing of pages of already initialized vmemmap like the case for hugetlbfs,
> which simplifies the logic (besides not being arch-specific). After these,
> there's quite visible region bootstrap of pmem memmap given that we would
> initialize fewer struct pages depending on the page size.
>
> NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on 
> emulated NVDIMMs
> with 2M and 1G respectivally. The net gain in improvement is similarly 
> observed
> in proportion when running on actual NVDIMMs.
>
> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> are working with compound pages i.e. we do 1 increment/decrement to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably, and unpin_user_pages() improves as well when passed a
> set of consecutive pages:
>
>before  after
> (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>
> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
> user. For unpin_user_pages() we have an additional test to demonstrate the
> improvement.  The test performs MR reg/unreg continuously and measuring its
> rate for a given period. So essentially ib_mem_get and ib_mem_release being
> stress tested which at the end of day means: pin_user_pages_longterm() and
> unpin_user_pages() for a scatterlist:
>
> Before:
> 159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
> 466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>
> After:
> 305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
> 1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>
> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
> is very experimental, and I imported most of follow_hugetlb_page(), except
> that we do the same trick as gup-fast. In doing the patch I feel this batching
> should live in follow_page_mask() and having that being changed to return a 
> set
> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. 
> This
> patch then brings the previous test of mr reg/unreg (above) on parity
> between device-dax and hugetlbfs.
>
> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we 
> are
> still running tests. Hence the RFC, asking for comments and general direction
> of the work before continuing.
>
> Patches apply on top of linux-next tag next-20201208 (commit a9e26cb5f261).
>
> Comments and suggestions very much appreciated!
>
> Thanks,
> Joao
>
> Joao Martins (9):
>   memremap: add ZONE_DEVICE 

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2020-12-09 Thread David Hildenbrand
On 08.12.20 18:28, Joao Martins wrote:
> Hey,
> 
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 
> 
> [0] 
> https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuc...@bytedance.com/
> 
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap 
> pages).
> 
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total 
> memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total 
> memory)
> 

That's the dream :)

> Along the way I've extended it past 'struct page' overhead *trying* to 
> address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
> 
> So to summarize what the series does:
> 
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or 
> a
> given @align order. The main difference though, is that contrary to the 
> hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it.

Yeah, I'd argue that this case is a lot easier to handle. When the buddy
is involved, things get more complicated.

-- 
Thanks,

David / dhildenb
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas

2020-12-09 Thread Christoph Hellwig
On Tue, Dec 08, 2020 at 03:57:54PM -0400, Jason Gunthorpe wrote:
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>struct page **cur;
>struct page **end;
>unsigned long vaddr;
>[..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..

something like this:

http://git.infradead.org/users/hch/misc.git/commitdiff/c3d019802dbde5a4cc4160e7ec8ccba479b19f97

from this old and not fully working series:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org