Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-04-01 Thread Joao Martins
On 3/24/21 7:00 PM, Joao Martins wrote:
> On 3/24/21 5:45 PM, Dan Williams wrote:
>> On Thu, Mar 18, 2021 at 3:02 AM Joao Martins  
>> wrote:
>>> On 3/18/21 4:08 AM, Dan Williams wrote:
>>>> Now that device-dax and filesystem-dax are guaranteed to unmap all user
>>>> mappings of devmap / DAX pages before tearing down the 'struct page'
>>>> array, get_user_pages_fast() can rely on its traditional synchronization
>>>> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
>>>> device shutdown. Specifically the unmap guarantee ensures that gup-fast
>>>> either succeeds in taking a page reference (lock-less), or it detects a
>>>> need to fall back to the slow path where the device presence can be
>>>> revalidated with locks held.
>>>
>>> [...]
>>>
>>> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
>>> try_grab_page() for checking pgmap type to see if we are in a device-dax
>>> longterm pin?
>>
>> So, there is an effort to add a new pte bit p{m,u}d_special to disable
>> gup-fast for huge pages [1]. I'd like to investigate whether we could
>> use devmap + special as an encoding for "no longterm" and never
>> consult the pgmap in the gup-fast path.
>>
>> [1]: 
>> https://lore.kernel.org/linux-mm/a1fa7fa2-914b-366d-9902-e5b784e84...@shipmail.org/
>>
> 
> Oh, nice! That would be ideal indeed, as we would skip the pgmap lookup 
> enterily.
> 
> I suppose device-dax would use pfn_t PFN_MAP while fs-dax memory device would 
> set PFN_MAP
> | PFN_DEV (provided vmf_insert_pfn_{pmd,pud} calls mkspecial on PFN_DEV).
> 
> I haven't been following that thread, but for PMD/PUD special in vmf_* these 
> might be useful:
> 
> https://lore.kernel.org/linux-mm/20200110190313.17144-2-joao.m.mart...@oracle.com/
> https://lore.kernel.org/linux-mm/20200110190313.17144-4-joao.m.mart...@oracle.com/
> 

On a second thought, maybe it doesn't need to be that complicated for 
{fs,dev}dax if the
added special bit is just a subcase of devmap pte/pmd/puds. See below scsissors 
mark as a
rough estimation on the changes (nothing formal/proper as it isn't properly 
splitted).
Running gup_test with devdax/fsdax FOLL_LONGTERM and without does the intended. 
(gup_test
-m 16384 -r 10 -a -S -n 512 -w -f  [-F 0x1]).

Unless this is about using special PMD/PUD bits without page structs (thus 
without devmap
bits) as in the previous two links.

-- >8 --

Subject: mm/gup, nvdimm: allow FOLL_LONGTERM for device-dax gup-fast

Signed-off-by: Joao Martins 

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 47027796c2f9..8b5d68d89cde 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -439,11 +439,16 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)

 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_devmap(pmd)pte_devmap(pmd_pte(pmd))
+#define pmd_special(pmd)   pte_special(pmd_pte(pmd))
 #endif
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 {
return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
 }
+static inline pmd_t pmd_mkspecial(pmd_t pmd)
+{
+   return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_SPECIAL)));
+}

 #define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd))
 #define __phys_to_pmd_val(phys)__phys_to_pte_val(phys)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..45449ee86d4f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -259,13 +259,13 @@ static inline int pmd_large(pmd_t pte)
 /* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large 
*/
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-   return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+   return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == 
_PAGE_PSE;
 }

 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-   return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+   return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_SPECIAL)) == 
_PAGE_PSE;
 }
 #endif

@@ -297,6 +297,19 @@ static inline int pgd_devmap(pgd_t pgd)
 {
return 0;
 }
+
+static inline bool pmd_special(pmd_t pmd)
+{
+   return !!(pmd_flags(pmd) & _PAGE_SPECIAL);
+}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline bool pud_special(pud_t pud)
+{
+   return !!(pud_flags(pud) & _PAGE_SPECIAL);
+}
+#endif
+
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -452,6 +465,11 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_DEVMAP);
 }

+static inline pmd_t pmd_mkspecia

Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-24 Thread Joao Martins
On 3/24/21 5:45 PM, Dan Williams wrote:
> On Thu, Mar 18, 2021 at 3:02 AM Joao Martins  
> wrote:
>> On 3/18/21 4:08 AM, Dan Williams wrote:
>>> Now that device-dax and filesystem-dax are guaranteed to unmap all user
>>> mappings of devmap / DAX pages before tearing down the 'struct page'
>>> array, get_user_pages_fast() can rely on its traditional synchronization
>>> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
>>> device shutdown. Specifically the unmap guarantee ensures that gup-fast
>>> either succeeds in taking a page reference (lock-less), or it detects a
>>> need to fall back to the slow path where the device presence can be
>>> revalidated with locks held.
>>
>> [...]
>>
>>> @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
>>> addr, unsigned long end,
>>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>>
>>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
>>> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>> +
>>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>unsigned long end, unsigned int flags,
>>>struct page **pages, int *nr)
>>>  {
>>>   int nr_start = *nr;
>>> - struct dev_pagemap *pgmap = NULL;
>>>
>>>   do {
>>> - struct page *page = pfn_to_page(pfn);
>>> + struct page *page;
>>> +
>>> + /*
>>> +  * Typically pfn_to_page() on a devmap pfn is not safe
>>> +  * without holding a live reference on the hosting
>>> +  * pgmap. In the gup-fast path it is safe because any
>>> +  * races will be resolved by either gup-fast taking a
>>> +  * reference or the shutdown path unmapping the pte to
>>> +  * trigger gup-fast to fall back to the slow path.
>>> +  */
>>> + page = pfn_to_page(pfn);
>>>
>>> - pgmap = get_dev_pagemap(pfn, pgmap);
>>> - if (unlikely(!pgmap)) {
>>> - undo_dev_pagemap(nr, nr_start, flags, pages);
>>> - return 0;
>>> - }
>>>   SetPageReferenced(page);
>>>   pages[*nr] = page;
>>>   if (unlikely(!try_grab_page(page, flags))) {
>>
>> So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
>> try_grab_page() for checking pgmap type to see if we are in a device-dax
>> longterm pin?
> 
> So, there is an effort to add a new pte bit p{m,u}d_special to disable
> gup-fast for huge pages [1]. I'd like to investigate whether we could
> use devmap + special as an encoding for "no longterm" and never
> consult the pgmap in the gup-fast path.
> 
> [1]: 
> https://lore.kernel.org/linux-mm/a1fa7fa2-914b-366d-9902-e5b784e84...@shipmail.org/
>

Oh, nice! That would be ideal indeed, as we would skip the pgmap lookup 
enterily.

I suppose device-dax would use pfn_t PFN_MAP while fs-dax memory device would 
set PFN_MAP
| PFN_DEV (provided vmf_insert_pfn_{pmd,pud} calls mkspecial on PFN_DEV).

I haven't been following that thread, but for PMD/PUD special in vmf_* these 
might be useful:

https://lore.kernel.org/linux-mm/20200110190313.17144-2-joao.m.mart...@oracle.com/
https://lore.kernel.org/linux-mm/20200110190313.17144-4-joao.m.mart...@oracle.com/


Re: [PATCH 3/3] mm/devmap: Remove pgmap accounting in the get_user_pages_fast() path

2021-03-18 Thread Joao Martins
On 3/18/21 4:08 AM, Dan Williams wrote:
> Now that device-dax and filesystem-dax are guaranteed to unmap all user
> mappings of devmap / DAX pages before tearing down the 'struct page'
> array, get_user_pages_fast() can rely on its traditional synchronization
> method "validate_pte(); get_page(); revalidate_pte()" to catch races with
> device shutdown. Specifically the unmap guarantee ensures that gup-fast
> either succeeds in taking a page reference (lock-less), or it detects a
> need to fall back to the slow path where the device presence can be
> revalidated with locks held.

[...]

> @@ -2087,21 +2078,26 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> addr, unsigned long end,
>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>unsigned long end, unsigned int flags,
>struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
> - struct dev_pagemap *pgmap = NULL;
>  
>   do {
> - struct page *page = pfn_to_page(pfn);
> + struct page *page;
> +
> + /*
> +  * Typically pfn_to_page() on a devmap pfn is not safe
> +  * without holding a live reference on the hosting
> +  * pgmap. In the gup-fast path it is safe because any
> +  * races will be resolved by either gup-fast taking a
> +  * reference or the shutdown path unmapping the pte to
> +  * trigger gup-fast to fall back to the slow path.
> +  */
> + page = pfn_to_page(pfn);
>  
> - pgmap = get_dev_pagemap(pfn, pgmap);
> - if (unlikely(!pgmap)) {
> - undo_dev_pagemap(nr, nr_start, flags, pages);
> - return 0;
> - }
>   SetPageReferenced(page);
>   pages[*nr] = page;
>   if (unlikely(!try_grab_page(page, flags))) {

So for allowing FOLL_LONGTERM[0] would it be OK if we used page->pgmap after
try_grab_page() for checking pgmap type to see if we are in a device-dax
longterm pin?

Joao

[0] 
https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10f...@oracle.com/


Re: [PATCH v4 0/4] mm/gup: page unpining improvements

2021-02-18 Thread Joao Martins



On 2/18/21 3:33 PM, Joao Martins wrote:
> On 2/18/21 7:24 AM, Christoph Hellwig wrote:
>> On Fri, Feb 12, 2021 at 01:08:39PM +0000, Joao Martins wrote:
>>> Hey,
>>>
>>> This series improves page unpinning, with an eye on improving MR
>>> deregistration for big swaths of memory (which is bound by the page
>>> unpining), particularly:
>>
>> Can you also take a look at the (bdev and iomap) direct I/O code to
>> make use of this?  It should really help there, with a much wieder use
>> than RDMA.
>>
> Perhaps by bdev and iomap direct I/O using this, you were thinking to replace
> bio_release_pages() which operates on bvecs and hence releasing contiguous 
> pages
> in a bvec at once? e.g. something like from this:
> 
> bio_for_each_segment_all(bvec, bio, iter_all) {
> if (mark_dirty && !PageCompound(bvec->bv_page))
> set_page_dirty_lock(bvec->bv_page);
> put_page(bvec->bv_page);
> }
> 
> (...) to this instead:
> 
>   bio_for_each_bvec_all(bvec, bio, i)
>   unpin_user_page_range_dirty_lock(bvec->bv_page,
>   DIV_ROUND_UP(bvec->bv_len, PAGE_SIZE),
>   mark_dirty && !PageCompound(bvec->bv_page));
> 

Quick correction: It should be put_user_page_range_dirty_lock() given that 
unpin is
specific to FOLL_PIN users.


Re: [PATCH v4 0/4] mm/gup: page unpining improvements

2021-02-18 Thread Joao Martins
On 2/18/21 7:24 AM, Christoph Hellwig wrote:
> On Fri, Feb 12, 2021 at 01:08:39PM +0000, Joao Martins wrote:
>> Hey,
>>
>> This series improves page unpinning, with an eye on improving MR
>> deregistration for big swaths of memory (which is bound by the page
>> unpining), particularly:
> 
> Can you also take a look at the (bdev and iomap) direct I/O code to
> make use of this?  It should really help there, with a much wieder use
> than RDMA.
> 
Perhaps by bdev and iomap direct I/O using this, you were thinking to replace
bio_release_pages() which operates on bvecs and hence releasing contiguous pages
in a bvec at once? e.g. something like from this:

bio_for_each_segment_all(bvec, bio, iter_all) {
if (mark_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
}

(...) to this instead:

bio_for_each_bvec_all(bvec, bio, i)
unpin_user_page_range_dirty_lock(bvec->bv_page,
DIV_ROUND_UP(bvec->bv_len, PAGE_SIZE),
mark_dirty && !PageCompound(bvec->bv_page));


[PATCH v4 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-12 Thread Joao Martins
Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

To that end, define another iterator for_each_compound_range()
that operates in page ranges as opposed to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: John Hubbard 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 62 ++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 89fca443e6f1..43f59e4eb760 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1255,6 +1255,8 @@ static inline void put_page(struct page *page)
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 384571be2c66..6b6d861727cf 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -213,6 +213,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_range_next(unsigned long i, unsigned long npages,
+  struct page **list, struct page **head,
+  unsigned int *ntails)
+{
+   struct page *next, *page;
+   unsigned int nr = 1;
+
+   if (i >= npages)
+   return;
+
+   next = *list + i;
+   page = compound_head(next);
+   if (PageCompound(page) && compound_order(page) >= 1)
+   nr = min_t(unsigned int,
+  page + compound_nr(page) - next, npages - i);
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_range_next(__i, __npages, __list, &(__head), 
&(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 static inline void compound_next(unsigned long i, unsigned long npages,
 struct page **list, struct page **head,
 unsigned int *ntails)
@@ -301,6 +327,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely 
released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * pin_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if 
the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
-- 
2.17.1



[PATCH v4 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

2021-02-12 Thread Joao Martins
Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2688 rounds in 5.002 sec: 1860.786 usec / round
32517 rounds in 60.001 sec: 1845.225 usec / round

Signed-off-by: Joao Martins 
Acked-by: Jason Gunthorpe 
---
 drivers/infiniband/core/umem.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..9b607013e2a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
-   struct sg_page_iter sg_iter;
-   struct page *page;
+   bool make_dirty = umem->writable && dirty;
+   struct scatterlist *sg;
+   unsigned int i;
 
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);
 
-   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
-   page = sg_page_iter_page(_iter);
-   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
-   }
+   for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+   unpin_user_page_range_dirty_lock(sg_page(sg),
+   DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
sg_free_table(>sg_head);
 }
-- 
2.17.1



[PATCH v4 0/4] mm/gup: page unpining improvements

2021-02-12 Thread Joao Martins
Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

 1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they 
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages

 2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.mart...@oracle.com/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks Jason and John for all the comments.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

Joao

Changelog since,

v3 -> v4:
 * Add the Reviewed-by/Acked-by by Jason on all patches.
 * Add the Reviewed-by by John on the third patch.
 * Fix the wrong mention to get_user_pages()  in
 unpin_user_page_range_dirty_lock()  docs (third patch).

v2 -> v3:
 * Handle compound_order = 1 as well and move subtraction to min_t()
   on patch 3.
 * Remove stale paragraph on patch 3 commit description (John)
 * Rename range_next to compound_range_next() (John)
 * Add John's Reviewed-by on patch 1 (John)
 * Clean and rework compound_next() on patch 1 (John)

v1 -> v2:
 * Prefix macro arguments with __ to avoid collisions with other defines (John)
 * Remove count_tails() and have the logic for the two iterators split into
   range_next() and compound_next() (John)
 * Remove the @range boolean from the iterator helpers (John)
 * Add docs on unpin_user_page_range_dirty_lock() on patch 3 (John)
 * Use unsigned for @i on patch 4 (John)
 * Fix subject line of patch 4 (John)
 * Add John's Reviewed-by on the second patch
 * Fix incorrect use of @nmap and use @sg_nents instead (Jason)

RFC -> v1:
 * Introduce a head/ntails iterator and change unpin_*_pages() to use that,
   inspired by folio iterators (Jason)
 * Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
   on a consecutive page range without having to walk page arrays (Jason)
 * Use unsigned for number of tails (Jason)


Joao Martins (4):
  mm/gup: add compound page list iterator
  mm/gup: decrement head page once for group of subpages
  mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  RDMA/umem: batch page unpin in __ib_umem_release()

 drivers/infiniband/core/umem.c |  12 ++--
 include/linux/mm.h |   2 +
 mm/gup.c   | 117 -
 3 files changed, 107 insertions(+), 24 deletions(-)

-- 
2.17.1



[PATCH v4 1/4] mm/gup: add compound page list iterator

2021-02-12 Thread Joao Martins
Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
Reviewed-by: John Hubbard 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..1a709eae2bfd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -213,6 +213,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   page = compound_head(list[i]);
+   for (nr = i + 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr - i;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
-- 
2.17.1



[PATCH v4 2/4] mm/gup: decrement head page once for group of subpages

2021-02-12 Thread Joao Martins
Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins 
Reviewed-by: John Hubbard 
Reviewed-by: Jason Gunthorpe 
---
 mm/gup.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1a709eae2bfd..384571be2c66 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -265,20 +265,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 bool make_dirty)
 {
unsigned long index;
-
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
+   struct page *head;
+   unsigned int ntails;
 
if (!make_dirty) {
unpin_user_pages(pages, npages);
return;
}
 
-   for (index = 0; index < npages; index++) {
-   struct page *page = compound_head(pages[index]);
+   for_each_compound_head(index, pages, npages, head, ntails) {
/*
 * Checking PageDirty at this point may race with
 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -299,9 +294,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 * written back, so it gets written back again in the
 * next writeback cycle. This is harmless.
 */
-   if (!PageDirty(page))
-   set_page_dirty_lock(page);
-   unpin_user_page(page);
+   if (!PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -318,6 +313,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
unsigned long index;
+   struct page *head;
+   unsigned int ntails;
 
/*
 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -326,13 +323,9 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
 */
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
-   for (index = 0; index < npages; index++)
-   unpin_user_page(pages[index]);
+
+   for_each_compound_head(index, pages, npages, head, ntails)
+   put_compound_head(head, ntails, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1



Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-11 Thread Joao Martins



On 2/10/21 11:19 PM, John Hubbard wrote:
> On 2/5/21 12:41 PM, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> To that end, define another iterator for_each_compound_range()
>> that operates in page ranges as opposed to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>   include/linux/mm.h |  2 ++
>>   mm/gup.c   | 62 ++
>>   2 files changed, 64 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a608feb0d42e..b76063f7f18a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   bool make_dirty);
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long 
>> npages,
>> +  bool make_dirty);
>>   void unpin_user_pages(struct page **pages, unsigned long npages);
>>   
>>   /**
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 467a11df216d..938964d31494 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> +static inline void compound_range_next(unsigned long i, unsigned long 
>> npages,
>> +   struct page **list, struct page **head,
>> +   unsigned int *ntails)
> 
> Yes, the new names look good, and I have failed to find any logic errors, so:
> 
> Reviewed-by: John Hubbard 
> 

Thanks again for all the input!


Re: [PATCH v3 1/4] mm/gup: add compound page list iterator

2021-02-11 Thread Joao Martins
On 2/10/21 11:20 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:24PM +0000, Joao Martins wrote:
>> Add an helper that iterates over head pages in a list of pages. It
>> essentially counts the tails until the next page to process has a
>> different head that the current. This is going to be used by
>> unpin_user_pages() family of functions, to batch the head page refcount
>> updates once for all passed consecutive tail pages.
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> Reviewed-by: John Hubbard 
>> ---
>>  mm/gup.c | 26 ++
>>  1 file changed, 26 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe 
> 
Thanks!

> This can be used for check_and_migrate_cma_pages() too (there is a
> series around to change this logic though, not sure if it is landed
> yet)

It got unqueued AFAIUI.

It makes sense for most users today except hugetlb pages, which are also
the fastest page pinner today. And unilaterally using this iterator makes
all page types pay the added cost. So either keeping the current loop having
the exception to PageHuge() head pages, or doing it correctly with that
split logic we were talking on the other thread.

Joao


Re: [PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

2021-02-11 Thread Joao Martins
On 2/10/21 11:17 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:27PM +0000, Joao Martins wrote:
>> Use the newly added unpin_user_page_range_dirty_lock()
>> for more quickly unpinning a consecutive range of pages
>> represented as compound pages. This will also calculate
>> number of pages to unpin (for the tail pages which matching
>> head page) and thus batch the refcount update.
>>
>> Running a test program which calls mr reg/unreg on a 1G in size
>> and measures cost of both operations together (in a guest using rxe)
>> with THP and hugetlbfs:
>>
>> Before:
>> 590 rounds in 5.003 sec: 8480.335 usec / round
>> 6898 rounds in 60.001 sec: 8698.367 usec / round
>>
>> After:
>> 2688 rounds in 5.002 sec: 1860.786 usec / round
>> 32517 rounds in 60.001 sec: 1845.225 usec / round
>>
>> Signed-off-by: Joao Martins 
>> ---
>>  drivers/infiniband/core/umem.c | 12 ++--
>>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> Would best for this to go through Andrew's tree
> 
> Acked-by: Jason Gunthorpe 
> 
> 4x improvement is pretty good!
> 

It would only be half of that improvement if it wasn't for your
unpin_user_page_range_dirty_lock() suggestion, so thanks for all
the input :)

Joao


Re: [PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-11 Thread Joao Martins



On 2/10/21 11:15 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:26PM +0000, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> To that end, define another iterator for_each_compound_range()
>> that operates in page ranges as opposed to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>  include/linux/mm.h |  2 ++
>>  mm/gup.c   | 62 ++
>>  2 files changed, 64 insertions(+)
> 
> Reviewed-by: Jason Gunthorpe 
> 
Thanks!

>> +/**
>> + * unpin_user_page_range_dirty_lock() - release and optionally dirty
>> + * gup-pinned page range
>> + *
>> + * @page:  the starting page of a range maybe marked dirty, and definitely 
>> released.
>> + * @npages: number of consecutive pages to release.
>> + * @make_dirty: whether to mark the pages dirty
>> + *
>> + * "gup-pinned page range" refers to a range of pages that has had one of 
>> the
>> + * get_user_pages() variants called on that page.
> 
> Tidy this language though, this only works with the pin_user_pages
> variants because it hardwires FOLL_PIN
> 
Yes, I can respin a v4 with that adjustment.


Re: [PATCH v3 2/4] mm/gup: decrement head page once for group of subpages

2021-02-11 Thread Joao Martins



On 2/10/21 9:02 PM, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 08:41:25PM +0000, Joao Martins wrote:
>> Rather than decrementing the head page refcount one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page. To that
>> end switch to for_each_compound_head() does most of the work.
>>
>> set_page_dirty() needs no adjustment as it's a nop for
>> non-dirty head pages and it doesn't operate on tail pages.
>>
>> This considerably improves unpinning of pages with THP and
>> hugetlbfs:
>>
>> - THP
>> gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
>>
>> - 16G with 1G huge page size
>> gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
>>
>> Signed-off-by: Joao Martins 
>> Reviewed-by: John Hubbard 
>> ---
>>  mm/gup.c | 29 +++--
>>  1 file changed, 11 insertions(+), 18 deletions(-)
> 
> Looks fine
> 
> Reviewed-by: Jason Gunthorpe 
> 
Thanks!

> I was wondering why this only touches the FOLL_PIN path, 

That's just because I was looking at pinning mostly.

> it would make
> sense to also use this same logic for release_pages()

Yeah, indeed -- any place tearing potentially consecutive sets of pages
are candidates.

> 
> for (i = 0; i < nr; i++) {
> struct page *page = pages[i];
> page = compound_head(page);
> if (is_huge_zero_page(page))
> continue; 
> 
> Jason
> 


Re: [PATCH] dax: fix default return code of range_parse()

2021-02-10 Thread Joao Martins
On 1/26/21 2:13 AM, Shiyang Ruan wrote:
> The return value of range_parse() indicates the size when it is
> positive.  The error code should be negative.
> 
> Signed-off-by: Shiyang Ruan 

Reviewed-by: Joao Martins 

Although, FWIW, there was another patch exactly like this a couple
months ago, albeit it didn't get pulled for some reason:

https://lore.kernel.org/linux-nvdimm/20201026110425.136629-1-zhangqilo...@huawei.com/

> ---
>  drivers/dax/bus.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 737b207c9e30..3003558c1a8b 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1038,7 +1038,7 @@ static ssize_t range_parse(const char *opt, size_t len, 
> struct range *range)
>  {
>   unsigned long long addr = 0;
>   char *start, *end, *str;
> - ssize_t rc = EINVAL;
> + ssize_t rc = -EINVAL;
>  
>   str = kstrdup(opt, GFP_KERNEL);
>   if (!str)
> 


Re: [PATCH 0/2] KVM: do not assume PTE is writable after follow_pfn

2021-02-09 Thread Joao Martins
On 2/9/21 8:19 AM, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 07:26:25PM -0400, Jason Gunthorpe wrote:
 page_mkclean() has some technique to make the notifier have the right
 size without becoming entangled in the PTL locks..
>>>
>>> Right.  I guess it's because dax doesn't have "struct page*" on the
>>> back, so it
>>
>> It doesn't? I thought DAX cases did?
> 
> File system DAX has a struct page, device DAX does not.  Which means
> everything using iomap should have a page available, but i'm adding
> Dan as he should know the details :)
> 
Both fsdax and device-dax have struct page. It's the pmem block device
that doesn't.


Re: [PATCH] KVM: x86/xen: Use hva_t for holding hypercall page address

2021-02-08 Thread Joao Martins



On 2/8/21 8:15 PM, Sean Christopherson wrote:
> Use hva_t, a.k.a. unsigned long, for the local variable that holds the
> hypercall page address.  On 32-bit KVM, gcc complains about using a u64
> due to the implicit cast from a 64-bit value to a 32-bit pointer.
> 
>   arch/x86/kvm/xen.c: In function ‘kvm_xen_write_hypercall_page’:
>   arch/x86/kvm/xen.c:300:22: error: cast to pointer from integer of
>  different size [-Werror=int-to-pointer-cast]
>   300 |   page = memdup_user((u8 __user *)blob_addr, PAGE_SIZE);
> 
> Cc: Joao Martins 
> Cc: David Woodhouse 
> Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled")
> Signed-off-by: Sean Christopherson 

Reviewed-by: Joao Martins 

> ---
>  arch/x86/kvm/xen.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 2cee0376455c..deda1ba8c18a 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -286,8 +286,12 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, 
> u64 data)
>   return 1;
>   }
>   } else {
> - u64 blob_addr = lm ? kvm->arch.xen_hvm_config.blob_addr_64
> -: kvm->arch.xen_hvm_config.blob_addr_32;
> + /*
> +  * Note, truncation is a non-issue as 'lm' is guaranteed to be
> +  * false for a 32-bit kernel, i.e. when hva_t is only 4 bytes.
> +  */
> + hva_t blob_addr = lm ? kvm->arch.xen_hvm_config.blob_addr_64
> +  : kvm->arch.xen_hvm_config.blob_addr_32;
>   u8 blob_size = lm ? kvm->arch.xen_hvm_config.blob_size_64
> : kvm->arch.xen_hvm_config.blob_size_32;
>   u8 *page;
> 


Re: [PATCH v14 0/8] Free some vmemmap pages of HugeTLB page

2021-02-05 Thread Joao Martins
On 2/4/21 3:50 AM, Muchun Song wrote:
> Hi all,
> 

[...]

> When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> vmemmap pages and restore the previous mapping relationship.
> 
> Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar
> to the 2MB HugeTLB page. We also can use this approach to free the vmemmap
> pages.
> 
> In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a
> very substantial gain. On our server, run some SPDK/QEMU applications which
> will use 1024GB hugetlbpage. With this feature enabled, we can save ~16GB
> (1G hugepage)/~12GB (2MB hugepage) memory.
> 
> Because there are vmemmap page tables reconstruction on the freeing/allocating
> path, it increases some overhead. Here are some overhead analysis.

[...]

> Although the overhead has increased, the overhead is not significant. Like 
> Mike
> said, "However, remember that the majority of use cases create hugetlb pages 
> at
> or shortly after boot time and add them to the pool. So, additional overhead 
> is
> at pool creation time. There is no change to 'normal run time' operations of
> getting a page from or returning a page to the pool (think page fault/unmap)".
> 

Despite the overhead and in addition to the memory gains from this series ...
there's an additional benefit there isn't talked here with your vmemmap page
reuse trick. That is page (un)pinners will see an improvement and I presume 
because
there are fewer memmap pages and thus the tail/head pages are staying in cache 
more
often.

Out of the box I saw (when comparing linux-next against linux-next + this 
series)
with gup_test and pinning a 16G hugetlb file (with 1G pages):

get_user_pages(): ~32k -> ~9k
unpin_user_pages(): ~75k -> ~70k

Usually any tight loop fetching compound_head(), or reading tail pages data 
(e.g.
compound_head) benefit a lot. There's some unpinning inefficiencies I am 
fixing[0], but
with that in added it shows even more:

unpin_user_pages(): ~27k -> ~3.8k

FWIW, I was also seeing that with devdax and the ZONE_DEVICE vmemmap page reuse 
equivalent
series[1] but it was mixed with other numbers.

Anyways, JFYI :)

Joao

[0] 
https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.mart...@oracle.com/
[1] 
https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.mart...@oracle.com/


[PATCH v3 1/4] mm/gup: add compound page list iterator

2021-02-05 Thread Joao Martins
Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
Reviewed-by: John Hubbard 
---
 mm/gup.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..8defe4f670d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   page = compound_head(list[i]);
+   for (nr = i + 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr - i;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
-- 
2.17.1



[PATCH v3 2/4] mm/gup: decrement head page once for group of subpages

2021-02-05 Thread Joao Martins
Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins 
Reviewed-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8defe4f670d5..467a11df216d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -267,20 +267,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 bool make_dirty)
 {
unsigned long index;
-
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
+   struct page *head;
+   unsigned int ntails;
 
if (!make_dirty) {
unpin_user_pages(pages, npages);
return;
}
 
-   for (index = 0; index < npages; index++) {
-   struct page *page = compound_head(pages[index]);
+   for_each_compound_head(index, pages, npages, head, ntails) {
/*
 * Checking PageDirty at this point may race with
 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -301,9 +296,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 * written back, so it gets written back again in the
 * next writeback cycle. This is harmless.
 */
-   if (!PageDirty(page))
-   set_page_dirty_lock(page);
-   unpin_user_page(page);
+   if (!PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -320,6 +315,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
unsigned long index;
+   struct page *head;
+   unsigned int ntails;
 
/*
 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -328,13 +325,9 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
 */
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
-   for (index = 0; index < npages; index++)
-   unpin_user_page(pages[index]);
+
+   for_each_compound_head(index, pages, npages, head, ntails)
+   put_compound_head(head, ntails, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1



[PATCH v3 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

2021-02-05 Thread Joao Martins
Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2688 rounds in 5.002 sec: 1860.786 usec / round
32517 rounds in 60.001 sec: 1845.225 usec / round

Signed-off-by: Joao Martins 
---
 drivers/infiniband/core/umem.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..9b607013e2a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
-   struct sg_page_iter sg_iter;
-   struct page *page;
+   bool make_dirty = umem->writable && dirty;
+   struct scatterlist *sg;
+   unsigned int i;
 
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);
 
-   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
-   page = sg_page_iter_page(_iter);
-   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
-   }
+   for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+   unpin_user_page_range_dirty_lock(sg_page(sg),
+   DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
sg_free_table(>sg_head);
 }
-- 
2.17.1



[PATCH v3 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-05 Thread Joao Martins
Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

To that end, define another iterator for_each_compound_range()
that operates in page ranges as opposed to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 62 ++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 467a11df216d..938964d31494 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_range_next(unsigned long i, unsigned long npages,
+  struct page **list, struct page **head,
+  unsigned int *ntails)
+{
+   struct page *next, *page;
+   unsigned int nr = 1;
+
+   if (i >= npages)
+   return;
+
+   next = *list + i;
+   page = compound_head(next);
+   if (PageCompound(page) && compound_order(page) >= 1)
+   nr = min_t(unsigned int,
+  page + compound_nr(page) - next, npages - i);
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_range_next(__i, __npages, __list, &(__head), 
&(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 static inline void compound_next(unsigned long i, unsigned long npages,
 struct page **list, struct page **head,
 unsigned int *ntails)
@@ -303,6 +329,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely 
released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if 
the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
-- 
2.17.1



[PATCH v3 0/4] mm/gup: page unpining improvements

2021-02-05 Thread Joao Martins
Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

 1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they 
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages.

 2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.mart...@oracle.com/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks for all the comments so far.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

Joao

Changelog since,

v2 -> v3:
 * Handle compound_order = 1 as well and move subtraction to min_t()
   on patch 3.
 * Remove stale paragraph on patch 3 commit description (John)
 * Rename range_next to compound_range_next() (John)
 * Add John's Reviewed-by on patch 1 (John)
 * Clean and rework compound_next() on patch 1 (John)

v1 -> v2:
 * Prefix macro arguments with __ to avoid collisions with other defines (John)
 * Remove count_tails() and have the logic for the two iterators split into
   range_next() and compound_next() (John)
 * Remove the @range boolean from the iterator helpers (John)
 * Add docs on unpin_user_page_range_dirty_lock() on patch 3 (John)
 * Use unsigned for @i on patch 4 (John)
 * Fix subject line of patch 4 (John)
 * Add John's Reviewed-by on the second patch
 * Fix incorrect use of @nmap and use @sg_nents instead (Jason)

RFC -> v1:
 * Introduce a head/ntails iterator and change unpin_*_pages() to use that,
   inspired by folio iterators (Jason)
 * Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
   on a consecutive page range without having to walk page arrays (Jason)
 * Use unsigned for number of tails (Jason)

Joao Martins (4):
  mm/gup: add compound page list iterator
  mm/gup: decrement head page once for group of subpages
  mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  RDMA/umem: batch page unpin in __ib_umem_release()

 drivers/infiniband/core/umem.c |  12 ++--
 include/linux/mm.h |   2 +
 mm/gup.c   | 117 -
 3 files changed, 107 insertions(+), 24 deletions(-)

-- 
2.17.1



Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

2021-02-05 Thread Joao Martins



On 2/4/21 8:00 PM, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 04:15:53PM -0800, John Hubbard wrote:
>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>>> index 2dde99a9ba07..ea4ebb3261d9 100644
>>> +++ b/drivers/infiniband/core/umem.c
>>> @@ -47,17 +47,17 @@
>>>   static void __ib_umem_release(struct ib_device *dev, struct ib_umem 
>>> *umem, int dirty)
>>>   {
>>> -   struct sg_page_iter sg_iter;
>>> -   struct page *page;
>>> +   bool make_dirty = umem->writable && dirty;
>>> +   struct scatterlist *sg;
>>> +   int i;
>>
>> Maybe unsigned int is better, so as to perfectly match the 
>> scatterlist.length.
> 
> Yes please
> 
Fixed in v2.

>>> if (umem->nmap > 0)
>>> ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
>>> DMA_BIDIRECTIONAL);
>>> -   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
>>> -   page = sg_page_iter_page(_iter);
>>> -   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
>>> -   }
>>> +   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
>>
>> The change from umem->sg_nents to umem->nmap looks OK, although we should get
>> IB people to verify that there is not some odd bug or reason to leave it as 
>> is.
> 
> No, nmap wouldn't be right here. nmap is the number of dma mapped SGLs
> in the list and should only be used by things doing sg_dma* stuff.
> 
> umem->sg_nents is the number of CPU SGL entries and is the correct
> thing here.
> 

And this was fixed in v2 as well.


Re: linux-next: Signed-off-by missing for commit in the kvm tree

2021-02-05 Thread Joao Martins
On 2/4/21 8:18 PM, Stephen Rothwell wrote:
> Hi all,
> 
> Commit
> 
>   79033bebf6fa ("KVM: x86/xen: Fix coexistence of Xen and Hyper-V hypercalls")
> 
> is missing a Signed-off-by from its author.
> 
Except that David is the author of this particular patch, not me.

Joao


Re: [PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-05 Thread Joao Martins
On 2/5/21 4:49 AM, John Hubbard wrote:
> On 2/4/21 12:24 PM, Joao Martins wrote:
>> Add a unpin_user_page_range_dirty_lock() API which takes a starting page
>> and how many consecutive pages we want to unpin and optionally dirty.
>>
>> Given that we won't be iterating on a list of changes, change
>> compound_next() to receive a bool, whether to calculate from the starting
> 
> Thankfully, that claim is stale and can now be removed from this commit
> description.
> 
Yes, I'll delete it.

>> page, or walk the page array. Finally add a separate iterator,
>> for_each_compound_range() that just operate in page ranges as opposed
>> to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>   include/linux/mm.h |  2 ++
>>   mm/gup.c   | 64 ++
>>   2 files changed, 66 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a608feb0d42e..b76063f7f18a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   bool make_dirty);
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long 
>> npages,
>> +  bool make_dirty);
>>   void unpin_user_pages(struct page **pages, unsigned long npages);
>>   
>>   /**
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5a3dd235017a..3426736a01b2 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> +static inline void range_next(unsigned long i, unsigned long npages,
>> +  struct page **list, struct page **head,
>> +  unsigned int *ntails)
> 
> Would compound_range_next() be a better name?
> 
Yeah, will change to that instead. range_next() might actually get confused for 
operations
done on struct range *.

One other thing about my naming is that unpin_user_page_range_dirty_lock() is 
*huge*. But
it seems to adhere to the rest of unpin_* family of functions naming. Couldn't 
find a
better alternative :/

>> +{
>> +struct page *next, *page;
>> +unsigned int nr = 1;
>> +
>> +if (i >= npages)
>> +return;
>> +
>> +npages -= i;

I will remove this @npages subtraction into the min_t() calculation as it's the 
only
placed that's used.

>> +next = *list + i;
>> +
>> +page = compound_head(next);
>> +if (PageCompound(page) && compound_order(page) > 1)

I am not handling compound_order == 1 so will change to >= in the condition 
above.
@compound_nr is placed on the second page.

>> +nr = min_t(unsigned int,
>> +   page + compound_nr(page) - next, npages);
> 
> This pointer arithmetic will involve division. Which may be unnecessarily
> expensive, if there is a way to calculate this with indices instead of
> pointer arithmetic. I'm not sure if there is, off hand, but thought it
> worth mentioning because the point is sometimes overlooked.
> 
Sadly, can't think of :( hence had to adhere to what seems to be the pattern 
today.

Any conversion to PFNs (page_to_pfn) will do same said arithmetic, and
I don't think we can reliably use page_index (and even that is only available 
on the
head page).

>> +
>> +*head = page;
>> +*ntails = nr;
>> +}
>> +
>> +#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
>> +for (__i = 0, \
>> + range_next(__i, __npages, __list, &(__head), &(__ntails)); \
>> + __i < __npages; __i += __ntails, \
>> + range_next(__i, __npages, __list, &(__head), &(__ntails)))
>> +
>>   static inline void compound_next(unsigned long i, unsigned long npages,
>>   struct page **list, struct page **head,
>>   unsigned int *ntails)
>> @@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>   }
>>   EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
>>

Re: [PATCH v2 1/4] mm/gup: add compound page list iterator

2021-02-05 Thread Joao Martins



On 2/5/21 4:11 AM, John Hubbard wrote:
> On 2/4/21 12:24 PM, Joao Martins wrote:
>> Add an helper that iterates over head pages in a list of pages. It
>> essentially counts the tails until the next page to process has a
>> different head that the current. This is going to be used by
>> unpin_user_pages() family of functions, to batch the head page refcount
>> updates once for all passed consecutive tail pages.
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>   mm/gup.c | 29 +
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index d68bcb482b11..d1549c61c2f6 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> +static inline void compound_next(unsigned long i, unsigned long npages,
>> + struct page **list, struct page **head,
>> + unsigned int *ntails)
>> +{
>> +struct page *page;
>> +unsigned int nr;
>> +
>> +if (i >= npages)
>> +return;
>> +
>> +list += i;
>> +npages -= i;
> 
> It is worth noting that this is slightly more complex to read than it needs 
> to be.
> You are changing both endpoints of a loop at once. That's hard to read for a 
> human.
> And you're only doing it in order to gain the small benefit of being able to
> use nr directly at the end of the routine.
> 
> If instead you keep npages constant like it naturally wants to be, you could
> just do a "(*ntails)++" in the loop, to take care of *ntails.
> 
I didn't do it as such as I would need to deref @ntails per iteration, so
it felt more efficient to do as above. On a second thought, I could 
alternatively do the
following below, thoughts?

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..8defe4f670d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,32 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);

+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   page = compound_head(list[i]);
+   for (nr = i + 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr - i;
+}
+

> However, given that the patch is correct and works as-is, the above is really 
> just
> an optional idea, so please feel free to add:
> 
> Reviewed-by: John Hubbard 
> 
> 
Thanks!

Hopefully I can retain that if the snippet above is preferred?

Joao


[PATCH v2 4/4] RDMA/umem: batch page unpin in __ib_umem_release()

2021-02-04 Thread Joao Martins
Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2688 rounds in 5.002 sec: 1860.786 usec / round
32517 rounds in 60.001 sec: 1845.225 usec / round

Signed-off-by: Joao Martins 
---
 drivers/infiniband/core/umem.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..9b607013e2a2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
-   struct sg_page_iter sg_iter;
-   struct page *page;
+   bool make_dirty = umem->writable && dirty;
+   struct scatterlist *sg;
+   unsigned int i;
 
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);
 
-   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
-   page = sg_page_iter_page(_iter);
-   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
-   }
+   for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+   unpin_user_page_range_dirty_lock(sg_page(sg),
+   DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
sg_free_table(>sg_head);
 }
-- 
2.17.1



[PATCH v2 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-04 Thread Joao Martins
Add a unpin_user_page_range_dirty_lock() API which takes a starting page
and how many consecutive pages we want to unpin and optionally dirty.

Given that we won't be iterating on a list of changes, change
compound_next() to receive a bool, whether to calculate from the starting
page, or walk the page array. Finally add a separate iterator,
for_each_compound_range() that just operate in page ranges as opposed
to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 64 ++
 2 files changed, 66 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 5a3dd235017a..3426736a01b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void range_next(unsigned long i, unsigned long npages,
+ struct page **list, struct page **head,
+ unsigned int *ntails)
+{
+   struct page *next, *page;
+   unsigned int nr = 1;
+
+   if (i >= npages)
+   return;
+
+   npages -= i;
+   next = *list + i;
+
+   page = compound_head(next);
+   if (PageCompound(page) && compound_order(page) > 1)
+   nr = min_t(unsigned int,
+  page + compound_nr(page) - next, npages);
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+range_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 static inline void compound_next(unsigned long i, unsigned long npages,
 struct page **list, struct page **head,
 unsigned int *ntails)
@@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely 
released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if 
the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
-- 
2.17.1



[PATCH v2 2/4] mm/gup: decrement head page once for group of subpages

2021-02-04 Thread Joao Martins
Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins 
Reviewed-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index d1549c61c2f6..5a3dd235017a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 bool make_dirty)
 {
unsigned long index;
-
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
+   struct page *head;
+   unsigned int ntails;
 
if (!make_dirty) {
unpin_user_pages(pages, npages);
return;
}
 
-   for (index = 0; index < npages; index++) {
-   struct page *page = compound_head(pages[index]);
+   for_each_compound_head(index, pages, npages, head, ntails) {
/*
 * Checking PageDirty at this point may race with
 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -304,9 +299,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 * written back, so it gets written back again in the
 * next writeback cycle. This is harmless.
 */
-   if (!PageDirty(page))
-   set_page_dirty_lock(page);
-   unpin_user_page(page);
+   if (!PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -323,6 +318,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
unsigned long index;
+   struct page *head;
+   unsigned int ntails;
 
/*
 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -331,13 +328,9 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
 */
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
-   for (index = 0; index < npages; index++)
-   unpin_user_page(pages[index]);
+
+   for_each_compound_head(index, pages, npages, head, ntails)
+   put_compound_head(head, ntails, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1



[PATCH v2 0/4] mm/gup: page unpining improvements

2021-02-04 Thread Joao Martins
Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

 1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they 
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages.

 2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.mart...@oracle.com/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks for all the comments in the RFC above.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

Joao

Changelog since,

v1 -> v2:
 * Prefix macro arguments with __ to avoid collisions with other defines (John)
 * Remove count_tails() and have the logic for the two iterators split into
   range_next() and compound_next() (John)
 * Remove the @range boolean from the iterator helpers (John)
 * Add docs on unpin_user_page_range_dirty_lock() on patch 3 (John)
 * Use unsigned for @i on patch 4 (John)
 * Fix subject line of patch 4 (John)
 * Add John's Reviewed-by on the second patch
 * Fix incorrect use of @nmap and use @sg_nents instead (Jason)

RFC -> v1:
 * Introduce a head/ntails iterator and change unpin_*_pages() to use that,
   inspired by folio iterators (Jason)
 * Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
   on a consecutive page range without having to walk page arrays (Jason)
 * Use unsigned for number of tails (Jason)

Joao Martins (4):
  mm/gup: add compound page list iterator
  mm/gup: decrement head page once for group of subpages
  mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  RDMA/umem: batch page unpin in __ib_umem_release()

 drivers/infiniband/core/umem.c |  12 ++--
 include/linux/mm.h |   2 +
 mm/gup.c   | 122 -
 3 files changed, 112 insertions(+), 24 deletions(-)

-- 
2.17.1



[PATCH v2 1/4] mm/gup: add compound page list iterator

2021-02-04 Thread Joao Martins
Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
 mm/gup.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..d1549c61c2f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   list += i;
+   npages -= i;
+   page = compound_head(*list);
+
+   for (nr = 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
-- 
2.17.1



Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-04 Thread Joao Martins
On 2/4/21 11:35 AM, Joao Martins wrote:
> On 2/3/21 11:37 PM, John Hubbard wrote:
>> On 2/3/21 2:00 PM, Joao Martins wrote:
>>> -static inline unsigned int count_ntails(struct page **pages, unsigned long 
>>> npages)
>>> +static inline unsigned int count_ntails(struct page **pages,
>>> +   unsigned long npages, bool range)
>>>   {
>>> -   struct page *head = compound_head(pages[0]);
>>> +   struct page *page = pages[0], *head = compound_head(page);
>>> unsigned int ntails;
>>>   
>>> +   if (range)
>>> +   return (!PageCompound(head) || compound_order(head) <= 1) ? 1 :
>>> +  min_t(unsigned int, (head + compound_nr(head) - page), 
>>> npages);
>>
>> Here, you clearly should use a separate set of _range routines. Because 
>> you're basically
>> creating two different routines here! Keep it simple.
>>
>> Once you're in a separate routine, you might feel more comfortable expanding 
>> that to
>> a more readable form, too:
>>
>>  if (!PageCompound(head) || compound_order(head) <= 1)
>>  return 1;
>>
>>  return min_t(unsigned int, (head + compound_nr(head) - page), npages);
>>
> Yes.
> 
> Let me also try instead to put move everything into two sole iterator helper 
> routines,
> compound_next() and compound_next_range(), and thus get rid of this 
> count_ntails(). It
> should also help in removing a compound_head() call which should save cycles.
> 

As mentioned earlier, I got rid of count_ntails and the ugly boolean. Plus 
addressed the
missing docs -- fwiw, I borrowed unpin_user_pages_dirty_lock() docs and 
modified a bit.

Partial diff below, hopefully it is looking better now:

diff --git a/mm/gup.c b/mm/gup.c
index 5a3dd235017a..4ef36c8990e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,34 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);

+static inline void range_next(unsigned long i, unsigned long npages,
+ struct page **list, struct page **head,
+ unsigned int *ntails)
+{
+   struct page *next, *page;
+   unsigned int nr = 1;
+
+   if (i >= npages)
+   return;
+
+   npages -= i;
+   next = *list + i;
+
+   page = compound_head(next);
+   if (PageCompound(page) && compound_order(page) > 1)
+   nr = min_t(unsigned int,
+  page + compound_nr(page) - next, npages);
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_range(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+range_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+range_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 static inline void compound_next(unsigned long i, unsigned long npages,
 struct page **list, struct page **head,
 unsigned int *ntails)
@@ -306,6 +334,42 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long
npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);

+/**
+ * unpin_user_page_range_dirty_lock() - release and optionally dirty
+ * gup-pinned page range
+ *
+ * @page:  the starting page of a range maybe marked dirty, and definitely 
released.
+ * @npages: number of consecutive pages to release.
+ * @make_dirty: whether to mark the pages dirty
+ *
+ * "gup-pinned page range" refers to a range of pages that has had one of the
+ * get_user_pages() variants called on that page.
+ *
+ * For the page ranges defined by [page .. page+npages], make that range (or
+ * its head pages, if a compound page) dirty, if @make_dirty is true, and if 
the
+ * page range was previously listed as clean.
+ *
+ * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
+ * required, then the caller should a) verify that this is really correct,
+ * because _lock() is usually required, and b) hand code it:
+ * set_page_dirty_lock(), unpin_user_page().
+ *
+ */
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+


Re: [PATCH 1/4] mm/gup: add compound page list iterator

2021-02-04 Thread Joao Martins
On 2/4/21 11:27 AM, Joao Martins wrote:
> On 2/3/21 11:00 PM, John Hubbard wrote:
>> On 2/3/21 2:00 PM, Joao Martins wrote:
>>> Add an helper that iterates over head pages in a list of pages. It
>>> essentially counts the tails until the next page to process has a
>>> different head that the current. This is going to be used by
>>> unpin_user_pages() family of functions, to batch the head page refcount
>>> updates once for all passed consecutive tail pages.
>>>
>>> Suggested-by: Jason Gunthorpe 
>>> Signed-off-by: Joao Martins 
>>> ---
>>>   mm/gup.c | 29 +
>>>   1 file changed, 29 insertions(+)
>>>
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index d68bcb482b11..4f88dcef39f2 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
>>>   }
>>>   EXPORT_SYMBOL(unpin_user_page);
>>>   
>>> +static inline unsigned int count_ntails(struct page **pages, unsigned long 
>>> npages)
>>
>> Silly naming nit: could we please name this function count_pagetails()? 
>> count_ntails
>> is a bit redundant, plus slightly less clear.
>>
> Hmm, pagetails is also a tiny bit redundant. Perhaps count_subpages() instead?
> 
> count_ntails is meant to be 'count number of tails' i.e. to align terminology 
> with head +
> tails which was also suggested over the other series.
> 
Given your comment on the third patch, I reworked a bit and got rid of the 
count_ntails.

So it's looking like this, also the macro arguments renaming as well):

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..d1549c61c2f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);

+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   struct page *page;
+   unsigned int nr;
+
+   if (i >= npages)
+   return;
+
+   list += i;
+   npages -= i;
+   page = compound_head(*list);
+
+   for (nr = 1; nr < npages; nr++) {
+   if (compound_head(list[nr]) != page)
+   break;
+   }
+
+   *head = page;
+   *ntails = nr;
+}
+
+#define for_each_compound_head(__i, __list, __npages, __head, __ntails) \
+   for (__i = 0, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)); \
+__i < __npages; __i += __ntails, \
+compound_next(__i, __npages, __list, &(__head), &(__ntails)))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.



Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

2021-02-04 Thread Joao Martins



On 2/4/21 12:15 AM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
>> Use the newly added unpin_user_page_range_dirty_lock()
>> for more quickly unpinning a consecutive range of pages
>> represented as compound pages. This will also calculate
>> number of pages to unpin (for the tail pages which matching
>> head page) and thus batch the refcount update.
>>
>> Running a test program which calls mr reg/unreg on a 1G in size
>> and measures cost of both operations together (in a guest using rxe)
>> with THP and hugetlbfs:
> 
> In the patch subject line:
> 
> s/__ib_mem_release/__ib_umem_release/
> 
Ah, yes.

>>
>> Before:
>> 590 rounds in 5.003 sec: 8480.335 usec / round
>> 6898 rounds in 60.001 sec: 8698.367 usec / round
>>
>> After:
>> 2631 rounds in 5.001 sec: 1900.618 usec / round
>> 31625 rounds in 60.001 sec: 1897.267 usec / round
>>
>> Signed-off-by: Joao Martins 
>> ---
>>   drivers/infiniband/core/umem.c | 12 ++--
>>   1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 2dde99a9ba07..ea4ebb3261d9 100644
>> --- a/drivers/infiniband/core/umem.c
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -47,17 +47,17 @@
>>   
>>   static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, 
>> int dirty)
>>   {
>> -struct sg_page_iter sg_iter;
>> -struct page *page;
>> +bool make_dirty = umem->writable && dirty;
>> +struct scatterlist *sg;
>> +int i;
> 
> Maybe unsigned int is better, so as to perfectly match the scatterlist.length.
> 
Will fix.

>>   
>>  if (umem->nmap > 0)
>>  ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
>>  DMA_BIDIRECTIONAL);
>>   
>> -for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
>> -page = sg_page_iter_page(_iter);
>> -unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
>> -}
>> +for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
> 
> The change from umem->sg_nents to umem->nmap looks OK, although we should get
> IB people to verify that there is not some odd bug or reason to leave it as 
> is.
> 
/me nods

fwiw this was suggested by Jason :) as the way I had done was unnecessarily 
allocating a
page to unpin pages.

>> +unpin_user_page_range_dirty_lock(sg_page(sg),
>> +DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
> 
> Is it really OK to refer directly to sg->length? The scatterlist library goes
> to some effort to avoid having callers directly access the struct member 
> variables.
> 
> Actually, the for_each_sg() code and its behavior with sg->length and 
> sg_page(sg)
> confuses me because I'm new to it, and I don't quite understand how this 
> works.

So IIUC this can be done given how ib_umem_get allocates scatterlists (i.e. see 
the call
to __sg_alloc_table_from_pages()). It builds a scatterlist with a segment size 
in device
DMA max segment size  (e.g. 64K, 2G or etc depending on what the device sets it 
to) and
after created each scatterlist I am iterating represents a contiguous range of 
PFNs with a
starting page. And if you keep pinning contiguous amounts of memory, it keeps 
coalescing
this to the previously allocated sgl.

> Especially with SG_CHAIN. I'm assuming that you've monitored /proc/vmstat for
> nr_foll_pin* ?
> 
Yeap I did. I see no pages left unpinned.

>>   
>>  sg_free_table(>sg_head);
>>   }
>>
> 
> thanks,
> 


Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-04 Thread Joao Martins



On 2/4/21 12:11 AM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
> ...
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long 
>> npages,
>> +  bool make_dirty)
>> +{
>> +unsigned long index;
>> +struct page *head;
>> +unsigned int ntails;
>> +
>> +for_each_compound_range(index, , npages, head, ntails) {
>> +if (make_dirty && !PageDirty(head))
>> +set_page_dirty_lock(head);
>> +put_compound_head(head, ntails, FOLL_PIN);
>> +}
>> +}
>> +EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
>> +
> 
> Also, looking at this again while trying to verify the sg diffs in patch #4, 
> I noticed
> that the name is tricky. Usually a "range" would not have a single struct 
> page* as the
> argument. So in this case, perhaps a comment above the function would help, 
> something
> approximately like this:
> 
> /*
>   * The "range" in the function name refers to the fact that the page may be a
>   * compound page. If so, then that compound page will be treated as one or 
> more
>   * ranges of contiguous tail pages.
>   */
> 
> ...I guess. Or maybe the name is just wrong (a comment block explaining a 
> name is
> always a bad sign). 

Naming aside, a comment is probably worth nonetheless to explain what the 
function does.

> Perhaps we should rename it to something like:
> 
>   unpin_user_compound_page_dirty_lock()
> 
> ?

The thing though, is that it doesn't *only* unpin a compound page. It unpins a 
contiguous
range of pages (hence page_range) *and* if these are compound pages it further 
accelerates
things.

Albeit, your name suggestion then probably hints the caller that you should be 
passing a
compound page anyways, so your suggestion does have a ring to it.

Joao


Re: [PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-04 Thread Joao Martins



On 2/3/21 11:37 PM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
>> Add a unpin_user_page_range() API which takes a starting page
>> and how many consecutive pages we want to dirty.
>>
>> Given that we won't be iterating on a list of changes, change
>> compound_next() to receive a bool, whether to calculate from the starting
>> page, or walk the page array. Finally add a separate iterator,
> 
> A bool arg is sometimes, but not always, a hint that you really just want
> a separate set of routines. Below...
> 
Yes.

I was definitely wrestling back and forth a lot about having separate routines 
for two
different iterators helpers i.e. compound_next_head()or having it all merged 
into one
compound_next() / count_ntails().

>> for_each_compound_range() that just operate in page ranges as opposed
>> to page array.
>>
>> For users (like RDMA mr_dereg) where each sg represents a
>> contiguous set of pages, we're able to more efficiently unpin
>> pages without having to supply an array of pages much of what
>> happens today with unpin_user_pages().
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>   include/linux/mm.h |  2 ++
>>   mm/gup.c   | 48 ++
>>   2 files changed, 42 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a608feb0d42e..b76063f7f18a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   bool make_dirty);
>> +void unpin_user_page_range_dirty_lock(struct page *page, unsigned long 
>> npages,
>> +  bool make_dirty);
>>   void unpin_user_pages(struct page **pages, unsigned long npages);
>>   
>>   /**
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 971a24b4b73f..1b57355d5033 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,11 +215,16 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> -static inline unsigned int count_ntails(struct page **pages, unsigned long 
>> npages)
>> +static inline unsigned int count_ntails(struct page **pages,
>> +unsigned long npages, bool range)
>>   {
>> -struct page *head = compound_head(pages[0]);
>> +struct page *page = pages[0], *head = compound_head(page);
>>  unsigned int ntails;
>>   
>> +if (range)
>> +return (!PageCompound(head) || compound_order(head) <= 1) ? 1 :
>> +   min_t(unsigned int, (head + compound_nr(head) - page), 
>> npages);
> 
> Here, you clearly should use a separate set of _range routines. Because 
> you're basically
> creating two different routines here! Keep it simple.
> 
> Once you're in a separate routine, you might feel more comfortable expanding 
> that to
> a more readable form, too:
> 
>   if (!PageCompound(head) || compound_order(head) <= 1)
>   return 1;
> 
>   return min_t(unsigned int, (head + compound_nr(head) - page), npages);
> 
Yes.

Let me also try instead to put move everything into two sole iterator helper 
routines,
compound_next() and compound_next_range(), and thus get rid of this 
count_ntails(). It
should also help in removing a compound_head() call which should save cycles.

Joao


Re: [PATCH 1/4] mm/gup: add compound page list iterator

2021-02-04 Thread Joao Martins
On 2/3/21 11:00 PM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
>> Add an helper that iterates over head pages in a list of pages. It
>> essentially counts the tails until the next page to process has a
>> different head that the current. This is going to be used by
>> unpin_user_pages() family of functions, to batch the head page refcount
>> updates once for all passed consecutive tail pages.
>>
>> Suggested-by: Jason Gunthorpe 
>> Signed-off-by: Joao Martins 
>> ---
>>   mm/gup.c | 29 +
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index d68bcb482b11..4f88dcef39f2 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
>>   }
>>   EXPORT_SYMBOL(unpin_user_page);
>>   
>> +static inline unsigned int count_ntails(struct page **pages, unsigned long 
>> npages)
> 
> Silly naming nit: could we please name this function count_pagetails()? 
> count_ntails
> is a bit redundant, plus slightly less clear.
> 
Hmm, pagetails is also a tiny bit redundant. Perhaps count_subpages() instead?

count_ntails is meant to be 'count number of tails' i.e. to align terminology 
with head +
tails which was also suggested over the other series.

>> +{
>> +struct page *head = compound_head(pages[0]);
>> +unsigned int ntails;
>> +
>> +for (ntails = 1; ntails < npages; ntails++) {
>> +if (compound_head(pages[ntails]) != head)
>> +break;
>> +}
>> +
>> +return ntails;
>> +}
>> +
>> +static inline void compound_next(unsigned long i, unsigned long npages,
>> + struct page **list, struct page **head,
>> + unsigned int *ntails)
>> +{
>> +if (i >= npages)
>> +return;
>> +
>> +*ntails = count_ntails(list + i, npages - i);
>> +*head = compound_head(list[i]);
>> +}
>> +
>> +#define for_each_compound_head(i, list, npages, head, ntails) \
> 
> When using macros, which are dangerous in general, you have to worry about
> things like name collisions. I really dislike that C has forced this unsafe
> pattern upon us, but of course we are stuck with it, for iterator helpers.
> 
/me nods

> Given that we're stuck, you should probably use names such as __i, __list, 
> etc,
> in the the above #define. Otherwise you could stomp on existing variables.

Will do.

Joao


Re: [PATCH 2/4] mm/gup: decrement head page once for group of subpages

2021-02-04 Thread Joao Martins



On 2/3/21 11:28 PM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
>> Rather than decrementing the head page refcount one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page. To that
>> end switch to for_each_compound_head() does most of the work.
>>
>> set_page_dirty() needs no adjustment as it's a nop for
>> non-dirty head pages and it doesn't operate on tail pages.
>>
>> This considerably improves unpinning of pages with THP and
>> hugetlbfs:
>>
>> - THP
>> gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us
>>
>> - 16G with 1G huge page size
>> gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
>> PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us
>>
>> Signed-off-by: Joao Martins 
>> ---
>>   mm/gup.c | 29 +++--
>>   1 file changed, 11 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 4f88dcef39f2..971a24b4b73f 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
>> unsigned long npages,
>>   bool make_dirty)
>>   {
>>  unsigned long index;
>> -
>> -/*
>> - * TODO: this can be optimized for huge pages: if a series of pages is
>> - * physically contiguous and part of the same compound page, then a
>> - * single operation to the head page should suffice.
>> - */
> 
> Great to see this TODO (and the related one below) finally done!
> 
> Everything looks correct here.
> 
> Reviewed-by: John Hubbard 
> 
Thank you!

Joao


[PATCH 0/4] mm/gup: page unpining improvements

2021-02-03 Thread Joao Martins
Hey,

This series improves page unpinning, with an eye on improving MR
deregistration for big swaths of memory (which is bound by the page
unpining), particularly:

 1) Decrement the head page by @ntails and thus reducing a lot the number of
atomic operations per compound page. This is done by comparing individual
tail pages heads, and counting number of consecutive tails on which they 
match heads and based on that update head page refcount. Should have a
visible improvement in all page (un)pinners which use compound pages.

 2) Introducing a new API for unpinning page ranges (to avoid the trick in the
previous item and be based on math), and use that in RDMA ib_mem_release
(used for mr deregistration).

Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x
(through gup_test) and RDMA MR dereg improves ~4.5x with the new API.
See patches 2 and 4 for those.

These patches used to be in this RFC:

https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.mart...@oracle.com/,
"[PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps"

But were moved separately at the suggestion of Jason, given it's applicable
to page unpinning in general. Thanks for all the comments in the RFC above.

These patches apply on top of linux-next tag next-20210202.

Suggestions, comments, welcomed as usual.

Joao

Changelog since the RFC above:

* Introduce a head/ntails iterator and change unpin_*_pages() to use that,
 inspired by folio iterators (Jason)
 Introduce an alternative unpin_user_page_range_dirty_lock() to unpin based
on a consecutive page range without having to walk page arrays (Jason)
* Use unsigned for number of tails (Jason)

Joao Martins (4):
  mm/gup: add compound page list iterator
  mm/gup: decrement head page once for group of subpages
  mm/gup: add a range variant of unpin_user_pages_dirty_lock()
  RDMA/umem: batch page unpin in __ib_mem_release()

 drivers/infiniband/core/umem.c | 12 ++---
 include/linux/mm.h |  2 +
 mm/gup.c   | 90 +++---
 3 files changed, 80 insertions(+), 24 deletions(-)

-- 
2.17.1



[PATCH 3/4] mm/gup: add a range variant of unpin_user_pages_dirty_lock()

2021-02-03 Thread Joao Martins
Add a unpin_user_page_range() API which takes a starting page
and how many consecutive pages we want to dirty.

Given that we won't be iterating on a list of changes, change
compound_next() to receive a bool, whether to calculate from the starting
page, or walk the page array. Finally add a separate iterator,
for_each_compound_range() that just operate in page ranges as opposed
to page array.

For users (like RDMA mr_dereg) where each sg represents a
contiguous set of pages, we're able to more efficiently unpin
pages without having to supply an array of pages much of what
happens today with unpin_user_pages().

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  2 ++
 mm/gup.c   | 48 ++
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a608feb0d42e..b76063f7f18a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1265,6 +1265,8 @@ static inline void put_page(struct page *page)
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 bool make_dirty);
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 971a24b4b73f..1b57355d5033 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,11 +215,16 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
-static inline unsigned int count_ntails(struct page **pages, unsigned long 
npages)
+static inline unsigned int count_ntails(struct page **pages,
+   unsigned long npages, bool range)
 {
-   struct page *head = compound_head(pages[0]);
+   struct page *page = pages[0], *head = compound_head(page);
unsigned int ntails;
 
+   if (range)
+   return (!PageCompound(head) || compound_order(head) <= 1) ? 1 :
+  min_t(unsigned int, (head + compound_nr(head) - page), 
npages);
+
for (ntails = 1; ntails < npages; ntails++) {
if (compound_head(pages[ntails]) != head)
break;
@@ -229,20 +234,32 @@ static inline unsigned int count_ntails(struct page 
**pages, unsigned long npage
 }
 
 static inline void compound_next(unsigned long i, unsigned long npages,
-struct page **list, struct page **head,
-unsigned int *ntails)
+struct page **list, bool range,
+struct page **head, unsigned int *ntails)
 {
+   struct page *p, **next = 
+
if (i >= npages)
return;
 
-   *ntails = count_ntails(list + i, npages - i);
-   *head = compound_head(list[i]);
+   if (range)
+   *next = *list + i;
+   else
+   next = list + i;
+
+   *ntails = count_ntails(next, npages - i, range);
+   *head = compound_head(*next);
 }
 
+#define for_each_compound_range(i, list, npages, head, ntails) \
+   for (i = 0, compound_next(i, npages, list, true, , ); \
+i < npages; i += ntails, \
+compound_next(i, npages, list, true,  , ))
+
 #define for_each_compound_head(i, list, npages, head, ntails) \
-   for (i = 0, compound_next(i, npages, list, , ); \
+   for (i = 0, compound_next(i, npages, list, false, , ); \
 i < npages; i += ntails, \
-compound_next(i, npages, list, , ))
+compound_next(i, npages, list, false,  , ))
 
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
@@ -306,6 +323,21 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
+void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
+ bool make_dirty)
+{
+   unsigned long index;
+   struct page *head;
+   unsigned int ntails;
+
+   for_each_compound_range(index, , npages, head, ntails) {
+   if (make_dirty && !PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
+   }
+}
+EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
+
 /**
  * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
-- 
2.17.1



[PATCH 2/4] mm/gup: decrement head page once for group of subpages

2021-02-03 Thread Joao Martins
Rather than decrementing the head page refcount one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page. To that
end switch to for_each_compound_head() does most of the work.

set_page_dirty() needs no adjustment as it's a nop for
non-dirty head pages and it doesn't operate on tail pages.

This considerably improves unpinning of pages with THP and
hugetlbfs:

- THP
gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us

- 16G with 1G huge page size
gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w
PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us

Signed-off-by: Joao Martins 
---
 mm/gup.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4f88dcef39f2..971a24b4b73f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -270,20 +270,15 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 bool make_dirty)
 {
unsigned long index;
-
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
+   struct page *head;
+   unsigned int ntails;
 
if (!make_dirty) {
unpin_user_pages(pages, npages);
return;
}
 
-   for (index = 0; index < npages; index++) {
-   struct page *page = compound_head(pages[index]);
+   for_each_compound_head(index, pages, npages, head, ntails) {
/*
 * Checking PageDirty at this point may race with
 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -304,9 +299,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, 
unsigned long npages,
 * written back, so it gets written back again in the
 * next writeback cycle. This is harmless.
 */
-   if (!PageDirty(page))
-   set_page_dirty_lock(page);
-   unpin_user_page(page);
+   if (!PageDirty(head))
+   set_page_dirty_lock(head);
+   put_compound_head(head, ntails, FOLL_PIN);
}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -323,6 +318,8 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
unsigned long index;
+   struct page *head;
+   unsigned int ntails;
 
/*
 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -331,13 +328,9 @@ void unpin_user_pages(struct page **pages, unsigned long 
npages)
 */
if (WARN_ON(IS_ERR_VALUE(npages)))
return;
-   /*
-* TODO: this can be optimized for huge pages: if a series of pages is
-* physically contiguous and part of the same compound page, then a
-* single operation to the head page should suffice.
-*/
-   for (index = 0; index < npages; index++)
-   unpin_user_page(pages[index]);
+
+   for_each_compound_head(index, pages, npages, head, ntails)
+   put_compound_head(head, ntails, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1



[PATCH 1/4] mm/gup: add compound page list iterator

2021-02-03 Thread Joao Martins
Add an helper that iterates over head pages in a list of pages. It
essentially counts the tails until the next page to process has a
different head that the current. This is going to be used by
unpin_user_pages() family of functions, to batch the head page refcount
updates once for all passed consecutive tail pages.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Joao Martins 
---
 mm/gup.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index d68bcb482b11..4f88dcef39f2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,35 @@ void unpin_user_page(struct page *page)
 }
 EXPORT_SYMBOL(unpin_user_page);
 
+static inline unsigned int count_ntails(struct page **pages, unsigned long 
npages)
+{
+   struct page *head = compound_head(pages[0]);
+   unsigned int ntails;
+
+   for (ntails = 1; ntails < npages; ntails++) {
+   if (compound_head(pages[ntails]) != head)
+   break;
+   }
+
+   return ntails;
+}
+
+static inline void compound_next(unsigned long i, unsigned long npages,
+struct page **list, struct page **head,
+unsigned int *ntails)
+{
+   if (i >= npages)
+   return;
+
+   *ntails = count_ntails(list + i, npages - i);
+   *head = compound_head(list[i]);
+}
+
+#define for_each_compound_head(i, list, npages, head, ntails) \
+   for (i = 0, compound_next(i, npages, list, , ); \
+i < npages; i += ntails, \
+compound_next(i, npages, list, , ))
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
-- 
2.17.1



[PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

2021-02-03 Thread Joao Martins
Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2631 rounds in 5.001 sec: 1900.618 usec / round
31625 rounds in 60.001 sec: 1897.267 usec / round

Signed-off-by: Joao Martins 
---
 drivers/infiniband/core/umem.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..ea4ebb3261d9 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@
 
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int 
dirty)
 {
-   struct sg_page_iter sg_iter;
-   struct page *page;
+   bool make_dirty = umem->writable && dirty;
+   struct scatterlist *sg;
+   int i;
 
if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);
 
-   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
-   page = sg_page_iter_page(_iter);
-   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
-   }
+   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
+   unpin_user_page_range_dirty_lock(sg_page(sg),
+   DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
sg_free_table(>sg_head);
 }
-- 
2.17.1



Re: [PATCH v8 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-03 Thread Joao Martins
On 2/3/21 3:32 PM, Joao Martins wrote:
> On 2/3/21 2:51 PM, Pavel Tatashin wrote:
>> On Wed, Feb 3, 2021 at 8:23 AM Joao Martins  
>> wrote:
>>> On 1/25/21 7:47 PM, Pavel Tatashin wrote:
>>> for compound pages but when !is_transparent_hugepage(head) or just 
>>> PageHuge(head) like:
>>>
>>> +   if (!is_transparent_hugepage(head) && PageCompound(page))
>>> +   i += (compound_nr(head) - (pages[i] - head));
>>>
>>> Or making specific to hugetlbfs:
>>>
>>> +   if (PageHuge(head))
>>> +   i += (compound_nr(head) - (pages[i] - head));
>>
>> Yes, this is reasonable optimization. I will submit a follow up patch
>> against linux-next.

Realized it late, but the previous step was already broken. And I inherited its 
brokeness,
when copy-pasting the deleted chunk:

The @step should be capped at the remaining pages to iterate:

i += min(nr_pages - i, compound_nr(head) - (pages[i] - head));

  Joao


Re: [PATCH v8 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-03 Thread Joao Martins
On 2/3/21 2:53 PM, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 01:22:18PM +0000, Joao Martins wrote:
> 
>> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k 
>> usecs when
>> pinning a 16G hugetlb file.
> 
> Yes, but correctness demands it.
> 
Hence the quotes around the word regress.

But still, we are adding unnecessary overhead (previously nonexistent) to 
compound pages
even those where the issue doesn't exist (hugetlb).

> The solution is to track these pages as we discover them so we know if
> a PMD/PUD points and can directly skip the duplicated work
> 
That looks better. It would require moving check_and_migrate_movable_pages() 
elsewhere,
and/or possibly reworking the entire series?

>> Splitting can only occur on THP right? If so, perhaps we could
>> retain the @step increment for compound pages but when
>> !is_transparent_hugepage(head) or just PageHuge(head) like:
> 
> Honestly I'd rather see it fixed properly which will give even bigger
> performance gains - avoiding the entire rescan of the page list will
> be a win
> 

If check_and_migrate_movable_pages() is meant to migrate unpinned pages, then 
rather than
pinning+unpinning+moving, perhaps it would be called in __get_user_pages() 
place where we
are walking page tables and know if it's a PUD/PMD and can skip all the 
subpages and just
record and migrate those instead. Was that your thinking?

Joao


Re: [PATCH v8 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-03 Thread Joao Martins
On 2/3/21 2:51 PM, Pavel Tatashin wrote:
> On Wed, Feb 3, 2021 at 8:23 AM Joao Martins  wrote:
>>
>> On 1/25/21 7:47 PM, Pavel Tatashin wrote:
>>> When pages are isolated in check_and_migrate_movable_pages() we skip
>>> compound number of pages at a time. However, as Jason noted, it is
>>> not necessary correct that pages[i] corresponds to the pages that
>>> we skipped. This is because it is possible that the addresses in
>>> this range had split_huge_pmd()/split_huge_pud(), and these functions
>>> do not update the compound page metadata.
>>>
>>> The problem can be reproduced if something like this occurs:
>>>
>>> 1. User faulted huge pages.
>>> 2. split_huge_pmd() was called for some reason
>>> 3. User has unmapped some sub-pages in the range
>>> 4. User tries to longterm pin the addresses.
>>>
>>> The resulting pages[i] might end-up having pages which are not compound
>>> size page aligned.
>>>
>>> Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on 
>>> huge page")
>>> Reported-by: Jason Gunthorpe 
>>> Signed-off-by: Pavel Tatashin 
>>> Reviewed-by: Jason Gunthorpe 
>>> ---
>>
>> [...]
>>
>>>   /*
>>>* If we get a page from the CMA zone, since we are going to
>>>* be pinning these entries, we might as well move them out
>>> @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct 
>>> mm_struct *mm,
>>>   }
>>>   }
>>>   }
>>> -
>>> - i += step;
>>>   }
>>>
>>
> 
> Hi Joao,
> 
>> With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> ~32k 
>> usecs when
>> pinning a 16G hugetlb file.
> 
> Estimate or you actually measured?
> 
It's what I had measured before sending. The ~ is because there's error 
variance.

>>
> 
>> Splitting can only occur on THP right? If so, perhaps we could retain the 
>> @step increment
> 
> Yes, I do not think we can split HugePage, only THP.
> 
Right, that's my impression too.

>> for compound pages but when !is_transparent_hugepage(head) or just 
>> PageHuge(head) like:
>>
>> +   if (!is_transparent_hugepage(head) && PageCompound(page))
>> +   i += (compound_nr(head) - (pages[i] - head));
>>
>> Or making specific to hugetlbfs:
>>
>> +   if (PageHuge(head))
>> +   i += (compound_nr(head) - (pages[i] - head));
> 
> Yes, this is reasonable optimization. I will submit a follow up patch
> against linux-next.


Re: [PATCH v8 02/14] mm/gup: check every subpage of a compound page during isolation

2021-02-03 Thread Joao Martins
On 1/25/21 7:47 PM, Pavel Tatashin wrote:
> When pages are isolated in check_and_migrate_movable_pages() we skip
> compound number of pages at a time. However, as Jason noted, it is
> not necessary correct that pages[i] corresponds to the pages that
> we skipped. This is because it is possible that the addresses in
> this range had split_huge_pmd()/split_huge_pud(), and these functions
> do not update the compound page metadata.
> 
> The problem can be reproduced if something like this occurs:
> 
> 1. User faulted huge pages.
> 2. split_huge_pmd() was called for some reason
> 3. User has unmapped some sub-pages in the range
> 4. User tries to longterm pin the addresses.
> 
> The resulting pages[i] might end-up having pages which are not compound
> size page aligned.
> 
> Fixes: aa712399c1e8 ("mm/gup: speed up check_and_migrate_cma_pages() on huge 
> page")
> Reported-by: Jason Gunthorpe 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Jason Gunthorpe 
> ---

[...]

>   /*
>* If we get a page from the CMA zone, since we are going to
>* be pinning these entries, we might as well move them out
> @@ -1599,8 +1596,6 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
>   }
>   }
>   }
> -
> - i += step;
>   }
>  

With this, longterm gup will 'regress' for hugetlbfs e.g. from ~6k -> 32k usecs 
when
pinning a 16G hugetlb file.

Splitting can only occur on THP right? If so, perhaps we could retain the @step 
increment
for compound pages but when !is_transparent_hugepage(head) or just 
PageHuge(head) like:

+   if (!is_transparent_hugepage(head) && PageCompound(page))
+   i += (compound_nr(head) - (pages[i] - head));

Or making specific to hugetlbfs:

+   if (PageHuge(head))
+   i += (compound_nr(head) - (pages[i] - head));


Re: dax alignment problem on arm64 (and other achitectures)

2021-01-29 Thread Joao Martins



On 1/29/21 4:32 PM, Pavel Tatashin wrote:
> On Fri, Jan 29, 2021 at 9:51 AM Joao Martins  
> wrote:
>>
>> Hey Pavel,
>>
>> On 1/29/21 1:50 PM, Pavel Tatashin wrote:
>>>> Since we last talked about this the enabling for EFI "Special Purpose"
>>>> / Soft Reserved Memory has gone upstream and instantiates device-dax
>>>> instances for address ranges marked with EFI_MEMORY_SP attribute.
>>>> Critically this way of declaring device-dax removes the consideration
>>>> of it as persistent memory and as such no metadata reservation. So, if
>>>> you are willing to maintain the metadata external to the device (which
>>>> seems reasonable for your environment) and have your platform firmware
>>>> / kernel command line mark it as EFI_CONVENTIONAL_MEMORY +
>>>> EFI_MEMORY_SP, then these reserve-free dax-devices will surface.
>>>
>>> Hi Dan,
>>>
>>> This is cool. Does it allow conversion between devdax and fsdax so DAX
>>> aware filesystem can be installed and data can be put there to be
>>> preserved across the reboot?
>>>
>>
>> fwiw wrt to the 'preserved across kexec' part, you are going to need
>> something conceptually similar to snippet below the scissors mark.
>> Alternatively, we could fix kexec userspace to add conventional memory
>> ranges (without the SP attribute part) when it sees a Soft-Reserved region.
>> But can't tell which one is the right thing to do.
> 
> Hi Joao,
> 
> Is not it just a matter of appending arguments to the kernel parameter
> during kexec reboot with Soft-Reserved region specified, or am I
> missing something? I understand with fileload kexec syscall we might
> accidently load segments onto reserved region, but with the original
> kexec syscall, where we can specify destinations for each segment that
> should not be a problem with today's kexec tools.
> 
efi_fake_mem only works with EFI_MEMMAP conventional memory ranges, thus
not having a EFI_MEMMAP with RAM ranges means it's a nop for the soft-reserved
regions. Unless, you trying to suggest something like:

memmap=%+0xefff

... To mark soft reserved on top an existing RAM? Sadly don't know if there's
an equivalent for ARM.


> I agree that preserving it automatically as you are proposing, would
> make more sense, instead of fiddling with kernel parameters and
> segment destinations.
> 
> Thank you,
> Pasha
> 
>>
>> At the moment, HMAT ranges (or those defined with efi_fake_mem=) aren't
>> preserved not because of anything special with HMAT, but simply because
>> the EFI memmap conventional ram ranges are not preserved (only runtime
>> services). And HMAT/efi_fake_mem expects these to based on EFI memmap.
>>

[snip]


[PATCH v2 0/2] mm/hugetlb: follow_hugetlb_page() improvements

2021-01-28 Thread Joao Martins
Hey,

While looking at ZONE_DEVICE struct page reuse particularly the last
patch[0], I found two possible improvements for follow_hugetlb_page()
which is solely used for get_user_pages()/pin_user_pages().

The first patch batches page refcount updates while the second tidies
up storing the subpages/vmas. Both together bring the cost of slow
variant of gup() cost from ~87.6k usecs to ~5.8k usecs.

libhugetlbfs tests seem to pass as well gup_test benchmarks
with hugetlbfs vmas.

v2:
  * switch from refs++ to ++refs;
  * add Mike's Rb on patch 1;
  * switch from page++ to mem_map_offset() on the second patch;
  
[0] 
https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.mart...@oracle.com/

Joao Martins (2):
  mm/hugetlb: grab head page refcount once for group of subpages
  mm/hugetlb: refactor subpage recording

 include/linux/mm.h |  3 +++
 mm/gup.c   |  5 ++--
 mm/hugetlb.c   | 66 +++---
 3 files changed, 44 insertions(+), 30 deletions(-)

-- 
2.17.1



[PATCH v2 1/2] mm/hugetlb: grab head page refcount once for group of subpages

2021-01-28 Thread Joao Martins
follow_hugetlb_page() once it locks the pmd/pud, checks all its
N subpages in a huge page and grabs a reference for each one.
Similar to gup-fast, have follow_hugetlb_page() grab the head
page refcount only after counting all its subpages that are part
of the just faulted huge page.

Consequently we reduce the number of atomics necessary to pin
said huge page, which improves non-fast gup() considerably:

  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~87.6k us -> ~12.8k us

Reviewed-by: Mike Kravetz 
Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  3 +++
 mm/gup.c   |  5 ++---
 mm/hugetlb.c   | 43 ---
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5d618d08506..0d793486822b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,6 +1182,9 @@ static inline void get_page(struct page *page)
 }
 
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
+__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
+  unsigned int flags);
+
 
 static inline __must_check bool try_get_page(struct page *page)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 3e086b073624..ecadc80934b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,9 +79,8 @@ static inline struct page *try_get_compound_head(struct page 
*page, int refs)
  * considered failure, and furthermore, a likely bug in the caller, so a 
warning
  * is also emitted.
  */
-static __maybe_unused struct page *try_grab_compound_head(struct page *page,
- int refs,
- unsigned int flags)
+__maybe_unused struct page *try_grab_compound_head(struct page *page,
+  int refs, unsigned int flags)
 {
if (flags & FOLL_GET)
return try_get_compound_head(page, refs);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6bad1f686c5..becef936ec21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4798,7 +4798,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);
-   int err = -EFAULT;
+   int err = -EFAULT, refs;
 
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
@@ -4918,26 +4918,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
continue;
}
 
+   refs = 0;
+
 same_page:
-   if (pages) {
+   if (pages)
pages[i] = mem_map_offset(page, pfn_offset);
-   /*
-* try_grab_page() should always succeed here, because:
-* a) we hold the ptl lock, and b) we've just checked
-* that the huge page is present in the page tables. If
-* the huge page is present, then the tail pages must
-* also be present. The ptl prevents the head page and
-* tail pages from being rearranged in any way. So this
-* page must be available at this point, unless the page
-* refcount overflowed:
-*/
-   if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
-   spin_unlock(ptl);
-   remainder = 0;
-   err = -ENOMEM;
-   break;
-   }
-   }
 
if (vmas)
vmas[i] = vma;
@@ -4946,6 +4931,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
++pfn_offset;
--remainder;
++i;
+   ++refs;
if (vaddr < vma->vm_end && remainder &&
pfn_offset < pages_per_huge_page(h)) {
/*
@@ -4953,6 +4939,25 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * of this compound page.
 */
goto same_page;
+   } else if (pages) {
+   /*
+* try_grab_compound_head() should always succeed here,
+* because: a) we hold the ptl lock, and b) we've just
+* checked that the huge page is present in the page
+* tables. If the huge page is present, then the tail
+* pages must also be present. The ptl prevents the
+

[PATCH v2 2/2] mm/hugetlb: refactor subpage recording

2021-01-28 Thread Joao Martins
For a given hugepage backing a VA, there's a rather ineficient
loop which is solely responsible for storing subpages in GUP
@pages/@vmas array. For each subpage we check whether it's within
range or size of @pages and keep increment @pfn_offset and a couple
other variables per subpage iteration.

Simplify this logic and minimize the cost of each iteration to just
store the output page/vma. Instead of incrementing number of @refs
iteratively, we do it through pre-calculation of @refs and only
with a tight loop for storing pinned subpages/vmas.

Additionally, retain existing behaviour with using mem_map_offset()
when recording the subpages for configurations that don't have a
contiguous mem_map.

pinning consequently improves bringing us close to
{pin,get}_user_pages_fast:

  - 16G with 1G huge page size
  gup_test -f /mnt/huge/file -m 16384 -r 30 -L -S -n 512 -w

PIN_LONGTERM_BENCHMARK: ~12.8k us -> ~5.8k us
PIN_FAST_BENCHMARK: ~3.7k us

Signed-off-by: Joao Martins 
---
 mm/hugetlb.c | 49 -
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index becef936ec21..f3baabbda432 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4789,6 +4789,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out;
 }
 
+static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
+int refs, struct page **pages,
+struct vm_area_struct **vmas)
+{
+   int nr;
+
+   for (nr = 0; nr < refs; nr++) {
+   if (likely(pages))
+   pages[nr] = mem_map_offset(page, nr);
+   if (vmas)
+   vmas[nr] = vma;
+   }
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 struct page **pages, struct vm_area_struct **vmas,
 unsigned long *position, unsigned long *nr_pages,
@@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
continue;
}
 
-   refs = 0;
+   refs = min3(pages_per_huge_page(h) - pfn_offset,
+   (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
 
-same_page:
-   if (pages)
-   pages[i] = mem_map_offset(page, pfn_offset);
+   if (pages || vmas)
+   record_subpages_vmas(mem_map_offset(page, pfn_offset),
+vma, refs,
+likely(pages) ? pages + i : NULL,
+vmas ? vmas + i : NULL);
 
-   if (vmas)
-   vmas[i] = vma;
-
-   vaddr += PAGE_SIZE;
-   ++pfn_offset;
-   --remainder;
-   ++i;
-   ++refs;
-   if (vaddr < vma->vm_end && remainder &&
-   pfn_offset < pages_per_huge_page(h)) {
-   /*
-* We use pfn_offset to avoid touching the pageframes
-* of this compound page.
-*/
-   goto same_page;
-   } else if (pages) {
+   if (pages) {
/*
 * try_grab_compound_head() should always succeed here,
 * because: a) we hold the ptl lock, and b) we've just
@@ -4950,7 +4952,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * any way. So this page must be available at this
 * point, unless the page refcount overflowed:
 */
-   if (WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+   if (WARN_ON_ONCE(!try_grab_compound_head(pages[i],
 refs,
 flags))) {
spin_unlock(ptl);
@@ -4959,6 +4961,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
break;
}
}
+
+   vaddr += (refs << PAGE_SHIFT);
+   remainder -= refs;
+   i += refs;
+
spin_unlock(ptl);
}
*nr_pages = remainder;
-- 
2.17.1



Re: [PATCH 2/2] mm/hugetlb: refactor subpage recording

2021-01-26 Thread Joao Martins
On 1/26/21 6:08 PM, Mike Kravetz wrote:
> On 1/25/21 12:57 PM, Joao Martins wrote:
>> For a given hugepage backing a VA, there's a rather ineficient
>> loop which is solely responsible for storing subpages in the passed
>> pages/vmas array. For each subpage we check whether it's within
>> range or size of @pages and keep incrementing @pfn_offset and a couple
>> other variables per subpage iteration.
>>
>> Simplify this logic and minimize ops per iteration to just
>> store the output page/vma. Instead of incrementing number of @refs
>> iteratively, we do it through a precalculation of @refs and having
>> only a tight loop for storing pinned subpages/vmas.
>>
>> pinning consequently improves considerably, bringing us close to
>> {pin,get}_user_pages_fast:
>>
>>  - 16G with 1G huge page size
>>  gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w
>>
>>  PIN_LONGTERM_BENCHMARK: ~11k us -> ~4400 us
>>  PIN_FAST_BENCHMARK: ~3700 us
>>
>> Signed-off-by: Joao Martins 
>> ---
>>  mm/hugetlb.c | 49 -
>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 016addc8e413..1f7a95bc7c87 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4789,6 +4789,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>>  goto out;
>>  }
>>  
>> +static void record_subpages_vmas(struct page *page, struct vm_area_struct 
>> *vma,
>> + int refs, struct page **pages,
>> + struct vm_area_struct **vmas)
>> +{
>> +int nr;
>> +
>> +for (nr = 0; nr < refs; nr++) {
>> +if (likely(pages))
>> +pages[nr] = page++;
>> +if (vmas)
>> +vmas[nr] = vma;
>> +}
>> +}
>> +
>>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>   struct page **pages, struct vm_area_struct **vmas,
>>   unsigned long *position, unsigned long *nr_pages,
>> @@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, 
>> struct vm_area_struct *vma,
>>  continue;
>>  }
>>  
>> -refs = 0;
>> +refs = min3(pages_per_huge_page(h) - pfn_offset,
>> +(vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>>  
>> -same_page:
>> -if (pages)
>> -pages[i] = mem_map_offset(page, pfn_offset);
>> +if (pages || vmas)
>> +record_subpages_vmas(mem_map_offset(page, pfn_offset),
> 
> The assumption made here is that mem_map is contiguous for the range of
> pages in the hugetlb page.  I do not believe you can make this assumption
> for (gigantic) hugetlb pages which are > MAX_ORDER_NR_PAGES.  For example,
> 

That would mean get_user_pages_fast() and put_user_pages_fast() are broken for 
anything
handling PUDs or above? See record_subpages() in gup_huge_pud() or even 
gup_huge_pgd().
It's using the same page++.

This adjustment below probably is what you're trying to suggest.

Also, nth_page() is slightly more expensive and so the numbers above change 
from ~4.4k
usecs to ~7.8k usecs.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1f7a95bc7c87..cf66f8c2f92a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4789,15 +4789,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out;
 }

-static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
+static void record_subpages_vmas(struct page *page, unsigned long pfn_offset,
+struct vm_area_struct *vma,
 int refs, struct page **pages,
 struct vm_area_struct **vmas)
 {
-   int nr;
+   unsigned long nr;

for (nr = 0; nr < refs; nr++) {
if (likely(pages))
-   pages[nr] = page++;
+   pages[nr] = mem_map_offset(page, pfn_offset + nr);
if (vmas)
vmas[nr] = vma;
}
@@ -4936,8 +4937,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct
*vma,
(vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);

if (pages || vmas)
-   record_subpages_vmas(mem_map_offset(page, pfn_offset),
-vma, refs,
+   record_subpages_vmas(page, pfn_offset, vma, refs,
 likely(pages) ? pages + i : NULL,
 vmas ? vmas + i : NULL);


Re: [PATCH 2/2] mm/hugetlb: refactor subpage recording

2021-01-26 Thread Joao Martins
On 1/26/21 9:21 PM, Mike Kravetz wrote:
> On 1/26/21 11:21 AM, Joao Martins wrote:
>> On 1/26/21 6:08 PM, Mike Kravetz wrote:
>>> On 1/25/21 12:57 PM, Joao Martins wrote:
>>>>
>>>> +static void record_subpages_vmas(struct page *page, struct vm_area_struct 
>>>> *vma,
>>>> +   int refs, struct page **pages,
>>>> +   struct vm_area_struct **vmas)
>>>> +{
>>>> +  int nr;
>>>> +
>>>> +  for (nr = 0; nr < refs; nr++) {
>>>> +  if (likely(pages))
>>>> +  pages[nr] = page++;
>>>> +  if (vmas)
>>>> +  vmas[nr] = vma;
>>>> +  }
>>>> +}
>>>> +
>>>>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>>> struct page **pages, struct vm_area_struct **vmas,
>>>> unsigned long *position, unsigned long *nr_pages,
>>>> @@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, 
>>>> struct vm_area_struct *vma,
>>>>continue;
>>>>}
>>>>  
>>>> -  refs = 0;
>>>> +  refs = min3(pages_per_huge_page(h) - pfn_offset,
>>>> +  (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>>>>  
>>>> -same_page:
>>>> -  if (pages)
>>>> -  pages[i] = mem_map_offset(page, pfn_offset);
>>>> +  if (pages || vmas)
>>>> +  record_subpages_vmas(mem_map_offset(page, pfn_offset),
>>>
>>> The assumption made here is that mem_map is contiguous for the range of
>>> pages in the hugetlb page.  I do not believe you can make this assumption
>>> for (gigantic) hugetlb pages which are > MAX_ORDER_NR_PAGES.  For example,
>>>
> 
> Thinking about this a bit more ...
> 
> mem_map can be accessed contiguously if we have a virtual memmap.  Correct?

Right.

> I suspect virtual memmap may be the most common configuration today.  However,
> it seems we do need to handle other configurations.
> 

At the moment mem_map_offset(page, n) in turn does this for >= MAX_ORDER:

pfn_to_page(page_to_pfn(page) + n)


For CONFIG_SPARSE_VMEMMAP or FLATMEM will resolve into something:

vmemmap + ((page - vmemmap) + n)

It isn't really different than incrementing the @page.

I can only think that CONFIG_SPARSEMEM and CONFIG_DISCONTIGMEM as the offending
cases which respectively look into section info or pgdat.

[CONFIG_DISCONTIGMEM doesnt isn't auto selected by any arch at the moment.]

>> That would mean get_user_pages_fast() and pin_user_pages_fast() are broken 
>> for anything
>> handling PUDs or above? See record_subpages() in gup_huge_pud() or even 
>> gup_huge_pgd().
>> It's using the same page++.
> 
> Yes, I believe those would also have the issue.
> Cc: John and Jason as they have spent a significant amount of time in gup
> code recently.  There may be something that makes that code safe?
> 
Maybe -- Looking back, gup-fast has always relied on that page pointer 
arithmetic, even
before its refactors around __record_subpages() and what not.

>> This adjustment below probably is what you're trying to suggest.
>>
>> Also, nth_page() is slightly more expensive and so the numbers above change 
>> from ~4.4k
>> usecs to ~7.8k usecs.
> 
> If my thoughts about virtual memmap are correct, then could we simply have
> a !vmemmap version of mem_map_offset (or similar routine) to avoid overhead?
> In that case, we could ifdef out on SPARSEMEM || DISCONTIGMEM for 
> mem_map_offset() either
internally or within the helper I added for follow_hugetlb_page().


Re: [PATCH 2/2] mm/hugetlb: refactor subpage recording

2021-01-26 Thread Joao Martins



On 1/26/21 7:21 PM, Joao Martins wrote:
> On 1/26/21 6:08 PM, Mike Kravetz wrote:
>> On 1/25/21 12:57 PM, Joao Martins wrote:
>>> For a given hugepage backing a VA, there's a rather ineficient
>>> loop which is solely responsible for storing subpages in the passed
>>> pages/vmas array. For each subpage we check whether it's within
>>> range or size of @pages and keep incrementing @pfn_offset and a couple
>>> other variables per subpage iteration.
>>>
>>> Simplify this logic and minimize ops per iteration to just
>>> store the output page/vma. Instead of incrementing number of @refs
>>> iteratively, we do it through a precalculation of @refs and having
>>> only a tight loop for storing pinned subpages/vmas.
>>>
>>> pinning consequently improves considerably, bringing us close to
>>> {pin,get}_user_pages_fast:
>>>
>>>  - 16G with 1G huge page size
>>>  gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w
>>>
>>>  PIN_LONGTERM_BENCHMARK: ~11k us -> ~4400 us
>>>  PIN_FAST_BENCHMARK: ~3700 us
>>>
>>> Signed-off-by: Joao Martins 
>>> ---
>>>  mm/hugetlb.c | 49 -
>>>  1 file changed, 28 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 016addc8e413..1f7a95bc7c87 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -4789,6 +4789,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct 
>>> *dst_mm,
>>> goto out;
>>>  }
>>>  
>>> +static void record_subpages_vmas(struct page *page, struct vm_area_struct 
>>> *vma,
>>> +int refs, struct page **pages,
>>> +struct vm_area_struct **vmas)
>>> +{
>>> +   int nr;
>>> +
>>> +   for (nr = 0; nr < refs; nr++) {
>>> +   if (likely(pages))
>>> +   pages[nr] = page++;
>>> +   if (vmas)
>>> +   vmas[nr] = vma;
>>> +   }
>>> +}
>>> +
>>>  long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>>  struct page **pages, struct vm_area_struct **vmas,
>>>  unsigned long *position, unsigned long *nr_pages,
>>> @@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, 
>>> struct vm_area_struct *vma,
>>> continue;
>>> }
>>>  
>>> -   refs = 0;
>>> +   refs = min3(pages_per_huge_page(h) - pfn_offset,
>>> +   (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
>>>  
>>> -same_page:
>>> -   if (pages)
>>> -   pages[i] = mem_map_offset(page, pfn_offset);
>>> +   if (pages || vmas)
>>> +   record_subpages_vmas(mem_map_offset(page, pfn_offset),
>>
>> The assumption made here is that mem_map is contiguous for the range of
>> pages in the hugetlb page.  I do not believe you can make this assumption
>> for (gigantic) hugetlb pages which are > MAX_ORDER_NR_PAGES.  For example,
>>
> 
> That would mean get_user_pages_fast() and put_user_pages_fast() are broken 
> for anything
> handling PUDs or above? See record_subpages() in gup_huge_pud() or even 
> gup_huge_pgd().
> It's using the same page++.
> 

Err ... I meant pin_user_pages_fast(), sorry about that.

> This adjustment below probably is what you're trying to suggest.
> 
> Also, nth_page() is slightly more expensive and so the numbers above change 
> from ~4.4k
> usecs to ~7.8k usecs.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1f7a95bc7c87..cf66f8c2f92a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4789,15 +4789,16 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
> goto out;
>  }
> 
> -static void record_subpages_vmas(struct page *page, struct vm_area_struct 
> *vma,
> +static void record_subpages_vmas(struct page *page, unsigned long pfn_offset,
> +struct vm_area_struct *vma,
>  int refs, struct page **pages,
>  struct vm_area_struct **vmas)
>  {
> -   int nr;
> +   unsigned long nr;
> 
> for (nr = 0; nr < refs; nr++) {
> if (likely(pages))
> -   pages[nr] = page++;
> +   pages[nr] = mem_map_offset(page, pfn_offset + nr);
> if (vmas)
> vmas[nr] = vma;
> }
> @@ -4936,8 +4937,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct
> *vma,
> (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
> 
> if (pages || vmas)
> -   record_subpages_vmas(mem_map_offset(page, pfn_offset),
> -vma, refs,
> +   record_subpages_vmas(page, pfn_offset, vma, refs,
>  likely(pages) ? pages + i : NULL,
>  vmas ? vmas + i : NULL);
> 


[PATCH 1/2] mm/hugetlb: grab head page refcount once per group of subpages

2021-01-25 Thread Joao Martins
follow_hugetlb_page() once it locks the pmd/pud, it checks all the
subpages in a huge page and grabs a reference for each one,
depending on how many pages we can store or the size of va range.
Similar to gup-fast, have follow_hugetlb_page() grab the head
page refcount only after counting all its subpages that are part
of the just faulted huge page.

Consequently we reduce the number of atomics necessary to pin
said huge page, which improves non-fast gup() considerably:

 - 16G with 1G huge page size
 gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w

 PIN_LONGTERM_BENCHMARK: ~87.6k us -> ~11k us

Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  3 +++
 mm/gup.c   |  5 ++---
 mm/hugetlb.c   | 43 ---
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5d618d08506..0d793486822b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,6 +1182,9 @@ static inline void get_page(struct page *page)
 }
 
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
+__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
+  unsigned int flags);
+
 
 static inline __must_check bool try_get_page(struct page *page)
 {
diff --git a/mm/gup.c b/mm/gup.c
index 3e086b073624..ecadc80934b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,9 +79,8 @@ static inline struct page *try_get_compound_head(struct page 
*page, int refs)
  * considered failure, and furthermore, a likely bug in the caller, so a 
warning
  * is also emitted.
  */
-static __maybe_unused struct page *try_grab_compound_head(struct page *page,
- int refs,
- unsigned int flags)
+__maybe_unused struct page *try_grab_compound_head(struct page *page,
+  int refs, unsigned int flags)
 {
if (flags & FOLL_GET)
return try_get_compound_head(page, refs);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6bad1f686c5..016addc8e413 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4798,7 +4798,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
unsigned long vaddr = *position;
unsigned long remainder = *nr_pages;
struct hstate *h = hstate_vma(vma);
-   int err = -EFAULT;
+   int err = -EFAULT, refs;
 
while (vaddr < vma->vm_end && remainder) {
pte_t *pte;
@@ -4918,26 +4918,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
continue;
}
 
+   refs = 0;
+
 same_page:
-   if (pages) {
+   if (pages)
pages[i] = mem_map_offset(page, pfn_offset);
-   /*
-* try_grab_page() should always succeed here, because:
-* a) we hold the ptl lock, and b) we've just checked
-* that the huge page is present in the page tables. If
-* the huge page is present, then the tail pages must
-* also be present. The ptl prevents the head page and
-* tail pages from being rearranged in any way. So this
-* page must be available at this point, unless the page
-* refcount overflowed:
-*/
-   if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
-   spin_unlock(ptl);
-   remainder = 0;
-   err = -ENOMEM;
-   break;
-   }
-   }
 
if (vmas)
vmas[i] = vma;
@@ -4946,6 +4931,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
++pfn_offset;
--remainder;
++i;
+   refs++;
if (vaddr < vma->vm_end && remainder &&
pfn_offset < pages_per_huge_page(h)) {
/*
@@ -4953,6 +4939,25 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * of this compound page.
 */
goto same_page;
+   } else if (pages) {
+   /*
+* try_grab_compound_head() should always succeed here,
+* because: a) we hold the ptl lock, and b) we've just
+* checked that the huge page is present in the page
+* tables. If the huge page is present, then the tail
+

[PATCH 2/2] mm/hugetlb: refactor subpage recording

2021-01-25 Thread Joao Martins
For a given hugepage backing a VA, there's a rather ineficient
loop which is solely responsible for storing subpages in the passed
pages/vmas array. For each subpage we check whether it's within
range or size of @pages and keep incrementing @pfn_offset and a couple
other variables per subpage iteration.

Simplify this logic and minimize ops per iteration to just
store the output page/vma. Instead of incrementing number of @refs
iteratively, we do it through a precalculation of @refs and having
only a tight loop for storing pinned subpages/vmas.

pinning consequently improves considerably, bringing us close to
{pin,get}_user_pages_fast:

 - 16G with 1G huge page size
 gup_test -f /mnt/huge/file -m 16384 -r 10 -L -S -n 512 -w

 PIN_LONGTERM_BENCHMARK: ~11k us -> ~4400 us
 PIN_FAST_BENCHMARK: ~3700 us

Signed-off-by: Joao Martins 
---
 mm/hugetlb.c | 49 -
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 016addc8e413..1f7a95bc7c87 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4789,6 +4789,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
goto out;
 }
 
+static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
+int refs, struct page **pages,
+struct vm_area_struct **vmas)
+{
+   int nr;
+
+   for (nr = 0; nr < refs; nr++) {
+   if (likely(pages))
+   pages[nr] = page++;
+   if (vmas)
+   vmas[nr] = vma;
+   }
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 struct page **pages, struct vm_area_struct **vmas,
 unsigned long *position, unsigned long *nr_pages,
@@ -4918,28 +4932,16 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
continue;
}
 
-   refs = 0;
+   refs = min3(pages_per_huge_page(h) - pfn_offset,
+   (vma->vm_end - vaddr) >> PAGE_SHIFT, remainder);
 
-same_page:
-   if (pages)
-   pages[i] = mem_map_offset(page, pfn_offset);
+   if (pages || vmas)
+   record_subpages_vmas(mem_map_offset(page, pfn_offset),
+vma, refs,
+likely(pages) ? pages + i : NULL,
+vmas ? vmas + i : NULL);
 
-   if (vmas)
-   vmas[i] = vma;
-
-   vaddr += PAGE_SIZE;
-   ++pfn_offset;
-   --remainder;
-   ++i;
-   refs++;
-   if (vaddr < vma->vm_end && remainder &&
-   pfn_offset < pages_per_huge_page(h)) {
-   /*
-* We use pfn_offset to avoid touching the pageframes
-* of this compound page.
-*/
-   goto same_page;
-   } else if (pages) {
+   if (pages) {
/*
 * try_grab_compound_head() should always succeed here,
 * because: a) we hold the ptl lock, and b) we've just
@@ -4950,7 +4952,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 * any way. So this page must be available at this
 * point, unless the page refcount overflowed:
 */
-   if (WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+   if (WARN_ON_ONCE(!try_grab_compound_head(pages[i],
 refs,
 flags))) {
spin_unlock(ptl);
@@ -4959,6 +4961,11 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
break;
}
}
+
+   vaddr += (refs << PAGE_SHIFT);
+   remainder -= refs;
+   i += refs;
+
spin_unlock(ptl);
}
*nr_pages = remainder;
-- 
2.17.1



[PATCH 0/2] mm/hugetlb: follow_hugetlb_page() improvements

2021-01-25 Thread Joao Martins
Hey,

While looking at ZONE_DEVICE struct page reuse particularly the last
patch[0], I found two possible improvements for follow_hugetlb_page()
which is solely used for get_user_pages()/pin_user_pages().

The first patch batches page refcount updates while the second tidies
up storing the subpages/vmas. Both together bring the cost of slow
variant of gup() cost from ~86k usecs to ~4.4k usecs.

libhugetlbfs tests seem to pass as well gup_test benchmarks
with hugetlbfs vmas.

[0] 
https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.mart...@oracle.com/

Joao Martins (2):
  mm/hugetlb: grab head page refcount once per group of subpages
  mm/hugetlb: refactor subpage recording

 include/linux/mm.h |  3 +++
 mm/gup.c   |  5 ++--
 mm/hugetlb.c   | 66 +++---
 3 files changed, 44 insertions(+), 30 deletions(-)

-- 
2.17.1



Re: [LSFMMBPF 2021] A status update

2021-01-05 Thread Joao Martins
On 12/12/20 5:29 PM, Matthew Wilcox wrote:
> And most urgently, when should we have the GUP meeting?  On the call,
> I suggested Friday the 8th of January, but I'm happy to set something
> up for next week if we'd like to talk more urgently.  Please propose a
> date & time.  I know we have people in Portugal and Nova Scotia who need
> to be involved live, so a time friendly to UTC+0 and UTC-4 would be good.

FWIW, I would suggest the same time as you had for PageFolio (18h GMT / 10pm PT 
/ 13h ET)
given it can cover many tz in a not-so-bothersome time.

But instead of Jan 8 perhaps better for next week (Jan 15) in case folks
are still in new year holidays (given we are in the first week of the year).

Joao


Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2021-01-05 Thread Joao Martins
On 1/1/21 2:33 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>>> But if the kernel is going to short-circuit the IPIs and VIRQs, then
>>> it's already going to have to handle the evtchn_pending/evtchn_mask
>>> bitmaps, and actually injecting interrupts.
>>>
>>
>> Right. I was trying to point that out in the discussion we had
>> in next patch. But true be told, more about touting the idea of kernel
>> knowing if a given event channel is registered for userspace handling,
>> rather than fully handling the event channel.
>>
>> I suppose we are able to provide both options to the VMM anyway
>> i.e. 1) letting them handle it enterily in userspace by intercepting
>> EVTCHNOP_send, or through the irq route if we want kernel to offload it.
>>
>>> Given that it has to have that functionality anyway, it seems saner to
>>> let the kernel have full control over it and to just expose
>>> 'evtchn_send' to userspace.
>>>
>>> The alternative is to have userspace trying to play along with the
>>> atomic handling of those bitmasks too 
>>
>> That part is not particularly hard -- having it done already.
> 
> Right, for 2-level it works out fairly well. I like your model of
> installing { vcpu_id, port } into the KVM IRQ routing table and that's
> enough for the kernel to raise the event channels that it *needs* to
> know about, while userspace can do the others for itself. It's just
> atomic test-and-set bitop stuff with no real need for coordination.
> 
> For FIFO event channels it gets more fun, because the updates aren't
> truly atomic — they require a spinlock around the three operations that
> the host has to perform when linking an event into a queue: 
> 
>  • Set the new port's LINKED bit
>  • Set the previous head's LINK to point to the new port
>  • Store the new port# as the head.
> 
> One option might be to declare that for FIFO, all updates for a given
> queue *must* be handled either by the kernel, or by userspace, and
> there's sharing of control.
> 
> Or maybe there's a way to keep the kernel side simple by avoiding the
> tail handling completely. Surely we only really care about kernel
> handling of the *fast* path, where a single event channel is triggered
> and handled immediately? In the high-latency case where we're gathering
> multiple events in a queue before the guest ever gets to process them, 
> we might as well let that be handled by userspace, right?
> 
> So in that case, perhaps the kernel part could forget all the horrid
> nonsense with tracking the tail of the queue. It would handle the event
> in-kernel *only* in the case where the event is the *first* in the
> queue, and the head was previously zero?
> 
> But even that isn't a simple atomic operation though; we still have to
> mark the event LINKED, then update the head pointer to reference it.
> And what if we set the 'LINKED' bit but then find that userspace has
> already linked another port so ->head is no longer zero?
> 
> Can we just clear the LINKED bit and then punt the whole event for
> userspace to (re)handle? Or raise a special event to userspace so it
> knows it needs to go ahead and link the port even though its LINKED bit
> has already been set?
> 
> None of the available options really fill me with joy; I'm somewhat
> inclined just to declare that the kernel won't support acceleration of
> FIFO event channels at all.
> 
> None of which matters a *huge* amount right now if I was only going to
> leave that as a future optimisation anyway.
> 
ACK.

> What it does actually mean in the short term is that as I update your
> KVM_IRQ_ROUTING_XEN_EVTCHN support, I probably *won't* bother to add a
> 'priority' field to struct kvm_irq_routing_xen_evtchn to make it
> extensible to FIFO event channels. We can always add that later.
> 
> Does that seem reasonable?
> 
Yes, makes sense IMHO. Guests need to anyway fallback to 2L if the
evtchnop_init_control hypercall fails, and the way we are handling events,
doesn't warrant FIFO event channel support as mandatory.

Despite the many fifo event features, IIRC the main driving motivation
was to go beyond the 1K/4K port limit for 32b/64b guests to be 128K max
ports per guest instead. But that was mostly a limit for Domain-0 as it hosts
all the backend handling in the traditional (i.e. without driver
domains) deployment. Therefore limiting how many vdevs one could host in
the system. And on cases for dense VM consolidation where you host 1K
guests with e.g. 3 block devices and 1 network interface one would quickly
ran out of interdomain event channel ports in Dom0. But that is not the
case here.

We are anyways ABI-limited to HVM_MAX_VCPUS (128) and if we assume
4 event channels for the legacy guest per VCPU (4 ipi evts, 1 timer,
1 debug) then it means we have still 3328 ports for interdomain event
channels for a 128 vCPU HVM guest ... when using the 2L event channels ABI.

Joao


Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-09 Thread Joao Martins
On 12/9/20 3:41 PM, David Woodhouse wrote:
> On 9 December 2020 13:26:55 GMT, Joao Martins  
> wrote:
>> On 12/9/20 11:39 AM, David Woodhouse wrote:
>>> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
>>> return the domain-wide vector in preference to the one in the LAPIC,
>> if
>>> it actually gets invoked. 
>>
>> Only if the callback installed is HVMIRQ_callback_vector IIUC.
>>
>> Otherwise the vector would be pending like any other LAPIC vector.
> 
> Ah, right.
> 
> For some reason I had it in my head that you could only set the per-vCPU 
> lapic vector if the domain was set to HVMIRQ_callback_vector. If the domain 
> is set to HVMIRQ_callback_none, that clearly makes more sense.
> 
> Still, my patch should do the same as Xen does in the case where a guest does 
> set both, I think.
> 
> Faithful compatibility with odd Xen behaviour FTW :)
> 
Ah, yes. In that case, HVMIRQ_callback_vector does take precedence.

But it would be very weird for a guest to setup two callback vectors :)


Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-09 Thread Joao Martins
On 12/9/20 11:39 AM, David Woodhouse wrote:
> On Wed, 2020-12-09 at 10:51 +0000, Joao Martins wrote:
>> Isn't this what the first half of this patch was doing initially (minus the
>> irq routing) ? Looks really similar:
>>
>> https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.mart...@oracle.com/
> 
> Absolutely! This thread is in reply to your original posting of
> precisely that patch, and I've had your tree open in gitk to crib from
> for most of the last week.
> 
I forgot about this patch given all the discussion so far and I had to re-look 
given that
it resembled me from your snippet. But I ended up being a little pedantic -- 
sorry about that.

> There's a *reason* why my tree looks like yours, and why more than half
> of the patches in it still show you as being the author :)
> 
Btw, in this patch it would be Ankur's.

More importantly, thanks a lot for picking it up and for all the awesome stuff 
you're
doing with it.

>> Albeit, I gotta after seeing the irq routing removed it ends much simpler, 
>> if we just
>> replace the irq routing with a domain-wide upcall vector ;)
> 
> I like "simpler".
> 
> I also killed the ->cb.queued flag you had because I think it's
> redundant with evtchn_upcall_pending anyway.
> 
Yeap, indeed.

>> Albeit it wouldn't cover the Windows Xen open source drivers which use the 
>> EVTCHN method
>> (which is a regular LAPIC vector) [to answer your question on what uses it] 
>> For the EVTCHN
>> you can just inject a regular vector through lapic deliver, and guest acks 
>> it. Sadly Linux
>> doesn't use it,
>> and if it was the case we would probably get faster upcalls with APICv/AVIC.
> 
> It doesn't need to, because those can just be injected with
> KVM_SIGNAL_MSI.
> 
/me nods

> At most, we just need to make sure that kvm_xen_has_interrupt() returns
> false if the per-vCPU LAPIC vector is configured. But I didn't do that
> because I checked Xen and it doesn't do it either.
> 
Oh! I have this strange recollection that it was, when we were looking at the 
Xen
implementation.

> As far as I can tell, Xen's hvm_vcpu_has_pending_irq() will still
> return the domain-wide vector in preference to the one in the LAPIC, if
> it actually gets invoked. 

Only if the callback installed is HVMIRQ_callback_vector IIUC.

Otherwise the vector would be pending like any other LAPIC vector.

> And if the guest sets ->evtchn_upcall_pending
> to reinject the IRQ (as Linux does on unmask) Xen won't use the per-
> vCPU vector to inject that; it'll use the domain-wide vector.
> Right -- I don't think Linux even installs a per-CPU upcall LAPIC vector 
> other than the
domain's callback irq.

>>> I'm not sure that condition wasn't *already* broken some cases for
>>> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
>>> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
>>> sure.
>>>
>>> But what if vcpu_enter_guest() processes that the first time (and
>>> clears KVM_REQ_EVENT), and then exits for some *other* reason with
>>> interrupts still disabled? Next time through vcpu_enter_guest(), even
>>> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
>>> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
>>> inject_pending_event().
>>>
>>> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
>>> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
>>> bug? Or am I missing something there and there isn't a bug after all?
>>
>> Given that the notion of an event channel pending is Xen specific handling, 
>> I am not sure
>> we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. 
>> Much of the
>> reason that we ended up checking on vmenter that checks event channels 
>> pendings..
> 
> Sure, we definitely need the check I added in vcpu_enter_guest() for
> Xen unless I'm going to come up with a way to set KVM_REQ_EVENT at the
> appropriate time.
> 
> But I'm looking at the ExtINT handling and as far as I can tell it's
> buggy. So I didn't want to use it as a model for setting KVM_REQ_EVENT
> for Xen events.
> 
/me nods


Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-09 Thread Joao Martins



On 12/9/20 10:27 AM, David Woodhouse wrote:
> On Tue, 2020-12-08 at 22:35 -0800, Ankur Arora wrote:
>>> It looks like Linux doesn't use the per-vCPU upcall vector that you
>>> called 'KVM_XEN_CALLBACK_VIA_EVTCHN'. So I'm delivering interrupts via
>>> KVM_INTERRUPT as if they were ExtINT
>>>
>>> ... except I'm not. Because the kernel really does expect that to be an
>>> ExtINT from a legacy PIC, and kvm_apic_accept_pic_intr() only returns
>>> true if LVT0 is set up for EXTINT and unmasked.
>>>
>>> I messed around with this hack and increasingly desperate variations on
>>> the theme (since this one doesn't cause the IRQ window to be opened to
>>> userspace in the first place), but couldn't get anything working:
>>
>> Increasingly desperate variations,  about sums up my process as well while
>> trying to get the upcall vector working.
> 
> :)
> 
> So this seems to work, and I think it's about as minimal as it can get.
> 
> All it does is implement a kvm_xen_has_interrupt() which checks the
> vcpu_info->evtchn_upcall_pending flag, just like Xen does.
> 
> With this, my completely userspace implementation of event channels is
> working. And of course this forms a basis for adding the minimal
> acceleration of IPI/VIRQ that we need to the kernel, too.
> 
> My only slight concern is the bit in vcpu_enter_guest() where we have
> to add the check for kvm_xen_has_interrupt(), because nothing is
> setting KVM_REQ_EVENT. I suppose I could look at having something —
> even an explicit ioctl from userspace — set that for us BUT...
> 
Isn't this what the first half of this patch was doing initially (minus the
irq routing) ? Looks really similar:

https://lore.kernel.org/kvm/20190220201609.28290-11-joao.m.mart...@oracle.com/

Albeit, I gotta after seeing the irq routing removed it ends much simpler, if 
we just
replace the irq routing with a domain-wide upcall vector ;)

Albeit it wouldn't cover the Windows Xen open source drivers which use the 
EVTCHN method
(which is a regular LAPIC vector) [to answer your question on what uses it] For 
the EVTCHN
you can just inject a regular vector through lapic deliver, and guest acks it. 
Sadly Linux
doesn't use it,
and if it was the case we would probably get faster upcalls with APICv/AVIC.

> I'm not sure that condition wasn't *already* broken some cases for
> KVM_INTERRUPT anyway. In kvm_vcpu_ioctl_interrupt() we set
> vcpu->arch.pending_userspace_vector and we *do* request KVM_REQ_EVENT,
> sure.
> 
> But what if vcpu_enter_guest() processes that the first time (and
> clears KVM_REQ_EVENT), and then exits for some *other* reason with
> interrupts still disabled? Next time through vcpu_enter_guest(), even
> though kvm_cpu_has_injectable_intr() is still true, we don't enable the
> IRQ windowvmexit because KVM_REQ_EVENT got lost so we don't even call
> inject_pending_event().
> 
> So instead of just kvm_xen_has_interrupt() in my patch below, I wonder
> if we need to use kvm_cpu_has_injectable_intr() to fix the existing
> bug? Or am I missing something there and there isn't a bug after all?
> 
Given that the notion of an event channel pending is Xen specific handling, I 
am not sure
we can remove the kvm_xen_has_interrupt()/kvm_xen_get_interrupt() logic. Much 
of the
reason that I ended up checking on vmenter that checks event channels pendings..

That or the autoEOI hack Ankur and you were talking out.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d8716ef27728..4a63f212fdfc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -902,6 +902,7 @@ struct msr_bitmap_range {
>  /* Xen emulation context */
>  struct kvm_xen {
>   bool long_mode;
> + u8 upcall_vector;
>   struct kvm_host_map shinfo_map;
>   void *shinfo;
>  };
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 814698e5b152..24668b51b5c8 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -14,6 +14,7 @@
>  #include "irq.h"
>  #include "i8254.h"
>  #include "x86.h"
> +#include "xen.h"
>  
>  /*
>   * check if there are pending timer events
> @@ -56,6 +57,9 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v)
>   if (!lapic_in_kernel(v))
>   return v->arch.interrupt.injected;
>  
> + if (kvm_xen_has_interrupt(v))
> + return 1;
> +
>   if (!kvm_apic_accept_pic_intr(v))
>   return 0;
>  
> @@ -110,6 +114,9 @@ static int kvm_cpu_get_extint(struct kvm_vcpu *v)
>   if (!lapic_in_kernel(v))
>   return v->arch.interrupt.nr;
>  
> + if (kvm_xen_has_interrupt(v))
> + return v->kvm->arch.xen.upcall_vector;
> +
>   if (irqchip_split(v->kvm)) {
>   int vector = v->arch.pending_external_vector;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ad9eea8f4f26..1711072b3616 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8891,7 +8891,8 @@ static int vcpu_enter_guest(struct kvm_vcpu 

Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-02 Thread Joao Martins
On 12/2/20 7:02 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 18:34 +0000, Joao Martins wrote:
>> On 12/2/20 4:47 PM, David Woodhouse wrote:
>>> On Wed, 2020-12-02 at 13:12 +, Joao Martins wrote:
>>>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> For the VMM
>>> API I think we should follow the Xen model, mixing the domain-wide and
>>> per-vCPU configuration. It's the best way to faithfully model the
>>> behaviour a true Xen guest would experience.
>>>
>>> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>>>  • HVMIRQ_callback_vector, taking a vector#
>>>  • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
>>>
>>> And *maybe* in a later patch it could also handle
>>>  • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>>>  • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
>>>
>>
>> Most of the Xen versions we were caring had callback_vector and
>> vcpu callback vector (despite Linux not using the latter). But if you're
>> dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
>> gsi and pci-intx are must-haves.
> 
> Note sure about GSI but PCI-INTX is definitely something I've seen in
> active use by customers recently. I think SLES10 will use that.
> 

Some of the Windows drivers we used were relying on GSI.

I don't know about what kernel is SLES10 but Linux is aware
of XENFEAT_hvm_callback_vector since 2.6.35 i.e. about 10years.
Unless some other out-of-tree patch is opting it out I suppose.

> 
>> But kinda have mixed feelings in having kernel handling all event channels 
>> ABI,
>> as opposed to only the ones userspace asked to offload. It looks a tad 
>> unncessary besides
>> the added gain to VMMs that don't need to care about how the internals of 
>> event channels.
>> But performance-wise it wouldn't bring anything better. But maybe, the 
>> former is reason
>> enough to consider it.
> 
> Yeah, we'll see. Especially when it comes to implementing FIFO event
> channels, I'd rather just do it in one place — and if the kernel does
> it anyway then it's hardly difficult to hook into that.
> 
Fortunately that's xen 4.3 and up *I think* :) (the FIFO events)

And Linux is the one user I am aware IIRC.

> But I've been about as coherent as I can be in email, and I think we're
> generally aligned on the direction. 

Yes, definitely.

> I'll do some more experiments and
> see what I can get working, and what it looks like.
> 
> I'm focusing on making the shinfo stuff all use kvm_map_gfn() first.
> 
I was chatting with Ankur, and we can't fully 100% remember why we dropped using
kvm_vcpu_map/kvm_map_gfn. We were using kvm_vcpu_map() but at the time the new 
guest
mapping series was in discussion, so we dropped those until it settled in.

One "side effect" on mapping shared_info with kvm_vcpu_map, is that we have to 
loop all
vcpus unless we move shared_info elsewhere IIRC. But switching vcpu_info, 
vcpu_time_info
(and steal clock) to kvm_vcpu_map is trivial.. at least based on our old wip 
branches here.

Joao


Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-02 Thread Joao Martins
On 12/2/20 4:47 PM, David Woodhouse wrote:
> On Wed, 2020-12-02 at 13:12 +0000, Joao Martins wrote:
>> On 12/2/20 11:17 AM, David Woodhouse wrote:
>>> I might be more inclined to go for a model where the kernel handles the
>>> evtchn_pending/evtchn_mask for us. What would go into the irq routing
>>> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
>>
>> But passing port to the routing and handling the sending of events wouldn't 
>> it lead to
>> unnecessary handling of event channels which aren't handled by the kernel, 
>> compared to
>> just injecting caring about the upcall?
> 
> Well, I'm generally in favour of *not* doing things in the kernel that
> don't need to be there.
> 
> But if the kernel is going to short-circuit the IPIs and VIRQs, then
> it's already going to have to handle the evtchn_pending/evtchn_mask
> bitmaps, and actually injecting interrupts.
> 
Right. I was trying to point that out in the discussion we had
in next patch. But true be told, more about touting the idea of kernel
knowing if a given event channel is registered for userspace handling,
rather than fully handling the event channel.

I suppose we are able to provide both options to the VMM anyway
i.e. 1) letting them handle it enterily in userspace by intercepting
EVTCHNOP_send, or through the irq route if we want kernel to offload it.

> Given that it has to have that functionality anyway, it seems saner to
> let the kernel have full control over it and to just expose
> 'evtchn_send' to userspace.
> 
> The alternative is to have userspace trying to play along with the
> atomic handling of those bitmasks too 

That part is not particularly hard -- having it done already.

>, and injecting events through
> KVM_INTERRUPT/KVM_SIGNAL_MSI in parallel to the kernel doing so. That
> seems like *more* complexity, not less.
>
/me nods

>> I wanted to mention the GSI callback method too, but wasn't enterily sure if 
>> eventfd was
>> enough.
> 
> That actually works quite nicely even for userspace irqchip.
> 
> Forgetting Xen for the moment... my model for userspace I/OAPIC with
> interrupt remapping is that during normal runtime, the irqfd is
> assigned and things all work and we can even have IRQ posting for
> eventfds which came from VFIO. 
> 
> When the IOMMU invalidates an IRQ translation, all it does is
> *deassign* the irqfd from the KVM IRQ. So the next time that eventfd
> fires, it's caught in the userspace event loop instead. Userspace can
> then retranslate through the IOMMU and reassign the irqfd for next
> time.
> 
> So, back to Xen. As things stand with just the hypercall+shinfo patches
> I've already rebased, we have enough to do fully functional Xen
> hosting. 

Yes -- the rest become optimizations in performance sensitive paths.

TBH (and this is slightly offtopic) the somewhat hairy part is xenbus/xenstore.
And the alternative to playing nice with xenstored, is the VMM learning
to parse the xenbus messages and fake the xenstored content/transactions stuff
individually per per-VM .

> The event channels are slow but it *can* be done entirely in

While consulting my old notes, about twice as slow if done in userspace.

> userspace — handling *all* the hypercalls, and delivering interrupts to
> the guest in whatever mode is required.
> 
> Event channels are a very important optimisation though. 

/me nods

> For the VMM
> API I think we should follow the Xen model, mixing the domain-wide and
> per-vCPU configuration. It's the best way to faithfully model the
> behaviour a true Xen guest would experience.
> 
> So KVM_XEN_ATTR_TYPE_CALLBACK_VIA can be used to set one of
>  • HVMIRQ_callback_vector, taking a vector#
>  • HVMIRQ_callback_gsi for the in-kernel irqchip, taking a GSI#
> 
> And *maybe* in a later patch it could also handle
>  • HVMIRQ_callback_gsi for split-irqchip, taking an eventfd
>  • HVMIRQ_callback_pci_intx, taking an eventfd (or a pair, for EOI?)
> 
Most of the Xen versions we were caring had callback_vector and
vcpu callback vector (despite Linux not using the latter). But if you're
dating back to 3.2 and 4.1 well (or certain Windows drivers), I suppose
gsi and pci-intx are must-haves.

> I don't know if the latter two really make sense. After all the
> argument for handling IPI/VIRQ in kernel kind of falls over if we have
> to bounce out to userspace anyway. 

Might as well let userspace EVTCHNOP_send handle it, I wonder.

> So it *only* makes sense if those
> eventfds actually end up wired *through* userspace to a KVM IRQFD as I
> described for the IOMMU stuff.
> 
We didn't implement the phy event channels, but for most old kernels we
tested (back to 2.6.XX) at the time, seemed to play along.

> In addition to that

Re: [PATCH RFC 10/39] KVM: x86/xen: support upcall vector

2020-12-02 Thread Joao Martins
On 12/2/20 11:17 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> @@ -176,6 +177,9 @@ int kvm_arch_set_irq_inatomic(struct 
>> kvm_kernel_irq_routing_entry *e,
>> int r;
>>  
>> switch (e->type) {
>> +   case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> +   return kvm_xen_set_evtchn(e, kvm, irq_source_id, level,
>> +  line_status);
>> case KVM_IRQ_ROUTING_HV_SINT:
>> return kvm_hv_set_sint(e, kvm, irq_source_id, level,
>>line_status);
>> @@ -325,6 +329,13 @@ int kvm_set_routing_entry(struct kvm *kvm,
>> e->hv_sint.vcpu = ue->u.hv_sint.vcpu;
>> e->hv_sint.sint = ue->u.hv_sint.sint;
>> break;
>> +   case KVM_IRQ_ROUTING_XEN_EVTCHN:
>> +   e->set = kvm_xen_set_evtchn;
>> +   e->evtchn.vcpu = ue->u.evtchn.vcpu;
>> +   e->evtchn.vector = ue->u.evtchn.vector;
>> +   e->evtchn.via = ue->u.evtchn.via;
>> +
>> +   return kvm_xen_setup_evtchn(kvm, e);
>> default:
>> return -EINVAL;
>> }
> 
> 
> Hmm. I'm not sure I've have done it that way.
> 
> These IRQ routing entries aren't for individual event channel ports;
> they don't map to kvm_xen_evtchn_send().
> 
> They actually represent the upcall to the given vCPU when any event
> channel is signalled, and it's really per-vCPU configuration.
> 
Right.

> When the kernel raises (IPI, VIRQ) events on a given CPU, it doesn't
> actually use these routing entries; it just uses the values in
> vcpu_xen->cb.{via,vector} which were cached from the last of these IRQ
> routing entries that happens to have been processed?
> 
Correct.

> The VMM is *expected* to set up precisely one of these for each vCPU,
> right?
> 
>From guest PoV, the hypercall:

HYPERVISOR_hvm_op(HVMOP_set_param, HVM_PARAM_CALLBACK_IRQ, ...)

(...) is global.

But this one (on more recent versions of Xen, particularly recent Windows 
guests):

HVMOP_set_evtchn_upcall_vector

Is per-vCPU, and it's a LAPIC vector.

But indeed the VMM ends up changing the @via @vector on a individual CPU.

> Would it not be better to do that via KVM_XEN_HVM_SET_ATTR?

It's definitely an interesting (better?) alternative, considering we use as a 
vCPU attribute.

I suppose you're suggesting like,

KVM_XEN_ATTR_TYPE_CALLBACK

And passing the vector, and callback type.

> The usage model for userspace is presumably that the VMM should set the
> appropriate bit in evtchn_pending, check evtchn_mask and then call into
> the kernel to do the set_irq() to inject the callback vector to the
> guest?
> 
Correct, that's how it works for userspace handled event channels.

> I might be more inclined to go for a model where the kernel handles the
> evtchn_pending/evtchn_mask for us. What would go into the irq routing
> table is { vcpu, port# } which get passed to kvm_xen_evtchn_send().
> 

But passing port to the routing and handling the sending of events wouldn't it 
lead to
unnecessary handling of event channels which aren't handled by the kernel, 
compared to
just injecting caring about the upcall?

I thought from previous feedback that it was something you wanted to avoid.

> Does that seem reasonable?
> 
Otherwise, it seems reasonable to have it.

I'll let Ankur chip in too, as this was something he was more closely modelling 
after.

> Either way, I do think we need a way for events raised in the kernel to
> be signalled to userspace, if they are targeted at a vCPU which has
> CALLBACK_VIA_INTX that the kernel can't do directly. So we probably
> *do* need that eventfd I was objecting to earlier, except that it's not
> a per-evtchn thing; it's per-vCPU.
> 

Ah!

I wanted to mention the GSI callback method too, but wasn't enterily sure if 
eventfd was
enough.

Joao


Re: [PATCH RFC 02/39] KVM: x86/xen: intercept xen hypercalls if enabled

2020-12-02 Thread Joao Martins
On 12/1/20 11:19 AM, David Woodhouse wrote:
> On Tue, 2020-12-01 at 09:48 +, David Woodhouse wrote:
>> So I switched it to generate the hypercall page directly from the
>> kernel, just like we do for the Hyper-V hypercall page.
> 
> Speaking of Hyper-V... I'll post this one now so you can start heckling
> it.
> 
Xen viridian mode is indeed one thing that needed fixing. And definitely a
separate patch as you do here. Both this one and the previous is looking good.

I suppose that because you switch to kvm_vcpu_write_guest() you no longer need
to validate that the hypercall page is correct neither marking as dirty. 
Probably
worth making that explicit in the commit message.

> I won't post the rest as I go; not much of the rest of the series when
> I eventually post it will be very new and exciting. It'll just be
> rebasing and tweaking your originals and perhaps adding some tests.
> 

Awesome!!


Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

2020-12-02 Thread Joao Martins
On 12/2/20 5:17 AM, Ankur Arora wrote:
> On 2020-12-01 5:26 p.m., David Woodhouse wrote
>> On Tue, 2020-12-01 at 16:40 -0800, Ankur Arora wrote:
>>> On 2020-12-01 5:07 a.m., David Woodhouse wrote:

[...]

 If that was allowed, wouldn't it have been a much simpler fix for
 CVE-2019-3016? What am I missing?
>>>
>>> Agreed.
>>>
>>> Perhaps, Paolo can chime in with why KVM never uses pinned page
>>> and always prefers to do cached mappings instead?
>>>

 Should I rework these to use kvm_write_guest_cached()?
>>>
>>> kvm_vcpu_map() would be better. The event channel logic does RMW operations
>>> on shared_info->vcpu_info.
>>
>> I've ported the shared_info/vcpu_info parts and made a test case, and
>> was going back through to make it use kvm_write_guest_cached(). But I
>> should probably plug on through the evtchn bits before I do that.
>>
>> I also don't see much locking on the cache, and can't convince myself
>> that accessing the shared_info page from multiple CPUs with
>> kvm_write_guest_cached() or kvm_map_gfn() is sane unless they each have
>> their own cache.
> 
> I think you could get a VCPU specific cache with kvm_vcpu_map().
> 

steal clock handling creates such a mapping cache (struct gfn_to_pfn_cache).

Joao


Re: [PATCH RFC 03/39] KVM: x86/xen: register shared_info page

2020-12-02 Thread Joao Martins
[late response - was on holiday yesterday]

On 12/2/20 12:40 AM, Ankur Arora wrote:
> On 2020-12-01 5:07 a.m., David Woodhouse wrote:
>> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>>> +static int kvm_xen_shared_info_init(struct kvm *kvm, gfn_t gfn)
>>> +{
>>> +   struct shared_info *shared_info;
>>> +   struct page *page;
>>> +
>>> +   page = gfn_to_page(kvm, gfn);
>>> +   if (is_error_page(page))
>>> +   return -EINVAL;
>>> +
>>> +   kvm->arch.xen.shinfo_addr = gfn;
>>> +
>>> +   shared_info = page_to_virt(page);
>>> +   memset(shared_info, 0, sizeof(struct shared_info));
>>> +   kvm->arch.xen.shinfo = shared_info;
>>> +   return 0;
>>> +}
>>> +
>>
>> Hm.
>>
>> How come we get to pin the page and directly dereference it every time,
>> while kvm_setup_pvclock_page() has to use kvm_write_guest_cached()
>> instead?
> 
> So looking at my WIP trees from the time, this is something that
> we went back and forth on as well with using just a pinned page or a
> persistent kvm_vcpu_map().
> 
> I remember distinguishing shared_info/vcpu_info from kvm_setup_pvclock_page()
> as shared_info is created early and is not expected to change during the
> lifetime of the guest which didn't seem true for MSR_KVM_SYSTEM_TIME (or
> MSR_KVM_STEAL_TIME) so that would either need to do a kvm_vcpu_map()
> kvm_vcpu_unmap() dance or do some kind of synchronization.
> 
> That said, I don't think this code explicitly disallows any updates
> to shared_info.
> 
>>
>> If that was allowed, wouldn't it have been a much simpler fix for
>> CVE-2019-3016? What am I missing?
> 
> Agreed.
> 
> Perhaps, Paolo can chime in with why KVM never uses pinned page
> and always prefers to do cached mappings instead?
> 
Part of the CVE fix to not use cached versions.

It's not a longterm pin of the page unlike we try to do here (partly due to the 
nature
of the pages we are mapping) but we still we map the gpa, RMW the steal time 
struct, and
then unmap the page.

See record_steal_time() -- but more specifically commit b043138246 ("x86/KVM: 
Make sure
KVM_VCPU_FLUSH_TLB flag is not missed").

But I am not sure it's a good idea to follow the same as record_steal_time() 
given that
this is a fairly sensitive code path for event channels.

>>
>> Should I rework these to use kvm_write_guest_cached()?
> 
> kvm_vcpu_map() would be better. The event channel logic does RMW operations
> on shared_info->vcpu_info.
> 
Indeed, yes.

Ankur IIRC, we saw missed event channels notifications when we were using the
{write,read}_cached() version of the patch.

But I can't remember the reason it was due to, either the evtchn_pending or the 
mask
word -- which would make it not inject an upcall.

Joao


Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

2020-11-30 Thread Joao Martins



On 11/30/20 7:04 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 18:41 +0000, Joao Martins wrote:
>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>> {
>> ...
>> if (kvm_hv_hypercall_enabled(vcpu->kvm))
>> return kvm_hv_hypercall(...);
>>
>> if (kvm_xen_hypercall_enabled(vcpu->kvm))
>> return kvm_xen_hypercall(...);
>> ...
>> }
>>
>> And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the 
>> registers mean
>> e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd 
>> and port.
> 
> Right, although it's a little more abstract than that: "RDI/RSI for
> arg#0, arg#1 respectively".
> 
> And those are RDI/RSI for 64-bit Xen, EBX/ECX for 32-bit Xen, and
> RBX/RDI for Hyper-V. (And Hyper-V seems to use only the two, while Xen
> theoretically has up to 6).
> 
Indeed, almost reminds my other patch for xen hypercalls -- it was handling 
32-bit and
64-bit that way:

https://lore.kernel.org/kvm/20190220201609.28290-3-joao.m.mart...@oracle.com/

>> The kernel logic wouldn't be much different at the core, so thought of tihs 
>> consolidation.
>> But the added complexity would have come from having to deal with two 
>> userspace exit types
>> -- indeed probably not worth the trouble as you pointed out.
> 
> Yeah, I think I'm just going to move the 'kvm_userspace_hypercall()'
> from my patch to be 'kvm_xen_hypercall()' in a new xen.c but still
> using KVM_EXIT_HYPERCALL. Then I can rebase your other patches on top
> of that, with the evtchn bypass.
> 
Yeap, makes sense.

Joao


Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

2020-11-30 Thread Joao Martins
On 11/30/20 6:01 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 17:15 +0000, Joao Martins wrote:
>> On 11/30/20 4:48 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 15:08 +, Joao Martins wrote:
>>>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>>>> On Mon, 2020-11-30 at 12:17 +, Joao Martins wrote:
>>>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>>>> On Wed, 2019-02-20 at 20:15 +, Joao Martins wrote:

[...]

>>>> I should comment on your other patch but: if we're going to make it 
>>>> generic for
>>>> the userspace hypercall handling, might as well move hyper-v there too. In 
>>>> this series,
>>>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a 
>>>> generic version
>>>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both 
>>>> guest types, while
>>>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own 
>>>> separate patch :)
>>>
>>> There's a limit to how much consolidation we can do because the ABI is
>>> different; the args are in different registers.
>>>
>>
>> Yes. It would be optionally enabled of course and VMM would have to adjust 
>> to the new ABI
>> -- surely wouldn't want to break current users of KVM_EXIT_HYPERV.
> 
> True, but that means we'd have to keep KVM_EXIT_HYPERV around anyway,
> and can't actually *remove* it. The "consolidation" gives us more
> complexity, not less.
>
Fair point.

>>> I do suspect Hyper-V should have marshalled its arguments into the
>>> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
>>> don't think it makes sense to change it now since it's a user-facing
>>> ABI. I don't want to follow its lead by inventing *another* gratuitous
>>> exit type for Xen though.
>>>
>>
>> I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
>> exit type ;)
>>
>> But I guess you still need to co-relate a type of hypercall (Xen guest cap 
>> enabled?) to
>> tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).
> 
> Sure, but if the VMM doesn't know what kind of guest it's hosting, we
> have bigger problems... :)
> 
Right :)

I was referring to the kernel here.

Eventually we need to special case things for a given guest type case e.g.

int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
...
if (kvm_hv_hypercall_enabled(vcpu->kvm))
return kvm_hv_hypercall(...);

if (kvm_xen_hypercall_enabled(vcpu->kvm))
return kvm_xen_hypercall(...);
...
}

And on kvm_xen_hypercall() for the cases VMM offloads to demarshal what the 
registers mean
e.g. for event channel send 64-bit guest: RAX for opcode and RDI/RSI for cmd 
and port.

The kernel logic wouldn't be much different at the core, so thought of tihs 
consolidation.
But the added complexity would have come from having to deal with two userspace 
exit types
-- indeed probably not worth the trouble as you pointed out.


Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

2020-11-30 Thread Joao Martins
On 11/30/20 4:48 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 15:08 +0000, Joao Martins wrote:
>> On 11/30/20 12:55 PM, David Woodhouse wrote:
>>> On Mon, 2020-11-30 at 12:17 +, Joao Martins wrote:
>>>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>>>> On Wed, 2019-02-20 at 20:15 +, Joao Martins wrote:
>>>> One thing I didn't quite do at the time, is the whitelisting of 
>>>> unregistered
>>>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>>>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>>>> ones which go to userspace should be explicitly requested as such
>>>> and otherwise return -ENOENT in the hypercall.
>>>
>>> Hm, why would -ENOENT be a fast path which needs to be handled in the
>>> kernel?
>>>
>>
>> It's not that it's a fast path.
>>
>> Like sending an event channel to an unbound vector, now becomes an possible 
>> vector to
>> worry about in userspace VMM e.g. should that port lookup logic be fragile.
>>
>> So it's more along the lines of Nack-ing the invalid port earlier to rather 
>> go
>> to go userspace to invalidate it, provided we do the lookup anyway in the 
>> kernel.
> 
> If the port lookup logic is fragile, I *want* it in the sandboxed
> userspace VMM and not in the kernel :)
> 
Yes definitely -- I think we are on the same page on that.

But it's just that we do the lookup *anyways* to check if the kernel has a given
evtchn port registered. That's the lookup I am talking about here, with just an
extra bit to tell that's a userspace handled port.

> And unless we're going to do *all* of the EVTCHNOP_bind*, EVTCHN_close,
> etc. handling in the kernel, doesn't userspace have to have all that
> logic for managing the port space anyway?
> 
Indeed.

> I think it's better to let userspace own it outright, and use the
> kernel bypass purely for the fast paths. The VMM can even implement
> IPI/VIRQ support in userspace, then use the kernel bypass if/when it's
> available.
> 
True, and it's pretty much how it's implemented today.

But felt it was still worth having this discussion ... should this be
considered or discarded. I suppose we stick with the later for now.

>>>> Perhaps eventfd could be a way to express this? Like if you register
>>>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>>>> or if neither it's then returned an error without bothering the VMM.
>>>
>>> I much prefer the simple model where the *only* event channels that the
>>> kernel knows about are the ones it's expected to handle.
>>>
>>> For any others, the bypass doesn't kick in, and userspace gets the
>>> KVM_EXIT_HYPERCALL exit.
>>>
>>
>> /me nods
>>
>> I should comment on your other patch but: if we're going to make it generic 
>> for
>> the userspace hypercall handling, might as well move hyper-v there too. In 
>> this series,
>> I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a 
>> generic version
>> I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest 
>> types, while
>> disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate 
>> patch :)
> 
> There's a limit to how much consolidation we can do because the ABI is
> different; the args are in different registers.
> 
Yes. It would be optionally enabled of course and VMM would have to adjust to 
the new ABI
-- surely wouldn't want to break current users of KVM_EXIT_HYPERV.

> I do suspect Hyper-V should have marshalled its arguments into the
> existing kvm_run->arch.hypercall and used KVM_EXIT_HYPERCALL but I
> don't think it makes sense to change it now since it's a user-facing
> ABI. I don't want to follow its lead by inventing *another* gratuitous
> exit type for Xen though.
> 
I definitely like the KVM_EXIT_HYPERCALL better than a KVM_EXIT_XEN userspace
exit type ;)

But I guess you still need to co-relate a type of hypercall (Xen guest cap 
enabled?) to
tell it's Xen or KVM to specially enlighten certain opcodes (EVTCHNOP_send).

Joao


Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

2020-11-30 Thread Joao Martins
On 11/30/20 12:55 PM, David Woodhouse wrote:
> On Mon, 2020-11-30 at 12:17 +0000, Joao Martins wrote:
>> On 11/30/20 9:41 AM, David Woodhouse wrote:
>>> On Wed, 2019-02-20 at 20:15 +, Joao Martins wrote:
>>>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>>>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>>>> and interdomain event channels, we do the upcall on the assigned vcpu.
>>>
>>> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
>>> that we can short-circuit IPI delivery without it having to bounce
>>> through userspace.
>>>
>>> But why would I then want then short-circuit the short-circuit,
>>> providing an eventfd for it to signal... so that I can then just
>>> receive the event in userspace in a *different* form to the original
>>> hypercall exit I would have got?
>>>
>>
>> One thing I didn't quite do at the time, is the whitelisting of unregistered
>> ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
>> the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
>> ones which go to userspace should be explicitly requested as such
>> and otherwise return -ENOENT in the hypercall.
> 
> Hm, why would -ENOENT be a fast path which needs to be handled in the
> kernel?
> 
It's not that it's a fast path.

Like sending an event channel to an unbound vector, now becomes an possible 
vector to
worry about in userspace VMM e.g. should that port lookup logic be fragile.

So it's more along the lines of Nack-ing the invalid port earlier to rather go
to go userspace to invalidate it, provided we do the lookup anyway in the 
kernel.

>> Perhaps eventfd could be a way to express this? Like if you register
>> without an eventfd it's offloaded, otherwise it's assigned to userspace,
>> or if neither it's then returned an error without bothering the VMM.
> 
> I much prefer the simple model where the *only* event channels that the
> kernel knows about are the ones it's expected to handle.
> 
> For any others, the bypass doesn't kick in, and userspace gets the
> KVM_EXIT_HYPERCALL exit.
> 
/me nods

I should comment on your other patch but: if we're going to make it generic for
the userspace hypercall handling, might as well move hyper-v there too. In this 
series,
I added KVM_EXIT_XEN, much like it exists KVM_EXIT_HYPERV -- but with a generic 
version
I wonder if a capability could gate KVM_EXIT_HYPERCALL to handle both guest 
types, while
disabling KVM_EXIT_HYPERV. But this is probably subject of its own separate 
patch :)

>> But still, eventfd is probably unnecessary complexity when another @type
>> (XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
>> and let it route its evtchn port handling to the its own I/O handling thread.
> 
> Hmm... so the benefit of the eventfd is that we can wake the I/O thread
> directly instead of bouncing out to userspace on the vCPU thread only
> for it to send a signal and return to the guest? Did you ever use that,
> and it is worth the additional in-kernel code?
> 
This was my own preemptive optimization to the interface -- it's not worth
the added code for vIRQ and IPI at this point which is what *for sure* the
kernel will handle.

> Is there any reason we'd want that for IPI or VIRQ event channels, or
> can it be only for INTERDOM/UNBOUND event channels which come later?
> 
/me nods.

No reason to keep that for IPI/vIRQ.

> I'm tempted to leave it out of the first patch, and *maybe* add it back
> in a later patch, putting it in the union alongside .virq.type.
> 
> 
> struct kvm_xen_eventfd {
>  
>  #define XEN_EVTCHN_TYPE_VIRQ 0
>  #define XEN_EVTCHN_TYPE_IPI  1
> __u32 type;
> __u32 port;
> __u32 vcpu;
> -   __s32 fd;
>  
>  #define KVM_XEN_EVENTFD_DEASSIGN   (1 << 0)
>  #define KVM_XEN_EVENTFD_UPDATE (1 << 1)
> __u32 flags;
> union {
> struct {
> __u8 type;
> } virq;
> +  struct {
> +  __s32 eventfd;
> +  } interdom; /* including unbound */
> __u32 padding[2];
> };
>} evtchn;
> 
> Does that make sense to you?
> 
Yeap! :)

Joao


Re: [PATCH RFC 11/39] KVM: x86/xen: evtchn signaling via eventfd

2020-11-30 Thread Joao Martins
On 11/30/20 9:41 AM, David Woodhouse wrote:
> On Wed, 2019-02-20 at 20:15 +0000, Joao Martins wrote:
>> userspace registers a @port to an @eventfd, that is bound to a
>>  @vcpu. This information is then used when the guest does an
>> EVTCHNOP_send with a port registered with the kernel.
> 
> Why do I want this part?
> 
It is unnecessary churn to support eventfd at this point.

The patch subject/name is also a tad misleading, as
it implements the event channel port offloading with the optional fd
being just a small detail in addition.

>> EVTCHNOP_send short-circuiting happens by marking the event as pending
>> in the shared info and vcpu info pages and doing the upcall. For IPIs
>> and interdomain event channels, we do the upcall on the assigned vcpu.
> 
> This part I understand, 'peeking' at the EVTCHNOP_send hypercall so
> that we can short-circuit IPI delivery without it having to bounce
> through userspace.
> 
> But why would I then want then short-circuit the short-circuit,
> providing an eventfd for it to signal... so that I can then just
> receive the event in userspace in a *different* form to the original
> hypercall exit I would have got?
> 

One thing I didn't quite do at the time, is the whitelisting of unregistered
ports to userspace. Right now, it's a blacklist i.e. if it's not handled in
the kernel (IPIs, timer vIRQ, etc) it goes back to userspace. When the only
ones which go to userspace should be explicitly requested as such
and otherwise return -ENOENT in the hypercall.

Perhaps eventfd could be a way to express this? Like if you register
without an eventfd it's offloaded, otherwise it's assigned to userspace,
or if neither it's then returned an error without bothering the VMM.

But still, eventfd is probably unnecessary complexity when another @type
(XEN_EVTCHN_TYPE_USER) would serve, and then just exiting to userspace
and let it route its evtchn port handling to the its own I/O handling thread.

Joao


Re: [PATCH] KVM: x86: clflushopt should be treated as a no-op by emulation

2020-11-13 Thread Joao Martins
On 11/3/20 12:04 PM, David Edmondson wrote:
> The instruction emulator ignores clflush instructions, yet fails to
> support clflushopt. Treat both similarly.
> 
> Fixes: 13e457e0eebf ("KVM: x86: Emulator does not decode clflush well")
> Signed-off-by: David Edmondson 

FWIW,

Reviewed-by: Joao Martins 

> ---
>  arch/x86/kvm/emulate.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 0d917eb70319..56cae1ff9e3f 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4046,6 +4046,12 @@ static int em_clflush(struct x86_emulate_ctxt *ctxt)
>   return X86EMUL_CONTINUE;
>  }
>  
> +static int em_clflushopt(struct x86_emulate_ctxt *ctxt)
> +{
> + /* emulating clflushopt regardless of cpuid */
> + return X86EMUL_CONTINUE;
> +}
> +
>  static int em_movsxd(struct x86_emulate_ctxt *ctxt)
>  {
>   ctxt->dst.val = (s32) ctxt->src.val;
> @@ -4585,7 +4591,7 @@ static const struct opcode group11[] = {
>  };
>  
>  static const struct gprefix pfx_0f_ae_7 = {
> - I(SrcMem | ByteOp, em_clflush), N, N, N,
> + I(SrcMem | ByteOp, em_clflush), I(SrcMem | ByteOp, em_clflushopt), N, N,
>  };
>  
>  static const struct group_dual group15 = { {
> 


Re: [PATCH 00/35] Enhance memory utilization with DMEMFS

2020-10-19 Thread Joao Martins
On 10/19/20 2:37 PM, Paolo Bonzini wrote:
> On 15/10/20 00:25, Dan Williams wrote:
>> Now, with recent device-dax extensions, it
>> also has a coarse grained memory management system for  physical
>> address-space partitioning and a path for struct-page-less backing for
>> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
>> with incremental improvements to the 2 existing memory-management
>> systems?
> 
> If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
> would still create the "struct page" albeit lazily?  KVM then would use
> the usual get_user_pages() path.
> 
Correct.

The removal of struct page would be one of the added incremental improvements, 
like a
'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't 
online/create the
struct pages. The remaining plumbing (...)

> Looking more closely at the implementation of dmemfs, I don't understand
> is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
> memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
> did that KVM would get physical addresses using fixup_user_fault and
> never need pfn_to_page() or get_user_pages().  I'm not saying that would
> instantly be an approval, but it would make remove a lot of hooks.
> 

(...) is similar to what you describe above. Albeit there's probably no need to 
do a
remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that 
means it's
limited to a single contiguous PFN chunk.

KVM has the bits to make it work without struct pages, I don't think there's a 
need for
new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned 
at the
start of the thread. I'm storing my wip here:

https://github.com/jpemartins/linux pageless-dax

Which is based on the first series that had been submitted earlier this year:


https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.mart...@oracle.com/

  Joao


Re: [PATCH 00/35] Enhance memory utilization with DMEMFS

2020-10-12 Thread Joao Martins
On 10/10/20 9:15 AM, yulei zhang wrote:
> On Fri, Oct 9, 2020 at 7:53 PM Joao Martins  wrote:
>> On 10/9/20 12:39 PM, yulei zhang wrote:
>>> Joao, thanks a lot for the feedback. One more thing needs to mention
>>> is that dmemfs also support fine-grained
>>> memory management which makes it more flexible for tenants with
>>> different requirements.
>>>
>> So as DAX when it allows to partition a region (starting 5.10). Meaning you 
>> have a region
>> which you dedicated to userspace. That region can then be partitioning into 
>> devices which
>> give you access to multiple (possibly discontinuous) extents with at a given 
>> page
>> granularity (selectable when you create the device), accessed through mmap().
>> You can then give that device to a cgroup. Or you can return that memory 
>> back to the
>> kernel (should you run into OOM situation), or you recreate the same 
>> mappings across
>> reboot/kexec.
>>
>> I probably need to read your patches again, but can you extend on the 
>> 'dmemfs also support
>> fine-grained memory management' to understand what is the gap that you 
>> mention?
>>
> sure, dmemfs uses bitmap to track the memory usage in the reserved
> memory region in
> a given page size granularity. And for each user the memory can be
> discrete as well.
> 
That same functionality of tracking reserved region usage across different 
users at any
page granularity is covered the DAX series I mentioned below. The discrete part 
-- IIUC
what you meant -- is then reduced using DAX ABI/tools to create a device file 
vs a filesystem.

>>> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins  
>>> wrote:
>>>>
>>>> [adding a couple folks that directly or indirectly work on the subject]
>>>>
>>>> On 10/8/20 8:53 AM, yulei.ker...@gmail.com wrote:
>>>>> From: Yulei Zhang 
>>>>>
>>>>> In current system each physical memory page is assocaited with
>>>>> a page structure which is used to track the usage of this page.
>>>>> But due to the memory usage rapidly growing in cloud environment,
>>>>> we find the resource consuming for page structure storage becomes
>>>>> highly remarkable. So is it an expense that we could spare?
>>>>>
>>>> Happy to see another person working to solve the same problem!
>>>>
>>>> I am really glad to see more folks being interested in solving
>>>> this problem and I hope we can join efforts?
>>>>
>>>> BTW, there is also a second benefit in removing struct page -
>>>> which is carving out memory from the direct map.
>>>>
>>>>> This patchset introduces an idea about how to save the extra
>>>>> memory through a new virtual filesystem -- dmemfs.
>>>>>
>>>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>>>> memory based filesystem. This kind of memory is special as it
>>>>> is not managed by kernel and most important it is without 'struct page'.
>>>>> Therefore we can leverage the extra memory from the host system
>>>>> to support more tenants in our cloud service.
>>>>>
>>>> This is like a walk down the memory lane.
>>>>
>>>> About a year ago we followed the same exact idea/motivation to
>>>> have memory outside of the direct map (and removing struct page overhead)
>>>> and started with our own layer/thingie. However we realized that DAX
>>>> is one the subsystems which already gives you direct access to memory
>>>> for free (and is already upstream), plus a couple of things which we
>>>> found more handy.
>>>>
>>>> So we sent an RFC a couple months ago:
>>>>
>>>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.mart...@oracle.com/
>>>>
>>>> Since then majority of the work has been in improving DAX[1].
>>>> But now that is done I am going to follow up with the above patchset.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.st...@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>> (Give me a couple of days and I will send you the link to the latest
>>>> patches on a git-tree - would love feedback!)
>>>>
>>>> The struct page removal for DAX would then be small, and ticks the
>>>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>&

Re: [PATCH 00/35] Enhance memory utilization with DMEMFS

2020-10-09 Thread Joao Martins
On 10/9/20 12:39 PM, yulei zhang wrote:
> Joao, thanks a lot for the feedback. One more thing needs to mention
> is that dmemfs also support fine-grained
> memory management which makes it more flexible for tenants with
> different requirements.
> 
So as DAX when it allows to partition a region (starting 5.10). Meaning you 
have a region
which you dedicated to userspace. That region can then be partitioning into 
devices which
give you access to multiple (possibly discontinuous) extents with at a given 
page
granularity (selectable when you create the device), accessed through mmap().
You can then give that device to a cgroup. Or you can return that memory back 
to the
kernel (should you run into OOM situation), or you recreate the same mappings 
across
reboot/kexec.

I probably need to read your patches again, but can you extend on the 'dmemfs 
also support
fine-grained memory management' to understand what is the gap that you mention?

> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins  wrote:
>>
>> [adding a couple folks that directly or indirectly work on the subject]
>>
>> On 10/8/20 8:53 AM, yulei.ker...@gmail.com wrote:
>>> From: Yulei Zhang 
>>>
>>> In current system each physical memory page is assocaited with
>>> a page structure which is used to track the usage of this page.
>>> But due to the memory usage rapidly growing in cloud environment,
>>> we find the resource consuming for page structure storage becomes
>>> highly remarkable. So is it an expense that we could spare?
>>>
>> Happy to see another person working to solve the same problem!
>>
>> I am really glad to see more folks being interested in solving
>> this problem and I hope we can join efforts?
>>
>> BTW, there is also a second benefit in removing struct page -
>> which is carving out memory from the direct map.
>>
>>> This patchset introduces an idea about how to save the extra
>>> memory through a new virtual filesystem -- dmemfs.
>>>
>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>> memory based filesystem. This kind of memory is special as it
>>> is not managed by kernel and most important it is without 'struct page'.
>>> Therefore we can leverage the extra memory from the host system
>>> to support more tenants in our cloud service.
>>>
>> This is like a walk down the memory lane.
>>
>> About a year ago we followed the same exact idea/motivation to
>> have memory outside of the direct map (and removing struct page overhead)
>> and started with our own layer/thingie. However we realized that DAX
>> is one the subsystems which already gives you direct access to memory
>> for free (and is already upstream), plus a couple of things which we
>> found more handy.
>>
>> So we sent an RFC a couple months ago:
>>
>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.mart...@oracle.com/
>>
>> Since then majority of the work has been in improving DAX[1].
>> But now that is done I am going to follow up with the above patchset.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.st...@dwillia2-desk3.amr.corp.intel.com/
>>
>> (Give me a couple of days and I will send you the link to the latest
>> patches on a git-tree - would love feedback!)
>>
>> The struct page removal for DAX would then be small, and ticks the
>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>> support) that we both do, with a smaller diffstat and it doesn't
>> touch KVM (not at least fundamentally).
>>
>> 15 files changed, 401 insertions(+), 38 deletions(-)
>>
>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
>> like we both do. Furthermore there wouldn't be a need for a new vm type,
>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.st...@dwillia2-desk3.amr.corp.intel.com/
>>
>>
>>> We uses a kernel boot parameter 'dmem=' to reserve the system
>>> memory when the host system boots up, the details can be checked
>>> in /Documentation/admin-guide/kernel-parameters.txt.
>>>
>>> Theoretically for each 4k physical page it can save 64 bytes if
>>> we drop the 'struct page', so for guest memory with 320G it can
>>> save about 5G physical memory totally.
>>>
>> Also worth mentioning that if you only care about 'struct page' cost, and 
>> not on the
>> security boundary, there's also some work

Re: [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page

2020-10-09 Thread Joao Martins
On 10/9/20 1:58 AM, Sean Christopherson wrote:
> On Thu, Oct 08, 2020 at 03:54:12PM +0800, yulei.ker...@gmail.com wrote:
>> From: Yulei Zhang 
>>
>> Dmem page is pfn invalid but not mmio. Support cacheable
>> dmem page for kvm.
>>
>> Signed-off-by: Chen Zhuo 
>> Signed-off-by: Yulei Zhang 
>> ---
>>  arch/x86/kvm/mmu/mmu.c | 5 +++--
>>  include/linux/dmem.h   | 7 +++
>>  mm/dmem.c  | 7 +++
>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 71aa3da2a0b7..0115c1767063 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -41,6 +41,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>>   */
>>  (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
>>  
>> -return !e820__mapped_raw_any(pfn_to_hpa(pfn),
>> +return (!e820__mapped_raw_any(pfn_to_hpa(pfn),
>>   pfn_to_hpa(pfn + 1) - 1,
>> - E820_TYPE_RAM);
>> + E820_TYPE_RAM)) || (!is_dmem_pfn(pfn));
> 
> This is wrong.  As is, the logic reads "A PFN is MMIO if it is INVALID &&
> (!RAM || !DMEM)".  The obvious fix would be to change it to "INVALID &&
> !RAM && !DMEM", but that begs the question of whether or DMEM is reported
> as RAM.  I don't see any e820 related changes in the series, i.e. no evidence
> that dmem yanks its memory out of the e820 tables, which makes me think this
> change is unnecessary.
> 
Even if there would exist e820 changes, e820__mapped_raw_any() checks against
hardware-provided e820 that we are given before any changes happen i.e. not the 
one kernel
has changed (e820_table_firmware). So unless you're having that memory carved 
from an MMIO
range (which would be wrong), or the BIOS is misrepresenting its memory map... 
the
e820__mapped_raw_any(E820_TYPE_RAM) ought to be enough to cover RAM.

Or at least that has been my experience with similar work.

Joao


Re: [PATCH 00/35] Enhance memory utilization with DMEMFS

2020-10-08 Thread Joao Martins
[adding a couple folks that directly or indirectly work on the subject]

On 10/8/20 8:53 AM, yulei.ker...@gmail.com wrote:
> From: Yulei Zhang 
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
> 
Happy to see another person working to solve the same problem!

I am really glad to see more folks being interested in solving
this problem and I hope we can join efforts?

BTW, there is also a second benefit in removing struct page -
which is carving out memory from the direct map.

> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
> 
This is like a walk down the memory lane.

About a year ago we followed the same exact idea/motivation to
have memory outside of the direct map (and removing struct page overhead)
and started with our own layer/thingie. However we realized that DAX
is one the subsystems which already gives you direct access to memory
for free (and is already upstream), plus a couple of things which we
found more handy.

So we sent an RFC a couple months ago:

https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.mart...@oracle.com/

Since then majority of the work has been in improving DAX[1].
But now that is done I am going to follow up with the above patchset.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.st...@dwillia2-desk3.amr.corp.intel.com/

(Give me a couple of days and I will send you the link to the latest
patches on a git-tree - would love feedback!)

The struct page removal for DAX would then be small, and ticks the
same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
support) that we both do, with a smaller diffstat and it doesn't
touch KVM (not at least fundamentally).

15 files changed, 401 insertions(+), 38 deletions(-)

The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
like we both do. Furthermore there wouldn't be a need for a new vm type,
consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.st...@dwillia2-desk3.amr.corp.intel.com/


> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt. 
> 
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally. 
> 
Also worth mentioning that if you only care about 'struct page' cost, and not 
on the
security boundary, there's also some work on hugetlbfs preallocation of 
hugepages into
tricking vmemmap in reusing tail pages.

  
https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuc...@bytedance.com/

Going forward that could also make sense for device-dax to avoid so many
struct pages allocated (which would require its transition to compound
struct pages like hugetlbfs which we are looking at too). In addition an
idea  would be perhaps to have a stricter mode in DAX where
we initialize/use the metadata ('struct page') but remove the underlaying
PFNs (of the 'struct page') from the direct map having to bear the cost of
mapping/unmapping on gup/pup.

Joao


Re: [PATCH v5 00/17] device-dax: support sub-dividing soft-reserved ranges

2020-09-25 Thread Joao Martins
On 9/25/20 10:01 PM, Dan Williams wrote:
> On Fri, Sep 25, 2020 at 1:52 PM Joao Martins  
> wrote:
>>
>> Hey Dan,
>>
>> On 9/25/20 8:11 PM, Dan Williams wrote:
>>> Changes since v4 [1]:
>>> - Rebased on
>>>   device-dax-move-instance-creation-parameters-to-struct-dev_dax_data.patch
>>>   in -mm [2]. I.e. patches that did not need fixups from v4 are not
>>>   included.
>>>
>>> - Folded all fixes
>>>
>>
>> Hmm, perhaps you missed the fixups before the above mentioned patch?
>>
>> From:
>>
>> https://www.ozlabs.org/~akpm/mmots/series
>>
>> under "mm/dax", I am listing those fixups here:
>>
>> x86-numa-add-nohmat-option-fix.patch
>> acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device-fix.patch
>> mm-memory_hotplug-introduce-default-phys_to_target_node-implementation-fix.patch
>> acpi-hmat-attach-a-device-for-each-soft-reserved-range-fix.patch
>>
>> (in https://www.ozlabs.org/~akpm/mmots/broken-out/)
> 
> I left those for Andrew to handle. I actually should have started this
> set one more down in his stack because that's where my new changes
> start.
> 
Ah, got it!


Re: [PATCH v5 00/17] device-dax: support sub-dividing soft-reserved ranges

2020-09-25 Thread Joao Martins
Hey Dan,

On 9/25/20 8:11 PM, Dan Williams wrote:
> Changes since v4 [1]:
> - Rebased on
>   device-dax-move-instance-creation-parameters-to-struct-dev_dax_data.patch
>   in -mm [2]. I.e. patches that did not need fixups from v4 are not
>   included.
> 
> - Folded all fixes
> 

Hmm, perhaps you missed the fixups before the above mentioned patch?

From:

https://www.ozlabs.org/~akpm/mmots/series

under "mm/dax", I am listing those fixups here:

x86-numa-add-nohmat-option-fix.patch
acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device-fix.patch
mm-memory_hotplug-introduce-default-phys_to_target_node-implementation-fix.patch
acpi-hmat-attach-a-device-for-each-soft-reserved-range-fix.patch

(in https://www.ozlabs.org/~akpm/mmots/broken-out/)

[...]

> ---
> 
> Andrew, this series replaces
> 
> device-dax-make-pgmap-optional-for-instance-creation.patch
> 
> ...through...
> 
> dax-hmem-introduce-dax_hmemregion_idle-parameter.patch
> 
> ...in your stack.
> 
> Let me know if there is a different / preferred way to refresh a bulk of
> patches in your queue when only a subset need updates.
> 



Re: [PATCH v2] iommu/amd: Restore IRTE.RemapEn bit for amd_iommu_activate_guest_mode

2020-09-16 Thread Joao Martins
On 9/16/20 12:17 PM, Suravee Suthikulpanit wrote:
> Commit e52d58d54a32 ("iommu/amd: Use cmpxchg_double() when updating
> 128-bit IRTE") removed an assumption that modify_irte_ga always set
> the valid bit, which requires the callers to set the appropriate value
> for the struct irte_ga.valid bit before calling the function.
> 
> Similar to the commit 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn
> bit after programming IRTE"), which is for the function
> amd_iommu_deactivate_guest_mode().
> 
> The same change is also needed for the amd_iommu_activate_guest_mode().
> Otherwise, this could trigger IO_PAGE_FAULT for the VFIO based VMs with
> AVIC enabled.
> 
> Reported-by: Maxim Levitsky 
> Tested-by: Maxim Levitsky 
> Cc: Joao Martins 
> Fixes: e52d58d54a321 ("iommu/amd: Use cmpxchg_double() when updating 128-bit 
> IRTE")
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd/iommu.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index e938677af8bc..db4fb840c59c 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3900,14 +3900,18 @@ int amd_iommu_activate_guest_mode(void *data)
>  {
>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
> + u64 valid;
>  
>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>   !entry || entry->lo.fields_vapic.guest_mode)
>   return 0;
>  
> + valid = entry->lo.fields_vapic.valid;
> +
>   entry->lo.val = 0;
>   entry->hi.val = 0;
>  
> + entry->lo.fields_vapic.valid   = valid;
>   entry->lo.fields_vapic.guest_mode  = 1;
>   entry->lo.fields_vapic.ga_log_intr = 1;
>   entry->hi.fields.ga_root_ptr   = ir_data->ga_root_ptr;
> 
FWIW,

Reviewed-by: Joao Martins 


Re: [PATCH] iommu/amd: fix interrupt remapping for avic

2020-09-15 Thread Joao Martins
On 9/15/20 1:30 PM, Suravee Suthikulpanit wrote:
> On 9/15/20 6:25 PM, Maxim Levitsky wrote:
>> On Mon, 2020-09-14 at 21:48 +0700, Suravee Suthikulpanit wrote:
>>> Could you please try with the following patch instead?
>>>
>>> --- a/drivers/iommu/amd/iommu.c
>>> +++ b/drivers/iommu/amd/iommu.c
>>> @@ -3840,14 +3840,18 @@ int amd_iommu_activate_guest_mode(void *data)
>>>{
>>>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>>>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
>>> +   u64 valid;
>>>
>>>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>>>   !entry || entry->lo.fields_vapic.guest_mode)
>>>   return 0;
>>>
>>> +   valid = entry->lo.fields_vapic.valid;
>>> +
>>>   entry->lo.val = 0;
>>>   entry->hi.val = 0;
>>>
>>> +   entry->lo.fields_vapic.valid   = valid;
>>>   entry->lo.fields_vapic.guest_mode  = 1;
>>>   entry->lo.fields_vapic.ga_log_intr = 1;
>>>   entry->hi.fields.ga_root_ptr   = ir_data->ga_root_ptr;
>>> @@ -3864,12 +3868,14 @@ int amd_iommu_deactivate_guest_mode(void *data)
>>>   struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
>>>   struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
>>>   struct irq_cfg *cfg = ir_data->cfg;
>>> -   u64 valid = entry->lo.fields_remap.valid;
>>> +   u64 valid;
>>>
>>>   if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
>>>   !entry || !entry->lo.fields_vapic.guest_mode)
>>>   return 0;
>>>
>>> +   valid = entry->lo.fields_remap.valid;
>>> +
>>>   entry->lo.val = 0;
>>>   entry->hi.val = 0;
>> I see. I based my approach on the fact that valid bit was
>> set always to true anyway before, plus that amd_iommu_activate_guest_mode
>> should be really only called when someone activates a valid interrupt 
>> remapping
>> entry, but IMHO the approach of preserving the valid bit is safer anyway.
>>
>> It works on my system (I applied the patch manually, since either your or my 
>> email client,
>> seems to mangle the patch)
>>
> 
> Sorry for the mangled patch. I'll submit the patch w/ your information. 
> Thanks for your help reporting, debugging, and 
> testing the patch.
> 
I assume you're only doing the valid bit preservation in 
amd_iommu_activate_guest_mode() ?
The null deref fix in amd_iommu_deactivate_guest_mode() was fixed elsewhere[0], 
or are you
planning on merging both changes like the diff you attached?

Asking also because commit 26e495f341 ("iommu/amd: Restore IRTE.RemapEn bit 
after
programming IRTE") was added in v5.4 and v5.8 stable trees but the v5.4 
backport didn't
include e52d58d54a321 ("iommu/amd: Use cmpxchg_double() when updating 128-bit 
IRTE").

Joao

[0] 
https://lore.kernel.org/linux-iommu/20200910171621.12879-1-joao.m.mart...@oracle.com/


[PATCH] iommu/amd: Fix potential @entry null deref

2020-09-10 Thread Joao Martins
After commit 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn bit after
programming IRTE"), smatch warns:

drivers/iommu/amd/iommu.c:3870 amd_iommu_deactivate_guest_mode()
warn: variable dereferenced before check 'entry' (see line 3867)

Fix this by moving the @valid assignment to after @entry has been checked
for NULL.

Cc: Suravee Suthikulpanit 
Fixes: 26e495f34107 ("iommu/amd: Restore IRTE.RemapEn bit after programming 
IRTE")
Reported-by: Dan Carpenter 
Signed-off-by: Joao Martins 
---
 drivers/iommu/amd/iommu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 07ae8b93887e..8abe1c7ad45b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3864,12 +3864,14 @@ int amd_iommu_deactivate_guest_mode(void *data)
struct amd_ir_data *ir_data = (struct amd_ir_data *)data;
struct irte_ga *entry = (struct irte_ga *) ir_data->entry;
struct irq_cfg *cfg = ir_data->cfg;
-   u64 valid = entry->lo.fields_remap.valid;
+   u64 valid;
 
if (!AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir) ||
!entry || !entry->lo.fields_vapic.guest_mode)
return 0;
 
+   valid = entry->lo.fields_remap.valid;
+
entry->lo.val = 0;
entry->hi.val = 0;
 
-- 
2.17.1



Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

2020-09-08 Thread Joao Martins
[Sorry for the late response]

On 8/21/20 11:06 AM, David Hildenbrand wrote:
> On 03.08.20 07:03, Dan Williams wrote:
>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>   * could be mixed in a node with faster memory, causing
>>   * unavoidable performance issues.
>>   */
>> -numa_node = dev_dax->target_node;
>>  if (numa_node < 0) {
>>  dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>  numa_node);
>>  return -EINVAL;
>>  }
>>  
>> -/* Hotplug starting at the beginning of the next block: */
>> -kmem_start = ALIGN(range->start, memory_block_size_bytes());
>> -
>> -kmem_size = range_len(range);
>> -/* Adjust the size down to compensate for moving up kmem_start: */
>> -kmem_size -= kmem_start - range->start;
>> -/* Align the size down to cover only complete blocks: */
>> -kmem_size &= ~(memory_block_size_bytes() - 1);
>> -kmem_end = kmem_start + kmem_size;
>> -
>> -new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>> -if (!new_res_name)
>> +res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>> +if (!res_name)
>>  return -ENOMEM;
>>  
>> -/* Region is permanently reserved if hotremove fails. */
>> -new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>> -if (!new_res) {
>> -dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>> - _start, _end);
>> -kfree(new_res_name);
>> +res = request_mem_region(range.start, range_len(), res_name);
> 
> I think our range could be empty after aligning. I assume
> request_mem_region() would check that, but maybe we could report a
> better error/warning in that case.
> 
dax_kmem_range() already returns a memory-block-aligned @range but
IIUC request_mem_region() isn't checking for that. Having said that
the returned @res wouldn't be different from the passed range.start.

>>  /*
>>   * Ensure that future kexec'd kernels will not treat this as RAM
>>   * automatically.
>>   */
>> -rc = add_memory_driver_managed(numa_node, new_res->start,
>> -   resource_size(new_res), kmem_name);
>> +rc = add_memory_driver_managed(numa_node, res->start,
>> +   resource_size(res), kmem_name);
>> +
>> +res->flags |= IORESOURCE_BUSY;
> 
> Hm, I don't think that's correct. Any specific reason why to mark the
> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
> suddenly stumble over it - and e.g., similarly kexec code when trying to
> find memory for placing kexec images. I think we should leave this
> !BUSY, just as it is right now.
> 
Agreed.

>>  if (rc) {
>> -release_resource(new_res);
>> -kfree(new_res);
>> -kfree(new_res_name);
>> +release_mem_region(range.start, range_len());
>> +kfree(res_name);
>>  return rc;
>>  }
>> -dev_dax->dax_kmem_res = new_res;
>> +
>> +dev_set_drvdata(dev, res_name);
>>  
>>  return 0;
>>  }
>>  
>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>> -static int dev_dax_kmem_remove(struct device *dev)
>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>  {
>> -struct dev_dax *dev_dax = to_dev_dax(dev);
>> -struct resource *res = dev_dax->dax_kmem_res;
>> -resource_size_t kmem_start = res->start;
>> -resource_size_t kmem_size = resource_size(res);
>> -const char *res_name = res->name;
>>  int rc;
>> +struct device *dev = _dax->dev;
>> +const char *res_name = dev_get_drvdata(dev);
>> +struct range range = dax_kmem_range(dev_dax);
>>  
>>  /*
>>   * We have one shot for removing memory, if some memory blocks were not
>>   * offline prior to calling this function remove_memory() will fail, and
>>   * there is no way to hotremove this memory until reboot because device
>> - * unbind will succeed even if we return failure.
>> + * unbind will proceed regardless of the remove_memory result.
>>   */
>> -rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
>> -if (rc) {
>> -any_hotremove_failed = true;
>> -dev_err(dev,
>> -"DAX region %pR cannot be hotremoved until the next 
>> reboot\n",
>> -res);
>> -return rc;
>> +rc = remove_memory(dev_dax->target_node, range.start, 
>> range_len());
>> +if (rc == 0) {
> 
> if (!rc) ?
> 
Better off would be to keep the old order:

if (rc) {
any_hotremove_failed = true;
dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next 
reboot\n",
range.start, range.end);
return;
}

release_mem_region(range.start, range_len());
dev_set_drvdata(dev, NULL);
kfree(res_name);
return;


>> +

Re: [PATCH 2/2] iommu: amd: Use cmpxchg_double() when updating 128-bit IRTE

2020-09-02 Thread Joao Martins
On 9/2/20 5:51 AM, Suravee Suthikulpanit wrote:
> When using 128-bit interrupt-remapping table entry (IRTE) (a.k.a GA mode),
> current driver disables interrupt remapping when it updates the IRTE
> so that the upper and lower 64-bit values can be updated safely.
> 
> However, this creates a small window, where the interrupt could
> arrive and result in IO_PAGE_FAULT (for interrupt) as shown below.
> 
>   IOMMU DriverDevice IRQ
>   ===
>   irte.RemapEn=0
>...
>change IRTEIRQ from device ==> IO_PAGE_FAULT !!
>...
>   irte.RemapEn=1
> 
> This scenario has been observed when changing irq affinity on a system
> running I/O-intensive workload, in which the destination APIC ID
> in the IRTE is updated.
> 
> Instead, use cmpxchg_double() to update the 128-bit IRTE at once without
> disabling the interrupt remapping. However, this means several features,
> which require GA (128-bit IRTE) support will also be affected if cmpxchg16b
> is not supported (which is unprecedented for AMD processors w/ IOMMU).
> 
Probably requires:

 Fixes: 880ac60e2538 ("iommu/amd: Introduce interrupt remapping ops structure")

?

> Reported-by: Sean Osborne 
> Tested-by: Erik Rockstrom 
> Signed-off-by: Suravee Suthikulpanit 

With the comments below addressed, FWIW:

 Reviewed-by: Joao Martins 

> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
> index c652f16eb702..ad30467f6930 100644
> --- a/drivers/iommu/amd/init.c
> +++ b/drivers/iommu/amd/init.c
> @@ -1511,7 +1511,14 @@ static int __init init_iommu_one(struct amd_iommu 
> *iommu, struct ivhd_header *h)
>   iommu->mmio_phys_end = MMIO_REG_END_OFFSET;
>   else
>   iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
> - if (((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0))
> +
> + /*
> +  * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
> +  * GAM also requires GA mode. Therefore, we need to
> +  * check cmbxchg16b support before enabling it.
> +  */

s/cmbxchg16b/cmpxchg16b

> + if (!boot_cpu_has(X86_FEATURE_CX16) ||
> + ((h->efr_attr & (0x1 << IOMMU_FEAT_GASUP_SHIFT)) == 0))
>   amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
>   break;
>   case 0x11:
> @@ -1520,8 +1527,18 @@ static int __init init_iommu_one(struct amd_iommu 
> *iommu, struct ivhd_header *h)
>   iommu->mmio_phys_end = MMIO_REG_END_OFFSET;
>   else
>   iommu->mmio_phys_end = MMIO_CNTR_CONF_OFFSET;
> - if (((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0))
> +
> + /*
> +  * Note: GA (128-bit IRTE) mode requires cmpxchg16b supports.
> +  * XT, GAM also requires GA mode. Therefore, we need to
> +  * check cmbxchg16b support before enabling them.

s/cmbxchg16b/cmpxchg16b

> +  */
> + if (boot_cpu_has(X86_FEATURE_CX16) ||

You probably want !boot_cpu_has(X86_FEATURE_CX16) ?

> + ((h->efr_reg & (0x1 << IOMMU_EFR_GASUP_SHIFT)) == 0)) {
>   amd_iommu_guest_ir = AMD_IOMMU_GUEST_IR_LEGACY;
> + break;
> + }
> +
>   /*
>* Note: Since iommu_update_intcapxt() leverages
>* the IOMMU MMIO access to MSI capability block registers


Re: [PATCH 1/2] iommu: amd: Restore IRTE.RemapEn bit after programming IRTE

2020-09-02 Thread Joao Martins
On 9/2/20 5:51 AM, Suravee Suthikulpanit wrote:
> Currently, the RemapEn (valid) bit is accidentally cleared when
> programming IRTE w/ guestMode=0. It should be restored to
> the prior state.
> 
Probably requires:

 Fixes: b9fc6b56f478 ("iommu/amd: Implements irq_set_vcpu_affinity() hook to 
setup vapic
mode for pass-through devices")

?

> Signed-off-by: Suravee Suthikulpanit 

FWIW,

 Reviewed-by: Joao Martins 


Re: [PATCH v2 22/22] device-dax: Introduce 'mapping' devices

2020-07-16 Thread Joao Martins
On 7/16/20 5:00 PM, Dan Williams wrote:
> On Thu, Jul 16, 2020 at 6:19 AM Joao Martins  
> wrote:
>> On 7/12/20 5:28 PM, Dan Williams wrote:
>>> In support of interrogating the physical address layout of a device with
>>> dis-contiguous ranges, introduce a sysfs directory with 'start', 'end',
>>> and 'page_offset' attributes. The alternative is trying to parse
>>> /proc/iomem, and that file will not reflect the extent layout until the
>>> device is enabled.
>>>
>>> Cc: Vishal Verma 
>>> Signed-off-by: Dan Williams 
>>> ---
>>>  drivers/dax/bus.c |  191 
>>> +
>>>  drivers/dax/dax-private.h |   14 +++
>>>  2 files changed, 203 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>>> index f342e36c69a1..8b6c4ddc5f42 100644
>>> --- a/drivers/dax/bus.c
>>> +++ b/drivers/dax/bus.c
>>> @@ -579,6 +579,167 @@ struct dax_region *alloc_dax_region(struct device 
>>> *parent, int region_id,
>>>  }
>>>  EXPORT_SYMBOL_GPL(alloc_dax_region);
>>>
>>> +static void dax_mapping_release(struct device *dev)
>>> +{
>>> + struct dax_mapping *mapping = to_dax_mapping(dev);
>>> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
>>> +
>>> + ida_free(_dax->ida, mapping->id);
>>> + kfree(mapping);
>>> +}
>>> +
>>> +static void unregister_dax_mapping(void *data)
>>> +{
>>> + struct device *dev = data;
>>> + struct dax_mapping *mapping = to_dax_mapping(dev);
>>> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
>>> + struct dax_region *dax_region = dev_dax->region;
>>> +
>>> + dev_dbg(dev, "%s\n", __func__);
>>> +
>>> + device_lock_assert(dax_region->dev);
>>> +
>>> + dev_dax->ranges[mapping->range_id].mapping = NULL;
>>> + mapping->range_id = -1;
>>> +
>>> + device_del(dev);
>>> + put_device(dev);
>>> +}
>>> +
>>> +static struct dev_dax_range *get_dax_range(struct device *dev)
>>> +{
>>> + struct dax_mapping *mapping = to_dax_mapping(dev);
>>> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
>>> + struct dax_region *dax_region = dev_dax->region;
>>> +
>>> + device_lock(dax_region->dev);
>>> + if (mapping->range_id < 1) {
>> ^ it's 'mapping->range_id < 0'
>>
>> Otherwise 'mapping0' sysfs entries won't work.
>> Disabled ranges use id -1.
> 
> Whoops, yes. Needs a unit test.
> 
If it helps, this particular patch for daxctl:

https://lore.kernel.org/linux-nvdimm/20200716184707.23018-7-joao.m.mart...@oracle.com/

May help in the immediate term: if it's broken no mappings are listed.

But yeah, a unit test in 'test/daxctl-create.sh' should be added.

Joao


[PATCH v1 3/4] dax/hmem: Introduce dax_hmem.region_idle parameter

2020-07-16 Thread Joao Martins
Introduce a new module parameter for dax_hmem which
initializes all region devices as free, rather than allocating
a pagemap for the region by default.

All hmem devices created with dax_hmem.region_idle=1 will have full
available size for creating dynamic dax devices.

Signed-off-by: Joao Martins 
---
 drivers/dax/hmem/hmem.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1a3347bb6143..1bf040dbc834 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -5,6 +5,9 @@
 #include 
 #include "../bus.h"
 
+static bool region_idle;
+module_param_named(region_idle, region_idle, bool, 0644);
+
 static int dax_hmem_probe(struct platform_device *pdev)
 {
struct device *dev = >dev;
@@ -30,7 +33,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = -1,
-   .size = resource_size(res),
+   .size = region_idle ? 0 : resource_size(res),
};
dev_dax = devm_create_dev_dax();
if (IS_ERR(dev_dax))
-- 
2.17.1



[PATCH v1 2/4] device-dax: Add an 'align' attribute

2020-07-16 Thread Joao Martins
Introduce a device align attribute. While doing so,
rename the region align attribute to be more explicitly
named as so, but keep it named as @align to retain the API
for tools like daxctl.

Changes on align may not always be valid, when say certain
mappings were created with 2M and then we switch to 1G. So, we
validate all ranges against the new value being attempted,
post resizing.

Signed-off-by: Joao Martins 
---
 drivers/dax/bus.c | 101 +-
 1 file changed, 92 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2578651c596e..eb384dd6a376 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -230,14 +230,15 @@ static ssize_t region_size_show(struct device *dev,
 static struct device_attribute dev_attr_region_size = __ATTR(size, 0444,
region_size_show, NULL);
 
-static ssize_t align_show(struct device *dev,
+static ssize_t region_align_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
struct dax_region *dax_region = dev_get_drvdata(dev);
 
return sprintf(buf, "%u\n", dax_region->align);
 }
-static DEVICE_ATTR_RO(align);
+static struct device_attribute dev_attr_region_align =
+   __ATTR(align, 0400, region_align_show, NULL);
 
 #define for_each_dax_region_resource(dax_region, res) \
for (res = (dax_region)->res.child; res; res = res->sibling)
@@ -488,7 +489,7 @@ static umode_t dax_region_visible(struct kobject *kobj, 
struct attribute *a,
 static struct attribute *dax_region_attributes[] = {
_attr_available_size.attr,
_attr_region_size.attr,
-   _attr_align.attr,
+   _attr_region_align.attr,
_attr_create.attr,
_attr_seed.attr,
_attr_delete.attr,
@@ -855,14 +856,13 @@ static ssize_t size_show(struct device *dev,
return sprintf(buf, "%llu\n", size);
 }
 
-static bool alloc_is_aligned(struct dax_region *dax_region,
-   resource_size_t size)
+static bool alloc_is_aligned(resource_size_t size, unsigned long align)
 {
/*
 * The minimum mapping granularity for a device instance is a
 * single subsection, unless the arch says otherwise.
 */
-   return IS_ALIGNED(size, max_t(unsigned long, dax_region->align,
+   return IS_ALIGNED(size, max_t(unsigned long, align,
memremap_compat_align()));
 }
 
@@ -958,7 +958,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return dev_dax_shrink(dev_dax, size);
 
to_alloc = size - dev_size;
-   if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc),
+   if (dev_WARN_ONCE(dev, !alloc_is_aligned(to_alloc, dev_dax->align),
"resize of %pa misaligned\n", _alloc))
return -ENXIO;
 
@@ -1022,7 +1022,7 @@ static ssize_t size_store(struct device *dev, struct 
device_attribute *attr,
if (rc)
return rc;
 
-   if (!alloc_is_aligned(dax_region, val)) {
+   if (!alloc_is_aligned(val, dev_dax->align)) {
dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val);
return -EINVAL;
}
@@ -1041,6 +1041,87 @@ static ssize_t size_store(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t align_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->align);
+}
+
+static ssize_t dev_dax_set_align(struct dev_dax *dev_dax,
+unsigned long long align)
+{
+   resource_size_t dev_size = dev_dax_size(dev_dax);
+   struct device *dev = _dax->dev;
+   ssize_t rc, i;
+
+   if (dev->driver)
+   return -EBUSY;
+
+   rc = -EINVAL;
+   if (dev_size > 0 && !alloc_is_aligned(dev_size, align)) {
+   dev_dbg(dev, "%s: align %llx invalid for size %llu\n",
+   __func__, align, dev_size);
+   return rc;
+   }
+
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   size_t len = range_len(_dax->ranges[i].range);
+
+   if (!alloc_is_aligned(len, align)) {
+   dev_dbg(dev, "%s: align %llx invalid for range %ld\n",
+   __func__, align, i);
+   return rc;
+   }
+   }
+
+   switch (align) {
+   case PUD_SIZE:
+   if (!IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   break;
+   fallthrough;
+   case PMD_SIZE:
+   if (!has_transparent_hugepage())
+   break;
+   fallthrough;
+   case PAGE_SIZE:
+   rc = 0;
+   

[PATCH v1 1/4] device-dax: Make align a per-device property

2020-07-16 Thread Joao Martins
Introduce @align to struct dev_dax.

When creating a new device, we still initialize to the default
dax_region @align. Child devices belonging to a region may wish
to keep a different alignment property instead of a global
region-defined one.

Signed-off-by: Joao Martins 
---
 drivers/dax/bus.c |  1 +
 drivers/dax/dax-private.h |  1 +
 drivers/dax/device.c  | 35 +++
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 8b6c4ddc5f42..2578651c596e 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1215,6 +1215,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
*data)
 
dev_dax->dax_dev = dax_dev;
dev_dax->target_node = dax_region->target_node;
+   dev_dax->align = dax_region->align;
ida_init(_dax->ida);
kref_get(_region->kref);
 
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 13780f62b95e..96ef5a8ae0ba 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -62,6 +62,7 @@ struct dax_mapping {
 struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+   unsigned int align;
int target_node;
int id;
struct ida ida;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bfc5c83e3b0..346c7bb8cf06 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -17,7 +17,6 @@
 static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
 {
-   struct dax_region *dax_region = dev_dax->region;
struct device *dev = _dax->dev;
unsigned long mask;
 
@@ -32,7 +31,7 @@ static int check_vma(struct dev_dax *dev_dax, struct 
vm_area_struct *vma,
return -EINVAL;
}
 
-   mask = dax_region->align - 1;
+   mask = dev_dax->align - 1;
if (vma->vm_start & mask || vma->vm_end & mask) {
dev_info_ratelimited(dev,
"%s: %s: fail, unaligned vma (%#lx - %#lx, 
%#lx)\n",
@@ -86,13 +85,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
-   if (dax_region->align > PAGE_SIZE) {
+   if (dev_dax->align > PAGE_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-   dax_region->align, fault_size);
+   dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}
 
-   if (fault_size != dax_region->align)
+   if (fault_size != dev_dax->align)
return VM_FAULT_SIGBUS;
 
phys = dax_pgoff_to_phys(dev_dax, vmf->pgoff, PAGE_SIZE);
@@ -120,15 +119,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
-   if (dax_region->align > PMD_SIZE) {
+   if (dev_dax->align > PMD_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-   dax_region->align, fault_size);
+   dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}
 
-   if (fault_size < dax_region->align)
+   if (fault_size < dev_dax->align)
return VM_FAULT_SIGBUS;
-   else if (fault_size > dax_region->align)
+   else if (fault_size > dev_dax->align)
return VM_FAULT_FALLBACK;
 
/* if we are outside of the VMA */
@@ -164,15 +163,15 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
-   if (dax_region->align > PUD_SIZE) {
+   if (dev_dax->align > PUD_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-   dax_region->align, fault_size);
+   dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}
 
-   if (fault_size < dax_region->align)
+   if (fault_size < dev_dax->align)
return VM_FAULT_SIGBUS;
-   else if (fault_size > dax_region->align)
+   else if (fault_size > dev_dax->align)
return VM_FAULT_FALLBACK;
 
/* if we are outside of the VMA */
@@ -267,9 +266,8 @@ static int dev_dax_split(struct vm_area_struct *vma, 
unsigned long addr)
 {
struct file *filp = vma->vm_file;
struct dev_dax *dev_dax = filp->private_data;
-   struct dax_region *dax_region = dev_dax->region;
 
-   if (!IS_ALIGNED(addr, dax_region->align))
+   if (!IS_ALIGNED(addr, dev_dax->align))
return -EINVAL;
return 0;
 }
@@ -278,9 

[PATCH v1 0/4] device-dax: Further improvements to subdivision

2020-07-16 Thread Joao Martins
Hey,

Here's a small set of improvements that I was hoping you would consider as part
of your Soft-Reserved series[0]; should you think they make sense. It does the
following:

  Patch 1-2: Add an align sysfs attribute, as opposed to being limited to 2M.
  This brings parity to static dax regions, which support 1G.

  Patch 3. Add a module parameter to hmem, to initialize the region as idle
  with full available_size for child devices. When the region is gonna be
  partiotined by default and assigned to guests, doesn't help to
  initialize the region. Also if the majority of the region starts
  non-idle and it's a region which memmap is bigger than System-Ram,
  then there not be enough space for the device to be probed successfully.

  Patch 4: Add an sysfs attribute for range allocation. It is a single entry
  where you pass a range - and where the ordering plus range
  length designate the page_offset. It is meant for recreate the same
  mappings after kexec (retaining the GPA->HPA association), but could
  also allow application to implement their own allocation strategy.

I will submit a separate daxctl counterpart shortly, to help visualize its use.

[0] 
https://lore.kernel.org/linux-mm/159457116473.754248.7879464730875147365.st...@dwillia2-desk3.amr.corp.intel.com

Joao

Joao Martins (4):
  device-dax: Make align a per-device property
  device-dax: Add an 'align' attribute
  dax/hmem: Introduce dax_hmem.region_idle parameter
  device-dax: Add a range mapping allocation attribute

 drivers/dax/bus.c | 166 +++---
 drivers/dax/dax-private.h |   1 +
 drivers/dax/device.c  |  35 
 drivers/dax/hmem/hmem.c   |   5 +-
 4 files changed, 177 insertions(+), 30 deletions(-)

-- 
2.17.1



[PATCH v1 4/4] device-dax: Add a range mapping allocation attribute

2020-07-16 Thread Joao Martins
Add a sysfs attribute which denotes a range from the dax region
to be allocated. It's an write only @mapping sysfs attribute in
the format of '-' to allocate a range. @start and
@end use hexadecimal values and the @pgoff is implicitly ordered
wrt to previous writes to @mapping sysfs e.g. a write of a range
of length 1G the pgoff is 0..1G(-4K), a second write will use
@pgoff for 1G+4K...

This range mapping interface is useful for:

 1) Application which want to implement its own allocation logic,
 and thus pick the desired ranges from dax_region.

 2) For use cases like VMM fast restart[0] where after kexec we
 want to the same gpa<->phys mappings (as originally created
 before kexec).

[0] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Joao Martins 
---
 drivers/dax/bus.c | 64 +++
 1 file changed, 64 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index eb384dd6a376..83cc55d7517d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1041,6 +1041,67 @@ static ssize_t size_store(struct device *dev, struct 
device_attribute *attr,
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t range_parse(const char *opt, size_t len, struct range *range)
+{
+   unsigned long long addr = 0;
+   char *start, *end, *str;
+   ssize_t rc = EINVAL;
+
+   str = kstrdup(opt, GFP_KERNEL);
+   if (!str)
+   return rc;
+
+   end = str;
+   start = strsep(, "-");
+   if (!start || !end)
+   goto err;
+
+   rc = kstrtoull(start, 16, );
+   if (rc)
+   goto err;
+   range->start = addr;
+
+   rc = kstrtoull(end, 16, );
+   if (rc)
+   goto err;
+   range->end = addr;
+
+err:
+   kfree(str);
+   return rc;
+}
+
+static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
+   const char *buf, size_t len)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_region *dax_region = dev_dax->region;
+   size_t to_alloc;
+   struct range r;
+   ssize_t rc;
+
+   rc = range_parse(buf, len, );
+   if (rc)
+   return rc;
+
+   rc = -ENXIO;
+   device_lock(dax_region->dev);
+   if (!dax_region->dev->driver) {
+   device_unlock(dax_region->dev);
+   return rc;
+   }
+   device_lock(dev);
+
+   to_alloc = range_len();
+   if (alloc_is_aligned(to_alloc, dev_dax->align))
+   rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+   device_unlock(dev);
+   device_unlock(dax_region->dev);
+
+   return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_WO(mapping);
+
 static ssize_t align_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -1182,6 +1243,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, 
struct attribute *a, int n)
return 0;
if (a == _attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
return 0;
+   if (a == _attr_mapping.attr && is_static(dax_region))
+   return 0;
if ((a == _attr_align.attr ||
 a == _attr_size.attr) && is_static(dax_region))
return 0444;
@@ -1191,6 +1254,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, 
struct attribute *a, int n)
 static struct attribute *dev_dax_attributes[] = {
_attr_modalias.attr,
_attr_size.attr,
+   _attr_mapping.attr,
_attr_target_node.attr,
_attr_align.attr,
_attr_resource.attr,
-- 
2.17.1



Re: [PATCH v2 22/22] device-dax: Introduce 'mapping' devices

2020-07-16 Thread Joao Martins
On 7/12/20 5:28 PM, Dan Williams wrote:
> In support of interrogating the physical address layout of a device with
> dis-contiguous ranges, introduce a sysfs directory with 'start', 'end',
> and 'page_offset' attributes. The alternative is trying to parse
> /proc/iomem, and that file will not reflect the extent layout until the
> device is enabled.
> 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/bus.c |  191 
> +
>  drivers/dax/dax-private.h |   14 +++
>  2 files changed, 203 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index f342e36c69a1..8b6c4ddc5f42 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -579,6 +579,167 @@ struct dax_region *alloc_dax_region(struct device 
> *parent, int region_id,
>  }
>  EXPORT_SYMBOL_GPL(alloc_dax_region);
>  
> +static void dax_mapping_release(struct device *dev)
> +{
> + struct dax_mapping *mapping = to_dax_mapping(dev);
> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
> +
> + ida_free(_dax->ida, mapping->id);
> + kfree(mapping);
> +}
> +
> +static void unregister_dax_mapping(void *data)
> +{
> + struct device *dev = data;
> + struct dax_mapping *mapping = to_dax_mapping(dev);
> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
> + struct dax_region *dax_region = dev_dax->region;
> +
> + dev_dbg(dev, "%s\n", __func__);
> +
> + device_lock_assert(dax_region->dev);
> +
> + dev_dax->ranges[mapping->range_id].mapping = NULL;
> + mapping->range_id = -1;
> +
> + device_del(dev);
> + put_device(dev);
> +}
> +
> +static struct dev_dax_range *get_dax_range(struct device *dev)
> +{
> + struct dax_mapping *mapping = to_dax_mapping(dev);
> + struct dev_dax *dev_dax = to_dev_dax(dev->parent);
> + struct dax_region *dax_region = dev_dax->region;
> +
> + device_lock(dax_region->dev);
> + if (mapping->range_id < 1) {
^ it's 'mapping->range_id < 0'

Otherwise 'mapping0' sysfs entries won't work.
Disabled ranges use id -1.

Joao


Re: [PATCH 11/12] device-dax: Add dis-contiguous resource support

2020-05-12 Thread Joao Martins
On 3/23/20 11:55 PM, Dan Williams wrote:
> @@ -561,13 +580,26 @@ static int __alloc_dev_dax_range(struct dev_dax 
> *dev_dax, u64 start,
>   if (start == U64_MAX)
>   return -EINVAL;
>  
> + ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
> + * (dev_dax->nr_range + 1), GFP_KERNEL);
> + if (!ranges)
> + return -ENOMEM;
> +
>   alloc = __request_region(res, start, size, dev_name(dev), 0);
> - if (!alloc)
> + if (!alloc) {
> + kfree(ranges);
>   return -ENOMEM;
> + }

Noticed this yesterday while looking at alloc_dev_dax_range().

Is it correct to free @ranges here on __request_region failure?

IIUC krealloc() would free dev_dax->ranges if it succeeds, leaving us without
any valid ranges if __request_region failure case indeed frees @ranges. These
@ranges are being used afterwards when we delete the interface and free the
assigned regions. Perhaps we should remove the kfree() above and set
dev_dax->ranges instead before __request_region; or alternatively change the
call order between krealloc and __request_region? FWIW, krealloc checks if the
object being reallocated already meets the requested size, so perhaps there's no
harm with going with the former.

Joao


Re: [PATCH] cpuidle-haltpoll: make haltpoll aware of 'idle=' override

2019-10-17 Thread Joao Martins
On 10/17/19 8:20 PM, Rafael J. Wysocki wrote:
> On Thu, Oct 17, 2019 at 2:41 AM Zhenzhong Duan
>  wrote:
>> and haltpoll
>> is built in. If haltpoll is built as a module, still give a chance for
>> admin to use it despite 'idle='.
> 
> Why exactly?  Do you have any particular use case in mind?
> 
There was no concrete use-case in particular; we thought that when building as a
module, that the general usage would be for cpuidle-haltpoll to be inserted (or
removed) at some point in the time. And hence allow it to override boot 'idle='
given that you can always remove the module and go back to 'idle='.

> Otherwise I'd prefer the behavior to be consistent regardless of
> whether or not it is a module..
> 
It's best we just remove the conditional and keep it consistent across builtin
and module.

Joao


[PATCH v3 0/4] cpuidle, haltpoll: governor switching on idle register

2019-09-07 Thread Joao Martins
Hey,

Presented herewith a series with aims to tie in together the haltpoll
idle driver and governor, without sacrificing previous governor setups.
In addition, there are a few fixes with respect to module loading for
cpuidle-haltpoll. 

The series is organized as follows:

 Patch 1: Allows idle driver stating a preferred governor that it
  wants to use, based on discussion here:

  https://lore.kernel.org/kvm/457e8ca1-beb3-ca39-b257-e7bc6bb35...@oracle.com/

 Patch 2: Decrease rating of governor, and allows previous defaults
  to be as before haltpoll, while using @governor to switch to haltpoll
  when haltpoll driver is registered;

 Patch 3 - 4: Module loading fixes. first is the incorrect error
  reporting and second is supportting module unloading.

Thanks,
Joao

v3:
* Fixed ARM build issues.

v2:
* Add missing Fixes tag on patches 3 and 4.

Joao Martins (4):
  cpuidle: allow governor switch on cpuidle_register_driver()
  cpuidle-haltpoll: set haltpoll as preferred governor
  cpuidle-haltpoll: return -ENODEV on modinit failure
  cpuidle-haltpoll: do not set an owner to allow modunload

 drivers/cpuidle/cpuidle-haltpoll.c   |  4 ++--
 drivers/cpuidle/cpuidle.h|  2 ++
 drivers/cpuidle/driver.c | 25 +
 drivers/cpuidle/governor.c   |  7 ---
 drivers/cpuidle/governors/haltpoll.c |  2 +-
 include/linux/cpuidle.h  |  3 +++
 6 files changed, 37 insertions(+), 6 deletions(-)

-- 
2.17.1



[PATCH v3 4/4] cpuidle-haltpoll: do not set an owner to allow modunload

2019-09-07 Thread Joao Martins
cpuidle-haltpoll can be built as a module to allow optional late load.
Given we are setting @owner to THIS_MODULE, cpuidle will attempt to grab a
module reference every time a cpuidle_device is registered -- so
essentially all online cpus get a reference.

This prevents for the module to be unloaded later, which makes the
module_exit callback entirely unused. Thus remove the @owner and allow
module to be unloaded.

Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver")
Signed-off-by: Joao Martins 
---
v2:
 * Added Fixes tag
---
 drivers/cpuidle/cpuidle-haltpoll.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-haltpoll.c 
b/drivers/cpuidle/cpuidle-haltpoll.c
index 7a0239ef717e..49a65c6fe91e 100644
--- a/drivers/cpuidle/cpuidle-haltpoll.c
+++ b/drivers/cpuidle/cpuidle-haltpoll.c
@@ -35,7 +35,6 @@ static int default_enter_idle(struct cpuidle_device *dev,
 static struct cpuidle_driver haltpoll_driver = {
.name = "haltpoll",
.governor = "haltpoll",
-   .owner = THIS_MODULE,
.states = {
{ /* entry 0 is for polling */ },
{
-- 
2.17.1



  1   2   3   4   >