[PATCH v2 2/6] mm: introduce put_user_page*(), placeholder versions

2018-11-10 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jan Kara 
Cc: Al Viro 
Cc: Jerome Glisse 
Cc: Christoph Hellwig 
Cc: Ralph Campbell 

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 20 
 mm/swap.c  | 80 ++
 2 files changed, 100 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..09fbb2c81aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..bb8c32595e5f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+  unsigned long npages,
+  set_dirty_func sdf)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+   if (!PageDirty(page))
+   sdf(page);
+
+   put_user_page(page);
+   }
+}
+
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+   __put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages 

[PATCH v2 1/6] mm/gup: finish consolidating error handling

2018-11-10 Thread john . hubbard
From: John Hubbard 

An upcoming patch wants to be able to operate on each page that
get_user_pages has retrieved. In order to do that, it's best to
have a common exit point from the routine. Most of this has been
taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
pinning pages"), but there was one case remaining.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this commit removes.

Cc: Keith Busch 
Cc: Dan Williams 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f76e77a2d34b..55a41dee0340 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
if (ret)
-   return i ? : ret;
+   goto out;
ctx.page_mask = 0;
goto next_page;
}
-- 
2.19.1



[PATCH v2 3/6] infiniband/mm: convert put_page() to put_user_page*()

2018-11-10 Thread john . hubbard
From: John Hubbard 

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Christian Benvenuti 

Reviewed-by: Jan Kara 
Reviewed-by: Dennis Dalessandro 
Acked-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  7 ---
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c6144df47ea4..c2898bc7b3b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (!PageDirty(page) && umem->writable && dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
+   if (umem->writable && dirty)
+   put_user_pages_dirty_lock(, 1);
+   else
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 2b4c5e7dd5a1..8b5116b52d2a 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -667,7 +667,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->p

[PATCH v2 6/6] mm: track gup pages with page->dma_pinned_* fields

2018-11-10 Thread john . hubbard
From: John Hubbard 

This patch sets and restores the new page->dma_pinned_flags and
page->dma_pinned_count fields, but does not actually use them for
anything yet.

In order to use these fields at all, the page must be removed from
any LRU list that it's on. The patch also adds some precautions that
prevent the page from getting moved back onto an LRU, once it is
in this state.

This is in preparation to fix some problems that came up when using
devices (NICs, GPUs, for example) that set up direct access to a chunk
of system (CPU) memory, so that they can DMA to/from that memory.

Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jan Kara 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 19 +--
 mm/gup.c   | 55 +--
 mm/memcontrol.c|  8 +++
 mm/swap.c  | 58 ++
 4 files changed, 125 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09fbb2c81aba..6c64b1e0b777 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -950,6 +950,10 @@ static inline void put_page(struct page *page)
 {
page = compound_head(page);
 
+   VM_BUG_ON_PAGE(PageDmaPinned(page) &&
+  page_ref_count(page) <
+   atomic_read(>dma_pinned_count),
+  page);
/*
 * For devmap managed pages we need to catch refcount transition from
 * 2 to 1, when refcount reach one it means the page is free and we
@@ -964,21 +968,10 @@ static inline void put_page(struct page *page)
 }
 
 /*
- * put_user_page() - release a page that had previously been acquired via
- * a call to one of the get_user_pages*() functions.
- *
  * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * one of these put_user_pages*() routines:
  */
-static inline void put_user_page(struct page *page)
-{
-   put_page(page);
-}
-
+void put_user_page(struct page *page);
 void put_user_pages_dirty(struct page **pages, unsigned long npages);
 void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
 void put_user_pages(struct page **pages, unsigned long npages);
diff --git a/mm/gup.c b/mm/gup.c
index 55a41dee0340..ec1b26591532 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -25,6 +25,50 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static void pin_page_for_dma(struct page *page)
+{
+   int ret = 0;
+   struct zone *zone;
+
+   page = compound_head(page);
+   zone = page_zone(page);
+
+   spin_lock(zone_gup_lock(zone));
+
+   if (PageDmaPinned(page)) {
+   /* Page was not on an LRU list, because it was DMA-pinned. */
+   VM_BUG_ON_PAGE(PageLRU(page), page);
+
+   atomic_inc(>dma_pinned_count);
+   goto unlock_out;
+   }
+
+   /*
+* Note that page->dma_pinned_flags is unioned with page->lru.
+* The rules are: reading PageDmaPinned(page) is allowed even if
+* PageLRU(page) is true. That works because of pointer alignment:
+* the PageDmaPinned bit is less than the pointer alignment, so
+* either the page is on an LRU, or (maybe) the PageDmaPinned
+* bit is set.
+*
+* However, SetPageDmaPinned requires that the page is both locked,
+* and also, removed from the LRU.
+*
+* The other flag, PageDmaPinnedWasLru, is not used for
+* synchronization, and so is only read or written after we are
+* certain that the full page->dma_pinned_flags field is available.
+*/
+   ret = isolate_lru_page(page);
+   if (ret == 0)
+   SetPageDmaPinnedWasLru(page);
+
+   atomic_set(>dma_pinned_count, 1);
+   SetPageDmaPinned(page);
+
+unlock_out:
+   spin_unlock(zone_gup_lock(zone));
+}
+
 static struct page *no_page_table(struct vm_area_struct *vma,
unsigned int flags)
 {
@@ -670,7 +714,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
 {
-   long ret = 0, i = 0;
+   long ret = 0, i = 0, j;
struct vm_area_struct *vma = NULL;
struct follow_page_context ctx = { NULL };
 
@@ -774,6 +818,10 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
nr_pages -= page_increm;
} while (nr_pages);
 out:
+   if (pages)
+   for (j = 0; j < i; j++)
+ 

[PATCH v2 5/6] mm: introduce zone_gup_lock, for dma-pinned pages

2018-11-10 Thread john . hubbard
From: John Hubbard 

The page->dma_pinned_flags and _count fields require
lock protection. A lock at approximately the granularity
of the zone_lru_lock is called for, but adding to the
locking contention of zone_lru_lock is undesirable,
because that is a pre-existing hot spot. Fortunately,
these new dma_pinned_* fields can use an independent
lock, so this patch creates an entirely new lock, right
next to the zone_lru_lock.

Why "zone_gup_lock"?

Most of the naming refers to "DMA-pinned pages", but
"zone DMA lock" has other meanings already, so this is
called zone_gup_lock instead. The "dma pinning" is a result
of get_user_pages (gup) being called, so the name still
helps explain its use.

Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jan Kara 
Signed-off-by: John Hubbard 
---
 include/linux/mmzone.h | 6 ++
 mm/page_alloc.c| 1 +
 2 files changed, 7 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 847705a6d0ec..125a6f34f6ba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -660,6 +660,7 @@ typedef struct pglist_data {
enum zone_type kswapd_classzone_idx;
 
int kswapd_failures;/* Number of 'reclaimed == 0' runs */
+   spinlock_t pinned_dma_lock;
 
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
@@ -729,6 +730,11 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return >zone_pgdat->lru_lock;
 }
 
+static inline spinlock_t *zone_gup_lock(struct zone *zone)
+{
+   return >zone_pgdat->pinned_dma_lock;
+}
+
 static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
return >lruvec;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a919ba5cb3c8..7cc0d9bdba17 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6305,6 +6305,7 @@ static void __meminit pgdat_init_internals(struct 
pglist_data *pgdat)
 
pgdat_page_ext_init(pgdat);
spin_lock_init(>lru_lock);
+   spin_lock_init(>pinned_dma_lock);
lruvec_init(node_lruvec(pgdat));
 }
 
-- 
2.19.1



[PATCH v2 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-10 Thread john . hubbard
From: John Hubbard 

Add two struct page fields that, combined, are unioned with
struct page->lru. There is no change in the size of
struct page. These new fields are for type safety and clarity.

Also add page flag accessors to test, set and clear the new
page->dma_pinned_flags field.

The page->dma_pinned_count field will be used in upcoming
patches.

Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jan Kara 
Cc: Balbir Singh 
Signed-off-by: John Hubbard 
---
 include/linux/mm_types.h   | 22 ++
 include/linux/page-flags.h | 61 ++
 2 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..017ab82e36ca 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -78,12 +78,22 @@ struct page {
 */
union {
struct {/* Page cache and anonymous pages */
-   /**
-* @lru: Pageout list, eg. active_list protected by
-* zone_lru_lock.  Sometimes used as a generic list
-* by the page owner.
-*/
-   struct list_head lru;
+   union {
+   /**
+* @lru: Pageout list, eg. active_list protected
+* by zone_lru_lock.  Sometimes used as a
+* generic list by the page owner.
+*/
+   struct list_head lru;
+   /* Used by get_user_pages*(). Pages may not be
+* on an LRU while these dma_pinned_* fields
+* are in use.
+*/
+   struct {
+   unsigned long dma_pinned_flags;
+   atomic_t  dma_pinned_count;
+   };
+   };
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index;  /* Our offset within mapping. */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..3190b6b6a82f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -437,6 +437,67 @@ static __always_inline int __PageMovable(struct page *page)
PAGE_MAPPING_MOVABLE;
 }
 
+/*
+ * Because page->dma_pinned_flags is unioned with page->lru, any page that
+ * uses these flags must NOT be on an LRU. That's partly enforced by
+ * ClearPageDmaPinned, which gives the page back to LRU.
+ *
+ * PageDmaPinned is checked without knowing whether it is a tail page or a
+ * PageDmaPinned page. For that reason, PageDmaPinned avoids PageTail (the 0th
+ * bit in the first union of struct page), and instead uses bit 1 (0x2),
+ * rather than bit 0.
+ *
+ * PageDmaPinned can only be used if no other systems are using the same bit
+ * across the first struct page union. In this regard, it is similar to
+ * PageTail, and in fact, because of PageTail's constraint that bit 0 be left
+ * alone, bit 1 is also left alone so far: other union elements (ignoring tail
+ * pages) put pointers there, and pointer alignment leaves the lower two bits
+ * available.
+ *
+ * So, constraints include:
+ *
+ * -- Only use PageDmaPinned on non-tail pages.
+ * -- Remove the page from any LRU list first.
+ */
+#define PAGE_DMA_PINNED0x2UL
+#define PAGE_DMA_PINNED_WAS_LRU0x4UL
+
+static __always_inline int PageDmaPinned(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   return test_bit(PAGE_DMA_PINNED, >dma_pinned_flags);
+}
+
+static __always_inline void SetPageDmaPinned(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   set_bit(PAGE_DMA_PINNED, >dma_pinned_flags);
+}
+
+static __always_inline void ClearPageDmaPinned(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   clear_bit(PAGE_DMA_PINNED, >dma_pinned_flags);
+}
+
+static __always_inline int PageDmaPinnedWasLru(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   return test_bit(PAGE_DMA_PINNED_WAS_LRU, >dma_pinned_flags);
+}
+
+static __always_inline void SetPageDmaPinnedWasLru(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   set_bit(PAGE_DMA_PINNED_WAS_LRU, >dma_pinned_flags);
+}
+
+static __always_inline void ClearPageDmaPinnedWasLru(struct page *page)
+{
+   VM_BUG_ON(page != compound_head(page));
+   clear_bit(PAGE_DMA_PINNED_WAS_LRU, >dma_pinned_flags);
+}
+
 #ifdef CONFIG_KSM
 /*
  * A KSM page is one of those write-protecte

Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-06 Thread John Hubbard
On 11/6/18 12:41 PM, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
>> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
>>> On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
>>>> On 11/5/18 1:54 AM, Jan Kara wrote:
>>>>> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
>>>>> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
>>>>> has the advantage that it is pretty much standard for a test like this so
>>>>> everyone knows what the test does from a glimpse) and run with it 
>>>>> something
>>>>> like the following workfile:
>>>>>
>>>>> [reader]
>>>>> direct=1
>>>>> ioengine=libaio
>>>>> blocksize=4096
>>>>> size=1g
>>>>> numjobs=1
>>>>> rw=read
>>>>> iodepth=64
>>>>>
>>>>> And see how the numbers with and without your patches compare?
>>>>>
>>>>>   Honza
>>>>>
>>>>> [1] https://github.com/axboe/fio
>>>>
>>>> That program is *very* good to have. Whew. Anyway, it looks like read 
>>>> bandwidth 
>>>> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
>>>> as compared to around 85 without the patch, so still showing about a 20%
>>>> performance degradation, assuming I'm reading this correctly.
>>>>
>>>> Raw data follows, using the fio options you listed above:
>>>>
>>>> Baseline (without my patch):
>>>>  
>>> 
>>>>  lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
>>>> clat percentiles (usec):
>>>>  |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>>>>  | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>>>>  | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>>>>  | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>>>>  | 99.99th=[12256]
>>> .
>>>> Modified (with my patch):
>>>>  
>>> .
>>>>  lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
>>>> clat percentiles (usec):
>>>>  |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>>>>  | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>>>>  | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>>>>  | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>>>>  | 99.99th=[14484]
>>>
>>> So it's adding at least 500us of completion latency to every IO?
>>> I'd argue that the IO latency impact is far worse than the a 20%
>>> throughput drop.
>>
>> Hum, right. So for each IO we have to remove the page from LRU on submit
> 
> Which cost us less then 10us on average:
> 
>   slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
> vs
>   slat (usec): min=18, max=4378, avg=52.59, stdev=63.66
> 
>> and then put it back on IO completion (which is going to race with new
>> submits so LRU lock contention might be an issue).
> 
> Removal has to take the same LRU lock, so I don't think contention
> is the problem here. More likely the overhead is in selecting the
> LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
> a memcg lookup.
> 
>> Spending 500 us on that
>> is not unthinkable when the lock is contended but it is more expensive than
>> I'd have thought. John, could you perhaps profile where the time is spent?
> 

OK, some updates on that:

1. First of all, I fixed a direct-io-related call site (it was still calling 
put_page
instead of put_user_page), and that not only got rid of a problem, it also 
changed
performance: it makes the impact of the patch a bit less. (Sorry about that!
I was hoping that I could get away with temporarily ignoring that failure, but 
no.)
The bandwidth numbers in particular look much closer to each other.

2. Second, note that these fio results are noisy. The std deviation is large 
enough 
that some of this could be noise. In order to highlight that, I did 5 runs each 
of
with, and without the patch, and while there is definitely a performance drop 
on 
average, it's also true that there is overlap in the results. In other words, 
the
best "with patch" run is

Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-05 Thread John Hubbard
On 11/5/18 1:54 AM, Jan Kara wrote:
> On Sun 04-11-18 23:10:12, John Hubbard wrote:
>> On 10/13/18 9:47 AM, Christoph Hellwig wrote:
>>> On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
>>>> In patch 6/6, pin_page_for_dma(), which is called at the end of 
>>>> get_user_pages(),
>>>> unceremoniously rips the pages out of the LRU, as a prerequisite to using
>>>> either of the page->dma_pinned_* fields. 
>>>>
>>>> The idea is that LRU is not especially useful for this situation anyway,
>>>> so we'll just make it one or the other: either a page is dma-pinned, and
>>>> just hanging out doing RDMA most likely (and LRU is less meaningful during 
>>>> that
>>>> time), or it's possibly on an LRU list.
>>>
>>> Have you done any benchmarking what this does to direct I/O performance,
>>> especially for small I/O directly to a (fast) block device?
>>>
>>
>> Hi Christoph,
>>
>> I'm seeing about 20% slower in one case: lots of reads and writes of size 
>> 8192 B,
>> on a fast NVMe device. My put_page() --> put_user_page() conversions are 
>> incomplete 
>> and buggy yet, but I've got enough of them done to briefly run the test.
>>
>> One thing that occurs to me is that jumping on and off the LRU takes time, 
>> and
>> if we limited this to 64-bit platforms, maybe we could use a real page flag? 
>> I 
>> know that leaves 32-bit out in the cold, but...maybe use this slower approach
>> for 32-bit, and the pure page flag for 64-bit? uggh, we shouldn't slow down 
>> anything
>> by 20%. 
>>
>> Test program is below. I hope I didn't overlook something obvious, but it's 
>> definitely possible, given my lack of experience with direct IO. 
>>
>> I'm preparing to send an updated RFC this week, that contains the feedback 
>> to date,
>> and also many converted call sites as well, so that everyone can see what 
>> the whole
>> (proposed) story would look like in its latest incarnation.
> 
> Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> has the advantage that it is pretty much standard for a test like this so
> everyone knows what the test does from a glimpse) and run with it something
> like the following workfile:
> 
> [reader]
> direct=1
> ioengine=libaio
> blocksize=4096
> size=1g
> numjobs=1
> rw=read
> iodepth=64
> 
> And see how the numbers with and without your patches compare?
> 
>   Honza
> 
> [1] https://github.com/axboe/fio

That program is *very* good to have. Whew. Anyway, it looks like read bandwidth 
is approximately 74 MiB/s with my patch (it varies a bit, run to run),
as compared to around 85 without the patch, so still showing about a 20%
performance degradation, assuming I'm reading this correctly.

Raw data follows, using the fio options you listed above:

Baseline (without my patch):
 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=87.2MiB/s,w=0KiB/s][r=22.3k,w=0 IOPS][eta 
00m:00s]
reader: (groupid=0, jobs=1): err= 0: pid=1775: Mon Nov  5 12:08:45 2018
   read: IOPS=21.9k, BW=85.7MiB/s (89.9MB/s)(1024MiB/11945msec)
slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
clat (usec): min=71, max=13093, avg=2869.40, stdev=1225.23
 lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
clat percentiles (usec):
 |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
 | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
 | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
 | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
 | 99.99th=[12256]
   bw (  KiB/s): min=80648, max=93288, per=99.80%, avg=87608.57, stdev=3201.35, 
samples=23
   iops: min=20162, max=23322, avg=21902.09, stdev=800.37, samples=23
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=88.47%, 10=11.27%, 20=0.25%
  cpu  : usr=2.68%, sys=94.68%, ctx=408, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs)

Re: [PATCH v2 1/6] mm/gup: finish consolidating error handling

2018-11-14 Thread John Hubbard

On 11/12/18 8:14 AM, Dan Williams wrote:

On Mon, Nov 12, 2018 at 7:45 AM Keith Busch  wrote:


On Sat, Nov 10, 2018 at 12:50:36AM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 

An upcoming patch wants to be able to operate on each page that
get_user_pages has retrieved. In order to do that, it's best to
have a common exit point from the routine. Most of this has been
taken care of by commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while
pinning pages"), but there was one case remaining.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this commit removes.

Cc: Keith Busch 
Cc: Dan Williams 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Signed-off-by: John Hubbard 
---
  mm/gup.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f76e77a2d34b..55a41dee0340 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -696,12 +696,11 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
   if (!vma || start >= vma->vm_end) {
   vma = find_extend_vma(mm, start);
   if (!vma && in_gate_area(mm, start)) {
- int ret;
   ret = get_gate_page(mm, start & PAGE_MASK,
   gup_flags, ,
   pages ? [i] : NULL);
   if (ret)
- return i ? : ret;
+ goto out;
   ctx.page_mask = 0;
   goto next_page;
   }


This also fixes a potentially leaked dev_pagemap reference count if a
failure occurs when an iteration crosses a vma boundary. I don't think
it's normal to have different vma's on a users mapped zone device memory,
but good to fix anyway.


Does not sound abnormal to me, we should promote this as a fix for the
current cycle with an updated changelog.



Andrew, should I send this patch separately, or do you have what you 
need already?


thanks,
--
John Hubbard
NVIDIA


[PATCH v2 1/3] mm: get_user_pages: consolidate error handling

2018-10-04 Thread john . hubbard
From: John Hubbard 

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *nonblocking)
 {
long i = 0;
+   int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
-   ret = get_gate_page(mm, start & PAGE_MASK,
+   err = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
-   if (ret)
-   return i ? : ret;
+   if (err)
+   goto out;
page_mask = 0;
goto next_page;
}
 
-   if (!vma || check_vma_flags(vma, gup_flags))
-   return i ? : -EFAULT;
+   if (!vma || check_vma_flags(vma, gup_flags)) {
+   err = -EFAULT;
+   goto out;
+   }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current)))
-   return i ? i : -ERESTARTSYS;
+   if (unlikely(fatal_signal_pending(current))) {
+   err = -ERESTARTSYS;
+   goto out;
+   }
cond_resched();
page = follow_page_mask(vma, start, foll_flags, _mask);
if (!page) {
-   int ret;
-   ret = faultin_page(tsk, vma, start, _flags,
+   err = faultin_page(tsk, vma, start, _flags,
nonblocking);
-   switch (ret) {
+   switch (err) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
-   return i ? i : ret;
+   goto out;
case -EBUSY:
-   return i;
+   err = 0;
+   goto out;
case -ENOENT:
goto next_page;
}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 */
goto next_page;
} else if (IS_ERR(page)) {
-   return i ? i : PTR_ERR(page);
+   err = PTR_ERR(page);
+   goto out;
}
if (pages) {
pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
-   return i;
+
+out:
+   return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0



[PATCH v2 0/3] get_user_pages*() and RDMA: first steps

2018-10-04 Thread john . hubbard
From: John Hubbard 

Changes since v1:

-- Renamed release_user_pages*() to put_user_pages*(), from Jan's feedback.

-- Removed the goldfish.c changes, and instead, only included a single
   user (infiniband) of the new functions. That is because goldfish.c no
   longer has a name collision (it has a release_user_pages() routine), and
   also because infiniband exercises both the put_user_page() and
   put_user_pages*() paths.

-- Updated links to discussions and plans, so as to be sure to include
   bounce buffers, thanks to Jerome's feedback.

Also:

-- Dennis, thanks for your earlier review, and I have not yet added your
   Reviewed-by tag, because this revision changes the things that you had
   previously reviewed, thus potentially requiring another look.

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

I'd like to get the first two patches into the -mm tree.

Patch 1, although not technically critical to do now, is still nice to
have, because it's already been reviewed by Jan, and it's just one more
thing on the long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to
start submitting changes to convert all of the callsites of
get_user_pages*() and put_page().  I think this will work a lot better
than trying to maintain a massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(), and
also exercises put_user_pages_dirty_locked().

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 

John Hubbard (3):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page[s](), placeholder versions
  infiniband/mm: convert to the new put_user_page[s]() calls

 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ++
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 include/linux/mm.h  | 42 -
 mm/gup.c| 37 ++
 9 files changed, 80 insertions(+), 41 deletions(-)

-- 
2.19.0



[PATCH v2 3/3] infiniband/mm: convert to the new put_user_page[s]() calls

2018-10-04 Thread john . hubbard
From: John Hubbard 

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(),
instead of put_page().

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: Mike Marciniszyn 
CC: Dennis Dalessandro 
CC: Christian Benvenuti 

CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 7 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..9430d697cb9f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
page = sg_page(sg);
if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
-   put_page(page);
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
goto out;
}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, 
struct mthca_uar *uar,
if (db_tab->page[i].uvirt) {
mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
}
}
 
diff --git a/drivers/infiniband/h

[PATCH v2 2/3] mm: introduce put_user_page[s](), placeholder versions

2018-10-04 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), for the same reasons.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..1a9aae7c659f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, 
int, void __user *,
size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
size_t *, loff_t *);
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
@@ -943,6 +945,44 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/* Placeholder version, until all get_user_pages*() callers are updated. */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+/* For get_user_pages*()-pinned pages, use these variants instead of
+ * release_pages():
+ */
+static inline void put_user_pages_dirty(struct page **pages,
+   unsigned long npages)
+{
+   while (npages) {
+   set_page_dirty(pages[npages]);
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
+static inline void put_user_pages_dirty_lock(struct page **pages,
+unsigned long npages)
+{
+   while (npages) {
+   set_page_dirty_lock(pages[npages]);
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
+static inline void put_user_pages(struct page **pages,
+ unsigned long npages)
+{
+   while (npages) {
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1534,8 +1574,6 @@ int redirty_page_for_writepage(struct writeback_control 
*wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
  struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {
-- 
2.19.0



[PATCH v5 3/3] infiniband/mm: convert put_page() to put_user_page*()

2018-10-09 Thread john . hubbard
From: John Hubbard 

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, any will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: Mike Marciniszyn 
CC: Dennis Dalessandro 
CC: Christian Benvenuti 

CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org

Reviewed-by: Jan Kara 
Reviewed-by: Dennis Dalessandro 
Acked-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  7 ---
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..7ab7a3a35eb4 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (!PageDirty(page) && umem->writable && dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
+   if (umem->writable && dirty)
+   put_user_pages_dirty_lock(, 1);
+   else
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_p

[PATCH v5 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-09 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, any will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 20 +++
 mm/swap.c  | 83 ++
 2 files changed, 103 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..76d18aada9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -943,6 +943,26 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index 26fc9b5f1b6c..efab3a6b6f91 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -134,6 +134,89 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+   if (!PageDirty(page))
+   set_page_dirty(page);
+
+   put_user_page(page);
+   }
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+  

[PATCH v5 0/3] get_user_pages*() and RDMA: first steps

2018-10-09 Thread john . hubbard
From: John Hubbard 

Changes since v4:

-- Changed the new put_user_page*() functions to operate only on the head
   page, because that's how the final version of those functions will work.
   (Andrew Morton's feedback prompted this, thanks!)

-- Added proper documentation of the new put_user_page*() functions.

-- Moved most of the new put_user_page*() functions out of the header file,
   and into swap.c, because they have grown a little bigger than static
   inline functions should be. The trivial put_user_page() was left as
   a static inline for now, though.

-- Picked up Andrew Morton's Reviewed-by, for the first patch. I left
   Jan's Reviewed-by in place for now, but we should verify that it still
   holds, with the various changes above. The main difference is the change
   to use the head page, the rest is just code movement and documentation.

-- Fixed a bug in the infiniband patch, found by the kbuild bot.

-- Rewrote the changelogs (and part of this cover letter) to be clearer.
   Part of that is less reliance on links, and instead, just writing the
   steps directly.

Changes since v3:

-- Picks up Reviewed-by tags from Jan Kara and Dennis Dalessandro.

-- Picks up Acked-by tag from Jason Gunthorpe, in case this ends up *not*
   going in via the RDMA tree.

-- Fixes formatting of a comment.

Changes since v2:

-- Absorbed more dirty page handling logic into the put_user_page*(), and
   handled some page releasing loops in infiniband more thoroughly, as per
   Jason Gunthorpe's feedback.

-- Fixed a bug in the put_user_pages*() routines' loops (thanks to
   Ralph Campbell for spotting it).

Changes since v1:

-- Renamed release_user_pages*() to put_user_pages*(), from Jan's feedback.

-- Removed the goldfish.c changes, and instead, only included a single
   user (infiniband) of the new functions. That is because goldfish.c no
   longer has a name collision (it has a release_user_pages() routine), and
   also because infiniband exercises both the put_user_page() and
   put_user_pages*() paths.

-- Updated links to discussions and plans, so as to be sure to include
   bounce buffers, thanks to Jerome's feedback.

Also:

This short series prepares for eventually fixing the problem described
in [1]. The steps are:

1) (This patchset): Provide put_user_page*() routines, intended to be used
for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
invoke put_user_page*(), instead of put_page(). This involves dozens of
call sites, any will take some time. Patch 3/3 here kicks off the effort,
by applying it to infiniband.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.

Patch 1, although not technically critical to do now, is still nice to
have, because it's already been reviewed by Jan (and Andrew, now), and
it's just one more thing on the long TODO list here, that is ready to be
checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to
start submitting changes to convert all of the callsites of
get_user_pages*() and put_page().  I think this will work a lot better
than trying to maintain a massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(), and
also exercises put_user_pages_dirty_locked().

Once these are all in, then the floodgates can open up to convert the large
number of remaining get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 
CC: Andrew Morton 

John Hubbard (3):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c  |  7 +-
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 +--
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 +--
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  

[PATCH v5 1/3] mm: get_user_pages: consolidate error handling

2018-10-09 Thread john . hubbard
From: John Hubbard 

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara 
Reviewed-by: Andrew Morton 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *nonblocking)
 {
long i = 0;
+   int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
-   ret = get_gate_page(mm, start & PAGE_MASK,
+   err = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
-   if (ret)
-   return i ? : ret;
+   if (err)
+   goto out;
page_mask = 0;
goto next_page;
}
 
-   if (!vma || check_vma_flags(vma, gup_flags))
-   return i ? : -EFAULT;
+   if (!vma || check_vma_flags(vma, gup_flags)) {
+   err = -EFAULT;
+   goto out;
+   }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current)))
-   return i ? i : -ERESTARTSYS;
+   if (unlikely(fatal_signal_pending(current))) {
+   err = -ERESTARTSYS;
+   goto out;
+   }
cond_resched();
page = follow_page_mask(vma, start, foll_flags, _mask);
if (!page) {
-   int ret;
-   ret = faultin_page(tsk, vma, start, _flags,
+   err = faultin_page(tsk, vma, start, _flags,
nonblocking);
-   switch (ret) {
+   switch (err) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
-   return i ? i : ret;
+   goto out;
case -EBUSY:
-   return i;
+   err = 0;
+   goto out;
case -ENOENT:
goto next_page;
}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 */
goto next_page;
} else if (IS_ERR(page)) {
-   return i ? i : PTR_ERR(page);
+   err = PTR_ERR(page);
+   goto out;
}
if (pages) {
pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
-   return i;
+
+out:
+   return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.1



Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-09 Thread John Hubbard
On 10/8/18 5:14 PM, Andrew Morton wrote:
> On Mon,  8 Oct 2018 14:16:22 -0700 john.hubb...@gmail.com wrote:
> 
>> From: John Hubbard 
[...]
>> +/*
>> + * Pages that were pinned via get_user_pages*() should be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +put_page(page);
>> +}
>> +
>> +static inline void put_user_pages_dirty(struct page **pages,
>> +unsigned long npages)
>> +{
>> +unsigned long index;
>> +
>> +for (index = 0; index < npages; index++) {
>> +if (!PageDirty(pages[index]))
> 
> Both put_page() and set_page_dirty() handle compound pages.  But
> because of the above statement, put_user_pages_dirty() might misbehave? 
> Or maybe it won't - perhaps the intent here is to skip dirtying the
> head page if the sub page is clean?  Please clarify, explain and add
> comment if so.
> 

Yes, technically, the accounting is wrong: we normally use the head page to 
track dirtiness, and here, that is not done. (Nor was it done before this
patch). However, it's not causing problems in code today because sub pages
are released at about the same time as head pages, so the head page does get 
properly checked at some point. And that means that set_page_dirty*() gets
called if it needs to be called. 

Obviously this is a little fragile, in that it depends on the caller behaving 
a certain way. And in any case, the long-term fix (coming later) *also* only
operates on the head page. So actually, instead of a comment, I think it's good 
to just insert

page = compound_head(page);

...into these new routines, right now. I'll do that.

[...]
> 
> Otherwise looks OK.  Ish.  But it would be nice if that comment were to
> explain *why* get_user_pages() pages must be released with
> put_user_page().
> 

Yes, will do.

> Also, maintainability.  What happens if someone now uses put_page() by
> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
> this from occurring as code evolves?  Is there a cheap way of detecting
> this bug at runtime?
> 

It might be possible to do a few run-time checks, such as "does page that came 
back to put_user_page() have the correct flags?", but it's harder (without 
having a dedicated page flag) to detect the other direction: "did someone page 
in a get_user_pages page, to put_page?"

As Jan said in his reply, converting get_user_pages (and put_user_page) to 
work with a new data type that wraps struct pages, would solve it, but that's
an awfully large change. Still...given how much of a mess this can turn into 
if it's wrong, I wonder if it's worth it--maybe? 

thanks,
-- 
John Hubbard
NVIDIA
 


[PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-08 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 49 --
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..0490f4a71b9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, 
int, void __user *,
size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
size_t *, loff_t *);
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
@@ -943,6 +945,51 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/*
+ * Pages that were pinned via get_user_pages*() should be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below.
+ */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+static inline void put_user_pages_dirty(struct page **pages,
+   unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   if (!PageDirty(pages[index]))
+   set_page_dirty(pages[index]);
+
+   put_user_page(pages[index]);
+   }
+}
+
+static inline void put_user_pages_dirty_lock(struct page **pages,
+unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   if (!PageDirty(pages[index]))
+   set_page_dirty_lock(pages[index]);
+
+   put_user_page(pages[index]);
+   }
+}
+
+static inline void put_user_pages(struct page **pages,
+ unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++)
+   put_user_page(pages[index]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1534,8 +1581,6 @@ int redirty_page_for_writepage(struct writeback_control 
*wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
  struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {
-- 
2.19.0



[PATCH v4 0/3] get_user_pages*() and RDMA: first steps

2018-10-08 Thread john . hubbard
From: John Hubbard 

Andrew, do you have a preference for which tree (MM or RDMA) this should
go in? If not, then could you please ACK this so that Jason can pick it
up for the RDMA tree?

Changes since v3:

-- Picks up Reviewed-by tags from Jan Kara and Dennis Dalessandro.

-- Picks up Acked-by tag from Jason Gunthorpe, in case this ends up *not*
   going in via the RDMA tree.

-- Fixes formatting of a comment.

Changes since v2:

-- Absorbed more dirty page handling logic into the put_user_page*(), and
   handled some page releasing loops in infiniband more thoroughly, as per
   Jason Gunthorpe's feedback.

-- Fixed a bug in the put_user_pages*() routines' loops (thanks to
   Ralph Campbell for spotting it).

Changes since v1:

-- Renamed release_user_pages*() to put_user_pages*(), from Jan's feedback.

-- Removed the goldfish.c changes, and instead, only included a single
   user (infiniband) of the new functions. That is because goldfish.c no
   longer has a name collision (it has a release_user_pages() routine), and
   also because infiniband exercises both the put_user_page() and
   put_user_pages*() paths.

-- Updated links to discussions and plans, so as to be sure to include
   bounce buffers, thanks to Jerome's feedback.

Also:

-- Dennis, thanks for your earlier review, and I have not yet added your
   Reviewed-by tag, because this revision changes the things that you had
   previously reviewed, thus potentially requiring another look.

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

Patch 1, although not technically critical to do now, is still nice to
have, because it's already been reviewed by Jan, and it's just one more
thing on the long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to
start submitting changes to convert all of the callsites of
get_user_pages*() and put_page().  I think this will work a lot better
than trying to maintain a massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(), and
also exercises put_user_pages_dirty_locked().

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 
CC: Andrew Morton 

John Hubbard (3):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c  |  7 +--
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ++---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 +--
 include/linux/mm.h  | 49 -
 mm/gup.c| 37 +---
 9 files changed, 93 insertions(+), 45 deletions(-)

-- 
2.19.0



[PATCH v4 3/3] infiniband/mm: convert put_page() to put_user_page*()

2018-10-08 Thread john . hubbard
From: John Hubbard 

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: Mike Marciniszyn 
CC: Dennis Dalessandro 
CC: Christian Benvenuti 

CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org

Reviewed-by: Jan Kara 
Reviewed-by: Dennis Dalessandro 
Acked-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  7 ---
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
 7 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..7ab7a3a35eb4 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (!PageDirty(page) && umem->writable && dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
+   if (umem->writable && dirty)
+   put_user_pages_dirty_lock(, 1);
+   else
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
goto out;
}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, 
struct mthca_uar *uar,
if (db_tab->page[i].uvirt) {
mthca_UNMAP_ICM(dev, m

[PATCH v4 1/3] mm: get_user_pages: consolidate error handling

2018-10-08 Thread john . hubbard
From: John Hubbard 

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *nonblocking)
 {
long i = 0;
+   int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
-   ret = get_gate_page(mm, start & PAGE_MASK,
+   err = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
-   if (ret)
-   return i ? : ret;
+   if (err)
+   goto out;
page_mask = 0;
goto next_page;
}
 
-   if (!vma || check_vma_flags(vma, gup_flags))
-   return i ? : -EFAULT;
+   if (!vma || check_vma_flags(vma, gup_flags)) {
+   err = -EFAULT;
+   goto out;
+   }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current)))
-   return i ? i : -ERESTARTSYS;
+   if (unlikely(fatal_signal_pending(current))) {
+   err = -ERESTARTSYS;
+   goto out;
+   }
cond_resched();
page = follow_page_mask(vma, start, foll_flags, _mask);
if (!page) {
-   int ret;
-   ret = faultin_page(tsk, vma, start, _flags,
+   err = faultin_page(tsk, vma, start, _flags,
nonblocking);
-   switch (ret) {
+   switch (err) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
-   return i ? i : ret;
+   goto out;
case -EBUSY:
-   return i;
+   err = 0;
+   goto out;
case -ENOENT:
goto next_page;
}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 */
goto next_page;
} else if (IS_ERR(page)) {
-   return i ? i : PTR_ERR(page);
+   err = PTR_ERR(page);
+   goto out;
}
if (pages) {
pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
-   return i;
+
+out:
+   return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0



Re: [PATCH v3 3/3] infiniband/mm: convert put_page() to put_user_page*()

2018-10-08 Thread John Hubbard
On 10/8/18 12:42 PM, Jason Gunthorpe wrote:
> On Fri, Oct 05, 2018 at 07:49:49PM -0700, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
[...]
>>  drivers/infiniband/core/umem.c  |  7 ---
>>  drivers/infiniband/core/umem_odp.c  |  2 +-
>>  drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
>>  drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
>>  drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
>>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 
>>  drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
>>  7 files changed, 24 insertions(+), 28 deletions(-)
> 
> I have no issues with this, do you want this series to go through the
> rdma tree? Otherwise:
> 
> Acked-by: Jason Gunthorpe 
> 

The RDMA tree seems like a good path for this, yes, glad you suggested
that.

I'll post a v4 with the comment fix and the recent reviewed-by's, which
should be ready for that.  It's based on today's linux.git tree at the 
moment, but let me know if I should re-apply it to the RDMA tree.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 3/3] infiniband/mm: convert put_page() to put_user_page*()

2018-10-08 Thread John Hubbard
On 10/8/18 1:56 PM, Jason Gunthorpe wrote:
> On Mon, Oct 08, 2018 at 01:37:35PM -0700, John Hubbard wrote:
>> On 10/8/18 12:42 PM, Jason Gunthorpe wrote:
>>> On Fri, Oct 05, 2018 at 07:49:49PM -0700, john.hubb...@gmail.com wrote:
>>>> From: John Hubbard 
>> [...]
>>>>  drivers/infiniband/core/umem.c  |  7 ---
>>>>  drivers/infiniband/core/umem_odp.c  |  2 +-
>>>>  drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
>>>>  drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
>>>>  drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
>>>>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 
>>>>  drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
>>>>  7 files changed, 24 insertions(+), 28 deletions(-)
>>>
>>> I have no issues with this, do you want this series to go through the
>>> rdma tree? Otherwise:
>>>
>>> Acked-by: Jason Gunthorpe 
>>>
>>
>> The RDMA tree seems like a good path for this, yes, glad you suggested
>> that.
>>
>> I'll post a v4 with the comment fix and the recent reviewed-by's, which
>> should be ready for that.  It's based on today's linux.git tree at the 
>> moment, but let me know if I should re-apply it to the RDMA tree.
> 
> I'm unclear who needs to ack the MM sections for us to take it to
> RDMA?
> 
> Otherwise it is no problem..
> 

It needs Andrew Morton (+CC) and preferably also Michal Hocko (already on CC).

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-09 Thread John Hubbard
On 10/9/18 4:20 PM, Andrew Morton wrote:
> On Tue, 9 Oct 2018 10:30:25 +0200 Jan Kara  wrote:
> 
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>
>> The same will happen as with any other reference counting bug - the special
>> user reference will leak. It will be pretty hard to debug I agree. I was
>> thinking about whether we could provide some type safety against such bugs
>> such as get_user_pages() not returning struct page pointers but rather some
>> other special type but it would result in a big amount of additional churn
>> as we'd have to propagate this different type e.g. through the IO path so
>> that IO completion routines could properly call put_user_pages(). So I'm
>> not sure it's really worth it.
> 
> I'm not really understanding.  Patch 3/3 changes just one infiniband
> driver to use put_user_page().  But the changelogs here imply (to me)
> that every user of get_user_pages() needs to be converted to
> s/put_page/put_user_page/.
> 
> Methinks a bit more explanation is needed in these changelogs?
> 

OK, yes, it does sound like the explanation is falling short. I'll work on 
something 
clearer. Did the proposed steps in the changelogs, such as:
  
[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

help at all, or is it just too many references, and I should write the words
directly in the changelog?

Anyway, patch 3/3 is a just a working example (which we do want to submit, 
though), and
many more conversions will follow. But they don't have to be done all 
upfront--they
can be done in follow up patchsets. 

The put_user_page*() routines are, at this point, not going to significantly 
change
behavior. 

I'm working on an RFC that will show what the long-term fix to get_user_pages 
and
put_user_pages will look like. But meanwhile it's good to get started on 
converting
all of the call sites.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call

2018-10-02 Thread John Hubbard
On 10/1/18 7:35 AM, Dennis Dalessandro wrote:
> On 9/28/2018 11:12 PM, John Hubbard wrote:
>> On 9/28/18 8:39 AM, Jason Gunthorpe wrote:
>>> On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubb...@gmail.com wrote:
>>>> From: John Hubbard 
>> [...]
>>>>
>>>> diff --git a/drivers/infiniband/core/umem.c 
>>>> b/drivers/infiniband/core/umem.c
>>>> index a41792dbae1f..9430d697cb9f 100644
>>>> +++ b/drivers/infiniband/core/umem.c
>>>> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, 
>>>> struct ib_umem *umem, int d
>>>>   page = sg_page(sg);
>>>>   if (!PageDirty(page) && umem->writable && dirty)
>>>>   set_page_dirty_lock(page);
>>>> -    put_page(page);
>>>> +    put_user_page(page);
>>>
>>> Would it make sense to have a release/put_user_pages_dirtied to absorb
>>> the set_page_dity pattern too? I notice in this patch there is some
>>> variety here, I wonder what is the right way?
>>>
>>> Also, I'm told this code here is a big performance bottleneck when the
>>> number of pages becomes very long (think >> GB of memory), so having a
>>> future path to use some kind of batching/threading sound great.
>>>
>>
>> Yes. And you asked for this the first time, too. Consistent! :) Sorry for
>> being slow to pick it up. It looks like there are several patterns, and
>> we have to support both set_page_dirty() and set_page_dirty_lock(). So
>> the best combination looks to be adding a few variations of
>> release_user_pages*(), but leaving put_user_page() alone, because it's
>> the "do it yourself" basic one. Scatter-gather will be stuck with that.
>>
>> Here's a differential patch with that, that shows a nice little cleanup in
>> a couple of IB places, and as you point out, it also provides the hooks for
>> performance upgrades (via batching) in the future.
>>
>> Does this API look about right?
> 
> I'm on board with that and the changes to hfi1 and qib.
> 
> Reviewed-by: Dennis Dalessandro 

Hi Dennis, thanks for the review!

I'll add those new routines in and send out a v2 soon, now that it appears, 
from 
the recent discussion, that this aspect of the approach is still viable.


thanks,
-- 
John Hubbard
NVIDIA


[PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-09-27 Thread john . hubbard
From: John Hubbard 

Hi,

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

I'd like to get the first two patches into the -mm tree.

Patch 1, although not technically critical to do now, is still nice to have,
because it's already been reviewed by Jan, and it's just one more thing on the
long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to start
submitting changes to convert all of the callsites of get_user_pages*() and
put_page().  I think this will work a lot better than trying to maintain a
massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
a fairly small and easy example.

Patch 4 converts a small driver from put_page() --> release_user_pages(). This
could just as easily have been done as a change from put_page() to
put_user_page(). The reason I did it this way is that this provides a small and
simple caller of the new release_user_pages() routine. I wanted both of the
new routines, even though just placeholders, to have callers.

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

CC: Al Viro 
CC: Christian Benvenuti 
CC: Christopher Lameter 
CC: Dan Williams 
CC: Dennis Dalessandro 
CC: Doug Ledford 
CC: Jan Kara 
CC: Jason Gunthorpe 
CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Mike Marciniszyn 
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org
CC: linux-r...@vger.kernel.org

John Hubbard (4):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page(), placeholder version
  infiniband/mm: convert to the new put_user_page() call
  goldfish_pipe/mm: convert to the new release_user_pages() call

 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++---
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/platform/goldfish/goldfish_pipe.c   |  7 ++--
 include/linux/mm.h  | 14 
 mm/gup.c| 37 -
 10 files changed, 52 insertions(+), 30 deletions(-)

-- 
2.19.0



[PATCH 1/4] mm: get_user_pages: consolidate error handling

2018-09-27 Thread john . hubbard
From: John Hubbard 

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *nonblocking)
 {
long i = 0;
+   int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
-   ret = get_gate_page(mm, start & PAGE_MASK,
+   err = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
-   if (ret)
-   return i ? : ret;
+   if (err)
+   goto out;
page_mask = 0;
goto next_page;
}
 
-   if (!vma || check_vma_flags(vma, gup_flags))
-   return i ? : -EFAULT;
+   if (!vma || check_vma_flags(vma, gup_flags)) {
+   err = -EFAULT;
+   goto out;
+   }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current)))
-   return i ? i : -ERESTARTSYS;
+   if (unlikely(fatal_signal_pending(current))) {
+   err = -ERESTARTSYS;
+   goto out;
+   }
cond_resched();
page = follow_page_mask(vma, start, foll_flags, _mask);
if (!page) {
-   int ret;
-   ret = faultin_page(tsk, vma, start, _flags,
+   err = faultin_page(tsk, vma, start, _flags,
nonblocking);
-   switch (ret) {
+   switch (err) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
-   return i ? i : ret;
+   goto out;
case -EBUSY:
-   return i;
+   err = 0;
+   goto out;
case -ENOENT:
goto next_page;
}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 */
goto next_page;
} else if (IS_ERR(page)) {
-   return i ? i : PTR_ERR(page);
+   err = PTR_ERR(page);
+   goto out;
}
if (pages) {
pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
-   return i;
+
+out:
+   return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0



[PATCH 4/4] goldfish_pipe/mm: convert to the new release_user_pages() call

2018-09-27 Thread john . hubbard
From: John Hubbard 

For code that retains pages via get_user_pages*(),
release those pages via the new release_user_pages(),
instead of calling put_page().

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

CC: Al Viro 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index fad0345376e0..1e9455a86698 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -340,8 +340,9 @@ static void __release_user_pages(struct page **pages, int 
pages_count,
for (i = 0; i < pages_count; i++) {
if (!is_write && consumed_size > 0)
set_page_dirty(pages[i]);
-   put_page(pages[i]);
}
+
+   release_user_pages(pages, pages_count);
 }
 
 /* Populate the call parameters, merging adjacent pages together */
-- 
2.19.0



[PATCH 3/4] infiniband/mm: convert to the new put_user_page() call

2018-09-27 Thread john . hubbard
From: John Hubbard 

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(),
instead of put_page().

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: Mike Marciniszyn 
CC: Dennis Dalessandro 
CC: Christian Benvenuti 

CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  | 2 +-
 drivers/infiniband/core/umem_odp.c  | 2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c | 6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   | 8 
 drivers/infiniband/hw/usnic/usnic_uiom.c| 2 +-
 7 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..9430d697cb9f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
page = sg_page(sg);
if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
-   put_page(page);
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..c7516029af33 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -126,7 +126,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct 
page **p,
for (i = 0; i < npages; i++) {
if (dirty)
set_page_dirty_lock(p[i]);
-   put_page(p[i]);
+   put_user_page(p[i]);
}
 
if (mm) { /* during close after signal, mm can be NULL */
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
goto out;
}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, 
struct mthca_uar *uar,
if (db_tab->page[i].uvirt) {
mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
}
}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c 
b/drivers/infiniband/hw/qib/qib_user_pages.c
index 16543d5e80c3..3f8fd42dd7fc 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -45,7 +45,7 @@ static void __qib_release_user_pages(struct page **p, size_t 
num_pages,
for (i = 0; i < num_pages; i++) {
if (dirty)
set_page_dirty_lock(p[i]);
-   put_page(p[i]);
+   put_user_page(p[i]);
}
 }
 
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c 
b/

[PATCH 2/4] mm: introduce put_user_page(), placeholder version

2018-09-27 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also adds release_user_pages(), a drop-in replacement for
release_pages(). This is intended to be easily grep-able,
for later performance improvements, since release_user_pages
is not batched like release_pages() is, and is significantly
slower.

Also: rename goldfish_pipe.c's release_user_pages(), in order
to avoid a naming conflict with the new external function of
the same name.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c |  4 ++--
 include/linux/mm.h| 14 ++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index 2da567540c2d..fad0345376e0 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -332,7 +332,7 @@ static int pin_user_pages(unsigned long first_page, 
unsigned long last_page,
 
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
+static void __release_user_pages(struct page **pages, int pages_count,
int is_write, s32 consumed_size)
 {
int i;
@@ -410,7 +410,7 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pages, pages_count, is_write, *consumed_size);
+   __release_user_pages(pages, pages_count, is_write, *consumed_size);
 
mutex_unlock(>lock);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a61ebe8ad4ca..72caf803115f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -943,6 +943,20 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/* Placeholder version, until all get_user_pages*() callers are updated. */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+/* A drop-in replacement for release_pages(): */
+static inline void release_user_pages(struct page **pages,
+ unsigned long npages)
+{
+   while (npages)
+   put_user_page(pages[--npages]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
-- 
2.19.0



Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-09-28 Thread John Hubbard
On 9/28/18 8:29 AM, Jerome Glisse wrote:
> On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
>>
>> Hi,
>>
>> This short series prepares for eventually fixing the problem described
>> in [1], and is following a plan listed in [2].
>>
>> I'd like to get the first two patches into the -mm tree.
>>
>> Patch 1, although not technically critical to do now, is still nice to have,
>> because it's already been reviewed by Jan, and it's just one more thing on 
>> the
>> long TODO list here, that is ready to be checked off.
>>
>> Patch 2 is required in order to allow me (and others, if I'm lucky) to start
>> submitting changes to convert all of the callsites of get_user_pages*() and
>> put_page().  I think this will work a lot better than trying to maintain a
>> massive patchset and submitting all at once.
>>
>> Patch 3 converts infiniband drivers: put_page() --> put_user_page(). I picked
>> a fairly small and easy example.
>>
>> Patch 4 converts a small driver from put_page() --> release_user_pages(). 
>> This
>> could just as easily have been done as a change from put_page() to
>> put_user_page(). The reason I did it this way is that this provides a small 
>> and
>> simple caller of the new release_user_pages() routine. I wanted both of the
>> new routines, even though just placeholders, to have callers.
>>
>> Once these are all in, then the floodgates can open up to convert the large
>> number of get_user_pages*() callsites.
>>
>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
>> Proposed steps for fixing get_user_pages() + DMA problems.
>>
> 
> So the solution is to wait (possibly for days, months, years) that the
> RDMA or GPU which did GUP and do not have mmu notifier, release the page
> (or put_user_page()) ?
> 
> This sounds bads. Like i said during LSF/MM there is no way to properly
> fix hardware that can not be preempted/invalidated ... most GPU are fine.
> Few RDMA are fine, most can not ...
> 

Hi Jerome,

Personally, I'm think that this particular design is the best one I've seen
so far, but if other, better designs show up, than let's do those instead, sure.

I guess your main concern is that this might take longer than other approaches.

As for time frame, perhaps I made it sound worse than it really is. I have 
patches
staged already for all of the simpler call sites, and for about half of the more
complicated ones. The core solution in mm is not large, and we've gone through 
a 
few discussion threads about it back in July or so, so it shouldn't take too 
long
to perfect it.

So it may be a few months to get it all reviewed and submitted, but I don't
see "years" by any stretch.


> If it is just about fixing the set_page_dirty() bug then just looking at
> refcount versus mapcount should already tell you if you can remove the
> buffer head from the page or not. Which would fix the bug without complex
> changes (i still like the put_user_page just for symetry with GUP).
> 

It's about more than that. The goal is to make it safe and correct to
use a non-CPU device to read and write to "pinned" memory, especially when
that memory is backed by a file system.

I recall there were objections to just narrowly fixing the set_page_dirty()
bug, because the underlying problem is large and serious. So here we are.

thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH] mm: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE

2018-10-10 Thread John Hubbard
On 10/10/18 10:26 AM, Jann Horn wrote:
> On Wed, Oct 10, 2018 at 7:19 PM Michal Hocko  wrote:
>> On Wed 10-10-18 17:27:36, Jann Horn wrote:
>>> Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an
>>> application causes that application to randomly crash. The existing check
>>> for handling MAP_FIXED_NOREPLACE looks up the first VMA that either
>>> overlaps or follows the requested region, and then bails out if that VMA
>>> overlaps *the start* of the requested region. It does not bail out if the
>>> VMA only overlaps another part of the requested region.
>>
>> I do not understand. Could you give me an example?
> 
> Sure.
> 
> ===
> user@debian:~$ cat mmap_fixed_simple.c
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #ifndef MAP_FIXED_NOREPLACE
> #define MAP_FIXED_NOREPLACE 0x10
> #endif
> 
> int main(void) {
>   char *p;
> 
>   errno = 0;
>   p = mmap((void*)0x10001000, 0x4000, PROT_NONE,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
>   printf("p1=%p err=%m\n", p);
> 
>   errno = 0;
>   p = mmap((void*)0x1000, 0x2000, PROT_READ,
> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
>   printf("p2=%p err=%m\n", p);
> 
>   char cmd[100];
>   sprintf(cmd, "cat /proc/%d/maps", getpid());
>   system(cmd);
> 
>   return 0;
> }
> user@debian:~$ gcc -o mmap_fixed_simple mmap_fixed_simple.c
> user@debian:~$ ./mmap_fixed_simple
> p1=0x10001000 err=Success
> p2=0x1000 err=Success
> 1000-10002000 r--p  00:00 0
> 10002000-10005000 ---p  00:00 0
> 564a9a06f000-564a9a07 r-xp  fe:01 264004
>   /home/user/mmap_fixed_simple
> 564a9a26f000-564a9a27 r--p  fe:01 264004
>   /home/user/mmap_fixed_simple
> 564a9a27-564a9a271000 rw-p 1000 fe:01 264004
>   /home/user/mmap_fixed_simple
> 564a9a54a000-564a9a56b000 rw-p  00:00 0  
> [heap]
> 7f8eba447000-7f8eba5dc000 r-xp  fe:01 405885
>   /lib/x86_64-linux-gnu/libc-2.24.so
> 7f8eba5dc000-7f8eba7dc000 ---p 00195000 fe:01 405885
>   /lib/x86_64-linux-gnu/libc-2.24.so
> 7f8eba7dc000-7f8eba7e r--p 00195000 fe:01 405885
>   /lib/x86_64-linux-gnu/libc-2.24.so
> 7f8eba7e-7f8eba7e2000 rw-p 00199000 fe:01 405885
>   /lib/x86_64-linux-gnu/libc-2.24.so
> 7f8eba7e2000-7f8eba7e6000 rw-p  00:00 0
> 7f8eba7e6000-7f8eba809000 r-xp  fe:01 405876
>   /lib/x86_64-linux-gnu/ld-2.24.so
> 7f8eba9e9000-7f8eba9eb000 rw-p  00:00 0
> 7f8ebaa06000-7f8ebaa09000 rw-p  00:00 0
> 7f8ebaa09000-7f8ebaa0a000 r--p 00023000 fe:01 405876
>   /lib/x86_64-linux-gnu/ld-2.24.so
> 7f8ebaa0a000-7f8ebaa0b000 rw-p 00024000 fe:01 405876
>   /lib/x86_64-linux-gnu/ld-2.24.so
> 7f8ebaa0b000-7f8ebaa0c000 rw-p  00:00 0
> 7ffcc99fa000-7ffcc9a1b000 rw-p  00:00 0  
> [stack]
> 7ffcc9b44000-7ffcc9b47000 r--p  00:00 0  
> [vvar]
> 7ffcc9b47000-7ffcc9b49000 r-xp  00:00 0  
> [vdso]
> ff60-ff601000 r-xp  00:00 0
>   [vsyscall]
> user@debian:~$ uname -a
> Linux debian 4.19.0-rc6+ #181 SMP Wed Oct 3 23:43:42 CEST 2018 x86_64 
> GNU/Linux
> user@debian:~$
> ===
> 
> As you can see, the first page of the mapping at 0x10001000 was clobbered.

This looks good to me. The short example really helped, thanks for that.

(I think my first reply got bounced, sorry if this ends up as a duplicate email
for anyone.)

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH] mm: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE

2018-10-10 Thread John Hubbard
On 10/10/18 10:38 AM, Michal Hocko wrote:
> On Wed 10-10-18 19:26:50, Jann Horn wrote:
> [...]
>> As you can see, the first page of the mapping at 0x10001000 was clobbered.
>>
>>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>>> index 5f2b2b184c60..f7cd9cb966c0 100644
>>>> --- a/mm/mmap.c
>>>> +++ b/mm/mmap.c
>>>> @@ -1410,7 +1410,7 @@ unsigned long do_mmap(struct file *file, unsigned 
>>>> long addr,
>>>>   if (flags & MAP_FIXED_NOREPLACE) {
>>>>   struct vm_area_struct *vma = find_vma(mm, addr);
>>>>
>>>> - if (vma && vma->vm_start <= addr)
>>>> + if (vma && vma->vm_start < addr + len)
>>>
>>> find_vma is documented to - Look up the first VMA which satisfies addr <
>>> vm_end, NULL if none.
>>> This means that the above check guanratees that
>>> vm_start <= addr < vm_end
>>> so an overlap is guanrateed. Why should we care how much we overlap?
>>
>> "an overlap is guaranteed"? I have no idea what you're trying to say.
> 
> I have misread your changelog and the patch. Sorry about that. I thought
> you meant a false possitive but you in fact meant false negative. Now it
> makes complete sense.
> 
> Acked-by: Michal Hocko 
> 
> And thanks a lot for catching that!
> 

This also looks good to me. 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v5 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-10 Thread John Hubbard
On 10/10/18 1:03 AM, Jan Kara wrote:
> On Tue 09-10-18 21:11:33, john.hubb...@gmail.com wrote:
>> +/*
>> + * put_user_pages() - for each page in the @pages array, release the page
>> + * using put_user_page().
>> + *
>> + * Please see the put_user_page() documentation for details.
>> + *
>> + * This is just like put_user_pages_dirty(), except that it invokes
>> + * set_page_dirty_lock(), instead of set_page_dirty().
> 
> This paragraph should be deleted. Other than that the patch looks good.
> 

Good catch. Fixed locally, and it will go up with the next spin.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v4 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-10 Thread John Hubbard
On 10/10/18 1:59 AM, Jan Kara wrote:
> On Tue 09-10-18 17:42:09, John Hubbard wrote:
>> On 10/8/18 5:14 PM, Andrew Morton wrote:
>>> Also, maintainability.  What happens if someone now uses put_page() by
>>> mistake?  Kernel fails in some mysterious fashion?  How can we prevent
>>> this from occurring as code evolves?  Is there a cheap way of detecting
>>> this bug at runtime?
>>>
>>
>> It might be possible to do a few run-time checks, such as "does page that 
>> came 
>> back to put_user_page() have the correct flags?", but it's harder (without 
>> having a dedicated page flag) to detect the other direction: "did someone 
>> page 
>> in a get_user_pages page, to put_page?"
>>
>> As Jan said in his reply, converting get_user_pages (and put_user_page) to 
>> work with a new data type that wraps struct pages, would solve it, but that's
>> an awfully large change. Still...given how much of a mess this can turn into 
>> if it's wrong, I wonder if it's worth it--maybe? 
> 
> I'm certainly not opposed to looking into it. But after looking into this
> for a while it is not clear to me how to convert e.g. fs/direct-io.c or
> fs/iomap.c. They pass the reference from gup() via
> bio->bi_io_vec[]->bv_page and then release it after IO completion.
> Propagating the new type to ->bv_page is not good as lower layer do not
> really care how the page is pinned in memory. But we do need to somehow
> pass the information to the IO completion functions in a robust manner.
> 

You know, that problem has to be solved in either case: even if we do not
use a new data type for get_user_pages, we still need to clearly, accurately
match up the get/put pairs. And for the complicated systems (block IO, and
GPU DRM layer, especially) one of the things that has caused me concern is 
the way the pages all end up in a large, complicated pool, and put_page is
used to free all of them, indiscriminately.

So I'm glad you're looking at ways to disambiguate this for the bio system.

> Hmm, what about the following:
> 
> 1) Make gup() return new type - struct user_page *? In practice that would
> be just a struct page pointer with 0 bit set so that people are forced to
> use proper helpers and not just force types (and the setting of bit 0 and
> masking back would be hidden behind CONFIG_DEBUG_USER_PAGE_REFERENCES for
> performance reasons). Also the transition would have to be gradual so we'd
> have to name the function differently and use it from converted code.

Yes. That seems perfect: it just fades away if you're not debugging, but we
can catch lots of problems when CONFIG_DEBUG_USER_PAGE_REFERENCES is set. 

> 
> 2) Provide helper bio_add_user_page() that will take user_page, convert it
> to struct page, add it to the bio, and flag the bio as having pages with
> user references. That code would also make sure the bio is consistent in
> having only user-referenced pages in that case. IO completion (like
> bio_check_pages_dirty(), bio_release_pages() etc.) will check the flag and
> use approprite release function.

I'm very new to bio, so I have to ask: can we be sure that the same types of 
pages are always used, within each bio? Because otherwise, we'll have to plumb 
it all the way down to bio_vec's--or so it appears, based on my reading of 
bio_release_pages() and surrounding code.

> 
> 3) I have noticed fs/direct-io.c may submit zero page for IO when it needs
> to clear stuff so we'll probably need a helper function to acquire 'user pin'
> reference given a page pointer so that that code can be kept reasonably
> simple and pass user_page references all around.
>

This only works if we don't set page flags, because if we do set page flags 
on the single, global zero page, that will break the world. So I'm not sure
that the zero page usage in fs/directio.c is going to survive the conversion
to this new approach. :)
 
> So this way we could maintain reasonable confidence that refcounts didn't
> get mixed up. Thoughts?
> 

After thinking some more about the complicated situations in bio and DRM,
and looking into the future (bug reports...), I am leaning toward your 
struct user_page approach. 

I'm looking forward to hearing other opinions on whether it's worth it to go
and do this fairly intrusive change, in return for, probably, fewer bugs along
the way.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 2/4] mm: introduce put_user_page(), placeholder version

2018-10-03 Thread John Hubbard
On 10/3/18 9:22 AM, Jan Kara wrote:
> On Thu 27-09-18 22:39:48, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also adds release_user_pages(), a drop-in replacement for
>> release_pages(). This is intended to be easily grep-able,
>> for later performance improvements, since release_user_pages
>> is not batched like release_pages() is, and is significantly
>> slower.
> 
> A small nit but can we maybe call this put_user_pages() for symmetry with
> put_user_page()? I don't really care too much but it would look natural to
> me.
> 

Sure. It started out as "make it a drop-in replacement for release_pages()",
but now it's not quite a drop-in replacement anymore. And in any case it's an 
opportunity to make the name more intuitive, so that seems like a good
idea.

If anyone hates put_user_pages() and wants to campaign relentlessly for
release_pages*(), then now is the time! :)


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call

2018-10-03 Thread John Hubbard
On 10/3/18 9:27 AM, Jan Kara wrote:
> On Fri 28-09-18 20:12:33, John Hubbard wrote:
>>  static inline void release_user_pages(struct page **pages,
>> - unsigned long npages)
>> + unsigned long npages,
>> + bool set_dirty)
>>  {
>> -   while (npages)
>> -   put_user_page(pages[--npages]);
>> +   if (set_dirty)
>> +   release_user_pages_dirty(pages, npages);
>> +   else
>> +   release_user_pages_basic(pages, npages);
>> +}
> 
> Is there a good reason to have this with set_dirty argument? Generally bool
> arguments are not great for readability (or greppability for that matter).
> Also in this case callers can just as easily do:
>   if (set_dirty)
>   release_user_pages_dirty(...);
>   else
>   release_user_pages(...);
> 
> And furthermore it makes the code author think more whether he needs
> set_page_dirty() or set_page_dirty_lock(), rather than just passing 'true'
> and hoping the function magically does the right thing for him.
> 

Ha, I went through *precisely* that argument in my head, too--and then
got seduced with the idea that it pretties up the existing calling code, 
because it's a drop-in one-liner at the call sites. But yes, I'll change it 
back to omit the bool set_dirty argument.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 2/3] mm: introduce put_user_page[s](), placeholder versions

2018-10-05 Thread John Hubbard
On 10/5/18 2:48 PM, Jason Gunthorpe wrote:
> On Fri, Oct 05, 2018 at 12:49:06PM -0700, John Hubbard wrote:
>> On 10/5/18 8:17 AM, Jason Gunthorpe wrote:
>>> On Thu, Oct 04, 2018 at 09:02:24PM -0700, john.hubb...@gmail.com wrote:
>>>> From: John Hubbard 
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>>
>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>> as a replacement for release_pages(), for the same reasons.
>>>> These may be used for subsequent performance improvements,
>>>> via batching of pages to be released.
>>>>
>>>> This prepares for eventually fixing the problem described
>>>> in [1], and is following a plan listed in [2], [3], [4].
>>>>
>>>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>>>
>>>> [2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
>>>> Proposed steps for fixing get_user_pages() + DMA problems.
>>>>
>>>> [3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
>>>> Bounce buffers (otherwise [2] is not really viable).
>>>>
>>>> [4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
>>>> Follow-up discussions.
>>>>
>> [...]
>>>>  
>>>> +/* Placeholder version, until all get_user_pages*() callers are updated. 
>>>> */
>>>> +static inline void put_user_page(struct page *page)
>>>> +{
>>>> +  put_page(page);
>>>> +}
>>>> +
>>>> +/* For get_user_pages*()-pinned pages, use these variants instead of
>>>> + * release_pages():
>>>> + */
>>>> +static inline void put_user_pages_dirty(struct page **pages,
>>>> +  unsigned long npages)
>>>> +{
>>>> +  while (npages) {
>>>> +  set_page_dirty(pages[npages]);
>>>> +  put_user_page(pages[npages]);
>>>> +  --npages;
>>>> +  }
>>>> +}
>>>
>>> Shouldn't these do the !PageDirty(page) thing?
>>>
>>
>> Well, not yet. This is the "placeholder" patch, in which I planned to keep
>> the behavior the same, while I go to all the get_user_pages call sites and 
>> change 
>> put_page() and release_pages() over to use these new routines.
> 
> Hmm.. Well, if it is the right thing to do here, why not include it and
> take it out of callers when doing the conversion?
> 
> If it is the wrong thing, then let us still take it out of callers
> when doing the conversion :)
> 
> Just seems like things will be in a better place to make future
> changes if all the call sights are de-duplicated and correct.
> 

OK, yes. Let me send out a v3 with that included, then.

thanks,
-- 
John Hubbard
NVIDIA


[PATCH v3 0/3] get_user_pages*() and RDMA: first steps

2018-10-05 Thread john . hubbard
From: John Hubbard 

Changes since v2:

-- Absorbed more dirty page handling logic into the put_user_page*(), and
   handled some page releasing loops in infiniband more thoroughly, as per
   Jason Gunthorpe's feedback.

-- Fixed a bug in the put_user_pages*() routines' loops (thanks to
   Ralph Campbell for spotting it).

Changes since v1:

-- Renamed release_user_pages*() to put_user_pages*(), from Jan's feedback.

-- Removed the goldfish.c changes, and instead, only included a single
   user (infiniband) of the new functions. That is because goldfish.c no
   longer has a name collision (it has a release_user_pages() routine), and
   also because infiniband exercises both the put_user_page() and
   put_user_pages*() paths.

-- Updated links to discussions and plans, so as to be sure to include
   bounce buffers, thanks to Jerome's feedback.

Also:

-- Dennis, thanks for your earlier review, and I have not yet added your
   Reviewed-by tag, because this revision changes the things that you had
   previously reviewed, thus potentially requiring another look.

This short series prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

Patch 1, although not technically critical to do now, is still nice to
have, because it's already been reviewed by Jan, and it's just one more
thing on the long TODO list here, that is ready to be checked off.

Patch 2 is required in order to allow me (and others, if I'm lucky) to
start submitting changes to convert all of the callsites of
get_user_pages*() and put_page().  I think this will work a lot better
than trying to maintain a massive patchset and submitting all at once.

Patch 3 converts infiniband drivers: put_page() --> put_user_page(), and
also exercises put_user_pages_dirty_locked().

Once these are all in, then the floodgates can open up to convert the large
number of get_user_pages*() callsites.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 

John Hubbard (3):
  mm: get_user_pages: consolidate error handling
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c  |  7 +--
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ++---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 +--
 include/linux/mm.h  | 48 -
 mm/gup.c| 37 +---
 9 files changed, 92 insertions(+), 45 deletions(-)

-- 
2.19.0



[PATCH v3 3/3] infiniband/mm: convert put_page() to put_user_page*()

2018-10-05 Thread john . hubbard
From: John Hubbard 

For code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Doug Ledford 
CC: Jason Gunthorpe 
CC: Mike Marciniszyn 
CC: Dennis Dalessandro 
CC: Christian Benvenuti 

CC: linux-r...@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux...@kvack.org
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  7 ---
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  8 
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
 7 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a41792dbae1f..7ab7a3a35eb4 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (!PageDirty(page) && umem->writable && dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
+   if (umem->writable && dirty)
+   put_user_pages_dirty_lock(, 1);
+   else
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 6ec748eccff7..6227b89cf05c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -717,7 +717,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
user_virt, u64 bcnt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->page[i].mem));
goto out;
}
 
@@ -555,7 +555,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, 
struct mthca_uar *uar,
if (db_tab->page[i].uvirt) {
mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
pci_unmap_sg(dev->pdev, _tab-

[PATCH v3 1/3] mm: get_user_pages: consolidate error handling

2018-10-05 Thread john . hubbard
From: John Hubbard 

An upcoming patch requires a way to operate on each page that
any of the get_user_pages_*() variants returns.

In preparation for that, consolidate the error handling for
__get_user_pages(). This provides a single location (the "out:" label)
for operating on the collected set of pages that are about to be returned.

As long every use of the "ret" variable is being edited, rename
"ret" --> "err", so that its name matches its true role.
This also gets rid of two shadowed variable declarations, as a
tiny beneficial a side effect.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 37 ++---
 1 file changed, 22 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1abc8b4afff6..05ee7c18e59a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -660,6 +660,7 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *nonblocking)
 {
long i = 0;
+   int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
 
@@ -685,18 +686,19 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
-   ret = get_gate_page(mm, start & PAGE_MASK,
+   err = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
-   if (ret)
-   return i ? : ret;
+   if (err)
+   goto out;
page_mask = 0;
goto next_page;
}
 
-   if (!vma || check_vma_flags(vma, gup_flags))
-   return i ? : -EFAULT;
+   if (!vma || check_vma_flags(vma, gup_flags)) {
+   err = -EFAULT;
+   goto out;
+   }
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
, _pages, i,
@@ -709,23 +711,25 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 * If we have a pending SIGKILL, don't keep faulting pages and
 * potentially allocating memory.
 */
-   if (unlikely(fatal_signal_pending(current)))
-   return i ? i : -ERESTARTSYS;
+   if (unlikely(fatal_signal_pending(current))) {
+   err = -ERESTARTSYS;
+   goto out;
+   }
cond_resched();
page = follow_page_mask(vma, start, foll_flags, _mask);
if (!page) {
-   int ret;
-   ret = faultin_page(tsk, vma, start, _flags,
+   err = faultin_page(tsk, vma, start, _flags,
nonblocking);
-   switch (ret) {
+   switch (err) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
-   return i ? i : ret;
+   goto out;
case -EBUSY:
-   return i;
+   err = 0;
+   goto out;
case -ENOENT:
goto next_page;
}
@@ -737,7 +741,8 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
 */
goto next_page;
} else if (IS_ERR(page)) {
-   return i ? i : PTR_ERR(page);
+   err = PTR_ERR(page);
+   goto out;
}
if (pages) {
pages[i] = page;
@@ -757,7 +762,9 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
start += page_increm * PAGE_SIZE;
nr_pages -= page_increm;
} while (nr_pages);
-   return i;
+
+out:
+   return i ? i : err;
 }
 
 static bool vma_permits_fault(struct vm_area_struct *vma,
-- 
2.19.0



[PATCH v3 2/3] mm: introduce put_user_page*(), placeholder versions

2018-10-05 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This prepares for eventually fixing the problem described
in [1], and is following a plan listed in [2], [3], [4].

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

[2] https://lkml.kernel.org/r/20180709080554.21931-1-jhubb...@nvidia.com
Proposed steps for fixing get_user_pages() + DMA problems.

[3]https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz
Bounce buffers (otherwise [2] is not really viable).

[4] https://lkml.kernel.org/r/20181003162115.gg24...@quack2.suse.cz
Follow-up discussions.

CC: Matthew Wilcox 
CC: Michal Hocko 
CC: Christopher Lameter 
CC: Jason Gunthorpe 
CC: Dan Williams 
CC: Jan Kara 
CC: Al Viro 
CC: Jerome Glisse 
CC: Christoph Hellwig 
CC: Ralph Campbell 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 48 --
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0416a7204be3..305b206e6851 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -137,6 +137,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, 
int, void __user *,
size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
size_t *, loff_t *);
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
@@ -943,6 +945,50 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/* Pages that were pinned via get_user_pages*() should be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below.
+ */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+static inline void put_user_pages_dirty(struct page **pages,
+   unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   if (!PageDirty(pages[index]))
+   set_page_dirty(pages[index]);
+
+   put_user_page(pages[index]);
+   }
+}
+
+static inline void put_user_pages_dirty_lock(struct page **pages,
+unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   if (!PageDirty(pages[index]))
+   set_page_dirty_lock(pages[index]);
+
+   put_user_page(pages[index]);
+   }
+}
+
+static inline void put_user_pages(struct page **pages,
+ unsigned long npages)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++)
+   put_user_page(pages[index]);
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
@@ -1534,8 +1580,6 @@ int redirty_page_for_writepage(struct writeback_control 
*wbc,
 void account_page_dirtied(struct page *page, struct address_space *mapping);
 void account_page_cleaned(struct page *page, struct address_space *mapping,
  struct bdi_writeback *wb);
-int set_page_dirty(struct page *page);
-int set_page_dirty_lock(struct page *page);
 void __cancel_dirty_page(struct page *page);
 static inline void cancel_dirty_page(struct page *page)
 {
-- 
2.19.0



Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-09-28 Thread John Hubbard
On 9/28/18 2:49 PM, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
>> On 9/28/18 8:29 AM, Jerome Glisse wrote:
>>> On Thu, Sep 27, 2018 at 10:39:45PM -0700, john.hubb...@gmail.com wrote:
>>>> From: John Hubbard 
[...]
>>> So the solution is to wait (possibly for days, months, years) that the
>>> RDMA or GPU which did GUP and do not have mmu notifier, release the page
>>> (or put_user_page()) ?
>>>
>>> This sounds bads. Like i said during LSF/MM there is no way to properly
>>> fix hardware that can not be preempted/invalidated ... most GPU are fine.
>>> Few RDMA are fine, most can not ...
>>>
>>
>> Hi Jerome,
>>
>> Personally, I'm think that this particular design is the best one I've seen
>> so far, but if other, better designs show up, than let's do those instead, 
>> sure.
>>
>> I guess your main concern is that this might take longer than other 
>> approaches.
>>
>> As for time frame, perhaps I made it sound worse than it really is. I have 
>> patches
>> staged already for all of the simpler call sites, and for about half of the 
>> more
>> complicated ones. The core solution in mm is not large, and we've gone 
>> through a 
>> few discussion threads about it back in July or so, so it shouldn't take too 
>> long
>> to perfect it.
>>
>> So it may be a few months to get it all reviewed and submitted, but I don't
>> see "years" by any stretch.
> 
> Bit of miss-comprehention there :) By month, years, i am talking about
> the time it will take for some user to release the pin they have on the
> page. Not the time to push something upstream.
> 
> AFAICT RDMA driver do not have any upper bound on how long they can hold
> a page reference and thus your solution can leave one CPU core stuck for
> as long as the pin is active. Worst case might lead to all CPU core waiting
> for something that might never happen.
> 

Actually, the latest direction on that discussion was toward periodically
writing back, even while under RDMA, via bounce buffers:

  https://lkml.kernel.org/r/20180710082100.mkdwngdv5kkrc...@quack2.suse.cz

I still think that's viable. Of course, there are other things besides 
writeback (see below) that might also lead to waiting.

>>> If it is just about fixing the set_page_dirty() bug then just looking at
>>> refcount versus mapcount should already tell you if you can remove the
>>> buffer head from the page or not. Which would fix the bug without complex
>>> changes (i still like the put_user_page just for symetry with GUP).
>>>
>>
>> It's about more than that. The goal is to make it safe and correct to
>> use a non-CPU device to read and write to "pinned" memory, especially when
>> that memory is backed by a file system.
>>
>> I recall there were objections to just narrowly fixing the set_page_dirty()
>> bug, because the underlying problem is large and serious. So here we are.
> 
> Except that you can not solve that issue without proper hardware. GPU are
> fine. RDMA are broken except the mellanox5 hardware which can invalidate
> at anytime its page table thus allowing to write protect the page at any
> time.

Today, people are out there using RDMA without page-fault-capable hardware.
And they are hitting problems, as we've seen. From the discussions so far,
I don't think it's impossible to solve the problems, even for "lesser", 
non-fault-capable hardware. Especially once we decide on what is reasonable
and supported.  Certainly the existing situation needs *something* to 
change, even if it's (I don't recommend this) "go forth and tell the world
to stop using RDMA with their current hardware".

> 
> With the solution put forward here you can potentialy wait _forever_ for
> the driver that holds a pin to drop it. This was the point i was trying to
> get accross during LSF/MM. 

I agree that just blocking indefinitely is generally unacceptable for kernel
code, but we can probably avoid it for many cases (bounce buffers), and
if we think it is really appropriate (file system unmounting, maybe?) then
maybe tolerate it in some rare cases.  

>You can not fix broken hardware that decided to
> use GUP to do a feature they can't reliably do because their hardware is
> not capable to behave.
> 
> Because code is easier here is what i was meaning:
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup=a5dbc0fe7e71d347067579f13579df372ec48389
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup=01677bc039c791a16d5f82b3ef84917d62fac826
> 

While that may work sometimes, I don't think it is reliable enough to trus

Re: [PATCH 3/4] infiniband/mm: convert to the new put_user_page() call

2018-09-28 Thread John Hubbard
On 9/28/18 8:39 AM, Jason Gunthorpe wrote:
> On Thu, Sep 27, 2018 at 10:39:47PM -0700, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
[...]
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index a41792dbae1f..9430d697cb9f 100644
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -60,7 +60,7 @@ static void __ib_umem_release(struct ib_device *dev, 
>> struct ib_umem *umem, int d
>>  page = sg_page(sg);
>>  if (!PageDirty(page) && umem->writable && dirty)
>>  set_page_dirty_lock(page);
>> -put_page(page);
>> +put_user_page(page);
> 
> Would it make sense to have a release/put_user_pages_dirtied to absorb
> the set_page_dity pattern too? I notice in this patch there is some
> variety here, I wonder what is the right way?
> 
> Also, I'm told this code here is a big performance bottleneck when the
> number of pages becomes very long (think >> GB of memory), so having a
> future path to use some kind of batching/threading sound great.
> 

Yes. And you asked for this the first time, too. Consistent! :) Sorry for
being slow to pick it up. It looks like there are several patterns, and
we have to support both set_page_dirty() and set_page_dirty_lock(). So
the best combination looks to be adding a few variations of
release_user_pages*(), but leaving put_user_page() alone, because it's
the "do it yourself" basic one. Scatter-gather will be stuck with that.

Here's a differential patch with that, that shows a nice little cleanup in 
a couple of IB places, and as you point out, it also provides the hooks for 
performance upgrades (via batching) in the future.

Does this API look about right?

diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index c7516029af33..48afec362c31 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -123,11 +123,7 @@ void hfi1_release_user_pages(struct mm_struct *mm, struct 
page **p,
 {
size_t i;
 
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_user_page(p[i]);
-   }
+   release_user_pages_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c 
b/drivers/infiniband/hw/qib/qib_user_pages.c
index 3f8fd42dd7fc..c57a3a6730b6 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,13 +40,7 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 int dirty)
 {
-   size_t i;
-
-   for (i = 0; i < num_pages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_user_page(p[i]);
-   }
+   release_user_pages_lock(p, num_pages, dirty);
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72caf803115f..b280d0181e06 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -138,6 +138,9 @@ extern int overcommit_ratio_handler(struct ctl_table *, 
int, void __user *,
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
size_t *, loff_t *);
 
+int set_page_dirty(struct page *page);
+int set_page_dirty_lock(struct page *page);
+
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 /* to align the pointer to the (next) page boundary */
@@ -949,12 +952,56 @@ static inline void put_user_page(struct page *page)
put_page(page);
 }
 
-/* A drop-in replacement for release_pages(): */
+/* For get_user_pages*()-pinned pages, use these variants instead of
+ * release_pages():
+ */
+static inline void release_user_pages_dirty(struct page **pages,
+   unsigned long npages)
+{
+   while (npages) {
+   set_page_dirty(pages[npages]);
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
+static inline void release_user_pages_dirty_lock(struct page **pages,
+unsigned long npages)
+{
+   while (npages) {
+   set_page_dirty_lock(pages[npages]);
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
+static inline void release_user_pages_basic(struct page **pages,
+   unsigned long npages)
+{
+   while (npages) {
+   put_user_page(pages[npages]);
+   --npages;
+   }
+}
+
 static inline void release_user_pages(struct page **pages,
- unsigned long npages)
+ 

Re: [BUG] mm: direct I/O (using GUP) can write to COW anonymous pages

2018-09-25 Thread John Hubbard
On 9/18/18 2:58 AM, Jan Kara wrote:
> On Tue 18-09-18 02:35:43, Jann Horn wrote:
>> On Tue, Sep 18, 2018 at 2:05 AM Hugh Dickins  wrote:
> 
> Thanks for CC Hugh.
> 
>>> On Mon, 17 Sep 2018, Jann Horn wrote:
>>>
>>
>> Makes sense, I guess.
>>
>> I wonder whether there's a concise way to express this in the fork.2
>> manpage, or something like that. Maybe I'll take a stab at writing
>> something. The biggest issue I see with documenting this edgecase is
>> that, as an application developer, if you don't know whether some file
>> might be coming from a FUSE filesystem that has opted out of using the
>> disk cache, the "don't do that" essentially becomes "don't read() into
>> heap buffers while fork()ing in another thread", since with FUSE,
>> direct I/O can happen even if you don't open files as O_DIRECT as long
>> as the filesystem requests direct I/O, and get_user_pages_fast() will
>> AFAIU be used for non-page-aligned buffers, meaning that an adjacent
>> heap memory access could trigger CoW page duplication. But then, FUSE
>> filesystems that opt out of the disk cache are probably so rare that
>> it's not a concern in practice...
> 
> So at least for shared file mappings we do need to fix this issue as it's
> currently userspace triggerable Oops if you try hard enough. And with RDMA
> you don't even have to try that hard. Properly dealing with private
> mappings should not be that hard once the infrastructure is there I hope
> but I didn't seriously look into that. I've added Miklos and John to CC as
> they are interested as well. John was working on fixing this problem -
> https://lkml.org/lkml/2018/7/9/158 - but I didn't hear from him for quite a
> while so I'm not sure whether it died off or what's the current situation.
> 

Hi,

Sorry for missing this even though I was CC'd, I only just now noticed it, while
trying to get caught up again.

Anyway, I've been sidetracked for a...while (since July!), but am jumping back 
in and working on this now. And I've got time allocated for it. So here goes.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/29/18 6:18 PM, Tom Talpey wrote:
> On 11/29/2018 8:39 PM, John Hubbard wrote:
>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>> [...]
>>>>> I'm super-limited here this week hardware-wise and have not been able
>>>>> to try testing with the patched kernel.
>>>>>
>>>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>>>> test, and without your change.
>>>>>
>>>>
>>>> So just to double check (again): you are running fio with these parameters,
>>>> right?
>>>>
>>>> [reader]
>>>> direct=1
>>>> ioengine=libaio
>>>> blocksize=4096
>>>> size=1g
>>>> numjobs=1
>>>> rw=read
>>>> iodepth=64
>>>
>>> Correct, I copy/pasted these directly. I also ran with size=10g because
>>> the 1g provides a really small sample set.
>>>
>>> There was one other difference, your results indicated fio 3.3 was used.
>>> My Bionic install has fio 3.1. I don't find that relevant because our
>>> goal is to compare before/after, which I haven't done yet.
>>>
>>
>> OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
>> options
>> set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
>> rated
>> speed of the Samsung NVMe device, so now we should have a clearer picture of 
>> the
>> performance that real users will see.
> 
> Oh, good! I'm especially glad because I was having a heck of a time
> reconfiguring the one machine I have available for this.
> 
>> Continuing on, then: running a before and after test, I don't see any 
>> significant
>> difference in the fio results:
> 
> Excerpting from below:
> 
>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> vs
> 
>> With patches applied:
>> read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>    cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
> 
> Perfect results, not CPU limited, and full IOPS.
> 
> Curiously identical, so I trust you've checked that you measured
> both targets, but if so, I say it's good.
> 

Argh, copy-paste error in the email. The real "before" is ever so slightly
better, at 194K IOPS and 759 MB/s:

 $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1715: Thu Nov 29 17:07:09 2018
   read: IOPS=194k, BW=759MiB/s (795MB/s)(1024MiB/1350msec)
slat (nsec): min=1245, max=2812.7k, avg=1538.03, stdev=5519.61
clat (usec): min=148, max=755, avg=326.85, stdev=18.13
 lat (usec): min=150, max=3483, avg=328.41, stdev=19.53
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  355], 99.50th=[  537], 99.90th=[  553], 99.95th=[  553],
 | 99.99th=[  619]
   bw (  KiB/s): min=767816, max=783096, per=99.84%, avg=775456.00, 
stdev=10804.59, samples=2
   iops: min=191954, max=195774, avg=193864.00, stdev=2701.15, samples=2
  lat (usec)   : 250=0.09%, 500=99.30%, 750=0.61%, 1000=0.01%
  cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=759MiB/s (795MB/s), 759MiB/s-759MiB/s (795MB/s-795MB/s), io=1024MiB 
(1074MB), run=1350-1350msec

Disk stats (read/write):
  nvme0n1: ios=222

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/29/18 6:30 PM, Tom Talpey wrote:
> On 11/29/2018 9:21 PM, John Hubbard wrote:
>> On 11/29/18 6:18 PM, Tom Talpey wrote:
>>> On 11/29/2018 8:39 PM, John Hubbard wrote:
>>>> On 11/28/18 5:59 AM, Tom Talpey wrote:
>>>>> On 11/27/2018 9:52 PM, John Hubbard wrote:
>>>>>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>>>>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>>>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>>>>> [...]
>>> Excerpting from below:
>>>
>>>> Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:
>>>>   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> vs
>>>
>>>> With patches applied:
>>>>   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
>>>>  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
>>>
>>> Perfect results, not CPU limited, and full IOPS.
>>>
>>> Curiously identical, so I trust you've checked that you measured
>>> both targets, but if so, I say it's good.
>>>
>>
>> Argh, copy-paste error in the email. The real "before" is ever so slightly
>> better, at 194K IOPS and 759 MB/s:
> 
> Definitely better - note the system CPU is lower, which is probably the
> reason for the increased IOPS.
> 
>>    cpu  : usr=18.24%, sys=44.77%, ctx=251527, majf=0, minf=73
> 
> Good result - a correct implementation, and faster.
> 

Thanks, Tom, I really appreciate your experience and help on what performance 
should look like here. (I'm sure you can guess that this is the first time 
I've worked with fio, heh.)

I'll send out a new, non-RFC patchset soon, then.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-29 Thread John Hubbard
On 11/28/18 5:59 AM, Tom Talpey wrote:
> On 11/27/2018 9:52 PM, John Hubbard wrote:
>> On 11/27/18 5:21 PM, Tom Talpey wrote:
>>> On 11/21/2018 5:06 PM, John Hubbard wrote:
>>>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>> [...]
>>> I'm super-limited here this week hardware-wise and have not been able
>>> to try testing with the patched kernel.
>>>
>>> I was able to compare my earlier quick test with a Bionic 4.15 kernel
>>> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
>>> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
>>> test, and without your change.
>>>
>>
>> So just to double check (again): you are running fio with these parameters,
>> right?
>>
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> Correct, I copy/pasted these directly. I also ran with size=10g because
> the 1g provides a really small sample set.
> 
> There was one other difference, your results indicated fio 3.3 was used.
> My Bionic install has fio 3.1. I don't find that relevant because our
> goal is to compare before/after, which I haven't done yet.
> 

OK, the 50 MB/s was due to my particular .config. I had some expensive debug 
options
set in mm, fs and locking subsystems. Turning those off, I'm back up to the 
rated
speed of the Samsung NVMe device, so now we should have a clearer picture of the
performance that real users will see.

Continuing on, then: running a before and after test, I don't see any 
significant 
difference in the fio results:

fio.conf:

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64

-
Baseline 4.20.0-rc3 (commit f2ce1065e767), as before:

$ fio ./experimental-fio.conf 
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
 | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
   iops: min=187910, max=195728, avg=191819.00, stdev=5528.16, samples=2
  lat (usec)   : 250=0.08%, 500=99.30%, 750=0.59%
  lat (msec)   : 20=0.02%
  cpu  : usr=16.26%, sys=48.05%, ctx=251258, majf=0, minf=73
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB 
(1074MB), run=1360-1360msec

Disk stats (read/write):
  nvme0n1: ios=220798/0, merge=0/0, ticks=71481/0, in_queue=71966, util=100.00%

-
With patches applied:

 fast_256GB $ fio ./experimental-fio.conf
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1738: Thu Nov 29 17:20:07 2018
   read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1360msec)
slat (nsec): min=1381, max=46469, avg=1649.48, stdev=594.46
clat (usec): min=162, max=12247, avg=330.00, stdev=185.55
 lat (usec): min=165, max=12253, avg=331.68, stdev=185.69
clat percentiles (usec):
 |  1.00th=[  322],  5.00th=[  326], 10.00th=[  326], 20.00th=[  326],
 | 30.00th=[  326], 40.00th=[  326], 50.00th=[  326], 60.00th=[  326],
 | 70.00th=[  326], 80.00th=[  326], 90.00th=[  326], 95.00th=[  326],
 | 99.00th=[  379], 99.50th=[  594], 99.90th=[  603], 99.95th=[  611],
 | 99.99th=[12125]
   bw (  KiB/s): min=751640, max=782912, per=99.52%, avg=767276.00, 
stdev=22112.64, samples=2
   iops: min=187910, max=195728, avg=191819.00, s

[PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-03 Thread john . hubbard
From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

Also introduces put_user_pages(), and a few dirty/locked variations,
as a replacement for release_pages(), and also as a replacement
for open-coded loops that release multiple pages.
These may be used for subsequent performance improvements,
via batching of pages to be released.

This is the first step of fixing the problem described in [1]. The steps
are:

1) (This patch): provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara 

Cc: Matthew Wilcox 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jan Kara 
Cc: Al Viro 
Cc: Jerome Glisse 
Cc: Christoph Hellwig 
Cc: Ralph Campbell 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 20 
 mm/swap.c  | 80 ++
 2 files changed, 100 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..09fbb2c81aba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
+/*
+ * put_user_page() - release a page that had previously been acquired via
+ * a call to one of the get_user_pages*() functions.
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+static inline void put_user_page(struct page *page)
+{
+   put_page(page);
+}
+
+void put_user_pages_dirty(struct page **pages, unsigned long npages);
+void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
+void put_user_pages(struct page **pages, unsigned long npages);
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/swap.c b/mm/swap.c
index aa483719922e..bb8c32595e5f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
 }
 EXPORT_SYMBOL(put_pages_list);
 
+typedef int (*set_dirty_func)(struct page *page);
+
+static void __put_user_pages_dirty(struct page **pages,
+  unsigned long npages,
+  set_dirty_func sdf)
+{
+   unsigned long index;
+
+   for (index = 0; index < npages; index++) {
+   struct page *page = compound_head(pages[index]);
+
+   if (!PageDirty(page))
+   sdf(page);
+
+   put_user_page(page);
+   }
+}
+
+/*
+ * put_user_pages_dirty() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * set_page_dirty(), which does not lock the page, is used here.
+ * Therefore, it is the caller's responsibility to ensure that this is
+ * safe. If not, then put_user_pages_dirty_lock() should be called instead.
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages array.
+ *
+ */
+void put_user_pages_dirty(struct page **pages, unsigned long npages)
+{
+   __put_user_pages_dirty(pages, npages, set_page_dirty);
+}
+EXPORT_SYMBOL(put_user_pages_dirty);
+
+/*
+ * put_user_pages_dirty_lock() - for each page in the @pages array, make
+ * that page (or its head page, if a compound page) dirty, if it was
+ * previously listed as clean. Then, release the page using
+ * put_user_page().
+ *
+ * Please see the put_user_page() documentation for details.
+ *
+ * This is just like put_user_pages_dirty(), except that it invokes
+ * set_page_dirty_lock(), instead of set_page_dirty().
+ *
+ * @pages:  array of pages to be marked dirty and released.
+ * @npages: number of pages in the @pages 

[PATCH 2/2] infiniband/mm: convert put_page() to put_user_page*()

2018-12-03 Thread john . hubbard
From: John Hubbard 

For infiniband code that retains pages via get_user_pages*(),
release those pages via the new put_user_page(), or
put_user_pages*(), instead of put_page()

This is a tiny part of the second step of fixing the problem described
in [1]. The steps are:

1) Provide put_user_page*() routines, intended to be used
   for releasing pages that were pinned via get_user_pages*().

2) Convert all of the call sites for get_user_pages*(), to
   invoke put_user_page*(), instead of put_page(). This involves dozens of
   call sites, and will take some time.

3) After (2) is complete, use get_user_pages*() and put_user_page*() to
   implement tracking of these pages. This tracking will be separate from
   the existing struct page refcounting.

4) Use the tracking and identification of these pages, to implement
   special handling (especially in writeback paths) when the pages are
   backed by a filesystem. Again, [1] provides details as to why that is
   desirable.

[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"

Reviewed-by: Jan Kara 
Reviewed-by: Dennis Dalessandro 
Acked-by: Jason Gunthorpe 

Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Christian Benvenuti 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  7 ---
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ---
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +++---
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ---
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +++---
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 ---
 7 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c6144df47ea4..c2898bc7b3b2 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -58,9 +58,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {
 
page = sg_page(sg);
-   if (!PageDirty(page) && umem->writable && dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
+   if (umem->writable && dirty)
+   put_user_pages_dirty_lock(, 1);
+   else
+   put_user_page(page);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 676c1fd1119d..99715049cd3b 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -659,7 +659,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_page(local_page_list[j]);
+   put_user_page(local_page_list[j]);
continue;
}
 
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index e341e6dcc388..99ccc0483711 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -121,13 +121,10 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, 
unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   size_t i;
-
-   for (i = 0; i < npages; i++) {
-   if (dirty)
-   set_page_dirty_lock(p[i]);
-   put_page(p[i]);
-   }
+   if (dirty)
+   put_user_pages_dirty_lock(p, npages);
+   else
+   put_user_pages(p, npages);
 
if (mm) { /* during close after signal, mm can be NULL */
down_write(>mmap_sem);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index cc9c0c8ccba3..b8b12effd009 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -481,7 +481,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 
ret = pci_map_sg(dev->pdev, _tab->page[i].mem, 1, PCI_DMA_TODEVICE);
if (ret < 0) {
-   put_page(pages[0]);
+   put_user_page(pages[0]);
goto out;
}
 
@@ -489,7 +489,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
 mthca_uarc_virt(dev, uar, i));
if (ret) {
pci_unmap_sg(dev->pdev, _tab->page[i].mem, 1, 
PCI_DMA_TODEVICE);
-   put_page(sg_page(_tab->page[i].mem));
+   put_user_page(sg_page(_tab->p

[PATCH 0/2] put_user_page*(): start converting the call sites

2018-12-03 Thread john . hubbard
From: John Hubbard 

Hi,

Summary: I'd like these two patches to go into the next convenient cycle.
I *think* that means 4.21.

Details

At the Linux Plumbers Conference, we talked about this approach [1], and
the primary lingering concern was over performance. Tom Talpey helped me
through a much more accurate run of the fio performance test, and now
it's looking like an under 1% performance cost, to add and remove pages
from the LRU (this is only paid when dealing with get_user_pages) [2]. So
we should be fine to start converting call sites.

This patchset gets the conversion started. Both patches already had a fair
amount of review.

(Tom, I'll add you Tested-by to the actual implementation that moves
pages on and off the LRU. These first two patches don't do that.)

[1] https://linuxplumbersconf.org/event/2/contributions/126/
"RDMA and get_user_pages"

[2] https://lore.kernel.org/r/79d1ee27-9ea0-3d15-3fc4-97c1bd79c...@talpey.com

John Hubbard (2):
  mm: introduce put_user_page*(), placeholder versions
  infiniband/mm: convert put_page() to put_user_page*()

 drivers/infiniband/core/umem.c  |  7 +-
 drivers/infiniband/core/umem_odp.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c | 11 ++-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  | 11 ++-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  7 +-
 include/linux/mm.h  | 20 ++
 mm/swap.c   | 80 +
 9 files changed, 123 insertions(+), 27 deletions(-)

-- 
2.19.2



Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-07 Thread John Hubbard
On 12/7/18 11:16 AM, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote:
>> On 12/4/18 5:57 PM, John Hubbard wrote:
>>> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>>>> does this proposal interact with those?
>>>>>>
>>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there 
>>>>>> an entire
>>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? 
>>>>>> Said another
>>>>>> way: is it reasonable to disallow calling get_user_pages() on 
>>>>>> ZONE_DEVICE pages?
>>>>>>
>>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the 
>>>>>> whole 
>>>>>> LRU field approach is unusable.
>>>>>
>>>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>>>> damage:
>>>>>
>>>>> +++ b/include/linux/mm_types.h
>>>>> @@ -151,10 +151,12 @@ struct page {
>>>>>  #endif
>>>>> };
>>>>> struct {/* ZONE_DEVICE pages */
>>>>> +   unsigned long _zd_pad_2;/* LRU */
>>>>> +   unsigned long _zd_pad_3;/* LRU */
>>>>> +   unsigned long _zd_pad_1;/* uses mapping */
>>>>> /** @pgmap: Points to the hosting device page 
>>>>> map. */
>>>>> struct dev_pagemap *pgmap;
>>>>> unsigned long hmm_data;
>>>>> -   unsigned long _zd_pad_1;/* uses mapping */
>>>>> };
>>>>>  
>>>>> /** @rcu_head: You can use this to free a page by RCU. */
>>>>>
>>>>> You don't use page->private or page->index, do you Dan?
>>>>
>>>> page->private and page->index are use by HMM DEVICE page.
>>>>
>>>
>>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining 
>>> for 
>>> dma-pinned information. Which might work. To recap, we need:
>>>
>>> -- 1 bit for PageDmaPinned
>>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
>>> -- N bits for a reference count
>>>
>>> Those *could* be packed into a single 64-bit field, if really necessary.
>>>
>>
>> ...actually, this needs to work on 32-bit systems, as well. And HMM is using 
>> a lot.
>> However, it is still possible for this to work.
>>
>> Matthew, can I have that bit now please? I'm about out of options, and now 
>> it will actually
>> solve the problem here.
>>
>> Given:
>>
>> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on 
>> the LRU.
>> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) 
>> is required, 
>> for that case. 
>>
>> 2) There is an independent bit available (according to Matthew). 
>>
>> 3) HMM uses 4 of the 5 struct page fields, so only one field is available 
>> for a counter 
>>in that case.
> 
> To expend on this, HMM private page are use for anonymous page
> so the index and mapping fields have the value you expect for
> such pages. Down the road i want also to support file backed
> page with HMM private (mapping, private, index).
> 
> For HMM public both anonymous and file back page are supported
> today (HMM public is only useful on platform with something like
> OpenCAPI, CCIX or NVlink ... so PowerPC for now).
> 
>> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages.
> 
> get_user_pages() only need to work with HMM public page not the
> private one as we can not allow _anyone_ to pin HMM private page.
> So on get_user_pages() on HMM private we get a page fault and
> it is migrated back to regular memory.
> 
> 
>> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a 
>> complete
>> unsigned long field.
>>
>> So that leads to the following approach:
>

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-06 Thread John Hubbard
On 12/4/18 5:57 PM, John Hubbard wrote:
> On 12/4/18 5:44 PM, Jerome Glisse wrote:
>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>>> does this proposal interact with those?
>>>>
>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an 
>>>> entire
>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said 
>>>> another
>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE 
>>>> pages?
>>>>
>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the 
>>>> whole 
>>>> LRU field approach is unusable.
>>>
>>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>>> damage:
>>>
>>> +++ b/include/linux/mm_types.h
>>> @@ -151,10 +151,12 @@ struct page {
>>>  #endif
>>> };
>>> struct {/* ZONE_DEVICE pages */
>>> +   unsigned long _zd_pad_2;/* LRU */
>>> +   unsigned long _zd_pad_3;/* LRU */
>>> +   unsigned long _zd_pad_1;/* uses mapping */
>>> /** @pgmap: Points to the hosting device page map. 
>>> */
>>> struct dev_pagemap *pgmap;
>>> unsigned long hmm_data;
>>> -   unsigned long _zd_pad_1;/* uses mapping */
>>> };
>>>  
>>> /** @rcu_head: You can use this to free a page by RCU. */
>>>
>>> You don't use page->private or page->index, do you Dan?
>>
>> page->private and page->index are use by HMM DEVICE page.
>>
> 
> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining 
> for 
> dma-pinned information. Which might work. To recap, we need:
> 
> -- 1 bit for PageDmaPinned
> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
> -- N bits for a reference count
> 
> Those *could* be packed into a single 64-bit field, if really necessary.
> 

...actually, this needs to work on 32-bit systems, as well. And HMM is using a 
lot.
However, it is still possible for this to work.

Matthew, can I have that bit now please? I'm about out of options, and now it 
will actually
solve the problem here.

Given:

1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on 
the LRU.
That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is 
required, 
for that case. 

2) There is an independent bit available (according to Matthew). 

3) HMM uses 4 of the 5 struct page fields, so only one field is available for a 
counter 
   in that case.

4) get_user_pages() must work on ZONE_DEVICE and HMM pages.

5) For a proper atomic counter for both 32- and 64-bit, we really do need a 
complete
unsigned long field.

So that leads to the following approach:

-- Use a single unsigned long field for an atomic reference count for the DMA 
pinned count.
For normal pages, this will be the *second* field of the LRU (in order to avoid 
PageTail bit).

For ZONE_DEVICE pages, we can also line up the fields so that the second LRU 
field is 
available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move 
up and
optionally renamed:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 017ab82e36ca..b5dcd9398cae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -90,8 +90,8 @@ struct page {
 * are in use.
 */
struct {
-   unsigned long dma_pinned_flags;
-   atomic_t  dma_pinned_count;
+   unsigned long dma_pinned_flags; /* 
LRU.next */
+   atomic_t  dma_pinned_count; /* 
LRU.prev */
};
};
/* See page-flags.h for PAGE_MAPPING_FLAGS */
@@ -161,9 +161,9 @@ struct page {
};
struct {/* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
-   struct dev_pagemap *pgmap;
-   unsigned long hmm_data;
-   unsigned long _zd_pad_1;    /* uses mapping */
+   struct dev_pagemap *pgmap;  /* LRU.next */
+   unsigned long _zd_pad_1;/* LRU.prev or 
dma_pinned_count */
+   unsigned long hmm_data; /* uses mapping */
};
 
/** @rcu_head: You can use this to free a page by RCU. */



-- Use an additional, fully independent page bit (from Matthew) for 
PageDmaPinned.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread John Hubbard
On 12/4/18 12:28 PM, Dan Williams wrote:
> On Mon, Dec 3, 2018 at 4:17 PM  wrote:
>>
>> From: John Hubbard 
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>invoke put_user_page*(), instead of put_page(). This involves dozens of
>>call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>implement tracking of these pages. This tracking will be separate from
>>the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>special handling (especially in writeback paths) when the pages are
>>backed by a filesystem. Again, [1] provides details as to why that is
>>desirable.
> 
> I thought at Plumbers we talked about using a page bit to tag pages
> that have had their reference count elevated by get_user_pages()? That
> way there is no need to distinguish put_page() from put_user_page() it
> just happens internally to put_page(). At the conference Matthew was
> offering to free up a page bit for this purpose.
> 

...but then, upon further discussion in that same session, we realized that
that doesn't help. You need a reference count. Otherwise a random put_page
could affect your dma-pinned pages, etc, etc.

I was not able to actually find any place where a single additional page
bit would help our situation, which is why this still uses LRU fields for
both the two bits required (the RFC [1] still applies), and the 
dma_pinned_count.


[1] https://lore.kernel.org/r/20181110085041.10071-7-jhubb...@nvidia.com



>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara 
> 
> Wish, you could have been there Jan. I'm missing why it's safe to
> assume that a single put_user_page() is paired with a get_user_page()?
> 

A put_user_page() per page, or a put_user_pages() for an array of pages. See
patch 0002 for several examples.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread John Hubbard
On 12/4/18 3:03 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard  wrote:
>>
>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>> On Mon, Dec 3, 2018 at 4:17 PM  wrote:
>>>>
>>>> From: John Hubbard 
>>>>
>>>> Introduces put_user_page(), which simply calls put_page().
>>>> This provides a way to update all get_user_pages*() callers,
>>>> so that they call put_user_page(), instead of put_page().
>>>>
>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>> as a replacement for release_pages(), and also as a replacement
>>>> for open-coded loops that release multiple pages.
>>>> These may be used for subsequent performance improvements,
>>>> via batching of pages to be released.
>>>>
>>>> This is the first step of fixing the problem described in [1]. The steps
>>>> are:
>>>>
>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>for releasing pages that were pinned via get_user_pages*().
>>>>
>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>invoke put_user_page*(), instead of put_page(). This involves dozens of
>>>>call sites, and will take some time.
>>>>
>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>implement tracking of these pages. This tracking will be separate from
>>>>the existing struct page refcounting.
>>>>
>>>> 4) Use the tracking and identification of these pages, to implement
>>>>special handling (especially in writeback paths) when the pages are
>>>>backed by a filesystem. Again, [1] provides details as to why that is
>>>>desirable.
>>>
>>> I thought at Plumbers we talked about using a page bit to tag pages
>>> that have had their reference count elevated by get_user_pages()? That
>>> way there is no need to distinguish put_page() from put_user_page() it
>>> just happens internally to put_page(). At the conference Matthew was
>>> offering to free up a page bit for this purpose.
>>>
>>
>> ...but then, upon further discussion in that same session, we realized that
>> that doesn't help. You need a reference count. Otherwise a random put_page
>> could affect your dma-pinned pages, etc, etc.
> 
> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
> the end of the page pin event separate from the final 'put' of the
> page? Makes sense.
> 

Yes, that's it exactly.

>> I was not able to actually find any place where a single additional page
>> bit would help our situation, which is why this still uses LRU fields for
>> both the two bits required (the RFC [1] still applies), and the 
>> dma_pinned_count.
> 
> Except the LRU fields are already in use for ZONE_DEVICE pages... how
> does this proposal interact with those?

Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an 
entire
use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said 
another
way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages?

If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
LRU field approach is unusable.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread John Hubbard
On 12/4/18 4:40 PM, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 4:37 PM Jerome Glisse  wrote:
>>
>> On Tue, Dec 04, 2018 at 03:03:02PM -0800, Dan Williams wrote:
>>> On Tue, Dec 4, 2018 at 1:56 PM John Hubbard  wrote:
>>>>
>>>> On 12/4/18 12:28 PM, Dan Williams wrote:
>>>>> On Mon, Dec 3, 2018 at 4:17 PM  wrote:
>>>>>>
>>>>>> From: John Hubbard 
>>>>>>
>>>>>> Introduces put_user_page(), which simply calls put_page().
>>>>>> This provides a way to update all get_user_pages*() callers,
>>>>>> so that they call put_user_page(), instead of put_page().
>>>>>>
>>>>>> Also introduces put_user_pages(), and a few dirty/locked variations,
>>>>>> as a replacement for release_pages(), and also as a replacement
>>>>>> for open-coded loops that release multiple pages.
>>>>>> These may be used for subsequent performance improvements,
>>>>>> via batching of pages to be released.
>>>>>>
>>>>>> This is the first step of fixing the problem described in [1]. The steps
>>>>>> are:
>>>>>>
>>>>>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>>>>>for releasing pages that were pinned via get_user_pages*().
>>>>>>
>>>>>> 2) Convert all of the call sites for get_user_pages*(), to
>>>>>>invoke put_user_page*(), instead of put_page(). This involves dozens 
>>>>>> of
>>>>>>call sites, and will take some time.
>>>>>>
>>>>>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>>>>>implement tracking of these pages. This tracking will be separate from
>>>>>>the existing struct page refcounting.
>>>>>>
>>>>>> 4) Use the tracking and identification of these pages, to implement
>>>>>>special handling (especially in writeback paths) when the pages are
>>>>>>backed by a filesystem. Again, [1] provides details as to why that is
>>>>>>desirable.
>>>>>
>>>>> I thought at Plumbers we talked about using a page bit to tag pages
>>>>> that have had their reference count elevated by get_user_pages()? That
>>>>> way there is no need to distinguish put_page() from put_user_page() it
>>>>> just happens internally to put_page(). At the conference Matthew was
>>>>> offering to free up a page bit for this purpose.
>>>>>
>>>>
>>>> ...but then, upon further discussion in that same session, we realized that
>>>> that doesn't help. You need a reference count. Otherwise a random put_page
>>>> could affect your dma-pinned pages, etc, etc.
>>>
>>> Ok, sorry, I mis-remembered. So, you're effectively trying to capture
>>> the end of the page pin event separate from the final 'put' of the
>>> page? Makes sense.
>>>
>>>> I was not able to actually find any place where a single additional page
>>>> bit would help our situation, which is why this still uses LRU fields for
>>>> both the two bits required (the RFC [1] still applies), and the 
>>>> dma_pinned_count.
>>>
>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>> does this proposal interact with those?
>>>
>>>> [1] https://lore.kernel.org/r/20181110085041.10071-7-jhubb...@nvidia.com
>>>>
>>>>>> [1] https://lwn.net/Articles/753027/ : "The Trouble with 
>>>>>> get_user_pages()"
>>>>>>
>>>>>> Reviewed-by: Jan Kara 
>>>>>
>>>>> Wish, you could have been there Jan. I'm missing why it's safe to
>>>>> assume that a single put_user_page() is paired with a get_user_page()?
>>>>>
>>>>
>>>> A put_user_page() per page, or a put_user_pages() for an array of pages. 
>>>> See
>>>> patch 0002 for several examples.
>>>
>>> Yes, however I was more concerned about validation and trying to
>>> locate missed places where put_page() is used instead of
>>> put_user_page().
>>>
>>> It would be interesting to see if we could have a debug mode where
>>> get_user_pages() returned dynamically allocated pages from a known
>>> address range and ca

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread John Hubbard
On 12/3/18 11:53 PM, Mike Rapoport wrote:
> Hi John,
> 
> Thanks for having documentation as a part of the patch. Some kernel-doc
> nits below.
> 
> On Mon, Dec 03, 2018 at 04:17:19PM -0800, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
>>
>> Introduces put_user_page(), which simply calls put_page().
>> This provides a way to update all get_user_pages*() callers,
>> so that they call put_user_page(), instead of put_page().
>>
>> Also introduces put_user_pages(), and a few dirty/locked variations,
>> as a replacement for release_pages(), and also as a replacement
>> for open-coded loops that release multiple pages.
>> These may be used for subsequent performance improvements,
>> via batching of pages to be released.
>>
>> This is the first step of fixing the problem described in [1]. The steps
>> are:
>>
>> 1) (This patch): provide put_user_page*() routines, intended to be used
>>for releasing pages that were pinned via get_user_pages*().
>>
>> 2) Convert all of the call sites for get_user_pages*(), to
>>invoke put_user_page*(), instead of put_page(). This involves dozens of
>>call sites, and will take some time.
>>
>> 3) After (2) is complete, use get_user_pages*() and put_user_page*() to
>>implement tracking of these pages. This tracking will be separate from
>>the existing struct page refcounting.
>>
>> 4) Use the tracking and identification of these pages, to implement
>>special handling (especially in writeback paths) when the pages are
>>backed by a filesystem. Again, [1] provides details as to why that is
>>desirable.
>>
>> [1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
>>
>> Reviewed-by: Jan Kara 
>>
>> Cc: Matthew Wilcox 
>> Cc: Michal Hocko 
>> Cc: Christopher Lameter 
>> Cc: Jason Gunthorpe 
>> Cc: Dan Williams 
>> Cc: Jan Kara 
>> Cc: Al Viro 
>> Cc: Jerome Glisse 
>> Cc: Christoph Hellwig 
>> Cc: Ralph Campbell 
>> Signed-off-by: John Hubbard 
>> ---
>>  include/linux/mm.h | 20 
>>  mm/swap.c  | 80 ++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5411de93a363..09fbb2c81aba 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -963,6 +963,26 @@ static inline void put_page(struct page *page)
>>  __put_page(page);
>>  }
>>
>> +/*
>> + * put_user_page() - release a page that had previously been acquired via
>> + * a call to one of the get_user_pages*() functions.
> 
> Please add @page parameter description, otherwise kernel-doc is unhappy

Hi Mike,

Sorry I missed these kerneldoc points from your earlier review! I'll fix it
up now and it will show up in the next posting.

> 
>> + *
>> + * Pages that were pinned via get_user_pages*() must be released via
>> + * either put_user_page(), or one of the put_user_pages*() routines
>> + * below. This is so that eventually, pages that are pinned via
>> + * get_user_pages*() can be separately tracked and uniquely handled. In
>> + * particular, interactions with RDMA and filesystems need special
>> + * handling.
>> + */
>> +static inline void put_user_page(struct page *page)
>> +{
>> +put_page(page);
>> +}
>> +
>> +void put_user_pages_dirty(struct page **pages, unsigned long npages);
>> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
>> +void put_user_pages(struct page **pages, unsigned long npages);
>> +
>>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>  #define SECTION_IN_PAGE_FLAGS
>>  #endif
>> diff --git a/mm/swap.c b/mm/swap.c
>> index aa483719922e..bb8c32595e5f 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -133,6 +133,86 @@ void put_pages_list(struct list_head *pages)
>>  }
>>  EXPORT_SYMBOL(put_pages_list);
>>
>> +typedef int (*set_dirty_func)(struct page *page);
>> +
>> +static void __put_user_pages_dirty(struct page **pages,
>> +   unsigned long npages,
>> +   set_dirty_func sdf)
>> +{
>> +unsigned long index;
>> +
>> +for (index = 0; index < npages; index++) {
>> +struct page *page = compound_head(pages[index]);
>> +
>> +if (!PageDirty(page))
>> +sdf(page);
>> +
>> +put_user_page(page);
&g

Re: [PATCH 0/2] put_user_page*(): start converting the call sites

2018-12-04 Thread John Hubbard
On 12/4/18 9:10 AM, David Laight wrote:
> From: john.hubb...@gmail.com
>> Sent: 04 December 2018 00:17
>>
>> Summary: I'd like these two patches to go into the next convenient cycle.
>> I *think* that means 4.21.
>>
>> Details
>>
>> At the Linux Plumbers Conference, we talked about this approach [1], and
>> the primary lingering concern was over performance. Tom Talpey helped me
>> through a much more accurate run of the fio performance test, and now
>> it's looking like an under 1% performance cost, to add and remove pages
>> from the LRU (this is only paid when dealing with get_user_pages) [2]. So
>> we should be fine to start converting call sites.
>>
>> This patchset gets the conversion started. Both patches already had a fair
>> amount of review.
> 
> Shouldn't the commit message contain actual details of the change?
> 

Hi David,

This "patch " is not a commit message, as it never shows up in git log.
Each of the follow-up patches does have details about the changes it makes.

But maybe you are really asking for more background information, which I
should have added in this cover letter. Here's a start:

https://lore.kernel.org/r/20181110085041.10071-1-jhubb...@nvidia.com

...and it looks like this small patch series is not going to work out--I'm
going to have to fall back to another RFC spin. So I'll be sure to include 
you and everyone on that. Hope that helps.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-04 Thread John Hubbard
On 12/4/18 5:44 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote:
>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote:
>>> On 12/4/18 3:03 PM, Dan Williams wrote:
>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how
>>>> does this proposal interact with those?
>>>
>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an 
>>> entire
>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said 
>>> another
>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE 
>>> pages?
>>>
>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole 
>>> LRU field approach is unusable.
>>
>> We just need to rearrange ZONE_DEVICE pages.  Please excuse the whitespace
>> damage:
>>
>> +++ b/include/linux/mm_types.h
>> @@ -151,10 +151,12 @@ struct page {
>>  #endif
>> };
>> struct {/* ZONE_DEVICE pages */
>> +   unsigned long _zd_pad_2;/* LRU */
>> +   unsigned long _zd_pad_3;/* LRU */
>> +   unsigned long _zd_pad_1;/* uses mapping */
>> /** @pgmap: Points to the hosting device page map. */
>> struct dev_pagemap *pgmap;
>> unsigned long hmm_data;
>> -   unsigned long _zd_pad_1;/* uses mapping */
>> };
>>  
>> /** @rcu_head: You can use this to free a page by RCU. */
>>
>> You don't use page->private or page->index, do you Dan?
> 
> page->private and page->index are use by HMM DEVICE page.
> 

OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for 
dma-pinned information. Which might work. To recap, we need:

-- 1 bit for PageDmaPinned
-- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru.
-- N bits for a reference count

Those *could* be packed into a single 64-bit field, if really necessary.


thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-21 Thread John Hubbard
On 11/21/18 8:49 AM, Tom Talpey wrote:
> On 11/21/2018 1:09 AM, John Hubbard wrote:
>> On 11/19/18 10:57 AM, Tom Talpey wrote:
>>> ~14000 4KB read IOPS is really, really low for an NVMe disk.
>>
>> Yes, but Jan Kara's original config file for fio is *intended* to highlight
>> the get_user_pages/put_user_pages changes. It was *not* intended to get max
>> performance,  as you can see by the numjobs and direct IO parameters:
>>
>> cat fio.conf
>> [reader]
>> direct=1
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> numjobs=1
>> rw=read
>> iodepth=64
> 
> To be clear - I used those identical parameters, on my lower-spec
> machine, and got 400,000 4KB read IOPS. Those results are nearly 30x
> higher than yours!

OK, then something really is wrong here...

> 
>> So I'm thinking that this is not a "tainted" test, but rather, we're 
>> constraining
>> things a lot with these choices. It's hard to find a good test config to run 
>> that
>> allows decisions, but so far, I'm not really seeing anything that says "this
>> is so bad that we can't afford to fix the brokenness." I think.
> 
> I'm not suggesting we tune the benchmark, I'm suggesting the results
> on your system are not meaningful since they are orders of magnitude
> low. And without meaningful data it's impossible to see the performance
> impact of the change...
> 
>>> Can you confirm what type of hardware you're running this test on?
>>> CPU, memory speed and capacity, and NVMe device especially?
>>>
>>> Tom.
>>
>> Yes, it's a nice new system, I don't expect any strange perf problems:
>>
>> CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
>>  (Intel X299 chipset)
>> Block device: nvme-Samsung_SSD_970_EVO_250GB
>> DRAM: 32 GB
> 
> The Samsung Evo 970 250GB is speced to yield 200,000 random read IOPS
> with a 4KB QD32 workload:
> 
> 
> https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-evo-nvme-m-2-250gb-mz-v7e250bw/#specs
> 
> And the I7-7800X is a 6-core processor (12 hyperthreads).
> 
>> So, here's a comparison using 20 threads, direct IO, for the baseline vs.
>> patched kernel (below). Highlights:
>>
>> -- IOPS are similar, around 60k.
>> -- BW gets worse, dropping from 290 to 220 MB/s.
>> -- CPU is well under 100%.
>> -- latency is incredibly long, but...20 threads.
>>
>> Baseline:
>>
>> $ ./run.sh
>> fio configuration:
>> [reader]
>> ioengine=libaio
>> blocksize=4096
>> size=1g
>> rw=read
>> group_reporting
>> iodepth=256
>> direct=1
>> numjobs=20
> 
> Ouch - 20 threads issuing 256 io's each!? Of course latency skyrockets.
> That's going to cause tremendous queuing, and context switching, far
> outside of the get_user_pages() change.
> 
> But even so, it only brings IOPS to 74.2K, which is still far short of
> the device's 200K spec.
> 
> Comparing anyway:
> 
> 
>> Patched:
>>
>>  Running fio:
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=256
>> ...
>> fio-3.3
>> Starting 20 processes
>> Jobs: 13 (f=8): 
>> [_(1),R(1),_(1),f(1),R(2),_(1),f(2),_(1),R(1),f(1),R(1),f(1),R(1),_(2),R(1),_(1),R(1)][97.9%][r=229MiB/s,w=0KiB/s][r=58.5k,w=0
>>  IOPS][eta 00m:02s]
>> reader: (groupid=0, jobs=20): err= 0: pid=2104: Tue Nov 20 22:01:58 2018
>>     read: IOPS=56.8k, BW=222MiB/s (232MB/s)(20.0GiB/92385msec)
>> ...
>> Thoughts?
> 
> Concern - the 74.2K IOPS unpatched drops to 56.8K patched!

ACK. :)

> 
> What I'd really like to see is to go back to the original fio parameters
> (1 thread, 64 iodepth) and try to get a result that gets at least close
> to the speced 200K IOPS of the NVMe device. There seems to be something
> wrong with yours, currently.

I'll dig into what has gone wrong with the test. I see fio putting data files
in the right place, so the obvious "using the wrong drive" is (probably)
not it. Even though it really feels like that sort of thing. We'll see. 

> 
> Then of course, the result with the patched get_user_pages, and
> compare whichever of IOPS or CPU% changes, and how much.
> 
> If these are within a few percent, I agree it's good to go. If it's
> roughly 25% like the result just above, that's a rocky road.
> 
> I can try this after the holiday on some basic hardware and might
> be able to scrounge up better. Can you post that github link?
> 

Here:

   g...@github.com:johnhubbard/linux (branch: gup_dma_testing)


-- 
thanks,
John Hubbard
NVIDIA


[PATCH 0/1] mm/gup: finish consolidating error handling

2018-11-21 Thread john . hubbard
From: John Hubbard 

Hi,

Keith Busch and Dan Williams noticed that this patch
(which was part of my RFC[1] for the get_user_pages + DMA
fix) also fixes a bug. Accordingly, I'm adjusting
the changelog and posting this as it's own patch.

[1] https://lkml.kernel.org/r/20181110085041.10071-1-jhubb...@nvidia.com

John Hubbard (1):
  mm/gup: finish consolidating error handling

 mm/gup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

-- 
2.19.1



[PATCH] mm/gup: finish consolidating error handling

2018-11-21 Thread john . hubbard
From: John Hubbard 

Commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while pinning pages")
attempted to operate on each page that get_user_pages had retrieved. In
order to do that, it created a common exit point from the routine.
However, one case was missed, which this patch fixes up.

Also, there was still an unnecessary shadow declaration (with a
different type) of the "ret" variable, which this patch removes.

Fixes: df06b37ffe5a4 ("mm/gup: cache dev_pagemap while pinning pages")

Reviewed-by: Keith Busch 
Cc: Dan Williams 
Cc: Kirill A. Shutemov 
Cc: Dave Hansen 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index aa43620a3270..8cb68a50dbdf 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -702,12 +702,11 @@ static long __get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm,
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
-   int ret;
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
if (ret)
-   return i ? : ret;
+   goto out;
ctx.page_mask = 0;
goto next_page;
}
-- 
2.19.1



Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-20 Thread John Hubbard
On 11/19/18 10:57 AM, Tom Talpey wrote:
> John, thanks for the discussion at LPC. One of the concerns we
> raised however was the performance test. The numbers below are
> rather obviously tainted. I think we need to get a better baseline
> before concluding anything...
> 
> Here's my main concern:
> 

Hi Tom,

Thanks again for looking at this!


> On 11/10/2018 3:50 AM, john.hubb...@gmail.com wrote:
>> From: John Hubbard 
>> ...
>> --
>> WITHOUT the patch:
>> --
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=55.5MiB/s,w=0KiB/s][r=14.2k,w=0 IOPS][eta 
>> 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=1750: Tue Nov  6 20:18:06 2018
>>     read: IOPS=13.9k, BW=54.4MiB/s (57.0MB/s)(1024MiB/18826msec)
> 
> ~14000 4KB read IOPS is really, really low for an NVMe disk.

Yes, but Jan Kara's original config file for fio is *intended* to highlight
the get_user_pages/put_user_pages changes. It was *not* intended to get max
performance,  as you can see by the numjobs and direct IO parameters:

cat fio.conf 
[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64


So I'm thinking that this is not a "tainted" test, but rather, we're 
constraining
things a lot with these choices. It's hard to find a good test config to run 
that
allows decisions, but so far, I'm not really seeing anything that says "this
is so bad that we can't afford to fix the brokenness." I think.

After talking with you and reading this email, I did a bunch more test runs, 
varying the following fio parameters:

-- direct
-- numjobs
-- iodepth

...with both the baseline 4.20-rc3 kernel, and with my patches applied. (btw, if
anyone cares, I'll post a github link that has a complete, testable 
patchset--not
ready for submission as such, but it works cleanly and will allow others to 
attempt to reproduce my results).

What I'm seeing is that I can get 10x or better improvements in IOPS and BW,
just by going to 10 threads and turning off direct IO--as expected. So in the 
end,
I increased the number of threads, and also increased iodepth a bit. 


Test results below...


> 
>>    cpu  : usr=2.39%, sys=95.30%, ctx=669, majf=0, minf=72
> 
> CPU is obviously the limiting factor. At these IOPS, it should be far
> less.
>> --
>> OR, here's a better run WITH the patch applied, and you can see that this is 
>> nearly as good
>> as the "without" case:
>> --
>>
>> reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=64
>> fio-3.3
>> Starting 1 process
>> Jobs: 1 (f=1): [R(1)][100.0%][r=53.2MiB/s,w=0KiB/s][r=13.6k,w=0 IOPS][eta 
>> 00m:00s]
>> reader: (groupid=0, jobs=1): err= 0: pid=2521: Tue Nov  6 20:01:33 2018
>>     read: IOPS=13.4k, BW=52.5MiB/s (55.1MB/s)(1024MiB/19499msec)
> 
> Similar low IOPS.
> 
>>    cpu  : usr=3.47%, sys=94.61%, ctx=370, majf=0, minf=73
> 
> Similar CPU saturation.
> 
>>
> 
> I get nearly 400,000 4KB IOPS on my tiny desktop, which has a 25W
> i7-7500 and a Samsung PM961 128GB NVMe (stock Bionic 4.15 kernel
> and fio version 3.1). Even then, the CPU saturates, so it's not
> necessarily a perfect test. I'd like to see your runs both get to
> "max" IOPS, i.e. CPU < 100%, and compare the CPU numbers. This would
> give the best comparison for making a decision.

I can get to CPU < 100% by increasing to 10 or 20 threads, although it
makes latency ever so much worse.

> 
> Can you confirm what type of hardware you're running this test on?
> CPU, memory speed and capacity, and NVMe device especially?
> 
> Tom.

Yes, it's a nice new system, I don't expect any strange perf problems:

CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
(Intel X299 chipset)
Block device: nvme-Samsung_SSD_970_EVO_250GB
DRAM: 32 GB

So, here's a comparison using 20 threads, direct IO, for the baseline vs. 
patched kernel (below). Highlights:

-- IOPS are similar, around 60k. 
-- BW gets worse, dropping from 290 to 220 MB/s.
-- CPU is well under 100%.
-- latency is incredibly long, but...20 threads.

Baseline:

$ ./run.sh
fio configuration:
[reader]
ioengine=libaio
blocksize=4096
size=1g
rw=read
group_reporting
iodepth=256
direct=1
numjobs=20
 Running fio:
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, 
ioengine=liba

Re: [PATCH v2 0/6] RFC: gup+dma: tracking dma-pinned pages

2018-11-27 Thread John Hubbard
On 11/27/18 5:21 PM, Tom Talpey wrote:
> On 11/21/2018 5:06 PM, John Hubbard wrote:
>> On 11/21/18 8:49 AM, Tom Talpey wrote:
>>> On 11/21/2018 1:09 AM, John Hubbard wrote:
>>>> On 11/19/18 10:57 AM, Tom Talpey wrote:
[...]
>>>
>>> What I'd really like to see is to go back to the original fio parameters
>>> (1 thread, 64 iodepth) and try to get a result that gets at least close
>>> to the speced 200K IOPS of the NVMe device. There seems to be something
>>> wrong with yours, currently.
>>
>> I'll dig into what has gone wrong with the test. I see fio putting data files
>> in the right place, so the obvious "using the wrong drive" is (probably)
>> not it. Even though it really feels like that sort of thing. We'll see.
>>
>>>
>>> Then of course, the result with the patched get_user_pages, and
>>> compare whichever of IOPS or CPU% changes, and how much.
>>>
>>> If these are within a few percent, I agree it's good to go. If it's
>>> roughly 25% like the result just above, that's a rocky road.
>>>
>>> I can try this after the holiday on some basic hardware and might
>>> be able to scrounge up better. Can you post that github link?
>>>
>>
>> Here:
>>
>>     g...@github.com:johnhubbard/linux (branch: gup_dma_testing)
> 
> I'm super-limited here this week hardware-wise and have not been able
> to try testing with the patched kernel.
> 
> I was able to compare my earlier quick test with a Bionic 4.15 kernel
> (400K IOPS) against a similar 4.20rc3 kernel, and the rate dropped to
> ~_375K_ IOPS. Which I found perhaps troubling. But it was only a quick
> test, and without your change.
> 

So just to double check (again): you are running fio with these parameters,
right?

[reader]
direct=1
ioengine=libaio
blocksize=4096
size=1g
numjobs=1
rw=read
iodepth=64



> Say, that branch reports it has not had a commit since June 30. Is that
> the right one? What about gup_dma_for_lpc_2018?
> 

That's the right branch, but the AuthorDate for the head commit (only) somehow
got stuck in the past. I just now amended that patch with a new date and pushed 
it, so the head commit now shows Nov 27:

   https://github.com/johnhubbard/linux/commits/gup_dma_testing


The actual code is the same, though. (It is still based on Nov 19th's 
f2ce1065e767
commit.)


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH] mm/gup: finish consolidating error handling

2018-11-21 Thread John Hubbard
On 11/21/18 2:44 PM, Andrew Morton wrote:
> On Wed, 21 Nov 2018 00:14:02 -0800 john.hubb...@gmail.com wrote:
> 
>> Commit df06b37ffe5a4 ("mm/gup: cache dev_pagemap while pinning pages")
>> attempted to operate on each page that get_user_pages had retrieved. In
>> order to do that, it created a common exit point from the routine.
>> However, one case was missed, which this patch fixes up.
>>
>> Also, there was still an unnecessary shadow declaration (with a
>> different type) of the "ret" variable, which this patch removes.
>>
> 
> What is the bug which this supposedly fixes and what is that bug's
> user-visible impact?
> 

Keith's description of the situation is:

  This also fixes a potentially leaked dev_pagemap reference count if a
  failure occurs when an iteration crosses a vma boundary. I don't think
  it's normal to have different vma's on a users mapped zone device memory,
  but good to fix anyway.

I actually thought that this code:

/* first iteration or cross vma bound */
if (!vma || start >= vma->vm_end) {
vma = find_extend_vma(mm, start);
if (!vma && in_gate_area(mm, start)) {
ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, ,
pages ? [i] : NULL);
if (ret)
goto out;

...dealt with the "you're trying to pin the gate page, as part of this call",
rather than the generic case of crossing a vma boundary. (I think there's a fine
point that I must be overlooking.) But it's still a valid case, either way.

-- 
thanks,
John Hubbard
NVIDIA


Re: [PATCH 4/6] mmu_notifier: pass through vma to invalidate_range and invalidate_page

2014-06-29 Thread John Hubbard
date_range_start(vma->vm_mm, mmun_start,
> + mmu_notifier_invalidate_range_start(vma->vm_mm, vma, mmun_start,
>   mmun_end, MMU_MIGRATE);
>  
>   for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
> @@ -229,7 +229,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   if (likely(need_flush))
>   flush_tlb_range(vma, old_end-len, old_addr);
>  
> - mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start,
> + mmu_notifier_invalidate_range_end(vma->vm_mm, vma, mmun_start,
> mmun_end, MMU_MIGRATE);
>  
>   return len + old_addr - old_end;/* how much done */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bd7e6d7..f1be50d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -840,7 +840,7 @@ static int page_mkclean_one(struct page *page, struct 
> vm_area_struct *vma,
>   pte_unmap_unlock(pte, ptl);
>  
>   if (ret) {
> - mmu_notifier_invalidate_page(mm, address, MMU_WB);
> + mmu_notifier_invalidate_page(mm, vma, address, MMU_WB);
>   (*cleaned)++;
>   }
>  out:
> @@ -1237,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>  out_unmap:
>   pte_unmap_unlock(pte, ptl);
>   if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> - mmu_notifier_invalidate_page(mm, address, event);
> + mmu_notifier_invalidate_page(mm, vma, address, event);
>  out:
>   return ret;
>  
> @@ -1325,7 +1325,8 @@ static int try_to_unmap_cluster(unsigned long cursor, 
> unsigned int *mapcount,
>  
>   mmun_start = address;
>   mmun_end   = end;
> - mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end, event);
> + mmu_notifier_invalidate_range_start(mm, vma, mmun_start,
> + mmun_end, event);
>  
>   /*
>* If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
> @@ -1390,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, 
> unsigned int *mapcount,
>   (*mapcount)--;
>   }
>   pte_unmap_unlock(pte - 1, ptl);
> - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end, event);
> + mmu_notifier_invalidate_range_end(mm, vma, mmun_start, mmun_end, event);
>   if (locked_vma)
>   up_read(>vm_mm->mmap_sem);
>   return ret;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6e1992f..c4b7bf9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -262,6 +262,7 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
> mmu_notifier *mn)
>  
>  static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
>struct mm_struct *mm,
> +  struct vm_area_struct *vma,
>unsigned long address,
>enum mmu_event event)
>  {
> @@ -318,6 +319,7 @@ static void kvm_mmu_notifier_change_pte(struct 
> mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   struct mm_struct *mm,
> + struct vm_area_struct *vma,
>   unsigned long start,
>   unsigned long end,
>   enum mmu_event event)
> @@ -345,6 +347,7 @@ static void 
> kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> struct mm_struct *mm,
> +   struct vm_area_struct *vma,
> unsigned long start,
> unsigned long end,
> enum mmu_event event)
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

Other than the refinements suggested above, I can't seem to find anything 
wrong with this patch, so:

Reviewed-by: John Hubbard 

thanks,
John H.

Re: [PATCH 1/6] mmput: use notifier chain to call subsystem exit handler.

2014-06-29 Thread John Hubbard
 down_write(>mmap_sem);
>   up_write(>mmap_sem);
>   }
> + return 0;
>  }
>  
>  struct page *ksm_might_need_to_copy(struct page *page,
> @@ -2305,11 +2312,20 @@ static struct attribute_group ksm_attr_group = {
>  };
>  #endif /* CONFIG_SYSFS */
>  
> +static struct notifier_block ksm_mmput_nb = {
> + .notifier_call  = ksm_exit,
> + .priority   = 2,
> +};
> +
>  static int __init ksm_init(void)
>  {
>   struct task_struct *ksm_thread;
>   int err;
>  
> + err = mmput_register_notifier(_mmput_nb);
> + if (err)
> + return err;
> +

In order to be perfectly consistent with this routine's existing code, you 
would want to write:

if (err)
goto out;

...but it does the same thing as your code. It' just a consistency thing.

>   err = ksm_slab_init();
>   if (err)
>   goto out;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 61aec93..b684a21 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2775,6 +2775,9 @@ void exit_mmap(struct mm_struct *mm)
>   struct vm_area_struct *vma;
>   unsigned long nr_accounted = 0;
>  
> + /* Important to call this first. */
> + khugepaged_exit(mm);
> +
>   /* mm's last user has gone, and its about to be pulled down */
>   mmu_notifier_release(mm);
>  
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

Above points are extremely minor, so:

Reviewed-by: John Hubbard 

thanks,
John H.

Re: [PATCH 2/6] mm: differentiate unmap for vmscan from other unmap.

2014-06-29 Thread John Hubbard
On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse 
> 
> New code will need to be able to differentiate between a regular unmap and
> an unmap trigger by vmscan in which case we want to be as quick as possible.
> 
> Signed-off-by: Jérôme Glisse 
> ---
>  include/linux/rmap.h | 15 ---
>  mm/memory-failure.c  |  2 +-
>  mm/vmscan.c  |  4 ++--
>  3 files changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be57450..eddbc07 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -72,13 +72,14 @@ struct anon_vma_chain {
>  };
>  
>  enum ttu_flags {
> - TTU_UNMAP = 1,  /* unmap mode */
> - TTU_MIGRATION = 2,  /* migration mode */
> - TTU_MUNLOCK = 4,/* munlock mode */
> -
> - TTU_IGNORE_MLOCK = (1 << 8),/* ignore mlock */
> - TTU_IGNORE_ACCESS = (1 << 9),   /* don't age */
> - TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> + TTU_VMSCAN = 1, /* unmap for vmscan */
> + TTU_POISON = 2, /* unmap for poison */
> + TTU_MIGRATION = 4,  /* migration mode */
> + TTU_MUNLOCK = 8,/* munlock mode */
> +
> + TTU_IGNORE_MLOCK = (1 << 9),/* ignore mlock */
> + TTU_IGNORE_ACCESS = (1 << 10),  /* don't age */
> + TTU_IGNORE_HWPOISON = (1 << 11),/* corrupted page is recoverable */

Unless there is a deeper purpose that I am overlooking, I think it would 
be better to leave the _MLOCK, _ACCESS, and _HWPOISON at their original 
values. I just can't quite see why they would need to start at bit 9 
instead of bit 8...

>  };
>  
>  #ifdef CONFIG_MMU
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a7a89eb..ba176c4 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -887,7 +887,7 @@ static int page_action(struct page_state *ps, struct page 
> *p,
>  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
> int trapno, int flags, struct page **hpagep)
>  {
> - enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> + enum ttu_flags ttu = TTU_POISON | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
>   struct address_space *mapping;
>   LIST_HEAD(tokill);
>   int ret;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d24fd6..5a7d286 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1163,7 +1163,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone 
> *zone,
>   }
>  
>   ret = shrink_page_list(_pages, zone, ,
> - TTU_UNMAP|TTU_IGNORE_ACCESS,
> + TTU_VMSCAN|TTU_IGNORE_ACCESS,
>   , , , , , true);
>   list_splice(_pages, page_list);
>   mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
> @@ -1518,7 +1518,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   if (nr_taken == 0)
>   return 0;
>  
> - nr_reclaimed = shrink_page_list(_list, zone, sc, TTU_UNMAP,
> + nr_reclaimed = shrink_page_list(_list, zone, sc, TTU_VMSCAN,
>   _dirty, _unqueued_dirty, _congested,
>   _writeback, _immediate,
>   false);
> -- 
> 1.9.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> 

Other than that, looks good.

Reviewed-by: John Hubbard 

thanks,
John H.

Re: [PATCH 3/6] mmu_notifier: add event information to address invalidation v2

2014-06-29 Thread John Hubbard
On Fri, 27 Jun 2014, Jérôme Glisse wrote:

> From: Jérôme Glisse 
> 
> The event information will be usefull for new user of mmu_notifier API.
> The event argument differentiate between a vma disappearing, a page
> being write protected or simply a page being unmaped. This allow new
> user to take different path for different event for instance on unmap
> the resource used to track a vma are still valid and should stay around.
> While if the event is saying that a vma is being destroy it means that any
> resources used to track this vma can be free.
> 
> Changed since v1:
>   - renamed action into event (updated commit message too).
>   - simplified the event names and clarified their intented usage
> also documenting what exceptation the listener can have in
> respect to each event.
> 
> Signed-off-by: Jérôme Glisse 
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c |   3 +-
>  drivers/iommu/amd_iommu_v2.c|  14 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c  |   9 ++-
>  drivers/xen/gntdev.c|   9 ++-
>  fs/proc/task_mmu.c  |   6 +-
>  include/linux/hugetlb.h |   7 +-
>  include/linux/mmu_notifier.h| 117 
> ++--
>  kernel/events/uprobes.c |  10 ++-
>  mm/filemap_xip.c|   2 +-
>  mm/huge_memory.c|  51 --
>  mm/hugetlb.c|  25 ---
>  mm/ksm.c|  18 +++--
>  mm/memory.c |  27 +---
>  mm/migrate.c|   9 ++-
>  mm/mmu_notifier.c   |  28 +---
>  mm/mprotect.c   |  33 ++---
>  mm/mremap.c |   6 +-
>  mm/rmap.c   |  24 +--
>  virt/kvm/kvm_main.c |  12 ++--
>  19 files changed, 291 insertions(+), 119 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 21ea928..ed6f35e 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -56,7 +56,8 @@ struct i915_mmu_object {
>  static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier 
> *_mn,
>  struct mm_struct *mm,
>  unsigned long start,
> -unsigned long end)
> +unsigned long end,
> +enum mmu_event event)
>  {
>   struct i915_mmu_notifier *mn = container_of(_mn, struct 
> i915_mmu_notifier, mn);
>   struct interval_tree_node *it = NULL;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 499b436..2bb9771 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -414,21 +414,25 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
>  static void mn_change_pte(struct mmu_notifier *mn,
> struct mm_struct *mm,
> unsigned long address,
> -   pte_t pte)
> +   pte_t pte,
> +   enum mmu_event event)
>  {
>   __mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_page(struct mmu_notifier *mn,
>  struct mm_struct *mm,
> -unsigned long address)
> +unsigned long address,
> +enum mmu_event event)
>  {
>   __mn_flush_page(mn, address);
>  }
>  
>  static void mn_invalidate_range_start(struct mmu_notifier *mn,
> struct mm_struct *mm,
> -   unsigned long start, unsigned long end)
> +   unsigned long start,
> +   unsigned long end,
> +   enum mmu_event event)
>  {
>   struct pasid_state *pasid_state;
>   struct device_state *dev_state;
> @@ -449,7 +453,9 @@ static void mn_invalidate_range_start(struct mmu_notifier 
> *mn,
>  
>  static void mn_invalidate_range_end(struct mmu_notifier *mn,
>   struct mm_struct *mm,
> - unsigned long start, unsigned long end)
> + unsigned long start,
> + unsigned long end,
> + enum mmu_event event)
>  {
>   struct pasid_state *pasid_state;
>   struct device_state *dev_state;
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
> b/drivers/misc/sgi-gru/grutlbpurge.c
> index 2129274..e67fed1 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ 

Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v3

2018-03-22 Thread John Hubbard
On 03/21/2018 06:28 PM, jgli...@redhat.com wrote:
> From: Ralph Campbell 
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> Changed since v2:
>   - drop stable
>   - Split list removale and call to driver release callback. This
> allow the release callback to wait on any pending fault handler
> without deadlock.
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 10 ++
>  mm/hmm.c| 29 -
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> + /* release() - release hmm_mirror
> +  *
> +  * @mirror: pointer to struct hmm_mirror
> +  *
> +  * This is called when the mm_struct is being released.
> +  * The callback should make sure no references to the mirror occur
> +  * after the callback returns.
> +  */
> + void (*release)(struct hmm_mirror *mirror);
> +
>   /* sync_cpu_device_pagetables() - synchronize page tables
>*
>* @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..34c16297f65e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,32 @@ static void hmm_invalidate_range(struct hmm *hmm,
>   up_read(>mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + struct hmm_mirror *mirror;
> + struct hmm *hmm = mm->hmm;
> +
> + down_write(>mirrors_sem);
> + mirror = list_first_entry_or_null(>mirrors, struct hmm_mirror,
> +   list);
> + while (mirror) {
> + list_del_init(>list);
> + if (mirror->ops->release) {
> + /*
> +  * Drop mirrors_sem so callback can wait on any pending
> +  * work that might itself trigger mmu_notifier callback
> +  * and thus would deadlock with us.
> +  */
> + up_write(>mirrors_sem);
> + mirror->ops->release(mirror);
> + down_write(>mirrors_sem);
> + }
> + mirror = list_first_entry_or_null(>mirrors, struct 
> hmm_mirror,
> +   list);
> + }
> + up_write(>mirrors_sem);
> +}
> +

Hi Jerome,

This looks good (and the list handling is way better than my horrible 
copy-the-list idea)!

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA

>  static void hmm_invalidate_range_start(struct mmu_notifier *mn,
>  struct mm_struct *mm,
>  unsigned long start,
> @@ -185,6 +211,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier 
> *mn,
>  }
>  
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> + .release= hmm_release,
>   .invalidate_range_start = hmm_invalidate_range_start,
>   .invalidate_range_end   = hmm_invalidate_range_end,
>  };
> @@ -230,7 +257,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
>   struct hmm *hmm = mirror->hmm;
>  
>   down_write(>mirrors_sem);
> - list_del(>list);
> + list_del_init(>list);
>   up_write(>mirrors_sem);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
> 


Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2

2018-03-22 Thread John Hubbard
On 03/21/2018 04:41 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:16 AM, jgli...@redhat.com wrote:
>>> From: Jérôme Glisse 
>>>
>>> This code was lost in translation at one point. This properly call
>>> mmu_notifier_unregister_no_release() once last user is gone. This
>>> fix the zombie mm_struct as without this patch we do not drop the
>>> refcount we have on it.
>>>
>>> Changed since v1:
>>>   - close race window between a last mirror unregistering and a new
>>> mirror registering, which could have lead to use after free()
>>> kind of bug
>>>
>>> Signed-off-by: Jérôme Glisse 
>>> Cc: Evgeny Baskakov 
>>> Cc: Ralph Campbell 
>>> Cc: Mark Hairgrove 
>>> Cc: John Hubbard 
>>> ---
>>>  mm/hmm.c | 35 +--
>>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index 6088fa6ed137..f75aa8df6e97 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, 
>>> struct mm_struct *mm)
>>> if (!mm || !mirror || !mirror->ops)
>>> return -EINVAL;
>>>  
>>> +again:
>>> mirror->hmm = hmm_register(mm);
>>> if (!mirror->hmm)
>>> return -ENOMEM;
>>>  
>>> down_write(>hmm->mirrors_sem);
>>> -   list_add(>list, >hmm->mirrors);
>>> -   up_write(>hmm->mirrors_sem);
>>> +   if (mirror->hmm->mm == NULL) {
>>> +   /*
>>> +* A racing hmm_mirror_unregister() is about to destroy the hmm
>>> +* struct. Try again to allocate a new one.
>>> +*/
>>> +   up_write(>hmm->mirrors_sem);
>>> +   mirror->hmm = NULL;
>>
>> This is being set outside of locks, so now there is another race with
>> another hmm_mirror_register...
>>
>> I'll take a moment and draft up what I have in mind here, which is a more
>> symmetrical locking scheme for these routines.
>>
> 
> No this code is correct. hmm->mm is set after hmm struct is allocated
> and before it is public so no one can race with that. It is clear in
> hmm_mirror_unregister() under the write lock hence checking it here
> under that same lock is correct.

Are you implying that code that calls hmm_mirror_register() should do 
it's own locking, to prevent simultaneous calls to that function? Because
as things are right now, multiple threads can arrive at this point. The
fact that mirror->hmm is not "public" is irrelevant; what matters is that
>1 thread can change it simultaneously.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2

2018-03-22 Thread John Hubbard
On 03/22/2018 04:37 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:16 AM, jgli...@redhat.com wrote:
>>>>> From: Jérôme Glisse 



>>>
>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>> and before it is public so no one can race with that. It is clear in
>>> hmm_mirror_unregister() under the write lock hence checking it here
>>> under that same lock is correct.
>>
>> Are you implying that code that calls hmm_mirror_register() should do 
>> it's own locking, to prevent simultaneous calls to that function? Because
>> as things are right now, multiple threads can arrive at this point. The
>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>> 1 thread can change it simultaneously.
> 
> The content of struct hmm_mirror should not be modified by code outside
> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
> is a private structure to HMM and the driver should not touch it, ie it
> should be considered as read only/const from driver code point of view.

Yes, that point is clear and obvious.

> 
> It is also expected (which was obvious to me) that driver only call once
> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
> for any given hmm_mirror struct. Note that driver can register multiple
> _different_ mirror struct to same mm or differents mm.
> 
> There is no need of locking on the driver side whatsoever as long as the
> above rules are respected. I am puzzle if they were not obvious :)

Those rules were not obvious. It's unusual to claim that register and unregister
can run concurrently, but regiser and register cannot. Let's please document
the rules a bit in the comments.

> 
> Note that the above rule means that for any given struct hmm_mirror their
> can only be one and only one call to hmm_mirror_register() happening, no
> concurrent call. If you are doing the latter then something is seriously
> wrong in your design.
> 
> So to be clear on what variable are you claiming race ?
>   mirror->hmm ?
>   mirror->hmm->mm which is really hmm->mm (mirror part does not matter) ?
> 
> I will hold resending v4 until tomorrow morning (eastern time) so that
> you can convince yourself that this code is right or prove me wrong.

No need to wait. The documentation request above is a minor point, and
we're OK with you resending v4 whenever you're ready.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2

2018-03-22 Thread John Hubbard
On 03/22/2018 05:50 PM, Jerome Glisse wrote:
> On Thu, Mar 22, 2018 at 05:13:14PM -0700, John Hubbard wrote:
>> On 03/22/2018 04:37 PM, Jerome Glisse wrote:
>>> On Thu, Mar 22, 2018 at 03:47:16PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 04:41 PM, Jerome Glisse wrote:
>>>>> On Wed, Mar 21, 2018 at 04:22:49PM -0700, John Hubbard wrote:
>>>>>> On 03/21/2018 11:16 AM, jgli...@redhat.com wrote:
>>>>>>> From: Jérôme Glisse 
>>
>> 
>>
>>>>>
>>>>> No this code is correct. hmm->mm is set after hmm struct is allocated
>>>>> and before it is public so no one can race with that. It is clear in
>>>>> hmm_mirror_unregister() under the write lock hence checking it here
>>>>> under that same lock is correct.
>>>>
>>>> Are you implying that code that calls hmm_mirror_register() should do 
>>>> it's own locking, to prevent simultaneous calls to that function? Because
>>>> as things are right now, multiple threads can arrive at this point. The
>>>> fact that mirror->hmm is not "public" is irrelevant; what matters is that
>>>>> 1 thread can change it simultaneously.
>>>
>>> The content of struct hmm_mirror should not be modified by code outside
>>> HMM after hmm_mirror_register() and before hmm_mirror_unregister(). This
>>> is a private structure to HMM and the driver should not touch it, ie it
>>> should be considered as read only/const from driver code point of view.
>>
>> Yes, that point is clear and obvious.
>>
>>>
>>> It is also expected (which was obvious to me) that driver only call once
>>> and only once hmm_mirror_register(), and only once hmm_mirror_unregister()
>>> for any given hmm_mirror struct. Note that driver can register multiple
>>> _different_ mirror struct to same mm or differents mm.
>>>
>>> There is no need of locking on the driver side whatsoever as long as the
>>> above rules are respected. I am puzzle if they were not obvious :)
>>
>> Those rules were not obvious. It's unusual to claim that register and 
>> unregister
>> can run concurrently, but regiser and register cannot. Let's please document
>> the rules a bit in the comments.
> 
> I am really surprise this was not obvious. All existing _register API
> in the kernel follow this. You register something once only and doing
> it twice for same structure (ie unique struct hmm_mirror *mirror pointer
> value) leads to serious bugs (doing so concurently or not).
> 
> For instance if you call mmu_notifier_register() twice (concurrently
> or not) with same pointer value for struct mmu_notifier *mn then bad
> thing will happen. Same for driver_register() but this one actualy
> have sanity check and complain loudly if that happens. I doubt there
> is any single *_register/unregister() in the kernel that does not
> follow this.

OK, then I guess no need to document it. In any case we know what to
expect here, so no problem there.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Note that doing register/unregister concurrently for the same unique
> hmm_mirror struct is also illegal. However concurrent register and
> unregister of different hmm_mirror struct is legal and this is the
> reasons for races we were discussing.
> 
> Cheers,
> Jérôme
> 


Re: [PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory

2018-03-19 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h |  8 +++-
>  mm/hmm.c| 45 +++--
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 78b3ed6d7977..6d2b6bf6da4b 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)
> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 04595a994542..2118e42cb838 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
>   *pfns = 0;
>  }
>  
> +/*
> + * hmm_vma_walk_hole() - handle a range back by no pmd or no pte


Maybe write it like this:

 * hmm_vma_walk_hole() - handle a range that is not backed by any pmd or pte
 

> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @walk: mm_walk structure
> + * Returns: 0 on success, -EAGAIN after page fault, or page fault error
> + *
> + * This is an helper call whenever pmd_none() or pte_none() returns true
> + * or when there is no directory covering the range.

Instead of those two lines, how about:

 * This routine will be called whenever pmd_none() or pte_none() returns
 * true, or whenever there is no page directory covering the VA range.


> + */
>  static int hmm_vma_walk_hole(unsigned long addr,
>unsigned long end,
>struct mm_walk *walk)
> @@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
>   uint64_t *pfns = range->pfns;
>   unsigned long i;
>  
> - hmm_vma_walk->last = addr;
> - i = (addr - range->start) >> PAGE_SHIFT;
> - for (; addr < end; addr += PAGE_SIZE, i++) {
> - pfns[i] = HMM_PFN_EMPTY;
> - if (hmm_vma_walk->fault) {
> - int ret;
> -
> - ret = hmm_vma_do_fault(walk, addr, [i]);
> - if (ret != -EAGAIN)
> - return ret;
> - }
> - }
> -
> - return hmm_vma_walk->fault ? -EAGAIN : 0;
> -}
> -
> -static int hmm_vma_walk_clear(unsigned long addr,
> -   unsigned long end,
> -   struct mm_walk *walk)
> -{
> - struct hmm_vma_walk *hmm_vma_walk = walk->private;
> - struct hmm_range *range = hmm_vma_walk->range;
> - uint64_t *pfns = range->pfns;
> - unsigned long i;
> -

Nice consolidation!

>   hmm_vma_walk->last = addr;
>   i = (addr - range->start) >> PAGE_SHIFT;
>   for (; addr < end; addr += PAGE_SIZE, i++) {
> @@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>   if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
>   goto again;
>   if (pmd_protnone(pmd))
> - return hmm_vma_walk_clear(start, end, walk);
> + return hmm_vma_walk_hole(start, end, walk);
>  
>   if (write_fault && !pmd_write(pmd))
> - return hmm_vma_walk_clear(start, end, walk

Re: [PATCH 10/14] mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE

2018-03-19 Thread John Hubbard
On 03/16/2018 01:35 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> Make naming consistent accross code, DEVICE_PRIVATE is the name use
> outside HMM code so use that one.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 4 ++--
>  mm/hmm.c| 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)

Seems entirely harmless. :)

Reviewed-by: John Hubbard 

> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 6d2b6bf6da4b..78018b3e7a9f 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -88,13 +88,13 @@ struct hmm;
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
>   *  set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
> + * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
>   */
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
>  #define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
>  #define HMM_PFN_SHIFT 5
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 2118e42cb838..857eec622c98 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -429,7 +429,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>   pfns[i] |= HMM_PFN_WRITE;
>   } else if (write_fault)
>   goto fault;
> - pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
> + pfns[i] |= HMM_PFN_DEVICE_PRIVATE;
>       } else if (is_migration_entry(entry)) {
>   if (hmm_vma_walk->fault) {
>   pte_unmap(ptep);
> 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 11/14] mm/hmm: move hmm_pfns_clear() closer to where it is use

2018-03-19 Thread John Hubbard
On 03/16/2018 01:35 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> Move hmm_pfns_clear() closer to where it is use to make it clear it
> is not use by page table walkers.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  mm/hmm.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: John Hubbard 

> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 857eec622c98..3a708f500b80 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -297,14 +297,6 @@ static int hmm_pfns_bad(unsigned long addr,
>   return 0;
>  }
>  
> -static void hmm_pfns_clear(uint64_t *pfns,
> -unsigned long addr,
> -unsigned long end)
> -{
> - for (; addr < end; addr += PAGE_SIZE, pfns++)
> - *pfns = 0;
> -}
> -
>  /*
>   * hmm_vma_walk_hole() - handle a range back by no pmd or no pte
>   * @start: range virtual start address (inclusive)
> @@ -463,6 +455,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>   return 0;
>  }
>  
> +static void hmm_pfns_clear(uint64_t *pfns,
> +unsigned long addr,
> +unsigned long end)
> +{
> + for (; addr < end; addr += PAGE_SIZE, pfns++)
> + *pfns = 0;
> +}
> +

Yep, identical, so no functional changes.

>  static void hmm_pfns_special(struct hmm_range *range)
>  {
>   unsigned long addr = range->start, i = 0;

thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH 14/14] mm/hmm: use device driver encoding for HMM pfn

2018-03-19 Thread John Hubbard
On 03/16/2018 01:35 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
> and pfn shift value allowing them to define their own encoding for HMM
> pfn that are fill inside the pfns array of the hmm_range struct. With
> this device driver can get pfn that match their own private encoding
> out of HMM without having to do any convertion.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 91 
> -
>  mm/hmm.c| 83 +++-
>  2 files changed, 102 insertions(+), 72 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index ee758c4e4bec..cb9af99f9371 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,68 +80,106 @@
>  struct hmm;
>  
>  /*
> + * hmm_pfn_flag_e - HMM uses its own pfn type to keep several flags per page

OK, so here's the patch that switches over from bits to enum-based flags. But 
it is
still mysterious to me.

Maybe this is the place to write some details about how this array of flags 
actually
works. At first reading it is deeply confusing.

p.s. I still need to review the large patches: #11-13. I should get to those 
tomorrow 
morning. 

thanks,
-- 
John Hubbard
NVIDIA

> + *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
> + * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
>   *  set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> + */
> +enum hmm_pfn_flag_e {
> + HMM_PFN_VALID = 0,
> + HMM_PFN_WRITE,
> + HMM_PFN_ERROR,
> + HMM_PFN_NONE,
> + HMM_PFN_SPECIAL,
> + HMM_PFN_DEVICE_PRIVATE,
> + HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * struct hmm_range - track invalidation lock on virtual address range
> + *
> + * @vma: the vm area struct for the range
> + * @list: all range lock are on a list
> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @pfns: array of pfns (big enough for the range)
> + * @flags: pfn flags to match device driver page table
> + * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> + * @valid: pfns array did not change since it has been fill by an HMM 
> function
>   */
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_WRITE (1 << 1)
> -#define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> -#define HMM_PFN_SHIFT 5
> +struct hmm_range {
> + struct vm_area_struct   *vma;
> + struct list_headlist;
> + unsigned long   start;
> + unsigned long   end;
> + uint64_t*pfns;
> + const uint64_t  *flags;
> + uint8_t pfn_shift;
> + boolvalid;
> +};
>  
>  /*
>   * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> + * @range: range use to decode HMM pfn value
>   * @pfn: HMM pfn value to get corresponding struct page from
>   * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
>   *
>   * If the uint64_t is valid (ie valid flag set) then return the struct page
>   * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
>   */
> -static inline struct page *hmm_pfn_to_page(uint64_t pfn)
> +static inline struct page *hmm_pfn_to_page(const struct hmm_range *range,
> +uint64_t pfn)
>  {
> - if (!(pfn & HMM_PFN_VALID))
> + if (!(pfn & range->flags[HMM_PFN_VALID]))
>   return NULL;
> - return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> + return pfn_to_page(pfn >> range->pfn_shift);
>  }
>  
>  /*
>   * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
> + * @range: range use to decode HMM pfn value
>   * @pfn: HMM pfn value to extract pfn from
>   * Returns: pfn value if HMM pfn is valid, -1UL otherwise
>   */
> -static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
> +st

Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-20 Thread John Hubbard
On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
> From: Ralph Campbell 
> 
> The hmm_mirror_register() function registers a callback for when
> the CPU pagetable is modified. Normally, the device driver will
> call hmm_mirror_unregister() when the process using the device is
> finished. However, if the process exits uncleanly, the struct_mm
> can be destroyed with no warning to the device driver.
> 
> Changed since v1:
>   - dropped VM_BUG_ON()
>   - cc stable
> 
> Signed-off-by: Ralph Campbell 
> Signed-off-by: Jérôme Glisse 
> Cc: sta...@vger.kernel.org
> Cc: Evgeny Baskakov 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 10 ++
>  mm/hmm.c| 18 +-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 36dd21fe5caf..fa7b51f65905 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,6 +218,16 @@ enum hmm_update_type {
>   * @update: callback to update range on a device
>   */
>  struct hmm_mirror_ops {
> + /* release() - release hmm_mirror
> +  *
> +  * @mirror: pointer to struct hmm_mirror
> +  *
> +  * This is called when the mm_struct is being released.
> +  * The callback should make sure no references to the mirror occur
> +  * after the callback returns.
> +  */
> + void (*release)(struct hmm_mirror *mirror);
> +
>   /* sync_cpu_device_pagetables() - synchronize page tables
>*
>* @mirror: pointer to struct hmm_mirror
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 320545b98ff5..6088fa6ed137 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
>   up_read(>mirrors_sem);
>  }
>  
> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + struct hmm *hmm = mm->hmm;
> + struct hmm_mirror *mirror;
> + struct hmm_mirror *mirror_next;
> +
> + down_write(>mirrors_sem);
> + list_for_each_entry_safe(mirror, mirror_next, >mirrors, list) {
> + list_del_init(>list);
> + if (mirror->ops->release)
> + mirror->ops->release(mirror);

Hi Jerome,

This presents a deadlock problem (details below). As for solution ideas, 
Mark Hairgrove points out that the MMU notifiers had to solve the
same sort of problem, and part of the solution involves "avoid
holding locks when issuing these callbacks". That's not an entire 
solution description, of course, but it seems like a good start.

Anyway, for the deadlock problem:

Each of these ->release callbacks potentially has to wait for the 
hmm_invalidate_range() callbacks to finish. That is not shown in any
code directly, but it's because: when a device driver is processing 
the above ->release callback, it has to allow any in-progress operations 
to finish up (as specified clearly in your comment documentation above). 

Some of those operations will invariably need to do things that result 
in page invalidations, thus triggering the hmm_invalidate_range() callback.
Then, the hmm_invalidate_range() callback tries to acquire the same 
hmm->mirrors_sem lock, thus leading to deadlock:

hmm_invalidate_range():
// ...
down_read(>mirrors_sem);
    list_for_each_entry(mirror, >mirrors, list)
mirror->ops->sync_cpu_device_pagetables(mirror, action,
start, end);
up_read(>mirrors_sem);

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit

2018-03-20 Thread John Hubbard
On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  mm/hmm.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..667944630dc9 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,10 +244,29 @@ EXPORT_SYMBOL(hmm_mirror_register);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
>   struct hmm *hmm = mirror->hmm;
> + struct mm_struct *mm = NULL;
> + bool unregister = false;
>  
>   down_write(>mirrors_sem);
>   list_del_init(>list);
> + unregister = list_empty(>mirrors);

Hi Jerome,

This first minor point may be irrelevant, depending on how you fix 
the other problem below, but: tiny naming idea: rename unregister 
to either "should_unregister", or "mirror_snapshot_empty"...the 
latter helps show that this is stale information, once the lock is 
dropped. 

>   up_write(>mirrors_sem);
> +
> + if (!unregister)
> + return;

Whee, here I am, lock-free, in the middle of a race condition
window. :)  Right here, someone (hmm_mirror_register) could be adding
another mirror.

It's not immediately clear to me what the best solution is.
I'd be happier if we didn't have to drop one lock and take
another like this, but if we do, then maybe rechecking that
the list hasn't changed...safely, somehow, is a way forward here.


> +
> + spin_lock(>mm->page_table_lock);
> + if (hmm->mm->hmm == hmm) {
> + mm = hmm->mm;
> + mm->hmm = NULL;
> + }
> + spin_unlock(>mm->page_table_lock);
> +
> + if (mm == NULL)
> + return;
> +
> + mmu_notifier_unregister_no_release(>mmu_notifier, mm);
> + kfree(hmm);
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>  

thanks,
-- 
John Hubbard
NVIDIA
 


Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2

2018-03-20 Thread John Hubbard
On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> User of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array
> and pfn shift value allowing them to define their own encoding for HMM
> pfn that are fill inside the pfns array of the hmm_range struct. With
> this device driver can get pfn that match their own private encoding
> out of HMM without having to do any conversion.
> 
> Changed since v1:
>   - Split flags and special values for clarification
>   - Improved comments and provide examples
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 130 
> +---
>  mm/hmm.c|  85 +++---
>  2 files changed, 142 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 0f7ea3074175..5d26e0a223d9 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,68 +80,145 @@
>  struct hmm;
>  
>  /*
> + * hmm_pfn_flag_e - HMM flag enums
> + *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
> + *
> + * The driver provide a flags array, if driver valid bit for an entry is bit
> + * 3 ie (entry & (1 << 3)) is true if entry is valid then driver must provide
> + * an array in hmm_range.flags with hmm_range.flags[HMM_PFN_VALID] == 1 << 3.
> + * Same logic apply to all flags. This is same idea as vm_page_prot in vma
> + * except that this is per device driver rather than per architecture.

Hi Jerome,

If we go with this approach--and I hope not, I'll try to talk you down from the
ledge, in a moment--then maybe we should add the following to the comments: 

"There is only one bit ever set in each hmm_range.flags[entry]." 

Or maybe we'll get pushback, that the code shows that already, but IMHO this is 
strange way to do things (especially when there is a much easier way), and 
deserves 
that extra bit of helpful documentation.

More below...

> + */
> +enum hmm_pfn_flag_e {
> + HMM_PFN_VALID = 0,
> + HMM_PFN_WRITE,
> + HMM_PFN_DEVICE_PRIVATE,
> + HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * hmm_pfn_value_e - HMM pfn special value
> + *
> + * Flags:
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
> + * HMM_PFN_NONE: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
>   *  set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_PRIVATE: unaddressable device memory (ZONE_DEVICE)
> + *
> + * Driver provide entry value for none entry, error entry and special entry,
> + * driver can alias (ie use same value for error and special for instance). 
> It
> + * should not alias none and error or special.
> + *
> + * HMM pfn value returned by hmm_vma_get_pfns() or hmm_vma_fault() will be:
> + * hmm_range.values[HMM_PFN_ERROR] if CPU page table entry is poisonous,
> + * hmm_range.values[HMM_PFN_NONE] if there is no CPU page table
> + * hmm_range.values[HMM_PFN_SPECIAL] if CPU page table entry is a special one
>   */
> -#define HMM_PFN_VALID (1 << 0)
> -#define HMM_PFN_WRITE (1 << 1)
> -#define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_SPECIAL (1 << 3)
> -#define HMM_PFN_DEVICE_PRIVATE (1 << 4)
> -#define HMM_PFN_SHIFT 5
> +enum hmm_pfn_value_e {
> + HMM_PFN_ERROR,
> + HMM_PFN_NONE,
> + HMM_PFN_SPECIAL,
> + HMM_PFN_VALUE_MAX
> +};

I can think of perhaps two good solid ways to get what you want, without
moving to what I consider an unnecessary excursion into arrays of flags. 
If I understand correctly, you want to let each architecture
specify which bit to use for each of the above HMM_PFN_* flags. 

The way you have it now, the code does things like this:

cpu_flags & range->flags[HMM_PFN_WRITE]

but that array entry is mostly empty space, and it's confusing. It would
be nicer to see:

cpu_flags & HMM_PFN_WRITE

...which you can easily do, by defining HMM_PFN_WRITE and friends in an
arch-specific header file.

The other way to make this more readable would be to use helper routines
similar to what the vm_pgprot* routines do:

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
}

...but that's also unnecessary.

Let's just keep it simple, and go back to the bitmap flags!

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()

2018-03-20 Thread John Hubbard
On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> No functional change, just create one function to handle pmd and one
> to handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  mm/hmm.c | 174 
> +--
>  1 file changed, 102 insertions(+), 72 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 52cdceb35733..dc703e9c3a95 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -351,6 +351,99 @@ static int hmm_vma_walk_hole(unsigned long addr,
>   return hmm_vma_walk->fault ? -EAGAIN : 0;
>  }
>  
> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
> +   unsigned long addr,
> +   unsigned long end,
> +   uint64_t *pfns,

Hi Jerome,

Nice cleanup, it makes it much easier to follow the code now.

Let's please rename the pfns argument above to "pfn", because in this
helper (and the _pte helper too), there is only one pfn involved, rather
than an array of them.

> +   pmd_t pmd)
> +{
> + struct hmm_vma_walk *hmm_vma_walk = walk->private;
> + unsigned long pfn, i;
> + uint64_t flag = 0;
> +
> + if (pmd_protnone(pmd))
> + return hmm_vma_walk_hole(addr, end, walk);
> +
> + if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pmd_write(pmd))
> + return hmm_vma_walk_hole(addr, end, walk);
> +
> + pfn = pmd_pfn(pmd) + pte_index(addr);
> + flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
> + for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
> + pfns[i] = hmm_pfn_from_pfn(pfn) | flag;
> + hmm_vma_walk->last = end;
> + return 0;
> +}
> +
> +static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> +   unsigned long end, pmd_t *pmdp, pte_t *ptep,
> +   uint64_t *pfns)

Same thing here: rename pfns --> pfn.

I moved diffs around to attempt to confirm that this is just a refactoring,
and it does look the same. It's easy to overlook things here, but:

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA

> +{
> + struct hmm_vma_walk *hmm_vma_walk = walk->private;
> + struct vm_area_struct *vma = walk->vma;
> + pte_t pte = *ptep;
> +
> + *pfns = 0;
> +
> + if (pte_none(pte)) {
> + *pfns = 0;
> + if (hmm_vma_walk->fault)
> + goto fault;
> + return 0;
> + }
> +
> + if (!pte_present(pte)) {
> + swp_entry_t entry = pte_to_swp_entry(pte);
> +
> + if (!non_swap_entry(entry)) {
> + if (hmm_vma_walk->fault)
> + goto fault;
> + return 0;
> + }
> +
> + /*
> +  * This is a special swap entry, ignore migration, use
> +  * device and report anything else as error.
> +  */
> + if (is_device_private_entry(entry)) {
> + *pfns = hmm_pfn_from_pfn(swp_offset(entry));
> + if (is_write_device_private_entry(entry)) {
> + *pfns |= HMM_PFN_WRITE;
> + } else if ((hmm_vma_walk->fault & hmm_vma_walk->write))
> + goto fault;
> + *pfns |= HMM_PFN_DEVICE_PRIVATE;
> + return 0;
> + }
> +
> + if (is_migration_entry(entry)) {
> + if (hmm_vma_walk->fault) {
> + pte_unmap(ptep);
> + hmm_vma_walk->last = addr;
> + migration_entry_wait(vma->vm_mm,
> + pmdp, addr);
> + return -EAGAIN;
> + }
> + return 0;
> + }
> +
> + /* Report error for everything else */
> + *pfns = HMM_PFN_ERROR;
> + return -EFAULT;
> + }
> +
> + if ((hmm_vma_walk->fault & hmm_vma_walk->write) && !pte_write(pte))
> + goto fault;
> +
> + *pfns = hmm_pfn_from_pfn(pte_pfn(pte));
> + *pfns |= pte_write(pte) ? HMM_PFN_WRITE : 0;
> + return 0;
> +
> +fault:
> + pte_unmap(ptep);
> + /* Fault any virtual address we were ask to fault */
> + return hmm_vma_walk_hole(addr, end, walk);
> +}
>

Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2

2018-03-20 Thread John Hubbard
On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> There is no point in differentiating between a range for which there
> is not even a directory (and thus entries) and empty entry (pte_none()
> or pmd_none() returns true).
> 
> Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
> duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.
> 
> Changed since v1:
>   - Improved comments
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h |  8 +++-
>  mm/hmm.c| 45 +++--
>  2 files changed, 18 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 54d684fe3b90..cf283db22106 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
> - * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
> @@ -94,10 +93,9 @@ struct hmm;
>  #define HMM_PFN_VALID (1 << 0)
>  #define HMM_PFN_WRITE (1 << 1)
>  #define HMM_PFN_ERROR (1 << 2)
> -#define HMM_PFN_EMPTY (1 << 3)

Hi Jerome,

Nearly done with this one...see below for a bit more detail, but I think if we 
did this:

#define HMM_PFN_EMPTY (0)

...it would work out nicely.

> -#define HMM_PFN_SPECIAL (1 << 4)
> -#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
> -#define HMM_PFN_SHIFT 6
> +#define HMM_PFN_SPECIAL (1 << 3)
> +#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
> +#define HMM_PFN_SHIFT 5
>  



> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>   pfns[i] = 0;
>  
>   if (pte_none(pte)) {
> - pfns[i] = HMM_PFN_EMPTY;
> + pfns[i] = 0;

This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
Symbols are better than raw numbers here.


>   if (hmm_vma_walk->fault)
>   goto fault;
>   continue;
> @@ -489,8 +474,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  fault:
>   pte_unmap(ptep);
> - /* Fault all pages in range */
> - return hmm_vma_walk_clear(start, end, walk);
> + /* Fault any virtual address we were ask to fault */

 asked to fault

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-21 Thread John Hubbard
On 03/21/2018 11:03 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>> From: Ralph Campbell 



>> Hi Jerome,
>>
>> This presents a deadlock problem (details below). As for solution ideas, 
>> Mark Hairgrove points out that the MMU notifiers had to solve the
>> same sort of problem, and part of the solution involves "avoid
>> holding locks when issuing these callbacks". That's not an entire 
>> solution description, of course, but it seems like a good start.
>>
>> Anyway, for the deadlock problem:
>>
>> Each of these ->release callbacks potentially has to wait for the 
>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>> code directly, but it's because: when a device driver is processing 
>> the above ->release callback, it has to allow any in-progress operations 
>> to finish up (as specified clearly in your comment documentation above). 
>>
>> Some of those operations will invariably need to do things that result 
>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>> hmm->mirrors_sem lock, thus leading to deadlock:
>>
>> hmm_invalidate_range():
>> // ...
>>  down_read(>mirrors_sem);
>>  list_for_each_entry(mirror, >mirrors, list)
>>  mirror->ops->sync_cpu_device_pagetables(mirror, action,
>>  start, end);
>>  up_read(>mirrors_sem);
> 
> That is just illegal, the release callback is not allowed to trigger
> invalidation all it does is kill all device's threads and stop device
> page fault from happening. So there is no deadlock issues. I can re-
> inforce the comment some more (see [1] for example on what it should
> be).

That rule is fine, and it is true that the .release callback will not 
directly trigger any invalidations. However, the problem is in letting 
any *existing* outstanding operations finish up. We have to let 
existing operations "drain", in order to meet the requirement that 
everything is done when .release returns.

For example, if a device driver thread is in the middle of working through
its fault buffer, it will call migrate_vma(), which will in turn unmap
pages. That will cause an hmm_invalidate_range() callback, which tries
to take hmm->mirrors_sems, and we deadlock.

There's no way to "kill" such a thread while it's in the middle of
migrate_vma(), you have to let it finish up.

> 
> Also it is illegal for the sync callback to trigger any mmu_notifier
> callback. I thought this was obvious. The sync callback should only
> update device page table and do _nothing else_. No way to make this
> re-entrant.

That is obvious, yes. I am not trying to say there is any problem with
that rule. It's the "drain outstanding operations during .release", 
above, that is the real problem.

thanks,
-- 
John Hubbard
NVIDIA

> 
> For anonymous private memory migrated to device memory it is freed
> shortly after the release callback (see exit_mmap()). For share memory
> you might want to migrate back to regular memory but that will be fine
> as you will not get mmu_notifier callback any more.
> 
> So i don't see any deadlock here.
> 
> Cheers,
> Jérôme
> 
> [1] 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=nouveau-hmm=93adb3e6b4f39d5d146b6a8afb4175d37bdd4890
> 


Re: [PATCH 13/15] mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()

2018-03-21 Thread John Hubbard
On 03/21/2018 08:08 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:07:29PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>> From: Jérôme Glisse 


 
>>> +static int hmm_vma_handle_pmd(struct mm_walk *walk,
>>> + unsigned long addr,
>>> + unsigned long end,
>>> + uint64_t *pfns,
>>
>> Hi Jerome,
>>
>> Nice cleanup, it makes it much easier to follow the code now.
>>
>> Let's please rename the pfns argument above to "pfn", because in this
>> helper (and the _pte helper too), there is only one pfn involved, rather
>> than an array of them.
> 
> This is only true to handle_pte, for handle_pmd it will go over several
> pfn entries. But they will all get fill with same value modulo pfn which
> will increase monotically (ie same flag as pmd permissions apply to all
> entries).

oops, yes you are right about handle_pmd.

> 
> Note sure s/pfns/pfn for hmm_vma_handle_pte() warrant a respin.

Probably not, unless there is some other reason to respin. Anyway, this patch
looks good either way, I think, so you can still add:

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-21 Thread John Hubbard
On 03/21/2018 03:46 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>>>> From: Ralph Campbell 
>>
>> 
>>
>>>> Hi Jerome,
>>>>
>>>> This presents a deadlock problem (details below). As for solution ideas, 
>>>> Mark Hairgrove points out that the MMU notifiers had to solve the
>>>> same sort of problem, and part of the solution involves "avoid
>>>> holding locks when issuing these callbacks". That's not an entire 
>>>> solution description, of course, but it seems like a good start.
>>>>
>>>> Anyway, for the deadlock problem:
>>>>
>>>> Each of these ->release callbacks potentially has to wait for the 
>>>> hmm_invalidate_range() callbacks to finish. That is not shown in any
>>>> code directly, but it's because: when a device driver is processing 
>>>> the above ->release callback, it has to allow any in-progress operations 
>>>> to finish up (as specified clearly in your comment documentation above). 
>>>>
>>>> Some of those operations will invariably need to do things that result 
>>>> in page invalidations, thus triggering the hmm_invalidate_range() callback.
>>>> Then, the hmm_invalidate_range() callback tries to acquire the same 
>>>> hmm->mirrors_sem lock, thus leading to deadlock:
>>>>
>>>> hmm_invalidate_range():
>>>> // ...
>>>>down_read(>mirrors_sem);
>>>>list_for_each_entry(mirror, >mirrors, list)
>>>>mirror->ops->sync_cpu_device_pagetables(mirror, action,
>>>>start, end);
>>>>up_read(>mirrors_sem);
>>>
>>> That is just illegal, the release callback is not allowed to trigger
>>> invalidation all it does is kill all device's threads and stop device
>>> page fault from happening. So there is no deadlock issues. I can re-
>>> inforce the comment some more (see [1] for example on what it should
>>> be).
>>
>> That rule is fine, and it is true that the .release callback will not 
>> directly trigger any invalidations. However, the problem is in letting 
>> any *existing* outstanding operations finish up. We have to let 
>> existing operations "drain", in order to meet the requirement that 
>> everything is done when .release returns.
>>
>> For example, if a device driver thread is in the middle of working through
>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>> pages. That will cause an hmm_invalidate_range() callback, which tries
>> to take hmm->mirrors_sems, and we deadlock.
>>
>> There's no way to "kill" such a thread while it's in the middle of
>> migrate_vma(), you have to let it finish up.
>>
>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>> callback. I thought this was obvious. The sync callback should only
>>> update device page table and do _nothing else_. No way to make this
>>> re-entrant.
>>
>> That is obvious, yes. I am not trying to say there is any problem with
>> that rule. It's the "drain outstanding operations during .release", 
>> above, that is the real problem.
> 
> Maybe just relax the release callback wording, it should stop any
> more processing of fault buffer but not wait for it to finish. In
> nouveau code i kill thing but i do not wait hence i don't deadlock.

But you may crash, because that approach allows .release to finish
up, thus removing the mm entirely, out from under (for example)
a migrate_vma call--or any other call that refers to the mm.

It doesn't seem too hard to avoid the problem, though: maybe we
can just drop the lock while doing the mirror->ops->release callback.
There are a few ways to do this, but one example is: 

-- take the lock,
-- copy the list to a local list, deleting entries as you go,
-- drop the lock, 
-- iterate through the local list copy and 
-- issue the mirror->ops->release callbacks.

At this point, more items could have been added to the list, so repeat
the above until the original list is empty. 

This is subject to a limited starvation case if mirror keep getting 
registered, but I think we can ignore that, because it only lasts as long as 
mirrors keep getting added, and then it finishes

Re: [PATCH 10/15] mm/hmm: do not differentiate between empty entry or missing directory v2

2018-03-21 Thread John Hubbard
On 03/21/2018 07:48 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 10:24:34PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>> From: Jérôme Glisse 
>>>



>>
>> 
>>
>>> @@ -438,7 +423,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>>> pfns[i] = 0;
>>>  
>>> if (pte_none(pte)) {
>>> -   pfns[i] = HMM_PFN_EMPTY;
>>> +   pfns[i] = 0;
>>
>> This works, but why not keep HMM_PFN_EMPTY, and just define it as zero?
>> Symbols are better than raw numbers here.
>>
> 
> The last patch do that so i don't think it is worth respinning
> just to make this intermediate state prettier.
> 

Yes, you're right, of course. And, no other problems found, so:

Reviewed-by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 15/15] mm/hmm: use device driver encoding for HMM pfn v2

2018-03-21 Thread John Hubbard
On 03/21/2018 08:52 AM, Jerome Glisse wrote:
> On Tue, Mar 20, 2018 at 09:39:27PM -0700, John Hubbard wrote:
>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>> From: Jérôme Glisse 
> 
> [...]
> 



>>
>> Let's just keep it simple, and go back to the bitmap flags!
> 
> This simplify nouveau code and it is the reason why i did that patch.
> I am sure it can simplify NVidia uvm code, i can look into it if you
> want to give pointers. Idea here is that HMM can fill array with some-
> thing that match device driver internal format and avoid the conversion
> step from HMM format to driver format (saving CPU cycles and memory
> doing so). I am open to alternative that give the same end result.
> 
> [Just because code is worth 2^32 words :)
> 
> Without this patch:
> int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
> {
> uint64_t *hmm_pfns, *nouveau_pfns;
> 
> hmm_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
> nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
> hmm_vma_fault(..., hmm_pfns, ...);
> 
> for (i = 0; i < npages; ++i) {
> nouveau_pfns[i] = nouveau_pfn_from_hmm_pfn(hmm_pfns[i]);
> }
> ...
> }
> 
> With this patch:
> int nouveau_do_fault(..., ulong addr, unsigned npages, ...)
> {
> uint64_t *nouveau_pfns;
> 
> nouveau_pfns = kmalloc(sizeof(uint64_t) * npages, GFP_KERNEL);
> hmm_vma_fault(..., nouveau_pfns, ...);
> 
> ...
> }
> 
> Benefit from this patch is quite obvious to me. Down the road with bit
> more integration between HMM and IOMMU/DMA this can turn into something
> directly ready for hardware consumptions.
> 
> Note that you could argue that i can convert nouveau to use HMM format
> but this would not work, first because it requires a lot of changes in
> nouuveau, second because HMM do not have all the flags needed by the
> drivers (nor does HMM need them). HMM being the helper here, i feel it
> is up to HMM to adapt to drivers than the other way around.]
> 

OK, if this simplifies Nouveau and potentially other drivers, then I'll 
drop my earlier objections! Thanks for explaining what's going on, in detail.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 04/15] mm/hmm: unregister mmu_notifier when last HMM client quit v2

2018-03-21 Thread John Hubbard
On 03/21/2018 11:16 AM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> This code was lost in translation at one point. This properly call
> mmu_notifier_unregister_no_release() once last user is gone. This
> fix the zombie mm_struct as without this patch we do not drop the
> refcount we have on it.
> 
> Changed since v1:
>   - close race window between a last mirror unregistering and a new
> mirror registering, which could have lead to use after free()
> kind of bug
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  mm/hmm.c | 35 +--
>  1 file changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..f75aa8df6e97 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -222,13 +222,24 @@ int hmm_mirror_register(struct hmm_mirror *mirror, 
> struct mm_struct *mm)
>   if (!mm || !mirror || !mirror->ops)
>   return -EINVAL;
>  
> +again:
>   mirror->hmm = hmm_register(mm);
>   if (!mirror->hmm)
>   return -ENOMEM;
>  
>   down_write(>hmm->mirrors_sem);
> - list_add(>list, >hmm->mirrors);
> - up_write(>hmm->mirrors_sem);
> + if (mirror->hmm->mm == NULL) {
> + /*
> +  * A racing hmm_mirror_unregister() is about to destroy the hmm
> +  * struct. Try again to allocate a new one.
> +  */
> + up_write(>hmm->mirrors_sem);
> + mirror->hmm = NULL;

This is being set outside of locks, so now there is another race with
another hmm_mirror_register...

I'll take a moment and draft up what I have in mind here, which is a more
symmetrical locking scheme for these routines.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 03/15] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-21 Thread John Hubbard
On 03/21/2018 04:37 PM, Jerome Glisse wrote:
> On Wed, Mar 21, 2018 at 04:10:32PM -0700, John Hubbard wrote:
>> On 03/21/2018 03:46 PM, Jerome Glisse wrote:
>>> On Wed, Mar 21, 2018 at 03:16:04PM -0700, John Hubbard wrote:
>>>> On 03/21/2018 11:03 AM, Jerome Glisse wrote:
>>>>> On Tue, Mar 20, 2018 at 09:14:34PM -0700, John Hubbard wrote:
>>>>>> On 03/19/2018 07:00 PM, jgli...@redhat.com wrote:
>>>>>>> From: Ralph Campbell 
> 
> [...]
> 
>>>>> That is just illegal, the release callback is not allowed to trigger
>>>>> invalidation all it does is kill all device's threads and stop device
>>>>> page fault from happening. So there is no deadlock issues. I can re-
>>>>> inforce the comment some more (see [1] for example on what it should
>>>>> be).
>>>>
>>>> That rule is fine, and it is true that the .release callback will not 
>>>> directly trigger any invalidations. However, the problem is in letting 
>>>> any *existing* outstanding operations finish up. We have to let 
>>>> existing operations "drain", in order to meet the requirement that 
>>>> everything is done when .release returns.
>>>>
>>>> For example, if a device driver thread is in the middle of working through
>>>> its fault buffer, it will call migrate_vma(), which will in turn unmap
>>>> pages. That will cause an hmm_invalidate_range() callback, which tries
>>>> to take hmm->mirrors_sems, and we deadlock.
>>>>
>>>> There's no way to "kill" such a thread while it's in the middle of
>>>> migrate_vma(), you have to let it finish up.
>>>>
>>>>> Also it is illegal for the sync callback to trigger any mmu_notifier
>>>>> callback. I thought this was obvious. The sync callback should only
>>>>> update device page table and do _nothing else_. No way to make this
>>>>> re-entrant.
>>>>
>>>> That is obvious, yes. I am not trying to say there is any problem with
>>>> that rule. It's the "drain outstanding operations during .release", 
>>>> above, that is the real problem.
>>>
>>> Maybe just relax the release callback wording, it should stop any
>>> more processing of fault buffer but not wait for it to finish. In
>>> nouveau code i kill thing but i do not wait hence i don't deadlock.
>>
>> But you may crash, because that approach allows .release to finish
>> up, thus removing the mm entirely, out from under (for example)
>> a migrate_vma call--or any other call that refers to the mm.
> 
> No you can not crash on mm as it will not vanish before you are done
> with it as mm will not be freed before you call hmm_unregister() and
> you should not call that from release, nor should you call it before
> everything is flush. However vma struct might vanish ... i might have
> assume wrongly about the down_write() always happening in exit_mmap()
> This might be a solution to force serialization.
> 
 
OK. My details on mm destruction were inaccurate, but we do agree now
that that the whole virtual address space is being torn down at the same 
time as we're trying to use it, so I think we're on the same page now.

>>
>> It doesn't seem too hard to avoid the problem, though: maybe we
>> can just drop the lock while doing the mirror->ops->release callback.
>> There are a few ways to do this, but one example is: 
>>
>> -- take the lock,
>> -- copy the list to a local list, deleting entries as you go,
>> -- drop the lock, 
>> -- iterate through the local list copy and 
>> -- issue the mirror->ops->release callbacks.
>>
>> At this point, more items could have been added to the list, so repeat
>> the above until the original list is empty. 
>>
>> This is subject to a limited starvation case if mirror keep getting 
>> registered, but I think we can ignore that, because it only lasts as long as 
>> mirrors keep getting added, and then it finishes up.
> 
> The down_write is better solution and easier just 2 line of code.

OK. I'll have a better idea when I see it.

> 
>>
>>>
>>> What matter is to stop any further processing. Yes some fault might
>>> be in flight but they will serialize on various lock. 
>>
>> Those faults in flight could already be at a point where they have taken
>> whatever locks they need, so we don't dare let the mm get destroyed while
>> such fault handling is in progress.
> 
> mm ca

Re: [PATCH 3/4] mm/hmm: HMM should have a callback before MM is destroyed

2018-03-15 Thread John Hubbard
On 03/15/2018 05:54 PM, Jerome Glisse wrote:
> On Thu, Mar 15, 2018 at 03:48:29PM -0700, Andrew Morton wrote:
>> On Thu, 15 Mar 2018 14:36:59 -0400 jgli...@redhat.com wrote:
>>
>>> From: Ralph Campbell 
>>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> The changelog doesn't tell us what the runtime effects of the bug are. 
>> This makes it hard for me to answer the "did Jerome consider doing
>> cc:stable" question.
> 
> The impact is low, they might be issue only if application is kill,
> and we don't have any upstream user yet hence why i did not cc
> stable.
> 

Hi Jerome and Andrew,

I'd claim that it is not possible to make a safe and correct device
driver, without this patch. That's because, without the .release callback
that you're adding here, the driver could end up doing operations on a 
stale struct_mm, leading to crashes and other disasters.

Even if people think that maybe that window is "small", it's not really
any smaller than lots of race condition problems that we've seen. And
it is definitely not that hard to hit it: just a good directed stress
test involving multiple threads that are doing early process termination
while also doing lots of migrations and page faults, should suffice.

It is probably best to add this patch to stable, for that reason. 

thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers

2018-03-15 Thread John Hubbard
On 03/15/2018 11:37 AM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> to directly write entry that can match any device page table entry
> format. Device driver now provide an array of flags value and we use
> enum to index this array for each flag.
> 
> This also allow the device driver to ask for write fault on a per page
> basis making API more flexible to service multiple device page faults
> in one go.
> 

Hi Jerome,

This is a large patch, so I'm going to review it in two passes. The first 
pass is just an overview plus the hmm.h changes (now), and tomorrow I will
review the hmm.c, which is where the real changes are.

Overview: the hmm.c changes are doing several things, and it is difficult to
review, because refactoring, plus new behavior, makes diffs less useful here.
It would probably be good to split the hmm.c changes into a few patches, such
as:

-- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
   being passed to functions), and
-- New behavior in the page handling loops, and 
-- Refactoring into new routines (hmm_vma_handle_pte, and others)

That way, reviewers can see more easily that things are correct. 

> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 130 +++--
>  mm/hmm.c| 331 
> +---
>  2 files changed, 249 insertions(+), 212 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 61b0e1c05ee1..34e8a8c65bbd 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -80,11 +80,10 @@
>  struct hmm;
>  
>  /*
> - * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
> + * uint64_t - HMM uses its own pfn type to keep several flags per page

This line now is a little odd, because it looks like it's trying to document
uint64_t as an HMM pfn type. :) Maybe:

* HMM pfns are of type uint64_t

...or else just delete it, either way.

>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

All of these are missing a _FLAG_ piece. The above should be HMM_PFN_FLAG_VALID,
to match the enum below.

> - * HMM_PFN_READ:  CPU page table has read permission set

So why is it that we don't need the _READ flag anymore? I looked at the 
corresponding
hmm.c but still don't quite get it. Is it that we just expect that _READ is
always set if there is an entry at all? Or something else?

>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -92,64 +91,94 @@ struct hmm;
>   *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should 
> not
>   *  be mirrored by a device, because the entry will never have 
> HMM_PFN_VALID
>   *  set and the pfn value is undefined.
> - * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
> + * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
>   */
> -typedef unsigned long hmm_pfn_t;
> +enum hmm_pfn_flag_e {
> + HMM_PFN_FLAG_VALID = 0,
> + HMM_PFN_FLAG_WRITE,
> + HMM_PFN_FLAG_ERROR,
> + HMM_PFN_FLAG_NONE,
> + HMM_PFN_FLAG_SPECIAL,
> + HMM_PFN_FLAG_DEVICE_PRIVATE,
> + HMM_PFN_FLAG_MAX
> +};
> +
> +/*
> + * struct hmm_range - track invalidation lock on virtual address range
> + *
> + * @vma: the vm area struct for the range
> + * @list: all range lock are on a list
> + * @start: range virtual start address (inclusive)
> + * @end: range virtual end address (exclusive)
> + * @pfns: array of pfns (big enough for the range)
> + * @flags: pfn flags to match device driver page table
> + * @valid: pfns array did not change since it has been fill by an HMM 
> function
> + */
> +struct hmm_range {
> + struct vm_area_struct   *vma;
> + struct list_headlist;
> + unsigned long   start;
> + unsigned long   end;
> + uint64_t*pfns;
> + const uint64_t  *flags;
> + uint8_t pfn_shift;
> + boolvalid;
> +};
> +#define HMM_RANGE_PFN_FLAG(f) (range->flags[HMM_PFN_FLAG_##f])

Please please please no. :)  This breaks grep without actually adding any value.
It's not as if you need to build up a whole set of symmetric macros like
the Page* flags do, after all. So we can keep this very simple, instead.

I've looked through the hmm.c and it's always just something like
HMM_RANGE_PFN_FLAG(WRITE), so there really is no need for this macro at all.

Just 

Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-16 Thread John Hubbard
On 03/16/2018 02:26 PM, Jerome Glisse wrote:
> On Fri, Mar 16, 2018 at 02:12:21PM -0700, Andrew Morton wrote:
>> On Fri, 16 Mar 2018 15:14:08 -0400 jgli...@redhat.com wrote:
>>
>>> The hmm_mirror_register() function registers a callback for when
>>> the CPU pagetable is modified. Normally, the device driver will
>>> call hmm_mirror_unregister() when the process using the device is
>>> finished. However, if the process exits uncleanly, the struct_mm
>>> can be destroyed with no warning to the device driver.
>>
>> Again, what are the user-visible effects of the bug?  Such info is
>> needed when others review our request for a -stable backport.  And the
>> many people who review -stable patches for integration into their own
>> kernel trees will want to understand the benefit of the patch to their
>> users.
> 
> I have not had any issues in any of my own testing but nouveau driver
> is not as advance as the NVidia closed driver in respect to HMM inte-
> gration yet.
> 
> If any issues they will happen between exit_mm() and exit_files() in
> do_exit() (kernel/exit.c) exit_mm() tear down the mm struct but without
> this callback the device driver might still be handling page fault and
> thus might potentialy tries to handle them against a dead mm_struct.
> 
> So i am not sure what are the symptoms. To be fair there is no public
> driver using that part of HMM beside nouveau rfc patches. So at this
> point the impact on anybody is non existent. If anyone want to back-
> port nouveau HMM support once it make it upstream it will probably
> have to backport more things along the way. This is why i am not that
> aggressive on ccing stable so far.

The problem I'd like to avoid is: having a version of HMM in stable that
is missing this new callback. And without it, once the driver starts doing
actual concurrent operations, we can expect that the race condition will
happen.

It just seems unfortunate to have stable versions out there that would
be exposed to this, when it only require a small patch to avoid it.

On the other hand, it's also reasonable to claim that this is part of the
evolving HMM feature, and as such, this new feature does not belong in
stable.  I'm not sure which argument carries more weight here.

thanks,
-- 
John Hubbard
NVIDIA

> 
> Cheers,
> Jérôme
> 


Re: [PATCH 02/14] mm/hmm: fix header file if/else/endif maze

2018-03-16 Thread John Hubbard
On 03/16/2018 02:35 PM, Andrew Morton wrote:
> On Fri, 16 Mar 2018 17:18:02 -0400 Jerome Glisse  wrote:
> 
>> On Fri, Mar 16, 2018 at 02:09:59PM -0700, Andrew Morton wrote:
>>> On Fri, 16 Mar 2018 15:14:07 -0400 jgli...@redhat.com wrote:
>>>
>>>> From: Jérôme Glisse 
>>>>
>>>> The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.
>>>
>>> "were wrong" is not a sufficient explanation of the problem, especially
>>> if we're requesting a -stable backport.  Please fully describe the
>>> effects of a bug when fixing it?
>>
>> Build issue (compilation failure) if you have multiple includes of
>> hmm.h through different headers is the most obvious issue. So it
>> will be very obvious with any big driver that include the file in
>> different headers.
> 
> That doesn't seem to warrant a -stable backport?  The developer of such
> a driver will simply fix the headers?

Right. For this patch, I would strongly request a -stable backport.  It's 
really going to cause problems if anyone tries to use -stable with HMM,
without this fix.

thanks,
-- 
John Hubbard
NVIDIA

> 
>> I can respin with that. Sorry again for not being more explanatory
>> it is always hard for me to figure what is not obvious to others.
> 
> I updated the changelog, no respin needed.
> 


Re: [PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct

2018-03-16 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> The private field of mm_walk struct point to an hmm_vma_walk struct and
> not to the hmm_range struct desired. Fix to get proper struct pointer.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: sta...@vger.kernel.org
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  mm/hmm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6088fa6ed137..64d9e7dae712 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
>   unsigned long end,
>   struct mm_walk *walk)
>  {
> - struct hmm_range *range = walk->private;
> + struct hmm_vma_walk *hmm_vma_walk = walk->private;
> + struct hmm_range *range = hmm_vma_walk->range;
>   hmm_pfn_t *pfns = range->pfns;
>   unsigned long i;
>  

This fix looks good. I also checked the other uses of walk->private, of course, 
but it was only this one that was wrong.

I think this patch also belongs in -stable, because it is a simple bug fix.

For the description, well...actually, because ->range is the first element in
struct hmm_vma_walk, you probably end up with the same pointer value, both
before and after this fix. So maybe there are no symptoms to see. Maybe that's
an argument for *not* putting it in -stable, too. I'll leave that question
to more experienced people.

Either way, you can add: 

Reviewed by: John Hubbard 

thanks,
-- 
John Hubbard
NVIDIA
 



Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-16 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Ralph Campbell 
> 



> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + struct hmm *hmm = mm->hmm;
> + struct hmm_mirror *mirror;
> + struct hmm_mirror *mirror_next;
> +
> + down_write(>mirrors_sem);
> + list_for_each_entry_safe(mirror, mirror_next, >mirrors, list) {
> + list_del_init(>list);
> + if (mirror->ops->release)
> + mirror->ops->release(mirror);
> + }
> + up_write(>mirrors_sem);
> +}
> +

OK, as for actual code review:

This part of the locking looks good. However, I think it can race against
hmm_mirror_register(), because hmm_mirror_register() will just add a new 
mirror regardless.

So:

thread 1  thread 2
---
hmm_release   hmm_mirror_register 
down_write(>mirrors_sem);
// deletes all list items
up_write
  unblocked: adds new mirror
  

...so I think we need a way to back out of any pending hmm_mirror_register()
calls, as part of the .release steps, right? It seems hard for the device 
driver,
which could be inside of hmm_mirror_register(), to handle that. Especially 
considering
that right now, hmm_mirror_register() will return success in this case--so
there is no indication that anything is wrong.

Maybe hmm_mirror_register() could return an error (and not add to the mirror 
list),
in such a situation, how's that sound?

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters

2018-03-16 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 

Hi Jerome,

I failed to find any problems in this patch, so:

Reviewed-by: John Hubbard 

There are a couple of documentation recommended typo fixes listed
below, which are very minor but as long as I'm here I'll point them out.

> Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
> struct as parameter and were initializing that struct with others of
> their parameters. Have caller of those function do this as they are
> likely to already do and only pass this struct to both function this
> shorten function signature and make it easiers in the future to add

 easier

> new parameters by simply adding them to the structure.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 18 -
>  mm/hmm.c| 78 
> +++--
>  2 files changed, 33 insertions(+), 63 deletions(-)



>  
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 64d9e7dae712..49f0f6b337ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>  
>  /*
>   * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual 
> addresses
> - * @vma: virtual memory area containing the virtual address range
> - * @range: used to track snapshot validity
> - * @start: range virtual start address (inclusive)
> - * @end: range virtual end address (exclusive)
> - * @entries: array of hmm_pfn_t: provided by the caller, filled in by 
> function
> + * @range: range being snapshoted and all needed informations

Let's just say this:

* @range: range being snapshotted




> @@ -628,11 +617,7 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>  
>  /*
>   * hmm_vma_fault() - try to fault some address in a virtual address range
> - * @vma: virtual memory area containing the virtual address range
> - * @range: use to track pfns array content validity
> - * @start: fault range virtual start address (inclusive)
> - * @end: fault range virtual end address (exclusive)
> - * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
> + * @range: range being faulted and all needed informations

Similarly here, let's just write it like this:

* @range: range being faulted


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture

2018-03-16 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> Only peculiar architecture allow write without read thus assume that
> any valid pfn do allow for read. Note we do not care for write only
> because it does make sense with thing like atomic compare and exchange
> or any other operations that allow you to get the memory value through
> them.
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 14 ++
>  mm/hmm.c| 28 
>  2 files changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index b65e527dd120..4bdc58ffe9f3 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -84,7 +84,6 @@ struct hmm;
>   *
>   * Flags:
>   * HMM_PFN_VALID: pfn is valid

Maybe write it like this:

* HMM_PFN_VALID: pfn is valid. This implies that it has, at least, read 
permission.

> - * HMM_PFN_READ:  CPU page table has read permission set
>   * HMM_PFN_WRITE: CPU page table has write permission set
>   * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned 
> memory
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
> @@ -97,13 +96,12 @@ struct hmm;
>  typedef unsigned long hmm_pfn_t;
>  
>  #define HMM_PFN_VALID (1 << 0)



>  
> @@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
>   list_add_rcu(>list, >ranges);
>   spin_unlock(>lock);
>  
> + if (!(vma->vm_flags & VM_READ)) {
> + /*
> +  * If vma do not allow read assume it does not allow write as
> +  * only peculiar architecture allow write without read and this
> +  * is not a case we care about (some operation like atomic no
> +  * longer make sense).
> +  */
> + hmm_pfns_clear(range->pfns, range->start, range->end);
> + return 0;

1. Shouldn't we return an error here? All is not well. No one has any pfns, even
   though they tried to get some. :)

2. I think this check needs to be done much earlier, right after the "Sanity
   check, this should not happen" code in this routine.

> + }
> +
>   hmm_vma_walk.fault = false;
>   hmm_vma_walk.range = range;
>   mm_walk.private = _vma_walk;
> @@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, 
> bool block)
>   list_add_rcu(>list, >ranges);
>   spin_unlock(>lock);
>  
> + if (!(vma->vm_flags & VM_READ)) {
> + /*
> +  * If vma do not allow read assume it does not allow write as
> +  * only peculiar architecture allow write without read and this
> +  * is not a case we care about (some operation like atomic no
> +  * longer make sense).
> +  */

For the comment wording (for this one, and the one above), how about:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either.
 */

...and then leave the more extensive explanation to the commit log. Or,
if we really want a longer explananation right here, then:

/*
 * If the vma does not allow read access, then assume that 
 * it does not allow write access, either. Architectures that
 * allow write without read access are not supported by HMM,
 * because operations such as atomic access would not work.
 */


> + hmm_pfns_clear(range->pfns, range->start, range->end);
> + return 0;
> + }

Similar points as above: it seems like an error case, and the check should be 
right near 
the beginning of the function.

thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-16 Thread John Hubbard
On 03/16/2018 07:36 PM, John Hubbard wrote:
> On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
>> From: Ralph Campbell 
>>
> 
> 
> 
>> +static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>> +{
>> +struct hmm *hmm = mm->hmm;
>> +struct hmm_mirror *mirror;
>> +struct hmm_mirror *mirror_next;
>> +
>> +down_write(>mirrors_sem);
>> +list_for_each_entry_safe(mirror, mirror_next, >mirrors, list) {
>> +list_del_init(>list);
>> +if (mirror->ops->release)
>> +mirror->ops->release(mirror);
>> +}
>> +up_write(>mirrors_sem);
>> +}
>> +
> 
> OK, as for actual code review:
> 
> This part of the locking looks good. However, I think it can race against
> hmm_mirror_register(), because hmm_mirror_register() will just add a new 
> mirror regardless.
> 
> So:
> 
> thread 1  thread 2
> ---
> hmm_release   hmm_mirror_register 
> down_write(>mirrors_sem);
> // deletes all list items
> up_write
>   unblocked: adds new mirror
>   
> 
> ...so I think we need a way to back out of any pending hmm_mirror_register()
> calls, as part of the .release steps, right? It seems hard for the device 
> driver,
> which could be inside of hmm_mirror_register(), to handle that. Especially 
> considering
> that right now, hmm_mirror_register() will return success in this case--so
> there is no indication that anything is wrong.
> 
> Maybe hmm_mirror_register() could return an error (and not add to the mirror 
> list),
> in such a situation, how's that sound?
> 

In other words, I think this would help (not tested yet beyond a quick compile,
but it's pretty simple):

diff --git a/mm/hmm.c b/mm/hmm.c
index 7ccca5478ea1..da39f8522dca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -66,6 +66,7 @@ struct hmm {
struct list_headmirrors;
struct mmu_notifier mmu_notifier;
struct rw_semaphore mirrors_sem;
+   boolshutting_down;
 };
 
 /*
@@ -99,6 +100,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
INIT_LIST_HEAD(>ranges);
spin_lock_init(>lock);
hmm->mm = mm;
+   hmm->shutting_down = false;
 
/*
 * We should only get here if hold the mmap_sem in write mode ie on
@@ -167,6 +169,7 @@ static void hmm_release(struct mmu_notifier *mn, struct 
mm_struct *mm)
struct hmm_mirror *mirror_next;
 
down_write(>mirrors_sem);
+   hmm->shutting_down = true;
list_for_each_entry_safe(mirror, mirror_next, >mirrors, list) {
list_del_init(>list);
if (mirror->ops->release)
@@ -227,6 +230,10 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct 
mm_struct *mm)
return -ENOMEM;
 
    down_write(>hmm->mirrors_sem);
+   if (mirror->hmm->shutting_down) {
+   up_write(>hmm->mirrors_sem);
+   return -ESRCH;
+   }
list_add(>list, >hmm->mirrors);
up_write(>hmm->mirrors_sem);


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong

2018-03-16 Thread John Hubbard
On 03/16/2018 12:14 PM, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 

Hi Jerome,

This one looks great. A couple of trivial typo fixes are listed below.

You can add:

Reviewed-by: John Hubbard 

> All device driver we care about are using 64bits page table entry. In
> order to match this and to avoid useless define convert all HMM pfn to
> directly use uint64_t. It is a first step on the road to allow driver
> to directly use pfn value return by HMM (saving memory and CPU cycles
> use for convertion between the two).

  used for conversion
> 
> Signed-off-by: Jérôme Glisse 
> Cc: Evgeny Baskakov 
> Cc: Ralph Campbell 
> Cc: Mark Hairgrove 
> Cc: John Hubbard 
> ---
>  include/linux/hmm.h | 46 +-
>  mm/hmm.c| 26 +-
>  2 files changed, 34 insertions(+), 38 deletions(-)
> 



> @@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
>  #define HMM_PFN_SHIFT 6
>  
>  /*
> - * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
> - * @pfn: hmm_pfn_t to convert to struct page
> - * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
> + * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
> + * @pfn: HMM pfn value to get corresponding struct page from
> + * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
>   *
> - * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
> - * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
> + * If the uint64_t is valid (ie valid flag set) then return the struct page

  If the HMM pfn is valid



>  
> @@ -634,8 +634,8 @@ EXPORT_SYMBOL(hmm_vma_range_done);
>   * This is similar to a regular CPU page fault except that it will not 
> trigger
>   * any memory migration if the memory being faulted is not accessible by 
> CPUs.
>   *
> - * On error, for one virtual address in the range, the function will set the
> - * hmm_pfn_t error flag for the corresponding pfn entry.
> + * On error, for one virtual address in the range, the function will mark the
> + * correspond HMM pfn entry with error flag.

  corresponding HMM pfn entry with an error flag.

thanks,
-- 
John Hubbard
NVIDIA



<    1   2   3   4   5   6   7   8   9   10   >