[PATCH] powerpc/512x: Use dma_request_chan() instead dma_request_slave_channel()

2019-12-16 Thread Peter Ujfalusi
dma_request_slave_channel() is a wrapper on top of dma_request_chan()
eating up the error code.

By using dma_request_chan() directly the driver can support deferred
probing against DMA.

Signed-off-by: Peter Ujfalusi 
---
 arch/powerpc/platforms/512x/mpc512x_lpbfifo.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c 
b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
index 13631f35cd14..04bf6ecf7d55 100644
--- a/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
+++ b/arch/powerpc/platforms/512x/mpc512x_lpbfifo.c
@@ -434,9 +434,9 @@ static int mpc512x_lpbfifo_probe(struct platform_device 
*pdev)
memset(, 0, sizeof(struct lpbfifo_data));
spin_lock_init();
 
-   lpbfifo.chan = dma_request_slave_channel(>dev, "rx-tx");
-   if (lpbfifo.chan == NULL)
-   return -EPROBE_DEFER;
+   lpbfifo.chan = dma_request_chan(>dev, "rx-tx");
+   if (IS_ERR(lpbfifo.chan))
+   return PTR_ERR(lpbfifo.chan);
 
if (of_address_to_resource(pdev->dev.of_node, 0, ) != 0) {
dev_err(>dev, "bad 'reg' in 'sclpc' device tree node\n");
-- 
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki



Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-16 Thread Jan Kara
Hi!

On Mon 16-12-19 14:25:12, John Hubbard wrote:
> Hi,
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Just a note for Andrew and others watching this series: At this point I'm fine
with the series so if someone still has some review feedback or wants to
check the series, now is the right time. Otherwise I think Andrew can push
the series to MM tree so that it will get wider testing exposure and is
prepared for the next merge window.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread Jan Kara
On Mon 16-12-19 14:18:59, John Hubbard wrote:
> On 12/16/19 4:53 AM, Jan Kara wrote:
> > With this fixed, the patch looks good to me so you can then add:
> > 
> > Reviewed-by: Jan Kara 
> > 
> > Honza
> > 
> 
> btw, thanks for the thorough review of this critical patch (and for your
> patience with my mistakes). I really appreciate it, and this patchset would
> not have made it this far without your detailed help and explanations.

You're welcome! I'd also like to thank you for persistently driving this
series :)

Honza
-- 
Jan Kara 
SUSE Labs, CR


[RFC PATCH 1/2] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-16 Thread Aneesh Kumar K.V
Architectures for which we have hardware walkers of Linux page table should
flush TLB on mmu gather batch allocation failures and batch flush. Some
architectures like POWER supports multiple translation modes (hash and radix)
and in the case of POWER only radix translation mode needs the above TLBI.
This is because for hash translation mode kernel wants to avoid this extra
flush since there are no hardware walkers of linux page table. With radix
translation, the hardware also walks linux page table and with that, kernel
needs to make sure to TLB invalidate page walk cache before page table pages are
freed.

More details in
commit: d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for 
RCU_TABLE_FREE")

Based on changes from Peter Zijlstra 

Signed-off-by: Aneesh Kumar K.V 
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  |  4 
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 22 +++---
 mm/mmu_gather.c | 16 
 7 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 48b5e103bdb0..208aad121630 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -396,9 +396,6 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config HAVE_MMU_GATHER_PAGE_SIZE
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..a15f5584b0de 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,7 +223,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b2c0be93929d..feea1a09bbce 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,6 +27,10 @@
 #define tlb_flush tlb_flush
 extern void tlb_flush(struct mmu_gather *tlb);
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   radix_enabled()
+#endif
+
 /* Get the generic bits... */
 #include 
 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index eb24cb1afc11..18e9fb6fcf1b 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -65,7 +65,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2b10036fefd0..dcdf13fc0a0b 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -137,13 +137,6 @@
  *  When used, an architecture is expected to provide __tlb_remove_table()
  *  which does the actual freeing of these pages.
  *
- *  HAVE_RCU_TABLE_NO_INVALIDATE
- *
- *  This makes HAVE_RCU_TABLE_FREE avoid calling tlb_flush_mmu_tlbonly() before
- *  freeing the page-table pages. This can be avoided if you use
- *  HAVE_RCU_TABLE_FREE and your architecture does _NOT_ use the Linux
- *  page-tables natively.
- *
  *  MMU_GATHER_NO_RANGE
  *
  *  Use this if your architecture lacks an efficient flush_tlb_range().
@@ -189,8 +182,23 @@ struct mmu_table_batch {
 
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_needs_table_invalidate() (true)
+#endif
+
+#else
+
+#ifdef tlb_needs_table_invalidate
+#error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE
 #endif
 
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
 /*
  * If we can't allocate a page to make a big batch of page pointers
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 

[RFC PATCH 2/2] mm/mmu_gather: Avoid multiple page walk cache flush

2019-12-16 Thread Aneesh Kumar K.V
On tlb_finish_mmu() kernel does a tlb flush before  mmu gather table invalidate.
The mmu gather table invalidate depending on kernel config also does another
TLBI. Avoid the later on tlb_finish_mmu().

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mmu_gather.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 7c1b8f67af7b..7e2bd43b9084 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -143,17 +143,23 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
 }
 
-static void tlb_table_flush(struct mmu_gather *tlb)
+static void __tlb_table_flush(struct mmu_gather *tlb, bool table_inval)
 {
struct mmu_table_batch **batch = >batch;
 
if (*batch) {
-   tlb_table_invalidate(tlb);
+   if (table_inval)
+   tlb_table_invalidate(tlb);
call_rcu(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
 }
 
+static void tlb_table_flush(struct mmu_gather *tlb)
+{
+   __tlb_table_flush(tlb, true);
+}
+
 void tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
struct mmu_table_batch **batch = >batch;
@@ -178,7 +184,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
-   tlb_table_flush(tlb);
+   __tlb_table_flush(tlb, false);
 #endif
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
tlb_batch_pages_flush(tlb);
-- 
2.23.0



[RFC PATCH 2/2] mm/mmu_gather: Avoid multiple page walk cache flush

2019-12-16 Thread Aneesh Kumar K.V
On tlb_finish_mmu() kernel does a tlb flush before  mmu gather table invalidate.
The mmu gather table invalidate depending on kernel config also does another
TLBI. Avoid the later on tlb_finish_mmu().

Signed-off-by: Aneesh Kumar K.V 
---
 mm/mmu_gather.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 7c1b8f67af7b..7e2bd43b9084 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -143,17 +143,23 @@ static void tlb_remove_table_rcu(struct rcu_head *head)
free_page((unsigned long)batch);
 }
 
-static void tlb_table_flush(struct mmu_gather *tlb)
+static void __tlb_table_flush(struct mmu_gather *tlb, bool table_inval)
 {
struct mmu_table_batch **batch = >batch;
 
if (*batch) {
-   tlb_table_invalidate(tlb);
+   if (table_inval)
+   tlb_table_invalidate(tlb);
call_rcu(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
 }
 
+static void tlb_table_flush(struct mmu_gather *tlb)
+{
+   __tlb_table_flush(tlb, true);
+}
+
 void tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
struct mmu_table_batch **batch = >batch;
@@ -178,7 +184,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 {
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
-   tlb_table_flush(tlb);
+   __tlb_table_flush(tlb, false);
 #endif
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
tlb_batch_pages_flush(tlb);
-- 
2.23.0



[RFC PATCH 1/2] mm/mmu_gather: Invalidate TLB correctly on batch allocation failure and flush

2019-12-16 Thread Aneesh Kumar K.V
Architectures for which we have hardware walkers of Linux page table should
flush TLB on mmu gather batch allocation failures and batch flush. Some
architectures like POWER supports multiple translation modes (hash and radix)
and in the case of POWER only radix translation mode needs the above TLBI.
This is because for hash translation mode kernel wants to avoid this extra
flush since there are no hardware walkers of linux page table. With radix
translation, the hardware also walks linux page table and with that, kernel
needs to make sure to TLB invalidate page walk cache before page table pages are
freed.

More details in
commit: d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for 
RCU_TABLE_FREE")

Based on changes from Peter Zijlstra 

Signed-off-by: Aneesh Kumar K.V 
---
 arch/Kconfig|  3 ---
 arch/powerpc/Kconfig|  1 -
 arch/powerpc/include/asm/tlb.h  |  4 
 arch/sparc/Kconfig  |  1 -
 arch/sparc/include/asm/tlb_64.h |  9 +
 include/asm-generic/tlb.h   | 22 +++---
 mm/mmu_gather.c | 16 
 7 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 48b5e103bdb0..208aad121630 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -396,9 +396,6 @@ config HAVE_ARCH_JUMP_LABEL_RELATIVE
 config HAVE_RCU_TABLE_FREE
bool
 
-config HAVE_RCU_TABLE_NO_INVALIDATE
-   bool
-
 config HAVE_MMU_GATHER_PAGE_SIZE
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..a15f5584b0de 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,7 +223,6 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
CPU_LITTLE_ENDIAN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b2c0be93929d..feea1a09bbce 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,6 +27,10 @@
 #define tlb_flush tlb_flush
 extern void tlb_flush(struct mmu_gather *tlb);
 
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   radix_enabled()
+#endif
+
 /* Get the generic bits... */
 #include 
 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index eb24cb1afc11..18e9fb6fcf1b 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -65,7 +65,6 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_RCU_TABLE_FREE if SMP
-   select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
select HAVE_DYNAMIC_FTRACE
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index a2f3fa61ee36..8cb8f3833239 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -28,6 +28,15 @@ void flush_tlb_pending(void);
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb) flush_tlb_pending()
 
+/*
+ * SPARC64's hardware TLB fill does not use the Linux page-tables
+ * and therefore we don't need a TLBI when freeing page-table pages.
+ */
+
+#ifdef CONFIG_HAVE_RCU_TABLE_FREE
+#define tlb_needs_table_invalidate()   (false)
+#endif
+
 #include 
 
 #endif /* _SPARC64_TLB_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2b10036fefd0..dcdf13fc0a0b 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -137,13 +137,6 @@
  *  When used, an architecture is expected to provide __tlb_remove_table()
  *  which does the actual freeing of these pages.
  *
- *  HAVE_RCU_TABLE_NO_INVALIDATE
- *
- *  This makes HAVE_RCU_TABLE_FREE avoid calling tlb_flush_mmu_tlbonly() before
- *  freeing the page-table pages. This can be avoided if you use
- *  HAVE_RCU_TABLE_FREE and your architecture does _NOT_ use the Linux
- *  page-tables natively.
- *
  *  MMU_GATHER_NO_RANGE
  *
  *  Use this if your architecture lacks an efficient flush_tlb_range().
@@ -189,8 +182,23 @@ struct mmu_table_batch {
 
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+/*
+ * This allows an architecture that does not use the linux page-tables for
+ * hardware to skip the TLBI when freeing page tables.
+ */
+#ifndef tlb_needs_table_invalidate
+#define tlb_needs_table_invalidate() (true)
+#endif
+
+#else
+
+#ifdef tlb_needs_table_invalidate
+#error tlb_needs_table_invalidate() requires MMU_GATHER_RCU_TABLE_FREE
 #endif
 
+#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
+
+
 #ifndef CONFIG_HAVE_MMU_GATHER_NO_GATHER
 /*
  * If we can't allocate a page to make a big batch of page pointers
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 

Re: [PATCH v18 11/13] open: introduce openat2(2) syscall

2019-12-16 Thread Aleksa Sarai
On 2019-12-16, Florian Weimer  wrote:
> > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> > index 1d338357df8a..58c3a0e543c6 100644
> > --- a/include/uapi/linux/fcntl.h
> > +++ b/include/uapi/linux/fcntl.h
> > @@ -93,5 +93,40 @@
> >  
> >  #define AT_RECURSIVE   0x8000  /* Apply to the entire subtree 
> > */
> >  
> > +/*
> > + * Arguments for how openat2(2) should open the target path. If @resolve is
> > + * zero, then openat2(2) operates very similarly to openat(2).
> > + *
> > + * However, unlike openat(2), unknown bits in @flags result in -EINVAL 
> > rather
> > + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> > + * O_TMPFILE} are set.
> > + *
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > +   __aligned_u64 flags;
> > +   __u16 mode;
> > +   __u16 __padding[3]; /* must be zeroed */
> > +   __aligned_u64 resolve;
> > +};
> > +
> > +#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
> > +#define OPEN_HOW_SIZE_LATEST   OPEN_HOW_SIZE_VER0
> > +
> > +/* how->resolve flags for openat2(2). */
> > +#define RESOLVE_NO_XDEV0x01 /* Block mount-point crossings
> > +   (includes bind-mounts). */
> > +#define RESOLVE_NO_MAGICLINKS  0x02 /* Block traversal through 
> > procfs-style
> > +   "magic-links". */
> > +#define RESOLVE_NO_SYMLINKS0x04 /* Block traversal through all 
> > symlinks
> > +   (implies OEXT_NO_MAGICLINKS) */
> > +#define RESOLVE_BENEATH0x08 /* Block "lexical" trickery like
> > +   "..", symlinks, and absolute
> > +   paths which escape the dirfd. */
> > +#define RESOLVE_IN_ROOT0x10 /* Make all jumps to "/" and ".."
> > +   be scoped inside the dirfd
> > +   (similar to chroot(2)). */
> >  
> >  #endif /* _UAPI_LINUX_FCNTL_H */
> 
> Would it be possible to move these to a new UAPI header?
> 
> In glibc, we currently do not #include .  We need some of
> the AT_* constants in POSIX mode, and the header is not necessarily
> namespace-clean.  If there was a separate header for openat2 support, we
> could use that easily, and we would only have to maintain the baseline
> definitions (which never change).

Sure, (assuming nobody objects) I can move it to "linux/openat2.h".

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


[PATCH V3 13/13] powerpc/vas: Free send window in VAS instance after credits returned

2019-12-16 Thread Haren Myneni


NX may be processing requests while trying to close window. Wait until
all credits are returned and then free send window from VAS instance.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index e6ea7d3..aded7a5 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1316,14 +1316,14 @@ int vas_win_close(struct vas_window *window)
 
unmap_paste_region(window);
 
-   clear_vinst_win(window);
-
poll_window_busy_state(window);
 
unpin_close_window(window);
 
poll_window_credits(window);
 
+   clear_vinst_win(window);
+
poll_window_castout(window);
 
/* if send window, drop reference to matching receive window */
-- 
1.8.3.1





[PATCH V3 12/13] powerpc/vas: Display process stuck message

2019-12-16 Thread Haren Myneni


Process can not close send window until all requests are processed.
Means wait until window state is not busy and send credits are
returned. Display debug message in case taking longer to close the
window.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 8d90d25..e6ea7d3 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1183,6 +1183,7 @@ static void poll_window_credits(struct vas_window *window)
 {
u64 val;
int creds, mode;
+   int count = 0;
 
val = read_hvwc_reg(window, VREG(WINCTL));
if (window->tx_win)
@@ -1201,10 +1202,25 @@ static void poll_window_credits(struct vas_window 
*window)
creds = GET_FIELD(VAS_LRX_WCRED, val);
}
 
+   /*
+* Takes around few microseconds to complete all pending requests
+* and return credits.
+* TODO: Issue CRB Kill to stop all pending requests. Need only
+*   if there is a bug in NX or fault handling in kernel.
+*/
if (creds < window->wcreds_max) {
val = 0;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(msecs_to_jiffies(10));
+   count++;
+   /*
+* Process can not close send window until all credits are
+* returned.
+*/
+   if (!(count % 1))
+   pr_debug("%s() pid %d stuck? retries %d\n", __func__,
+   vas_window_pid(window), count);
+
goto retry;
}
 }
@@ -1218,6 +1234,7 @@ static void poll_window_busy_state(struct vas_window 
*window)
 {
int busy;
u64 val;
+   int count = 0;
 
 retry:
val = read_hvwc_reg(window, VREG(WIN_STATUS));
@@ -1226,6 +1243,15 @@ static void poll_window_busy_state(struct vas_window 
*window)
val = 0;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(msecs_to_jiffies(5));
+   count++;
+   /*
+* Takes around 5 microseconds to process all pending
+* requests.
+*/
+   if (!(count % 1))
+   pr_debug("%s() pid %d stuck? retries %d\n", __func__,
+   vas_window_pid(window), count);
+
goto retry;
}
 }
-- 
1.8.3.1





[PATCH V3 11/13] powerpc/VAS: Return credits after handling fault

2019-12-16 Thread Haren Myneni


NX expects OS to return credit for send window after processing each
fault. Also credit has to be returned even for fault window.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c  | 10 ++
 arch/powerpc/platforms/powernv/vas-window.c | 17 +
 arch/powerpc/platforms/powernv/vas.h|  1 +
 3 files changed, 28 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 6e8b8c7..eadfee7 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -238,6 +238,11 @@ irqreturn_t vas_fault_handler(int irq, void *data)
memset(fifo, 0, CRB_SIZE);
mutex_unlock(>mutex);
 
+   /*
+* Return credit for the fault window.
+*/
+   vas_return_credit(vinst->fault_win, 0);
+
pr_devel("VAS[%d] fault_fifo %p, fifo %p, fault_crbs %d\n",
vinst->vas_id, vinst->fault_fifo, fifo,
vinst->fault_crbs);
@@ -264,6 +269,11 @@ irqreturn_t vas_fault_handler(int irq, void *data)
}
 
update_csb(window, crb);
+   /*
+* Return credit for send window after processing
+* fault CRB.
+*/
+   vas_return_credit(window, 1);
} while (true);
 
return IRQ_HANDLED;
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 8428970..8d90d25 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1319,6 +1319,23 @@ int vas_win_close(struct vas_window *window)
 }
 EXPORT_SYMBOL_GPL(vas_win_close);
 
+/*
+ * Return credit for the given window.
+ */
+void vas_return_credit(struct vas_window *window, bool tx)
+{
+   uint64_t val;
+
+   val = 0ULL;
+   if (tx) { /* send window */
+   val = SET_FIELD(VAS_TX_WCRED, val, 1);
+   write_hvwc_reg(window, VREG(TX_WCRED_ADDER), val);
+   } else {
+   val = SET_FIELD(VAS_LRX_WCRED, val, 1);
+   write_hvwc_reg(window, VREG(LRX_WCRED_ADDER), val);
+   }
+}
+
 struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
uint32_t pswid)
 {
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index f5f45ea..495937a 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -415,6 +415,7 @@ struct vas_winctx {
 extern void vas_window_free_dbgdir(struct vas_window *win);
 extern int vas_setup_fault_window(struct vas_instance *vinst);
 extern irqreturn_t vas_fault_handler(int irq, void *data);
+extern void vas_return_credit(struct vas_window *window, bool tx);
 extern struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
uint32_t pswid);
 
-- 
1.8.3.1





[PATCH V3 10/13] powerpc/vas: Do not use default credits for receive window

2019-12-16 Thread Haren Myneni


System checkstops if RxFIFO overruns with more requests than the
maximum possible number of CRBs allowed in FIFO at any time. So
max credits value (rxattr.wcreds_max) is set and is passed to
vas_rx_win_open() by the the driver.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 4 ++--
 arch/powerpc/platforms/powernv/vas.h| 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 0f27ac5..8428970 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -772,7 +772,7 @@ static bool rx_win_args_valid(enum vas_cop_type cop,
if (attr->rx_fifo_size > VAS_RX_FIFO_SIZE_MAX)
return false;
 
-   if (attr->wcreds_max > VAS_RX_WCREDS_MAX)
+   if (!attr->wcreds_max)
return false;
 
if (attr->nx_win) {
@@ -878,7 +878,7 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
rxwin->nx_win = rxattr->nx_win;
rxwin->user_win = rxattr->user_win;
rxwin->cop = cop;
-   rxwin->wcreds_max = rxattr->wcreds_max ?: VAS_WCREDS_DEFAULT;
+   rxwin->wcreds_max = rxattr->wcreds_max;
 
init_winctx_for_rxwin(rxwin, rxattr, );
init_winctx_regs(rxwin, );
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index af03aa0..f5f45ea 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -101,11 +101,9 @@
 /*
  * Initial per-process credits.
  * Max send window credits:4K-1 (12-bits in VAS_TX_WCRED)
- * Max receive window credits: 64K-1 (16 bits in VAS_LRX_WCRED)
  *
  * TODO: Needs tuning for per-process credits
  */
-#define VAS_RX_WCREDS_MAX  ((64 << 10) - 1)
 #define VAS_TX_WCREDS_MAX  ((4 << 10) - 1)
 #define VAS_WCREDS_DEFAULT (1 << 10)
 
-- 
1.8.3.1





[PATCH V3 09/13] powerpc/vas: Print CRB and FIFO values

2019-12-16 Thread Haren Myneni


Dump FIFO entry values if could not find send window and print CRB
for debugging.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c | 41 ++
 1 file changed, 41 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 45bea15c..6e8b8c7 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -26,6 +26,28 @@
  */
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
+static void dump_crb(struct coprocessor_request_block *crb)
+{
+   struct data_descriptor_entry *dde;
+   struct nx_fault_stamp *nx;
+
+   dde = >source;
+   pr_devel("SrcDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
+   be64_to_cpu(dde->address), be32_to_cpu(dde->length),
+   dde->count, dde->index, dde->flags);
+
+   dde = >target;
+   pr_devel("TgtDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
+   be64_to_cpu(dde->address), be32_to_cpu(dde->length),
+   dde->count, dde->index, dde->flags);
+
+   nx = >stamp.nx;
+   pr_devel("NX Stamp: PSWID 0x%x, FSA 0x%llx, flags 0x%x, FS 0x%x\n",
+   be32_to_cpu(nx->pswid),
+   be64_to_cpu(crb->stamp.nx.fault_storage_addr),
+   nx->flags, be32_to_cpu(nx->fault_status));
+}
+
 /*
  * Update the CSB to indicate a translation error.
  *
@@ -145,6 +167,23 @@ static void update_csb(struct vas_window *window,
pid_vnr(pid), rc);
 }
 
+static void dump_fifo(struct vas_instance *vinst, void *entry)
+{
+   int i;
+   unsigned long *fifo = entry;
+
+   pr_err("Fault fifo size %d, max crbs %d, crb size %lu\n",
+   vinst->fault_fifo_size,
+   vinst->fault_fifo_size / CRB_SIZE,
+   sizeof(struct coprocessor_request_block));
+
+   pr_err("Fault FIFO Entry Dump:\n");
+   for (i = 0; i < CRB_SIZE; i += 4, fifo += 4) {
+   pr_err("[%.3d, %p]: 0x%.16lx 0x%.16lx 0x%.16lx 0x%.16lx\n",
+   i, fifo, *fifo, *(fifo+1), *(fifo+2), *(fifo+3));
+   }
+}
+
 /*
  * Process CRBs that we receive on the fault window.
  */
@@ -203,6 +242,7 @@ irqreturn_t vas_fault_handler(int irq, void *data)
vinst->vas_id, vinst->fault_fifo, fifo,
vinst->fault_crbs);
 
+   dump_crb(crb);
window = vas_pswid_to_window(vinst,
be32_to_cpu(crb->stamp.nx.pswid));
 
@@ -213,6 +253,7 @@ irqreturn_t vas_fault_handler(int irq, void *data)
 * even clean it up (return credit).
 * But we should not get here.
 */
+   dump_fifo(vinst, (void *)crb);
pr_err("VAS[%d] fault_fifo %p, fifo %p, pswid 0x%x, 
fault_crbs %d bad CRB?\n",
vinst->vas_id, vinst->fault_fifo, fifo,
be32_to_cpu(crb->stamp.nx.pswid),
-- 
1.8.3.1





[PATCH V3 08/13] powerpc/vas: Update CSB and notify process for fault CRBs

2019-12-16 Thread Haren Myneni


For each fault CRB, update fault address in CRB (fault_storage_addr)
and translation error status in CSB so that user space can touch the
fault address and resend the request. If the user space passed invalid
CSB address send signal to process with SIGSEGV.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c | 121 +
 1 file changed, 121 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 57f21ea..45bea15c 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -26,6 +27,125 @@
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
 /*
+ * Update the CSB to indicate a translation error.
+ *
+ * If the fault is in the CSB address itself or if we are unable to
+ * update the CSB, send a signal to the process, because we have no
+ * other way of notifying the user process.
+ *
+ * Remaining settings in the CSB are based on wait_for_csb() of
+ * NX-GZIP.
+ */
+static void update_csb(struct vas_window *window,
+   struct coprocessor_request_block *crb)
+{
+   int rc;
+   struct pid *pid;
+   void __user *csb_addr;
+   struct task_struct *tsk;
+   struct kernel_siginfo info;
+   struct coprocessor_status_block csb;
+
+   /*
+* NX user space windows can not be opened for task->mm=NULL
+* and faults will not be generated for kernel requests.
+*/
+   if (!window->mm || !window->user_win)
+   return;
+
+   csb_addr = (void *)be64_to_cpu(crb->csb_addr);
+
+   csb.cc = CSB_CC_TRANSLATION;
+   csb.ce = CSB_CE_TERMINATION;
+   csb.cs = 0;
+   csb.count = 0;
+
+   /*
+* Returns the fault address in CPU format since it is passed with
+* signal. But if the user space expects BE format, need changes.
+* i.e either kernel (here) or user should convert to CPU format.
+* Not both!
+*/
+   csb.address = be64_to_cpu(crb->stamp.nx.fault_storage_addr);
+   csb.flags = 0;
+
+   use_mm(window->mm);
+   rc = copy_to_user(csb_addr, , sizeof(csb));
+   /*
+* User space polls on csb.flags (first byte). So add barrier
+* then copy first byte with csb flags update.
+*/
+   smp_mb();
+   if (!rc) {
+   csb.flags = CSB_V;
+   rc = copy_to_user(csb_addr, , sizeof(u8));
+   }
+   unuse_mm(window->mm);
+
+   /* Success */
+   if (!rc)
+   return;
+
+   /*
+* User space passed invalid CSB address, Notify process with
+* SEGV signal.
+*/
+   pid = window->pid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Send window will be closed after processing all NX requests
+* and process exits after closing all windows. In multi-thread
+* applications, thread may not exists, but does not close FD
+* (means send window) upon exit. Parent thread (tgid) can use
+* and close the window later.
+* pid and mm references are taken when window is opened by
+* process (pid). So tgid is used only when child thread is not
+* available in multithread tasks.
+*
+*/
+   if (!tsk) {
+   pid = window->tgid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Parent thread will be closing window during its exit.
+* So should not get here.
+*/
+   if (!tsk)
+   return;
+   }
+
+   /* Do not notify if the task is exiting. */
+   if (tsk->flags & PF_EXITING) {
+   put_task_struct(tsk);
+   return;
+   }
+   put_task_struct(tsk);
+
+   pr_err("Invalid CSB address 0x%p signalling pid(%d)\n",
+   csb_addr, pid_vnr(pid));
+
+   clear_siginfo();
+   info.si_signo = SIGSEGV;
+   info.si_errno = EFAULT;
+   info.si_code = SEGV_MAPERR;
+   info.si_addr = csb_addr;
+
+   /*
+* process will be polling on csb.flags after request is sent to
+* NX. So generally CSB update should not fail except when an
+* application does not follow the process properly. So an error
+* message will be displayed and leave it to user space whether
+* to ignore or handle this signal.
+*/
+   rcu_read_lock();
+   rc = kill_pid_info(SIGSEGV, , pid);
+   rcu_read_unlock();
+
+   pr_devel("%s(): pid %d kill_proc_info() rc %d\n", __func__,
+   pid_vnr(pid), rc);
+}
+
+/*
  * Process CRBs that we receive on the fault window.
  */
 irqreturn_t vas_fault_handler(int irq, void *data)
@@ -102,6 +222,7 @@ irqreturn_t vas_fault_handler(int irq, void *data)
  

[PATCH V3 07/13] powerpc/vas: Take reference to PID and mm for userspace windows

2019-12-16 Thread Haren Myneni


Process close windows after its requests are completed. In multi-thread
applications, child can open a window but release FD will not be called
upon its exit. Parent thread will be closing it later upon its exit.

The parent can also send NX requests with this window and NX can
generate page faults. After kernel handles the page fault, send
signal to process by using PID if CSB address is invalid. Parent
thread will not receive signal since its PID is different from the one
saved in vas_window. So use tgid in case if the task for the pid saved
in window is not running and send signal to its parent.

To prevent reusing the pid until the window closed, take reference to
pid and task mm.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-debug.c  |  2 +-
 arch/powerpc/platforms/powernv/vas-window.c | 53 ++---
 arch/powerpc/platforms/powernv/vas.h|  9 -
 3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-debug.c 
b/arch/powerpc/platforms/powernv/vas-debug.c
index 09e63df..ef9a717 100644
--- a/arch/powerpc/platforms/powernv/vas-debug.c
+++ b/arch/powerpc/platforms/powernv/vas-debug.c
@@ -38,7 +38,7 @@ static int info_show(struct seq_file *s, void *private)
 
seq_printf(s, "Type: %s, %s\n", cop_to_str(window->cop),
window->tx_win ? "Send" : "Receive");
-   seq_printf(s, "Pid : %d\n", window->pid);
+   seq_printf(s, "Pid : %d\n", vas_window_pid(window));
 
 unlock:
mutex_unlock(_mutex);
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index e36c5d2..0f27ac5 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "vas.h"
@@ -877,8 +879,6 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
rxwin->user_win = rxattr->user_win;
rxwin->cop = cop;
rxwin->wcreds_max = rxattr->wcreds_max ?: VAS_WCREDS_DEFAULT;
-   if (rxattr->user_win)
-   rxwin->pid = task_pid_vnr(current);
 
init_winctx_for_rxwin(rxwin, rxattr, );
init_winctx_regs(rxwin, );
@@ -1028,7 +1028,6 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
txwin->tx_win = 1;
txwin->rxwin = rxwin;
txwin->nx_win = txwin->rxwin->nx_win;
-   txwin->pid = attr->pid;
txwin->user_win = attr->user_win;
txwin->wcreds_max = attr->wcreds_max ?: VAS_WCREDS_DEFAULT;
 
@@ -1069,8 +1068,43 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
goto free_window;
}
 
-   set_vinst_win(vinst, txwin);
+   if (txwin->user_win) {
+   /*
+* Window opened by child thread may not be closed when
+* it exits. So take reference to its pid and release it
+* when the window is free by parent thread.
+* Acquire a reference to the task's pid to make sure
+* pid will not be re-used - needed only for multithread
+* applications.
+*/
+   txwin->pid = get_task_pid(current, PIDTYPE_PID);
+   /*
+* Acquire a reference to the task's mm.
+*/
+   txwin->mm = get_task_mm(current);
 
+   if (!txwin->mm) {
+   put_pid(txwin->pid);
+   pr_err("VAS: pid(%d): mm_struct is not found\n",
+   current->pid);
+   rc = -EPERM;
+   goto free_window;
+   }
+
+   mmgrab(txwin->mm);
+   mmput(txwin->mm);
+   mm_context_add_copro(txwin->mm);
+   /*
+* Process closes window during exit. In the case of
+* multithread application, child can open window and
+* can exit without closing it. Expects parent thread
+* to use and close the window. So do not need to take
+* pid reference for parent thread.
+*/
+   txwin->tgid = find_get_pid(task_tgid_vnr(current));
+   }
+
+   set_vinst_win(vinst, txwin);
return txwin;
 
 free_window:
@@ -1267,8 +1301,17 @@ int vas_win_close(struct vas_window *window)
poll_window_castout(window);
 
/* if send window, drop reference to matching receive window */
-   if (window->tx_win)
+   if (window->tx_win) {
+   if (window->user_win) {
+   /* Drop references to pid and mm */
+   put_pid(window->pid);
+   if (window->mm) {
+   mmdrop(window->mm);
+   

[PATCH V3 06/13] powerpc/vas: Register NX with fault window ID and IRQ port value

2019-12-16 Thread Haren Myneni


For each user space send window, register NX with fault window ID
and port value so that NX paste CRBs in this fault FIFO when it
sees fault on the request buffer.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 15 +--
 arch/powerpc/platforms/powernv/vas.h| 15 +++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index cec1b41..e36c5d2 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -373,7 +373,7 @@ int init_winctx_regs(struct vas_window *window, struct 
vas_winctx *winctx)
init_xlate_regs(window, winctx->user_win);
 
val = 0ULL;
-   val = SET_FIELD(VAS_FAULT_TX_WIN, val, 0);
+   val = SET_FIELD(VAS_FAULT_TX_WIN, val, winctx->fault_win_id);
write_hvwc_reg(window, VREG(FAULT_TX_WIN), val);
 
/* In PowerNV, interrupts go to HV. */
@@ -748,6 +748,8 @@ static void init_winctx_for_rxwin(struct vas_window *rxwin,
 
winctx->min_scope = VAS_SCOPE_LOCAL;
winctx->max_scope = VAS_SCOPE_VECTORED_GROUP;
+   if (rxwin->vinst->virq)
+   winctx->irq_port = rxwin->vinst->irq_port;
 }
 
 static bool rx_win_args_valid(enum vas_cop_type cop,
@@ -945,13 +947,22 @@ static void init_winctx_for_txwin(struct vas_window 
*txwin,
winctx->lpid = txattr->lpid;
winctx->pidr = txattr->pidr;
winctx->rx_win_id = txwin->rxwin->winid;
+   /*
+* IRQ and fault window setup is successful. Set fault window
+* for the send window so that ready to handle faults.
+*/
+   if (txwin->vinst->virq)
+   winctx->fault_win_id = txwin->vinst->fault_win->winid;
 
winctx->dma_type = VAS_DMA_TYPE_INJECT;
winctx->tc_mode = txattr->tc_mode;
winctx->min_scope = VAS_SCOPE_LOCAL;
winctx->max_scope = VAS_SCOPE_VECTORED_GROUP;
+   if (txwin->vinst->virq)
+   winctx->irq_port = txwin->vinst->irq_port;
 
-   winctx->pswid = 0;
+   winctx->pswid = txattr->pswid ? txattr->pswid :
+   encode_pswid(txwin->vinst->vas_id, txwin->winid);
 }
 
 static bool tx_win_args_valid(enum vas_cop_type cop,
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index 879f5b4..2621df1 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -455,6 +455,21 @@ static inline u64 read_hvwc_reg(struct vas_window *win,
return in_be64(win->hvwc_map+reg);
 }
 
+/*
+ * Encode/decode the Partition Send Window ID (PSWID) for a window in
+ * a way that we can uniquely identify any window in the system. i.e.
+ * we should be able to locate the 'struct vas_window' given the PSWID.
+ *
+ * BitsUsage
+ * 0:7 VAS id (8 bits)
+ * 8:15Unused, 0 (3 bits)
+ * 16:31   Window id (16 bits)
+ */
+static inline u32 encode_pswid(int vasid, int winid)
+{
+   return ((u32)winid | (vasid << (31 - 7)));
+}
+
 static inline void decode_pswid(u32 pswid, int *vasid, int *winid)
 {
if (vasid)
-- 
1.8.3.1





[PATCH V3 05/13] powerpc/vas: Setup thread IRQ handler per VAS instance

2019-12-16 Thread Haren Myneni


Setup thread IRQ handler per each VAS instance. When NX sees a fault
on CRB, kernel gets an interrupt and vas_fault_handler will be
executed to process fault CRBs. Read all valid CRBs from fault FIFO,
determine the corresponding send window from CRB and process fault
requests.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c  | 83 +
 arch/powerpc/platforms/powernv/vas-window.c | 60 +
 arch/powerpc/platforms/powernv/vas.c| 21 +++-
 arch/powerpc/platforms/powernv/vas.h|  4 ++
 4 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index b0258ed..57f21ea 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "vas.h"
@@ -25,6 +26,88 @@
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
 /*
+ * Process CRBs that we receive on the fault window.
+ */
+irqreturn_t vas_fault_handler(int irq, void *data)
+{
+   struct vas_instance *vinst = data;
+   struct coprocessor_request_block buf, *crb;
+   struct vas_window *window;
+   void *fifo;
+
+   /*
+* VAS can interrupt with multiple page faults. So process all
+* valid CRBs within fault FIFO until reaches invalid CRB.
+* NX updates nx_fault_stamp in CRB and pastes in fault FIFO.
+* kernel retrives send window from parition send window ID
+* (pswid) in nx_fault_stamp. So pswid should be non-zero and
+* use this to check whether CRB is valid.
+* After reading CRB entry, it is reset with 0's in fault FIFO.
+*
+* In case kernel receives another interrupt with different page
+* fault and CRBs are processed by the previous handling, will be
+* returned from this function when it sees invalid CRB (means 0's).
+*/
+   do {
+   mutex_lock(>mutex);
+
+   /*
+* Advance the fault fifo pointer to next CRB.
+* Use CRB_SIZE rather than sizeof(*crb) since the latter is
+* aligned to CRB_ALIGN (256) but the CRB written to by VAS is
+* only CRB_SIZE in len.
+*/
+   fifo = vinst->fault_fifo + (vinst->fault_crbs * CRB_SIZE);
+   crb = fifo;
+
+   /*
+* pswid returned from NX will be in _be32, but just
+* checking non-zero value to make sure the CRB is valid.
+* Return if reached invalid CRB.
+*/
+   if (!crb->stamp.nx.pswid) {
+   mutex_unlock(>mutex);
+   return IRQ_HANDLED;
+   }
+
+   vinst->fault_crbs++;
+   if (vinst->fault_crbs == (vinst->fault_fifo_size / CRB_SIZE))
+   vinst->fault_crbs = 0;
+
+   crb = 
+   memcpy(crb, fifo, CRB_SIZE);
+   memset(fifo, 0, CRB_SIZE);
+   mutex_unlock(>mutex);
+
+   pr_devel("VAS[%d] fault_fifo %p, fifo %p, fault_crbs %d\n",
+   vinst->vas_id, vinst->fault_fifo, fifo,
+   vinst->fault_crbs);
+
+   window = vas_pswid_to_window(vinst,
+   be32_to_cpu(crb->stamp.nx.pswid));
+
+   if (IS_ERR(window)) {
+   /*
+* We got an interrupt about a specific send
+* window but we can't find that window and we can't
+* even clean it up (return credit).
+* But we should not get here.
+*/
+   pr_err("VAS[%d] fault_fifo %p, fifo %p, pswid 0x%x, 
fault_crbs %d bad CRB?\n",
+   vinst->vas_id, vinst->fault_fifo, fifo,
+   be32_to_cpu(crb->stamp.nx.pswid),
+   vinst->fault_crbs);
+
+   WARN_ON_ONCE(1);
+   return IRQ_HANDLED;
+   }
+
+   } while (true);
+
+   return IRQ_HANDLED;
+}
+
+/*
  * Fault window is opened per VAS instance. NX pastes fault CRB in fault
  * FIFO upon page faults.
  */
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index f07f49a..cec1b41 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1041,6 +1041,15 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
}
} else {
/*
+* Interrupt hanlder or fault window setup failed. Means
+* NX can not generate fault for page fault. So not
+   

[PATCH V3 04/13] powerpc/vas: Setup fault window per VAS instance

2019-12-16 Thread Haren Myneni


Setup fault window for each VAS instance. When NX gets fault on request
buffer, write fault CRBs in the corresponding fault FIFO and then sends
an interrupt to the OS.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/Makefile |   2 +-
 arch/powerpc/platforms/powernv/vas-fault.c  |  73 
 arch/powerpc/platforms/powernv/vas-window.c |   3 +-
 arch/powerpc/platforms/powernv/vas.c|  20 
 arch/powerpc/platforms/powernv/vas.h|   5 ++
 scripts/pnmtologo   | Bin 0 -> 71976 bytes
 6 files changed, 101 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/vas-fault.c
 create mode 100755 scripts/pnmtologo

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index c0f8120..395789f 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_MEMORY_FAILURE)  += opal-memory-errors.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
 obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
 obj-$(CONFIG_PPC_MEMTRACE) += memtrace.o
-obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o
+obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o vas-fault.o
 obj-$(CONFIG_OCXL_BASE)+= ocxl.o
 obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
 obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
new file mode 100644
index 000..b0258ed
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -0,0 +1,73 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VAS Fault handling.
+ * Copyright 2019, IBM Corporation
+ */
+
+#define pr_fmt(fmt) "vas: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vas.h"
+
+/*
+ * The maximum FIFO size for fault window can be 8MB
+ * (VAS_RX_FIFO_SIZE_MAX). Using 4MB FIFO since each VAS
+ * instance will be having fault window.
+ * 8MB FIFO can be used if expects more faults for each VAS
+ * instance.
+ */
+#define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
+
+/*
+ * Fault window is opened per VAS instance. NX pastes fault CRB in fault
+ * FIFO upon page faults.
+ */
+int vas_setup_fault_window(struct vas_instance *vinst)
+{
+   struct vas_rx_win_attr attr;
+
+   vinst->fault_fifo_size = VAS_FAULT_WIN_FIFO_SIZE;
+   vinst->fault_fifo = kzalloc(vinst->fault_fifo_size, GFP_KERNEL);
+   if (!vinst->fault_fifo) {
+   pr_err("Unable to alloc %d bytes for fault_fifo\n",
+   vinst->fault_fifo_size);
+   return -ENOMEM;
+   }
+
+   vas_init_rx_win_attr(, VAS_COP_TYPE_FAULT);
+
+   attr.rx_fifo_size = vinst->fault_fifo_size;
+   attr.rx_fifo = vinst->fault_fifo;
+
+   /*
+* Max creds is based on number of CRBs can fit in the FIFO.
+* (fault_fifo_size/CRB_SIZE). If 8MB FIFO is used, max creds
+* will be 0x since the receive creds field is 16bits wide.
+*/
+   attr.wcreds_max = vinst->fault_fifo_size / CRB_SIZE;
+   attr.lnotify_lpid = 0;
+   attr.lnotify_pid = mfspr(SPRN_PID);
+   attr.lnotify_tid = mfspr(SPRN_PID);
+
+   vinst->fault_win = vas_rx_win_open(vinst->vas_id, VAS_COP_TYPE_FAULT,
+   );
+
+   if (IS_ERR(vinst->fault_win)) {
+   pr_err("VAS: Error %ld opening FaultWin\n",
+   PTR_ERR(vinst->fault_win));
+   kfree(vinst->fault_fifo);
+   return PTR_ERR(vinst->fault_win);
+   }
+
+   pr_devel("VAS: Created FaultWin %d, LPID/PID/TID [%d/%d/%d]\n",
+   vinst->fault_win->winid, attr.lnotify_lpid,
+   attr.lnotify_pid, attr.lnotify_tid);
+
+   return 0;
+}
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 0c0d27d..f07f49a 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -827,9 +827,10 @@ void vas_init_rx_win_attr(struct vas_rx_win_attr *rxattr, 
enum vas_cop_type cop)
rxattr->fault_win = true;
rxattr->notify_disable = true;
rxattr->rx_wcred_mode = true;
-   rxattr->tx_wcred_mode = true;
rxattr->rx_win_ord_mode = true;
rxattr->tx_win_ord_mode = true;
+   rxattr->rej_no_credit = true;
+   rxattr->tc_mode = VAS_THRESH_DISABLED;
} else if (cop == VAS_COP_TYPE_FTW) {
rxattr->user_win = true;
rxattr->intr_disable = true;
diff --git a/arch/powerpc/platforms/powernv/vas.c 
b/arch/powerpc/platforms/powernv/vas.c
index 40d8213..850863c 100644
--- a/arch/powerpc/platforms/powernv/vas.c
+++ b/arch/powerpc/platforms/powernv/vas.c
@@ -23,6 +23,11 @@
 
 static 

[PATCH V3 03/13] powerpc/vas: Read interrupts and vas-port device tree properties

2019-12-16 Thread Haren Myneni


Read interrupts and vas-port device tree properties per each VAS
instance. NX generates an interrupt when it sees page fault on the
request buffer. Interrupts property is used to setup IRQ for handing
the fault and set port value for each user space send window.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas.c | 40 
 arch/powerpc/platforms/powernv/vas.h |  2 ++
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas.c 
b/arch/powerpc/platforms/powernv/vas.c
index ed9cc6d..40d8213 100644
--- a/arch/powerpc/platforms/powernv/vas.c
+++ b/arch/powerpc/platforms/powernv/vas.c
@@ -25,10 +25,11 @@
 
 static int init_vas_instance(struct platform_device *pdev)
 {
-   int rc, cpu, vasid;
-   struct resource *res;
-   struct vas_instance *vinst;
struct device_node *dn = pdev->dev.of_node;
+   int rc, cpu, vasid, nresources = 5;
+   struct vas_instance *vinst;
+   struct resource *res;
+   uint64_t port;
 
rc = of_property_read_u32(dn, "ibm,vas-id", );
if (rc) {
@@ -36,7 +37,18 @@ static int init_vas_instance(struct platform_device *pdev)
return -ENODEV;
}
 
-   if (pdev->num_resources != 4) {
+   rc = of_property_read_u64(dn, "ibm,vas-port", );
+   if (rc) {
+   pr_err("No ibm,vas-port property for %s?\n", pdev->name);
+   /* No interrupts property */
+   nresources = 4;
+   }
+
+   /*
+* interrupts property is available with 'ibm,vas-port' property.
+* 4 Resources and 1 IRQ if interrupts property is available.
+*/
+   if (pdev->num_resources != nresources) {
pr_err("Unexpected DT configuration for [%s, %d]\n",
pdev->name, vasid);
return -ENODEV;
@@ -51,6 +63,7 @@ static int init_vas_instance(struct platform_device *pdev)
mutex_init(>mutex);
vinst->vas_id = vasid;
vinst->pdev = pdev;
+   vinst->irq_port = port;
 
res = >resource[0];
vinst->hvwc_bar_start = res->start;
@@ -66,12 +79,23 @@ static int init_vas_instance(struct platform_device *pdev)
pr_err("Bad 'paste_win_id_shift' in DT, %llx\n", res->end);
goto free_vinst;
}
-
vinst->paste_win_id_shift = 63 - res->end;
 
-   pr_devel("Initialized instance [%s, %d], paste_base 0x%llx, "
-   "paste_win_id_shift 0x%llx\n", pdev->name, vasid,
-   vinst->paste_base_addr, vinst->paste_win_id_shift);
+   /* interrupts property */
+   if (pdev->num_resources == 5) {
+   res = >resource[4];
+   vinst->virq = res->start;
+   if (vinst->virq <= 0) {
+   pr_err("IRQ resource is not available for [%s, %d]\n",
+   pdev->name, vasid);
+   vinst->virq = 0;
+   }
+   }
+
+   pr_devel("Initialized instance [%s, %d] paste_base 0x%llx 
paste_win_id_shift 0x%llx IRQ %d Port 0x%llx\n",
+   pdev->name, vasid, vinst->paste_base_addr,
+   vinst->paste_win_id_shift, vinst->virq,
+   vinst->irq_port);
 
for_each_possible_cpu(cpu) {
if (cpu_to_chip_id(cpu) == of_get_ibm_chip_id(dn))
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index 5574aec..598608b 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -313,6 +313,8 @@ struct vas_instance {
u64 paste_base_addr;
u64 paste_win_id_shift;
 
+   u64 irq_port;
+   int virq;
struct mutex mutex;
struct vas_window *rxwin[VAS_COP_TYPE_MAX];
struct vas_window *windows[VAS_WINDOWS_PER_CHIP];
-- 
1.8.3.1





[PATCH V3 02/13] powerpc/vas: Define nx_fault_stamp in coprocessor_request_block

2019-12-16 Thread Haren Myneni


Kernel sets fault address and status in CRB for NX page fault on user
space address after processing page fault. User space gets the signal
and handles the fault mentioned in CRB by bringing the page in to
memory and send NX request again.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/icswx.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
index 9872f85..b233d1e 100644
--- a/arch/powerpc/include/asm/icswx.h
+++ b/arch/powerpc/include/asm/icswx.h
@@ -108,6 +108,17 @@ struct data_descriptor_entry {
__be64 address;
 } __packed __aligned(DDE_ALIGN);
 
+/* 4.3.2 NX-stamped Fault CRB */
+
+#define NX_STAMP_ALIGN  (0x10)
+
+struct nx_fault_stamp {
+   __be64 fault_storage_addr;
+   __be16 reserved;
+   __u8   flags;
+   __u8   fault_status;
+   __be32 pswid;
+} __packed __aligned(NX_STAMP_ALIGN);
 
 /* Chapter 6.5.2 Coprocessor-Request Block (CRB) */
 
@@ -135,7 +146,12 @@ struct coprocessor_request_block {
 
struct coprocessor_completion_block ccb;
 
-   u8 reserved[48];
+   union {
+   struct nx_fault_stamp nx;
+   u8 reserved[16];
+   } stamp;
+
+   u8 reserved[32];
 
struct coprocessor_status_block csb;
 } __packed __aligned(CRB_ALIGN);
-- 
1.8.3.1





[PATCH V3 01/13] powerpc/vas: Describe vas-port and interrupts properties

2019-12-16 Thread Haren Myneni


Signed-off-by: Haren Myneni 
---
 Documentation/devicetree/bindings/powerpc/ibm,vas.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt 
b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
index bf11d2f..12de08b 100644
--- a/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
+++ b/Documentation/devicetree/bindings/powerpc/ibm,vas.txt
@@ -11,6 +11,8 @@ Required properties:
   window context start and length, OS/User window context start and length,
   "Paste address" start and length, "Paste window id" start bit and number
   of bits)
+- ibm,vas-port : Port address for the interrupt.
+- interrupts: IRQ value for each VAS instance and level.
 
 Example:
 
@@ -18,5 +20,8 @@ Example:
compatible = "ibm,vas", "ibm,power9-vas";
reg = <0x60191 0x200 0x60190 0x1 
0x8 0x1 0x20 0x10>;
name = "vas";
+   interrupts = <0x1f 0>;
+   interrupt-parent = <>;
ibm,vas-id = <0x1>;
+   ibm,vas-port = <0x601000100>;
};
-- 
1.8.3.1





[Bug 205885] BUG: KASAN: null-ptr-deref in strncpy+0x3c/0x60

2019-12-16 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205885

Christophe Leroy (christophe.le...@c-s.fr) changed:

   What|Removed |Added

 CC||christophe.le...@c-s.fr

--- Comment #2 from Christophe Leroy (christophe.le...@c-s.fr) ---
You didn't get that with 5.5-rc1 ?
You get that as well when KASAN is not activated ?

If answer to both is 'yes', can you bisect ?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[PATCH V3 00/13] powerpc/vas: Page fault handling for user space NX requests

2019-12-16 Thread Haren Myneni


On power9, Virtual Accelerator Switchboard (VAS) allows user space or
kernel to communicate with Nest Accelerator (NX) directly using COPY/PASTE
instructions. NX provides verious functionalities such as compression,
encryption and etc. But only compression (842 and GZIP formats) is
supported in Linux kernel on power9.

842 compression driver (drivers/crypto/nx/nx-842-powernv.c)
is already included in Linux. Only GZIP support will be available from
user space.

Applications can issue GZIP compression / decompression requests to NX with
COPY/PASTE instructions. When NX is processing these requests, can hit
fault on the request buffer (not in memory). It issues an interrupt and
pastes fault CRB in fault FIFO. Expects kernel to handle this fault and
return credits for both send and fault windows after processing.

This patch series adds IRQ and fault window setup, and NX fault handling:
- Read IRQ# from "interrupts" property and configure IRQ per VAS instance.
- Set port# for each window to generate an interrupt when noticed fault.
- Set fault window and FIFO on which NX paste fault CRB.
- Setup IRQ thread fault handler per VAS instance.
- When receiving an interrupt, Read CRBs from fault FIFO and update
  coprocessor_status_block (CSB) in the corresponding CRB with translation
  failure (CSB_CC_TRANSLATION). After issuing NX requests, process polls
  on CSB address. When it sees translation error, can touch the request
  buffer to bring the page in to memory and reissue NX request.
- If copy_to_user fails on user space CSB address, OS sends SEGV signal.

Tested these patches with NX-GZIP support and will be posting this series
soon.

Patch 2: Define nx_fault_stamp on which NX writes fault status for the fault
 CRB
Patch 3: Read interrupts and port properties per VAS instance
Patch 4: Setup fault window per each VAS instance. This window is used for
 NX to paste fault CRB in FIFO.
Patches 5 & 6: Setup threaded IRQ per VAS and register NX with fault window
 ID and port number for each send window so that NX paste fault CRB
 in this window.
Patch 7: Reference to pid and mm so that pid is not used until window closed.
 Needed for multi thread application where child can open a window
 and can be used by parent later.
Patches 8 and 9: Process CRBs from fault FIFO and notify tasks by
 updating CSB or through signals.
Patches 10 and 11: Return credits for send and fault windows after handling
faults.
Patch 13:Fix closing send window after all credits are returned. This issue
 happens only for user space requests. No page faults on kernel
 request buffer.

Changelog:
V2:
  - Use threaded IRQ instead of own kernel thread handler
  - Use pswid insted of user space CSB address to find valid CRB
  - Removed unused macros and other changes as suggested by Christoph Hellwig

V3:
  - Rebased to 5.5-rc2
  - Use struct pid * instead of pid_t for vas_window tgid
  - Code cleanup as suggested by Christoph Hellwig

Haren Myneni (13):
  powerpc/vas: Describe vas-port and interrupts properties
  powerpc/vas: Define nx_fault_stamp in coprocessor_request_block
  powerpc/vas: Read interrupts and vas-port device tree properties
  powerpc/vas: Setup fault window per VAS instance
  powerpc/vas: Setup thread IRQ handler per VAS instance
  powerpc/vas: Register NX with fault window ID and IRQ port value
  powerpc/vas: Take reference to PID and mm for user space windows
  powerpc/vas: Update CSB and notify process for fault CRBs
  powerpc/vas: Print CRB and FIFO values
  powerpc/vas: Do not use default credits for receive window
  powerpc/VAS: Return credits after handling fault
  powerpc/vas: Display process stuck message
  powerpc/vas: Free send window in VAS instance after credits returned

 .../devicetree/bindings/powerpc/ibm,vas.txt|   5 +
 arch/powerpc/include/asm/icswx.h   |  18 +-
 arch/powerpc/platforms/powernv/Makefile|   2 +-
 arch/powerpc/platforms/powernv/vas-debug.c |   2 +-
 arch/powerpc/platforms/powernv/vas-fault.c | 328 +
 arch/powerpc/platforms/powernv/vas-window.c| 182 +++-
 arch/powerpc/platforms/powernv/vas.c   |  79 -
 arch/powerpc/platforms/powernv/vas.h   |  38 ++-
 scripts/pnmtologo  | Bin 0 -> 71976 bytes
 9 files changed, 628 insertions(+), 26 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/vas-fault.c
 create mode 100755 scripts/pnmtologo

-- 
1.8.3.1





Re: [PATCH V2 07/13] powerpc/vas: Take reference to PID and mm for user space windows

2019-12-16 Thread Haren Myneni
On Thu, 2019-12-12 at 05:02 -0800, Christoph Hellwig wrote:
> > +   if (txwin->user_win) {
> > +   /*
> > +* Window opened by child thread may not be closed when
> > +* it exits. So take reference to its pid and release it
> > +* when the window is free by parent thread.
> > +* Acquire a reference to the task's pid to make sure
> > +* pid will not be re-used.
> > +*/
> > +   txwin->pid = get_task_pid(current, PIDTYPE_PID);
> > +   /*
> > +* Acquire a reference to the task's mm.
> > +*/
> > +   txwin->mm = get_task_mm(current);
> > +
> > +   if (txwin->mm) {
> > +   mmput(txwin->mm);
> > +   mmgrab(txwin->mm);
> 
> Doesn't the mmgrab need to come before the mmput?
> 
> > +   mm_context_add_copro(txwin->mm);
> > +   } else {
> > +   put_pid(txwin->pid);
> > +   pr_err("VAS: pid(%d): mm_struct is not found\n",
> > +   current->pid);
> > +   rc = -EPERM;
> > +   goto free_window;
> > +   }
> 
> Also the code is much easier to follow if you handle the error
> first and avoid the else:
> 
>   txwin->mm = get_task_mm(current);
>   if (!txwin->mm) {
>   put_pid(txwin->pid);
>   pr_err("VAS: pid(%d): mm_struct is not found\n",
>   current->pid);
>   rc = -EPERM;
>   goto free_window;
>   }
>   mmgrab(txwin->mm);
>   mmput(txwin->mm);
> 
> Also don't you need to take a reference to the struct pid for the
> tgid as well?

Process opens window and closed it upon its exit. We do not have to take
pid reference in this case. But in multithread applications, child
thread can open window, can exit without closing it. Parent thread can
use this window and expects to close it later. So in this case pid
reference for child thread is needed. Where as parent thread is the one
who is closing the window later, do not to have to take its pid
reference.

Basically we are taking pid reference to the process who opens the
window to cover multithread applications.

Thanks for your review comments. I will post V3 patches.




Re: [PATCH v3] powerpc/smp: Use nid as fallback for package_id

2019-12-16 Thread Vasant Hegde

On 12/16/19 7:51 PM, Srikar Dronamraju wrote:

Package_id is to find out all cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM Lpars. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM Lpars.

Before the patch.
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1   <--
Socket(s):   16  <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
  #
  # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
  -1
  #

After the patch
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8  <--
Core(s) per socket:  8  <--
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
  #
  # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
  0
  #
Now lscpu output is more in line with the system configuration.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Vasant Hegde 
Cc: Vaidyanathan Srinivasan 


Looks good to me.

Reviewed-by: Vasant Hegde 

-Vasant



Re: [PATCH kernel v2 2/4] powerpc/pseries: Allow not having ibm, hypertas-functions::hcall-multi-tce for DDW

2019-12-16 Thread Thiago Jung Bauermann


Alexey Kardashevskiy  writes:

> On 17/12/2019 10:07, Thiago Jung Bauermann wrote:
>>
>> Alexey Kardashevskiy  writes:
>>
>>> By default a pseries guest supports a H_PUT_TCE hypercall which maps
>>> a single IOMMU page in a DMA window. Additionally the hypervisor may
>>> support H_PUT_TCE_INDIRECT/H_STUFF_TCE which update multiple TCEs at once;
>>> this is advertised via the device tree /rtas/ibm,hypertas-functions
>>> property which Linux converts to FW_FEATURE_MULTITCE.
>>>
>>> FW_FEATURE_MULTITCE is checked when dma_iommu_ops is used; however
>>> the code managing the huge DMA window (DDW) ignores it and calls
>>> H_PUT_TCE_INDIRECT even if it is explicitly disabled via
>>> the "multitce=off" kernel command line parameter.
>>>
>>> This adds FW_FEATURE_MULTITCE checking to the DDW code path.
>>>
>>> This changes tce_build_pSeriesLP to take liobn and page size as
>>> the huge window does not have iommu_table descriptor which usually
>>> the place to store these numbers.
>>>
>>> Fixes: 4e8b0cf46b25 ("powerpc/pseries: Add support for dynamic dma windows")
>>> Signed-off-by: Alexey Kardashevskiy 
>>
>> Reviewed-by: Thiago Jung Bauermann 
>> Tested-by: Thiago Jung Bauermann 
>>
>> Some minor nits below. Feel free to ignore.
>>
>>> @@ -146,25 +146,25 @@ static int tce_build_pSeriesLP(struct iommu_table 
>>> *tbl, long tcenum,
>>> int ret = 0;
>>> long tcenum_start = tcenum, npages_start = npages;
>>>
>>> -   rpn = __pa(uaddr) >> TCE_SHIFT;
>>> +   rpn = __pa(uaddr) >> tceshift;
>>> proto_tce = TCE_PCI_READ;
>>> if (direction != DMA_TO_DEVICE)
>>> proto_tce |= TCE_PCI_WRITE;
>>>
>>> while (npages--) {
>>> -   tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
>>> -   rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
>>> +   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
>>> +   rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
>>
>> Is it necessary to cast to u64 here? plpar_tce_put() takes unsigned long
>> for both arguments.
>
> Looked as an unrelated change. Small but still unrelated.

Ah, I hadn't noticed that the cast was already in the original code.

>>> @@ -400,6 +402,20 @@ static int tce_setrange_multi_pSeriesLP(unsigned long 
>>> start_pfn,
>>> u64 rc = 0;
>>> long l, limit;
>>>
>>> +   if (!firmware_has_feature(FW_FEATURE_MULTITCE)) {
>>> +   unsigned long tceshift = be32_to_cpu(maprange->tce_shift);
>>> +   unsigned long dmastart = (start_pfn << PAGE_SHIFT) +
>>> +   be64_to_cpu(maprange->dma_base);
>>> +   unsigned long tcenum = dmastart >> tceshift;
>>> +   unsigned long npages = num_pfn << PAGE_SHIFT >>
>>> +   be32_to_cpu(maprange->tce_shift);
>>
>> Could use the tceshift variable here.
>
>
> True, overlooked.
> Thanks for the reviews!

Thank you for the patches!

--
Thiago Jung Bauermann
IBM Linux Technology Center


[Bug 205885] BUG: KASAN: null-ptr-deref in strncpy+0x3c/0x60

2019-12-16 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205885

--- Comment #1 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 286333
  --> https://bugzilla.kernel.org/attachment.cgi?id=286333=edit
kernel .config (5.5-rc2, PowerMac G4 DP)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 205885] New: BUG: KASAN: null-ptr-deref in strncpy+0x3c/0x60

2019-12-16 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205885

Bug ID: 205885
   Summary: BUG: KASAN: null-ptr-deref in strncpy+0x3c/0x60
   Product: Platform Specific/Hardware
   Version: 2.5
Kernel Version: 5.5-rc2
  Hardware: PPC-32
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: PPC-32
  Assignee: platform_ppc...@kernel-bugs.osdl.org
  Reporter: erhar...@mailbox.org
Regression: No

Created attachment 286331
  --> https://bugzilla.kernel.org/attachment.cgi?id=286331=edit
screenshot (5.5-rc2, PowerMac G4 DP)

I get this hit at booting kernel 5.5-rc2 on my G4 DP:

[...]
BUG: KASAN: null-ptr-deref in strncpy+0x3c/0x60
Read of size 1 at addr  by task swapper/0/1

CPU: 1 PID: 1 Comm: swapper/0 Tainted: GW5.5.0-rc2-PowerMacG4
Call Trace:
[ee8edd78] [c07819e0] dump_stack+0xbc/0x118 (unreliable)
[ee8edda8] [c0244b48] __kasan_report+0x174/0x180
[ee8edde8] [c07949dc] strncpy+0x3c/0x60
[ee8ede18] [c0b6979c] mount_block_root+0x200/0x3e0
[ee8edef8] [c0b69b74] prepare_namespace+0x164/0x174
[ee8edf18] [c0005f3c] kernel_init+0x14/0xf0
[ee8edf38] [c001a348] ret_from_kernel_thread+0x14/0x1c
=
BUG: Kernel NULL pointer dereference on read at 0x000
Faulting instruction address: 0xc07949dc
Oops: Kernel access of bad area sig: 11 (#1]
[...]

For details see screenshot (I appled a median filter but tesseract still was
not able to make much text out of it).

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH kernel v2 2/4] powerpc/pseries: Allow not having ibm,hypertas-functions::hcall-multi-tce for DDW

2019-12-16 Thread Alexey Kardashevskiy



On 17/12/2019 10:07, Thiago Jung Bauermann wrote:
> 
> Alexey Kardashevskiy  writes:
> 
>> By default a pseries guest supports a H_PUT_TCE hypercall which maps
>> a single IOMMU page in a DMA window. Additionally the hypervisor may
>> support H_PUT_TCE_INDIRECT/H_STUFF_TCE which update multiple TCEs at once;
>> this is advertised via the device tree /rtas/ibm,hypertas-functions
>> property which Linux converts to FW_FEATURE_MULTITCE.
>>
>> FW_FEATURE_MULTITCE is checked when dma_iommu_ops is used; however
>> the code managing the huge DMA window (DDW) ignores it and calls
>> H_PUT_TCE_INDIRECT even if it is explicitly disabled via
>> the "multitce=off" kernel command line parameter.
>>
>> This adds FW_FEATURE_MULTITCE checking to the DDW code path.
>>
>> This changes tce_build_pSeriesLP to take liobn and page size as
>> the huge window does not have iommu_table descriptor which usually
>> the place to store these numbers.
>>
>> Fixes: 4e8b0cf46b25 ("powerpc/pseries: Add support for dynamic dma windows")
>> Signed-off-by: Alexey Kardashevskiy 
> 
> Reviewed-by: Thiago Jung Bauermann 
> Tested-by: Thiago Jung Bauermann 
> 
> Some minor nits below. Feel free to ignore.
> 
>> @@ -146,25 +146,25 @@ static int tce_build_pSeriesLP(struct iommu_table 
>> *tbl, long tcenum,
>>  int ret = 0;
>>  long tcenum_start = tcenum, npages_start = npages;
>>
>> -rpn = __pa(uaddr) >> TCE_SHIFT;
>> +rpn = __pa(uaddr) >> tceshift;
>>  proto_tce = TCE_PCI_READ;
>>  if (direction != DMA_TO_DEVICE)
>>  proto_tce |= TCE_PCI_WRITE;
>>
>>  while (npages--) {
>> -tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
>> -rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
>> +tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
>> +rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
> 
> Is it necessary to cast to u64 here? plpar_tce_put() takes unsigned long
> for both arguments.


Looked as an unrelated change. Small but still unrelated.


> 
>> @@ -261,16 +263,16 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
>> *tbl, long tcenum,
>>  return ret;
>>  }
>>
>> -static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
>> npages)
>> +static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long 
>> npages)
>>  {
>>  u64 rc;
>>
>>  while (npages--) {
>> -rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, 0);
>> +rc = plpar_tce_put((u64)liobn, (u64)tcenum << 12, 0);
> 
> Same comment regarding cast to u64.
> 
>> @@ -400,6 +402,20 @@ static int tce_setrange_multi_pSeriesLP(unsigned long 
>> start_pfn,
>>  u64 rc = 0;
>>  long l, limit;
>>
>> +if (!firmware_has_feature(FW_FEATURE_MULTITCE)) {
>> +unsigned long tceshift = be32_to_cpu(maprange->tce_shift);
>> +unsigned long dmastart = (start_pfn << PAGE_SHIFT) +
>> +be64_to_cpu(maprange->dma_base);
>> +unsigned long tcenum = dmastart >> tceshift;
>> +unsigned long npages = num_pfn << PAGE_SHIFT >>
>> +be32_to_cpu(maprange->tce_shift);
> 
> Could use the tceshift variable here.


True, overlooked.
Thanks for the reviews!


> 
>> +void *uaddr = __va(start_pfn << PAGE_SHIFT);
>> +
>> +return tce_build_pSeriesLP(be32_to_cpu(maprange->liobn),
>> +tcenum, tceshift, npages, (unsigned long) uaddr,
>> +DMA_BIDIRECTIONAL, 0);
>> +}
>> +
>>  local_irq_disable();/* to protect tcep and the page behind it */
>>  tcep = __this_cpu_read(tce_page);
> 
> 
> --
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 

-- 
Alexey


Re: [PATCH kernel v2 4/4] powerpc/pseries/svm: Allow IOMMU to work in SVM

2019-12-16 Thread Thiago Jung Bauermann


Alexey Kardashevskiy  writes:

> H_PUT_TCE_INDIRECT uses a shared page to send up to 512 TCE to
> a hypervisor in a single hypercall. This does not work for secure VMs
> as the page needs to be shared or the VM should use H_PUT_TCE instead.
>
> This disables H_PUT_TCE_INDIRECT by clearing the FW_FEATURE_PUT_TCE_IND
> feature bit so SVMs will map TCEs using H_PUT_TCE.
>
> This is not a part of init_svm() as it is called too late after FW
> patching is done and may result in a warning like this:
>
> [3.727716] Firmware features changed after feature patching!
> [3.727965] WARNING: CPU: 0 PID: 1 at 
> (...)arch/powerpc/lib/feature-fixups.c:466 check_features+0xa4/0xc0
>
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Thiago Jung Bauermann 
Tested-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH kernel v2 3/4] powerpc/pseries/iommu: Separate FW_FEATURE_MULTITCE to put/stuff features

2019-12-16 Thread Thiago Jung Bauermann


Alexey Kardashevskiy  writes:

> H_PUT_TCE_INDIRECT allows packing up to 512 TCE updates into a single
> hypercall; H_STUFF_TCE can clear lots in a single hypercall too.
>
> However, unlike H_STUFF_TCE (which writes the same TCE to all entries),
> H_PUT_TCE_INDIRECT uses a 4K page with new TCEs. In a secure VM
> environment this means sharing a secure VM page with a hypervisor which
> we would rather avoid.
>
> This splits the FW_FEATURE_MULTITCE feature into FW_FEATURE_PUT_TCE_IND
> and FW_FEATURE_STUFF_TCE. "hcall-multi-tce" in
> the "/rtas/ibm,hypertas-functions" device tree property sets both;
> the "multitce=off" kernel command line parameter disables both.
>
> This should not cause behavioural change.
>
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Thiago Jung Bauermann 
Tested-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH kernel v2 2/4] powerpc/pseries: Allow not having ibm, hypertas-functions::hcall-multi-tce for DDW

2019-12-16 Thread Thiago Jung Bauermann


Alexey Kardashevskiy  writes:

> By default a pseries guest supports a H_PUT_TCE hypercall which maps
> a single IOMMU page in a DMA window. Additionally the hypervisor may
> support H_PUT_TCE_INDIRECT/H_STUFF_TCE which update multiple TCEs at once;
> this is advertised via the device tree /rtas/ibm,hypertas-functions
> property which Linux converts to FW_FEATURE_MULTITCE.
>
> FW_FEATURE_MULTITCE is checked when dma_iommu_ops is used; however
> the code managing the huge DMA window (DDW) ignores it and calls
> H_PUT_TCE_INDIRECT even if it is explicitly disabled via
> the "multitce=off" kernel command line parameter.
>
> This adds FW_FEATURE_MULTITCE checking to the DDW code path.
>
> This changes tce_build_pSeriesLP to take liobn and page size as
> the huge window does not have iommu_table descriptor which usually
> the place to store these numbers.
>
> Fixes: 4e8b0cf46b25 ("powerpc/pseries: Add support for dynamic dma windows")
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Thiago Jung Bauermann 
Tested-by: Thiago Jung Bauermann 

Some minor nits below. Feel free to ignore.

> @@ -146,25 +146,25 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
> long tcenum,
>   int ret = 0;
>   long tcenum_start = tcenum, npages_start = npages;
>
> - rpn = __pa(uaddr) >> TCE_SHIFT;
> + rpn = __pa(uaddr) >> tceshift;
>   proto_tce = TCE_PCI_READ;
>   if (direction != DMA_TO_DEVICE)
>   proto_tce |= TCE_PCI_WRITE;
>
>   while (npages--) {
> - tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
> - rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
> + tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
> + rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);

Is it necessary to cast to u64 here? plpar_tce_put() takes unsigned long
for both arguments.

> @@ -261,16 +263,16 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
> *tbl, long tcenum,
>   return ret;
>  }
>
> -static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
> npages)
> +static void tce_free_pSeriesLP(unsigned long liobn, long tcenum, long npages)
>  {
>   u64 rc;
>
>   while (npages--) {
> - rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, 0);
> + rc = plpar_tce_put((u64)liobn, (u64)tcenum << 12, 0);

Same comment regarding cast to u64.

> @@ -400,6 +402,20 @@ static int tce_setrange_multi_pSeriesLP(unsigned long 
> start_pfn,
>   u64 rc = 0;
>   long l, limit;
>
> + if (!firmware_has_feature(FW_FEATURE_MULTITCE)) {
> + unsigned long tceshift = be32_to_cpu(maprange->tce_shift);
> + unsigned long dmastart = (start_pfn << PAGE_SHIFT) +
> + be64_to_cpu(maprange->dma_base);
> + unsigned long tcenum = dmastart >> tceshift;
> + unsigned long npages = num_pfn << PAGE_SHIFT >>
> + be32_to_cpu(maprange->tce_shift);

Could use the tceshift variable here.

> + void *uaddr = __va(start_pfn << PAGE_SHIFT);
> +
> + return tce_build_pSeriesLP(be32_to_cpu(maprange->liobn),
> + tcenum, tceshift, npages, (unsigned long) uaddr,
> + DMA_BIDIRECTIONAL, 0);
> + }
> +
>   local_irq_disable();/* to protect tcep and the page behind it */
>   tcep = __this_cpu_read(tce_page);


--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH kernel v2 1/4] Revert "powerpc/pseries/iommu: Don't use dma_iommu_ops on secure guests"

2019-12-16 Thread Thiago Jung Bauermann


Hello Alexey,

Alexey Kardashevskiy  writes:

> From: Ram Pai 
>
> This reverts commit edea902c1c1efb855f77e041f9daf1abe7a9768a.
>
> At the time the change allowed direct DMA ops for secure VMs; however
> since then we switched on using SWIOTLB backed with IOMMU (direct mapping)
> and to make this work, we need dma_iommu_ops which handles all cases
> including TCE mapping I/O pages in the presence of an IOMMU.
>
> Fixes: edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on 
> secure guests")
> Signed-off-by: Ram Pai 
> [aik: added "revert" and "fixes:"]
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: Thiago Jung Bauermann 
Tested-by: Thiago Jung Bauermann 

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


[PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-12-16 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "--"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.24.1



[PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2019-12-16 Thread John Hubbard
In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c|  4 +--
 drivers/gpu/drm/via/via_dmablit.c   |  4 +--
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c |  2 +-
 fs/io_uring.c   |  4 +--
 include/linux/mm.h  | 26 -
 mm/gup.c| 32 ++---
 mm/process_vm_access.c  |  4 +--
 net/xdp/xdp_umem.c  |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 71849830cd48..1d490155ecd7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat 
entries: ::
 /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
 free_exit:
/* free the references taken */
-   put_user_pages(mem->hpages, pinned);
+   unpin_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_user_page(page);
+   unpin_user_page(page);
 
mem->hpas[i] = 0;
}
diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
*vsg)
kfree(vsg->desc_pages);
/* fall through */
case dr_via_pages_locked:
-   put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
- (vsg->direction == DMA_FROM_DEVICE));
+   unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+  (vsg->direction == DMA_FROM_DEVICE));
/* fall through */
case dr_via_pages_alloc:
vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
page = sg_page_iter_page(_iter);
-   put_user_pages_dirty_lock(, 1, umem->writable && dirty);
+   unpin_user_pages_dirty_lock(, 1, umem->writable && dirty);
}
 
sg_free_table(>sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 size_t npages, bool dirty)
 {
-   put_user_pages_dirty_lock(p, npages, dirty);
+   unpin_user_pages_dirty_lock(p, npages, dirty);
 
if (mm) { /* during close after signal, mm can be NULL */
atomic64_sub(npages, >pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index 

[PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call

2019-12-16 Thread John Hubbard
Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 30 +-
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-   return 0;
+   ret = 0;
+   goto done;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
if (is_invalid_reserved_pfn(*pfn))
ret = 0;
}
-
+done:
up_read(>mmap_sem);
return ret;
 }
-- 
2.24.1



[PATCH v11 12/25] IB/{core, hw, umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP

2019-12-16 Thread John Hubbard
Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  |  2 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
 drivers/infiniband/sw/siw/siw_mem.c |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
+   ret = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
int ret;
unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct 
mthca_uar *uar,
goto out;
}
 
-   ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+   ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
  FOLL_WRITE | FOLL_LONGTERM, pages);
if (ret < 0)
goto out;
diff --git 

[PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-16 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig 
Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Cc: 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-12-16 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.1



[PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/

Reviewed-by: Jan Kara 
Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h|  83 -
 include/linux/mmzone.h|   2 +
 include/linux/page_ref.h  |  10 +
 mm/gup.c  | 409 +-
 mm/huge_memory.c  |  29 +-
 mm/hugetlb.c  |  38 +-
 mm/vmstat.c   |   2 +
 8 files changed, 439 insertions(+), 136 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..bb44c4d2ada7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1016,6 +1016,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
page = compound_head(page);
@@ -1044,29 +1046,80 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many pin_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. 
In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual 
inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that 

[PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-12-16 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b1833fedc5c..c6ff9cc7fe71 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4539,7 +4539,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
+   pret = pin_user_pages(ubuf, nr_pages,
  FOLL_WRITE | FOLL_LONGTERM,
  pages, vmas);
if (pret == nr_pages) {
-- 
2.24.1



[PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-16 Thread John Hubbard
1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+   ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
mem->hpages + entry, NULL);
if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
SetPageDirty(page);
 
-   put_page(page);
+   put_user_page(page);
+
mem->hpas[i] = 0;
}
 }
-- 
2.24.1



[PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-16 Thread John Hubbard
Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig 
Reviewed-by: Leon Romanovsky 
Reviewed-by: Jan Kara 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index c0c56888e7cc..958ab0757389 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned long addr, len, end;
int nr = 0, ret = 0;
 
-   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+   if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+  FOLL_FORCE)))
return -EINVAL;
 
start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.1



[PATCH v11 16/25] net/xdp: set FOLL_PIN via pin_user_pages()

2019-12-16 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
+   npgs = pin_user_pages(umem->address, umem->npgs,
  gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
-- 
2.24.1



[PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-12-16 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara 
Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++
 include/linux/mm.h|  63 --
 mm/gup.c  | 161 +--
 4 files changed, 423 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..71849830cd48
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for 
problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper 
calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ 
+ pin_user_pages  FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast FOLL_PIN is always set internally by this function.
+ 

[PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine

2019-12-16 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = goldfish_pin_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.1



[PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-12-16 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.1



[PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-12-16 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig 
Suggested-by: Jérôme Glisse 
Reviewed-by: Dan Williams 
Reviewed-by: Jan Kara 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 17 +
 mm/memremap.c  | 16 ++--
 mm/swap.c  | 24 
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..77a4df06c8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+bool put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+   return false;
+}
+
 static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 * need to inform the device driver through callback. See
 * include/linux/memremap.h and HMM for details.
 */
-   if (put_devmap_managed_page(page))
+   if (page_is_devmap_managed(page)) {
+   put_devmap_managed_page(page);
return;
+   }
 
if (put_page_testzero(page))
__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
-
-   /* still busy */
-   if (count > 1)
-   return;
-
-   /* only triggered by the dev_pagemap shutdown path */
-   if (count == 0) {
-   __put_page(page);
-   return;
-   }
-
/* notify page idle for dax */
if (!is_device_private_page(page)) {
wake_up_var(>_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..49f7c2eea0ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+   }
+
+   return is_devmap;
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.1



[PATCH v11 19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-16 Thread John Hubbard
1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
-- 
2.24.1



[PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion

2019-12-16 Thread John Hubbard
1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+   err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 flags | FOLL_LONGTERM, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.1



[PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls

2019-12-16 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():  via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 65 --
 tools/testing/selftests/vm/gup_benchmark.c | 15 -
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..76d32db48af8 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +21,42 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   unpin_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -65,6 +103,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags,
+pages + i);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -75,15 +121,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -101,6 +154,8 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
case GUP_FAST_BENCHMARK:
case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
+   case PIN_FAST_BENCHMARK:
+   case PIN_BENCHMARK:
break;
default:
return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 

[PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-16 Thread John Hubbard
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Christoph Hellwig, a point of interest:

a) I've moved the bulk of the code out of the inline functions, as
   requested, for the devmap changes (patch 4: "mm: devmap: refactor
   1-based refcounting for ZONE_DEVICE pages").

Changes since v10: Remaining fixes resulting from Jan Kara's reviews:

* Shifted to using the sign bit in page_dma_pinned() to allow accurate
  results even in the overflow case. See the comments in that routine
  for details. This allowed getting rid of the new
  page_ref_zero_or_close_to_bias_overflow(), in favor of a simple
  sign check via "page_ref_count() <= 0").

* Simplified some of the huge_memory.c changes, and simplified a gup.c
  WARN invocation.

* Now using a standard -ENOMEM for most try_grab_page() failures.

* Got rid of tabs in the comment headers (I had thought they were
  required there, but it's actually the reverse: they are not
  allowed there).

* Rebased against 5.5-rc2 and retested.

* Added Jan Kara's reviewed-by tag for patch 23 (the main patch of the
  series).

Changes since v9: Fixes resulting from Jan Kara's and Jonathan Corbet's
reviews:

* Removed reviewed-by tags from the "mm/gup: track FOLL_PIN pages" (those
  were improperly inherited from the much smaller refactoring patch that
  was merged into it).

* Made try_grab_compound_head() and try_grab_page() behavior similar in
  their behavior with flags, in order to avoid "gotchas" later.

* follow_trans_huge_pmd(): moved the try_grab_page() to earlier in the
  routine, in order to avoid having to undo mlock_vma_page().

* follow_hugetlb_page(): removed a refcount overflow check that is now
  extraneous (and weaker than what try_grab_page() provides a few lines
  further down).

* Fixed up two Documentation flaws, pointed out by Jonathan Corbet's
  review.

Changes since v8:

* Merged the "mm/gup: pass flags arg to __gup_device_* functions" patch
  into the "mm/gup: track FOLL_PIN pages" patch, as requested by
  Christoph and Jan.

* Changed void grab_page() to bool try_grab_page(), and handled errors
  at the call sites. (From Jan's review comments.) try_grab_page()
  attempts to avoid page refcount overflows, even when counting up with
  GUP_PIN_COUNTING_BIAS increments.

* Fixed a bug that I'd introduced, when changing a BUG() to a WARN().

* Added Jan's reviewed-by tag to the " mm/gup: allow FOLL_FORCE for
  get_user_pages_fast()" patch.

* Documentation: pin_user_pages.rst: fixed an incorrect gup_benchmark
  invocation, left over from the pin_longterm days, spotted while preparing
  this version.

* Rebased onto today's linux.git (-rc1), and re-tested.

Changes since v7:

* Rebased onto Linux 5.5-rc1

* Reworked the grab_page() and try_grab_compound_head(), for API
  consistency and less diffs (thanks to Jan Kara's reviews).

* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
  patches.

* patch 4 refactoring changes, as mentioned above.

There is a git repo and branch, for convenience:

g...@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8

For the remaining list of "changes since version N", those are all in
v7, which is here:

  https://lore.kernel.org/r/20191121071354.456618-1-jhubb...@nvidia.com


Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
unpin_user_page()


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  

[PATCH v11 01/25] mm/gup: factor out duplicate code from four routines

2019-12-16 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Jan Kara 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 91 ++--
 1 file changed, 36 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..f764432914c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+  unsigned long end, struct page **pages)
+{
+   int nr;
+
+   for (nr = 0; addr != end; addr += PAGE_SIZE)
+   pages[nr++] = page++;
+
+   return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -2007,32 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2079,28 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2120,28 +2118,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = record_subpages(page, addr, end, pages + *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
+   *nr += refs;
SetPageReferenced(head);
return 1;
 }
@@ -2157,28 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, 
unsigned long 

[PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM

2019-12-16 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson 
Acked-by: Alex Williamson 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 27 ++-
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *locked)
 {
/*
-* FIXME: Current FOLL_LONGTERM behavior is incompatible with
+* Parts of FOLL_LONGTERM behavior are incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-* vmas.  As there are no users of this flag in this call we simply
-* disallow this option for now.
+* vmas. However, this only comes up if locked is set, and there are
+* callers that do request FOLL_LONGTERM, but do not set locked. So,
+* allow what we can.
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-   return -EINVAL;
+   if (gup_flags & FOLL_LONGTERM) {
+   if (WARN_ON_ONCE(locked))
+   return -EINVAL;
+   /*
+* This will check the vmas (even if our vmas arg is NULL)
+* and return -ENOTSUPP if DAX isn't allowed in this case:
+*/
+   return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+vmas, gup_flags | FOLL_TOUCH |
+FOLL_REMOTE);
+   }
 
return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
   locked,
-- 
2.24.1



[PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-12-16 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.1



[PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-12-16 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.1



[PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-12-16 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.1



[PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free()

2019-12-16 Thread John Hubbard
From: Dan Williams 

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jérôme Glisse 
Cc: Jan Kara 
Cc: Ira Weiny 
Signed-off-by: Dan Williams 
Signed-off-by: John Hubbard 
---
 drivers/nvdimm/pmem.c |  6 
 mm/memremap.c | 80 ---
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-   wake_up_var(>_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-   .page_free  = pmem_pagemap_page_free,
.kill   = pmem_pagemap_kill,
.cleanup= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..e899fa876a62 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-   if (!pgmap->ops || !pgmap->ops->page_free) {
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+   (!pgmap->ops || !pgmap->ops->page_free)) {
WARN(1, "Missing page_free method\n");
return -EINVAL;
}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
int count = page_ref_dec_return(page);
 
-   /*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
-*/
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
+   /* still busy */
+   if (count > 1)
+   return;
 
-   mem_cgroup_uncharge(page);
+   /* only triggered by the dev_pagemap shutdown path */
+   if (count == 0) {
+   __put_page(page);
+   return;
+   }
 
-   /*
-* When a device_private page is freed, the page->mapping field
-* may still contain a (stale) mapping value. For example, the
-* lower bits of page->mapping may still identify the page as
-* an anonymous page. Ultimately, this entire field is just
-* stale and wrong, and it will cause errors if not cleared.
-* One example is:
-*
-*  migrate_vma_pages()
-*migrate_vma_insert_page()
-*  page_add_new_anon_rmap()
-*__page_set_anon_rmap()
-*  ...checks page->mapping, via PageAnon(page) call,
-*and incorrectly concludes that the page is an
-*anonymous page. Therefore, it incorrectly,
-*silently fails to set up the new anon rmap.
-*
-* For other types of ZONE_DEVICE pages, migration is either
-* handled differently or not done at all, so there is no need
-* to clear page->mapping.
-*/
-   if (is_device_private_page(page))
-   page->mapping = NULL;
+   /* notify page idle for dax */
+   if (!is_device_private_page(page)) {
+   wake_up_var(>_refcount);
+   return;
+   }
 
-   page->pgmap->ops->page_free(page);
-   } else if (!count)
-   __put_page(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
+
+   /*
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as an
+* anonymous page. Ultimately, this entire field is just stale
+* and wrong, and it will cause errors if not cleared.  One
+* example is:
+*
+*  

[PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-12-16 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f764432914c4..3ecce297a47f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.1



Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread John Hubbard

On 12/16/19 4:53 AM, Jan Kara wrote:
...


I'd move this still a bit higher - just after VM_BUG_ON_PAGE() and before
if (flags & FOLL_TOUCH) test. Because touch_pmd() can update page tables
and we don't won't that if we're going to fail the fault.



Done. I'll post a full v11 series shortly.


With this fixed, the patch looks good to me so you can then add:

Reviewed-by: Jan Kara 

Honza



btw, thanks for the thorough review of this critical patch (and for your
patience with my mistakes). I really appreciate it, and this patchset would
not have made it this far without your detailed help and explanations.


thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v5 1/2] ASoC: dt-bindings: fsl_asrc: add compatible string for imx8qm & imx8qxp

2019-12-16 Thread Rob Herring
On Wed,  4 Dec 2019 20:00:18 +0800, Shengjiu Wang wrote:
> Add compatible string "fsl,imx8qm-asrc" for imx8qm platform,
> "fsl,imx8qxp-asrc" for imx8qxp platform.
> 
> There are two asrc modules in imx8qm & imx8qxp, the clock mapping is
> different for each other, so add new property "fsl,asrc-clk-map"
> to distinguish them.
> 
> Signed-off-by: Shengjiu Wang 
> ---
> changes in v2
> -none
> 
> changes in v3
> -use only one compatible string "fsl,imx8qm-asrc",
> -add new property "fsl,asrc-clk-map".
> 
> changes in v4
> -add "fsl,imx8qxp-asrc"
> 
> changes in v5
> -refine the comments for compatible
> 
>  Documentation/devicetree/bindings/sound/fsl,asrc.txt | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 

Acked-by: Rob Herring 


[PATCH v3 7/7] parisc/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 arch/parisc/kernel/perf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/perf.c b/arch/parisc/kernel/perf.c
index 676683641d00..c4208d027794 100644
--- a/arch/parisc/kernel/perf.c
+++ b/arch/parisc/kernel/perf.c
@@ -300,7 +300,7 @@ static ssize_t perf_write(struct file *file, const char 
__user *buf,
else
return -EFAULT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!perfmon_capable())
return -EACCES;
 
if (count != sizeof(uint32_t))
-- 
2.20.1




[PATCH v3 6/7] powerpc/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 arch/powerpc/perf/imc-pmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index cb50a9e1fd2d..e837717492e4 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -898,7 +898,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!perfmon_capable())
return -EACCES;
 
/* Sampling not supported */
@@ -1307,7 +1307,7 @@ static int trace_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!perfmon_capable())
return -EACCES;
 
/* Return if this is a couting event */
-- 
2.20.1




[PATCH v3 5/7] trace/bpf_trace: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to bpf_trace monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to bpf_trace monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
bpf_trace monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 kernel/trace/bpf_trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 44bd08f2443b..bafe21ac6d92 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1272,7 +1272,7 @@ int perf_event_query_prog_array(struct perf_event *event, 
void __user *info)
u32 *ids, prog_cnt, ids_len;
int ret;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!perfmon_capable())
return -EPERM;
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -EINVAL;
-- 
2.20.1




[PATCH v3 4/7] drm/i915/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to i915_perf monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to i915_perf subsystem remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
i915_perf monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 drivers/gpu/drm/i915/i915_perf.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index e42b86827d6b..e2697f8d04de 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2748,10 +2748,10 @@ i915_perf_open_ioctl_locked(struct drm_i915_private 
*dev_priv,
/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
 * we check a dev.i915.perf_stream_paranoid sysctl option
 * to determine if it's ok to access system wide OA counters
-* without CAP_SYS_ADMIN privileges.
+* without CAP_SYS_PERFMON or CAP_SYS_ADMIN privileges.
 */
if (privileged_op &&
-   i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to open system-wide i915 
perf stream\n");
ret = -EACCES;
goto err_ctx;
@@ -2939,9 +2939,8 @@ static int read_properties_unlocked(struct 
drm_i915_private *dev_priv,
} else
oa_freq_hz = 0;
 
-   if (oa_freq_hz > i915_oa_max_sample_rate &&
-   !capable(CAP_SYS_ADMIN)) {
-   DRM_DEBUG("OA exponent would exceed the max 
sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without root 
privileges\n",
+   if (oa_freq_hz > i915_oa_max_sample_rate && 
!perfmon_capable()) {
+   DRM_DEBUG("OA exponent would exceed the max 
sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN privileges\n",
  i915_oa_max_sample_rate);
return -EACCES;
}
@@ -3328,7 +3327,7 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, 
void *data,
return -EINVAL;
}
 
-   if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to add i915 OA config\n");
return -EACCES;
}
@@ -3474,7 +3473,7 @@ int i915_perf_remove_config_ioctl(struct drm_device *dev, 
void *data,
return -ENOTSUPP;
}
 
-   if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to remove i915 OA config\n");
return -EACCES;
}
-- 
2.20.1




[PATCH v3 3/7] perf tool: extend Perf tool with CAP_SYS_PERFMON capability support

2019-12-16 Thread Alexey Budankov


Extend error messages to mention CAP_SYS_PERFMON capability as an option
to substitute CAP_SYS_ADMIN capability for secure system performance
monitoring and observability. Make perf_event_paranoid_check() to be aware
of CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 tools/perf/design.txt   |  3 ++-
 tools/perf/util/cap.h   |  4 
 tools/perf/util/evsel.c | 10 +-
 tools/perf/util/util.c  |  1 +
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/perf/design.txt b/tools/perf/design.txt
index 0453ba26cdbd..71755b3e1303 100644
--- a/tools/perf/design.txt
+++ b/tools/perf/design.txt
@@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any 
user, for
 their own tasks.
 
 A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
-all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+all events on CPU-x. Per CPU counters need CAP_SYS_PERFMON or
+CAP_SYS_ADMIN privilege.
 
 The 'flags' parameter is currently unused and must be zero.
 
diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
index 051dc590ceee..0f79fbf6638b 100644
--- a/tools/perf/util/cap.h
+++ b/tools/perf/util/cap.h
@@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
 #define CAP_SYSLOG 34
 #endif
 
+#ifndef CAP_SYS_PERFMON
+#define CAP_SYS_PERFMON 38
+#endif
+
 #endif /* __PERF_CAP_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f4dea055b080..3a46325e3702 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2468,14 +2468,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, 
struct target *target,
 "You may not have permission to collect %sstats.\n\n"
 "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
 "which controls use of the performance events system by\n"
-"unprivileged users (without CAP_SYS_ADMIN).\n\n"
+"unprivileged users (without CAP_SYS_PERFMON or 
CAP_SYS_ADMIN).\n\n"
 "The current value is %d:\n\n"
 "  -1: Allow use of (almost) all events by all users\n"
 "  Ignore mlock limit after perf_event_mlock_kb without 
CAP_IPC_LOCK\n"
-">= 0: Disallow ftrace function tracepoint by users without 
CAP_SYS_ADMIN\n"
-"  Disallow raw tracepoint access by users without 
CAP_SYS_ADMIN\n"
-">= 1: Disallow CPU event access by users without 
CAP_SYS_ADMIN\n"
-">= 2: Disallow kernel profiling by users without 
CAP_SYS_ADMIN\n\n"
+">= 0: Disallow ftrace function tracepoint by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+"  Disallow raw tracepoint access by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+">= 1: Disallow CPU event access by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+">= 2: Disallow kernel profiling by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n\n"
 "To make this setting permanent, edit /etc/sysctl.conf too, 
e.g.:\n\n"
 "  kernel.perf_event_paranoid = -1\n" ,
 target->system_wide ? "system-wide " : "",
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 969ae560dad9..9981db0d8d09 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -272,6 +272,7 @@ int perf_event_paranoid(void)
 bool perf_event_paranoid_check(int max_level)
 {
return perf_cap__capable(CAP_SYS_ADMIN) ||
+   perf_cap__capable(CAP_SYS_PERFMON) ||
perf_event_paranoid() <= max_level;
 }
 
-- 
2.20.1




[PATCH v3 2/7] perf/core: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to perf_events monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to perf_events subsystem remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
perf_events monitoring is discouraged with respect to CAP_SYS_PERFMON 
capability.

Signed-off-by: Alexey Budankov 
---
 include/linux/perf_event.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 34c7c6910026..f46acd69425f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1285,7 +1285,7 @@ static inline int perf_is_paranoid(void)
 
 static inline int perf_allow_kernel(struct perf_event_attr *attr)
 {
-   if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
+   if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
return -EACCES;
 
return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
@@ -1293,7 +1293,7 @@ static inline int perf_allow_kernel(struct 
perf_event_attr *attr)
 
 static inline int perf_allow_cpu(struct perf_event_attr *attr)
 {
-   if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
+   if (sysctl_perf_event_paranoid > 0 && !perfmon_capable())
return -EACCES;
 
return security_perf_event_open(attr, PERF_SECURITY_CPU);
@@ -1301,7 +1301,7 @@ static inline int perf_allow_cpu(struct perf_event_attr 
*attr)
 
 static inline int perf_allow_tracepoint(struct perf_event_attr *attr)
 {
-   if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
+   if (sysctl_perf_event_paranoid > -1 && !perfmon_capable())
return -EPERM;
 
return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
-- 
2.20.1



[PATCH v3 1/7] capabilities: introduce CAP_SYS_PERFMON to kernel and user space

2019-12-16 Thread Alexey Budankov


Introduce CAP_SYS_PERFMON capability devoted to secure system performance
monitoring and observability so that CAP_SYS_PERFMON would assist
CAP_SYS_ADMIN capability in its governing role for perf_events, i915_perf
and other subsystems of the kernel.

CAP_SYS_PERFMON intends to harden system security and integrity during
system performance monitoring and observability by decreasing attack surface
that is available to CAP_SYS_ADMIN privileged processes.

CAP_SYS_PERFMON intends to take over CAP_SYS_ADMIN credentials related to
system performance monitoring and observability and balance amount of
CAP_SYS_ADMIN credentials in accordance with the recommendations provided
in the man page for CAP_SYS_ADMIN [1]: "Note: this capability is overloaded;
see Notes to kernel developers, below."

[1] http://man7.org/linux/man-pages/man7/capabilities.7.html

Signed-off-by: Alexey Budankov 
---
 include/linux/capability.h  | 1 +
 include/uapi/linux/capability.h | 8 +++-
 security/selinux/include/classmap.h | 4 ++--
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f43c73a..6342502c4c2a 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -251,6 +251,7 @@ extern bool privileged_wrt_inode_uidgid(struct 
user_namespace *ns, const struct
 extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
 extern bool file_ns_capable(const struct file *file, struct user_namespace 
*ns, int cap);
 extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace 
*ns);
+#define perfmon_capable() (capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN))
 
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct 
cpu_vfs_cap_data *cpu_caps);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 240fdb9a60f6..98e03cc76c7c 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -366,8 +366,14 @@ struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_READ 37
 
+/*
+ * Allow system performance and observability privileged operations
+ * using perf_events, i915_perf and other kernel subsystems
+ */
+
+#define CAP_SYS_PERFMON38
 
-#define CAP_LAST_CAP CAP_AUDIT_READ
+#define CAP_LAST_CAP CAP_SYS_PERFMON
 
 #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
 
diff --git a/security/selinux/include/classmap.h 
b/security/selinux/include/classmap.h
index 7db24855e12d..bae602c623b0 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,9 @@
"audit_control", "setfcap"
 
 #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \
-   "wake_alarm", "block_suspend", "audit_read"
+   "wake_alarm", "block_suspend", "audit_read", "sys_perfmon"
 
-#if CAP_LAST_CAP > CAP_AUDIT_READ
+#if CAP_LAST_CAP > CAP_SYS_PERFMON
 #error New capability defined, please update COMMON_CAP2_PERMS.
 #endif
 
-- 
2.20.1



[PATCH v3 0/7] Introduce CAP_SYS_PERFMON to secure system performance monitoring and observability

2019-12-16 Thread Alexey Budankov


Currently access to perf_events, i915_perf and other performance monitoring and
observability subsystems of the kernel is open for a privileged process [1] with
CAP_SYS_ADMIN capability enabled in the process effective set [2].

This patch set introduces CAP_SYS_PERFMON capability devoted to secure system
performance monitoring and observability operations so that CAP_SYS_PERFMON 
would
assist CAP_SYS_ADMIN capability in its governing role for perf_events, i915_perf
and other performance monitoring and observability subsystems of the kernel.

CAP_SYS_PERFMON intends to meet the demand to secure system performance 
monitoring
and observability operations in security sensitive, restricted, production
environments (e.g. HPC clusters, cloud and virtual compute environments) where 
root
or CAP_SYS_ADMIN credentials are not available to mass users of a system because
of security considerations.

CAP_SYS_PERFMON intends to harden system security and integrity during system
performance monitoring and observability operations by decreasing attack surface
that is available to CAP_SYS_ADMIN privileged processes [2].

CAP_SYS_PERFMON intends to take over CAP_SYS_ADMIN credentials related to system
performance monitoring and observability operations and balance amount of
CAP_SYS_ADMIN credentials following the recommendations in the capabilities man
page [2] for CAP_SYS_ADMIN: "Note: this capability is overloaded; see Notes to
kernel developers, below."

For backward compatibility reasons access to system performance monitoring and
observability subsystems of the kernel remains open for CAP_SYS_ADMIN privileged
processes but CAP_SYS_ADMIN capability usage for secure system performance 
monitoring
and observability operations is discouraged with respect to the introduced
CAP_SYS_PERFMON capability.

The patch set is for tip perf/core repository:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip perf/core
sha1: ceb9e77324fa661b1001a0ae66f061b5fcb4e4e6

---
Changes in v3:
- implemented perfmon_capable() macros aggregating required capabilities checks
Changes in v2:
- made perf_events trace points available to CAP_SYS_PERFMON privileged 
processes
- made perf_event_paranoid_check() treat CAP_SYS_PERFMON equally to 
CAP_SYS_ADMIN
- applied CAP_SYS_PERFMON to i915_perf, bpf_trace, powerpc and parisc system
  performance monitoring and observability related subsystems

---
Alexey Budankov (7):
  capabilities: introduce CAP_SYS_PERFMON to kernel and user space
  perf/core: open access for CAP_SYS_PERFMON privileged process
  perf tool: extend Perf tool with CAP_SYS_PERFMON capability support
  drm/i915/perf: open access for CAP_SYS_PERFMON privileged process
  trace/bpf_trace: open access for CAP_SYS_PERFMON privileged process
  powerpc/perf: open access for CAP_SYS_PERFMON privileged process
  parisc/perf: open access for CAP_SYS_PERFMON privileged process

 arch/parisc/kernel/perf.c   |  2 +-
 arch/powerpc/perf/imc-pmu.c |  4 ++--
 drivers/gpu/drm/i915/i915_perf.c| 13 ++---
 include/linux/capability.h  |  1 +
 include/linux/perf_event.h  |  6 +++---
 include/uapi/linux/capability.h |  8 +++-
 kernel/trace/bpf_trace.c|  2 +-
 security/selinux/include/classmap.h |  4 ++--
 tools/perf/design.txt   |  3 ++-
 tools/perf/util/cap.h   |  4 
 tools/perf/util/evsel.c | 10 +-
 tools/perf/util/util.c  |  1 +
 12 files changed, 35 insertions(+), 23 deletions(-)

---
Testing and validation (Intel Skylake, 8 cores, Fedora 29, 5.4.0-rc8+, x86_64):

libcap library [3], [4] and Perf tool can be used to apply CAP_SYS_PERFMON 
capability for secure system performance monitoring and observability beyond the
scope permitted by the system wide perf_event_paranoid kernel setting [5] and
below are the steps for evaluation:

  - patch, build and boot the kernel
  - patch, build Perf tool e.g. to /home/user/perf
  ...
  # git clone git://git.kernel.org/pub/scm/libs/libcap/libcap.git libcap
  # pushd libcap
  # patch libcap/include/uapi/linux/capabilities.h with [PATCH 1]
  # make
  # pushd progs
  # ./setcap "cap_sys_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
  # ./setcap -v "cap_sys_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
  /home/user/perf: OK
  # ./getcap /home/user/perf
  /home/user/perf = cap_sys_ptrace,cap_syslog,cap_sys_perfmon+ep
  # echo 2 > /proc/sys/kernel/perf_event_paranoid
  # cat /proc/sys/kernel/perf_event_paranoid 
  2
  ...
  $ /home/user/perf top
... works as expected ...
  $ cat /proc/`pidof perf`/status
  Name: perf
  Umask:0002
  State:S (sleeping)
  Tgid: 2958
  Ngid: 0
  Pid:  2958
  PPid: 9847
  TracerPid:0
  Uid:  500 500 500 500
  Gid:  500 500 500 500
  FDSize:   256
  ...
  CapInh:   
  CapPrm:   00440008
  CapEff:   00440008 => 01000100  1000  
   

RE: [PATCH v2 2/7] perf/core: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Lubashev, Igor
On Mon, Dec 16, 2019 at 2:15 AM, Alexey Budankov 
 wrote:
> 
> Open access to perf_events monitoring for CAP_SYS_PERFMON privileged
> processes.
> For backward compatibility reasons access to perf_events subsystem remains
> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
> for secure perf_events monitoring is discouraged with respect to
> CAP_SYS_PERFMON capability.
> 
> Signed-off-by: Alexey Budankov 
> ---
>  include/linux/perf_event.h | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index
> 34c7c6910026..52313d2cc343 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1285,7 +1285,8 @@ static inline int perf_is_paranoid(void)
> 
>  static inline int perf_allow_kernel(struct perf_event_attr *attr)  {
> - if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > 1 &&
> +!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>   return -EACCES;
> 
>   return security_perf_event_open(attr, PERF_SECURITY_KERNEL); @@
> -1293,7 +1294,8 @@ static inline int perf_allow_kernel(struct
> perf_event_attr *attr)
> 
>  static inline int perf_allow_cpu(struct perf_event_attr *attr)  {
> - if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > 0 &&
> + !(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>   return -EACCES;
> 
>   return security_perf_event_open(attr, PERF_SECURITY_CPU); @@ -
> 1301,7 +1303,8 @@ static inline int perf_allow_cpu(struct perf_event_attr
> *attr)
> 
>  static inline int perf_allow_tracepoint(struct perf_event_attr *attr)  {
> - if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
> + if (sysctl_perf_event_paranoid > -1 &&
> + !(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>   return -EPERM;
> 
>   return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
> --
> 2.20.1

Thanks.  I like the idea of CAP_SYS_PERFMON that does not require 
CAP_SYS_ADMIN.  It makes granting users ability to run perf a bit safer.

I see a lot of "(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)" constructs 
now.  Maybe wrapping it in an " inline bool perfmon_capable()" defined 
somewhere (like in /include/linux/capability.h)?

- Igor


Re: [PATCH v2 1/7] capabilities: introduce CAP_SYS_PERFMON to kernel and user space

2019-12-16 Thread Stephen Smalley

On 12/16/19 2:14 AM, Alexey Budankov wrote:


Introduce CAP_SYS_PERFMON capability devoted to secure system performance
monitoring and observability operations so that CAP_SYS_PERFMON would assist
CAP_SYS_ADMIN capability in its governing role for perf_events, i915_perf
and other performance monitoring and observability subsystems of the kernel.

CAP_SYS_PERFMON intends to harden system security and integrity during
system performance monitoring and observability operations by decreasing
attack surface that is available to CAP_SYS_ADMIN privileged processes.

CAP_SYS_PERFMON intends to take over CAP_SYS_ADMIN credentials related to
system performance monitoring and observability operations and balance amount
of CAP_SYS_ADMIN credentials following with the recommendations provided
in the capabilities man page [1] for CAP_SYS_ADMIN: "Note: this capability
is overloaded; see Notes to kernel developers, below."

[1] http://man7.org/linux/man-pages/man7/capabilities.7.html

Signed-off-by: Alexey Budankov 


Acked-by: Stephen Smalley 


---
  include/uapi/linux/capability.h | 8 +++-
  security/selinux/include/classmap.h | 4 ++--
  2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 240fdb9a60f6..7d1f8606c3e6 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -366,8 +366,14 @@ struct vfs_ns_cap_data {
  
  #define CAP_AUDIT_READ		37
  
+/*

+ * Allow system performance and observability privileged operations
+ * using perf_events, i915_perf and other kernel subsystems
+ */
+
+#define CAP_SYS_PERFMON38
  
-#define CAP_LAST_CAP CAP_AUDIT_READ

+#define CAP_LAST_CAP CAP_SYS_PERFMON
  
  #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
  
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h

index 7db24855e12d..bae602c623b0 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,9 @@
"audit_control", "setfcap"
  
  #define COMMON_CAP2_PERMS  "mac_override", "mac_admin", "syslog", \

-   "wake_alarm", "block_suspend", "audit_read"
+   "wake_alarm", "block_suspend", "audit_read", "sys_perfmon"
  
-#if CAP_LAST_CAP > CAP_AUDIT_READ

+#if CAP_LAST_CAP > CAP_SYS_PERFMON
  #error New capability defined, please update COMMON_CAP2_PERMS.
  #endif
  





[PATCH] powerpc/83xx: update kmeter1 defconfig and dts

2019-12-16 Thread Holger Brunck
From: Matteo Ghidoni 

The defconfig is synchronized and the missing
MTD_PHYSMAP, DEVTMPFS and I2C MUX support are switched on.

Additionally the I2C mux device is added to the DTS with
its attached temperature sensors and I2C clock frequency
is lowered.

Signed-off-by: Matteo Ghidoni 
Signed-off-by: Holger Brunck 
CC: Heiko Schocher 
CC: Michael Ellerman 
---
 arch/powerpc/boot/dts/kmeter1.dts   | 40 -
 arch/powerpc/configs/83xx/kmeter1_defconfig | 20 +--
 2 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/boot/dts/kmeter1.dts 
b/arch/powerpc/boot/dts/kmeter1.dts
index 154f5d293fd3..bc33f3ad19a3 100644
--- a/arch/powerpc/boot/dts/kmeter1.dts
+++ b/arch/powerpc/boot/dts/kmeter1.dts
@@ -70,7 +70,45 @@
reg = <0x3000 0x100>;
interrupts = <14 0x8>;
interrupt-parent = <>;
-   clock-frequency = <40>;
+   clock-frequency = <10>;
+
+   mux@70 {
+   compatible = "nxp,pca9547";
+   reg = <0x70>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   i2c@2 {
+   reg = <2>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   /* Temperature sensors */
+   temp@48 {
+   label = "Top";
+   compatible = "national,lm75";
+   reg = <0x48>;
+   };
+
+   temp@49 {
+   label = "Control";
+   compatible = "national,lm75";
+   reg = <0x49>;
+   };
+
+   temp@4a {
+   label = "Power";
+   compatible = "national,lm75";
+   reg = <0x4a>;
+   };
+
+   temp@4b {
+   label = "Front";
+   compatible = "national,lm75";
+   reg = <0x4b>;
+   };
+   };
+   };
};
 
serial0: serial@4500 {
diff --git a/arch/powerpc/configs/83xx/kmeter1_defconfig 
b/arch/powerpc/configs/83xx/kmeter1_defconfig
index 648c6b3dccf9..72abd8ae654a 100644
--- a/arch/powerpc/configs/83xx/kmeter1_defconfig
+++ b/arch/powerpc/configs/83xx/kmeter1_defconfig
@@ -3,22 +3,20 @@ CONFIG_SYSVIPC=y
 CONFIG_POSIX_MQUEUE=y
 CONFIG_NO_HZ=y
 CONFIG_HIGH_RES_TIMERS=y
+CONFIG_PREEMPT=y
 CONFIG_LOG_BUF_SHIFT=14
 CONFIG_EXPERT=y
 CONFIG_SLAB=y
-CONFIG_MODULES=y
-CONFIG_MODULE_UNLOAD=y
-# CONFIG_BLK_DEV_BSG is not set
-CONFIG_PARTITION_ADVANCED=y
-# CONFIG_MSDOS_PARTITION is not set
-# CONFIG_IOSCHED_DEADLINE is not set
-# CONFIG_IOSCHED_CFQ is not set
 # CONFIG_PPC_CHRP is not set
 # CONFIG_PPC_PMAC is not set
 CONFIG_PPC_83xx=y
 CONFIG_KMETER1=y
-CONFIG_PREEMPT=y
 # CONFIG_SECCOMP is not set
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+# CONFIG_BLK_DEV_BSG is not set
+CONFIG_PARTITION_ADVANCED=y
+# CONFIG_MSDOS_PARTITION is not set
 CONFIG_NET=y
 CONFIG_PACKET=y
 CONFIG_UNIX=y
@@ -29,12 +27,15 @@ CONFIG_IP_PNP=y
 CONFIG_TIPC=y
 CONFIG_BRIDGE=m
 CONFIG_VLAN_8021Q=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
 CONFIG_MTD=y
 CONFIG_MTD_CMDLINE_PARTS=y
 CONFIG_MTD_BLOCK=y
 CONFIG_MTD_CFI=y
 CONFIG_MTD_CFI_INTELEXT=y
 CONFIG_MTD_CFI_AMDSTD=y
+CONFIG_MTD_PHYSMAP=y
 CONFIG_MTD_PHYSMAP_OF=y
 CONFIG_MTD_PHRAM=y
 CONFIG_MTD_UBI=y
@@ -57,7 +58,10 @@ CONFIG_SERIAL_8250_CONSOLE=y
 CONFIG_HW_RANDOM=y
 CONFIG_I2C=y
 CONFIG_I2C_CHARDEV=y
+CONFIG_I2C_MUX=y
+CONFIG_I2C_MUX_PCA954x=y
 CONFIG_I2C_MPC=y
+CONFIG_GPIOLIB=y
 # CONFIG_HWMON is not set
 # CONFIG_USB_SUPPORT is not set
 CONFIG_UIO=y
-- 
2.11.1



Re: [PATCH v18 11/13] open: introduce openat2(2) syscall

2019-12-16 Thread Florian Weimer
* Aleksa Sarai:

> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 1d338357df8a..58c3a0e543c6 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -93,5 +93,40 @@
>  
>  #define AT_RECURSIVE 0x8000  /* Apply to the entire subtree */
>  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __aligned_u64 flags;
> + __u16 mode;
> + __u16 __padding[3]; /* must be zeroed */
> + __aligned_u64 resolve;
> +};
> +
> +#define OPEN_HOW_SIZE_VER0   24 /* sizeof first published struct */
> +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> +
> +/* how->resolve flags for openat2(2). */
> +#define RESOLVE_NO_XDEV  0x01 /* Block mount-point crossings
> + (includes bind-mounts). */
> +#define RESOLVE_NO_MAGICLINKS0x02 /* Block traversal through 
> procfs-style
> + "magic-links". */
> +#define RESOLVE_NO_SYMLINKS  0x04 /* Block traversal through all symlinks
> + (implies OEXT_NO_MAGICLINKS) */
> +#define RESOLVE_BENEATH  0x08 /* Block "lexical" trickery like
> + "..", symlinks, and absolute
> + paths which escape the dirfd. */
> +#define RESOLVE_IN_ROOT  0x10 /* Make all jumps to "/" and ".."
> + be scoped inside the dirfd
> + (similar to chroot(2)). */
>  
>  #endif /* _UAPI_LINUX_FCNTL_H */

Would it be possible to move these to a new UAPI header?

In glibc, we currently do not #include .  We need some of
the AT_* constants in POSIX mode, and the header is not necessarily
namespace-clean.  If there was a separate header for openat2 support, we
could use that easily, and we would only have to maintain the baseline
definitions (which never change).

Thanks,
Florian



Re: [PATCH v2 2/7] perf/core: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


On 16.12.2019 19:12, Lubashev, Igor wrote:
> On Mon, Dec 16, 2019 at 2:15 AM, Alexey Budankov 
>  wrote:
>>
>> Open access to perf_events monitoring for CAP_SYS_PERFMON privileged
>> processes.
>> For backward compatibility reasons access to perf_events subsystem remains
>> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
>> for secure perf_events monitoring is discouraged with respect to
>> CAP_SYS_PERFMON capability.
>>
>> Signed-off-by: Alexey Budankov 
>> ---
>>  include/linux/perf_event.h | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index
>> 34c7c6910026..52313d2cc343 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1285,7 +1285,8 @@ static inline int perf_is_paranoid(void)
>>
>>  static inline int perf_allow_kernel(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > 1 &&
>> +   !(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EACCES;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_KERNEL); @@
>> -1293,7 +1294,8 @@ static inline int perf_allow_kernel(struct
>> perf_event_attr *attr)
>>
>>  static inline int perf_allow_cpu(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > 0 &&
>> +!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EACCES;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_CPU); @@ -
>> 1301,7 +1303,8 @@ static inline int perf_allow_cpu(struct perf_event_attr
>> *attr)
>>
>>  static inline int perf_allow_tracepoint(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > -1 &&
>> +!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EPERM;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
>> --
>> 2.20.1
> 
> Thanks.  I like the idea of CAP_SYS_PERFMON that does not require 
> CAP_SYS_ADMIN.  It makes granting users ability to run perf a bit safer.
> 
> I see a lot of "(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)" 
> constructs now.  Maybe wrapping it in an " inline bool perfmon_capable()" 
> defined somewhere (like in /include/linux/capability.h)?

Sounds reasonable, thanks!

~Alexey

> 
> - Igor
> 


Re: [PATCH v2 2/7] perf/core: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov
On 16.12.2019 19:12, Lubashev, Igor wrote:
> On Mon, Dec 16, 2019 at 2:15 AM, Alexey Budankov 
>  wrote:
>>
>> Open access to perf_events monitoring for CAP_SYS_PERFMON privileged
>> processes.
>> For backward compatibility reasons access to perf_events subsystem remains
>> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
>> for secure perf_events monitoring is discouraged with respect to
>> CAP_SYS_PERFMON capability.
>>
>> Signed-off-by: Alexey Budankov 
>> ---
>>  include/linux/perf_event.h | 9 ++---
>>  1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index
>> 34c7c6910026..52313d2cc343 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1285,7 +1285,8 @@ static inline int perf_is_paranoid(void)
>>
>>  static inline int perf_allow_kernel(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > 1 &&
>> +   !(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EACCES;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_KERNEL); @@
>> -1293,7 +1294,8 @@ static inline int perf_allow_kernel(struct
>> perf_event_attr *attr)
>>
>>  static inline int perf_allow_cpu(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > 0 &&
>> +!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EACCES;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_CPU); @@ -
>> 1301,7 +1303,8 @@ static inline int perf_allow_cpu(struct perf_event_attr
>> *attr)
>>
>>  static inline int perf_allow_tracepoint(struct perf_event_attr *attr)  {
>> -if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
>> +if (sysctl_perf_event_paranoid > -1 &&
>> +!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
>>  return -EPERM;
>>
>>  return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
>> --
>> 2.20.1
> 
> Thanks.  I like the idea of CAP_SYS_PERFMON that does not require 
> CAP_SYS_ADMIN.  It makes granting users ability to run perf a bit safer.
> 
> I see a lot of "(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)" 
> constructs now.  Maybe wrapping it in an " inline bool perfmon_capable()" 
> defined somewhere (like in /include/linux/capability.h)?

Yes, it makes sense.

Thanks,
Alexey

> 
> - Igor
> 


Re: [PATCH net v2] net/ibmvnic: Fix typo in retry check

2019-12-16 Thread Thomas Falcon



On 12/13/19 7:27 PM, Jakub Kicinski wrote:

On Wed, 11 Dec 2019 09:38:39 -0600, Thomas Falcon wrote:

This conditional is missing a bang, with the intent
being to break when the retry count reaches zero.

Fixes: 476d96ca9c ("ibmvnic: Bound waits for device queries")
Suggested-by: Juliet Kim 
Signed-off-by: Thomas Falcon 

Ah damn, looks like this originates from my pseudo code.

I had to fix the fixes tag:

Commit: 847496ccfa22 ("net/ibmvnic: Fix typo in retry check")
Fixes tag: Fixes: 476d96ca9c ("ibmvnic: Bound waits for device queries")
Has these problem(s):
- SHA1 should be at least 12 digits long
  Can be fixed by setting core.abbrev to 12 (or more) or (for 
git v2.11
  or later) just making sure it is not set (or set to "auto").


Thanks, I'll keep that in mind next time.  IIRC I was making some last 
minute cosmetic changes before sending, and it might have slipped in 
that way.  In any case, I should have been more thorough in testing it.


Thanks,

Tom





Applied to net, thanks!


diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index efb0f10..2d84523 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -184,7 +184,7 @@ static int ibmvnic_wait_for_completion(struct 
ibmvnic_adapter *adapter,
netdev_err(netdev, "Device down!\n");
return -ENODEV;
}
-   if (retry--)
+   if (!retry--)
break;
if (wait_for_completion_timeout(comp_done, div_timeout))
return 0;


Re: [v5 3/3] Documentation: dt: binding: fsl: Add 'fsl,ippdexpcr1-alt-addr' property

2019-12-16 Thread Rob Herring
On Tue,  3 Dec 2019 20:28:18 +0800, Biwen Li wrote:
> The 'fsl,ippdexpcr1-alt-addr' property is used to handle an errata A-008646
> on LS1021A
> 
> Signed-off-by: Biwen Li 
> ---
> Change in v5:
>   - none
> 
> Change in v4:
>   - rename property name
> fsl,ippdexpcr-alt-addr -> fsl,ippdexpcr1-alt-addr
> 
> Change in v3:
>   - rename property name
> fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> 
> Change in v2:
>   - update desc of the property
> 'fsl,rcpm-scfg'
> 
>  .../devicetree/bindings/soc/fsl/rcpm.txt  | 21 +++
>  1 file changed, 21 insertions(+)
> 

Reviewed-by: Rob Herring 


Re: [PATCH 1/2] spi: fsl: don't map irq during probe

2019-12-16 Thread Sasha Levin
Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 500a32abaf81 ("spi: fsl: Call irq_dispose_mapping in err path").

The bot has tested the following trees: v5.4.2, v5.3.15, v4.19.88, v4.14.158.

v5.4.2: Build OK!
v5.3.15: Failed to apply! Possible dependencies:
Unable to calculate

v4.19.88: Failed to apply! Possible dependencies:
Unable to calculate

v4.14.158: Failed to apply! Possible dependencies:
Unable to calculate


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks,
Sasha


[PATCH v5 5/5] powerpc/numa: Remove late request for home node associativity

2019-12-16 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..9bd396fba7c2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..cdd39a0bc49d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f837a0e725bc..3ba8dc0b7bc5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1628,15 +1628,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
2.18.1



[PATCH v5 4/5] powerpc/numa: Early request for home node associativity

2019-12-16 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v3):
- Handled comments from Nathan Lynch
  * Use first thread of the core for cpu-to-node map.
  * get hardware-id in numa_setup_cpu

Changelog (v1->v2):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid


 arch/powerpc/mm/numa.c | 45 +-
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 63ec0c3c817f..f837a0e725bc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,13 +461,27 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(long hwid)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc;
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
+ * cpu_to_phys_id is only valid between smp_setup_cpu_maps() and
+ * smp_setup_pacas(). If called outside this window, set get_hwid to true.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
 {
-   struct device_node *cpu;
+   struct device_node *cpu = NULL;
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
 
@@ -485,6 +499,27 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN)) {
+   long hwid;
+
+   if (get_hwid)
+   hwid = get_hard_smp_processor_id(lcpu);
+   else
+   hwid = cpu_to_phys_id[lcpu];
+   nid = vphn_get_nid(hwid);
+   }
+
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -496,6 +531,7 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
@@ -515,7 +551,6 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
@@ -546,7 +581,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 {
int nid;
 
-   nid = numa_setup_cpu(cpu);
+   nid = numa_setup_cpu(cpu, true);
verify_cpu_node_mapping(cpu, nid);
return 0;
 }
@@ -893,7 +928,7 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
 
for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   numa_setup_cpu(cpu, false);
 }
 
 void __init initmem_init(void)
-- 
2.18.1



[PATCH v5 2/5] powerpc/numa: Handle extra hcall_vphn error cases

2019-12-16 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..8fbe57c29153 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
2.18.1



[PATCH v5 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-12-16 Thread Srikar Dronamraju
All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog: v3->v4
As suggested by Nathan, add a warning while mapping offline cpu.

 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c29153..63ec0c3c817f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -467,15 +467,20 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-   int nid = NUMA_NO_NODE;
struct device_node *cpu;
+   int fcpu = cpu_first_thread_sibling(lcpu);
+   int nid = NUMA_NO_NODE;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
+* Since cpu_to_node binding remains the same for all threads in the
+* core. If a valid cpu-to-node mapping is already available, for
+* the first thread in the core, use it.
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup_table[fcpu];
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -496,6 +501,19 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   /*
+* Update for the first thread of the core. All threads of a core
+* have to be part of the same node. This not only avoids querying
+* for every other thread in the core, but always avoids a case
+* where virtual node associativity change causes subsequent threads
+* of a core to be associated with different nid. However if first
+* thread is already online, expect it to have a valid mapping.
+*/
+   if (fcpu != lcpu) {
+   WARN_ON(cpu_online(fcpu));
+   map_cpu_to_node(fcpu, nid);
+   }
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
2.18.1



[PATCH v5 0/5] Early node associativity

2019-12-16 Thread Srikar Dronamraju
Note: (Only patch 3 changes from v3)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v4: https://patchwork.ozlabs.org/patch/1161979
Changelog: v4->v5:
- Rebased to v5.5-rc2

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 99 ---
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 77 insertions(+), 34 deletions(-)

-- 
2.18.1



[PATCH v5 1/5] powerpc/vphn: Check for error from hcall_vphn

2019-12-16 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6c670e..cca474a2c396 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
2.18.1



[PATCH v5 0/5] Early node associativity

2019-12-16 Thread Srikar Dronamraju
Note: (Only patch 3 changes from v3)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v4: https://patchwork.ozlabs.org/patch/1161979
Changelog: v4->v5:
- Rebased to v5.5-rc2

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 99 ---
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 77 insertions(+), 34 deletions(-)

-- 
2.18.1



[PATCH v3] powerpc/smp: Use nid as fallback for package_id

2019-12-16 Thread Srikar Dronamraju
Package_id is to find out all cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM Lpars. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM Lpars.

Before the patch.
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1   <--
Socket(s):   16  <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 -1
 #

After the patch
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8  <--
Core(s) per socket:  8  <--
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 0
 #
Now lscpu output is more in line with the system configuration.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Vasant Hegde 
Cc: Vaidyanathan Srinivasan 
---
Changelog from v2: https://patchwork.ozlabs.org/patch/1151649
- Rebased to v5.5-rc2
- Renamed the new function to cpu_to_nid
- Removed checks to fadump property. (Looked too excessive)
- Moved pseries specific code to pseries/lpar.c

Changelog from v1: https://patchwork.ozlabs.org/patch/1126145
- In v1 cpu_to_chip_id was overloaded to fallback on nid.  Michael
  Ellerman wasn't comfortable with nid being shown up as chip_id.

 arch/powerpc/include/asm/topology.h   |  7 ++-
 arch/powerpc/kernel/smp.c |  6 +++---
 arch/powerpc/platforms/pseries/lpar.c | 22 ++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..7422ef913c75 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -130,11 +130,16 @@ static inline void shared_proc_topology_init(void) {}
 
 #ifdef CONFIG_SMP
 #include 
+#ifdef CONFIG_PPC_SPLPAR
+int cpu_to_nid(int);
+#else
+#define cpu_to_nid(cpu)cpu_to_chip_id(cpu)
+#endif
 
 #ifdef CONFIG_PPC64
 #include 
 
-#define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
+#define topology_physical_package_id(cpu)  (cpu_to_nid(cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..b0c1438d8d9a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1188,7 +1188,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int chipid = cpu_to_chip_id(cpu);
+   int nid = cpu_to_nid(cpu);
int i;
 
/*
@@ -1217,11 +1217,11 @@ static void add_cpu_to_masks(int cpu)
for_each_cpu(i, cpu_l2_cache_mask(cpu))
set_cpus_related(cpu, i, cpu_core_mask);
 
-   if (chipid == -1)
+   if (nid == -1)
return;
 
for_each_cpu(i, cpu_online_mask)
-   if (cpu_to_chip_id(i) == chipid)
+   if (cpu_to_nid(i) == nid)
set_cpus_related(cpu, i, cpu_core_mask);
 }
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 60cb29ae4739..99583b9f00e4 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -650,6 +650,28 @@ static int __init vcpudispatch_stats_procfs_init(void)
 }
 
 machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
+
+int cpu_to_nid(cpu)
+{
+   int nid = cpu_to_chip_id(cpu);
+
+   /*
+* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
+* defined. Hence we would return the chip-id as the
+* cpu_to_nid.
+*/
+   if (nid == -1 && 

[PATCH v3] powerpc: Fix __clear_user() with KUAP enabled

2019-12-16 Thread Michael Ellerman
From: Andrew Donnellan 

The KUAP implementation adds calls in clear_user() to enable and
disable access to userspace memory. However, it doesn't add these to
__clear_user(), which is used in the ptrace regset code.

As there's only one direct user of __clear_user() (the regset code),
and the time taken to set the AMR for KUAP purposes is going to
dominate the cost of a quick access_ok(), there's not much point
having a separate path.

Rename __clear_user() to __arch_clear_user(), and make __clear_user()
just call clear_user().

Reported-by: syzbot+f25ecf4b2982d8c7a...@syzkaller-ppc64.appspotmail.com
Reported-by: Daniel Axtens 
Suggested-by: Michael Ellerman 
Fixes: de78a9c42a79 ("powerpc: Add a framework for Kernel Userspace Access 
Protection")
Signed-off-by: Andrew Donnellan 
[mpe: Use __arch_clear_user() for the asm version like arm64 & nds32]
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20191209132221.15328-1-...@linux.ibm.com
---
 arch/powerpc/include/asm/uaccess.h | 9 +++--
 arch/powerpc/lib/string_32.S   | 4 ++--
 arch/powerpc/lib/string_64.S   | 6 +++---
 3 files changed, 12 insertions(+), 7 deletions(-)

v3: mpe: Use __arch_clear_user() for the asm version like arm64 & nds32

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 15002b51ff18..c92fe7fe9692 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -401,7 +401,7 @@ copy_to_user_mcsafe(void __user *to, const void *from, 
unsigned long n)
return n;
 }
 
-extern unsigned long __clear_user(void __user *addr, unsigned long size);
+unsigned long __arch_clear_user(void __user *addr, unsigned long size);
 
 static inline unsigned long clear_user(void __user *addr, unsigned long size)
 {
@@ -409,12 +409,17 @@ static inline unsigned long clear_user(void __user *addr, 
unsigned long size)
might_fault();
if (likely(access_ok(addr, size))) {
allow_write_to_user(addr, size);
-   ret = __clear_user(addr, size);
+   ret = __arch_clear_user(addr, size);
prevent_write_to_user(addr, size);
}
return ret;
 }
 
+static inline unsigned long __clear_user(void __user *addr, unsigned long size)
+{
+   return clear_user(addr, size);
+}
+
 extern long strncpy_from_user(char *dst, const char __user *src, long count);
 extern __must_check long strnlen_user(const char __user *str, long n);
 
diff --git a/arch/powerpc/lib/string_32.S b/arch/powerpc/lib/string_32.S
index f69a6aab7bfb..1ddb26394e8a 100644
--- a/arch/powerpc/lib/string_32.S
+++ b/arch/powerpc/lib/string_32.S
@@ -17,7 +17,7 @@ CACHELINE_BYTES = L1_CACHE_BYTES
 LG_CACHELINE_BYTES = L1_CACHE_SHIFT
 CACHELINE_MASK = (L1_CACHE_BYTES-1)
 
-_GLOBAL(__clear_user)
+_GLOBAL(__arch_clear_user)
 /*
  * Use dcbz on the complete cache lines in the destination
  * to set them to zero.  This requires that the destination
@@ -87,4 +87,4 @@ _GLOBAL(__clear_user)
EX_TABLE(8b, 91b)
EX_TABLE(9b, 91b)
 
-EXPORT_SYMBOL(__clear_user)
+EXPORT_SYMBOL(__arch_clear_user)
diff --git a/arch/powerpc/lib/string_64.S b/arch/powerpc/lib/string_64.S
index 507b18b1660e..169872bc0892 100644
--- a/arch/powerpc/lib/string_64.S
+++ b/arch/powerpc/lib/string_64.S
@@ -17,7 +17,7 @@
.section".text"
 
 /**
- * __clear_user: - Zero a block of memory in user space, with less checking.
+ * __arch_clear_user: - Zero a block of memory in user space, with less 
checking.
  * @to:   Destination address, in user space.
  * @n:Number of bytes to zero.
  *
@@ -58,7 +58,7 @@ err3; stb r0,0(r3)
mr  r3,r4
blr
 
-_GLOBAL_TOC(__clear_user)
+_GLOBAL_TOC(__arch_clear_user)
cmpdi   r4,32
neg r6,r3
li  r0,0
@@ -181,4 +181,4 @@ err1;   dcbz0,r3
cmpdi   r4,32
blt .Lshort_clear
b   .Lmedium_clear
-EXPORT_SYMBOL(__clear_user)
+EXPORT_SYMBOL(__arch_clear_user)
-- 
2.21.0



Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread Jan Kara
On Fri 13-12-19 19:26:17, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 
> ---
> 
> Hi Jan,
> 
> This should address all of your comments for patch 23!

Thanks. One comment below:

> @@ -1486,6 +1500,10 @@ struct page *follow_trans_huge_pmd(struct 
> vm_area_struct *vma,
>   VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>   if (flags & FOLL_TOUCH)
>   touch_pmd(vma, addr, pmd, flags);
> +
> + if (!try_grab_page(page, flags))
> + return ERR_PTR(-ENOMEM);
> +
>   if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>   /*
>* We don't mlock() pte-mapped THPs. This way we can avoid

I'd move this still a bit higher - just after VM_BUG_ON_PAGE() and before
if (flags & FOLL_TOUCH) test. Because touch_pmd() can update page tables
and we don't won't that if we're going to fail the fault.

With this fixed, the patch looks good to me so you can then add:

Reviewed-by: Jan Kara 

Honza
-- 
Jan Kara 
SUSE Labs, CR


[PATCH 0/1] macintosh: replace i2c_new_probed_device with an ERR_PTR variant

2019-12-16 Thread Wolfram Sang
In the on-going mission to let i2c_new_* calls return an ERR_PTR instead of
NULL, here is a series for this subsystem converting i2c_new_probed_device() to
the newly introduced i2c_new_scanned_device(). Based on v5.5-rc1 and build 
tested.
Please apply via your tree.

Thanks,

   Wolfram


Wolfram Sang (1):
  macintosh: convert to i2c_new_scanned_device

 drivers/macintosh/therm_windtunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.20.1



[PATCH 1/1] macintosh: convert to i2c_new_scanned_device

2019-12-16 Thread Wolfram Sang
Move from the deprecated i2c_new_probed_device() to the new
i2c_new_scanned_device(). Make use of the new ERRPTR if suitable.

Signed-off-by: Wolfram Sang 
---
Build tested only.

 drivers/macintosh/therm_windtunnel.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/macintosh/therm_windtunnel.c 
b/drivers/macintosh/therm_windtunnel.c
index 8c744578122a..f15fec5e1cb6 100644
--- a/drivers/macintosh/therm_windtunnel.c
+++ b/drivers/macintosh/therm_windtunnel.c
@@ -321,10 +321,10 @@ do_attach( struct i2c_adapter *adapter )
 
memset(, 0, sizeof(struct i2c_board_info));
strlcpy(info.type, "therm_ds1775", I2C_NAME_SIZE);
-   i2c_new_probed_device(adapter, , scan_ds1775, NULL);
+   i2c_new_scanned_device(adapter, , scan_ds1775, NULL);
 
strlcpy(info.type, "therm_adm1030", I2C_NAME_SIZE);
-   i2c_new_probed_device(adapter, , scan_adm1030, NULL);
+   i2c_new_scanned_device(adapter, , scan_adm1030, NULL);
 
if( x.thermostat && x.fan ) {
x.running = 1;
-- 
2.20.1



Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-16 Thread Arnd Bergmann
On Mon, Dec 16, 2019 at 11:28 AM Will Deacon  wrote:
> On Fri, Dec 13, 2019 at 02:17:08PM +0100, Arnd Bergmann wrote:
> > On Thu, Dec 12, 2019 at 9:50 PM Linus Torvalds
> >  wrote:
> > > On Thu, Dec 12, 2019 at 11:34 AM Will Deacon  wrote:
> > > > The root of my concern in all of this, and what started me looking at 
> > > > it in
> > > > the first place, is the interaction with 'typeof()'. Inheriting 
> > > > 'volatile'
> > > > for a pointer means that local variables in macros declared using 
> > > > typeof()
> > > > suddenly start generating *hideous* code, particularly when pointless 
> > > > stack
> > > > spills get stackprotector all excited.
> > >
> > > Yeah, removing volatile can be a bit annoying.
> > >
> > > For the particular case of the bitops, though, it's not an issue.
> > > Since you know the type there, you can just cast it.
> > >
> > > And if we had the rule that READ_ONCE() was an arithmetic type, you could 
> > > do
> > >
> > > typeof(0+(*p)) __var;
> > >
> > > since you might as well get the integer promotion anyway (on the
> > > non-volatile result).
> > >
> > > But that doesn't work with structures or unions, of course.
> > >
> > > I'm not entirely sure we have READ_ONCE() with a struct. I do know we
> > > have it with 64-bit entities on 32-bit machines, but that's ok with
> > > the "0+" trick.
> >
> > I'll have my randconfig builder look for instances, so far I found one,
> > see below. My feeling is that it would be better to enforce at least
> > the size being a 1/2/4/8, to avoid cases where someone thinks
> > the access is atomic, but it falls back on a memcpy.
>
> I've been using something similar built on compiletime_assert_atomic_type()
> and I spotted another instance in the xdp code (xskq_validate_desc()) which
> tries to READ_ONCE() on a 128-bit descriptor, although a /very/ quick read
> of the code suggests that this probably can't be concurrently modified if
> the ring indexes are synchronised properly.

That's the only other one I found. I have not checked how many are structs
that are the size of a normal u8/u16/u32/u64, or if there are types that
have a lower alignment than there size, such as a __u16[2] that might
span a page boundary.

> However, enabling this for 32-bit ARM is total carnage; as Linus mentioned,
> a whole bunch of code appears to be relying on atomic 64-bit access of
> READ_ONCE(); the perf ring buffer, io_uring, the scheduler, pm_runtime,
> cpuidle, ... :(
>
> Unfortunately, at least some of these *do* look like bugs, but I can't see
> how we can fix them, not least because the first two are user ABI afaict. It
> may also be that in practice we get 2x32-bit stores, and that works out fine
> when storing a 32-bit virtual address. I'm not sure what (if anything) the
> compiler guarantees in these cases.

Would it help if 32-bit architectures use atomic64_read() and atomic64_set()
to implement a 64-bit READ_ONCE()/WRITE_ONCE(), or would that make it
worse in other ways?

On mips32, riscv32 and some minor 32-bit architectures with SMP support
(xtensa,  csky, hexagon, openrisc, parisc32, sparc32 and ppc32 AFAICT) this
ends up using a spinlock for GENERIC_ATOMIC64, but at least ARMv6+,
i586+ and most ARC should be fine.

(Side note: the ARMv7 implementation is suboptimimal for ARMv7VE+
if LPAE is disabled, I think we need to really add Kconfig options for
ARMv7VE and 32-bit ARMv8 improve this and things like integer divide).

  Arnd


Applied "spi: fsl: use platform_get_irq() instead of of_irq_to_resource()" to the spi tree

2019-12-16 Thread Mark Brown
The patch

   spi: fsl: use platform_get_irq() instead of of_irq_to_resource()

has been applied to the spi tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 63aa6a692595d47a0785297b481072086b9272d2 Mon Sep 17 00:00:00 2001
From: Christophe Leroy 
Date: Thu, 12 Dec 2019 17:47:24 +
Subject: [PATCH] spi: fsl: use platform_get_irq() instead of
 of_irq_to_resource()

Unlike irq_of_parse_and_map() which has a dummy definition on SPARC,
of_irq_to_resource() hasn't.

But as platform_get_irq() can be used instead and is generic, use it.

Reported-by: kbuild test robot 
Suggested-by: Mark Brown 
Fixes:  3194d2533eff ("spi: fsl: don't map irq during probe")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
Link: 
https://lore.kernel.org/r/091a277fd0b3356dca1e29858c1c96983fc9cb25.1576172743.git.christophe.le...@c-s.fr
Signed-off-by: Mark Brown 
---
 drivers/spi/spi-fsl-spi.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index d0ad9709f4a6..fb4159ad6bf6 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -746,9 +746,9 @@ static int of_fsl_spi_probe(struct platform_device *ofdev)
if (ret)
goto err;
 
-   irq = of_irq_to_resource(np, 0, NULL);
-   if (irq <= 0) {
-   ret = -EINVAL;
+   irq = platform_get_irq(ofdev, 0);
+   if (irq < 0) {
+   ret = irq;
goto err;
}
 
-- 
2.20.1



Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-16 Thread Peter Zijlstra
On Mon, Dec 16, 2019 at 10:28:06AM +, Will Deacon wrote:
> However, enabling this for 32-bit ARM is total carnage; as Linus mentioned,
> a whole bunch of code appears to be relying on atomic 64-bit access of
> READ_ONCE(); the perf ring buffer, io_uring, the scheduler, pm_runtime,
> cpuidle, ... :(
> 
> Unfortunately, at least some of these *do* look like bugs, but I can't see
> how we can fix them, not least because the first two are user ABI afaict. It
> may also be that in practice we get 2x32-bit stores, and that works out fine
> when storing a 32-bit virtual address. I'm not sure what (if anything) the
> compiler guarantees in these cases.

Perf does indeed have a (known) problem here for the head/tail values.
Last time we looked at that nobody could really come up with a sane
solution that wouldn't break something.

I'll try and dig out that thread. Perhaps casting the value to 'unsigned
long' internally might work, I forgot the details.


[PATCH v2] powerpc/pseries/cmm: fix managed page counts when migrating pages between zones

2019-12-16 Thread David Hildenbrand
Commit 63341ab03706 (virtio-balloon: fix managed page counts when migrating
pages between zones) fixed a long existing BUG in the virtio-balloon
driver when pages would get migrated between zones.  I did not try to
reproduce on powerpc, but looking at the code, the same should apply to
powerpc/cmm ever since it started using the balloon compaction
infrastructure (luckily just recently).

In case we have to migrate a ballon page to a newpage of another zone, the
managed page count of both zones is wrong. Paired with memory offlining
(which will adjust the managed page count), we can trigger kernel crashes
and all kinds of different symptoms.

Fix it by properly adjusting the managed page count when migrating if
the zone changed.

We'll temporarily modify the totalram page count. If this ever becomes a
problem, we can fine tune by providing helpers that don't touch
the totalram pages (e.g., adjust_zone_managed_page_count()).

Fixes: fe030c9b85e6 ("powerpc/pseries/cmm: Implement balloon compaction")
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Richard Fontana 
Cc: Greg Kroah-Hartman 
Cc: Arun KS 
Cc: Thomas Gleixner 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David Hildenbrand 
---

v1 -> v2:
- Link virtio-balloon fix commit
- Check if the zone changed
- Move fixup further up, before enqueuing the new newpage (where we are
  guaranteed to hold a reference to both pages)

---
 arch/powerpc/platforms/pseries/cmm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index 91571841df8a..9dba7e880885 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -539,6 +539,16 @@ static int cmm_migratepage(struct balloon_dev_info 
*b_dev_info,
/* balloon page list reference */
get_page(newpage);
 
+   /*
+* When we migrate a page to a different zone, we have to fixup the
+* count of both involved zones as we adjusted the managed page count
+* when inflating.
+*/
+   if (page_zone(page) != page_zone(newpage)) {
+   adjust_managed_page_count(page, 1);
+   adjust_managed_page_count(newpage, -1);
+   }
+
spin_lock_irqsave(_dev_info->pages_lock, flags);
balloon_page_insert(b_dev_info, newpage);
balloon_page_delete(page);
-- 
2.23.0



Re: READ_ONCE() + STACKPROTECTOR_STRONG == :/ (was Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.5-2 tag (topic/kasan-bitops))

2019-12-16 Thread Will Deacon
On Fri, Dec 13, 2019 at 02:17:08PM +0100, Arnd Bergmann wrote:
> On Thu, Dec 12, 2019 at 9:50 PM Linus Torvalds
>  wrote:
> > On Thu, Dec 12, 2019 at 11:34 AM Will Deacon  wrote:
> > > The root of my concern in all of this, and what started me looking at it 
> > > in
> > > the first place, is the interaction with 'typeof()'. Inheriting 'volatile'
> > > for a pointer means that local variables in macros declared using typeof()
> > > suddenly start generating *hideous* code, particularly when pointless 
> > > stack
> > > spills get stackprotector all excited.
> >
> > Yeah, removing volatile can be a bit annoying.
> >
> > For the particular case of the bitops, though, it's not an issue.
> > Since you know the type there, you can just cast it.
> >
> > And if we had the rule that READ_ONCE() was an arithmetic type, you could do
> >
> > typeof(0+(*p)) __var;
> >
> > since you might as well get the integer promotion anyway (on the
> > non-volatile result).
> >
> > But that doesn't work with structures or unions, of course.
> >
> > I'm not entirely sure we have READ_ONCE() with a struct. I do know we
> > have it with 64-bit entities on 32-bit machines, but that's ok with
> > the "0+" trick.
> 
> I'll have my randconfig builder look for instances, so far I found one,
> see below. My feeling is that it would be better to enforce at least
> the size being a 1/2/4/8, to avoid cases where someone thinks
> the access is atomic, but it falls back on a memcpy.

I've been using something similar built on compiletime_assert_atomic_type()
and I spotted another instance in the xdp code (xskq_validate_desc()) which
tries to READ_ONCE() on a 128-bit descriptor, although a /very/ quick read
of the code suggests that this probably can't be concurrently modified if
the ring indexes are synchronised properly.

However, enabling this for 32-bit ARM is total carnage; as Linus mentioned,
a whole bunch of code appears to be relying on atomic 64-bit access of
READ_ONCE(); the perf ring buffer, io_uring, the scheduler, pm_runtime,
cpuidle, ... :(

Unfortunately, at least some of these *do* look like bugs, but I can't see
how we can fix them, not least because the first two are user ABI afaict. It
may also be that in practice we get 2x32-bit stores, and that works out fine
when storing a 32-bit virtual address. I'm not sure what (if anything) the
compiler guarantees in these cases.

Will


[PATCH v2 7/7] parisc/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 arch/parisc/kernel/perf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/perf.c b/arch/parisc/kernel/perf.c
index 676683641d00..58e7d1444e4f 100644
--- a/arch/parisc/kernel/perf.c
+++ b/arch/parisc/kernel/perf.c
@@ -300,7 +300,7 @@ static ssize_t perf_write(struct file *file, const char 
__user *buf,
else
return -EFAULT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
return -EACCES;
 
if (count != sizeof(uint32_t))
-- 
2.20.1




[PATCH v2 6/7] powerpc/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 arch/powerpc/perf/imc-pmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index cb50a9e1fd2d..d8f936d1d6cc 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -898,7 +898,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
return -EACCES;
 
/* Sampling not supported */
@@ -1307,7 +1307,7 @@ static int trace_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
return -EACCES;
 
/* Return if this is a couting event */
-- 
2.20.1



[PATCH v2 5/7] trace/bpf_trace: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to bpf_trace monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to bpf_trace monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
bpf_trace monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 kernel/trace/bpf_trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 44bd08f2443b..0231bb363ef9 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1272,7 +1272,7 @@ int perf_event_query_prog_array(struct perf_event *event, 
void __user *info)
u32 *ids, prog_cnt, ids_len;
int ret;
 
-   if (!capable(CAP_SYS_ADMIN))
+   if (!(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN)))
return -EPERM;
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -EINVAL;
-- 
2.20.1




[PATCH v2 4/7] drm/i915/perf: open access for CAP_SYS_PERFMON privileged process

2019-12-16 Thread Alexey Budankov


Open access to i915_perf monitoring for CAP_SYS_PERFMON privileged processes.
For backward compatibility reasons access to i915_perf subsystem remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
i915_perf monitoring is discouraged with respect to CAP_SYS_PERFMON capability.

Signed-off-by: Alexey Budankov 
---
 drivers/gpu/drm/i915/i915_perf.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index e42b86827d6b..8a9ff40b1b0b 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -2748,10 +2748,11 @@ i915_perf_open_ioctl_locked(struct drm_i915_private 
*dev_priv,
/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
 * we check a dev.i915.perf_stream_paranoid sysctl option
 * to determine if it's ok to access system wide OA counters
-* without CAP_SYS_ADMIN privileges.
+* without CAP_SYS_PERFMON or CAP_SYS_ADMIN privileges.
 */
if (privileged_op &&
-   i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   i915_perf_stream_paranoid &&
+   !(capable(CAP_SYS_PERFMON) || capable(CAP_SYS_ADMIN))) {
DRM_DEBUG("Insufficient privileges to open system-wide i915 
perf stream\n");
ret = -EACCES;
goto err_ctx;
@@ -2940,8 +2941,8 @@ static int read_properties_unlocked(struct 
drm_i915_private *dev_priv,
oa_freq_hz = 0;
 
if (oa_freq_hz > i915_oa_max_sample_rate &&
-   !capable(CAP_SYS_ADMIN)) {
-   DRM_DEBUG("OA exponent would exceed the max 
sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without root 
privileges\n",
+   !(capable(CAP_SYS_PERFMON) || 
capable(CAP_SYS_ADMIN))) {
+   DRM_DEBUG("OA exponent would exceed the max 
sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN privileges\n",
  i915_oa_max_sample_rate);
return -EACCES;
}
@@ -3328,7 +3329,7 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, 
void *data,
return -EINVAL;
}
 
-   if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   if (i915_perf_stream_paranoid && !(capable(CAP_SYS_PERFMON) || 
capable(CAP_SYS_ADMIN))) {
DRM_DEBUG("Insufficient privileges to add i915 OA config\n");
return -EACCES;
}
@@ -3474,7 +3475,7 @@ int i915_perf_remove_config_ioctl(struct drm_device *dev, 
void *data,
return -ENOTSUPP;
}
 
-   if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+   if (i915_perf_stream_paranoid && !(capable(CAP_SYS_PERFMON) || 
capable(CAP_SYS_ADMIN))) {
DRM_DEBUG("Insufficient privileges to remove i915 OA config\n");
return -EACCES;
}
-- 
2.20.1




[PATCH v2 3/7] perf tool: extend Perf tool with CAP_SYS_PERFMON capability support

2019-12-16 Thread Alexey Budankov


Extend error messages to mention CAP_SYS_PERFMON capability as an option
to substitute CAP_SYS_ADMIN capability for secure system performance
monitoring and observability operations [1]. Make perf_event_paranoid_check()
to be aware of CAP_SYS_PERFMON capability.

[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

Signed-off-by: Alexey Budankov 
---
 tools/perf/design.txt   |  3 ++-
 tools/perf/util/cap.h   |  4 
 tools/perf/util/evsel.c | 10 +-
 tools/perf/util/util.c  |  1 +
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/perf/design.txt b/tools/perf/design.txt
index 0453ba26cdbd..71755b3e1303 100644
--- a/tools/perf/design.txt
+++ b/tools/perf/design.txt
@@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any 
user, for
 their own tasks.
 
 A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
-all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+all events on CPU-x. Per CPU counters need CAP_SYS_PERFMON or
+CAP_SYS_ADMIN privilege.
 
 The 'flags' parameter is currently unused and must be zero.
 
diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
index 051dc590ceee..0f79fbf6638b 100644
--- a/tools/perf/util/cap.h
+++ b/tools/perf/util/cap.h
@@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
 #define CAP_SYSLOG 34
 #endif
 
+#ifndef CAP_SYS_PERFMON
+#define CAP_SYS_PERFMON 38
+#endif
+
 #endif /* __PERF_CAP_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f4dea055b080..3a46325e3702 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2468,14 +2468,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, 
struct target *target,
 "You may not have permission to collect %sstats.\n\n"
 "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
 "which controls use of the performance events system by\n"
-"unprivileged users (without CAP_SYS_ADMIN).\n\n"
+"unprivileged users (without CAP_SYS_PERFMON or 
CAP_SYS_ADMIN).\n\n"
 "The current value is %d:\n\n"
 "  -1: Allow use of (almost) all events by all users\n"
 "  Ignore mlock limit after perf_event_mlock_kb without 
CAP_IPC_LOCK\n"
-">= 0: Disallow ftrace function tracepoint by users without 
CAP_SYS_ADMIN\n"
-"  Disallow raw tracepoint access by users without 
CAP_SYS_ADMIN\n"
-">= 1: Disallow CPU event access by users without 
CAP_SYS_ADMIN\n"
-">= 2: Disallow kernel profiling by users without 
CAP_SYS_ADMIN\n\n"
+">= 0: Disallow ftrace function tracepoint by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+"  Disallow raw tracepoint access by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+">= 1: Disallow CPU event access by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+">= 2: Disallow kernel profiling by users without 
CAP_SYS_PERFMON or CAP_SYS_ADMIN\n\n"
 "To make this setting permanent, edit /etc/sysctl.conf too, 
e.g.:\n\n"
 "  kernel.perf_event_paranoid = -1\n" ,
 target->system_wide ? "system-wide " : "",
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 969ae560dad9..9981db0d8d09 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -272,6 +272,7 @@ int perf_event_paranoid(void)
 bool perf_event_paranoid_check(int max_level)
 {
return perf_cap__capable(CAP_SYS_ADMIN) ||
+   perf_cap__capable(CAP_SYS_PERFMON) ||
perf_event_paranoid() <= max_level;
 }
 
-- 
2.20.1




  1   2   >