Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

2019-11-12 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:
> 
> > One other possible alternative (which previous versions of this patch
> > used) would be to check with path_is_under() if there was a racing
> > rename or mount (after re-taking the relevant seqlocks). While this does
> > work, it results in possible O(n*m) behaviour if there are many renames
> > or mounts occuring *anywhere on the system*.
> 
> BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
> will trigger arseloads of write_seqlock(_lock) simply on 
> d_splice_alias()
> bringing disconnected subtrees in contact with parent?

I wasn't aware of that -- that makes path_is_under() even less viable.
I'll reword it to be clearer that path_is_under() isn't a good idea and
why we went with -EAGAIN over an in-kernel retry.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution

2019-11-12 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> Minor nit here - I'd split "move the conditional call of set_root()
> into nd_jump_root()" into a separate patch before that one.  Makes
> for fewer distractions in this one.  I'd probably fold "and be
> ready for errors other than -ECHILD" into the same preliminary
> patch.

Will do.

> > +   /* Not currently safe for scoped-lookups. */
> > +   if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> > +   return ERR_PTR(-EXDEV);
> 
> Also a candidate for doing in nd_jump_link()...
> 
> > @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
> > struct inode *inode = nd->inode;
> >  
> > while (1) {
> > -   if (path_equal(>path, >root))
> > +   if (path_equal(>path, >root)) {
> > +   if (unlikely(nd->flags & LOOKUP_BENEATH))
> > +   return -EXDEV;
> 
> Umm...  Are you sure it's not -ECHILD?

It wouldn't hurt to be -ECHILD -- though it's not clear to me how likely
a success would be in REF-walk if the parent components didn't already
trigger an unlazy_walk() in RCU-walk.

I guess that also means LOOKUP_NO_XDEV should trigger -ECHILD in
follow_dotdot_rcu()?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


[PATCH 3/3] powerpc: remove support for NULL dev in __phys_to_dma / __dma_to_phys

2019-11-12 Thread Christoph Hellwig
Support for calling the DMA API functions without a valid device pointer
was removed a while ago, so remove the stale support for that from the
powerpc __phys_to_dma / __dma_to_phys helpers.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/dma-direct.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-direct.h 
b/arch/powerpc/include/asm/dma-direct.h
index e29e8a236b8d..abc154d784b0 100644
--- a/arch/powerpc/include/asm/dma-direct.h
+++ b/arch/powerpc/include/asm/dma-direct.h
@@ -4,15 +4,11 @@
 
 static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
 {
-   if (!dev)
-   return paddr + PCI_DRAM_OFFSET;
return paddr + dev->archdata.dma_offset;
 }
 
 static inline phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t daddr)
 {
-   if (!dev)
-   return daddr - PCI_DRAM_OFFSET;
return daddr - dev->archdata.dma_offset;
 }
 #endif /* ASM_POWERPC_DMA_DIRECT_H */
-- 
2.20.1



[PATCH 1/3] dma-direct: unify the dma_capable definitions

2019-11-12 Thread Christoph Hellwig
Currently each architectures that wants to override dma_to_phys and
phys_to_dma also has to provide dma_capable.  But there isn't really
any good reason for that.  powerpc and mips just have copies of the
generic one minus the latests fix, and the arm one was the inspiration
for said fix, but misses the bus_dma_mask handling.
Make all architectures use the generic version instead.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/include/asm/dma-direct.h | 19 ---
 arch/mips/include/asm/dma-direct.h|  8 
 arch/powerpc/include/asm/dma-direct.h |  9 -
 include/linux/dma-direct.h|  2 +-
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/arm/include/asm/dma-direct.h 
b/arch/arm/include/asm/dma-direct.h
index b67e5fc1fe43..7c3001a6a775 100644
--- a/arch/arm/include/asm/dma-direct.h
+++ b/arch/arm/include/asm/dma-direct.h
@@ -14,23 +14,4 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
dma_addr_t dev_addr)
return __pfn_to_phys(dma_to_pfn(dev, dev_addr)) + offset;
 }
 
-static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
-{
-   u64 limit, mask;
-
-   if (!dev->dma_mask)
-   return 0;
-
-   mask = *dev->dma_mask;
-
-   limit = (mask + 1) & ~mask;
-   if (limit && size > limit)
-   return 0;
-
-   if ((addr | (addr + size - 1)) & ~mask)
-   return 0;
-
-   return 1;
-}
-
 #endif /* ASM_ARM_DMA_DIRECT_H */
diff --git a/arch/mips/include/asm/dma-direct.h 
b/arch/mips/include/asm/dma-direct.h
index b5c240806e1b..14e352651ce9 100644
--- a/arch/mips/include/asm/dma-direct.h
+++ b/arch/mips/include/asm/dma-direct.h
@@ -2,14 +2,6 @@
 #ifndef _MIPS_DMA_DIRECT_H
 #define _MIPS_DMA_DIRECT_H 1
 
-static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
-{
-   if (!dev->dma_mask)
-   return false;
-
-   return addr + size - 1 <= *dev->dma_mask;
-}
-
 dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr);
 phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t daddr);
 
diff --git a/arch/powerpc/include/asm/dma-direct.h 
b/arch/powerpc/include/asm/dma-direct.h
index a2912b47102c..e29e8a236b8d 100644
--- a/arch/powerpc/include/asm/dma-direct.h
+++ b/arch/powerpc/include/asm/dma-direct.h
@@ -2,15 +2,6 @@
 #ifndef ASM_POWERPC_DMA_DIRECT_H
 #define ASM_POWERPC_DMA_DIRECT_H 1
 
-static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
-{
-   if (!dev->dma_mask)
-   return false;
-
-   return addr + size - 1 <=
-   min_not_zero(*dev->dma_mask, dev->bus_dma_mask);
-}
-
 static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
 {
if (!dev)
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 6db863c3eb93..991f8aa2676e 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -24,6 +24,7 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
dma_addr_t dev_addr)
 
return paddr + ((phys_addr_t)dev->dma_pfn_offset << PAGE_SHIFT);
 }
+#endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */
 
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
 {
@@ -38,7 +39,6 @@ static inline bool dma_capable(struct device *dev, dma_addr_t 
addr, size_t size)
 
return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_mask);
 }
-#endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */
 
 #ifdef CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED
 bool force_dma_unencrypted(struct device *dev);
-- 
2.20.1



[PATCH 2/3] dma-direct: avoid a forward declaration for phys_to_dma

2019-11-12 Thread Christoph Hellwig
Move dma_capable down a bit so that we don't need a forward declaration
for phys_to_dma.

Signed-off-by: Christoph Hellwig 
---
 include/linux/dma-direct.h | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 991f8aa2676e..f8959f75e496 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -6,8 +6,6 @@
 #include  /* for min_low_pfn */
 #include 
 
-static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr);
-
 #ifdef CONFIG_ARCH_HAS_PHYS_TO_DMA
 #include 
 #else
@@ -26,20 +24,6 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
dma_addr_t dev_addr)
 }
 #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */
 
-static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
-{
-   dma_addr_t end = addr + size - 1;
-
-   if (!dev->dma_mask)
-   return false;
-
-   if (!IS_ENABLED(CONFIG_ARCH_DMA_ADDR_T_64BIT) &&
-   min(addr, end) < phys_to_dma(dev, PFN_PHYS(min_low_pfn)))
-   return false;
-
-   return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_mask);
-}
-
 #ifdef CONFIG_ARCH_HAS_FORCE_DMA_UNENCRYPTED
 bool force_dma_unencrypted(struct device *dev);
 #else
@@ -65,6 +49,20 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
return __sme_clr(__dma_to_phys(dev, daddr));
 }
 
+static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
+{
+   dma_addr_t end = addr + size - 1;
+
+   if (!dev->dma_mask)
+   return false;
+
+   if (!IS_ENABLED(CONFIG_ARCH_DMA_ADDR_T_64BIT) &&
+   min(addr, end) < phys_to_dma(dev, PFN_PHYS(min_low_pfn)))
+   return false;
+
+   return end <= min_not_zero(*dev->dma_mask, dev->bus_dma_mask);
+}
+
 u64 dma_direct_get_required_mask(struct device *dev);
 void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t gfp, unsigned long attrs);
-- 
2.20.1



unify the dma_capable definition

2019-11-12 Thread Christoph Hellwig
Hi all,

there is no good reason to have different version of dma_capable.
This series removes the arch overrides and also adds a few cleanups
in the same area.


Re: [PATCH v4 00/47] QUICC Engine support on ARM and ARM64

2019-11-12 Thread Rasmus Villemoes
On 12/11/2019 21.45, Li Yang wrote:
> On Mon, Nov 11, 2019 at 5:39 PM Li Yang  wrote:
>>
>> On Fri, Nov 8, 2019 at 7:05 AM Rasmus Villemoes
>>  wrote:
>>>
>>
>> I'm generally ok with these enhencements and cleanups.  But as the
>> whole patch series touched multiple subsystems, I would like to
>> collect the Acked-by from Scott, Greg and David if we want the whole
>> series to go through the fsl/soc tree.
> 
> Rasmus,
> 
> Since the patches also touched net and serial subsystem.  Can you also
> repost these patches(maybe just related ones) onto netdev and
> linux-serial mailing list?

They were sent to those lists already. For example, according to
,
the recipients for 28/47 were

To: Qiang Zhao , Li Yang ,
Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org,
linux-arm-ker...@lists.infradead.org,
linux-ker...@vger.kernel.org, Scott Wood ,
Rasmus Villemoes ,
linux-ser...@vger.kernel.org

same for 29-33, and 43-46 was cc'ed to netdev@.

Rasmus


[PATCH v3] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h

2019-11-12 Thread Masahiro Yamada
The DTC v1.5.1 added references to (U)INT32_MAX.

This is no problem for user-space programs since  defines
(U)INT32_MAX along with (u)int32_t.

For the kernel space, libfdt_env.h needs to be adjusted before we
pull in the changes.

In the kernel, we usually use s/u32 instead of (u)int32_t for the
fixed-width types.

Accordingly, we already have S/U32_MAX for their max values.
So, we should not add (U)INT32_MAX to  any more.

Instead, add them to the in-kernel libfdt_env.h to compile the
latest libfdt.

Signed-off-by: Masahiro Yamada 
---

My initial plan was to change this in a series of 3 patches
since it is clean, and reduces the code.

[1/3] https://lore.kernel.org/patchwork/patch/1147095/
[2/3] https://lore.kernel.org/patchwork/patch/1147096/
[3/3] https://lore.kernel.org/patchwork/patch/1147097/

1/3 is stuck in the license bikeshed.

For 2/3, I have not been able to get Ack from Russell.

So, I chose a straight-forward fixup.


Changes in v3:
 - Resend as a single patch

 arch/arm/boot/compressed/libfdt_env.h | 4 +++-
 arch/powerpc/boot/libfdt_env.h| 2 ++
 include/linux/libfdt_env.h| 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/compressed/libfdt_env.h 
b/arch/arm/boot/compressed/libfdt_env.h
index b36c0289a308..6a0f1f524466 100644
--- a/arch/arm/boot/compressed/libfdt_env.h
+++ b/arch/arm/boot/compressed/libfdt_env.h
@@ -2,11 +2,13 @@
 #ifndef _ARM_LIBFDT_ENV_H
 #define _ARM_LIBFDT_ENV_H
 
+#include 
 #include 
 #include 
 #include 
 
-#define INT_MAX((int)(~0U>>1))
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
 
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
diff --git a/arch/powerpc/boot/libfdt_env.h b/arch/powerpc/boot/libfdt_env.h
index 2abc8e83b95e..9757d4f6331e 100644
--- a/arch/powerpc/boot/libfdt_env.h
+++ b/arch/powerpc/boot/libfdt_env.h
@@ -6,6 +6,8 @@
 #include 
 
 #define INT_MAX((int)(~0U>>1))
+#define UINT32_MAX ((u32)~0U)
+#define INT32_MAX  ((s32)(UINT32_MAX >> 1))
 
 #include "of.h"
 
diff --git a/include/linux/libfdt_env.h b/include/linux/libfdt_env.h
index edb0f0c30904..1adf54aad2df 100644
--- a/include/linux/libfdt_env.h
+++ b/include/linux/libfdt_env.h
@@ -7,6 +7,9 @@
 
 #include 
 
+#define INT32_MAX  S32_MAX
+#define UINT32_MAX U32_MAX
+
 typedef __be16 fdt16_t;
 typedef __be32 fdt32_t;
 typedef __be64 fdt64_t;
-- 
2.17.1



Re: [PATCH v4 0/3] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2019-11-12 Thread AKASHI Takahiro
Hi Bhupesh,

Do you have a corresponding patch for userspace tools,
including crash util and/or makedumpfile?
Otherwise, we can't verify that a generated core file is
correctly handled.

Thanks,
-Takahiro Akashi

On Mon, Nov 11, 2019 at 01:31:19PM +0530, Bhupesh Sharma wrote:
> Changes since v3:
> 
> - v3 can be seen here:
>   http://lists.infradead.org/pipermail/kexec/2019-March/022590.html
> - Addressed comments from James and exported TCR_EL1.T1SZ in vmcoreinfo
>   instead of PTRS_PER_PGD.
> - Added a new patch (via [PATCH 3/3]), which fixes a simple typo in
>   'Documentation/arm64/memory.rst'
> 
> Changes since v2:
> 
> - v2 can be seen here:
>   http://lists.infradead.org/pipermail/kexec/2019-March/022531.html
> - Protected 'MAX_PHYSMEM_BITS' vmcoreinfo variable under CONFIG_SPARSEMEM
>   ifdef sections, as suggested by Kazu.
> - Updated vmcoreinfo documentation to add description about
>   'MAX_PHYSMEM_BITS' variable (via [PATCH 3/3]).
> 
> Changes since v1:
> 
> - v1 was sent out as a single patch which can be seen here:
>   http://lists.infradead.org/pipermail/kexec/2019-February/022411.html
> 
> - v2 breaks the single patch into two independent patches:
>   [PATCH 1/2] appends 'PTRS_PER_PGD' to vmcoreinfo for arm64 arch, whereas
>   [PATCH 2/2] appends 'MAX_PHYSMEM_BITS' to vmcoreinfo in core kernel code 
> (all archs)
> 
> This patchset primarily fixes the regression reported in user-space
> utilities like 'makedumpfile' and 'crash-utility' on arm64 architecture
> with the availability of 52-bit address space feature in underlying
> kernel. These regressions have been reported both on CPUs which don't
> support ARMv8.2 extensions (i.e. LVA, LPA) and are running newer kernels
> and also on prototype platforms (like ARMv8 FVP simulator model) which
> support ARMv8.2 extensions and are running newer kernels.
> 
> The reason for these regressions is that right now user-space tools
> have no direct access to these values (since these are not exported
> from the kernel) and hence need to rely on a best-guess method of
> determining value of 'vabits_actual' and 'MAX_PHYSMEM_BITS' supported
> by underlying kernel.
> 
> Exporting these values via vmcoreinfo will help user-land in such cases.
> In addition, as per suggestion from makedumpfile maintainer (Kazu),
> it makes more sense to append 'MAX_PHYSMEM_BITS' to
> vmcoreinfo in the core code itself rather than in arm64 arch-specific
> code, so that the user-space code for other archs can also benefit from
> this addition to the vmcoreinfo and use it as a standard way of
> determining 'SECTIONS_SHIFT' value in user-land.
> 
> Cc: Boris Petkov 
> Cc: Ingo Molnar 
> Cc: Thomas Gleixner 
> Cc: Jonathan Corbet 
> Cc: James Morse 
> Cc: Mark Rutland 
> Cc: Will Deacon 
> Cc: Steve Capper 
> Cc: Catalin Marinas 
> Cc: Ard Biesheuvel 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Dave Anderson 
> Cc: Kazuhito Hagio 
> Cc: x...@kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: ke...@lists.infradead.org
> 
> Bhupesh Sharma (3):
>   crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo
>   arm64/crash_core: Export TCR_EL1.T1SZ in vmcoreinfo
>   Documentation/arm64: Fix a simple typo in memory.rst
> 
>  Documentation/arm64/memory.rst | 2 +-
>  arch/arm64/include/asm/pgtable-hwdef.h | 1 +
>  arch/arm64/kernel/crash_core.c | 9 +
>  kernel/crash_core.c| 1 +
>  4 files changed, 12 insertions(+), 1 deletion(-)
> 
> -- 
> 2.7.4
> 


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-12 Thread Ram Pai
On Wed, Nov 13, 2019 at 11:14:27AM +1100, Paul Mackerras wrote:
> On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> > On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > > There is subtle problem removing that code from the assembly.
> > > > 
> > > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without clearing
> > > > kvm->arch.secure_guest, the hypervisor will continue to think that the
> > > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > > hcall was invoked, was to inform the Hypervisor that it should no longer
> > > > consider the VM as a Secure VM. So there is a inconsistency there.
> > > 
> > > Most of the checks that look at whether a VM is a secure VM use code
> > > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > > take the false branch once we have set kvm->arch.secure_guest to
> > > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > > most places we will treat the VM as a normal VM from then on.  If
> > > there are any places where we still need to treat the VM as a secure
> > > VM while we are processing the abort we can easily do that too.
> > 
> > Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> > to the Ultravisor?   Because removing that assembly code will NOT lead the
> 
> No.  The suggestion is that vcpu->arch.secure_guest stays set to
> KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.

In the fast_guest_return path, if it finds 
(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_ABORT) is true, should it return to
UV or not?

Ideally it should return back to the ultravisor the first time
KVMPPC_SECURE_INIT_ABORT is set, and not than onwards.

RP



Re: [PATCH v4 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 8:26 PM, John Hubbard wrote:
> OK, here we go. Any VFIO and Infiniband runtime testing from anyone, is
> especially welcome here.
> 

Oh, and to make that easier, there is a git repo and branch, here:

g...@github.com:johnhubbard/linux.git pin_user_pages_tracking_v4


thanks,
-- 
John Hubbard
NVIDIA


[PATCH v4 23/23] mm/gup: remove support for gup(FOLL_LONGTERM)

2019-11-12 Thread John Hubbard
Now that all other kernel callers of get_user_pages(FOLL_LONGTERM)
have been converted to pin_longterm_pages(), lock it down:

1) Add an assertion to get_user_pages(), preventing callers from
   passing FOLL_LONGTERM (in addition to the existing assertion that
   prevents FOLL_PIN).

2) Remove the associated GUP_LONGTERM_BENCHMARK test.

Signed-off-by: John Hubbard 
---
 mm/gup.c   | 8 
 mm/gup_benchmark.c | 9 +
 tools/testing/selftests/vm/gup_benchmark.c | 7 ++-
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 82e7e4ce5027..90f5f95ee7ac 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1756,11 +1756,11 @@ long get_user_pages(unsigned long start, unsigned long 
nr_pages,
struct vm_area_struct **vmas)
 {
/*
-* FOLL_PIN must only be set internally by the pin_user_page*() and
-* pin_longterm_*() APIs, never directly by the caller, so enforce that
-* with an assertion:
+* FOLL_PIN and FOLL_LONGTERM must only be set internally by the
+* pin_user_page*() and pin_longterm_*() APIs, never directly by the
+* caller, so enforce that with an assertion:
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+   if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
return -EINVAL;
 
return __gup_longterm_locked(current, current->mm, start, nr_pages,
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 8f980d91dbf5..679f0e6a0adb 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,7 +6,7 @@
 #include 
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 #define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
 #define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
@@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, 
unsigned long nr_pages)
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
@@ -97,11 +96,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
-   case GUP_LONGTERM_BENCHMARK:
-   nr = get_user_pages(addr, nr,
-   gup->flags | FOLL_LONGTERM,
-   pages + i, NULL);
-   break;
case GUP_BENCHMARK:
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
@@ -159,7 +153,6 @@ static long gup_benchmark_ioctl(struct file *filep, 
unsigned int cmd,
 
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   case GUP_LONGTERM_BENCHMARK:
case GUP_BENCHMARK:
case PIN_FAST_BENCHMARK:
case PIN_LONGTERM_BENCHMARK:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 03928e47a86f..836b7082db13 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,7 +15,7 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
 /*
@@ -49,7 +49,7 @@ int main(int argc, char **argv)
char *file = "/dev/zero";
char *p;
 
-   while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
+   while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) {
switch (opt) {
case 'a':
cmd = PIN_FAST_BENCHMARK;
@@ -75,9 +75,6 @@ int main(int argc, char **argv)
case 'T':
thp = 0;
break;
-   case 'L':
-   cmd = GUP_LONGTERM_BENCHMARK;
-   break;
case 'U':
cmd = GUP_BENCHMARK;
break;
-- 
2.24.0



[PATCH v4 06/23] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-11-12 Thread John Hubbard
And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 24244a2f68cc..3d664a2539eb 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -271,16 +271,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   down_read(>mmap_sem);
-   ret = get_user_pages(cur_base,
-min_t(unsigned long, npages,
-  PAGE_SIZE / sizeof (struct page *)),
-gup_flags | FOLL_LONGTERM,
-page_list, NULL);
-   if (ret < 0) {
-   up_read(>mmap_sem);
+   ret = get_user_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags | FOLL_LONGTERM, page_list);
+   if (ret < 0)
goto umem_release;
-   }
 
cur_base += ret * PAGE_SIZE;
npages   -= ret;
@@ -288,8 +285,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = ib_umem_add_sg_table(sg, page_list, ret,
dma_get_max_seg_size(context->device->dma_device),
>sg_nents);
-
-   up_read(>mmap_sem);
}
 
sg_mark_end(sg);
-- 
2.24.0



[PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-12 Thread John Hubbard
Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
thoroughly documented.

The pin_longterm*() variants also set FOLL_LONGTERM, in addition
to FOLL_PIN:

pin_user_pages()
pin_user_pages_remote()
pin_user_pages_fast()

pin_longterm_pages()
pin_longterm_pages_remote()
pin_longterm_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* These are gup-internal flags, so the call sites should not directly
set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
assertions, for the new FOLL_PIN flag. However, for the pre-existing
FOLL_LONGTERM flag, which has some call sites that still directly
set FOLL_LONGTERM, there is no assertion yet.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
* Start with "pin_user_pages" instead of "get_user_pages". That
  makes it easy to find and audit the call sites.
* Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Mike Rapoport   # Documentation
Reviewed-by: Jérôme Glisse 
Cc: Jonathan Corbet 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 Documentation/core-api/index.rst  |   1 +
 Documentation/core-api/pin_user_pages.rst | 218 +
 include/linux/mm.h|  75 +-
 mm/gup.c  | 275 --
 4 files changed, 535 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
generic-radix-tree
memory-allocation
mm-api
+   pin_user_pages
gfp_mask-from-fs-io
timekeeping
boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst 
b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index ..ce819e709435
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,218 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+pin_user_pages() and related calls
+
+
+.. contents:: :local:
+
+Overview
+
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+ pin_longterm_pages
+ pin_longterm_pages_fast
+ pin_longterm_pages_remote
+
+Basic description of FOLL_PIN
+=
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the 
get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
+FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
+the associated wrapper functions  (pin_user_pages() and others) to set the
+correct combination of these flags, and to check for problems as well.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that 
FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===
+
+Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
+whatever flags the caller provides::
+
+ Functiongup flags (FOLL_PIN or FOLL_LONGTERM only)
+ --
+ pin_user_pages  FOLL_PIN
+ pin_user_pages_fast FOLL_PIN
+ pin_user_pages_remote   FOLL_PIN
+
+ pin_longterm_pages  FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_fast FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
+
+Tracking dma-pinned pages
+=
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  

[PATCH v4 20/23] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"

2019-11-12 Thread John Hubbard
Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 9 ++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = (next - addr) / PAGE_SIZE;
}
 
+   /* Filter out most gup flags: only allow a tiny subset here: */
+   gup->flags &= FOLL_WRITE;
+
switch (cmd) {
case GUP_FAST_BENCHMARK:
-   nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+   nr = get_user_pages_fast(addr, nr, gup->flags,
 pages + i);
break;
case GUP_LONGTERM_BENCHMARK:
nr = get_user_pages(addr, nr,
-   (gup->flags & 1) | FOLL_LONGTERM,
+   gup->flags | FOLL_LONGTERM,
pages + i, NULL);
break;
case GUP_BENCHMARK:
-   nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+   nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c 
b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
+
 struct gup_benchmark {
__u64 get_delta_usec;
__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
}
 
gup.nr_pages_per_call = nr_pages;
-   gup.flags = write;
+   if (write)
+   gup.flags |= FOLL_WRITE;
 
fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
if (fd == -1)
-- 
2.24.0



[PATCH v4 22/23] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage

2019-11-12 Thread John Hubbard
It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Signed-off-by: John Hubbard 
---
 tools/testing/selftests/vm/run_vmtests | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests 
b/tools/testing/selftests/vm/run_vmtests
index 951c507a27f7..93e8dc9a7cad 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal 
coverage.  Use"
 echo "  https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "  hugetlb regression testing."
 
+echo ""
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo ""
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
+echo "--"
+echo "running gup_benchmark -c (pin_user_pages)"
+echo "--"
+./gup_benchmark -c
+if [ $? -ne 0 ]; then
+   echo "[FAIL]"
+   exitcode=1
+else
+   echo "[PASS]"
+fi
+
 echo "---"
 echo "running userfaultfd"
 echo "---"
-- 
2.24.0



[PATCH v4 07/23] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-11-12 Thread John Hubbard
After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Acked-by: Hans Verkuil 
Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++)
+   for (i = 0; i < dma->nr_pages; i++) {
+   if (dma->direction == DMA_FROM_DEVICE)
+   set_page_dirty_lock(dma->pages[i]);
put_page(dma->pages[i]);
+   }
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v4 02/23] mm/gup: factor out duplicate code from four routines

2019-11-12 Thread John Hubbard
There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Jérôme Glisse 
Cc: Ira Weiny 
Cc: Christoph Hellwig 
Cc: Aneesh Kumar K.V 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 104 ---
 1 file changed, 45 insertions(+), 59 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 85caf76b3012..199da99e8ffc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, 
unsigned long addr,
 }
 #endif
 
+static int __record_subpages(struct page *page, unsigned long addr,
+unsigned long end, struct page **pages, int nr)
+{
+   int nr_recorded_pages = 0;
+
+   do {
+   pages[nr] = page;
+   nr++;
+   page++;
+   nr_recorded_pages++;
+   } while (addr += PAGE_SIZE, addr != end);
+   return nr_recorded_pages;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+   /* Do a get_page() first, in case refs == page->_refcount */
+   get_page(page);
+   page_ref_sub(page, refs);
+   put_page(page);
+}
+
+static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
+{
+   *nr += nr_recorded_pages;
+   SetPageReferenced(head);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
  unsigned long sz)
@@ -1998,33 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   refs = 0;
head = pte_page(pte);
-
page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-   do {
-   VM_BUG_ON(compound_head(page) != head);
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(head, refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-   /* Could be optimized better */
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
-   SetPageReferenced(head);
+   __huge_pt_done(head, refs, nr);
return 1;
 }
 
@@ -2071,29 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pmd_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   put_compound_head(head, refs);
return 0;
}
 
-   SetPageReferenced(head);
+   __huge_pt_done(head, refs, nr);
return 1;
 }
 
@@ -2114,29 +2119,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
 pages, nr);
}
 
-   refs = 0;
page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   do {
-   pages[*nr] = page;
-   (*nr)++;
-   page++;
-   refs++;
-   } while (addr += PAGE_SIZE, addr != end);
+   refs = __record_subpages(page, addr, end, pages, *nr);
 
head = try_get_compound_head(pud_page(orig), refs);
-   if (!head) {
-   *nr -= refs;
+   if (!head)
return 0;
-   }
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-   *nr -= refs;
-   while (refs--)
-   put_page(head);
+   

[PATCH v4 18/23] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-11-12 Thread John Hubbard
1. Change vfio from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Cc: Alex Williamson 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7301b710c9a4..1603459805f1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
-   if (prot & IOMMU_WRITE)
-   SetPageDirty(page);
-   put_page(page);
+
+   put_user_pages_dirty_lock(, 1, prot & IOMMU_WRITE);
return 1;
}
return 0;
@@ -347,8 +346,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
-   page, NULL, NULL);
+   ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1, flags, page, NULL,
+   NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;
-- 
2.24.0



[PATCH v4 12/23] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-11-12 Thread John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 mm/process_vm_access.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
if (copy > len)
copy = len;
 
-   if (vm_write) {
+   if (vm_write)
copied = copy_page_from_iter(page, offset, copy, iter);
-   set_page_dirty_lock(page);
-   } else {
+   else
copied = copy_page_to_iter(page, offset, copy, iter);
-   }
+
len -= copied;
if (copied < copy && iov_iter_count(iter))
return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
flags |= FOLL_WRITE;
 
while (!rc && nr_pages && iov_iter_count(iter)) {
-   int pages = min(nr_pages, max_pages_per_loop);
+   int pinned_pages = min(nr_pages, max_pages_per_loop);
int locked = 1;
size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 * current/current->mm
 */
down_read(>mmap_sem);
-   pages = get_user_pages_remote(task, mm, pa, pages, flags,
- process_pages, NULL, );
+   pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+flags, process_pages,
+NULL, );
if (locked)
up_read(>mmap_sem);
-   if (pages <= 0)
+   if (pinned_pages <= 0)
return -EFAULT;
 
-   bytes = pages * PAGE_SIZE - start_offset;
+   bytes = pinned_pages * PAGE_SIZE - start_offset;
if (bytes > len)
bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 vm_write);
len -= bytes;
start_offset = 0;
-   nr_pages -= pages;
-   pa += pages * PAGE_SIZE;
-   while (pages)
-   put_page(process_pages[--pages]);
+   nr_pages -= pinned_pages;
+   pa += pinned_pages * PAGE_SIZE;
+
+   /* If vm_write is set, the pages need to be made dirty: */
+   put_user_pages_dirty_lock(process_pages, pinned_pages,
+ vm_write);
}
 
return rc;
-- 
2.24.0



[PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-12 Thread John Hubbard
An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Suggested-by: Jérôme Glisse 
Signed-off-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h | 27 ---
 mm/memremap.c  | 67 --
 2 files changed, 53 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a2adf95b3f9c..96228376139c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
*page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
@@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
switch (page->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_FS_DAX:
-   __put_devmap_managed_page(page);
return true;
default:
break;
@@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
*page)
return false;
 }
 
+static inline bool put_devmap_managed_page(struct page *page)
+{
+   bool is_devmap = page_is_devmap_managed(page);
+
+   if (is_devmap) {
+   int count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+   }
+
+   return is_devmap;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..bc7e2a27d025 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-   int count = page_ref_dec_return(page);
+   /* Clear Active bit in case of parallel mark_page_accessed */
+   __ClearPageActive(page);
+   __ClearPageWaiters(page);
+
+   mem_cgroup_uncharge(page);
 
/*
-* If refcount is 1 then page is freed and refcount is stable as nobody
-* holds a reference on the page.
+* When a device_private page is freed, the page->mapping field
+* may still contain a (stale) mapping value. For example, the
+* lower bits of page->mapping may still identify the page as
+* an anonymous page. Ultimately, this entire field is just
+* stale and wrong, and it will cause errors if not cleared.
+* One example is:
+*
+*  migrate_vma_pages()
+*migrate_vma_insert_page()
+*  page_add_new_anon_rmap()
+*__page_set_anon_rmap()
+*  ...checks page->mapping, via PageAnon(page) call,
+*and incorrectly concludes that the page is an
+*anonymous page. Therefore, it incorrectly,
+*silently fails to set up the new anon rmap.
+*
+* For other types of ZONE_DEVICE pages, migration is either
+* handled differently or not done at all, so there is no need
+* to clear page->mapping.
 */
-   if (count == 1) {
-   /* Clear Active bit in case of parallel mark_page_accessed */
-   __ClearPageActive(page);
-   __ClearPageWaiters(page);
-
-   mem_cgroup_uncharge(page);
-
-   /*
-* When a device_private page is freed, the page->mapping field
-* 

[PATCH v4 21/23] mm/gup_benchmark: support pin_user_pages() and related calls

2019-11-12 Thread John Hubbard
Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages(): via the '-c' command line option
* pin_longterm_pages(): via the '-b' command line option
* pin_user_pages_fast(): via the '-a' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the three commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

PIN_FAST_BENCHMARK : calls pin_user_pages_fast()
PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages()
PIN_BENCHMARK  : calls pin_user_pages()

In between the calls to pin_*() and put_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Signed-off-by: John Hubbard 
---
 mm/gup_benchmark.c | 73 --
 tools/testing/selftests/vm/gup_benchmark.c | 23 ++-
 2 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..8f980d91dbf5 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,9 @@
 #define GUP_FAST_BENCHMARK _IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK _IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK  _IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK _IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK _IOWR('g', 5, struct gup_benchmark)
+#define PIN_BENCHMARK  _IOWR('g', 6, struct gup_benchmark)
 
 struct gup_benchmark {
__u64 get_delta_usec;
@@ -19,6 +22,44 @@ struct gup_benchmark {
__u64 expansion[10];/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long 
nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case GUP_FAST_BENCHMARK:
+   case GUP_LONGTERM_BENCHMARK:
+   case GUP_BENCHMARK:
+   for (i = 0; i < nr_pages; i++)
+   put_page(pages[i]);
+   break;
+
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   put_user_pages(pages, nr_pages);
+   break;
+   }
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+ unsigned long nr_pages)
+{
+   int i;
+
+   switch (cmd) {
+   case PIN_FAST_BENCHMARK:
+   case PIN_LONGTERM_BENCHMARK:
+   case PIN_BENCHMARK:
+   for (i = 0; i < nr_pages; i++) {
+   if (WARN(!page_dma_pinned(pages[i]),
+"pages[%d] is NOT dma-pinned\n", i))
+   break;
+   }
+   break;
+   }
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
struct gup_benchmark *gup)
 {
@@ -65,6 +106,18 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
nr = get_user_pages(addr, nr, gup->flags, pages + i,
NULL);
break;
+   case PIN_FAST_BENCHMARK:
+   nr = pin_user_pages_fast(addr, nr, gup->flags,
+pages + i);
+   break;
+   case PIN_LONGTERM_BENCHMARK:
+   nr = pin_longterm_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
+   case PIN_BENCHMARK:
+   nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+   NULL);
+   break;
default:
return -1;
}
@@ -75,15 +128,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
}
end_time = ktime_get();
 
+   /* Shifting the meaning of nr_pages: now it is actual number pinned: */
+   nr_pages = i;
+
gup->get_delta_usec = ktime_us_delta(end_time, start_time);
gup->size = addr - gup->addr;
 
+   /*
+* Take an un-benchmark-timed moment to verify DMA pinned
+* state: print a warning if any non-dma-pinned pages are found:
+*/
+   verify_dma_pinned(cmd, pages, nr_pages);
+
start_time = ktime_get();
-   for (i = 0; i < nr_pages; i++) {
-   if (!pages[i])
-   break;
-   put_page(pages[i]);
-   }
+
+   put_back_pages(cmd, pages, nr_pages);
+
end_time = ktime_get();
gup->put_delta_usec = 

[PATCH v4 19/23] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()

2019-11-12 Thread John Hubbard
1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

3. Release each page in mem->hpages[] (instead of mem->hpas[]), because
that is the array that pin_longterm_pages() filled in. This is more
accurate and should be a little safer from a maintenance point of
view.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Signed-off-by: John Hubbard 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..69d79cb50d47 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (entry = 0; entry < entries; entry += chunk) {
unsigned long n = min(entries - entry, chunk);
 
-   ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
-   FOLL_WRITE | FOLL_LONGTERM,
-   mem->hpages + entry, NULL);
+   ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n,
+FOLL_WRITE, mem->hpages + entry, NULL);
if (ret == n) {
pinned += n;
continue;
@@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
return 0;
 
 free_exit:
-   /* free the reference taken */
-   for (i = 0; i < pinned; i++)
-   put_page(mem->hpages[i]);
+   /* free the references taken */
+   put_user_pages(mem->hpages, pinned);
 
vfree(mem->hpas);
kfree(mem);
@@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct 
mm_iommu_table_group_mem_t *mem)
if (!page)
continue;
 
-   if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
-   SetPageDirty(page);
+   put_user_pages_dirty_lock(>hpages[i], 1,
+ MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
 
-   put_page(page);
mem->hpas[i] = 0;
}
 }
-- 
2.24.0



[PATCH v4 16/23] mm/gup: track FOLL_PIN pages

2019-11-12 Thread John Hubbard
Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

Suggested-by: Jan Kara 
Suggested-by: Jérôme Glisse 
Signed-off-by: John Hubbard 
---
 include/linux/mm.h   |  74 +++
 include/linux/mmzone.h   |   2 +
 include/linux/page_ref.h |  10 +++
 mm/gup.c | 189 +--
 mm/huge_memory.c |  54 ++-
 mm/hugetlb.c |  39 +++-
 mm/vmstat.c  |   2 +
 7 files changed, 321 insertions(+), 49 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c351e1b0b4b7..19b3fa68a4da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page 
*page)
return true;
 }
 
+__must_check bool user_page_ref_inc(struct page *page);
+
 static inline void put_page(struct page *page)
 {
page = compound_head(page);
@@ -1071,30 +1073,70 @@ static inline void put_page(struct page *page)
__put_page(page);
 }
 
-/**
- * put_user_page() - release a gup-pinned page
- * @page:pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original 
page
+ * reference count, and also a new count of how many get_user_pages() calls 
were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, get_user_pages() becomes special: such pages are marked
+ * as distinct from normal pages. As such, the new put_user_page() call (and
+ * its variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
  *
- * Pages that were pinned via either pin_user_pages*() or pin_longterm_pages*()
- * must be released via either put_user_page(), or one of the put_user_pages*()
- * routines. This is so that eventually such pages can be separately tracked 
and
- * uniquely handled. In particular, interactions with RDMA and filesystems need
- * special handling.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to get_user_pages() and put_user_page() becomes simpler,
+ * due to the fact that adding an even power of two to the page refcount has
+ * the effect of using only the upper N bits, for the code that counts up using
+ * the bias value. This means that the lower bits are left for the exclusive
+ * use of the original code that increments and decrements by one (or at least,
+ * by much smaller values than the bias value).
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual 
inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
+ *
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void put_user_page(struct page *page)
-{
-   put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1UL << 10)
 
+void put_user_page(struct page *page);
 void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
   bool make_dirty);
-
 void put_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*() or pin_longterm_pages*().
+ *
+ * The return value is partially fuzzy: false is not fuzzy, because it means
+ * "definitely not pinned for DMA", but true means "probably pinned for DMA, 
but
+ * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
+ * of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that 
many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please 

[PATCH v4 01/23] mm/gup: pass flags arg to __gup_device_* functions

2019-11-12 Thread John Hubbard
A subsequent patch requires access to gup flags, so
pass the flags argument through to the __gup_device_*
functions.

Also placate checkpatch.pl by shortening a nearby line.

Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Cc: Kirill A. Shutemov 
Signed-off-by: John Hubbard 
---
 mm/gup.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..85caf76b3012 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
int nr_start = *nr;
struct dev_pagemap *pgmap = NULL;
@@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
*pmdp, unsigned long addr,
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
unsigned long fault_pfn;
int nr_start = *nr;
 
fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-   if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+   if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
return 0;
 
if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
*pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-   unsigned long end, struct page **pages, int *nr)
+unsigned long end, unsigned int flags,
+struct page **pages, int *nr)
 {
BUILD_BUG();
return 0;
@@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
if (pmd_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+   return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
@@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-   unsigned long end, unsigned int flags, struct page **pages, int 
*nr)
+   unsigned long end, unsigned int flags,
+   struct page **pages, int *nr)
 {
struct page *head, *page;
int refs;
@@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned 
long addr,
if (pud_devmap(orig)) {
if (unlikely(flags & FOLL_LONGTERM))
return 0;
-   return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+   return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+pages, nr);
}
 
refs = 0;
-- 
2.24.0



[PATCH v4 15/23] net/xdp: set FOLL_PIN via pin_user_pages()

2019-11-12 Thread John Hubbard
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_longterm_pages().

Reviewed-by: Ira Weiny 
Acked-by: Björn Töpel 
Signed-off-by: John Hubbard 
---
 net/xdp/xdp_umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..66c814863cfd 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,8 +291,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
return -ENOMEM;
 
down_read(>mm->mmap_sem);
-   npgs = get_user_pages(umem->address, umem->npgs,
- gup_flags | FOLL_LONGTERM, >pgs[0], NULL);
+   npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags,
+ >pgs[0], NULL);
up_read(>mm->mmap_sem);
 
if (npgs != umem->npgs) {
-- 
2.24.0



[PATCH v4 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments accordingly, and update the VFIO caller
to take advantage of this, fixing a bug as a result: the VFIO caller
is logically a FOLL_LONGTERM user.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, move the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). It's lightly explained in the comments as well.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Suggested-by: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Jerome Glisse 
Cc: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/vfio/vfio_iommu_type1.c | 25 ++---
 mm/gup.c| 27 ++-
 2 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d864277ea16f..7301b710c9a4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -340,7 +340,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
 {
struct page *page[1];
struct vm_area_struct *vma;
-   struct vm_area_struct *vmas[1];
unsigned int flags = 0;
int ret;
 
@@ -348,33 +347,13 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
long vaddr,
flags |= FOLL_WRITE;
 
down_read(>mmap_sem);
-   if (mm == current->mm) {
-   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-vmas);
-   } else {
-   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-   vmas, NULL);
-   /*
-* The lifetime of a vaddr_get_pfn() page pin is
-* userspace-controlled. In the fs-dax case this could
-* lead to indefinite stalls in filesystem operations.
-* Disallow attempts to pin fs-dax pages via this
-* interface.
-*/
-   if (ret > 0 && vma_is_fsdax(vmas[0])) {
-   ret = -EOPNOTSUPP;
-   put_page(page[0]);
-   }
-   }
-   up_read(>mmap_sem);
-
+   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+   page, NULL, NULL);
if (ret == 1) {
*pfn = page_to_pfn(page[0]);
return 0;
}
 
-   down_read(>mmap_sem);
-
vaddr = untagged_addr(vaddr);
 
vma = find_vma_intersection(mm, vaddr, vaddr + 1);
diff --git a/mm/gup.c b/mm/gup.c
index 933524de6249..83702b2e86c8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long nr_pages,
+ struct page **pages,
+ struct vm_area_struct **vmas,
+ unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1167,13 +1174,23 @@ long get_user_pages_remote(struct task_struct *tsk, 
struct mm_struct *mm,
struct vm_area_struct **vmas, int *locked)
 {
/*
-* FIXME: Current FOLL_LONGTERM behavior is incompatible with
+* Parts of FOLL_LONGTERM behavior are incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-* vmas.  As there are no users of this flag in this call we simply
-* disallow this option for now.
+* vmas. However, this only comes up if locked is set, and there are
+* callers that do request FOLL_LONGTERM, but do not set locked. So,
+* allow what we can.
 */
-   if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-   return -EINVAL;
+   if (gup_flags & FOLL_LONGTERM) {
+   if 

[PATCH v4 17/23] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion

2019-11-12 Thread John Hubbard
1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil 
Reviewed-by: Ira Weiny 
Cc: Mauro Carvalho Chehab 
Signed-off-by: John Hubbard 
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c 
b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..9b9c5b37bf59 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct 
videobuf_dmabuf *dma,
dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
data, size, dma->nr_pages);
 
-   err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-flags | FOLL_LONGTERM, dma->pages, NULL);
+   err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
+flags, dma->pages, NULL);
 
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
-   dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+   dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
dma->nr_pages);
return err < 0 ? err : -EINVAL;
}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
 
if (dma->pages) {
-   for (i = 0; i < dma->nr_pages; i++) {
-   if (dma->direction == DMA_FROM_DEVICE)
-   set_page_dirty_lock(dma->pages[i]);
-   put_page(dma->pages[i]);
-   }
+   put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+ dma->direction == DMA_FROM_DEVICE);
kfree(dma->pages);
dma->pages = NULL;
}
-- 
2.24.0



[PATCH v4 05/23] goldish_pipe: rename local pin_user_pages() routine

2019-11-12 Thread John Hubbard
1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "pin_goldfish_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..7ed2a21a0bac 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
}
 }
 
-static int pin_user_pages(unsigned long first_page,
- unsigned long last_page,
- unsigned int last_page_size,
- int is_write,
- struct page *pages[MAX_BUFFERS_PER_COMMAND],
- unsigned int *iter_last_page_size)
+static int pin_goldfish_pages(unsigned long first_page,
+ unsigned long last_page,
+ unsigned int last_page_size,
+ int is_write,
+ struct page *pages[MAX_BUFFERS_PER_COMMAND],
+ unsigned int *iter_last_page_size)
 {
int ret;
int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
if (mutex_lock_interruptible(>lock))
return -ERESTARTSYS;
 
-   pages_count = pin_user_pages(first_page, last_page,
-last_page_size, is_write,
-pipe->pages, _last_page_size);
+   pages_count = pin_goldfish_pages(first_page, last_page,
+last_page_size, is_write,
+pipe->pages, _last_page_size);
if (pages_count < 0) {
mutex_unlock(>lock);
return pages_count;
-- 
2.24.0



[PATCH v4 10/23] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-11-12 Thread John Hubbard
1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.gb...@lst.de

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
b/drivers/platform/goldfish/goldfish_pipe.c
index 7ed2a21a0bac..635a8bc1b480 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page,
*iter_last_page_size = last_page_size;
}
 
-   ret = get_user_pages_fast(first_page, requested_pages,
+   ret = pin_user_pages_fast(first_page, requested_pages,
  !is_write ? FOLL_WRITE : 0,
  pages);
if (ret <= 0)
@@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page,
return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-  int is_write, s32 consumed_size)
-{
-   int i;
-
-   for (i = 0; i < pages_count; i++) {
-   if (!is_write && consumed_size > 0)
-   set_page_dirty(pages[i]);
-   put_page(pages[i]);
-   }
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
   int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-   release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+   put_user_pages_dirty_lock(pipe->pages, pages_count,
+ !is_write && *consumed_size > 0);
 
mutex_unlock(>lock);
return 0;
-- 
2.24.0



[PATCH v4 11/23] IB/{core, hw, umem}: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-12 Thread John Hubbard
Convert infiniband to use the new wrapper calls, and stop
explicitly setting FOLL_LONGTERM at the call sites.

The new pin_longterm_*() calls replace get_user_pages*()
calls, and set both FOLL_LONGTERM and a new FOLL_PIN
flag. The FOLL_PIN flag requires that the caller must
return the pages via put_user_page*() calls, but
infiniband was already doing that as part of an earlier
commit.

Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/infiniband/core/umem.c  | 10 +-
 drivers/infiniband/core/umem_odp.c  | 13 ++---
 drivers/infiniband/hw/hfi1/user_pages.c |  4 ++--
 drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  8 
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  9 -
 drivers/infiniband/sw/siw/siw_mem.c |  5 ++---
 8 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 3d664a2539eb..7f03f72e6dce 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -271,11 +271,11 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
sg = umem->sg_head.sgl;
 
while (npages) {
-   ret = get_user_pages_fast(cur_base,
- min_t(unsigned long, npages,
-   PAGE_SIZE /
-   sizeof(struct page *)),
- gup_flags | FOLL_LONGTERM, page_list);
+   ret = pin_longterm_pages_fast(cur_base,
+ min_t(unsigned long, npages,
+   PAGE_SIZE /
+   sizeof(struct page *)),
+ gup_flags, page_list);
if (ret < 0)
goto umem_release;
 
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 163ff7ba92b7..11249406148a 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -495,9 +495,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For 
on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
@@ -542,7 +541,7 @@ static int ib_umem_odp_map_dma_single_page(
}
 
 out:
-   put_user_page(page);
+   put_page(page);
 
if (remove_existing_mapping) {
ib_umem_notifier_start_account(umem_odp);
@@ -665,7 +664,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
ret = -EFAULT;
break;
}
-   put_user_page(local_page_list[j]);
+   put_page(local_page_list[j]);
continue;
}
 
@@ -692,8 +691,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, 
u64 user_virt,
 * ib_umem_odp_map_dma_single_page().
 */
if (npages - (j + 1) > 0)
-   put_user_pages(_page_list[j+1],
-  npages - (j + 1));
+   release_pages(_page_list[j+1],
+ npages - (j + 1));
break;
}
}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9b55b0a73e29 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned 
long vaddr, size_t np
bool writable, struct page **pages)
 {
int ret;
-   unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
+   unsigned int gup_flags = (writable ? FOLL_WRITE : 0);
 
-   ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+   ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages);
if (ret < 0)
return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c 
b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..beec7e4b8a96 

[PATCH v4 14/23] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-11-12 Thread John Hubbard
Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_longterm_pages().

Reviewed-by: Ira Weiny 
Reviewed-by: Jens Axboe 
Signed-off-by: John Hubbard 
---
 fs/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f9a38998f2fc..0f307f2c7cac 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3433,9 +3433,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
*ctx, void __user *arg,
 
ret = 0;
down_read(>mm->mmap_sem);
-   pret = get_user_pages(ubuf, nr_pages,
- FOLL_WRITE | FOLL_LONGTERM,
- pages, vmas);
+   pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages,
+ vmas);
if (pret == nr_pages) {
/* don't support file backed memory */
for (j = 0; j < nr_pages; j++) {
-- 
2.24.0



[PATCH v4 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-12 Thread John Hubbard
OK, here we go. Any VFIO and Infiniband runtime testing from anyone, is
especially welcome here.

Changes since v3:

* VFIO fix (patch 8): applied further cleanup: removed a pre-existing,
  unnecessary release and reacquire of mmap_sem. Moved the DAX vma
  checks from the vfio call site, to gup internals, and added comments
  (and commit log) to clarify.

* Due to the above, made a corresponding fix to the
  pin_longterm_pages_remote(), which was actually calling the wrong
  gup internal function.

* Changed put_user_page() comments, to refer to pin*() APIs, rather than
  get_user_pages*() APIs.

* Reverted an accidental whitespace-only change in the IB ODP code.

* Added a few more reviewed-by tags.


Changes since v2:

* Added a patch to convert IB/umem from normal gup, to gup_fast(). This
  is also posted separately, in order to hopefully get some runtime
  testing.

* Changed the page devmap code to be a little clearer,
  thanks to Jerome for that.

* Split out the page devmap changes into a separate patch (and moved
  Ira's Signed-off-by to that patch).

* Fixed my bug in IB: ODP code does not require pin_user_pages()
  semantics. Therefore, revert the put_user_page() calls to put_page(),
  and leave the get_user_pages() call as-is.

  * As part of the revert, I am proposing here a change directly
from put_user_pages(), to release_pages(). I'd feel better if
someone agrees that this is the best way. It uses the more
efficient release_pages(), instead of put_page() in a loop,
and keep the change to just a few character on one line,
but OTOH it is not a pure revert.

* Loosened the FOLL_LONGTERM restrictions in the
  __get_user_pages_locked() implementation, and used that in order
  to fix up a VFIO bug. Thanks to Jason for that idea.

* Note the use of release_pages() in IB: is that OK?

* Added a few more WARN's and clarifying comments nearby.

* Many documentation improvements in various comments.

* Moved the new pin_user_pages.rst from Documentation/vm/ to
  Documentation/core-api/ .

* Commit descriptions: added clarifying notes to the three patches
  (drm/via, fs/io_uring, net/xdp) that already had put_user_page()
  calls in place.

* Collected all pending Reviewed-by and Acked-by tags, from v1 and v2
  email threads.

* Lot of churn from v2 --> v3, so it's possible that new bugs
  sneaked in.

NOT DONE: separate patchset is required:

* __get_user_pages_locked(): stop compensating for
  buggy callers who failed to set FOLL_GET. Instead, assert
  that FOLL_GET is set (and fail if it's not).

==
Original cover letter (edited to fix up the patch description numbers)

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

get_user_pages() (sets FOLL_GET)
put_page()

to this:
pin_user_pages() (sets FOLL_PIN)
put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
put_page()

to this:
pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
put_user_page()


Patch summary:

* Patches 1-8: refactoring and preparatory cleanup, independent fixes

* Patch 9: introduce pin_user_pages(), FOLL_PIN, but no functional
   changes yet
* Patches 10-15: Convert existing put_user_page() callers, to use the
new pin*()
* Patch 16: Activate tracking of FOLL_PIN pages.
* Patches 17-19: convert FOLL_LONGTERM callers
* Patches: 20-22: gup_benchmark and run_vmtests support
* Patch 23: enforce FOLL_LONGTERM as a gup-internal (only) flag


Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some 

[PATCH v4 03/23] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-11-12 Thread John Hubbard
An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Signed-off-by: John Hubbard 
---
 mm/gup.c | 29 +++--
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 199da99e8ffc..933524de6249 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+   struct page *head = compound_head(page);
+
+   if (WARN_ON_ONCE(page_ref_count(head) < 0))
+   return NULL;
+   if (unlikely(!page_cache_add_speculative(head, refs)))
+   return NULL;
+   return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1793,20 +1808,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-   struct page *head = compound_head(page);
-   if (WARN_ON_ONCE(page_ref_count(head) < 0))
-   return NULL;
-   if (unlikely(!page_cache_add_speculative(head, refs)))
-   return NULL;
-   return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.0



[PATCH v4 13/23] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-11-12 Thread John Hubbard
Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter 
Reviewed-by: Jérôme Glisse 
Reviewed-by: Ira Weiny 
Signed-off-by: John Hubbard 
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c 
b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
drm_via_dmablit_t *xfer)
vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
if (NULL == vsg->pages)
return -ENOMEM;
-   ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+   ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
vsg->pages);
-- 
2.24.0



Re: [PATCH 00/50] Add log level to show_stack()

2019-11-12 Thread Sergey Senozhatsky
On (19/11/13 02:41), Dmitry Safonov wrote:
[..]
> 
> I don't strongly disagree, but if you look at those results:
> git grep 'printk("%s.*", \(lvl\|level\)'
> 
> it seems to be used in quite a few places.

Yes, you are right, it is used in some places. That's why I said
that I'd prefer to keep that number low (minimize it). But it's
not 0 (that ship has sailed).

-ss


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-12 Thread Dmitry Safonov
On 11/12/19 4:25 AM, Sergey Senozhatsky wrote:
> On (19/11/12 02:40), Dmitry Safonov wrote:
> [..]
>> In my point of view the cost of one-time [mostly build] testing every
>> architecture is cheaper than introducing some new smart code that will
>> live forever.
> 
> Well, there may be the need to pass loglevel deeper due to "hey __show_stack()
> on that arch invokes bar(), which invokes foo() and now foo() does printk(),
> but we don't see it". The context which decided to backtaraces decided
> to do so for a reason, probably, so I guess we can look at it as "a special
> error reporting code block".
> 
> The proposed patch set passes loglevel via slightly unusual channel -
> via sprintf(). We probably can do it, but I would prefer to minimize
> the number of such printk-s in the kernel. The code snippet which I
> posted also does pretty unusual thing w.r.t loglevel. Both approaches
> are "non-standard" from that POV.

I don't strongly disagree, but if you look at those results:
git grep 'printk("%s.*", \(lvl\|level\)'

it seems to be used in quite a few places.

Thanks,
  Dmitry


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-12 Thread Dmitry Safonov
On 11/13/19 1:23 AM, Sergey Senozhatsky wrote:
> On (19/11/12 19:12), Sergey Senozhatsky wrote:
>> On (19/11/12 09:35), Petr Mladek wrote:
>> [..]
>>> This is getting too complicated. It would introduce too many
>>> hidden rules. While the explicitly passed loglevel parameter
>>> is straightforward and clear.
>>
>> If loglevel is DEFAULT or NOTICE or INFO then we can overwrite it
>> (either downgrade or upgrade). That's one rule, basically. Not too
>> complicated, I guess.
> 
> Can be taken even a bit further than
> 
>   show_stack(NULL, NULL, LOGLEVEL_DEBUG);
> or
>   show_stack(NULL, NULL, LOGLEVEL_ERR);
> 
> For instance,
> 
>   spin_lock_irqsave(>lock, flags);
>   printk_emergency_enter(LOGLEVEL_SCHED);
>   ...
>   show_stack(...);
>   printk();
>   printk();
>   ...
>   spin_unlock_irqrestore(>lock, flags);
> 
> or
> 
>   spin_lock_irqsave(_port->lock, flags);
>   printk_emergency_enter(LOGLEVEL_SCHED);
>   ...
>   printk();
>   printk();
>   ...

Yeah, that makes sense.

I believe it's up to you, Petr and Steven to decide what's better for
printk(). I mean, after all you care for all this code.

I guess I've pointed that in my point of view price for one-time testing
code is cheaper than adding a new printk feature to swap log levels on
the fly. But again, it's just how I see it - if we fix those printing
problems, than in half year we will forget they ever existed, but in
your proposal, there will still be some clever printk code.

On the other side, also worth to note that current patches set fix
problems for kdb (and for my hung task printing patch), but it's
incomplete for sysrq driver. I've gone through functions used by sysrq
driver and the same changes introducing log level parameter would be
needed for: sched_show_task(), debug_show_all_locks(), show_regs(),
show_state(), show_mem(). Some of them don't need any platform changes,
but at least show_regs() needs.
So, you also need to have that in mind making the decision.

Thanks,
  Dmitry


Re: [PATCH 00/50] Add log level to show_stack()

2019-11-12 Thread Sergey Senozhatsky
On (19/11/12 19:12), Sergey Senozhatsky wrote:
> On (19/11/12 09:35), Petr Mladek wrote:
> [..]
> > This is getting too complicated. It would introduce too many
> > hidden rules. While the explicitly passed loglevel parameter
> > is straightforward and clear.
>
> If loglevel is DEFAULT or NOTICE or INFO then we can overwrite it
> (either downgrade or upgrade). That's one rule, basically. Not too
> complicated, I guess.

Can be taken even a bit further than

show_stack(NULL, NULL, LOGLEVEL_DEBUG);
or
show_stack(NULL, NULL, LOGLEVEL_ERR);

For instance,

spin_lock_irqsave(>lock, flags);
printk_emergency_enter(LOGLEVEL_SCHED);
...
show_stack(...);
printk();
printk();
...
spin_unlock_irqrestore(>lock, flags);

or

spin_lock_irqsave(_port->lock, flags);
printk_emergency_enter(LOGLEVEL_SCHED);
...
printk();
printk();
...

and so on.

-ss


Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-12 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> > On 2019-11-13, Al Viro  wrote:
> > > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > > 
> > > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata 
> > > > *nd, unsigned flags)
> > > >  
> > > > nd->m_seq = read_seqbegin(_lock);
> > > >  
> > > > -   /* Figure out the starting path and root (if needed). */
> > > > -   if (*s == '/') {
> > > > +   /* Absolute pathname -- fetch the root. */
> > > > +   if (flags & LOOKUP_IN_ROOT) {
> > > > +   /* With LOOKUP_IN_ROOT, act as a relative path. */
> > > > +   while (*s == '/')
> > > > +   s++;
> > > 
> > > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > > will skip them just fine, you are actually risking breakage in this:
> > > if (*s && unlikely(!d_can_lookup(dentry))) {
> > > fdput(f);
> > > return ERR_PTR(-ENOTDIR);
> > > }
> > > which is downstream from there with you patch, AFAICS.
> > 
> > I switched to stripping the slashes at your suggestion a few revisions
> > ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> > path_init(). But you're quite right about link_path_walk() -- and I'd be
> > more than happy to drop it.
> 
> That, IIRC, was about untangling the weirdness around multiple calls of
> dirfd_path_init() and basically went "we might want just strip the slashes
> in case of that flag very early in the entire thing, so that later the
> normal logics for absolute/relative would DTRT".

Ah okay, I'd misunderstood the point you were making in that thread.

> Since your check is right next to checking for absolute pathnames (and
> not in the very beginning of path_init()), we might as well turn the
> check for absolute pathname into *s == '/' && !(flags &
> LOOKUP_IN_ROOT) and be done with that.

Yup, agreed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH 31/33] powerpc/64: make buildable without CONFIG_COMPAT

2019-11-12 Thread Nicholas Piggin
Michal Suchanek's on November 13, 2019 2:52 am:
> There are numerous references to 32bit functions in generic and 64bit
> code so ifdef them out.
> 
> Signed-off-by: Michal Suchanek 

For the most part these seem okay to me.

> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 45f1d5e54671..35874119b398 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -44,16 +44,16 @@ CFLAGS_btext.o += -DDISABLE_BRANCH_PROFILING
>  endif
>  
>  obj-y:= cputable.o ptrace.o syscalls.o \
> -irq.o align.o signal_32.o pmc.o vdso.o \
> +irq.o align.o signal_$(BITS).o pmc.o vdso.o \
>  process.o systbl.o idle.o \
>  signal.o sysfs.o cacheinfo.o time.o \
>  prom.o traps.o setup-common.o \
>  udbg.o misc.o io.o misc_$(BITS).o \
>  of_platform.o prom_parse.o
> -obj-$(CONFIG_PPC64)  += setup_64.o sys_ppc32.o \
> -signal_64.o ptrace32.o \
> +obj-$(CONFIG_PPC64)  += setup_64.o \
>  paca.o nvram_64.o firmware.o note.o \
>  syscall_64.o
> +obj-$(CONFIG_COMPAT) += sys_ppc32.o ptrace32.o signal_32.o
>  obj-$(CONFIG_VDSO32) += vdso32/
>  obj-$(CONFIG_PPC_WATCHDOG)   += watchdog.o
>  obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 00173cc904ef..c339a984958f 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -52,8 +52,10 @@
>  SYS_CALL_TABLE:
>   .tc sys_call_table[TC],sys_call_table
>  
> +#ifdef CONFIG_COMPAT
>  COMPAT_SYS_CALL_TABLE:
>   .tc compat_sys_call_table[TC],compat_sys_call_table
> +#endif
>  
>  /* This value is used to mark exception frames on the stack. */
>  exception_marker:
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index 60436432399f..61678cb0e6a1 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -247,7 +247,6 @@ static void do_signal(struct task_struct *tsk)
>   sigset_t *oldset = sigmask_to_save();
>   struct ksignal ksig = { .sig = 0 };
>   int ret;
> - int is32 = is_32bit_task();
>  
>   BUG_ON(tsk != current);
>  
> @@ -277,7 +276,7 @@ static void do_signal(struct task_struct *tsk)
>  
>   rseq_signal_deliver(, tsk->thread.regs);
>  
> - if (is32) {
> + if (is_32bit_task()) {
>   if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>   ret = handle_rt_signal32(, oldset, tsk);
>   else

This is just a clean up I guess.

> diff --git a/arch/powerpc/kernel/syscall_64.c 
> b/arch/powerpc/kernel/syscall_64.c
> index d00cfc4a39a9..319ebd4f494d 100644
> --- a/arch/powerpc/kernel/syscall_64.c
> +++ b/arch/powerpc/kernel/syscall_64.c
> @@ -17,7 +17,6 @@ typedef long (*syscall_fn)(long, long, long, long, long, 
> long);
>  
>  long system_call_exception(long r3, long r4, long r5, long r6, long r7, long 
> r8, unsigned long r0, struct pt_regs *regs)
>  {
> - unsigned long ti_flags;
>   syscall_fn f;
>  
>   if (IS_ENABLED(CONFIG_PPC_BOOK3S))
> @@ -64,8 +63,7 @@ long system_call_exception(long r3, long r4, long r5, long 
> r6, long r7, long r8,
>  
>   __hard_irq_enable();
>  
> - ti_flags = current_thread_info()->flags;
> - if (unlikely(ti_flags & _TIF_SYSCALL_DOTRACE)) {
> + if (unlikely(current_thread_info()->flags & _TIF_SYSCALL_DOTRACE)) {
>   /*
>* We use the return value of do_syscall_trace_enter() as the
>* syscall number. If the syscall was rejected for any reason
> @@ -81,7 +79,7 @@ long system_call_exception(long r3, long r4, long r5, long 
> r6, long r7, long r8,
>   /* May be faster to do array_index_nospec? */
>   barrier_nospec();
>  
> - if (unlikely(ti_flags & _TIF_32BIT)) {
> + if (unlikely(is_32bit_task())) {
>   f = (void *)compat_sys_call_table[r0];
>  
>   r3 &= 0xULL;

I guess this is okay, I did want to be careful about where ti_flags
was loaded exactly, but I think DOTRACE and 32BIT are not volatile.
Is it possible to define _TIF_32BIT to zero for 64-bit !compat case
and have the original branch eliminated, or does that cause other
problems?

Thanks,
Nick


Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-12 Thread Al Viro
On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro  wrote:
> > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata 
> > > *nd, unsigned flags)
> > >  
> > >   nd->m_seq = read_seqbegin(_lock);
> > >  
> > > - /* Figure out the starting path and root (if needed). */
> > > - if (*s == '/') {
> > > + /* Absolute pathname -- fetch the root. */
> > > + if (flags & LOOKUP_IN_ROOT) {
> > > + /* With LOOKUP_IN_ROOT, act as a relative path. */
> > > + while (*s == '/')
> > > + s++;
> > 
> > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > will skip them just fine, you are actually risking breakage in this:
> > if (*s && unlikely(!d_can_lookup(dentry))) {
> > fdput(f);
> > return ERR_PTR(-ENOTDIR);
> > }
> > which is downstream from there with you patch, AFAICS.
> 
> I switched to stripping the slashes at your suggestion a few revisions
> ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> path_init(). But you're quite right about link_path_walk() -- and I'd be
> more than happy to drop it.

That, IIRC, was about untangling the weirdness around multiple calls of
dirfd_path_init() and basically went "we might want just strip the slashes
in case of that flag very early in the entire thing, so that later the
normal logics for absolute/relative would DTRT".  Since your check is
right next to checking for absolute pathnames (and not in the very
beginning of path_init()), we might as well turn the check for
absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be
done with that.


RE: [PATCH v4 47/47] soc: fsl: qe: remove PPC32 dependency from CONFIG_QUICC_ENGINE

2019-11-12 Thread Qiang Zhao
On Fri, Nov 8, 2019 at 21:01, Rasmus Villemoes  wrote:

> -Original Message-
> From: Rasmus Villemoes 
> Sent: 2019年11月8日 21:01
> To: Qiang Zhao ; Leo Li ;
> Christophe Leroy 
> Cc: linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org;
> linux-ker...@vger.kernel.org; Scott Wood ; Rasmus
> Villemoes 
> Subject: [PATCH v4 47/47] soc: fsl: qe: remove PPC32 dependency from
> CONFIG_QUICC_ENGINE
> 
> There are also ARM and ARM64 based SOCs with a QUICC Engine, and the core
> QE code as well as net/wan/fsl_ucc_hdlc and tty/serial/ucc_uart has now been
> modified to not rely on ppcisms.
> 
> So extend the architectures that can select QUICC_ENGINE, and add the rather
> modest requirements of OF && HAS_IOMEM.
> 
> The core code as well as the ucc_uart driver has been tested on an LS1021A
> (arm), and it has also been tested that the QE code still works on an mpc8309
> (ppc).
> 
> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/soc/fsl/qe/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/soc/fsl/qe/Kconfig b/drivers/soc/fsl/qe/Kconfig index
> cfa4b2939992..f1974f811572 100644
> --- a/drivers/soc/fsl/qe/Kconfig
> +++ b/drivers/soc/fsl/qe/Kconfig
> @@ -5,7 +5,8 @@
> 
>  config QUICC_ENGINE
>   bool "QUICC Engine (QE) framework support"
> - depends on FSL_SOC && PPC32
> + depends on OF && HAS_IOMEM
> + depends on PPC32 || ARM || ARM64 || COMPILE_TEST
>   select GENERIC_ALLOCATOR
>   select CRC32
>   help
> --
Tested-by: Qiang Zhao 
Tested QE-HDLC on ARM64!

Best Regards
Qiang Zhao


Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-12 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> 
> > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, 
> > unsigned flags)
> >  
> > nd->m_seq = read_seqbegin(_lock);
> >  
> > -   /* Figure out the starting path and root (if needed). */
> > -   if (*s == '/') {
> > +   /* Absolute pathname -- fetch the root. */
> > +   if (flags & LOOKUP_IN_ROOT) {
> > +   /* With LOOKUP_IN_ROOT, act as a relative path. */
> > +   while (*s == '/')
> > +   s++;
> 
> Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> will skip them just fine, you are actually risking breakage in this:
> if (*s && unlikely(!d_can_lookup(dentry))) {
> fdput(f);
> return ERR_PTR(-ENOTDIR);
> }
> which is downstream from there with you patch, AFAICS.

I switched to stripping the slashes at your suggestion a few revisions
ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
path_init(). But you're quite right about link_path_walk() -- and I'd be
more than happy to drop it.

[1]: https://lore.kernel.org/lkml/20190712125552.gl17...@zeniv.linux.org.uk/

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v15 7/9] open: introduce openat2(2) syscall

2019-11-12 Thread Aleksa Sarai
On 2019-11-13, Al Viro  wrote:
> On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
> > +/*
> > + * Arguments for how openat2(2) should open the target path. If @resolve is
> > + * zero, then openat2(2) operates very similarly to openat(2).
> > + *
> > + * However, unlike openat(2), unknown bits in @flags result in -EINVAL 
> > rather
> > + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> > + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> > + *
> > + * @flags: O_* flags.
> > + * @mode: O_CREAT/O_TMPFILE file mode.
> > + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).
> 
> ???

Sorry, that was left over from a previous revision (where the magic-link
re-opening restrictions were part of this series).

> > + * @resolve: RESOLVE_* flags.
> > + */
> > +struct open_how {
> > +   __aligned_u64 flags;
> > +   __u16 mode;
> > +   __u16 __padding[3]; /* must be zeroed */
> > +   __aligned_u64 resolve;
> > +};


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH RFC 02/15] powerpc/eeh: Rename eeh_pe_get() to eeh_pe_find()

2019-11-12 Thread Oliver O'Halloran
On Wed, Oct 2, 2019 at 4:03 PM Sam Bobroff  wrote:
>
> There are now functions eeh_get_pe() and eeh_pe_get() which seems
> likely to cause confusion.
>
> Keep eeh_get_pe() because "get" is commonly used to refer to acquiring
> a reference (which it does), and rename eeh_pe_get() to eeh_pe_find()
> because it performs a search.
>
> Signed-off-by: Sam Bobroff 

Good idea.

Reviewed-by: Oliver O'Halloran 


Re: [PATCH v15 7/9] open: introduce openat2(2) syscall

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).

???

> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __aligned_u64 flags;
> + __u16 mode;
> + __u16 __padding[3]; /* must be zeroed */
> + __aligned_u64 resolve;
> +};


Re: [PATCH v15 6/9] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:50PM +1100, Aleksa Sarai wrote:

> One other possible alternative (which previous versions of this patch
> used) would be to check with path_is_under() if there was a racing
> rename or mount (after re-taking the relevant seqlocks). While this does
> work, it results in possible O(n*m) behaviour if there are many renames
> or mounts occuring *anywhere on the system*.

BTW, do you realize that open-by-fhandle (or working nfsd, for that matter)
will trigger arseloads of write_seqlock(_lock) simply on d_splice_alias()
bringing disconnected subtrees in contact with parent?


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 5:35 PM, Dan Williams wrote:
> On Tue, Nov 12, 2019 at 5:08 PM John Hubbard  wrote:
>>
>> On 11/12/19 4:58 PM, Dan Williams wrote:
>> ...
> It's not redundant relative to upstream which does not do anything the
> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> 1-7 to see if something there made it redundant.

 Oh, the hunk John had below for get_user_pages_remote() also needs to
 call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
 that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
 above.
>>>
>>> Oh true, good eye. It is redundant if it does additionally call
>>> __gup_longterm_locked(), and it needs to do that otherwises it undoes
>>> the CMA migration magic that Aneesh added.
>>>
>>
>> OK. So just to be clear, I'll be removing this from the patch:
>>
>> /*
>>  * The lifetime of a vaddr_get_pfn() page pin is
>>  * userspace-controlled. In the fs-dax case this could
>>  * lead to indefinite stalls in filesystem operations.
>>  * Disallow attempts to pin fs-dax pages via this
>>  * interface.
>>  */
>> if (ret > 0 && vma_is_fsdax(vmas[0])) {
>> ret = -EOPNOTSUPP;
>> put_page(page[0]);
>> }
>>
>> (and the declaration of "vmas", as well).
> 
> ...and add a call to __gup_longterm_locked internal to
> get_user_pages_remote(), right?
> 

Yes, and thanks for double-checking. I think I got a little dizzy following
the call stack there. :)  And now I see that this also affects the
implementation of pin_longterm_pages_remote(), because that will need the
same logic that get_user_pages_remote() has. 



thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:

> @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata *nd, 
> unsigned flags)
>  
>   nd->m_seq = read_seqbegin(_lock);
>  
> - /* Figure out the starting path and root (if needed). */
> - if (*s == '/') {
> + /* Absolute pathname -- fetch the root. */
> + if (flags & LOOKUP_IN_ROOT) {
> + /* With LOOKUP_IN_ROOT, act as a relative path. */
> + while (*s == '/')
> + s++;

Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
will skip them just fine, you are actually risking breakage in this:
if (*s && unlikely(!d_can_lookup(dentry))) {
fdput(f);
return ERR_PTR(-ENOTDIR);
}
which is downstream from there with you patch, AFAICS.


[PATCH AUTOSEL 4.4 21/48] KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR

2019-11-12 Thread Sasha Levin
From: Cameron Kaiser 

[ Upstream commit 1006284c5e411872333967b1970c2ca46a9e225f ]

When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 4aab1c9c83e1a..41ac54bfdfdd9 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -70,8 +70,11 @@ void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
 {
if (vcpu->arch.hflags & BOOK3S_HFLAG_SPLIT_HACK) {
ulong pc = kvmppc_get_pc(vcpu);
+   ulong lr = kvmppc_get_lr(vcpu);
if ((pc & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
kvmppc_set_pc(vcpu, pc & ~SPLIT_HACK_MASK);
+   if ((lr & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_lr(vcpu, lr & ~SPLIT_HACK_MASK);
vcpu->arch.hflags &= ~BOOK3S_HFLAG_SPLIT_HACK;
}
 }
-- 
2.20.1



[PATCH AUTOSEL 4.4 15/48] powerpc/pseries: Fix how we iterate over the DTL entries

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit 9258227e9dd1da8feddb07ad9702845546a581c9 ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 37de83c5ef172..7a4d172c93765 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -185,7 +185,7 @@ static void dtl_stop(struct dtl *dtl)
 
 static u64 dtl_current_index(struct dtl *dtl)
 {
-   return lppaca_of(dtl->cpu).dtl_idx;
+   return be64_to_cpu(lppaca_of(dtl->cpu).dtl_idx);
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
-- 
2.20.1



[PATCH AUTOSEL 4.4 14/48] powerpc/pseries: Fix DTL buffer registration

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit db787af1b8a6b4be428ee2ea7d409dafcaa4a43c ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we register the DTL
buffer for a cpu when the associated file under powerpc/dtl in debugfs
is opened. When doing so, we need to set the size of the buffer being
registered in the second u32 word of the buffer. This needs to be in big
endian, but we are not doing the conversion resulting in the below error
showing up in dmesg:

dtl_start: DTL registration for cpu 0 (hw 0) failed with -4

Fix this in the obvious manner.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 39049e4884fbd..37de83c5ef172 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -150,7 +150,7 @@ static int dtl_start(struct dtl *dtl)
 
/* Register our dtl buffer with the hypervisor. The HV expects the
 * buffer size to be passed in the second word of the buffer */
-   ((u32 *)dtl->buf)[1] = DISPATCH_LOG_BYTES;
+   ((u32 *)dtl->buf)[1] = cpu_to_be32(DISPATCH_LOG_BYTES);
 
hwcpu = get_hard_smp_processor_id(dtl->cpu);
addr = __pa(dtl->buf);
-- 
2.20.1



[PATCH AUTOSEL 4.9 65/68] misc: cxl: Fix possible null pointer dereference

2019-11-12 Thread Sasha Levin
From: zhong jiang 

[ Upstream commit 3dac3583bf1a61db6aaf31dfd752c677a4400afd ]

It is not safe to dereference an object before a null test. It is
not needed and just remove them. Ftrace can be used instead.

Signed-off-by: zhong jiang 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Sasha Levin 
---
 drivers/misc/cxl/guest.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 3e102cd6ed914..d08509cd978a4 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -1026,8 +1026,6 @@ int cxl_guest_init_afu(struct cxl *adapter, int slice, 
struct device_node *afu_n
 
 void cxl_guest_remove_afu(struct cxl_afu *afu)
 {
-   pr_devel("in %s - AFU(%d)\n", __func__, afu->slice);
-
if (!afu)
return;
 
-- 
2.20.1



[PATCH AUTOSEL 4.9 29/68] KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR

2019-11-12 Thread Sasha Levin
From: Cameron Kaiser 

[ Upstream commit 1006284c5e411872333967b1970c2ca46a9e225f ]

When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 73c3c127d8584..209cad89a11a5 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -78,8 +78,11 @@ void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
 {
if (vcpu->arch.hflags & BOOK3S_HFLAG_SPLIT_HACK) {
ulong pc = kvmppc_get_pc(vcpu);
+   ulong lr = kvmppc_get_lr(vcpu);
if ((pc & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
kvmppc_set_pc(vcpu, pc & ~SPLIT_HACK_MASK);
+   if ((lr & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_lr(vcpu, lr & ~SPLIT_HACK_MASK);
vcpu->arch.hflags &= ~BOOK3S_HFLAG_SPLIT_HACK;
}
 }
-- 
2.20.1



[PATCH AUTOSEL 4.9 20/68] powerpc/pseries: Fix how we iterate over the DTL entries

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit 9258227e9dd1da8feddb07ad9702845546a581c9 ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 37de83c5ef172..7a4d172c93765 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -185,7 +185,7 @@ static void dtl_stop(struct dtl *dtl)
 
 static u64 dtl_current_index(struct dtl *dtl)
 {
-   return lppaca_of(dtl->cpu).dtl_idx;
+   return be64_to_cpu(lppaca_of(dtl->cpu).dtl_idx);
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
-- 
2.20.1



[PATCH AUTOSEL 4.9 19/68] powerpc/pseries: Fix DTL buffer registration

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit db787af1b8a6b4be428ee2ea7d409dafcaa4a43c ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we register the DTL
buffer for a cpu when the associated file under powerpc/dtl in debugfs
is opened. When doing so, we need to set the size of the buffer being
registered in the second u32 word of the buffer. This needs to be in big
endian, but we are not doing the conversion resulting in the below error
showing up in dmesg:

dtl_start: DTL registration for cpu 0 (hw 0) failed with -4

Fix this in the obvious manner.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 39049e4884fbd..37de83c5ef172 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -150,7 +150,7 @@ static int dtl_start(struct dtl *dtl)
 
/* Register our dtl buffer with the hypervisor. The HV expects the
 * buffer size to be passed in the second word of the buffer */
-   ((u32 *)dtl->buf)[1] = DISPATCH_LOG_BYTES;
+   ((u32 *)dtl->buf)[1] = cpu_to_be32(DISPATCH_LOG_BYTES);
 
hwcpu = get_hard_smp_processor_id(dtl->cpu);
addr = __pa(dtl->buf);
-- 
2.20.1



[PATCH AUTOSEL 4.14 114/115] powerpc/time: Fix clockevent_decrementer initalisation for PR KVM

2019-11-12 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit b4d16ab58c41ff0125822464bdff074cebd0fe47 ]

In the recent commit 8b78fdb045de ("powerpc/time: Use
clockevents_register_device(), fixing an issue with large
decrementer") we changed the way we initialise the decrementer
clockevent(s).

We no longer initialise the mult & shift values of
decrementer_clockevent itself.

This has the effect of breaking PR KVM, because it uses those values
in kvmppc_emulate_dec(). The symptom is guest kernels spin forever
mid-way through boot.

For now fix it by assigning back to decrementer_clockevent the mult
and shift values.

Fixes: 8b78fdb045de ("powerpc/time: Use clockevents_register_device(), fixing 
an issue with large decrementer")
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 870e75d304591..7c7c5a16284d2 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -988,6 +988,10 @@ static void register_decrementer_clockevent(int cpu)
 
printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
dec->name, dec->mult, dec->shift, cpu);
+
+   /* Set values for KVM, see kvm_emulate_dec() */
+   decrementer_clockevent.mult = dec->mult;
+   decrementer_clockevent.shift = dec->shift;
 }
 
 static void enable_large_decrementer(void)
-- 
2.20.1



[PATCH AUTOSEL 4.14 106/115] misc: cxl: Fix possible null pointer dereference

2019-11-12 Thread Sasha Levin
From: zhong jiang 

[ Upstream commit 3dac3583bf1a61db6aaf31dfd752c677a4400afd ]

It is not safe to dereference an object before a null test. It is
not needed and just remove them. Ftrace can be used instead.

Signed-off-by: zhong jiang 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Sasha Levin 
---
 drivers/misc/cxl/guest.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 1a64eb185cfd5..de2ce55395454 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -1028,8 +1028,6 @@ int cxl_guest_init_afu(struct cxl *adapter, int slice, 
struct device_node *afu_n
 
 void cxl_guest_remove_afu(struct cxl_afu *afu)
 {
-   pr_devel("in %s - AFU(%d)\n", __func__, afu->slice);
-
if (!afu)
return;
 
-- 
2.20.1



[PATCH AUTOSEL 4.14 055/115] KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR

2019-11-12 Thread Sasha Levin
From: Cameron Kaiser 

[ Upstream commit 1006284c5e411872333967b1970c2ca46a9e225f ]

When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index d38280b01ef08..1eda812499376 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -79,8 +79,11 @@ void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
 {
if (vcpu->arch.hflags & BOOK3S_HFLAG_SPLIT_HACK) {
ulong pc = kvmppc_get_pc(vcpu);
+   ulong lr = kvmppc_get_lr(vcpu);
if ((pc & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
kvmppc_set_pc(vcpu, pc & ~SPLIT_HACK_MASK);
+   if ((lr & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_lr(vcpu, lr & ~SPLIT_HACK_MASK);
vcpu->arch.hflags &= ~BOOK3S_HFLAG_SPLIT_HACK;
}
 }
-- 
2.20.1



[PATCH AUTOSEL 4.14 048/115] powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer

2019-11-12 Thread Sasha Levin
From: Anton Blanchard 

[ Upstream commit 8b78fdb045de60a4eb35460092bbd3cffa925353 ]

We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.

Signed-off-by: Anton Blanchard 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fe6f3a2854557..870e75d304591 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -984,10 +984,10 @@ static void register_decrementer_clockevent(int cpu)
*dec = decrementer_clockevent;
dec->cpumask = cpumask_of(cpu);
 
+   clockevents_config_and_register(dec, ppc_tb_freq, 2, decrementer_max);
+
printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
dec->name, dec->mult, dec->shift, cpu);
-
-   clockevents_register_device(dec);
 }
 
 static void enable_large_decrementer(void)
@@ -1035,18 +1035,7 @@ static void __init set_decrementer_max(void)
 
 static void __init init_decrementer_clockevent(void)
 {
-   int cpu = smp_processor_id();
-
-   clockevents_calc_mult_shift(_clockevent, ppc_tb_freq, 4);
-
-   decrementer_clockevent.max_delta_ns =
-   clockevent_delta2ns(decrementer_max, _clockevent);
-   decrementer_clockevent.max_delta_ticks = decrementer_max;
-   decrementer_clockevent.min_delta_ns =
-   clockevent_delta2ns(2, _clockevent);
-   decrementer_clockevent.min_delta_ticks = 2;
-
-   register_decrementer_clockevent(cpu);
+   register_decrementer_clockevent(smp_processor_id());
 }
 
 void secondary_cpu_time_init(void)
-- 
2.20.1



[PATCH AUTOSEL 4.14 033/115] powerpc/xive: Move a dereference below a NULL test

2019-11-12 Thread Sasha Levin
From: zhong jiang 

[ Upstream commit cd5ff94577e004e0a4457e70d0ef3a030f4010b8 ]

Move the dereference of xc below the NULL test.

Signed-off-by: zhong jiang 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/sysdev/xive/common.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 818fc5351591c..110d8bb16ebbb 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1008,12 +1008,13 @@ static void xive_ipi_eoi(struct irq_data *d)
 {
struct xive_cpu *xc = __this_cpu_read(xive_cpu);
 
-   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
-   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
-
/* Handle possible race with unplug and drop stale IPIs */
if (!xc)
return;
+
+   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
+   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
+
xive_do_source_eoi(xc->hw_ipi, >ipi_data);
xive_do_queue_eoi(xc);
 }
-- 
2.20.1



[PATCH AUTOSEL 4.14 032/115] powerpc/pseries: Fix how we iterate over the DTL entries

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit 9258227e9dd1da8feddb07ad9702845546a581c9 ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index c762689e0eb33..ef6595153642e 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -184,7 +184,7 @@ static void dtl_stop(struct dtl *dtl)
 
 static u64 dtl_current_index(struct dtl *dtl)
 {
-   return lppaca_of(dtl->cpu).dtl_idx;
+   return be64_to_cpu(lppaca_of(dtl->cpu).dtl_idx);
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
-- 
2.20.1



[PATCH AUTOSEL 4.14 031/115] powerpc/pseries: Fix DTL buffer registration

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit db787af1b8a6b4be428ee2ea7d409dafcaa4a43c ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we register the DTL
buffer for a cpu when the associated file under powerpc/dtl in debugfs
is opened. When doing so, we need to set the size of the buffer being
registered in the second u32 word of the buffer. This needs to be in big
endian, but we are not doing the conversion resulting in the below error
showing up in dmesg:

dtl_start: DTL registration for cpu 0 (hw 0) failed with -4

Fix this in the obvious manner.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 18014cdeb590a..c762689e0eb33 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -149,7 +149,7 @@ static int dtl_start(struct dtl *dtl)
 
/* Register our dtl buffer with the hypervisor. The HV expects the
 * buffer size to be passed in the second word of the buffer */
-   ((u32 *)dtl->buf)[1] = DISPATCH_LOG_BYTES;
+   ((u32 *)dtl->buf)[1] = cpu_to_be32(DISPATCH_LOG_BYTES);
 
hwcpu = get_hard_smp_processor_id(dtl->cpu);
addr = __pa(dtl->buf);
-- 
2.20.1



[PATCH AUTOSEL 4.14 022/115] KVM: PPC: Inform the userspace about TCE update failures

2019-11-12 Thread Sasha Levin
From: Alexey Kardashevskiy 

[ Upstream commit f7960e299f13f069d6f3d4e157d91bfca2669677 ]

We return H_TOO_HARD from TCE update handlers when we think that
the next handler (realmode -> virtual mode -> user mode) has a chance to
handle the request; H_HARDWARE/H_CLOSED otherwise.

This changes the handlers to return H_TOO_HARD on every error giving
the userspace an opportunity to handle any request or at least log
them all.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_64_vio.c| 8 
 arch/powerpc/kvm/book3s_64_vio_hv.c | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 2c6cce8e7cfd0..5e44462960213 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -404,7 +404,7 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
long ret;
 
if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, , )))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
if (dir == DMA_NONE)
return H_SUCCESS;
@@ -434,15 +434,15 @@ long kvmppc_tce_iommu_map(struct kvm *kvm, struct 
iommu_table *tbl,
return H_TOO_HARD;
 
if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, )))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
if (mm_iommu_mapped_inc(mem))
-   return H_CLOSED;
+   return H_TOO_HARD;
 
ret = iommu_tce_xchg(tbl, entry, , );
if (WARN_ON_ONCE(ret)) {
mm_iommu_mapped_dec(mem);
-   return H_HARDWARE;
+   return H_TOO_HARD;
}
 
if (dir != DMA_NONE)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 23d6d1592f117..c75e5664fe3d8 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -264,14 +264,14 @@ static long kvmppc_rm_tce_iommu_map(struct kvm *kvm, 
struct iommu_table *tbl,
 
if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
)))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
pua = (void *) vmalloc_to_phys(pua);
if (WARN_ON_ONCE_RM(!pua))
return H_HARDWARE;
 
if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
-   return H_CLOSED;
+   return H_TOO_HARD;
 
ret = iommu_tce_xchg_rm(tbl, entry, , );
if (ret) {
@@ -448,7 +448,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
rmap = (void *) vmalloc_to_phys(rmap);
if (WARN_ON_ONCE_RM(!rmap))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
/*
 * Synchronize with the MMU notifier callbacks in
-- 
2.20.1



Re: [PATCH v15 4/9] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:48PM +1100, Aleksa Sarai wrote:

Minor nit here - I'd split "move the conditional call of set_root()
into nd_jump_root()" into a separate patch before that one.  Makes
for fewer distractions in this one.  I'd probably fold "and be
ready for errors other than -ECHILD" into the same preliminary
patch.

> + /* Not currently safe for scoped-lookups. */
> + if (unlikely(nd->flags & LOOKUP_IS_SCOPED))
> + return ERR_PTR(-EXDEV);

Also a candidate for doing in nd_jump_link()...

> @@ -1373,8 +1403,11 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>   struct inode *inode = nd->inode;
>  
>   while (1) {
> - if (path_equal(>path, >root))
> + if (path_equal(>path, >root)) {
> + if (unlikely(nd->flags & LOOKUP_BENEATH))
> + return -EXDEV;

Umm...  Are you sure it's not -ECHILD?


[PATCH AUTOSEL 4.19 207/209] powerpc/time: Fix clockevent_decrementer initalisation for PR KVM

2019-11-12 Thread Sasha Levin
From: Michael Ellerman 

[ Upstream commit b4d16ab58c41ff0125822464bdff074cebd0fe47 ]

In the recent commit 8b78fdb045de ("powerpc/time: Use
clockevents_register_device(), fixing an issue with large
decrementer") we changed the way we initialise the decrementer
clockevent(s).

We no longer initialise the mult & shift values of
decrementer_clockevent itself.

This has the effect of breaking PR KVM, because it uses those values
in kvmppc_emulate_dec(). The symptom is guest kernels spin forever
mid-way through boot.

For now fix it by assigning back to decrementer_clockevent the mult
and shift values.

Fixes: 8b78fdb045de ("powerpc/time: Use clockevents_register_device(), fixing 
an issue with large decrementer")
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 6a1f0a084ca35..7707990c4c169 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -988,6 +988,10 @@ static void register_decrementer_clockevent(int cpu)
 
printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
dec->name, dec->mult, dec->shift, cpu);
+
+   /* Set values for KVM, see kvm_emulate_dec() */
+   decrementer_clockevent.mult = dec->mult;
+   decrementer_clockevent.shift = dec->shift;
 }
 
 static void enable_large_decrementer(void)
-- 
2.20.1



[PATCH AUTOSEL 4.19 196/209] misc: cxl: Fix possible null pointer dereference

2019-11-12 Thread Sasha Levin
From: zhong jiang 

[ Upstream commit 3dac3583bf1a61db6aaf31dfd752c677a4400afd ]

It is not safe to dereference an object before a null test. It is
not needed and just remove them. Ftrace can be used instead.

Signed-off-by: zhong jiang 
Acked-by: Andrew Donnellan 
Acked-by: Frederic Barrat 
Signed-off-by: Greg Kroah-Hartman 
Signed-off-by: Sasha Levin 
---
 drivers/misc/cxl/guest.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index b83a373e3a8dd..08f4a512afad2 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -1020,8 +1020,6 @@ int cxl_guest_init_afu(struct cxl *adapter, int slice, 
struct device_node *afu_n
 
 void cxl_guest_remove_afu(struct cxl_afu *afu)
 {
-   pr_devel("in %s - AFU(%d)\n", __func__, afu->slice);
-
if (!afu)
return;
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 122/209] soc: fsl: bman_portals: defer probe after bman's probe

2019-11-12 Thread Sasha Levin
From: Laurentiu Tudor 

[ Upstream commit e0940b34c40e95d1879691d2474d182c57aae0de ]

A crash in bman portal probing could not be triggered (as is the case
with qman portals) but it does make calls [1] into the bman driver so
lets make sure the bman portal probing happens after bman's.

[1]  bman_p_irqsource_add() (in bman) called by:
   init_pcfg() called by:
 bman_portal_probe()

Signed-off-by: Laurentiu Tudor 
Signed-off-by: Li Yang 
Signed-off-by: Sasha Levin 
---
 drivers/soc/fsl/qbman/bman_portal.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/soc/fsl/qbman/bman_portal.c 
b/drivers/soc/fsl/qbman/bman_portal.c
index 2f71f7df3465a..f9edd28894fda 100644
--- a/drivers/soc/fsl/qbman/bman_portal.c
+++ b/drivers/soc/fsl/qbman/bman_portal.c
@@ -91,7 +91,15 @@ static int bman_portal_probe(struct platform_device *pdev)
struct device_node *node = dev->of_node;
struct bm_portal_config *pcfg;
struct resource *addr_phys[2];
-   int irq, cpu;
+   int irq, cpu, err;
+
+   err = bman_is_probed();
+   if (!err)
+   return -EPROBE_DEFER;
+   if (err < 0) {
+   dev_err(>dev, "failing probe due to bman probe error\n");
+   return -ENODEV;
+   }
 
pcfg = devm_kmalloc(dev, sizeof(*pcfg), GFP_KERNEL);
if (!pcfg)
-- 
2.20.1



[PATCH AUTOSEL 4.19 095/209] KVM: PPC: Book3S PR: Exiting split hack mode needs to fixup both PC and LR

2019-11-12 Thread Sasha Levin
From: Cameron Kaiser 

[ Upstream commit 1006284c5e411872333967b1970c2ca46a9e225f ]

When an OS (currently only classic Mac OS) is running in KVM-PR and makes a
linked jump from code with split hack addressing enabled into code that does
not, LR is not correctly updated and reflects the previously munged PC.

To fix this, this patch undoes the address munge when exiting split
hack mode so that code relying on LR being a proper address will now
execute. This does not affect OS X or other operating systems running
on KVM-PR.

Signed-off-by: Cameron Kaiser 
Signed-off-by: Paul Mackerras 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 281f074581a3b..cc05f346e0421 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -78,8 +78,11 @@ void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu)
 {
if (vcpu->arch.hflags & BOOK3S_HFLAG_SPLIT_HACK) {
ulong pc = kvmppc_get_pc(vcpu);
+   ulong lr = kvmppc_get_lr(vcpu);
if ((pc & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
kvmppc_set_pc(vcpu, pc & ~SPLIT_HACK_MASK);
+   if ((lr & SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)
+   kvmppc_set_lr(vcpu, lr & ~SPLIT_HACK_MASK);
vcpu->arch.hflags &= ~BOOK3S_HFLAG_SPLIT_HACK;
}
 }
-- 
2.20.1



[PATCH AUTOSEL 4.19 085/209] powerpc/64s/radix: Explicitly flush ERAT with local LPID invalidation

2019-11-12 Thread Sasha Levin
From: Nicholas Piggin 

[ Upstream commit 053c5a753e951c5dd1729af2cf4d8107f2e6e09b ]

Local radix TLB flush operations that operate on congruence classes
have explicit ERAT flushes for POWER9. The process scoped LPID flush
did not have a flush, so add it.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/tlb-radix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 0cddae4263f96..21441ff17b92b 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -366,6 +366,7 @@ static inline void _tlbiel_lpid_guest(unsigned long lpid, 
unsigned long ric)
__tlbiel_lpid_guest(lpid, set, RIC_FLUSH_TLB);
 
asm volatile("ptesync": : :"memory");
+   asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
 }
 
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 084/209] powerpc/time: Use clockevents_register_device(), fixing an issue with large decrementer

2019-11-12 Thread Sasha Levin
From: Anton Blanchard 

[ Upstream commit 8b78fdb045de60a4eb35460092bbd3cffa925353 ]

We currently cap the decrementer clockevent at 4 seconds, even on systems
with large decrementer support. Fix this by converting the code to use
clockevents_register_device() which calculates the upper bound based on
the max_delta passed in.

Signed-off-by: Anton Blanchard 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/time.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 70f145e024877..6a1f0a084ca35 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -984,10 +984,10 @@ static void register_decrementer_clockevent(int cpu)
*dec = decrementer_clockevent;
dec->cpumask = cpumask_of(cpu);
 
+   clockevents_config_and_register(dec, ppc_tb_freq, 2, decrementer_max);
+
printk_once(KERN_DEBUG "clockevent: %s mult[%x] shift[%d] cpu[%d]\n",
dec->name, dec->mult, dec->shift, cpu);
-
-   clockevents_register_device(dec);
 }
 
 static void enable_large_decrementer(void)
@@ -1035,18 +1035,7 @@ static void __init set_decrementer_max(void)
 
 static void __init init_decrementer_clockevent(void)
 {
-   int cpu = smp_processor_id();
-
-   clockevents_calc_mult_shift(_clockevent, ppc_tb_freq, 4);
-
-   decrementer_clockevent.max_delta_ns =
-   clockevent_delta2ns(decrementer_max, _clockevent);
-   decrementer_clockevent.max_delta_ticks = decrementer_max;
-   decrementer_clockevent.min_delta_ns =
-   clockevent_delta2ns(2, _clockevent);
-   decrementer_clockevent.min_delta_ticks = 2;
-
-   register_decrementer_clockevent(cpu);
+   register_decrementer_clockevent(smp_processor_id());
 }
 
 void secondary_cpu_time_init(void)
-- 
2.20.1



[PATCH AUTOSEL 4.19 057/209] powerpc/xive: Move a dereference below a NULL test

2019-11-12 Thread Sasha Levin
From: zhong jiang 

[ Upstream commit cd5ff94577e004e0a4457e70d0ef3a030f4010b8 ]

Move the dereference of xc below the NULL test.

Signed-off-by: zhong jiang 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/sysdev/xive/common.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 0b24b10312213..f3af53abd40fb 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1009,12 +1009,13 @@ static void xive_ipi_eoi(struct irq_data *d)
 {
struct xive_cpu *xc = __this_cpu_read(xive_cpu);
 
-   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
-   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
-
/* Handle possible race with unplug and drop stale IPIs */
if (!xc)
return;
+
+   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
+   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
+
xive_do_source_eoi(xc->hw_ipi, >ipi_data);
xive_do_queue_eoi(xc);
 }
-- 
2.20.1



[PATCH AUTOSEL 4.19 056/209] powerpc/pseries: Fix how we iterate over the DTL entries

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit 9258227e9dd1da8feddb07ad9702845546a581c9 ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we look up dtl_idx in
the lppaca to determine the number of entries in the buffer. Since
lppaca is in big endian, we need to do an endian conversion before using
this in our calculation to determine the number of entries in the
buffer. Without this, we do not iterate over the existing entries in the
DTL buffer properly.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index c762689e0eb33..ef6595153642e 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -184,7 +184,7 @@ static void dtl_stop(struct dtl *dtl)
 
 static u64 dtl_current_index(struct dtl *dtl)
 {
-   return lppaca_of(dtl->cpu).dtl_idx;
+   return be64_to_cpu(lppaca_of(dtl->cpu).dtl_idx);
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
-- 
2.20.1



[PATCH AUTOSEL 4.19 055/209] powerpc/pseries: Fix DTL buffer registration

2019-11-12 Thread Sasha Levin
From: "Naveen N. Rao" 

[ Upstream commit db787af1b8a6b4be428ee2ea7d409dafcaa4a43c ]

When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set, we register the DTL
buffer for a cpu when the associated file under powerpc/dtl in debugfs
is opened. When doing so, we need to set the size of the buffer being
registered in the second u32 word of the buffer. This needs to be in big
endian, but we are not doing the conversion resulting in the below error
showing up in dmesg:

dtl_start: DTL registration for cpu 0 (hw 0) failed with -4

Fix this in the obvious manner.

Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config 
option.")
Signed-off-by: Naveen N. Rao 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/pseries/dtl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 18014cdeb590a..c762689e0eb33 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -149,7 +149,7 @@ static int dtl_start(struct dtl *dtl)
 
/* Register our dtl buffer with the hypervisor. The HV expects the
 * buffer size to be passed in the second word of the buffer */
-   ((u32 *)dtl->buf)[1] = DISPATCH_LOG_BYTES;
+   ((u32 *)dtl->buf)[1] = cpu_to_be32(DISPATCH_LOG_BYTES);
 
hwcpu = get_hard_smp_processor_id(dtl->cpu);
addr = __pa(dtl->buf);
-- 
2.20.1



[PATCH AUTOSEL 4.19 042/209] KVM: PPC: Inform the userspace about TCE update failures

2019-11-12 Thread Sasha Levin
From: Alexey Kardashevskiy 

[ Upstream commit f7960e299f13f069d6f3d4e157d91bfca2669677 ]

We return H_TOO_HARD from TCE update handlers when we think that
the next handler (realmode -> virtual mode -> user mode) has a chance to
handle the request; H_HARDWARE/H_CLOSED otherwise.

This changes the handlers to return H_TOO_HARD on every error giving
the userspace an opportunity to handle any request or at least log
them all.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Signed-off-by: Michael Ellerman 
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kvm/book3s_64_vio.c| 8 
 arch/powerpc/kvm/book3s_64_vio_hv.c | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 07a8004c3c237..65486c3d029b5 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -401,7 +401,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm,
long ret;
 
if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, , )))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
if (dir == DMA_NONE)
return H_SUCCESS;
@@ -449,15 +449,15 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct 
iommu_table *tbl,
return H_TOO_HARD;
 
if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, )))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
if (mm_iommu_mapped_inc(mem))
-   return H_CLOSED;
+   return H_TOO_HARD;
 
ret = iommu_tce_xchg(tbl, entry, , );
if (WARN_ON_ONCE(ret)) {
mm_iommu_mapped_dec(mem);
-   return H_HARDWARE;
+   return H_TOO_HARD;
}
 
if (dir != DMA_NONE)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c 
b/arch/powerpc/kvm/book3s_64_vio_hv.c
index eb8b11515a7ff..d258ed4ef77c3 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -300,10 +300,10 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, 
struct iommu_table *tbl,
 
if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
)))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
if (WARN_ON_ONCE_RM(mm_iommu_mapped_inc(mem)))
-   return H_CLOSED;
+   return H_TOO_HARD;
 
ret = iommu_tce_xchg_rm(kvm->mm, tbl, entry, , );
if (ret) {
@@ -501,7 +501,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
rmap = (void *) vmalloc_to_phys(rmap);
if (WARN_ON_ONCE_RM(!rmap))
-   return H_HARDWARE;
+   return H_TOO_HARD;
 
/*
 * Synchronize with the MMU notifier callbacks in
-- 
2.20.1



Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:

> @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
>  void nd_jump_link(struct path *path)
>  {
>   struct nameidata *nd = current->nameidata;
> +
> + nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
>   path_put(>path);
>  
>   nd->path = *path;
> @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
>   if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
>   if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
>   return ERR_PTR(-ELOOP);
> + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> + if (!nd->last_magiclink.same_mnt)
> + return ERR_PTR(-EXDEV);
> + }
>   }

Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
into nd_jump_link()?  Or just have nd_jump_link() return an error...

I mean, look at the callers of nd_jump_link().
static const char *policy_get_link(struct dentry *dentry,
   struct inode *inode,
   struct delayed_call *done)
{
struct aa_ns *ns;
struct path path;  

if (!dentry)   
return ERR_PTR(-ECHILD);
ns = aa_get_current_ns();
path.mnt = mntget(aafs_mnt);
path.dentry = dget(ns_dir(ns));
nd_jump_link(); 
aa_put_ns(ns);

return NULL;
}
- very close to the end of ->get_link() instance.

static const char *proc_pid_get_link(struct dentry *dentry,
 struct inode *inode,
 struct delayed_call *done)
{ 
struct path path;
int error = -EACCES;

if (!dentry)
return ERR_PTR(-ECHILD);

/* Are we allowed to snoop on the tasks file descriptors? */
if (!proc_fd_access_allowed(inode))
goto out;

error = PROC_I(inode)->op.proc_get_link(dentry, );
if (error)
goto out;

nd_jump_link();
return NULL;
out:   
return ERR_PTR(error);
}
Ditto.

static const char *proc_ns_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops;
struct task_struct *task;
struct path ns_path;
void *error = ERR_PTR(-EACCES);

if (!dentry)
return ERR_PTR(-ECHILD);

task = get_proc_task(inode);
if (!task)
return error;

if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
error = ns_get_path(_path, task, ns_ops);
if (!error)
nd_jump_link(_path);
}
put_task_struct(task);
return error;
}

The same.  And that's it - there's no more of them.  So how about
this in the beginning of the series, then having your magiclink
error handling done in nd_jump_link()?

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..8ec924813c30 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -859,7 +859,7 @@ static int nd_jump_root(struct nameidata *nd)
  * Helper to directly jump to a known parsed path from ->get_link,
  * caller must have taken a reference to path beforehand.
  */
-void nd_jump_link(struct path *path)
+const char *nd_jump_link(struct path *path)
 {
struct nameidata *nd = current->nameidata;
path_put(>path);
@@ -867,6 +867,7 @@ void nd_jump_link(struct path *path)
nd->path = *path;
nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
+   return NULL;
 }
 
 static inline void put_link(struct nameidata *nd)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..ac4e57a3dfa5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,8 +1626,7 @@ static const char *proc_pid_get_link(struct dentry 
*dentry,
if (error)
goto out;
 
-   nd_jump_link();
-   return NULL;
+   return nd_jump_link();
 out:
return ERR_PTR(error);
 }
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..dde0c501b2f3 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -54,7 +54,7 @@ static const char *proc_ns_get_link(struct dentry *dentry,
if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) {
error = ns_get_path(_path, task, ns_ops);
if (!error)
-   nd_jump_link(_path);
+   error = nd_jump_link(_path);
}
put_task_struct(task);
return error;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 397a08ade6a2..f3e8438e5631 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -68,7 +68,7 @@ extern int follow_up(struct path *);
 

Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Tue, Nov 12, 2019 at 5:08 PM John Hubbard  wrote:
>
> On 11/12/19 4:58 PM, Dan Williams wrote:
> ...
> >>> It's not redundant relative to upstream which does not do anything the
> >>> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> >>> 1-7 to see if something there made it redundant.
> >>
> >> Oh, the hunk John had below for get_user_pages_remote() also needs to
> >> call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
> >> that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
> >> above.
> >
> > Oh true, good eye. It is redundant if it does additionally call
> > __gup_longterm_locked(), and it needs to do that otherwises it undoes
> > the CMA migration magic that Aneesh added.
> >
>
> OK. So just to be clear, I'll be removing this from the patch:
>
> /*
>  * The lifetime of a vaddr_get_pfn() page pin is
>  * userspace-controlled. In the fs-dax case this could
>  * lead to indefinite stalls in filesystem operations.
>  * Disallow attempts to pin fs-dax pages via this
>  * interface.
>  */
> if (ret > 0 && vma_is_fsdax(vmas[0])) {
> ret = -EOPNOTSUPP;
> put_page(page[0]);
> }
>
> (and the declaration of "vmas", as well).

...and add a call to __gup_longterm_locked internal to
get_user_pages_remote(), right?


Re: [PATCH v15 2/9] namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:46PM +1100, Aleksa Sarai wrote:
> @@ -1078,6 +1079,10 @@ const char *get_link(struct nameidata *nd)
>   } else {
>   res = get(dentry, inode, >done);
>   }
> + if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> + if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> + return ERR_PTR(-ELOOP);
> + }

Minor nit - the first check probably wants unlikely() more than the
second one; it's probably noise anyway, but most of the symlinks
traversed are not going to be procfs ones, so you get test + branch
taken most of the time.

OTOH, that just might compile into
fetch nd->flags
and with LOOKUP_MAGICLINK_JUMPED | LOOKUP_NO_MAGICLINKS
compare with the same constant
unlikely branch when equal

Anyway, that's no more than a minor nit and can be dealt with later (if
at all)


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard

On 11/12/19 4:58 PM, Dan Williams wrote:
...

It's not redundant relative to upstream which does not do anything the
FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
1-7 to see if something there made it redundant.


Oh, the hunk John had below for get_user_pages_remote() also needs to
call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
above.


Oh true, good eye. It is redundant if it does additionally call
__gup_longterm_locked(), and it needs to do that otherwises it undoes
the CMA migration magic that Aneesh added.



OK. So just to be clear, I'll be removing this from the patch:

/*
 * The lifetime of a vaddr_get_pfn() page pin is
 * userspace-controlled. In the fs-dax case this could
 * lead to indefinite stalls in filesystem operations.
 * Disallow attempts to pin fs-dax pages via this
 * interface.
 */
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
put_page(page[0]);
}

(and the declaration of "vmas", as well).

thanks,
--
John Hubbard
NVIDIA


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Tue, Nov 12, 2019 at 3:43 PM Jason Gunthorpe  wrote:
>
> On Tue, Nov 12, 2019 at 02:45:51PM -0800, Dan Williams wrote:
> > On Tue, Nov 12, 2019 at 2:43 PM John Hubbard  wrote:
> > >
> > > On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
> > > ...
> > > >> -}
> > > >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | 
> > > >> FOLL_LONGTERM,
> > > >> +page, vmas, NULL);
> > > >> +/*
> > > >> + * The lifetime of a vaddr_get_pfn() page pin is
> > > >> + * userspace-controlled. In the fs-dax case this could
> > > >> + * lead to indefinite stalls in filesystem operations.
> > > >> + * Disallow attempts to pin fs-dax pages via this
> > > >> + * interface.
> > > >> + */
> > > >> +if (ret > 0 && vma_is_fsdax(vmas[0])) {
> > > >> +ret = -EOPNOTSUPP;
> > > >> +put_page(page[0]);
> > > >>  }
> > > >
> > > > AFAIK this chunk is redundant now as it is some hack to emulate
> > > > FOLL_LONGTERM? So vmas can be deleted too.
> > >
> > > Let me first make sure I understand what Dan has in mind for the vma
> > > checking, in the other thread...
> >
> > It's not redundant relative to upstream which does not do anything the
> > FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> > 1-7 to see if something there made it redundant.
>
> Oh, the hunk John had below for get_user_pages_remote() also needs to
> call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
> that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
> above.

Oh true, good eye. It is redundant if it does additionally call
__gup_longterm_locked(), and it needs to do that otherwises it undoes
the CMA migration magic that Aneesh added.

>
> Certainly no caller of FOLL_LONGTERM should have to do dax specific
> VMA checking.

Agree, that was my comment about cleaning up the vma_is_fsdax() check
to be internal to the gup core.


Re: [PATCH v15 0/9] open: introduce openat2(2) syscall

2019-11-12 Thread Aleksa Sarai
On 2019-11-12, Kees Cook  wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai  wrote:
> > > This patchset is being developed here:
> > >   
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus 
> > > Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > > series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

Given the namei changes, I wanted to avoid stepping on Al's toes. Though
he did review the series a few versions ago, the discussion didn't focus
on the openat2(2) semantics (which have also changed since then). I'm
not sure whether to interpret the silence to mean he's satisfied with
things as they are, or if he hasn't had more time to look at the series.

As for which tree it should be routed to, I don't mind -- Christian is
the most straight-forward choice (but if Al wants to route it, that's
fine with me too).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-12 Thread Paul Mackerras
On Tue, Nov 12, 2019 at 06:45:55AM -0800, Ram Pai wrote:
> On Tue, Nov 12, 2019 at 10:32:04PM +1100, Paul Mackerras wrote:
> > On Mon, Nov 11, 2019 at 11:52:15PM -0800, Ram Pai wrote:
> > > There is subtle problem removing that code from the assembly.
> > > 
> > > If the H_SVM_INIT_ABORT hcall returns to the ultravisor without clearing
> > > kvm->arch.secure_guest, the hypervisor will continue to think that the
> > > VM is a secure VM.   However the primary reason the H_SVM_INIT_ABORT
> > > hcall was invoked, was to inform the Hypervisor that it should no longer
> > > consider the VM as a Secure VM. So there is a inconsistency there.
> > 
> > Most of the checks that look at whether a VM is a secure VM use code
> > like "if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)".  Now
> > since KVMPPC_SECURE_INIT_ABORT is 4, an if statement such as that will
> > take the false branch once we have set kvm->arch.secure_guest to
> > KVMPPC_SECURE_INIT_ABORT in kvmppc_h_svm_init_abort.  So in fact in
> > most places we will treat the VM as a normal VM from then on.  If
> > there are any places where we still need to treat the VM as a secure
> > VM while we are processing the abort we can easily do that too.
> 
> Is the suggestion --  KVMPPC_SECURE_INIT_ABORT should never return back
> to the Ultravisor?   Because removing that assembly code will NOT lead the

No.  The suggestion is that vcpu->arch.secure_guest stays set to
KVMPPC_SECURE_INIT_ABORT until userspace calls KVM_PPC_SVM_OFF.

> Hypervisor back into the Ultravisor.  This is fine with the
> ultravisor. But then the hypervisor will not know where to return to.
> If it wants to return directly to the VM, it wont know to
> which address. It will be in a limbo.
> 
> > 
> > > This is fine, as long as the VM does not invoke any hcall or does not
> > > receive any hypervisor-exceptions.  The moment either of those happen,
> > > the control goes into the hypervisor, the hypervisor services
> > > the exception/hcall and while returning, it will see that the
> > > kvm->arch.secure_guest flag is set and **incorrectly** return
> > > to the ultravisor through a UV_RETURN ucall.  Ultravisor will
> > > not know what to do with it, because it does not consider that
> > > VM as a Secure VM.  Bad things happen.
> > 
> > If bad things happen in the ultravisor because the hypervisor did
> > something it shouldn't, then it's game over, you just lost, thanks for
> > playing.  The ultravisor MUST be able to cope with bogus UV_RETURN
> > calls for a VM that it doesn't consider to be a secure VM.  You need
> > to work out how to handle such calls safely and implement that in the
> > ultravisor.
> 
> Actually we do handle this gracefully in the ultravisor :). 
> We just retun back to the hypervisor saying "sorry dont know what
> to do with it, please handle it yourself".
> 
> However hypervisor would not know what to do with that return, and bad
> things happen in the hypervisor.

Right.  We need something after the "sc 2" to handle the case where
the ultravisor returns with an error from the UV_RETURN.

Paul.


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Jason Gunthorpe
On Tue, Nov 12, 2019 at 02:45:51PM -0800, Dan Williams wrote:
> On Tue, Nov 12, 2019 at 2:43 PM John Hubbard  wrote:
> >
> > On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
> > ...
> > >> -}
> > >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | 
> > >> FOLL_LONGTERM,
> > >> +page, vmas, NULL);
> > >> +/*
> > >> + * The lifetime of a vaddr_get_pfn() page pin is
> > >> + * userspace-controlled. In the fs-dax case this could
> > >> + * lead to indefinite stalls in filesystem operations.
> > >> + * Disallow attempts to pin fs-dax pages via this
> > >> + * interface.
> > >> + */
> > >> +if (ret > 0 && vma_is_fsdax(vmas[0])) {
> > >> +ret = -EOPNOTSUPP;
> > >> +put_page(page[0]);
> > >>  }
> > >
> > > AFAIK this chunk is redundant now as it is some hack to emulate
> > > FOLL_LONGTERM? So vmas can be deleted too.
> >
> > Let me first make sure I understand what Dan has in mind for the vma
> > checking, in the other thread...
> 
> It's not redundant relative to upstream which does not do anything the
> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> 1-7 to see if something there made it redundant.

Oh, the hunk John had below for get_user_pages_remote() also needs to
call __gup_longterm_locked() when FOLL_LONGTERM is specified, then
that calls check_dax_vmas() which duplicates the vma_is_fsdax() check
above.

Certainly no caller of FOLL_LONGTERM should have to do dax specific
VMA checking.

Jason


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 3:14 PM, Dan Williams wrote:
...
>>
>> Is that OK, or did you want to go further (possibly in a follow-up
>> patchset, as I'm hoping to get this one in soon)?
> 
> That looks ok. Something to maybe push down into the core in a future


Great! I'll post a cleaned up v4 (with the extraneous up_read()/down_read()
removed), then.


> cleanup, but not something that needs to be done now.
> 

Yes. I've put the FOLL_LONGTERM cleanup items on my list now, in case
they don't get done as part of something else. There's a lot more
change coming in this area.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 2:45 PM, Dan Williams wrote:
> On Tue, Nov 12, 2019 at 2:43 PM John Hubbard  wrote:
>>
>> On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
>> ...
 -}
 +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
 +page, vmas, NULL);
 +/*
 + * The lifetime of a vaddr_get_pfn() page pin is
 + * userspace-controlled. In the fs-dax case this could
 + * lead to indefinite stalls in filesystem operations.
 + * Disallow attempts to pin fs-dax pages via this
 + * interface.
 + */
 +if (ret > 0 && vma_is_fsdax(vmas[0])) {
 +ret = -EOPNOTSUPP;
 +put_page(page[0]);
  }
>>>
>>> AFAIK this chunk is redundant now as it is some hack to emulate
>>> FOLL_LONGTERM? So vmas can be deleted too.
>>
>> Let me first make sure I understand what Dan has in mind for the vma
>> checking, in the other thread...
> 
> It's not redundant relative to upstream which does not do anything the
> FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
> 1-7 to see if something there made it redundant.
> 

There is nothing in patches 1-7 that would make it redundant. 

About the only thing that you might find interesting in that subset is
patch 4 ("mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages"),
for devmap and ZONE_DEVICE interest. But it doesn't affect this
discussion directly.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Tue, Nov 12, 2019 at 3:08 PM John Hubbard  wrote:
>
> On 11/12/19 2:43 PM, Dan Williams wrote:
> ...
> > Ah, sorry. This was the first time I had looked at this series and
> > jumped in without reading the background.
> >
> > Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM
> > warning in get_user_pages_remote in another patch?
> >
>
> Actually, I haven't gone quite that far. Actually this patch is the last
> change to that function. Therefore, at the end of this patchset,
> get_user_pages_remote() ends up with this check in it which
> is a less-restrictive version of the warning:
>
> /*
>  * Current FOLL_LONGTERM behavior is incompatible with
>  * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
>  * vmas. However, this only comes up if locked is set, and there are
>  * callers that do request FOLL_LONGTERM, but do not set locked. So,
>  * allow what we can.
>  */
> if (gup_flags & FOLL_LONGTERM) {
> if (WARN_ON_ONCE(locked))
> return -EINVAL;
> }
>
> Is that OK, or did you want to go further (possibly in a follow-up
> patchset, as I'm hoping to get this one in soon)?

That looks ok. Something to maybe push down into the core in a future
cleanup, but not something that needs to be done now.

> ...
> >>> I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
> >>> vma_is_fsdax()) check and that would also remove the need for
> >>> __gup_longterm_locked.
> >>>
> >>
> >> Good idea, but there is still the call to check_and_migrate_cma_pages(),
> >> inside __gup_longterm_locked().  So it's a little more involved and
> >> we can't trivially delete __gup_longterm_locked() yet, right?
> >
> > [ add Aneesh ]
> >
> > Yes, you're right. I had overlooked that had snuck in there. That to
> > me similarly needs to be pushed down into the core with its own FOLL
> > flag, or it needs to be an explicit fixup that each caller does after
> > get_user_pages. The fact that migration silently happens as a side
> > effect of gup is too magical for my taste.
> >
>
> Yes. It's an intrusive side effect that is surprising, and not in a
> "happy surprise" way. :) .   Fixing up the CMA pages by splitting that
> functionality into separate function calls sounds like an improvement
> worth exploring.

Right, future work.


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 2:43 PM, Dan Williams wrote:
... 
> Ah, sorry. This was the first time I had looked at this series and
> jumped in without reading the background.
> 
> Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM
> warning in get_user_pages_remote in another patch?
> 

Actually, I haven't gone quite that far. Actually this patch is the last
change to that function. Therefore, at the end of this patchset, 
get_user_pages_remote() ends up with this check in it which
is a less-restrictive version of the warning:

/*
 * Current FOLL_LONGTERM behavior is incompatible with
 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
 * vmas. However, this only comes up if locked is set, and there are
 * callers that do request FOLL_LONGTERM, but do not set locked. So,
 * allow what we can.
 */
if (gup_flags & FOLL_LONGTERM) {
if (WARN_ON_ONCE(locked))
return -EINVAL;
}

Is that OK, or did you want to go further (possibly in a follow-up
patchset, as I'm hoping to get this one in soon)?  

...
>>> I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
>>> vma_is_fsdax()) check and that would also remove the need for
>>> __gup_longterm_locked.
>>>
>>
>> Good idea, but there is still the call to check_and_migrate_cma_pages(),
>> inside __gup_longterm_locked().  So it's a little more involved and
>> we can't trivially delete __gup_longterm_locked() yet, right?
> 
> [ add Aneesh ]
> 
> Yes, you're right. I had overlooked that had snuck in there. That to
> me similarly needs to be pushed down into the core with its own FOLL
> flag, or it needs to be an explicit fixup that each caller does after
> get_user_pages. The fact that migration silently happens as a side
> effect of gup is too magical for my taste.
> 

Yes. It's an intrusive side effect that is surprising, and not in a 
"happy surprise" way. :) .   Fixing up the CMA pages by splitting that
functionality into separate function calls sounds like an improvement
worth exploring.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v15 0/9] open: introduce openat2(2) syscall

2019-11-12 Thread Christian Brauner
On Tue, Nov 12, 2019 at 03:01:26PM -0800, Kees Cook wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai  wrote:
> > > This patchset is being developed here:
> > >   
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus 
> > > Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > > series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

I like this series and the same with the copy_struct_from_user() part of
it I've taken I'm happy to stuff this into a dedicated branch, merge it
into my for-next and send it for v5.5.
Though I'd _much_ rather see Al pick this up or have him give his
blessing first.

Christian


Re: [PATCH v15 0/9] open: introduce openat2(2) syscall

2019-11-12 Thread Kees Cook
On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> On 2019-11-05, Aleksa Sarai  wrote:
> > This patchset is being developed here:
> >   
> > 
> > Patch changelog:
> >  v15:
> >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus 
> > Torvalds]
> >   * Split out patches for each individual LOOKUP flag.
> >   * Reword commit messages to give more background information about the
> > series, as well as mention the semantics of each flag in more detail.
> > [...]
> 
> Ping -- this patch hasn't been touched for a week. Thanks.

If I've been following correctly, everyone is happy with this series.
(i.e. Linus's comment appear to have been addressed.)

Perhaps the next question is should this go via a pull request by you to
Linus directly during the v5.5 merge window, via akpm, via akpm, via
Christian, or some other path? Besides Linus, it's not been clear who
should "claim" this series. :)

-- 
Kees Cook


Re: Bug 205201 - overflow of DMA mask and bus mask

2019-11-12 Thread Christian Zigotzky

On 12 November 2019 at 3:41 pm, Christoph Hellwig wrote:

On Mon, Nov 11, 2019 at 01:22:55PM +0100, Christian Zigotzky wrote:

Now, I can definitely say that this patch does not solve the issue.

Do you have another patch for testing or shall I bisect?

I'm still interested in the .config and dmesg.  Especially if the
board is using arch/powerpc/sysdev/fsl_pci.c, which is the only
place inside the powerpc arch code doing funny things with the
bus_dma_mask, which Robin pointed out looks fishy.


Here you are:

.config: https://bugzilla.kernel.org/attachment.cgi?id=285815

dmesg: https://bugzilla.kernel.org/attachment.cgi?id=285813

Thanks


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Tue, Nov 12, 2019 at 2:43 PM John Hubbard  wrote:
>
> On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
> ...
> >> -}
> >> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> >> +page, vmas, NULL);
> >> +/*
> >> + * The lifetime of a vaddr_get_pfn() page pin is
> >> + * userspace-controlled. In the fs-dax case this could
> >> + * lead to indefinite stalls in filesystem operations.
> >> + * Disallow attempts to pin fs-dax pages via this
> >> + * interface.
> >> + */
> >> +if (ret > 0 && vma_is_fsdax(vmas[0])) {
> >> +ret = -EOPNOTSUPP;
> >> +put_page(page[0]);
> >>  }
> >
> > AFAIK this chunk is redundant now as it is some hack to emulate
> > FOLL_LONGTERM? So vmas can be deleted too.
>
> Let me first make sure I understand what Dan has in mind for the vma
> checking, in the other thread...

It's not redundant relative to upstream which does not do anything the
FOLL_LONGTERM in the gup-slow path... but I have not looked at patches
1-7 to see if something there made it redundant.


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Tue, Nov 12, 2019 at 2:24 PM John Hubbard  wrote:
>
> On 11/12/19 1:57 PM, Dan Williams wrote:
> ...
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index d864277ea16f..017689b7c32b 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, 
> >> unsigned long vaddr,
> >> flags |= FOLL_WRITE;
> >>
> >> down_read(>mmap_sem);
> >> -   if (mm == current->mm) {
> >> -   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> >> -vmas);
> >> -   } else {
> >> -   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, 
> >> page,
> >> -   vmas, NULL);
> >> -   /*
> >> -* The lifetime of a vaddr_get_pfn() page pin is
> >> -* userspace-controlled. In the fs-dax case this could
> >> -* lead to indefinite stalls in filesystem operations.
> >> -* Disallow attempts to pin fs-dax pages via this
> >> -* interface.
> >> -*/
> >> -   if (ret > 0 && vma_is_fsdax(vmas[0])) {
> >> -   ret = -EOPNOTSUPP;
> >> -   put_page(page[0]);
> >> -   }
> >> +   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | 
> >> FOLL_LONGTERM,
> >> +   page, vmas, NULL);
> >
> > Hmm, what's the point of passing FOLL_LONGTERM to
> > get_user_pages_remote() if get_user_pages_remote() is not going to
> > check the vma? I think we got to this code state because the
>
> FOLL_LONGTERM is short-lived in this location, because patch 23
> ("mm/gup: remove support for gup(FOLL_LONGTERM)") removes it, after
> callers are changed over to pin_longterm_pages*().
>
> So FOLL_LONGTERM is not doing much now, but it is basically a marker for
> "change gup(FOLL_LONGTERM) to pin_longterm_pages()", and patch 18
> actually makes that change.
>
> And then pin_longterm_pages*() is, in turn, a way to mark all the
> places that need file system and/or user space interactions (layout
> leases, etc), as per "Case 2: RDMA" in the new
> Documentation/vm/pin_user_pages.rst.

Ah, sorry. This was the first time I had looked at this series and
jumped in without reading the background.

Your patch as is looks ok, I assume you've removed the FOLL_LONGTERM
warning in get_user_pages_remote in another patch?

>
> > get_user_pages() vs get_user_pages_remote() split predated the
> > introduction of FOLL_LONGTERM.
>
> Yes. And I do want clean this up as I go, so we don't end up with
> stale concepts lingering in gup.c...
>
> >
> > I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
> > vma_is_fsdax()) check and that would also remove the need for
> > __gup_longterm_locked.
> >
>
> Good idea, but there is still the call to check_and_migrate_cma_pages(),
> inside __gup_longterm_locked().  So it's a little more involved and
> we can't trivially delete __gup_longterm_locked() yet, right?

[ add Aneesh ]

Yes, you're right. I had overlooked that had snuck in there. That to
me similarly needs to be pushed down into the core with its own FOLL
flag, or it needs to be an explicit fixup that each caller does after
get_user_pages. The fact that migration silently happens as a side
effect of gup is too magical for my taste.


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 12:43 PM, Jason Gunthorpe wrote:
...
>> -}
>> +ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
>> +page, vmas, NULL);
>> +/*
>> + * The lifetime of a vaddr_get_pfn() page pin is
>> + * userspace-controlled. In the fs-dax case this could
>> + * lead to indefinite stalls in filesystem operations.
>> + * Disallow attempts to pin fs-dax pages via this
>> + * interface.
>> + */
>> +if (ret > 0 && vma_is_fsdax(vmas[0])) {
>> +ret = -EOPNOTSUPP;
>> +put_page(page[0]);
>>  }
> 
> AFAIK this chunk is redundant now as it is some hack to emulate
> FOLL_LONGTERM? So vmas can be deleted too.

Let me first make sure I understand what Dan has in mind for the vma
checking, in the other thread...

> 
> Also unclear why this function has this:
> 
> up_read(>mmap_sem);
> 
> if (ret == 1) {
> *pfn = page_to_pfn(page[0]);
> return 0;
> }
> 
> down_read(>mmap_sem);
> 

Yes, that's really odd. It's not good to release and retake the lock
anyway in general (without re-checking things), and  certainly it is
not required to release mmap_sem in order to call page_to_pfn().

I've removed that up_read()/down_read() pair, for v4.


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 1:57 PM, Dan Williams wrote:
...
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index d864277ea16f..017689b7c32b 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, 
>> unsigned long vaddr,
>> flags |= FOLL_WRITE;
>>
>> down_read(>mmap_sem);
>> -   if (mm == current->mm) {
>> -   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
>> -vmas);
>> -   } else {
>> -   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
>> -   vmas, NULL);
>> -   /*
>> -* The lifetime of a vaddr_get_pfn() page pin is
>> -* userspace-controlled. In the fs-dax case this could
>> -* lead to indefinite stalls in filesystem operations.
>> -* Disallow attempts to pin fs-dax pages via this
>> -* interface.
>> -*/
>> -   if (ret > 0 && vma_is_fsdax(vmas[0])) {
>> -   ret = -EOPNOTSUPP;
>> -   put_page(page[0]);
>> -   }
>> +   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | 
>> FOLL_LONGTERM,
>> +   page, vmas, NULL);
> 
> Hmm, what's the point of passing FOLL_LONGTERM to
> get_user_pages_remote() if get_user_pages_remote() is not going to
> check the vma? I think we got to this code state because the

FOLL_LONGTERM is short-lived in this location, because patch 23 
("mm/gup: remove support for gup(FOLL_LONGTERM)") removes it, after
callers are changed over to pin_longterm_pages*().

So FOLL_LONGTERM is not doing much now, but it is basically a marker for
"change gup(FOLL_LONGTERM) to pin_longterm_pages()", and patch 18
actually makes that change.

And then pin_longterm_pages*() is, in turn, a way to mark all the 
places that need file system and/or user space interactions (layout
leases, etc), as per "Case 2: RDMA" in the new 
Documentation/vm/pin_user_pages.rst.

> get_user_pages() vs get_user_pages_remote() split predated the
> introduction of FOLL_LONGTERM.

Yes. And I do want clean this up as I go, so we don't end up with
stale concepts lingering in gup.c...

> 
> I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
> vma_is_fsdax()) check and that would also remove the need for
> __gup_longterm_locked.
> 

Good idea, but there is still the call to check_and_migrate_cma_pages(), 
inside __gup_longterm_locked().  So it's a little more involved and
we can't trivially delete __gup_longterm_locked() yet, right?


thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 08/23] vfio, mm: fix get_user_pages_remote() and FOLL_LONGTERM

2019-11-12 Thread Dan Williams
On Mon, Nov 11, 2019 at 4:07 PM John Hubbard  wrote:
>
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
>
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
>
> Update the code and comments accordingly, and update the VFIO caller
> to take advantage of this, fixing a bug as a result: the VFIO caller
> is logically a FOLL_LONGTERM user.
>
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this.
>
> Suggested-by: Jason Gunthorpe 
> Cc: Jerome Glisse 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 30 +-
>  mm/gup.c| 13 -
>  2 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index d864277ea16f..017689b7c32b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -348,24 +348,20 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned 
> long vaddr,
> flags |= FOLL_WRITE;
>
> down_read(>mmap_sem);
> -   if (mm == current->mm) {
> -   ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> -vmas);
> -   } else {
> -   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> -   vmas, NULL);
> -   /*
> -* The lifetime of a vaddr_get_pfn() page pin is
> -* userspace-controlled. In the fs-dax case this could
> -* lead to indefinite stalls in filesystem operations.
> -* Disallow attempts to pin fs-dax pages via this
> -* interface.
> -*/
> -   if (ret > 0 && vma_is_fsdax(vmas[0])) {
> -   ret = -EOPNOTSUPP;
> -   put_page(page[0]);
> -   }
> +   ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
> +   page, vmas, NULL);

Hmm, what's the point of passing FOLL_LONGTERM to
get_user_pages_remote() if get_user_pages_remote() is not going to
check the vma? I think we got to this code state because the
get_user_pages() vs get_user_pages_remote() split predated the
introduction of FOLL_LONGTERM.

I think check_vma_flags() should do the ((FOLL_LONGTERM | FOLL_GET) &&
vma_is_fsdax()) check and that would also remove the need for
__gup_longterm_locked.


Re: [PATCH v3 11/23] IB/{core,hw,umem}: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()

2019-11-12 Thread John Hubbard


On 11/12/19 12:44 PM, Jason Gunthorpe wrote:
> On Mon, Nov 11, 2019 at 04:06:48PM -0800, John Hubbard wrote:
>> @@ -542,7 +541,7 @@ static int ib_umem_odp_map_dma_single_page(
>>  }
>>  
>>  out:
>> -put_user_page(page);
>> +put_page(page);
>>  
>>  if (remove_existing_mapping) {
>>  ib_umem_notifier_start_account(umem_odp);
>> @@ -639,13 +638,14 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp 
>> *umem_odp, u64 user_virt,
>>  /*
>>   * Note: this might result in redundent page getting. We can
>>   * avoid this by checking dma_list to be 0 before calling
>> - * get_user_pages. However, this make the code much more
>> - * complex (and doesn't gain us much performance in most use
>> - * cases).
>> + * get_user_pages. However, this makes the code much
>> + * more complex (and doesn't gain us much performance in most
>> + * use cases).
>>   */
>>  npages = get_user_pages_remote(owning_process, owning_mm,
>> -user_virt, gup_num_pages,
>> -flags, local_page_list, NULL, NULL);
>> +   user_virt, gup_num_pages,
>> +   flags, local_page_list, NULL,
>> +   NULL);
>>  up_read(_mm->mmap_sem);
> 
> This is just whitespace churn? Drop it..
> 


Whoops, yes. It got there because of going through the pin*() conversion
and then a revert, and now it's just whitespace. I'll drop it, thanks for
catching that.


thanks,

John Hubbard
NVIDIA


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-12 Thread John Hubbard
On 11/12/19 12:38 PM, Jason Gunthorpe wrote:
> On Mon, Nov 11, 2019 at 04:06:37PM -0800, John Hubbard wrote:
>> Hi,
>>
>> The cover letter is long, so the more important stuff is first:
>>
>> * Jason, if you or someone could look at the the VFIO cleanup (patch 8)
>>   and conversion to FOLL_PIN (patch 18), to make sure it's use of
>>   remote and longterm gup matches what we discussed during the review
>>   of v2, I'd appreciate it.
>>
>> * Also for Jason and IB: as noted below, in patch 11, I am (too?) boldly
>>   converting from put_user_pages() to release_pages().
> 
> Why are we doing this? I think things got confused here someplace, as


Because:

a) These need put_page() calls,  and

b) there is no put_pages() call, but there is a release_pages() call that
is, arguably, what put_pages() would be.


> the comment still says:
> 
> /**
>  * put_user_page() - release a gup-pinned page
>  * @page:pointer to page to be released
>  *
>  * Pages that were pinned via get_user_pages*() must be released via
>  * either put_user_page(), or one of the put_user_pages*() routines
>  * below.


Ohhh, I missed those comments. They need to all be changed over to
say "pages that were pinned via pin_user_pages*() or 
pin_longterm_pages*() must be released via put_user_page*()."

The get_user_pages*() pages must still be released via put_page.

The churn is due to a fairly significant change in strategy, whis
is: instead of changing all get_user_pages*() sites to call 
put_user_page(), change selected sites to call pin_user_pages*() or 
pin_longterm_pages*(), plus put_user_page().

That allows incrementally converting the kernel over to using the
new pin APIs, without taking on the huge risk of a big one-shot
conversion. 

So, I've ended up with one place that actually needs to get reverted
back to get_user_pages(), and that's the IB ODP code.

> 
> I feel like if put_user_pages() is not the correct way to undo
> get_user_pages() then it needs to be deleted.
> 

Yes, you're right. I'll fix the put_user_page comments() as described.


thanks,

John Hubbard
NVIDIA


Re: [PATCH v2 11/29] vmlinux.lds.h: Replace RODATA with RO_DATA

2019-11-12 Thread Kees Cook
On Mon, Nov 11, 2019 at 07:08:51PM +0100, Geert Uytterhoeven wrote:
> Hi Kees,
> 
> On Mon, Nov 11, 2019 at 6:23 PM Kees Cook  wrote:
> > On Mon, Nov 11, 2019 at 05:58:06PM +0100, Geert Uytterhoeven wrote:
> > > On Fri, Oct 11, 2019 at 2:07 AM Kees Cook  wrote:
> > > > There's no reason to keep the RODATA macro: replace the callers with
> > > > the expected RO_DATA macro.
> > > >
> > > > Signed-off-by: Kees Cook 
> > > > ---
> > > >  arch/alpha/kernel/vmlinux.lds.S  | 2 +-
> > > >  arch/ia64/kernel/vmlinux.lds.S   | 2 +-
> > > >  arch/microblaze/kernel/vmlinux.lds.S | 2 +-
> > > >  arch/mips/kernel/vmlinux.lds.S   | 2 +-
> > > >  arch/um/include/asm/common.lds.S | 2 +-
> > > >  arch/xtensa/kernel/vmlinux.lds.S | 2 +-
> > > >  include/asm-generic/vmlinux.lds.h| 4 +---
> > > >  7 files changed, 7 insertions(+), 9 deletions(-)
> > >
> > > Somehow you missed:
> > >
> > > arch/m68k/kernel/vmlinux-std.lds:  RODATA
> > > arch/m68k/kernel/vmlinux-sun3.lds:  RODATA
> >
> > Argh. I've sent a patch; sorry and thanks for catching this. For my own
> > cross-build testing, which defconfig targets will hit these two linker
> > scripts?
> 
> vmlinux-sun3.lds: sun3_defconfig
> vmlinux-std.lds: All other classic 680x0 targets with an MMU, e.g. plain
>  defconfig aka multi_defconfig.

Excellent, thank you; I've updated my multi-arch build list. :)

-- 
Kees Cook


Re: [PATCH v4 00/47] QUICC Engine support on ARM and ARM64

2019-11-12 Thread Li Yang
On Mon, Nov 11, 2019 at 5:39 PM Li Yang  wrote:
>
> On Fri, Nov 8, 2019 at 7:05 AM Rasmus Villemoes
>  wrote:
> >
>
> I'm generally ok with these enhencements and cleanups.  But as the
> whole patch series touched multiple subsystems, I would like to
> collect the Acked-by from Scott, Greg and David if we want the whole
> series to go through the fsl/soc tree.

Rasmus,

Since the patches also touched net and serial subsystem.  Can you also
repost these patches(maybe just related ones) onto netdev and
linux-serial mailing list?

Regards,
Leo
>
> Also Qiang, can you help to test the latest version and provide you
> Tested-by?  Thanks.
>
> > There have been several attempts in the past few years to allow
> > building the QUICC engine drivers for platforms other than PPC. This
> > is yet another attempt.
> >
> > v3 can be found here: 
> > https://lore.kernel.org/lkml/20191101124210.14510-1-li...@rasmusvillemoes.dk/
> >
> > v4 adds a some patches to fix (ab)use of IS_ERR_VALUE which fails when
> > sizeof(u32) != sizeof(long), i.e. on 64-bit platforms. Freescale
> > drivers are some of the last holdouts using that macro (outside of
> > arch/ and core mm code), so I decided trying to simply get rid of it
> > instead of papering over it by using a temporary long to store the
> > result in. Doing that I stumbled on some other things that should be
> > fixed. These are the new patches 34-45.
> >
> > Patch 35 from v3 (which added a PPC32 dependency to FSL_UCC_HDLC) is
> > gone from this version, so that that driver can indeed now be built
> > for arm and arm64.
> >
> > 1-5 are about replacing in_be32 etc. in the core QE code 
> > (drivers/soc/fsl/qe).
> >
> > 6-8 handle miscellaneous other ppcisms.
> >
> > 9-21 deal with qe_ic: Simplifying the driver significantly by removing
> > unused code, and removing the platform-specific initialization from
> > arch/powerpc/.
> >
> > 22-25 deal with raw access to devicetree properties in native endianness.
> >
> > 26-33 makes drivers/tty/serial/ucc_uart.c (CONFIG_SERIAL_QE) ready to build 
> > on arm.
> >
> > 34-45 deal with IS_ERR_VALUE() and some other things found while
> > digging around that part of the code.
> >
> > 46 adds a PPC32 dependency to UCC_GETH - it has some of the same
> > issues that have been fixed in the ucc_uart and ucc_hdlc cases. Nobody
> > has requested that I allow that driver to be built for arm{,64}, so
> > instead of growing this series even bigger, I kept that addition. It's
> > trivial to remove if somebody cares enough to fix the build
> > errors/warnings and actually has a platform to test the result on.
> >
> > Finally patch 47 lifts the PPC32 restriction from QUICC_ENGINE. At the
> > request of Li Yang, it doesn't remove the PPC32 dependency but instead
> > changes it to PPC32 || ARM || ARM64 (or COMPILE_TEST), i.e. listing
> > the platforms that may have a QE.
> >
> > The series has been built and booted on both an mpc8309-based platform
> > (ppc) as well as an ls1021a-based platform (arm). The core QE code is
> > exercised on both, while I could only test the ucc_uart on arm, since
> > the uarts are not wired up on our mpc8309 board. Qiang Zhao reports
> > that the ucc_hdlc driver does indeed work on a ls1043ardb (arm64)
> > board, I hope he'll formally add a Tested-by: to the relevant patches
> > since I don't have any arm64 board with QE.
> >
> > Rasmus Villemoes (47):
> >   soc: fsl: qe: remove space-before-tab
> >   soc: fsl: qe: drop volatile qualifier of struct qe_ic::regs
> >   soc: fsl: qe: rename qe_(clr/set/clrset)bit* helpers
> >   soc: fsl: qe: introduce qe_io{read,write}* wrappers
> >   soc: fsl: qe: avoid ppc-specific io accessors
> >   soc: fsl: qe: replace spin_event_timeout by readx_poll_timeout_atomic
> >   soc: fsl: qe: qe.c: guard use of pvr_version_is() with CONFIG_PPC32
> >   soc: fsl: qe: drop unneeded #includes
> >   soc: fsl: qe: drop assign-only high_active in qe_ic_init
> >   soc: fsl: qe: remove pointless sysfs registration in qe_ic.c
> >   soc: fsl: qe: use qe_ic_cascade_{low,high}_mpic also on 83xx
> >   soc: fsl: qe: move calls of qe_ic_init out of arch/powerpc/
> >   powerpc/83xx: remove mpc83xx_ipic_and_qe_init_IRQ
> >   powerpc/85xx: remove mostly pointless mpc85xx_qe_init()
>
> Scott,
> What do you think about the PPC changes?
>
> >   soc: fsl: qe: move qe_ic_cascade_* functions to qe_ic.c
> >   soc: fsl: qe: rename qe_ic_cascade_low_mpic -> qe_ic_cascade_low
> >   soc: fsl: qe: remove unused qe_ic_set_* functions
> >   soc: fsl: qe: don't use NO_IRQ in qe_ic.c
> >   soc: fsl: qe: make qe_ic_get_{low,high}_irq static
> >   soc: fsl: qe: simplify qe_ic_init()
> >   soc: fsl: qe: merge qe_ic.h headers into qe_ic.c
> >   soc: fsl: qe: qe.c: use of_property_read_* helpers
> >   soc: fsl: qe: qe_io.c: don't open-code of_parse_phandle()
> >   soc: fsl: qe: qe_io.c: access device tree property using be32_to_cpu
> >   soc: fsl: qe: qe_io.c: use of_property_read_u32() in par_io_init()
> >   soc: fsl: move cpm.h from 

  1   2   >