Re: Onboard SD card doesn't work anymore after the 'mmc-v5.4-2' updates

2019-10-22 Thread Michael Ellerman
Russell King - ARM Linux admin  writes:
> On Tue, Oct 15, 2019 at 03:12:49PM +0200, Christian Zigotzky wrote:
>> Hello Russell,
>> 
>> You asked me about "dma-coherent" in the Cyrus device tree. Unfortunately I
>> don't find the property "dma-coherent" in the dtb source files.
>> 
>> Output of "fdtdump cyrus_p5020_eth_poweroff.dtb | grep dma":
>> 
>> dma0 = "/soc@ffe00/dma@100300";
>>     dma1 = "/soc@ffe00/dma@101300";
>>     dma@100300 {
>>     compatible = "fsl,eloplus-dma";
>>     dma-channel@0 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@80 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@100 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@180 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma@101300 {
>>     compatible = "fsl,eloplus-dma";
>>     dma-channel@0 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@80 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@100 {
>>     compatible = "fsl,eloplus-dma-channel";
>>     dma-channel@180 {
>>     compatible = "fsl,eloplus-dma-channel";
>
> Hmm, so it looks like PowerPC doesn't mark devices that are dma
> coherent with a property that describes them as such.
>
> I think this opens a wider question - what should of_dma_is_coherent()
> return for PowerPC?  It seems right now that it returns false for
> devices that are DMA coherent, which seems to me to be a recipe for
> future mistakes.

Right, it seems of_dma_is_coherent() has baked in the assumption that
devices are non-coherent unless explicitly marked as coherent.

Which is wrong on all or at least most existing powerpc systems
according to Ben.

> Any ideas from the PPC maintainers?

Fixing it at the source seems like the best option to prevent future
breakage.

So I guess that would mean making of_dma_is_coherent() return true/false
based on CONFIG_NOT_COHERENT_CACHE on powerpc.

We could do it like below, which would still allow the dma-coherent
property to work if it ever makes sense on a future powerpc platform.

I don't really know any of this embedded stuff well, so happy to take
other suggestions on how to handle this mess.

cheers


diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 25aaa3903000..b96c9010acb6 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -760,6 +760,22 @@ static int __init check_cache_coherency(void)
 late_initcall(check_cache_coherency);
 #endif /* CONFIG_CHECK_CACHE_COHERENCY */
 
+#ifndef CONFIG_NOT_COHERENT_CACHE
+/*
+ * For historical reasons powerpc kernels are built with hard wired knowledge 
of
+ * whether or not DMA accesses are cache coherent. Additionally device trees on
+ * powerpc do not typically support the dma-coherent property.
+ *
+ * So when we know that DMA is coherent, override arch_of_dma_is_coherent() to
+ * tell the drivers/of code that all devices are coherent regardless of whether
+ * they have a dma-coherent property.
+ */
+bool arch_of_dma_is_coherent(struct device_node *np)
+{
+   return true;
+}
+#endif
+
 #ifdef CONFIG_DEBUG_FS
 struct dentry *powerpc_debugfs_root;
 EXPORT_SYMBOL(powerpc_debugfs_root);
diff --git a/drivers/of/address.c b/drivers/of/address.c
index 978427a9d5e6..3a4b2949a322 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -993,6 +993,14 @@ int of_dma_get_range(struct device_node *np, u64 
*dma_addr, u64 *paddr, u64 *siz
 }
 EXPORT_SYMBOL_GPL(of_dma_get_range);
 
+/*
+ * arch_of_dma_is_coherent - Arch hook to determine if device is coherent for 
DMA
+ */
+bool __weak arch_of_dma_is_coherent(struct device_node *np)
+{
+   return false;
+}
+
 /**
  * of_dma_is_coherent - Check if device is coherent
  * @np:device node
@@ -1002,8 +1010,12 @@ EXPORT_SYMBOL_GPL(of_dma_get_range);
  */
 bool of_dma_is_coherent(struct device_node *np)
 {
-   struct device_node *node = of_node_get(np);
+   struct device_node *node;
+
+   if (arch_of_dma_is_coherent(np))
+   return true;
 
+   np = of_node_get(np);
while (node) {
if (of_property_read_bool(node, "dma-coherent")) {
of_node_put(node);


Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-22 Thread Bharata B Rao
On Wed, Oct 23, 2019 at 03:17:54PM +1100, Paul Mackerras wrote:
> On Tue, Oct 22, 2019 at 11:59:35AM +0530, Bharata B Rao wrote:
> The mapping of pages in userspace memory, and the mapping of userspace
> memory to guest physical space, are two distinct things.  The memslots
> describe the mapping of userspace addresses to guest physical
> addresses, but don't say anything about what is mapped at those
> userspace addresses.  So you can indeed get a page fault on a
> userspace address at the same time that a memslot is being deleted
> (even a memslot that maps that particular userspace address), because
> removing the memslot does not unmap anything from userspace memory,
> it just breaks the association between that userspace memory and guest
> physical memory.  Deleting the memslot does unmap the pages from the
> guest but doesn't unmap them from the userspace process (e.g. QEMU).
> 
> It is an interesting question what the semantics should be when a
> memslot is deleted and there are pages of userspace currently paged
> out to the device (i.e. the ultravisor).  One approach might be to say
> that all those pages have to come back to the host before we finish
> the memslot deletion, but that is probably not necessary; I think we
> could just say that those pages are gone and can be replaced by zero
> pages if they get accessed on the host side.  If userspace then unmaps
> the corresponding region of the userspace memory map, we can then just
> forget all those pages with very little work.

There are 5 scenarios currently where we are replacing the device mappings:

1. Guest reset
2. Memslot free (Memory unplug) (Not present in this version though)
3. Converting secure page to shared page
4. HV touching the secure page
5. H_SVM_INIT_ABORT hcall to abort SVM due to errors when transitioning
   to secure mode (Not present in this version)

In the first 3 cases, we don't need to get the page to HV from
the secure side and hence skip the page out. However currently we do
allocate fresh page and replace the mapping with the new one.
 
> > However if that sounds fragile, may be I can go back to my initial
> > design where we weren't using rmap[] to store device PFNs. That will
> > increase the memory usage but we give us an easy option to have
> > per-guest mutex to protect concurrent page-ins/outs/faults.
> 
> That sounds like it would be the best option, even if only in the
> short term.  At least it would give us a working solution, even if
> it's not the best performing solution.

Sure, will avoid using rmap[] in the next version.

Regards,
Bharata.



Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-22 Thread Paul Mackerras
On Tue, Oct 22, 2019 at 11:59:35AM +0530, Bharata B Rao wrote:
> On Fri, Oct 18, 2019 at 8:31 AM Paul Mackerras  wrote:
> >
> > On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> > > Manage migration of pages betwen normal and secure memory of secure
> > > guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> > >
> > > H_SVM_PAGE_IN: Move the content of a normal page to secure page
> > > H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> > >
> > > Private ZONE_DEVICE memory equal to the amount of secure memory
> > > available in the platform for running secure guests is created.
> > > Whenever a page belonging to the guest becomes secure, a page from
> > > this private device memory is used to represent and track that secure
> > > page on the HV side. The movement of pages between normal and secure
> > > memory is done via migrate_vma_pages() using UV_PAGE_IN and
> > > UV_PAGE_OUT ucalls.
> >
> > As we discussed privately, but mentioning it here so there is a
> > record:  I am concerned about this structure
> >
> > > +struct kvmppc_uvmem_page_pvt {
> > > + unsigned long *rmap;
> > > + struct kvm *kvm;
> > > + unsigned long gpa;
> > > +};
> >
> > which keeps a reference to the rmap.  The reference could become stale
> > if the memslot is deleted or moved, and nothing in the patch series
> > ensures that the stale references are cleaned up.
> 
> I will add code to release the device PFNs when memslot goes away. In
> fact the early versions of the patchset had this, but it subsequently
> got removed.
> 
> >
> > If it is possible to do without the long-term rmap reference, and
> > instead find the rmap via the memslots (with the srcu lock held) each
> > time we need the rmap, that would be safer, I think, provided that we
> > can sort out the lock ordering issues.
> 
> All paths except fault handler access rmap[] under srcu lock. Even in
> case of fault handler, for those faults induced by us (shared page
> handling, releasing device pfns), we do hold srcu lock. The difficult
> case is when we fault due to HV accessing a device page. In this case
> we come to fault hanler with mmap_sem already held and are not in a
> position to take kvm srcu lock as that would lead to lock order
> reversal. Given that we have pages mapped in still, I assume memslot
> can't go away while we access rmap[], so think we should be ok here.

The mapping of pages in userspace memory, and the mapping of userspace
memory to guest physical space, are two distinct things.  The memslots
describe the mapping of userspace addresses to guest physical
addresses, but don't say anything about what is mapped at those
userspace addresses.  So you can indeed get a page fault on a
userspace address at the same time that a memslot is being deleted
(even a memslot that maps that particular userspace address), because
removing the memslot does not unmap anything from userspace memory,
it just breaks the association between that userspace memory and guest
physical memory.  Deleting the memslot does unmap the pages from the
guest but doesn't unmap them from the userspace process (e.g. QEMU).

It is an interesting question what the semantics should be when a
memslot is deleted and there are pages of userspace currently paged
out to the device (i.e. the ultravisor).  One approach might be to say
that all those pages have to come back to the host before we finish
the memslot deletion, but that is probably not necessary; I think we
could just say that those pages are gone and can be replaced by zero
pages if they get accessed on the host side.  If userspace then unmaps
the corresponding region of the userspace memory map, we can then just
forget all those pages with very little work.

> However if that sounds fragile, may be I can go back to my initial
> design where we weren't using rmap[] to store device PFNs. That will
> increase the memory usage but we give us an easy option to have
> per-guest mutex to protect concurrent page-ins/outs/faults.

That sounds like it would be the best option, even if only in the
short term.  At least it would give us a working solution, even if
it's not the best performing solution.

Paul.


Re: [PATCH v3 1/4] powerpc/mm: Implement set_memory() routines

2019-10-22 Thread Michael Ellerman
Russell Currey  writes:
> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
> new file mode 100644
> index ..fe3ecbfb8e10
> --- /dev/null
> +++ b/arch/powerpc/mm/pageattr.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * MMU-generic set_memory implementation for powerpc
> + *
> + * Author: Russell Currey 

Please don't add email addresses in new files, they just risk
bit-rotting, they're in the git log anyway.

> + *
> + * Copyright 2019, IBM Corporation.
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
> +{
> + int action = *((int *)data);
> + pte_t pte_val;
> +
> + // invalidate the PTE so it's safe to modify
> + pte_val = ptep_get_and_clear(_mm, addr, ptep);
> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);

This doesn't work if for example we're setting the text mapping we're
executing from read-only, which in principle should work.

Or if another CPU is concurrently reading from a mapping we're marking
read-only.

I /think/ that's acceptable for all the current users, but I don't know
that for sure and it's not documented anywhere AFAICS.

At the very least it needs a big comment, and to be mentioned in the
change log.


Also there's no locking here, or in apply_to_page_range() AFAICS.

And because we're doing clear/modify/write, two CPUs that race doing eg.
set_memory_ro() and set_memory_nx() will potentially result in some PTEs
being marked permanently invalid, if one CPU sees the other CPUs clear
of the PTE before the write.

Again I'm not sure any current callers do that, but it's a bit fragile.

I think we can fix the race at least by taking the init_mm
page_table_lock around the clear/modify/write.

> + // modify the PTE bits as desired, then apply
> + switch (action) {
> + case SET_MEMORY_RO:
> + pte_val = pte_wrprotect(pte_val);
> + break;
> + case SET_MEMORY_RW:
> + pte_val = pte_mkwrite(pte_val);
> + break;
> + case SET_MEMORY_NX:
> + pte_val = pte_exprotect(pte_val);
> + break;
> + case SET_MEMORY_X:
> + pte_val = pte_mkexec(pte_val);
> + break;
> + default:
> + WARN_ON(true);
> + return -EINVAL;
> + }
> +
> + set_pte_at(_mm, addr, ptep, pte_val);
> +
> + return 0;
> +}

cheers



RE: [PATCH 0/7] towards QE support on ARM

2019-10-22 Thread Qiang Zhao
On 22/10/2019 18:18, Rasmus Villemoes  wrote:
> -Original Message-
> From: Rasmus Villemoes 
> Sent: 2019年10月22日 18:18
> To: Qiang Zhao ; Leo Li 
> Cc: Timur Tabi ; Greg Kroah-Hartman
> ; linux-ker...@vger.kernel.org;
> linux-ser...@vger.kernel.org; Jiri Slaby ;
> linuxppc-dev@lists.ozlabs.org; linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH 0/7] towards QE support on ARM
> 
> On 22/10/2019 04.24, Qiang Zhao wrote:
> > On Mon, Oct 22, 2019 at 6:11 AM Leo Li wrote
> 
> >> Right.  I'm really interested in getting this applied to my tree and
> >> make it upstream.  Zhao Qiang, can you help to review Rasmus's
> >> patches and comment?
> >
> > As you know, I maintained a similar patchset removing PPC, and someone
> told me qe_ic should moved into drivers/irqchip/.
> > I also thought qe_ic is a interrupt control driver, should be moved into dir
> irqchip.
> 
> Yes, and I also plan to do that at some point. However, that's orthogonal to
> making the driver build on ARM, so I don't want to mix the two. Making it
> usable on ARM is my/our priority currently.
> 
> I'd appreciate your input on my patches.

Yes, we can put this patchset in first place, ensure it can build and work on 
ARM, then push another patchset to move qe_ic.

Best Regards,
Qiang



[PATCH] powerpc/boot: Fix the initrd being overwritten under qemu

2019-10-22 Thread Oliver O'Halloran
When booting under OF the zImage expects the initrd address and size to be
passed to it using registers r3 and r4. SLOF (guest firmware used by QEMU)
currently doesn't do this so the zImage is not aware of the initrd
location.  This can result in initrd corruption either though the zImage
extracting the vmlinux over the initrd, or by the vmlinux overwriting the
initrd when relocating itself.

QEMU does put the linux,initrd-start and linux,initrd-end properties into
the devicetree to vmlinux to find the initrd. We can work around the SLOF
bug by also looking those properties in the zImage.

Cc: sta...@vger.kernel.org
Cc: Alexey Kardashevskiy 
Signed-off-by: Oliver O'Halloran 
---
First noticed here: 
https://unix.stackexchange.com/questions/547023/linux-kernel-on-ppc64le-vmlinux-equivalent-in-arch-powerpc-boot
---
 arch/powerpc/boot/devtree.c | 21 +
 arch/powerpc/boot/main.c|  7 +++
 arch/powerpc/boot/of.h  | 16 
 arch/powerpc/boot/ops.h |  1 +
 arch/powerpc/boot/swab.h| 17 +
 5 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/boot/devtree.c b/arch/powerpc/boot/devtree.c
index 5d91036..ac5c26b 100644
--- a/arch/powerpc/boot/devtree.c
+++ b/arch/powerpc/boot/devtree.c
@@ -13,6 +13,7 @@
 #include "string.h"
 #include "stdio.h"
 #include "ops.h"
+#include "swab.h"
 
 void dt_fixup_memory(u64 start, u64 size)
 {
@@ -318,6 +319,26 @@ int dt_xlate_reg(void *node, int res, unsigned long *addr, 
unsigned long *size)
return dt_xlate(node, res, reglen, addr, size);
 }
 
+int dt_read_addr(void *node, const char *prop, unsigned long *out_addr)
+{
+   int reglen;
+
+   *out_addr = 0;
+
+   reglen = getprop(node, prop, prop_buf, sizeof(prop_buf)) / 4;
+   if (reglen == 2) {
+   u64 v0 = be32_to_cpu(prop_buf[0]);
+   u64 v1 = be32_to_cpu(prop_buf[1]);
+   *out_addr = (v0 << 32) | v1;
+   } else if (reglen == 1) {
+   *out_addr = be32_to_cpu(prop_buf[0]);
+   } else {
+   return 0;
+   }
+
+   return 1;
+}
+
 int dt_xlate_addr(void *node, u32 *buf, int buflen, unsigned long *xlated_addr)
 {
 
diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c
index a9d2091..518af24 100644
--- a/arch/powerpc/boot/main.c
+++ b/arch/powerpc/boot/main.c
@@ -112,6 +112,13 @@ static struct addr_range prep_initrd(struct addr_range 
vmlinux, void *chosen,
} else if (initrd_size > 0) {
printf("Using loader supplied ramdisk at 0x%lx-0x%lx\n\r",
   initrd_addr, initrd_addr + initrd_size);
+   } else if (chosen) {
+   unsigned long initrd_end;
+
+   dt_read_addr(chosen, "linux,initrd-start", _addr);
+   dt_read_addr(chosen, "linux,initrd-end", _end);
+
+   initrd_size = initrd_end - initrd_addr;
}
 
/* If there's no initrd at all, we're done */
diff --git a/arch/powerpc/boot/of.h b/arch/powerpc/boot/of.h
index 31b2f5d..dc24770 100644
--- a/arch/powerpc/boot/of.h
+++ b/arch/powerpc/boot/of.h
@@ -26,22 +26,6 @@ typedef u16  __be16;
 typedef u32__be32;
 typedef u64__be64;
 
-#ifdef __LITTLE_ENDIAN__
-#define cpu_to_be16(x) swab16(x)
-#define be16_to_cpu(x) swab16(x)
-#define cpu_to_be32(x) swab32(x)
-#define be32_to_cpu(x) swab32(x)
-#define cpu_to_be64(x) swab64(x)
-#define be64_to_cpu(x) swab64(x)
-#else
-#define cpu_to_be16(x) (x)
-#define be16_to_cpu(x) (x)
-#define cpu_to_be32(x) (x)
-#define be32_to_cpu(x) (x)
-#define cpu_to_be64(x) (x)
-#define be64_to_cpu(x) (x)
-#endif
-
 #define PROM_ERROR (-1u)
 
 #endif /* _PPC_BOOT_OF_H_ */
diff --git a/arch/powerpc/boot/ops.h b/arch/powerpc/boot/ops.h
index e060676..5100dd7 100644
--- a/arch/powerpc/boot/ops.h
+++ b/arch/powerpc/boot/ops.h
@@ -95,6 +95,7 @@ void *simple_alloc_init(char *base, unsigned long heap_size,
 extern void flush_cache(void *, unsigned long);
 int dt_xlate_reg(void *node, int res, unsigned long *addr, unsigned long 
*size);
 int dt_xlate_addr(void *node, u32 *buf, int buflen, unsigned long 
*xlated_addr);
+int dt_read_addr(void *node, const char *prop, unsigned long *out);
 int dt_is_compatible(void *node, const char *compat);
 void dt_get_reg_format(void *node, u32 *naddr, u32 *nsize);
 int dt_get_virtual_reg(void *node, void **addr, int nres);
diff --git a/arch/powerpc/boot/swab.h b/arch/powerpc/boot/swab.h
index 11d2069..82db2c1 100644
--- a/arch/powerpc/boot/swab.h
+++ b/arch/powerpc/boot/swab.h
@@ -27,4 +27,21 @@ static inline u64 swab64(u64 x)
(u64)((x & (u64)0x00ffULL) >> 40) |
(u64)((x & (u64)0xff00ULL) >> 56);
 }
+
+#ifdef __LITTLE_ENDIAN__
+#define cpu_to_be16(x) swab16(x)
+#define be16_to_cpu(x) swab16(x)
+#define cpu_to_be32(x) swab32(x)
+#define be32_to_cpu(x) swab32(x)
+#define cpu_to_be64(x) swab64(x)
+#define be64_to_cpu(x) swab64(x)

Re: [PATCH v8 3/8] powerpc: detect the trusted boot state of the system

2019-10-22 Thread Michael Ellerman
Nayna Jain  writes:
> diff --git a/arch/powerpc/kernel/secure_boot.c 
> b/arch/powerpc/kernel/secure_boot.c
> index 99bba7915629..9753470ab08a 100644
> --- a/arch/powerpc/kernel/secure_boot.c
> +++ b/arch/powerpc/kernel/secure_boot.c
> @@ -28,3 +39,16 @@ bool is_ppc_secureboot_enabled(void)
>   pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
>   return enabled;
>  }
> +
> +bool is_ppc_trustedboot_enabled(void)
> +{
> + struct device_node *node;
> + bool enabled = false;
> +
> + node = get_ppc_fw_sb_node();
> + enabled = of_property_read_bool(node, "trusted-enabled");

Also here you need:

of_node_put(node);

> +
> + pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
> +
> + return enabled;
> +}

cheers


Re: [PATCH v8 1/8] powerpc: detect the secure boot mode of the system

2019-10-22 Thread Michael Ellerman
Nayna Jain  writes:
> diff --git a/arch/powerpc/kernel/secure_boot.c 
> b/arch/powerpc/kernel/secure_boot.c
> new file mode 100644
> index ..99bba7915629
> --- /dev/null
> +++ b/arch/powerpc/kernel/secure_boot.c
> @@ -0,0 +1,30 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain
> + */
> +#include 
> +#include 
> +#include 
> +
> +bool is_ppc_secureboot_enabled(void)
> +{
> + struct device_node *node;
> + bool enabled = false;
> +
> + node = of_find_compatible_node(NULL, NULL, "ibm,secvar-v1");

If this found a node then you have a node with an elevated refcount
which you need to drop on the way out.

> + if (!of_device_is_available(node)) {
> + pr_err("Cannot find secure variable node in device tree; 
> failing to secure state\n");
> + goto out;
> + }
> +
> + /*
> +  * secureboot is enabled if os-secure-enforcing property exists,
> +  * else disabled.
> +  */
> + enabled = of_property_read_bool(node, "os-secure-enforcing");
> +
> +out:

So here you need:

of_node_put(node);


> + pr_info("Secure boot mode %s\n", enabled ? "enabled" : "disabled");
> + return enabled;
> +}

cheers


Re: [PATCH v7 00/12] implement KASLR for powerpc/fsl_booke/32

2019-10-22 Thread Scott Wood
On Mon, 2019-10-21 at 11:34 +0800, Jason Yan wrote:
> 
> On 2019/10/10 2:46, Scott Wood wrote:
> > On Wed, 2019-10-09 at 16:41 +0800, Jason Yan wrote:
> > > Hi Scott,
> > > 
> > > On 2019/10/9 15:13, Scott Wood wrote:
> > > > On Wed, 2019-10-09 at 14:10 +0800, Jason Yan wrote:
> > > > > Hi Scott,
> > > > > 
> > > > > Would you please take sometime to test this?
> > > > > 
> > > > > Thank you so much.
> > > > > 
> > > > > On 2019/9/24 13:52, Jason Yan wrote:
> > > > > > Hi Scott,
> > > > > > 
> > > > > > Can you test v7 to see if it works to load a kernel at a non-zero
> > > > > > address?
> > > > > > 
> > > > > > Thanks,
> > > > 
> > > > Sorry for the delay.  Here's the output:
> > > > 
> > > 
> > > Thanks for the test.
> > > 
> > > > ## Booting kernel from Legacy Image at 1000 ...
> > > >  Image Name:   Linux-5.4.0-rc2-00050-g8ac2cf5b4
> > > >  Image Type:   PowerPC Linux Kernel Image (gzip compressed)
> > > >  Data Size:7521134 Bytes = 7.2 MiB
> > > >  Load Address: 0400
> > > >  Entry Point:  0400
> > > >  Verifying Checksum ... OK
> > > > ## Flattened Device Tree blob at 1fc0
> > > >  Booting using the fdt blob at 0x1fc0
> > > >  Uncompressing Kernel Image ... OK
> > > >  Loading Device Tree to 07fe, end 07fff65c ... OK
> > > > KASLR: No safe seed for randomizing the kernel base.
> > > > OF: reserved mem: initialized node qman-fqd, compatible id fsl,qman-
> > > > fqd
> > > > OF: reserved mem: initialized node qman-pfdr, compatible id fsl,qman-
> > > > pfdr
> > > > OF: reserved mem: initialized node bman-fbpr, compatible id fsl,bman-
> > > > fbpr
> > > > Memory CAM mapping: 64/64/64 Mb, residual: 12032Mb
> > > 
> > > When boot from 0400, the max CAM value is 64M. And
> > > you have a board with 12G memory, CONFIG_LOWMEM_CAM_NUM=3 means only
> > > 192M memory is mapped and when kernel is randomized at the middle of
> > > this 192M memory, we will not have enough continuous memory for node
> > > map.
> > > 
> > > Can you set CONFIG_LOWMEM_CAM_NUM=8 and see if it works?
> > 
> > OK, that worked.
> > 
> 
> Hi Scott, any more cases should be tested or any more comments?
> What else need to be done before this feature can be merged?

I've just applied it and sent a pull request.

-Scott




Pull request: scottwood/linux.git next

2019-10-22 Thread Scott Wood
This contains KASLR support for book3e 32-bit.

The following changes since commit 612ee81b9461475b5a5612c2e8d71559dd3c7920:

  powerpc/papr_scm: Fix an off-by-one check in papr_scm_meta_{get, set} 
(2019-10-10 20:15:53 +1100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next

for you to fetch changes up to 9df1ef3f1376ec5d3a1b51a4546c94279bcd88ca:

  powerpc/fsl_booke/32: Document KASLR implementation (2019-10-21 16:09:16 
-0500)


Jason Yan (12):
  powerpc: unify definition of M_IF_NEEDED
  powerpc: move memstart_addr and kernstart_addr to init-common.c
  powerpc: introduce kernstart_virt_addr to store the kernel base
  powerpc/fsl_booke/32: introduce create_kaslr_tlb_entry() helper
  powerpc/fsl_booke/32: introduce reloc_kernel_entry() helper
  powerpc/fsl_booke/32: implement KASLR infrastructure
  powerpc/fsl_booke/32: randomize the kernel image offset
  powerpc/fsl_booke/kaslr: clear the original kernel if randomized
  powerpc/fsl_booke/kaslr: support nokaslr cmdline parameter
  powerpc/fsl_booke/kaslr: dump out kernel offset information on panic
  powerpc/fsl_booke/kaslr: export offset in VMCOREINFO ELF notes
  powerpc/fsl_booke/32: Document KASLR implementation

 Documentation/powerpc/kaslr-booke32.rst   |  42 +++
 arch/powerpc/Kconfig  |  11 +
 arch/powerpc/include/asm/nohash/mmu-book3e.h  |  11 +-
 arch/powerpc/include/asm/page.h   |   7 +
 arch/powerpc/kernel/early_32.c|   5 +-
 arch/powerpc/kernel/exceptions-64e.S  |  12 +-
 arch/powerpc/kernel/fsl_booke_entry_mapping.S |  25 +-
 arch/powerpc/kernel/head_fsl_booke.S  |  61 +++-
 arch/powerpc/kernel/machine_kexec.c   |   1 +
 arch/powerpc/kernel/misc_64.S |   7 +-
 arch/powerpc/kernel/setup-common.c|  20 ++
 arch/powerpc/mm/init-common.c |   7 +
 arch/powerpc/mm/init_32.c |   5 -
 arch/powerpc/mm/init_64.c |   5 -
 arch/powerpc/mm/mmu_decl.h|  11 +
 arch/powerpc/mm/nohash/Makefile   |   1 +
 arch/powerpc/mm/nohash/fsl_booke.c|   8 +-
 arch/powerpc/mm/nohash/kaslr_booke.c  | 401 ++
 18 files changed, 587 insertions(+), 53 deletions(-)
 create mode 100644 Documentation/powerpc/kaslr-booke32.rst
 create mode 100644 arch/powerpc/mm/nohash/kaslr_booke.c


Re: [PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-22 Thread Dan Williams
Hi David,

Thanks for tackling this!

On Tue, Oct 22, 2019 at 10:13 AM David Hildenbrand  wrote:
>
> This series is based on [2], which should pop up in linux/next soon:
> https://lkml.org/lkml/2019/10/21/1034
>
> This is the result of a recent discussion with Michal ([1], [2]). Right
> now we set all pages PG_reserved when initializing hotplugged memmaps. This
> includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
> cleared again when onlining the memory, in case of ZONE_DEVICE memory
> never. In ancient times, we needed PG_reserved, because there was no way
> to tell whether the memmap was already properly initialized. We now have
> SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
> memory is already initialized deferred, and there shouldn't be a visible
> change in that regard.
>
> I remember that some time ago, we already talked about stopping to set
> ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
> Also, I forgot who was part of the discussion :)

You got me, Alex, and KVM folks on the Cc, so I'd say that was it.

> One of the biggest fear were side effects. I went ahead and audited all
> users of PageReserved(). The ones that don't need any care (patches)
> can be found below. I will double check and hope I am not missing something
> important.
>
> I am probably a little bit too careful (but I don't want to break things).
> In most places (besides KVM and vfio that are nuts), the
> pfn_to_online_page() check could most probably be avoided by a
> is_zone_device_page() check. However, I usually get suspicious when I see
> a pfn_valid() check (especially after I learned that people mmap parts of
> /dev/mem into user space, including memory without memmaps. Also, people
> could memmap offline memory blocks this way :/). As long as this does not
> hurt performance, I think we should rather do it the clean way.

I'm concerned about using is_zone_device_page() in places that are not
known to already have a reference to the page. Here's an audit of
current usages, and the ones I think need to cleaned up. The "unsafe"
ones do not appear to have any protections against the device page
being removed (get_dev_pagemap()). Yes, some of these were added by
me. The "unsafe? HMM" ones need HMM eyes because HMM leaks device
pages into anonymous memory paths and I'm not up to speed on how it
guarantees 'struct page' validity vs device shutdown without using
get_dev_pagemap().

smaps_pmd_entry(): unsafe

put_devmap_managed_page(): safe, page reference is held

is_device_private_page(): safe? gpu driver manages private page lifetime

is_pci_p2pdma_page(): safe, page reference is held

uncharge_page(): unsafe? HMM

add_to_kill(): safe, protected by get_dev_pagemap() and dax_lock_page()

soft_offline_page(): unsafe

remove_migration_pte(): unsafe? HMM

move_to_new_page(): unsafe? HMM

migrate_vma_pages() and helpers: unsafe? HMM

try_to_unmap_one(): unsafe? HMM

__put_page(): safe

release_pages(): safe

I'm hoping all the HMM ones can be converted to
is_device_private_page() directlly and have that routine grow a nice
comment about how it knows it can always safely de-reference its @page
argument.

For the rest I'd like to propose that we add a facility to determine
ZONE_DEVICE by pfn rather than page. The most straightforward why I
can think of would be to just add another bitmap to mem_section_usage
to indicate if a subsection is ZONE_DEVICE or not.

>
> I only gave it a quick test with DIMMs on x86-64, but didn't test the
> ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
> on x86-64 and PPC.

I'll give it a spin, but I don't think the kernel wants to grow more
is_zone_device_page() users.


Re: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes

2019-10-22 Thread David Hildenbrand

On 22.10.19 19:55, Matt Sickler wrote:

Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that.

The pages are obtained via get_user_pages_fast(). I assume, these could be 
ZONE_DEVICE pages. Let's just exclude them as well explicitly.


I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, 
typically HugePages (but not always).


ZONE_DEVICE, a.k.a. devmem, are pages that bypass the pagecache (e.g., 
DAX) completely and will therefore never get swapped. These pages are 
not managed by any page allocator (especially not the buddy), they are 
rather "directly mapped device memory".


E.g., a NVDIMM. It is mapped into the physical address space similar to 
ordinary RAM (a DIMM). Any write to such a PFN will directly end up on 
the target device. In contrast to a DIMM, the memory is persistent 
accross reboots.


Now, if you mmap such an NVDIMM into a user space process, you will end 
up with ZONE_DEVICE pages as part of the user space mapping (VMA). 
get_user_pages_fast() on this memory will result in "struct pages" that 
belong to ZONE_DEVICE. This is where this patch comes into play.


This patch makes sure that there is absolutely no change once we stop 
setting these ZONE_DEVICE pages PG_reserved. E.g., AFAIK, setting a 
ZONE_DEVICE page dirty does not make too much sense (never swapped).


Yes, it might not be a likely setup, however, it is possible. In this 
series I collect all places that *could* be affected. If that change is 
really needed has to be decided. I can see that the two staging drivers 
I have patches for might be able to just live with the change - but then 
we talked about it and are aware of the change.


Thanks!

--

Thanks,

David / dhildenb



RE: [PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes

2019-10-22 Thread Matt Sickler
>Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change 
>that.
>
>The pages are obtained via get_user_pages_fast(). I assume, these could be 
>ZONE_DEVICE pages. Let's just exclude them as well explicitly.

I'm not sure what ZONE_DEVICE pages are, but these pages are normal system RAM, 
typically HugePages (but not always).

>
>Cc: Greg Kroah-Hartman 
>Cc: Vandana BN 
>Cc: "Simon Sandström" 
>Cc: Dan Carpenter 
>Cc: Nishka Dasgupta 
>Cc: Madhumitha Prabakaran 
>Cc: Fabio Estevam 
>Cc: Matt Sickler 
>Cc: Jeremy Sowden 
>Signed-off-by: David Hildenbrand 
>---
> drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c 
>b/drivers/staging/kpc2000/kpc_dma/fileops.c
>index cb52bd9a6d2f..457adcc81fe6 100644
>--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
>+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
>@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t 
>xfr_count, u32 flags)
>BUG_ON(acd->ldev->pldev == NULL);
>
>for (i = 0 ; i < acd->page_count ; i++) {
>-   if (!PageReserved(acd->user_pages[i])) {
>+   if (!PageReserved(acd->user_pages[i]) &&
>+   !is_zone_device_page(acd->user_pages[i])) {
>set_page_dirty(acd->user_pages[i]);
>}
>}
>--
>2.21.0



[PATCH RFC v1 02/12] mm/usercopy.c: Prepare check_page_span() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

Let's make sure that the logic in the function won't change. Once we no
longer set these pages to reserved, we can rework this function to
perform separate checks for ZONE_DEVICE (split from PG_reserved checks).

Cc: Kees Cook 
Cc: Andrew Morton 
Cc: Kate Stewart 
Cc: Allison Randal 
Cc: "Isaac J. Manjarres" 
Cc: Qian Cai 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 mm/usercopy.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/usercopy.c b/mm/usercopy.c
index 660717a1ea5c..a3ac4be35cde 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -203,14 +203,15 @@ static inline void check_page_span(const void *ptr, 
unsigned long n,
 * device memory), or CMA. Otherwise, reject since the object spans
 * several independently allocated pages.
 */
-   is_reserved = PageReserved(page);
+   is_reserved = PageReserved(page) || is_zone_device_page(page);
is_cma = is_migrate_cma_page(page);
if (!is_reserved && !is_cma)
usercopy_abort("spans multiple pages", NULL, to_user, 0, n);
 
for (ptr += PAGE_SIZE; ptr <= end; ptr += PAGE_SIZE) {
page = virt_to_head_page(ptr);
-   if (is_reserved && !PageReserved(page))
+   if (is_reserved && !(PageReserved(page) ||
+is_zone_device_page(page)))
usercopy_abort("spans Reserved and non-Reserved pages",
   NULL, to_user, 0, n);
if (is_cma && !is_migrate_cma_page(page))
-- 
2.21.0



[PATCH RFC v1 00/12] mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE)

2019-10-22 Thread David Hildenbrand
This series is based on [2], which should pop up in linux/next soon:
https://lkml.org/lkml/2019/10/21/1034

This is the result of a recent discussion with Michal ([1], [2]). Right
now we set all pages PG_reserved when initializing hotplugged memmaps. This
includes ZONE_DEVICE memory. In case of system memory, PG_reserved is
cleared again when onlining the memory, in case of ZONE_DEVICE memory
never. In ancient times, we needed PG_reserved, because there was no way
to tell whether the memmap was already properly initialized. We now have
SECTION_IS_ONLINE for that in the case of !ZONE_DEVICE memory. ZONE_DEVICE
memory is already initialized deferred, and there shouldn't be a visible
change in that regard.

I remember that some time ago, we already talked about stopping to set
ZONE_DEVICE pages PG_reserved on the list, but I never saw any patches.
Also, I forgot who was part of the discussion :)

One of the biggest fear were side effects. I went ahead and audited all
users of PageReserved(). The ones that don't need any care (patches)
can be found below. I will double check and hope I am not missing something
important.

I am probably a little bit too careful (but I don't want to break things).
In most places (besides KVM and vfio that are nuts), the
pfn_to_online_page() check could most probably be avoided by a
is_zone_device_page() check. However, I usually get suspicious when I see
a pfn_valid() check (especially after I learned that people mmap parts of
/dev/mem into user space, including memory without memmaps. Also, people
could memmap offline memory blocks this way :/). As long as this does not
hurt performance, I think we should rather do it the clean way.

I only gave it a quick test with DIMMs on x86-64, but didn't test the
ZONE_DEVICE part at all (any tips for a nice QEMU setup?). Compile-tested
on x86-64 and PPC.

Other users of PageReserved() that should be fine:
- mm/page_owner.c:pagetypeinfo_showmixedcount_print()
  -> Never called for ZONE_DEVICE, (+ pfn_to_online_page(pfn))
- mm/page_owner.c:init_pages_in_zone()
  -> Never called for ZONE_DEVICE (!populated_zone(zone))
- mm/page_ext.c:free_page_ext()
  -> Only a BUG_ON(PageReserved(page)), not relevant
- mm/page_ext.c:has_unmovable_pages()
  -> Not releveant for ZONE_DEVICE
- mm/page_ext.c:pfn_range_valid_contig()
  -> pfn_to_online_page() already guards us
- mm/mempolicy.c:queue_pages_pte_range()
  -> vm_normal_page() checks against pte_devmap()
- mm/memory-failure.c:hwpoison_user_mappings()
  -> Not reached via memory_failure() due to pfn_to_online_page()
  -> Also not reached indirectly via memory_failure_hugetlb()
- mm/hugetlb.c:gather_bootmem_prealloc()
  -> Only a WARN_ON(PageReserved(page)), not relevant
- kernel/power/snapshot.c:saveable_highmem_page()
  -> pfn_to_online_page() already guards us
- kernel/power/snapshot.c:saveable_page()
  -> pfn_to_online_page() already guards us
- fs/proc/task_mmu.c:can_gather_numa_stats()
  -> vm_normal_page() checks against pte_devmap()
- fs/proc/task_mmu.c:can_gather_numa_stats_pmd
  -> vm_normal_page_pmd() checks against pte_devmap()
- fs/proc/page.c:stable_page_flags()
  -> The reserved bit is simply copied, irrelevant
- drivers/firmware/memmap.c:release_firmware_map_entry()
  -> really only a check to detect bootmem. Not relevant for ZONE_DEVICE
- arch/ia64/kernel/mca_drv.c
- arch/mips/mm/init.c
- arch/mips/mm/ioremap.c
- arch/nios2/mm/ioremap.c
- arch/parisc/mm/ioremap.c
- arch/sparc/mm/tlb.c
- arch/xtensa/mm/cache.c
  -> No ZONE_DEVICE support
- arch/powerpc/mm/init_64.c:vmemmap_free()
  -> Special-cases memmap on altmap
  -> Only a check for bootmem
- arch/x86/kernel/alternative.c:__text_poke()
  -> Only a WARN_ON(!PageReserved(pages[0])) to verify it is bootmem
- arch/x86/mm/init_64.c
  -> Only a check for bootmem

[1] https://lkml.org/lkml/2019/10/21/736
[2] https://lkml.org/lkml/2019/10/21/1034

Cc: Michal Hocko 
Cc: Dan Williams 
Cc: kvm-...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: de...@driverdev.osuosl.org
Cc: xen-de...@lists.xenproject.org
Cc: x...@kernel.org
Cc: Alexander Duyck 

David Hildenbrand (12):
  mm/memory_hotplug: Don't allow to online/offline memory blocks with
holes
  mm/usercopy.c: Prepare check_page_span() for PG_reserved changes
  KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes
  KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes
  vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes
  staging/gasket: Prepare gasket_release_page() for PG_reserved changes
  staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved
changes
  powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for
PG_reserved changes
  powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved
changes
  powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes
  x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes
  mm/memory_hotplug: 

[PATCH RFC v1 12/12] mm/memory_hotplug: Don't mark pages PG_reserved when initializing the memmap

2019-10-22 Thread David Hildenbrand
Everything should be prepared to stop setting pages PG_reserved when
initializing the memmap on memory hotplug. Most importantly, we
stop marking ZONE_DEVICE pages PG_reserved.

a) We made sure that any code that relied on PG_reserved to detect
   ZONE_DEVICE memory will no longer rely on PG_reserved - either
   by using pfn_to_online_page() to exclude them right away or by
   checking against is_zone_device_page().
b) We made sure that memory blocks with holes cannot be offlined and
   therefore also not onlined. We have quite some code that relies on
   memory holes being marked PG_reserved. This is now not an issue
   anymore.

generic_online_page() still calls __free_pages_core(), which performs
__ClearPageReserved(p). AFAIKS, this should not hurt.

It is worth nothing that the users of online_page_callback_t might see a
change. E.g., until now, pages not freed to the buddy by the HyperV
balloonm were set PG_reserved until freed via generic_online_page(). Now,
they would look like ordinarily allocated pages (refcount == 1). This
callback is used by the XEN balloon and the HyperV balloon. To not
introduce any silent errors, keep marking the pages PG_reserved. We can
most probably stop doing that, but have to double check if there are
issues (e.g., offlining code aborts right away in has_unmovable_pages()
when it runs into a PageReserved(page))

Update the documentation at various places.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Andrew Morton 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Anthony Yznaga 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Suggested-by: Michal Hocko 
Signed-off-by: David Hildenbrand 
---
 drivers/hv/hv_balloon.c|  6 ++
 drivers/xen/balloon.c  |  7 +++
 include/linux/page-flags.h |  8 +---
 mm/memory_hotplug.c| 17 +++--
 mm/page_alloc.c| 11 ---
 5 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index c722079d3c24..3214b0ef5247 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -670,6 +670,12 @@ static struct notifier_block hv_memory_nb = {
 /* Check if the particular page is backed and can be onlined and online it. */
 static void hv_page_online_one(struct hv_hotadd_state *has, struct page *pg)
 {
+   /*
+* TODO: The core used to mark the pages reserved. Most probably
+* we can stop doing that now.
+*/
+   __SetPageReserved(pg);
+
if (!has_pfn_is_backed(has, page_to_pfn(pg))) {
if (!PageOffline(pg))
__SetPageOffline(pg);
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 4f2e78a5e4db..af69f057913a 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -374,6 +374,13 @@ static void xen_online_page(struct page *page, unsigned 
int order)
mutex_lock(_mutex);
for (i = 0; i < size; i++) {
p = pfn_to_page(start_pfn + i);
+   /*
+* TODO: The core used to mark the pages reserved. Most probably
+* we can stop doing that now. However, especially
+* alloc_xenballooned_pages() left PG_reserved set
+* on pages that can get mapped to user space.
+*/
+   __SetPageReserved(p);
balloon_append(p);
}
mutex_unlock(_mutex);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..d4f85d866b71 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -30,24 +30,18 @@
  * - Pages falling into physical memory gaps - not IORESOURCE_SYSRAM. Trying
  *   to read/write these pages might end badly. Don't touch!
  * - The zero page(s)
- * - Pages not added to the page allocator when onlining a section because
- *   they were excluded via the online_page_callback() or because they are
- *   PG_hwpoison.
  * - Pages allocated in the context of kexec/kdump (loaded kernel image,
  *   control pages, vmcoreinfo)
  * - MMIO/DMA pages. Some architectures don't allow to ioremap pages that are
  *   not marked PG_reserved (as they might be in use by somebody else who does
  *   not respect the caching strategy).
- * - Pages part of an offline section (struct pages of offline sections should
- *   not be trusted as they will be initialized when first onlined).
  * - MCA pages on ia64
  * - Pages holding CPU notes for POWER Firmware Assisted Dump
- * - Device memory (e.g. PMEM, DAX, HMM)
  * Some PG_reserved pages will be excluded from the hibernation image.
  * PG_reserved does in general not hinder anybody from dumping or swapping
  * and is no longer required for remap_pfn_range(). ioremap might require it.
  * Consequently, PG_reserved for 

[PATCH RFC v1 11/12] x86/mm: Prepare __ioremap_check_ram() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Signed-off-by: David Hildenbrand 
---
 arch/x86/mm/ioremap.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index a39dcdb5ae34..db6913b48edf 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -77,10 +77,17 @@ static unsigned int __ioremap_check_ram(struct resource 
*res)
start_pfn = (res->start + PAGE_SIZE - 1) >> PAGE_SHIFT;
stop_pfn = (res->end + 1) >> PAGE_SHIFT;
if (stop_pfn > start_pfn) {
-   for (i = 0; i < (stop_pfn - start_pfn); ++i)
-   if (pfn_valid(start_pfn + i) &&
-   !PageReserved(pfn_to_page(start_pfn + i)))
+   for (i = 0; i < (stop_pfn - start_pfn); ++i) {
+   struct page *page;
+/*
+ * We treat any pages that are not online (not managed
+ * by the buddy) as not being RAM. This includes
+ * ZONE_DEVICE pages.
+ */
+   page = pfn_to_online_page(start_pfn + i);
+   if (page && !PageReserved(page))
return IORES_MAP_SYSTEM_RAM;
+   }
}
 
return 0;
-- 
2.21.0



[PATCH RFC v1 10/12] powerpc/mm: Prepare maybe_pte_to_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Allison Randal 
Cc: Nicholas Piggin 
Cc: Thomas Gleixner 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/pgtable.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index e3759b69f81b..613c98fa7dc0 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -55,10 +55,12 @@ static struct page *maybe_pte_to_page(pte_t pte)
unsigned long pfn = pte_pfn(pte);
struct page *page;
 
-   if (unlikely(!pfn_valid(pfn)))
-   return NULL;
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
+   /*
+* We reject any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   page = pfn_to_online_page(pfn);
+   if (unlikely(!page || PageReserved(page)))
return NULL;
return page;
 }
-- 
2.21.0



[PATCH RFC v1 06/12] staging/gasket: Prepare gasket_release_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Rob Springer 
Cc: Todd Poynor 
Cc: Ben Chan 
Cc: Greg Kroah-Hartman 
Signed-off-by: David Hildenbrand 
---
 drivers/staging/gasket/gasket_page_table.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/gasket/gasket_page_table.c 
b/drivers/staging/gasket/gasket_page_table.c
index f6d715787da8..d43fed58bf65 100644
--- a/drivers/staging/gasket/gasket_page_table.c
+++ b/drivers/staging/gasket/gasket_page_table.c
@@ -447,7 +447,7 @@ static bool gasket_release_page(struct page *page)
if (!page)
return false;
 
-   if (!PageReserved(page))
+   if (!PageReserved(page) && !is_zone_device_page(page))
SetPageDirty(page);
put_page(page);
 
-- 
2.21.0



[PATCH RFC v1 09/12] powerpc/64s: Prepare hash_page_do_lazy_icache() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

We could explicitly check for is_zone_device_page(page). But looking at
the pfn_valid() check, it seems safer to just use pfn_to_online_page()
here, that will skip all ZONE_DEVICE pages right away.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Aneesh Kumar K.V" 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: YueHaibing 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..a1566039e747 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1084,13 +1084,15 @@ void hash__early_init_mmu_secondary(void)
  */
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
 {
-   struct page *page;
+   struct page *page = pfn_to_online_page(pte_pfn(pte));
 
-   if (!pfn_valid(pte_pfn(pte)))
+   /*
+* We ignore any pages that are not online (not managed by the buddy).
+* This includes ZONE_DEVICE pages.
+*/
+   if (!page)
return pp;
 
-   page = pte_page(pte);
-
/* page is dirty */
if (!test_bit(PG_arch_1, >flags) && !PageReserved(page)) {
if (trap == 0x400) {
-- 
2.21.0



[PATCH RFC v1 05/12] vfio/type1: Prepare is_invalid_reserved_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Alex Williamson 
Cc: Cornelia Huck 
Signed-off-by: David Hildenbrand 
---
 drivers/vfio/vfio_iommu_type1.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..f8ce8c408ba8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
  */
 static bool is_invalid_reserved_pfn(unsigned long pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0



[PATCH RFC v1 08/12] powerpc/book3s: Prepare kvmppc_book3s_instantiate_page() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory.

Cc: Paul Mackerras 
Cc: Benjamin Herrenschmidt 
Cc: Michael Ellerman 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..05397c0561fc 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -801,12 +801,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
   writing, upgrade_p);
if (is_error_noslot_pfn(pfn))
return -EFAULT;
-   page = NULL;
-   if (pfn_valid(pfn)) {
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
-   page = NULL;
-   }
+   /*
+* We treat any pages that are not online (not managed by the
+* buddy) as reserved - this includes ZONE_DEVICE pages and
+* pages without a memmap (e.g., mapped via /dev/mem).
+*/
+   page = pfn_to_online_page(pfn);
+   if (page && PageReserved(page))
+   page = NULL;
}
 
/*
-- 
2.21.0



[PATCH RFC v1 07/12] staging: kpc2000: Prepare transfer_complete_cb() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

The pages are obtained via get_user_pages_fast(). I assume, these
could be ZONE_DEVICE pages. Let's just exclude them as well explicitly.

Cc: Greg Kroah-Hartman 
Cc: Vandana BN 
Cc: "Simon Sandström" 
Cc: Dan Carpenter 
Cc: Nishka Dasgupta 
Cc: Madhumitha Prabakaran 
Cc: Fabio Estevam 
Cc: Matt Sickler 
Cc: Jeremy Sowden 
Signed-off-by: David Hildenbrand 
---
 drivers/staging/kpc2000/kpc_dma/fileops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c 
b/drivers/staging/kpc2000/kpc_dma/fileops.c
index cb52bd9a6d2f..457adcc81fe6 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -212,7 +212,8 @@ void  transfer_complete_cb(struct aio_cb_data *acd, size_t 
xfr_count, u32 flags)
BUG_ON(acd->ldev->pldev == NULL);
 
for (i = 0 ; i < acd->page_count ; i++) {
-   if (!PageReserved(acd->user_pages[i])) {
+   if (!PageReserved(acd->user_pages[i]) &&
+   !is_zone_device_page(acd->user_pages[i])) {
set_page_dirty(acd->user_pages[i]);
}
}
-- 
2.21.0



[PATCH RFC v1 01/12] mm/memory_hotplug: Don't allow to online/offline memory blocks with holes

2019-10-22 Thread David Hildenbrand
Our onlining/offlining code is unnecessarily complicated. Only memory
blocks added during boot can have holes. Hotplugged memory never has
holes. That memory is already online.

When we stop allowing to offline memory blocks with holes, we implicitly
stop to online memory blocks with holes.

This allows to simplify the code. For example, we no longer have to
worry about marking pages that fall into memory holes PG_reserved when
onlining memory. We can stop setting pages PG_reserved.

Offlining memory blocks added during boot is usually not guranteed to work
either way. So stopping to do that (if anybody really used and tested
this over the years) should not really hurt. For the use case of
offlining memory to unplug DIMMs, we should see no change. (holes on
DIMMs would be weird)

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 561371ead39a..7210f4375279 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1447,10 +1447,19 @@ static void node_states_clear_node(int node, struct 
memory_notify *arg)
node_clear_state(node, N_MEMORY);
 }
 
+static int count_system_ram_pages_cb(unsigned long start_pfn,
+unsigned long nr_pages, void *data)
+{
+   unsigned long *nr_system_ram_pages = data;
+
+   *nr_system_ram_pages += nr_pages;
+   return 0;
+}
+
 static int __ref __offline_pages(unsigned long start_pfn,
  unsigned long end_pfn)
 {
-   unsigned long pfn, nr_pages;
+   unsigned long pfn, nr_pages = 0;
unsigned long offlined_pages = 0;
int ret, node, nr_isolate_pageblock;
unsigned long flags;
@@ -1461,6 +1470,20 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
mem_hotplug_begin();
 
+   /*
+* We don't allow to offline memory blocks that contain holes
+* and consecuently don't allow to online memory blocks that contain
+* holes. This allows to simplify the code quite a lot and we don't
+* have to mess with PG_reserved pages for memory holes.
+*/
+   walk_system_ram_range(start_pfn, end_pfn - start_pfn, _pages,
+ count_system_ram_pages_cb);
+   if (nr_pages != end_pfn - start_pfn) {
+   ret = -EINVAL;
+   reason = "memory holes";
+   goto failed_removal;
+   }
+
/* This makes hotplug much easier...and readable.
   we assume this for now. .*/
if (!test_pages_in_a_zone(start_pfn, end_pfn, _start,
@@ -1472,7 +1495,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
zone = page_zone(pfn_to_page(valid_start));
node = zone_to_nid(zone);
-   nr_pages = end_pfn - start_pfn;
 
/* set above range as isolated */
ret = start_isolate_page_range(start_pfn, end_pfn,
-- 
2.21.0



[PATCH RFC v1 04/12] KVM: Prepare kvm_is_reserved_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap. Note that ZONE_DEVICE memory is
never online (IOW, managed by the buddy).

Switching to pfn_to_online_page() keeps the existing behavior for
PFNs without a memmap and for ZONE_DEVICE memory. They are treated as
reserved and the page is not touched (e.g., to set it dirty or accessed).

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: KarimAllah Ahmed 
Signed-off-by: David Hildenbrand 
---
 virt/kvm/kvm_main.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 66a977472a1c..b233d4129014 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -151,9 +151,15 @@ __weak int kvm_arch_mmu_notifier_invalidate_range(struct 
kvm *kvm,
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
-   if (pfn_valid(pfn))
-   return PageReserved(pfn_to_page(pfn));
+   struct page *page = pfn_to_online_page(pfn);
 
+   /*
+* We treat any pages that are not online (not managed by the buddy)
+* as reserved - this includes ZONE_DEVICE pages and pages without
+* a memmap (e.g., mapped via /dev/mem).
+*/
+   if (page)
+   return PageReserved(page);
return true;
 }
 
-- 
2.21.0



[PATCH RFC v1 03/12] KVM: x86/mmu: Prepare kvm_is_mmio_pfn() for PG_reserved changes

2019-10-22 Thread David Hildenbrand
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to
change that.

KVM has this weird use case that you can map anything from /dev/mem
into the guest. pfn_valid() is not a reliable check whether the memmap
was initialized and can be touched. pfn_to_online_page() makes sure
that we have an initialized memmap - however, there is no reliable and
fast check to detect memmaps that were initialized and are ZONE_DEVICE.

Let's rewrite kvm_is_mmio_pfn() so we really only touch initialized
memmaps that are guaranteed to not contain garbage. Make sure that
RAM without a memmap is still not detected as MMIO and that ZONE_DEVICE
that is not UC/UC-/WC is not detected as MMIO.

Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Sean Christopherson 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Jim Mattson 
Cc: Joerg Roedel 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: KarimAllah Ahmed 
Cc: Michal Hocko 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 arch/x86/kvm/mmu.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 24c23c66b226..795869ffd4bb 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2962,20 +2962,26 @@ static bool mmu_need_write_protect(struct kvm_vcpu 
*vcpu, gfn_t gfn,
 
 static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 {
+   struct page *page = pfn_to_online_page(pfn);
+
+   /*
+* Online pages consist of pages managed by the buddy. Especially,
+* ZONE_DEVICE pages are never online. Online pages that are reserved
+* indicate the zero page and MMIO pages.
+*/
+   if (page)
+   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn));
+
+   /*
+* Anything with a valid memmap could be ZONE_DEVICE - or the
+* memmap could be uninitialized. Treat only UC/UC-/WC pages as MMIO.
+*/
if (pfn_valid(pfn))
-   return !is_zero_pfn(pfn) && PageReserved(pfn_to_page(pfn)) &&
-   /*
-* Some reserved pages, such as those from NVDIMM
-* DAX devices, are not for MMIO, and can be mapped
-* with cached memory type for better performance.
-* However, the above check misconceives those pages
-* as MMIO, and results in KVM mapping them with UC
-* memory type, which would hurt the performance.
-* Therefore, we check the host memory type in addition
-* and only treat UC/UC-/WC pages as MMIO.
-*/
-   (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
+   return !pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn);
 
+   /*
+* Any RAM that has no memmap (e.g., mapped via /dev/mem) is not MMIO.
+*/
return !e820__mapped_raw_any(pfn_to_hpa(pfn),
 pfn_to_hpa(pfn + 1) - 1,
 E820_TYPE_RAM);
-- 
2.21.0



Re: [PATCH 0/7] towards QE support on ARM

2019-10-22 Thread Christophe Leroy




On 10/18/2019 12:52 PM, Rasmus Villemoes wrote:

There have been several attempts in the past few years to allow
building the QUICC engine drivers for platforms other than PPC. This
is (the beginning of) yet another attempt. I hope I can get someone to
pick up these relatively trivial patches (I _think_ they shouldn't
change functionality at all), and then I'll continue slowly working
towards removing the PPC32 dependency for CONFIG_QUICC_ENGINE.

Tested on an MPC8309-derived board.

Rasmus Villemoes (7):
   soc: fsl: qe: remove space-before-tab
   soc: fsl: qe: drop volatile qualifier of struct qe_ic::regs
   soc: fsl: qe: avoid ppc-specific io accessors
   soc: fsl: qe: replace spin_event_timeout by readx_poll_timeout_atomic
   serial: make SERIAL_QE depend on PPC32
   serial: ucc_uart.c: explicitly include asm/cpm.h
   soc/fsl/qe/qe.h: remove include of asm/cpm.h


Please copy the entire series to linuxppc-dev list. We are missing 5/7 
and 7/7 (see 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=137048)


Christophe



  drivers/soc/fsl/qe/gpio.c | 30 
  drivers/soc/fsl/qe/qe.c   | 44 +++
  drivers/soc/fsl/qe/qe_ic.c|  8 ++---
  drivers/soc/fsl/qe/qe_ic.h|  2 +-
  drivers/soc/fsl/qe/qe_io.c| 40 ++---
  drivers/soc/fsl/qe/qe_tdm.c   |  8 ++---
  drivers/soc/fsl/qe/ucc.c  | 12 +++
  drivers/soc/fsl/qe/ucc_fast.c | 66 ++-
  drivers/soc/fsl/qe/ucc_slow.c | 38 ++--
  drivers/soc/fsl/qe/usb.c  |  2 +-
  drivers/tty/serial/Kconfig|  1 +
  drivers/tty/serial/ucc_uart.c |  1 +
  include/soc/fsl/qe/qe.h   |  1 -
  13 files changed, 126 insertions(+), 127 deletions(-)



Re: [PATCH 3/7] soc: fsl: qe: avoid ppc-specific io accessors

2019-10-22 Thread Christophe Leroy




On 10/18/2019 12:52 PM, Rasmus Villemoes wrote:

In preparation for allowing to build QE support for architectures
other than PPC, replace the ppc-specific io accessors. Done via



This patch is not transparent in terms of performance, functions get 
changed significantly.


Before the patch:

0330 :
 330:   81 43 00 04 lwz r10,4(r3)
 334:   7c 00 04 ac hwsync
 338:   81 2a 00 00 lwz r9,0(r10)
 33c:   0c 09 00 00 twi 0,r9,0
 340:   4c 00 01 2c isync
 344:   70 88 00 02 andi.   r8,r4,2
 348:   41 82 00 10 beq 358 
 34c:   39 00 00 01 li  r8,1
 350:   91 03 00 10 stw r8,16(r3)
 354:   61 29 00 10 ori r9,r9,16
 358:   70 88 00 01 andi.   r8,r4,1
 35c:   41 82 00 10 beq 36c 
 360:   39 00 00 01 li  r8,1
 364:   91 03 00 14 stw r8,20(r3)
 368:   61 29 00 20 ori r9,r9,32
 36c:   7c 00 04 ac hwsync
 370:   91 2a 00 00 stw r9,0(r10)
 374:   4e 80 00 20 blr

After the patch:

030c :
 30c:   94 21 ff e0 stwur1,-32(r1)
 310:   7c 08 02 a6 mflrr0
 314:   bf a1 00 14 stmwr29,20(r1)
 318:   7c 9f 23 78 mr  r31,r4
 31c:   90 01 00 24 stw r0,36(r1)
 320:   7c 7e 1b 78 mr  r30,r3
 324:   83 a3 00 04 lwz r29,4(r3)
 328:   7f a3 eb 78 mr  r3,r29
 32c:   48 00 00 01 bl  32c 
32c: R_PPC_REL24ioread32be
 330:   73 e9 00 02 andi.   r9,r31,2
 334:   41 82 00 10 beq 344 
 338:   39 20 00 01 li  r9,1
 33c:   91 3e 00 10 stw r9,16(r30)
 340:   60 63 00 10 ori r3,r3,16
 344:   73 e9 00 01 andi.   r9,r31,1
 348:   41 82 00 10 beq 358 
 34c:   39 20 00 01 li  r9,1
 350:   91 3e 00 14 stw r9,20(r30)
 354:   60 63 00 20 ori r3,r3,32
 358:   80 01 00 24 lwz r0,36(r1)
 35c:   7f a4 eb 78 mr  r4,r29
 360:   bb a1 00 14 lmw r29,20(r1)
 364:   7c 08 03 a6 mtlrr0
 368:   38 21 00 20 addir1,r1,32
 36c:   48 00 00 00 b   36c 
36c: R_PPC_REL24iowrite32be


Christophe


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-22 Thread Christophe Leroy




Le 22/10/2019 à 11:01, Christophe Leroy a écrit :



Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :

On Mon, 21 Oct 2019, Christophe Leroy wrote:

This is a tentative to switch powerpc/32 vdso to generic C 
implementation.

It will likely not work on 64 bits or even build properly at the moment.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

To allow that, C fallbacks just return -1 and the ASM entry point
performs the system call when the C function returns -1.

The performance is rather disappoiting. That's most likely all
calculation in the C implementation are based on 64 bits math and
converted to 32 bits at the very end. I guess C implementation should
use 32 bits math like the assembly VDSO does as of today.



gettimeofday:    vdso: 750 nsec/call

gettimeofday:    vdso: 1533 nsec/call


Small improvement (3%) with the proposed change:

gettimeofday:    vdso: 1485 nsec/call


By inlining do_hres() I get the following:

gettimeofday:vdso: 1072 nsec/call

Christophe



Though still some way to go.

Christophe



The only real 64bit math which can matter is the 64bit * 32bit multiply,
i.e.

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
 return ((cycles - last) & mask) * mult;
}

Everything else is trivial add/sub/shift, which should be roughly the 
same

in ASM.

Can you try to replace that with:

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
 u64 ret, delta = ((cycles - last) & mask);
 u32 dh, dl;

 dl = delta;
 dh = delta >> 32;

 res = mul_u32_u32(al, mul);
 if (ah)
 res += mul_u32_u32(ah, mul) << 32;

 return res;
}

That's pretty much what __do_get_tspec does in ASM.

Thanks,

tglx



[PATCH] powerpc/powernv: Fix CPU idle to be called with IRQs disabled

2019-10-22 Thread Nicholas Piggin
Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from
disrupting offline") changes arch_cpu_idle_dead to be called with
interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self.

Fix this by fixing up irq_happened after hard disabling, rather than
requiring there are no pending interrupts, similarly to what was done
done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ
handling in CPU offline").

Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting 
offline")
Reported-by: Paul Mackerras 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/smp.c | 50 +++-
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index fbd6e6b7bbf2..241cfee744d9 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -146,6 +146,18 @@ static int pnv_smp_cpu_disable(void)
return 0;
 }
 
+static void pnv_flush_interrupts(void)
+{
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   if (xive_enabled())
+   xive_flush_interrupt();
+   else
+   icp_opal_flush_interrupt();
+   } else {
+   icp_native_flush_interrupt();
+   }
+}
+
 static void pnv_smp_cpu_kill_self(void)
 {
unsigned int cpu;
@@ -153,13 +165,6 @@ static void pnv_smp_cpu_kill_self(void)
u64 lpcr_val;
 
/* Standard hot unplug procedure */
-   /*
-* This hard disables local interurpts, ensuring we have no lazy
-* irqs pending.
-*/
-   WARN_ON(irqs_disabled());
-   hard_irq_disable();
-   WARN_ON(lazy_irq_pending());
 
idle_task_exit();
current->active_mm = NULL; /* for sanity */
@@ -172,6 +177,26 @@ static void pnv_smp_cpu_kill_self(void)
if (cpu_has_feature(CPU_FTR_ARCH_207S))
wmask = SRR1_WAKEMASK_P8;
 
+   /*
+* This turns the irq soft-disabled state we're called with, into a
+* hard-disabled state with pending irq_happened interrupts cleared.
+*
+* PACA_IRQ_DEC   - Decrementer should be ignored.
+* PACA_IRQ_HMI   - Can be ignored, processing is done in real mode.
+* PACA_IRQ_DBELL, EE, PMI - Unexpected.
+*/
+   hard_irq_disable();
+   if (generic_check_cpu_restart(cpu))
+   goto out;
+   if (local_paca->irq_happened &
+   (PACA_IRQ_DBELL | PACA_IRQ_EE | PACA_IRQ_PMI)) {
+   if (local_paca->irq_happened & PACA_IRQ_EE)
+   pnv_flush_interrupts();
+   DBG("CPU%d Unexpected exit while offline irq_happened=%lx!\n",
+   cpu, local_paca->irq_happened);
+   }
+   local_paca->irq_happened = PACA_IRQ_HARD_DIS;
+
/*
 * We don't want to take decrementer interrupts while we are
 * offline, so clear LPCR:PECE1. We keep PECE2 (and
@@ -197,6 +222,7 @@ static void pnv_smp_cpu_kill_self(void)
 
srr1 = pnv_cpu_offline(cpu);
 
+   WARN_ON(!irqs_disabled());
WARN_ON(lazy_irq_pending());
 
/*
@@ -212,13 +238,7 @@ static void pnv_smp_cpu_kill_self(void)
 */
if (((srr1 & wmask) == SRR1_WAKEEE) ||
((srr1 & wmask) == SRR1_WAKEHVI)) {
-   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-   if (xive_enabled())
-   xive_flush_interrupt();
-   else
-   icp_opal_flush_interrupt();
-   } else
-   icp_native_flush_interrupt();
+   pnv_flush_interrupts();
} else if ((srr1 & wmask) == SRR1_WAKEHDBELL) {
unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
asm volatile(PPC_MSGCLR(%0) : : "r" (msg));
@@ -266,7 +286,7 @@ static void pnv_smp_cpu_kill_self(void)
 */
lpcr_val = mfspr(SPRN_LPCR) | (u64)LPCR_PECE1;
pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val);
-
+out:
DBG("CPU%d coming online...\n", cpu);
 }
 
-- 
2.23.0



[PATCH] powerpc/powernv/prd: Allow copying partial data to user space

2019-10-22 Thread Vasant Hegde
Allow copying partial data to user space. So that opal-prd daemon
can read message size, reallocate memory and make read call to
get rest of the data.

Cc: Jeremy Kerr 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Vasant Hegde 
---
 arch/powerpc/platforms/powernv/opal-prd.c | 27 ---
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 45f4223a790f..dac9d18293d8 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -153,20 +153,15 @@ static __poll_t opal_prd_poll(struct file *file,
 static ssize_t opal_prd_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
 {
-   struct opal_prd_msg_queue_item *item;
+   struct opal_prd_msg_queue_item *item = NULL;
unsigned long flags;
-   ssize_t size, err;
+   ssize_t size;
int rc;
 
/* we need at least a header's worth of data */
if (count < sizeof(item->msg))
return -EINVAL;
 
-   if (*ppos)
-   return -ESPIPE;
-
-   item = NULL;
-
for (;;) {
 
spin_lock_irqsave(_prd_msg_queue_lock, flags);
@@ -190,27 +185,23 @@ static ssize_t opal_prd_read(struct file *file, char 
__user *buf,
}
 
size = be16_to_cpu(item->msg.size);
-   if (size > count) {
-   err = -EINVAL;
+   rc = simple_read_from_buffer(buf, count, ppos, >msg, size);
+   if (rc < 0)
goto err_requeue;
-   }
-
-   rc = copy_to_user(buf, >msg, size);
-   if (rc) {
-   err = -EFAULT;
+   if (*ppos < size)
goto err_requeue;
-   }
 
+   /* Reset position */
+   *ppos = 0;
kfree(item);
-
-   return size;
+   return rc;
 
 err_requeue:
/* eep! re-queue at the head of the list */
spin_lock_irqsave(_prd_msg_queue_lock, flags);
list_add(>list, _prd_msg_queue);
spin_unlock_irqrestore(_prd_msg_queue_lock, flags);
-   return err;
+   return rc;
 }
 
 static ssize_t opal_prd_write(struct file *file, const char __user *buf,
-- 
2.21.0



Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-22 Thread Anshuman Khandual


On 10/22/2019 12:41 PM, Christophe Leroy wrote:
> 
> 
> On 10/21/2019 02:42 AM, Anshuman Khandual wrote:
>> This adds tests which will validate architecture page table helpers and
>> other accessors in their compliance with expected generic MM semantics.
>> This will help various architectures in validating changes to existing
>> page table helpers or addition of new ones.
>>
>> This test covers basic page table entry transformations including but not
>> limited to old, young, dirty, clean, write, write protect etc at various
>> level along with populating intermediate entries with next page table page
>> and validating them.
>>
>> Test page table pages are allocated from system memory with required size
>> and alignments. The mapped pfns at page table levels are derived from a
>> real pfn representing a valid kernel text symbol. This test gets called
>> right after page_alloc_init_late().
>>
>> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
>> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
>> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
>> arm64. Going forward, other architectures too can enable this after fixing
>> build or runtime problems (if any) with their page table helpers.
>>
>> Folks interested in making sure that a given platform's page table helpers
>> conform to expected generic MM semantics should enable the above config
>> which will just trigger this test during boot. Any non conformity here will
>> be reported as an warning which would need to be fixed. This test will help
>> catch any changes to the agreed upon semantics expected from generic MM and
>> enable platforms to accommodate it thereafter.
>>
>> Cc: Andrew Morton 
>> Cc: Vlastimil Babka 
>> Cc: Greg Kroah-Hartman 
>> Cc: Thomas Gleixner 
>> Cc: Mike Rapoport 
>> Cc: Jason Gunthorpe 
>> Cc: Dan Williams 
>> Cc: Peter Zijlstra 
>> Cc: Michal Hocko 
>> Cc: Mark Rutland 
>> Cc: Mark Brown 
>> Cc: Steven Price 
>> Cc: Ard Biesheuvel 
>> Cc: Masahiro Yamada 
>> Cc: Kees Cook 
>> Cc: Tetsuo Handa 
>> Cc: Matthew Wilcox 
>> Cc: Sri Krishna chowdary 
>> Cc: Dave Hansen 
>> Cc: Russell King - ARM Linux 
>> Cc: Michael Ellerman 
>> Cc: Paul Mackerras 
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Cc: "David S. Miller" 
>> Cc: Vineet Gupta 
>> Cc: James Hogan 
>> Cc: Paul Burton 
>> Cc: Ralf Baechle 
>> Cc: Kirill A. Shutemov 
>> Cc: Gerald Schaefer 
>> Cc: Christophe Leroy 
>> Cc: Ingo Molnar 
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-m...@vger.kernel.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-i...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: x...@kernel.org
>> Cc: linux-ker...@vger.kernel.org
>>
>> Tested-by: Christophe Leroy     #PPC32
>> Suggested-by: Catalin Marinas 
>> Signed-off-by: Andrew Morton 
>> Signed-off-by: Christophe Leroy 
>> Signed-off-by: Anshuman Khandual 
>> ---
> 
> The cover letter have the exact same title as this patch. I think a cover 
> letter is not necessary for a singleton series.

Right, but it became singleton series in this version :)

> 
> The history (and any other information you don't want to include in the 
> commit message) can be added here, below the '---'. That way it is in the 
> mail but won't be included in the commit.
I was aware about that but the change log here was big, hence just choose to 
have that
separately in a cover letter. As you said, I guess the cover letter is probably 
not
required anymore. Will add it here in the patch, next time around.

> 
>>   .../debug/debug-vm-pgtable/arch-support.txt    |  34 ++
>>   arch/arm64/Kconfig |   1 +
>>   arch/x86/Kconfig   |   1 +
>>   arch/x86/include/asm/pgtable_64.h  |   6 +
>>   include/asm-generic/pgtable.h  |   6 +
>>   init/main.c    |   1 +
>>   lib/Kconfig.debug  |  21 ++
>>   mm/Makefile    |   1 +
>>   mm/debug_vm_pgtable.c  | 388 
>> +
>>   9 files changed, 459 insertions(+)
>>   create mode 100644 
>> Documentation/features/debug/debug-vm-pgtable/arch-support.txt
>>   create mode 100644 mm/debug_vm_pgtable.c
>>
>> diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
>> b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
>> new file mode 100644
>> index 000..d6b8185
>> --- /dev/null
>> +++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
>> @@ -0,0 +1,34 @@
>> +#
>> +# Feature name:  debug-vm-pgtable
>> +# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
>> +# description:   arch supports pgtable tests for semantics 
>> compliance
>> +#
>> +    ---

[PATCH v2 1/1] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"

2019-10-22 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently on PSeries Linux guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since they are still
under the control fo the Linux guests. Hence, when we change the SMT
modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will
continue to count the values for offlined CPUs in extended cede as if
they are online. This breaks the accounting in tools such as lparstat.

To fix this, ensure that by default the offlined CPUs are returned to
the hypervisor via RTAS "stop-self" call by changing the default value
of "cede_offline_enabled" to false.

Fixes: commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state")

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/core-api/cpu_hotplug.rst   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7..5319593 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -53,7 +53,7 @@ Command Line Switches
 ``cede_offline={"off","on"}``
   Use this option to disable/enable putting offlined processors to an extended
   ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
+  ``cede_offline`` is set to "off".
 
   This option is limited to the PowerPC architecture.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bbda646..f9d0366 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, 
preferred_offline_state) =
 
 static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
 
-static bool cede_offline_enabled __read_mostly = true;
+/*
+ * Determines whether the offlined CPUs should be put to a long term
+ * processor cede (called extended cede) for power-saving
+ * purposes. The CPUs in extended cede are still with the Linux Guest
+ * and are not returned to the Hypervisor.
+ *
+ * By default, the offlined CPUs are returned to the hypervisor via
+ * RTAS "stop-self". This behaviour can be changed by passing the
+ * kernel commandline parameter "cede_offline=on".
+ */
+static bool cede_offline_enabled __read_mostly;
 
 /*
  * Enable/disable cede_offline when available.
-- 
1.9.4



[PATCH v2 0/1] pseries/hotplug: Change the default behaviour of cede_offline

2019-10-22 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This is the v2 of the fix to change the default behaviour of cede_offline.
The previous version can be found here: https://lkml.org/lkml/2019/9/12/222

The main change from v1 is that the patch2 to create a sysfs file to
report and control the value of cede_offline_enabled has been dropped.

Problem Description:

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):

SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second
  
SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200   69883784
core01 [  8]51200   88782536
core02 [ 16]51200   94296824
core03 [ 24]51200   80951968

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   136147792   
core01 [  8,9]  51200   128636784   
core02 [ 16,17] 51200   135426488   
core03 [ 24,25] 51200   153027520   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   258331616   
core01 [  8,9,10,11]51200   274220072   
core02 [ 16,17,18,19]   51200   260013736   
core03 [ 24,25,26,27]   51200   260079672   

SMT=on
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3,4,5,6,7]  51200   512941248   
core01 [  8,9,10,11,12,13,14,15]51200   512936544   
core02 [ 16,17,18,19,20,21,22,23]   51200   512931544   
core03 [ 24,25,26,27,28,29,30,31]   51200   512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

With the patches, we see that the observed value of the sum of the
PURR increments across the the online threads of a core in 1-second
matches the number of tb-ticks in 1-second.

SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200512527568  
core01 [  8]51200512556128  
core02 [ 16]51200512590016  
core03 [ 24]51200512589440  

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   512635328
core01 [  8,9]  51200   512610416   
core02 [ 16,17] 51200   512639360   
core03 [ 24,25] 51200   512638720   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   512757328   
core01 [  8,9,10,11]51200   512727920   
core02 [ 16,17,18,19]   51200   512754712   
core03 [ 24,25,26,27]   51200   512739040   

SMT=on
==
Core   SD-PURR SD-PURR
   (expected)  (observed)
==

Re: [PATCH 0/7] towards QE support on ARM

2019-10-22 Thread Rasmus Villemoes
On 22/10/2019 04.24, Qiang Zhao wrote:
> On Mon, Oct 22, 2019 at 6:11 AM Leo Li wrote

>> Right.  I'm really interested in getting this applied to my tree and make it
>> upstream.  Zhao Qiang, can you help to review Rasmus's patches and
>> comment?
> 
> As you know, I maintained a similar patchset removing PPC, and someone told 
> me qe_ic should moved into drivers/irqchip/.
> I also thought qe_ic is a interrupt control driver, should be moved into dir 
> irqchip.

Yes, and I also plan to do that at some point. However, that's
orthogonal to making the driver build on ARM, so I don't want to mix the
two. Making it usable on ARM is my/our priority currently.

I'd appreciate your input on my patches.

Rasmus



Re: [PATCH 0/7] towards QE support on ARM

2019-10-22 Thread Rasmus Villemoes
On 22/10/2019 00.11, Li Yang wrote:
> On Mon, Oct 21, 2019 at 3:46 AM Rasmus Villemoes
>  wrote:
>>

>>> Can you try the 4.14 branch from a newer LSDK release?  LS1021a should
>>> be supported platform on LSDK.  If it is broken, something is wrong.
>>
>> What newer release? LSDK-18.06-V4.14 is the latest -V4.14 tag at
>> https://github.com/qoriq-open-source/linux.git, and identical to the
> 
> That tree has been abandoned for a while, we probably should state
> that in the github.  The latest tree can be found at
> https://source.codeaurora.org/external/qoriq/qoriq-components/linux/

Ah. FYI, googling "LSDK" gives https://lsdk.github.io as one of the
first hits, and (apart from itself being a github url) that says on the
front page "Disaggregated components of LSDK are available in github.".
But yes, navigating to the Components tab and from there to lsdk linux
one does get directed at codeaurora.

>> In any case, we have zero interest in running an NXP kernel. Maybe I
>> should clarify what I meant by "based on commits from" above: We're
>> currently running a mainline 4.14 kernel on LS1021A, with a few patches
>> inspired from the NXP 4.1 branch applied on top - but also with some
>> manual fixes for e.g. the pvr_version_is() issue. Now we want to move
>> that to a 4.19-based kernel (so that it aligns with our MPC8309 platform).
> 
> We also provide 4.19 based kernel in the codeaurora repo.  I think it
> will be helpful to reuse patches there if you want to make your own
> tree.

Again, we don't want to run off an NXP kernel, we want to get the
necessary pieces upstream. For now, we have to live with a patched 4.19
kernel, but hopefully by the time we switch to 5.x (for some x >= 5) we
don't need to supply anything other than our own .dts and defconfig.

>> Yes, as I said, I wanted to try a fresh approach since Zhao
>> Qiang's patches seemed to be getting nowhere. Splitting the patches into
>> smaller pieces is definitely part of that - for example, the completely
>> trivial whitespace fix in patch 1 is to make sure the later coccinelle
>> generated patch is precisely that (i.e., a later respin can just rerun
>> the coccinelle script, with zero manual fixups). I also want to avoid
>> mixing the ppcism cleanups with other things (e.g. replacing some
>> of_get_property() by of_property_read_u32()). And the "testing on ARM"
>> part comes once I get to actually building on ARM. But there's not much
>> point doing all that unless there's some indication that this can be
>> applied to some tree that actually feeds into Linus', which is why I
>> started with a few trivial patches and precisely to start this discussion.
> 
> Right.  I'm really interested in getting this applied to my tree and
> make it upstream.  Zhao Qiang, can you help to review Rasmus's patches
> and comment?

Thanks, this is exactly what I was hoping for. Even just getting these
first rather trivial patches (in that they don't attempt to build for
ARM, or change functionality at all for PPC) merged for 5.5 would reduce
the amount of out-of-tree patches that we (and NXP for that matter)
would have to carry. I'll take the above as a go-ahead for me to try to
post more patches working towards enabling some of the QE drivers for ARM.

Rasmus


Re: [PATCH 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-10-22 Thread Rafael J. Wysocki
On Tue, Oct 22, 2019 at 9:51 AM Ran Wang  wrote:
>
> Some user might want to go through all registered wakeup sources
> and doing things accordingly. For example, SoC PM driver might need to
> do HW programming to prevent powering down specific IP which wakeup
> source depending on. So add this API to help walk through all registered
> wakeup source objects on that list and return them one by one.
>
> Signed-off-by: Ran Wang 
> Tested-by: Leonard Crestez 
> ---
> Change in v8
> - Rename wakeup_source_get_next() to wakeup_sources_walk_next().
> - Add wakeup_sources_read_lock() to take over locking job of
>   wakeup_source_get_star().
> - Rename wakeup_source_get_start() to wakeup_sources_walk_start().
> - Replace wakeup_source_get_stop() with wakeup_sources_read_unlock().
> - Define macro for_each_wakeup_source(ws).
>
> Change in v7:
> - Remove define of member *dev in wake_irq to fix conflict with commit
> c8377adfa781 ("PM / wakeup: Show wakeup sources stats in sysfs"), user
> will use ws->dev->parent instead.
> - Remove '#include ' because it is not used.
>
> Change in v6:
> - Add wakeup_source_get_star() and wakeup_source_get_stop() to aligned
> with wakeup_sources_stats_seq_start/nex/stop.
>
> Change in v5:
> - Update commit message, add decription of walk through all wakeup
> source objects.
> - Add SCU protection in function wakeup_source_get_next().
> - Rename wakeup_source member 'attached_dev' to 'dev' and move it up
> (before wakeirq).
>
> Change in v4:
> - None.
>
> Change in v3:
> - Adjust indentation of *attached_dev;.
>
> Change in v2:
> - None.
>
>  drivers/base/power/wakeup.c | 42 ++
>  include/linux/pm_wakeup.h   |  9 +
>  2 files changed, 51 insertions(+)
>
> diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
> index 5817b51..8c7a5f9 100644
> --- a/drivers/base/power/wakeup.c
> +++ b/drivers/base/power/wakeup.c
> @@ -248,6 +248,48 @@ void wakeup_source_unregister(struct wakeup_source *ws)
>  EXPORT_SYMBOL_GPL(wakeup_source_unregister);
>
>  /**
> + * wakeup_sources_read_lock - Lock wakeup source list for read.

Please document the return value.

> + */
> +int wakeup_sources_read_lock(void)
> +{
> +   return srcu_read_lock(_srcu);
> +}
> +EXPORT_SYMBOL_GPL(wakeup_sources_read_lock);
> +
> +/**
> + * wakeup_sources_read_unlock - Unlock wakeup source list.

Please document the argument.

> + */
> +void wakeup_sources_read_unlock(int idx)
> +{
> +   srcu_read_unlock(_srcu, idx);
> +}
> +EXPORT_SYMBOL_GPL(wakeup_sources_read_unlock);
> +
> +/**
> + * wakeup_sources_walk_start - Begin a walk on wakeup source list

Please document the return value and add a note that the wakeup
sources list needs to be locked for reading for this to be safe.

> + */
> +struct wakeup_source *wakeup_sources_walk_start(void)
> +{
> +   struct list_head *ws_head = _sources;
> +
> +   return list_entry_rcu(ws_head->next, struct wakeup_source, entry);
> +}
> +EXPORT_SYMBOL_GPL(wakeup_sources_walk_start);
> +
> +/**
> + * wakeup_sources_walk_next - Get next wakeup source from the list
> + * @ws: Previous wakeup source object

Please add a note that the wakeup sources list needs to be locked for
reading for this to be safe.

> + */
> +struct wakeup_source *wakeup_sources_walk_next(struct wakeup_source *ws)
> +{
> +   struct list_head *ws_head = _sources;
> +
> +   return list_next_or_null_rcu(ws_head, >entry,
> +   struct wakeup_source, entry);
> +}
> +EXPORT_SYMBOL_GPL(wakeup_sources_walk_next);
> +
> +/**
>   * device_wakeup_attach - Attach a wakeup source object to a device object.
>   * @dev: Device to handle.
>   * @ws: Wakeup source object to attach to @dev.
> diff --git a/include/linux/pm_wakeup.h b/include/linux/pm_wakeup.h
> index 661efa0..aa3da66 100644
> --- a/include/linux/pm_wakeup.h
> +++ b/include/linux/pm_wakeup.h
> @@ -63,6 +63,11 @@ struct wakeup_source {
> boolautosleep_enabled:1;
>  };
>
> +#define for_each_wakeup_source(ws) \
> +   for ((ws) = wakeup_sources_walk_start();\
> +(ws);  \
> +(ws) = wakeup_sources_walk_next((ws)))
> +
>  #ifdef CONFIG_PM_SLEEP
>
>  /*
> @@ -92,6 +97,10 @@ extern void wakeup_source_remove(struct wakeup_source *ws);
>  extern struct wakeup_source *wakeup_source_register(struct device *dev,
> const char *name);
>  extern void wakeup_source_unregister(struct wakeup_source *ws);
> +extern int wakeup_sources_read_lock(void);
> +extern void wakeup_sources_read_unlock(int idx);
> +extern struct wakeup_source *wakeup_sources_walk_start(void);
> +extern struct wakeup_source *wakeup_sources_walk_next(struct wakeup_source 
> *ws);
>  extern int 

Re: [PATCH 3/3] soc: fsl: add RCPM driver

2019-10-22 Thread Rafael J. Wysocki
On Tue, Oct 22, 2019 at 9:52 AM Ran Wang  wrote:
>
> The NXP's QorIQ Processors based on ARM Core have RCPM module
> (Run Control and Power Management), which performs system level
> tasks associated with power management such as wakeup source control.
>
> This driver depends on PM wakeup source framework which help to
> collect wake information.
>
> Signed-off-by: Ran Wang 
> ---
> Change in v8:
> - Adjust related API usage to meet wakeup.c's update in patch 1/3.
> - Add sanity checking for the case of ws->dev or ws->dev->parent
>   is null.
>
> Change in v7:
> - Replace 'ws->dev' with 'ws->dev->parent' to get aligned with
> c8377adfa781 ("PM / wakeup: Show wakeup sources stats in sysfs")
> - Remove '+obj-y += ftm_alarm.o' since it is wrong.
> - Cosmetic work.
>
> Change in v6:
> - Adjust related API usage to meet wakeup.c's update in patch 1/3.
>
> Change in v5:
> - Fix v4 regression of the return value of wakeup_source_get_next()
> didn't pass to ws in while loop.
> - Rename wakeup_source member 'attached_dev' to 'dev'.
> - Rename property 'fsl,#rcpm-wakeup-cells' to 
> '#fsl,rcpm-wakeup-cells'.
> please see https://lore.kernel.org/patchwork/patch/1101022/
>
> Change in v4:
> - Remove extra ',' in author line of rcpm.c
> - Update usage of wakeup_source_get_next() to be less confusing to the
> reader, code logic remain the same.
>
> Change in v3:
> - Some whitespace ajdustment.
>
> Change in v2:
> - Rebase Kconfig and Makefile update to latest mainline.
>
>  drivers/soc/fsl/Kconfig  |   8 +++
>  drivers/soc/fsl/Makefile |   1 +
>  drivers/soc/fsl/rcpm.c   | 133 
> +++
>  3 files changed, 142 insertions(+)
>  create mode 100644 drivers/soc/fsl/rcpm.c
>
> diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
> index f9ad8ad..4918856 100644
> --- a/drivers/soc/fsl/Kconfig
> +++ b/drivers/soc/fsl/Kconfig
> @@ -40,4 +40,12 @@ config DPAA2_CONSOLE
>   /dev/dpaa2_mc_console and /dev/dpaa2_aiop_console,
>   which can be used to dump the Management Complex and AIOP
>   firmware logs.
> +
> +config FSL_RCPM
> +   bool "Freescale RCPM support"
> +   depends on PM_SLEEP
> +   help
> + The NXP QorIQ Processors based on ARM Core have RCPM module
> + (Run Control and Power Management), which performs all device-level
> + tasks associated with power management, such as wakeup source 
> control.
>  endmenu
> diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
> index 71dee8d..906f1cd 100644
> --- a/drivers/soc/fsl/Makefile
> +++ b/drivers/soc/fsl/Makefile
> @@ -6,6 +6,7 @@
>  obj-$(CONFIG_FSL_DPAA) += qbman/
>  obj-$(CONFIG_QUICC_ENGINE) += qe/
>  obj-$(CONFIG_CPM)  += qe/
> +obj-$(CONFIG_FSL_RCPM) += rcpm.o
>  obj-$(CONFIG_FSL_GUTS) += guts.o
>  obj-$(CONFIG_FSL_MC_DPIO)  += dpio/
>  obj-$(CONFIG_DPAA2_CONSOLE)+= dpaa2-console.o
> diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
> new file mode 100644
> index 000..3ed135e
> --- /dev/null
> +++ b/drivers/soc/fsl/rcpm.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: GPL-2.0
> +//
> +// rcpm.c - Freescale QorIQ RCPM driver
> +//
> +// Copyright 2019 NXP
> +//
> +// Author: Ran Wang 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RCPM_WAKEUP_CELL_MAX_SIZE  7
> +
> +struct rcpm {
> +   unsigned intwakeup_cells;
> +   void __iomem*ippdexpcr_base;
> +   boollittle_endian;
> +};
> +

Please add a kerneldoc comment describing this routine.

> +static int rcpm_pm_prepare(struct device *dev)
> +{
> +   int i, ret, idx;
> +   void __iomem *base;
> +   struct wakeup_source*ws;
> +   struct rcpm *rcpm;
> +   struct device_node  *np = dev->of_node;
> +   u32 value[RCPM_WAKEUP_CELL_MAX_SIZE + 1], tmp;
> +
> +   rcpm = dev_get_drvdata(dev);
> +   if (!rcpm)
> +   return -EINVAL;
> +
> +   base = rcpm->ippdexpcr_base;
> +   idx = wakeup_sources_read_lock();
> +
> +   /* Begin with first registered wakeup source */
> +   for_each_wakeup_source(ws) {
> +
> +   /* skip object which is not attached to device */
> +   if (!ws->dev || !ws->dev->parent)
> +   continue;
> +
> +   ret = device_property_read_u32_array(ws->dev->parent,
> +   "fsl,rcpm-wakeup", value,
> +   rcpm->wakeup_cells + 1);
> +
> +   /*  Wakeup source should refer to current rcpm device */
> +   if (ret || (np->phandle != value[0])) {
> +   dev_info(dev, "%s doesn't refer to this rcpm\n",
> +  

Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-22 Thread Christophe Leroy




Le 21/10/2019 à 23:29, Thomas Gleixner a écrit :

On Mon, 21 Oct 2019, Christophe Leroy wrote:


This is a tentative to switch powerpc/32 vdso to generic C implementation.
It will likely not work on 64 bits or even build properly at the moment.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

To allow that, C fallbacks just return -1 and the ASM entry point
performs the system call when the C function returns -1.

The performance is rather disappoiting. That's most likely all
calculation in the C implementation are based on 64 bits math and
converted to 32 bits at the very end. I guess C implementation should
use 32 bits math like the assembly VDSO does as of today.



gettimeofday:vdso: 750 nsec/call

gettimeofday:vdso: 1533 nsec/call


Small improvement (3%) with the proposed change:

gettimeofday:vdso: 1485 nsec/call

Though still some way to go.

Christophe



The only real 64bit math which can matter is the 64bit * 32bit multiply,
i.e.

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
 return ((cycles - last) & mask) * mult;
}

Everything else is trivial add/sub/shift, which should be roughly the same
in ASM.

Can you try to replace that with:

static __always_inline
u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
{
 u64 ret, delta = ((cycles - last) & mask);
 u32 dh, dl;

 dl = delta;
 dh = delta >> 32;

 res = mul_u32_u32(al, mul);
 if (ah)
 res += mul_u32_u32(ah, mul) << 32;

 return res;
}

That's pretty much what __do_get_tspec does in ASM.

Thanks,

tglx



Re: [PATCH v4 3/3] powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp

2019-10-22 Thread Segher Boessenkool
On Mon, Oct 21, 2019 at 10:15:29PM -0700, Nathan Chancellor wrote:
> On Fri, Oct 18, 2019 at 03:02:10PM -0500, Segher Boessenkool wrote:
> > I think the proper solution is for the kernel to *do* use -ffreestanding,
> > and then somehow tell the kernel that memcpy etc. are the standard
> > functions.  A freestanding GCC already requires memcpy, memmove, memset,
> > memcmp, and sometimes abort to exist and do the standard thing; why cannot
> > programs then also rely on it to be the standard functions.
> > 
> > What exact functions are the reason the kernel does not use -ffreestanding?
> > Is it just memcpy?  Is more wanted?
> 
> I think Linus summarized it pretty well here:
> 
> https://lore.kernel.org/lkml/CAHk-=wi-epJZfBHDbKKDZ64us7WkF=lpufhvybmzsteo8q0...@mail.gmail.com/

GCC recognises __builtin_memcpy (or any other __builtin) just fine even
with -ffreestanding.

So the kernel wants a warning (or error) whenever a call to one of these
library functions is generated by the compiler without the user asking
for it directly (via a __builtin)?  And that is all that is needed for
the kernel to use -ffreestanding?

That shouldn't be hard.  Anything missing here?


Segher


[PATCH 2/3] ocxl: Add pseries-specific code

2019-10-22 Thread christophe lombard
pseries.c implements the guest-specific callbacks for the backend API.

The hypervisor calls provide an interface to configure and interact with
OpenCAPI devices. It matches the last version of the 'PAPR changes'
document.

The following hcalls are supported:
H_OCXL_CONFIG_ADAPTER   Used to configure OpenCAPI adapter characteristics.

H_OCXL_CONFIG_SPA   Used to configure the schedule process area (SPA)
table for an OCAPI device.

H_OCXL_GET_FAULT_STATE  Used to retrieve fault information from an
OpenCAPI device.

H_OCXL_HANDLE_FAULT Used to respond to an OpenCAPI fault.

Each previous hcall supports a config flag parameter, to allows the guest
to manage the CAPI device.

The current values 0xf004 to 0xf007 have been chosen according the
available QEMU hcall values which are specific to qemu / KVM-on-POWER.

Two parameters are common to all hcalls (buid and config_addr) that will
be used to allow QEMU to recover the PCI device.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/ocxl/Makefile|   1 +
 drivers/misc/ocxl/main.c  |   4 +
 drivers/misc/ocxl/ocxl_internal.h |   1 +
 drivers/misc/ocxl/pseries.c   | 450 ++
 4 files changed, 456 insertions(+)
 create mode 100644 drivers/misc/ocxl/pseries.c

diff --git a/drivers/misc/ocxl/Makefile b/drivers/misc/ocxl/Makefile
index bfdaeb232b83..3474e912c402 100644
--- a/drivers/misc/ocxl/Makefile
+++ b/drivers/misc/ocxl/Makefile
@@ -5,6 +5,7 @@ ocxl-y  += main.o pci.o config.o file.o 
pasid.o mmio.o
 ocxl-y += link.o context.o afu_irq.o sysfs.o trace.o
 ocxl-y += core.o
 ocxl-$(CONFIG_PPC_POWERNV) += powernv.o
+ocxl-$(CONFIG_PPC_PSERIES) += pseries.o
 
 obj-$(CONFIG_OCXL) += ocxl.o
 
diff --git a/drivers/misc/ocxl/main.c b/drivers/misc/ocxl/main.c
index 95df2ba4d473..bdd9ffa7f769 100644
--- a/drivers/misc/ocxl/main.c
+++ b/drivers/misc/ocxl/main.c
@@ -16,6 +16,10 @@ static int __init init_ocxl(void)
 
if (cpu_has_feature(CPU_FTR_HVMODE))
ocxl_ops = _powernv_ops;
+#ifdef CONFIG_PPC_PSERIES
+   else
+   ocxl_ops = _pseries_ops;
+#endif
 
rc = pci_register_driver(_pci_driver);
if (rc) {
diff --git a/drivers/misc/ocxl/ocxl_internal.h 
b/drivers/misc/ocxl/ocxl_internal.h
index 2bdea279bdc6..c18b32df3fe5 100644
--- a/drivers/misc/ocxl/ocxl_internal.h
+++ b/drivers/misc/ocxl/ocxl_internal.h
@@ -104,6 +104,7 @@ struct ocxl_backend_ops {
 };
 
 extern const struct ocxl_backend_ops ocxl_powernv_ops;
+extern const struct ocxl_backend_ops ocxl_pseries_ops;
 extern const struct ocxl_backend_ops *ocxl_ops;
 
 int ocxl_create_cdev(struct ocxl_afu *afu);
diff --git a/drivers/misc/ocxl/pseries.c b/drivers/misc/ocxl/pseries.c
new file mode 100644
index ..1d4942d713f7
--- /dev/null
+++ b/drivers/misc/ocxl/pseries.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0+
+// Copyright 2018 IBM Corp.
+#include 
+#include "ocxl_internal.h"
+#include 
+
+#define H_OCXL_CONFIG_ADAPTER  0xf004
+#define H_OCXL_CONFIG_SPA  0xf005
+#define H_OCXL_GET_FAULT_STATE 0xf006
+#define H_OCXL_HANDLE_FAULT0xf007
+
+#define H_CONFIG_ADAPTER_SETUP 1
+#define H_CONFIG_ADAPTER_RELEASE   2
+#define H_CONFIG_ADAPTER_GET_ACTAG 3
+#define H_CONFIG_ADAPTER_GET_PASID 4
+#define H_CONFIG_ADAPTER_SET_TL5
+#define H_CONFIG_ADAPTER_ALLOC_IRQ 6
+#define H_CONFIG_ADAPTER_FREE_IRQ  7
+
+#define H_CONFIG_SPA_SET   1
+#define H_CONFIG_SPA_UPDATE2
+#define H_CONFIG_SPA_REMOVE3
+
+static char *config_adapter_names[] = {
+   "UNKNOWN_OP",   /* 0 undefined */
+   "SETUP",/* 1 */
+   "RELEASE",  /* 2 */
+   "GET_ACTAG",/* 3 */
+   "GET_PASID",/* 4 */
+   "SET_TL",   /* 5 */
+   "ALLOC_IRQ",/* 6 */
+   "FREE_IRQ", /* 7 */
+};
+
+static char *config_spa_names[] = {
+   "UNKNOWN_OP",   /* 0 undefined */
+   "SET",  /* 1 */
+   "UPDATE",   /* 2 */
+   "REMOVE",   /* 3 */
+};
+
+static char *op_str(unsigned int op, char *names[], int len)
+{
+   if (op >= len)
+   return "UNKNOWN_OP";
+   return names[op];
+}
+
+#define OP_STR(op, names)  op_str(op, names, ARRAY_SIZE(names))
+#define OP_STR_CA(op)  OP_STR(op, config_adapter_names)
+#define OP_STR_CS(op)  OP_STR(op, config_spa_names)
+
+#define _PRINT_MSG(rc, format, ...)\
+   {   \
+   if (rc != H_SUCCESS && rc != H_CONTINUE)\
+   pr_err(format, __VA_ARGS__);\
+   else\
+  

[PATCH 0/3] ocxl: Support for an 0penCAPI device in a QEMU guest.

2019-10-22 Thread christophe lombard
This series adds support for an 0penCAPI device in a QEMU guest.

It builds on top of the existing ocxl driver +
http://patchwork.ozlabs.org/patch/1177999/

The ocxl module registers either a pci driver or a platform driver, based on
the environment (bare-metal (powernv) or pseries).

Roughly 4/5 of the code is common between the 2 types of driver:
- PCI implementation
- mmio operations
- link management
- sysfs folders
- page fault and context handling

The differences in implementation are essentially based on the interact with
the opal api(s) defined in the host. Several hcalls have been defined
(extension of the PAPR) to:
- configure the Sceduled Process Area
- get specific AFU information
- allocated irq
- handle page fault and process element

When the code needs to call a platform-specific implementation, it does so
through an API. The powervn and pseries implementations each describe
their own definition. See struct ocxl_backend_ops.

It has been tested in a bare-metal and QEMU environment using the memcpy and
the AFP AFUs.

christophe lombard (3):
  ocxl: Introduce implementation-specific API
  ocxl: Add pseries-specific code
  powerpc/pseries: Fixup config space size of OpenCAPI devices

 arch/powerpc/platforms/pseries/pci.c |   9 +
 drivers/misc/ocxl/Makefile   |   3 +
 drivers/misc/ocxl/config.c   |   7 +-
 drivers/misc/ocxl/link.c |  31 +-
 drivers/misc/ocxl/main.c |   9 +
 drivers/misc/ocxl/ocxl_internal.h|  25 ++
 drivers/misc/ocxl/powernv.c  |  88 ++
 drivers/misc/ocxl/pseries.c  | 450 +++
 8 files changed, 603 insertions(+), 19 deletions(-)
 create mode 100644 drivers/misc/ocxl/powernv.c
 create mode 100644 drivers/misc/ocxl/pseries.c

-- 
2.21.0



[PATCH 3/3] powerpc/pseries: Fixup config space size of OpenCAPI devices

2019-10-22 Thread christophe lombard
Fix up the pci config size of the OpenCAPI PCIe devices in the pseries
environment.
Most of OpenCAPI PCIe devices have 4096 bytes of configuration space.

Signed-off-by: Christophe Lombard 
---
 arch/powerpc/platforms/pseries/pci.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pci.c 
b/arch/powerpc/platforms/pseries/pci.c
index 1eae1d09980c..3397784767b0 100644
--- a/arch/powerpc/platforms/pseries/pci.c
+++ b/arch/powerpc/platforms/pseries/pci.c
@@ -291,6 +291,15 @@ static void fixup_winbond_82c105(struct pci_dev* dev)
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105,
 fixup_winbond_82c105);
 
+static void fixup_opencapi_cfg_size(struct pci_dev *pdev)
+{
+   if (!machine_is(pseries))
+   return;
+
+   pdev->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+}
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_IBM, 0x062b, fixup_opencapi_cfg_size);
+
 int pseries_root_bridge_prepare(struct pci_host_bridge *bridge)
 {
struct device_node *dn, *pdn;
-- 
2.21.0



[PATCH 1/3] ocxl: Introduce implementation-specific API

2019-10-22 Thread christophe lombard
The backend API (in ocxl.h) lists some low-level functions whose
implementation is different on bare-metal and in a guest. Each
environment implements its own functions, and the common code uses
them through function pointers, defined in ocxl_backend_ops

A new powernv.c file is created to call the pnv_ocxl_ API for the
bare-metal environment.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/ocxl/Makefile|  2 +
 drivers/misc/ocxl/config.c|  7 ++-
 drivers/misc/ocxl/link.c  | 31 +--
 drivers/misc/ocxl/main.c  |  5 ++
 drivers/misc/ocxl/ocxl_internal.h | 24 +
 drivers/misc/ocxl/powernv.c   | 88 +++
 6 files changed, 138 insertions(+), 19 deletions(-)
 create mode 100644 drivers/misc/ocxl/powernv.c

diff --git a/drivers/misc/ocxl/Makefile b/drivers/misc/ocxl/Makefile
index d07d1bb8e8d4..bfdaeb232b83 100644
--- a/drivers/misc/ocxl/Makefile
+++ b/drivers/misc/ocxl/Makefile
@@ -4,6 +4,8 @@ ccflags-$(CONFIG_PPC_WERROR)+= -Werror
 ocxl-y += main.o pci.o config.o file.o pasid.o mmio.o
 ocxl-y += link.o context.o afu_irq.o sysfs.o trace.o
 ocxl-y += core.o
+ocxl-$(CONFIG_PPC_POWERNV) += powernv.o
+
 obj-$(CONFIG_OCXL) += ocxl.o
 
 # For tracepoints to include our trace.h from tracepoint infrastructure:
diff --git a/drivers/misc/ocxl/config.c b/drivers/misc/ocxl/config.c
index 7ca0f6744125..981a3bcfe742 100644
--- a/drivers/misc/ocxl/config.c
+++ b/drivers/misc/ocxl/config.c
@@ -1,7 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0+
 // Copyright 2017 IBM Corp.
 #include 
-#include 
 #include 
 #include "ocxl_internal.h"
 
@@ -649,7 +648,7 @@ int ocxl_config_get_actag_info(struct pci_dev *dev, u16 
*base, u16 *enabled,
 * avoid an external driver using ocxl as a library to call
 * platform-dependent code
 */
-   rc = pnv_ocxl_get_actag(dev, base, enabled, supported);
+   rc = ocxl_ops->get_actag(dev, base, enabled, supported);
if (rc) {
dev_err(>dev, "Can't get actag for device: %d\n", rc);
return rc;
@@ -673,7 +672,7 @@ EXPORT_SYMBOL_GPL(ocxl_config_set_afu_actag);
 
 int ocxl_config_get_pasid_info(struct pci_dev *dev, int *count)
 {
-   return pnv_ocxl_get_pasid_count(dev, count);
+   return ocxl_ops->get_pasid_count(dev, count);
 }
 
 void ocxl_config_set_afu_pasid(struct pci_dev *dev, int pos, int pasid_base,
@@ -715,7 +714,7 @@ int ocxl_config_set_TL(struct pci_dev *dev, int tl_dvsec)
if (PCI_FUNC(dev->devfn) != 0)
return 0;
 
-   return pnv_ocxl_set_TL(dev, tl_dvsec);
+   return ocxl_ops->set_tl(dev, tl_dvsec);
 }
 EXPORT_SYMBOL_GPL(ocxl_config_set_TL);
 
diff --git a/drivers/misc/ocxl/link.c b/drivers/misc/ocxl/link.c
index e936a3bd5957..9f4d164180a7 100644
--- a/drivers/misc/ocxl/link.c
+++ b/drivers/misc/ocxl/link.c
@@ -5,7 +5,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include "ocxl_internal.h"
 #include "trace.h"
@@ -83,7 +82,7 @@ static void ack_irq(struct ocxl_link *link, enum xsl_response 
r)
 link->xsl_fault.dsisr,
 link->xsl_fault.dar,
 reg);
-   pnv_ocxl_handle_fault(link->platform_data, reg);
+   ocxl_ops->ack_irq(link->platform_data, reg);
}
 }
 
@@ -146,8 +145,8 @@ static irqreturn_t xsl_fault_handler(int irq, void *data)
int pid;
bool schedule = false;
 
-   pnv_ocxl_get_fault_state(link->platform_data, , ,
-_handle, );
+   ocxl_ops->get_fault_state(link->platform_data, , ,
+ _handle, );
trace_ocxl_fault(pe_handle, dsisr, dar, -1);
 
/* We could be reading all null values here if the PE is being
@@ -282,8 +281,8 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
struct ocxl_link **out_l
INIT_WORK(>xsl_fault.fault_work, xsl_fault_handler_bh);
 
/* platform specific hook */
-   rc = pnv_ocxl_platform_setup(dev, PE_mask, _irq,
->platform_data);
+   rc = ocxl_ops->platform_setup(dev, PE_mask, _irq,
+ >platform_data);
if (rc)
goto err_free;
 
@@ -298,7 +297,7 @@ static int alloc_link(struct pci_dev *dev, int PE_mask, 
struct ocxl_link **out_l
return 0;
 
 err_xsl_irq:
-   pnv_ocxl_platform_release(link->platform_data);
+   ocxl_ops->platform_release(link->platform_data);
 err_free:
kfree(link);
return rc;
@@ -344,7 +343,7 @@ static void release_xsl(struct kref *ref)
 
list_del(>list);
/* call platform code before releasing data */
-   pnv_ocxl_platform_release(link->platform_data);
+   ocxl_ops->platform_release(link->platform_data);
free_link(link);
 }
 
@@ -378,8 +377,8 @@ 

[PATCH 2/3] Documentation: dt: binding: fsl: Add 'little-endian' and update Chassis define

2019-10-22 Thread Ran Wang
By default, QorIQ SoC's RCPM register block is Big Endian. But
there are some exceptions, such as LS1088A and LS2088A, are
Little Endian. So add this optional property to help identify
them.

Actually LS2021A and other Layerscapes won't totally follow Chassis
2.1, so separate them from powerpc SoC.

Signed-off-by: Ran Wang 
Reviewed-by: Rob Herring 
---
Change in v8:
- None.

Change in v7:
- None.

Change in v6:
- None.

Change in v5:
- Add 'Reviewed-by: Rob Herring ' to commit message.
- Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup-cells'.
please see https://lore.kernel.org/patchwork/patch/1101022/

Change in v4:
- Adjust indectation of 'ls1021a, ls1012a, ls1043a, ls1046a'.

Change in v3:
- None.

Change in v2:
- None.

 Documentation/devicetree/bindings/soc/fsl/rcpm.txt | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt 
b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
index e284e4e..5a33619 100644
--- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
@@ -5,7 +5,7 @@ and power management.
 
 Required properites:
   - reg : Offset and length of the register set of the RCPM block.
-  - fsl,#rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the
+  - #fsl,rcpm-wakeup-cells : The number of IPPDEXPCR register cells in the
fsl,rcpm-wakeup property.
   - compatible : Must contain a chip-specific RCPM block compatible string
and (if applicable) may contain a chassis-version RCPM compatible
@@ -20,6 +20,7 @@ Required properites:
* "fsl,qoriq-rcpm-1.0": for chassis 1.0 rcpm
* "fsl,qoriq-rcpm-2.0": for chassis 2.0 rcpm
* "fsl,qoriq-rcpm-2.1": for chassis 2.1 rcpm
+   * "fsl,qoriq-rcpm-2.1+": for chassis 2.1+ rcpm
 
 All references to "1.0" and "2.0" refer to the QorIQ chassis version to
 which the chip complies.
@@ -27,14 +28,19 @@ Chassis Version Example Chips
 ------
 1.0p4080, p5020, p5040, p2041, p3041
 2.0t4240, b4860, b4420
-2.1t1040, ls1021
+2.1t1040,
+2.1+   ls1021a, ls1012a, ls1043a, ls1046a
+
+Optional properties:
+ - little-endian : RCPM register block is Little Endian. Without it RCPM
+   will be Big Endian (default case).
 
 Example:
 The RCPM node for T4240:
rcpm: global-utilities@e2000 {
compatible = "fsl,t4240-rcpm", "fsl,qoriq-rcpm-2.0";
reg = <0xe2000 0x1000>;
-   fsl,#rcpm-wakeup-cells = <2>;
+   #fsl,rcpm-wakeup-cells = <2>;
};
 
 * Freescale RCPM Wakeup Source Device Tree Bindings
@@ -44,7 +50,7 @@ can be used as a wakeup source.
 
   - fsl,rcpm-wakeup: Consists of a phandle to the rcpm node and the IPPDEXPCR
register cells. The number of IPPDEXPCR register cells is defined in
-   "fsl,#rcpm-wakeup-cells" in the rcpm node. The first register cell is
+   "#fsl,rcpm-wakeup-cells" in the rcpm node. The first register cell is
the bit mask that should be set in IPPDEXPCR0, and the second register
cell is for IPPDEXPCR1, and so on.
 
-- 
2.7.4



[PATCH 3/3] soc: fsl: add RCPM driver

2019-10-22 Thread Ran Wang
The NXP's QorIQ Processors based on ARM Core have RCPM module
(Run Control and Power Management), which performs system level
tasks associated with power management such as wakeup source control.

This driver depends on PM wakeup source framework which help to
collect wake information.

Signed-off-by: Ran Wang 
---
Change in v8:
- Adjust related API usage to meet wakeup.c's update in patch 1/3.
- Add sanity checking for the case of ws->dev or ws->dev->parent
  is null.

Change in v7:
- Replace 'ws->dev' with 'ws->dev->parent' to get aligned with
c8377adfa781 ("PM / wakeup: Show wakeup sources stats in sysfs")
- Remove '+obj-y += ftm_alarm.o' since it is wrong.
- Cosmetic work.

Change in v6:
- Adjust related API usage to meet wakeup.c's update in patch 1/3.

Change in v5:
- Fix v4 regression of the return value of wakeup_source_get_next()
didn't pass to ws in while loop.
- Rename wakeup_source member 'attached_dev' to 'dev'.
- Rename property 'fsl,#rcpm-wakeup-cells' to '#fsl,rcpm-wakeup-cells'.
please see https://lore.kernel.org/patchwork/patch/1101022/

Change in v4:
- Remove extra ',' in author line of rcpm.c
- Update usage of wakeup_source_get_next() to be less confusing to the
reader, code logic remain the same.

Change in v3:
- Some whitespace ajdustment.

Change in v2:
- Rebase Kconfig and Makefile update to latest mainline.

 drivers/soc/fsl/Kconfig  |   8 +++
 drivers/soc/fsl/Makefile |   1 +
 drivers/soc/fsl/rcpm.c   | 133 +++
 3 files changed, 142 insertions(+)
 create mode 100644 drivers/soc/fsl/rcpm.c

diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
index f9ad8ad..4918856 100644
--- a/drivers/soc/fsl/Kconfig
+++ b/drivers/soc/fsl/Kconfig
@@ -40,4 +40,12 @@ config DPAA2_CONSOLE
  /dev/dpaa2_mc_console and /dev/dpaa2_aiop_console,
  which can be used to dump the Management Complex and AIOP
  firmware logs.
+
+config FSL_RCPM
+   bool "Freescale RCPM support"
+   depends on PM_SLEEP
+   help
+ The NXP QorIQ Processors based on ARM Core have RCPM module
+ (Run Control and Power Management), which performs all device-level
+ tasks associated with power management, such as wakeup source control.
 endmenu
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 71dee8d..906f1cd 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -6,6 +6,7 @@
 obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_RCPM) += rcpm.o
 obj-$(CONFIG_FSL_GUTS) += guts.o
 obj-$(CONFIG_FSL_MC_DPIO)  += dpio/
 obj-$(CONFIG_DPAA2_CONSOLE)+= dpaa2-console.o
diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
new file mode 100644
index 000..3ed135e
--- /dev/null
+++ b/drivers/soc/fsl/rcpm.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// rcpm.c - Freescale QorIQ RCPM driver
+//
+// Copyright 2019 NXP
+//
+// Author: Ran Wang 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define RCPM_WAKEUP_CELL_MAX_SIZE  7
+
+struct rcpm {
+   unsigned intwakeup_cells;
+   void __iomem*ippdexpcr_base;
+   boollittle_endian;
+};
+
+static int rcpm_pm_prepare(struct device *dev)
+{
+   int i, ret, idx;
+   void __iomem *base;
+   struct wakeup_source*ws;
+   struct rcpm *rcpm;
+   struct device_node  *np = dev->of_node;
+   u32 value[RCPM_WAKEUP_CELL_MAX_SIZE + 1], tmp;
+
+   rcpm = dev_get_drvdata(dev);
+   if (!rcpm)
+   return -EINVAL;
+
+   base = rcpm->ippdexpcr_base;
+   idx = wakeup_sources_read_lock();
+
+   /* Begin with first registered wakeup source */
+   for_each_wakeup_source(ws) {
+
+   /* skip object which is not attached to device */
+   if (!ws->dev || !ws->dev->parent)
+   continue;
+
+   ret = device_property_read_u32_array(ws->dev->parent,
+   "fsl,rcpm-wakeup", value,
+   rcpm->wakeup_cells + 1);
+
+   /*  Wakeup source should refer to current rcpm device */
+   if (ret || (np->phandle != value[0])) {
+   dev_info(dev, "%s doesn't refer to this rcpm\n",
+   ws->name);
+   continue;
+   }
+
+   for (i = 0; i < rcpm->wakeup_cells; i++) {
+   /* We can only OR related bits */
+   if (value[i + 1]) {
+   if (rcpm->little_endian) {
+   tmp = ioread32(base + i * 4);
+ 

[PATCH 1/3] PM: wakeup: Add routine to help fetch wakeup source object.

2019-10-22 Thread Ran Wang
Some user might want to go through all registered wakeup sources
and doing things accordingly. For example, SoC PM driver might need to
do HW programming to prevent powering down specific IP which wakeup
source depending on. So add this API to help walk through all registered
wakeup source objects on that list and return them one by one.

Signed-off-by: Ran Wang 
Tested-by: Leonard Crestez 
---
Change in v8
- Rename wakeup_source_get_next() to wakeup_sources_walk_next().
- Add wakeup_sources_read_lock() to take over locking job of
  wakeup_source_get_star().
- Rename wakeup_source_get_start() to wakeup_sources_walk_start().
- Replace wakeup_source_get_stop() with wakeup_sources_read_unlock().
- Define macro for_each_wakeup_source(ws).

Change in v7:
- Remove define of member *dev in wake_irq to fix conflict with commit 
c8377adfa781 ("PM / wakeup: Show wakeup sources stats in sysfs"), user 
will use ws->dev->parent instead.
- Remove '#include ' because it is not used.

Change in v6:
- Add wakeup_source_get_star() and wakeup_source_get_stop() to aligned 
with wakeup_sources_stats_seq_start/nex/stop.

Change in v5:
- Update commit message, add decription of walk through all wakeup
source objects.
- Add SCU protection in function wakeup_source_get_next().
- Rename wakeup_source member 'attached_dev' to 'dev' and move it up
(before wakeirq).

Change in v4:
- None.

Change in v3:
- Adjust indentation of *attached_dev;.

Change in v2:
- None.

 drivers/base/power/wakeup.c | 42 ++
 include/linux/pm_wakeup.h   |  9 +
 2 files changed, 51 insertions(+)

diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
index 5817b51..8c7a5f9 100644
--- a/drivers/base/power/wakeup.c
+++ b/drivers/base/power/wakeup.c
@@ -248,6 +248,48 @@ void wakeup_source_unregister(struct wakeup_source *ws)
 EXPORT_SYMBOL_GPL(wakeup_source_unregister);
 
 /**
+ * wakeup_sources_read_lock - Lock wakeup source list for read.
+ */
+int wakeup_sources_read_lock(void)
+{
+   return srcu_read_lock(_srcu);
+}
+EXPORT_SYMBOL_GPL(wakeup_sources_read_lock);
+
+/**
+ * wakeup_sources_read_unlock - Unlock wakeup source list.
+ */
+void wakeup_sources_read_unlock(int idx)
+{
+   srcu_read_unlock(_srcu, idx);
+}
+EXPORT_SYMBOL_GPL(wakeup_sources_read_unlock);
+
+/**
+ * wakeup_sources_walk_start - Begin a walk on wakeup source list
+ */
+struct wakeup_source *wakeup_sources_walk_start(void)
+{
+   struct list_head *ws_head = _sources;
+
+   return list_entry_rcu(ws_head->next, struct wakeup_source, entry);
+}
+EXPORT_SYMBOL_GPL(wakeup_sources_walk_start);
+
+/**
+ * wakeup_sources_walk_next - Get next wakeup source from the list
+ * @ws: Previous wakeup source object
+ */
+struct wakeup_source *wakeup_sources_walk_next(struct wakeup_source *ws)
+{
+   struct list_head *ws_head = _sources;
+
+   return list_next_or_null_rcu(ws_head, >entry,
+   struct wakeup_source, entry);
+}
+EXPORT_SYMBOL_GPL(wakeup_sources_walk_next);
+
+/**
  * device_wakeup_attach - Attach a wakeup source object to a device object.
  * @dev: Device to handle.
  * @ws: Wakeup source object to attach to @dev.
diff --git a/include/linux/pm_wakeup.h b/include/linux/pm_wakeup.h
index 661efa0..aa3da66 100644
--- a/include/linux/pm_wakeup.h
+++ b/include/linux/pm_wakeup.h
@@ -63,6 +63,11 @@ struct wakeup_source {
boolautosleep_enabled:1;
 };
 
+#define for_each_wakeup_source(ws) \
+   for ((ws) = wakeup_sources_walk_start();\
+(ws);  \
+(ws) = wakeup_sources_walk_next((ws)))
+
 #ifdef CONFIG_PM_SLEEP
 
 /*
@@ -92,6 +97,10 @@ extern void wakeup_source_remove(struct wakeup_source *ws);
 extern struct wakeup_source *wakeup_source_register(struct device *dev,
const char *name);
 extern void wakeup_source_unregister(struct wakeup_source *ws);
+extern int wakeup_sources_read_lock(void);
+extern void wakeup_sources_read_unlock(int idx);
+extern struct wakeup_source *wakeup_sources_walk_start(void);
+extern struct wakeup_source *wakeup_sources_walk_next(struct wakeup_source 
*ws);
 extern int device_wakeup_enable(struct device *dev);
 extern int device_wakeup_disable(struct device *dev);
 extern void device_set_wakeup_capable(struct device *dev, bool capable);
-- 
2.7.4



Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-22 Thread Christophe Leroy




On 10/21/2019 02:42 AM, Anshuman Khandual wrote:

This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

This test covers basic page table entry transformations including but not
limited to old, young, dirty, clean, write, write protect etc at various
level along with populating intermediate entries with next page table page
and validating them.

Test page table pages are allocated from system memory with required size
and alignments. The mapped pfns at page table levels are derived from a
real pfn representing a valid kernel text symbol. This test gets called
right after page_alloc_init_late().

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Folks interested in making sure that a given platform's page table helpers
conform to expected generic MM semantics should enable the above config
which will just trigger this test during boot. Any non conformity here will
be reported as an warning which would need to be fixed. This test will help
catch any changes to the agreed upon semantics expected from generic MM and
enable platforms to accommodate it thereafter.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Ingo Molnar 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org

Tested-by: Christophe Leroy  #PPC32
Suggested-by: Catalin Marinas 
Signed-off-by: Andrew Morton 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---


The cover letter have the exact same title as this patch. I think a 
cover letter is not necessary for a singleton series.


The history (and any other information you don't want to include in the 
commit message) can be added here, below the '---'. That way it is in 
the mail but won't be included in the commit.



  .../debug/debug-vm-pgtable/arch-support.txt|  34 ++
  arch/arm64/Kconfig |   1 +
  arch/x86/Kconfig   |   1 +
  arch/x86/include/asm/pgtable_64.h  |   6 +
  include/asm-generic/pgtable.h  |   6 +
  init/main.c|   1 +
  lib/Kconfig.debug  |  21 ++
  mm/Makefile|   1 +
  mm/debug_vm_pgtable.c  | 388 +
  9 files changed, 459 insertions(+)
  create mode 100644 
Documentation/features/debug/debug-vm-pgtable/arch-support.txt
  create mode 100644 mm/debug_vm_pgtable.c

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
new file mode 100644
index 000..d6b8185
--- /dev/null
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -0,0 +1,34 @@
+#
+# Feature name:  debug-vm-pgtable
+# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
+# description:   arch supports pgtable tests for semantics compliance
+#
+---
+| arch |status|
+---
+|   alpha: | TODO |
+| arc: | TODO |
+| arm: | TODO |
+|   arm64: |  ok  |
+| c6x: | TODO |
+|csky: | TODO |
+|   h8300: | TODO |
+| hexagon: | TODO |
+|ia64: | TODO |
+|m68k: | TODO |
+|  microblaze: | TODO |
+|mips: | TODO |
+|   nds32: | TODO |
+|   nios2: | TODO |
+|openrisc: | TODO |
+|  parisc: | TODO |
+| powerpc: | TODO |


Say ok on ppc32


+|   riscv: | TODO |
+|s390: | TODO |
+|  sh: | TODO |
+|   sparc: 

Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-22 Thread Bharata B Rao
On Fri, Oct 18, 2019 at 8:31 AM Paul Mackerras  wrote:
>
> On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> > Manage migration of pages betwen normal and secure memory of secure
> > guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> >
> > H_SVM_PAGE_IN: Move the content of a normal page to secure page
> > H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> >
> > Private ZONE_DEVICE memory equal to the amount of secure memory
> > available in the platform for running secure guests is created.
> > Whenever a page belonging to the guest becomes secure, a page from
> > this private device memory is used to represent and track that secure
> > page on the HV side. The movement of pages between normal and secure
> > memory is done via migrate_vma_pages() using UV_PAGE_IN and
> > UV_PAGE_OUT ucalls.
>
> As we discussed privately, but mentioning it here so there is a
> record:  I am concerned about this structure
>
> > +struct kvmppc_uvmem_page_pvt {
> > + unsigned long *rmap;
> > + struct kvm *kvm;
> > + unsigned long gpa;
> > +};
>
> which keeps a reference to the rmap.  The reference could become stale
> if the memslot is deleted or moved, and nothing in the patch series
> ensures that the stale references are cleaned up.

I will add code to release the device PFNs when memslot goes away. In
fact the early versions of the patchset had this, but it subsequently
got removed.

>
> If it is possible to do without the long-term rmap reference, and
> instead find the rmap via the memslots (with the srcu lock held) each
> time we need the rmap, that would be safer, I think, provided that we
> can sort out the lock ordering issues.

All paths except fault handler access rmap[] under srcu lock. Even in
case of fault handler, for those faults induced by us (shared page
handling, releasing device pfns), we do hold srcu lock. The difficult
case is when we fault due to HV accessing a device page. In this case
we come to fault hanler with mmap_sem already held and are not in a
position to take kvm srcu lock as that would lead to lock order
reversal. Given that we have pages mapped in still, I assume memslot
can't go away while we access rmap[], so think we should be ok here.

However if that sounds fragile, may be I can go back to my initial
design where we weren't using rmap[] to store device PFNs. That will
increase the memory usage but we give us an easy option to have
per-guest mutex to protect concurrent page-ins/outs/faults.

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/


Re: [PATCH] powerpc/64s/exception: Fix kaup -> kuap typo

2019-10-22 Thread Russell Currey
On Tue, 2019-10-22 at 17:06 +1100, Andrew Donnellan wrote:
> It's KUAP, not KAUP. Fix typo in INT_COMMON macro.
> 
> Signed-off-by: Andrew Donnellan 

Akced-by: Russell Currey 



[PATCH] powerpc/64s/exception: Fix kaup -> kuap typo

2019-10-22 Thread Andrew Donnellan
It's KUAP, not KAUP. Fix typo in INT_COMMON macro.

Signed-off-by: Andrew Donnellan 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index d0018dd17e0a..46508b148e16 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -514,7 +514,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
  * If stack=0, then the stack is already set in r1, and r1 is saved in r10.
  * PPR save and CPU accounting is not done for the !stack case (XXX why not?)
  */
-.macro INT_COMMON vec, area, stack, kaup, reconcile, dar, dsisr
+.macro INT_COMMON vec, area, stack, kuap, reconcile, dar, dsisr
.if \stack
andi.   r10,r12,MSR_PR  /* See if coming from user  */
mr  r10,r1  /* Save r1  */
@@ -533,7 +533,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
std r10,GPR1(r1)/* save r1 in stackframe*/
 
.if \stack
-   .if \kaup
+   .if \kuap
kuap_save_amr_and_lock r9, r10, cr1, cr0
.endif
beq 101f/* if from kernel mode  */
@@ -541,7 +541,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
SAVE_PPR(\area, r9)
 101:
.else
-   .if \kaup
+   .if \kuap
kuap_save_amr_and_lock r9, r10, cr1
.endif
.endif
-- 
2.20.1