Re: [PATCH v4 3/7] PCI: Separate VF BAR updates from standard BAR updates

2016-11-29 Thread Gavin Shan
On Tue, Nov 29, 2016 at 08:48:26AM -0600, Bjorn Helgaas wrote:
>On Tue, Nov 29, 2016 at 03:55:46PM +1100, Gavin Shan wrote:
>> On Mon, Nov 28, 2016 at 10:15:06PM -0600, Bjorn Helgaas wrote:
>> >Previously pci_update_resource() used the same code path for updating
>> >standard BARs and VF BARs in SR-IOV capabilities.
>> >
>> >Split the VF BAR update into a new pci_iov_update_resource() internal
>> >interface, which makes it simpler to compute the BAR address (we can get
>> >rid of pci_resource_bar() and pci_iov_resource_bar()).
>> >
>> >This patch:
>> >
>> >  - Renames pci_update_resource() to pci_std_update_resource(),
>> >  - Adds pci_iov_update_resource(),
>> >  - Makes pci_update_resource() a wrapper that calls the appropriate one,
>> >
>> >No functional change intended.
>> >
>> >Signed-off-by: Bjorn Helgaas 
>> 
>> With below minor comments fixed:
>> 
>> Reviewed-by: Gavin Shan 
>> 
>> >---
>> > drivers/pci/iov.c   |   49 
>> > +++
>> > drivers/pci/pci.h   |1 +
>> > drivers/pci/setup-res.c |   13 +++-
>> > 3 files changed, 61 insertions(+), 2 deletions(-)
>> >
>> >diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> >index d41ec29..d00ed5c 100644
>> >--- a/drivers/pci/iov.c
>> >+++ b/drivers/pci/iov.c
>> >@@ -571,6 +571,55 @@ int pci_iov_resource_bar(struct pci_dev *dev, int 
>> >resno)
>> >4 * (resno - PCI_IOV_RESOURCES);
>> > }
>> >
>> >+/**
>> >+ * pci_iov_update_resource - update a VF BAR
>> >+ * @dev: the PCI device
>> >+ * @resno: the resource number
>> >+ *
>> >+ * Update a VF BAR in the SR-IOV capability of a PF.
>> >+ */
>> >+void pci_iov_update_resource(struct pci_dev *dev, int resno)
>> >+{
>> >+   struct pci_sriov *iov = dev->is_physfn ? dev->sriov : NULL;
>> >+   struct resource *res = dev->resource + resno;
>> >+   int vf_bar = resno - PCI_IOV_RESOURCES;
>> >+   struct pci_bus_region region;
>> >+   u32 new;
>> >+   int reg;
>> >+
>> >+   /*
>> >+* The generic pci_restore_bars() path calls this for all devices,
>> >+* including VFs and non-SR-IOV devices.  If this is not a PF, we
>> >+* have nothing to do.
>> >+*/
>> >+   if (!iov)
>> >+   return;
>> >+
>> >+   /*
>> >+* Ignore unimplemented BARs, unused resource slots for 64-bit
>> >+* BARs, and non-movable resources, e.g., those described via
>> >+* Enhanced Allocation.
>> >+*/
>> >+   if (!res->flags)
>> >+   return;
>> >+
>> >+   if (res->flags & IORESOURCE_UNSET)
>> >+   return;
>> >+
>> >+   if (res->flags & IORESOURCE_PCI_FIXED)
>> >+   return;
>> >+
>> >+   pcibios_resource_to_bus(dev->bus, , res);
>> >+   new = region.start;
>> >+
>> 
>> The bits indicating the BAR's property (e.g. memory, IO etc) are missed in 
>> @new.
>
>Hmm, yes.  I omitted those because those bits are supposed to be
>read-only, per spec (PCI r3.0, sec 6.2.5.1).  But I guess it would be
>more conservative to keep them, and this shouldn't be needlessly
>different from pci_std_update_resource().
>

Yeah, Agree.

>However, I don't think this code in pci_update_resource() is obviously
>correct:
>
>  new = region.start | (res->flags & PCI_REGION_FLAG_MASK);
>
>PCI_REGION_FLAG_MASK is 0xf.  For memory BARs, bits 0-3 are read-only
>property bits.  For I/O BARs, bits 0-1 are read-only and bits 2-3 are
>part of the address, so on the face of it, the above could corrupt two
>bits of an I/O address.
>
>It's true that decode_bar() initializes flags correctly, using
>PCI_BASE_ADDRESS_IO_MASK for I/O BARs and PCI_BASE_ADDRESS_MEM_MASK
>for memory BARs, but it would take a little more digging to be sure
>that we never set bits 2-3 of flags for an I/O resource elsewhere.
>

The BAR's property bits are probed from device-tree, not hardware
on some platforms (e.g. pSeries). Also, there is only one (property)
bit if it's a ROM BAR. So more check as below might be needed because
the code (without the enhancement) should also work fine.

>How about this in pci_std_update_resource():
>
>pcibios_resource_to_bus(dev->bus, , res);
>new = region.start;
>
>if (res->flags & IORESOURCE_IO) {
>mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
>new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
>} else {
>mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
>new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
>}
>

if (res->flags & IORESOURCE_IO) {
mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
} else if (resno < PCI_ROM_RESOURCE) {
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
} else if (resno == PCI_ROM_RESOURCE) {
mask = ~((u32)IORESOURCE_ROM_ENABLE);
new |= res->flags & IORESOURCE_ROM_ENABLE);
} else {

Re: [PATCH v4 3/7] PCI: Separate VF BAR updates from standard BAR updates

2016-11-29 Thread Bjorn Helgaas
On Wed, Nov 30, 2016 at 10:20:28AM +1100, Gavin Shan wrote:
> On Tue, Nov 29, 2016 at 08:48:26AM -0600, Bjorn Helgaas wrote:
> >On Tue, Nov 29, 2016 at 03:55:46PM +1100, Gavin Shan wrote:
> >> On Mon, Nov 28, 2016 at 10:15:06PM -0600, Bjorn Helgaas wrote:
> >> >Previously pci_update_resource() used the same code path for updating
> >> >standard BARs and VF BARs in SR-IOV capabilities.
> >> >
> >> >Split the VF BAR update into a new pci_iov_update_resource() internal
> >> >interface, which makes it simpler to compute the BAR address (we can get
> >> >rid of pci_resource_bar() and pci_iov_resource_bar()).
> >> >
> >> >This patch:
> >> >
> >> >  - Renames pci_update_resource() to pci_std_update_resource(),
> >> >  - Adds pci_iov_update_resource(),
> >> >  - Makes pci_update_resource() a wrapper that calls the appropriate one,
> >> >
> >> >No functional change intended.

> >However, I don't think this code in pci_update_resource() is obviously
> >correct:
> >
> >  new = region.start | (res->flags & PCI_REGION_FLAG_MASK);
> >
> >PCI_REGION_FLAG_MASK is 0xf.  For memory BARs, bits 0-3 are read-only
> >property bits.  For I/O BARs, bits 0-1 are read-only and bits 2-3 are
> >part of the address, so on the face of it, the above could corrupt two
> >bits of an I/O address.
> >
> >It's true that decode_bar() initializes flags correctly, using
> >PCI_BASE_ADDRESS_IO_MASK for I/O BARs and PCI_BASE_ADDRESS_MEM_MASK
> >for memory BARs, but it would take a little more digging to be sure
> >that we never set bits 2-3 of flags for an I/O resource elsewhere.
> >
> 
> The BAR's property bits are probed from device-tree, not hardware
> on some platforms (e.g. pSeries). Also, there is only one (property)
> bit if it's a ROM BAR. So more check as below might be needed because
> the code (without the enhancement) should also work fine.

Ah, right, I forgot about that.  I didn't do enough digging :)

> >How about this in pci_std_update_resource():
> >
> >pcibios_resource_to_bus(dev->bus, , res);
> >new = region.start;
> >
> >if (res->flags & IORESOURCE_IO) {
> >mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
> >new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
> >} else {
> >mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
> >new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
> >}
> >
> 
>   if (res->flags & IORESOURCE_IO) {
>   mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
>   new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
>   } else if (resno < PCI_ROM_RESOURCE) {
>   mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
>   new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
>   } else if (resno == PCI_ROM_RESOURCE) {
>   mask = ~((u32)IORESOURCE_ROM_ENABLE);
>   new |= res->flags & IORESOURCE_ROM_ENABLE);
>   } else {
>   dev_warn(>dev, "BAR#%d out of range\n", resno);
>   return;
>   }

After this patch, the only thing we OR into a ROM BAR value is
PCI_ROM_ADDRESS_ENABLE, and that's done below, only if the ROM is
already enabled.

I did update the ROM mask (to PCI_ROM_ADDRESS_MASK).  I'm not 100%
sure about doing that -- it follows the spec, but it is a change from
what we've been doing before.  I guess it should be safe because it
means we're checking fewer bits than before (only the top 21 bits for
ROMs, where we used check the top 28), so the only possible difference
is that we might not warn about "error updating" in some case where we
used to.

I'm not really sure about the value of the "error updating" checks to
begin with, though I guess it does help us find broken devices that
put non-BARs where BARs are supposed to be.

Bjorn


[PATCH] powerpc/radix/mm: Fixup storage key mm fault

2016-11-29 Thread Balbir Singh

Aneesh/Ben reported that the change to do_page_fault() needs
to handle the case where CPU_FTR_COHERENT_ICACHE is missing
but we have CPU_FTR_NOEXECUTE. In those cases the check
added for SRR1_ISI_N_OR_G might trigger a false positive.

This patch checks for CPU_FTR_COHERENT_ICACHE in addition
to the MSR value

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Balbir Singh 
---

 Applies on top of powerpc/next and I've not added a fixes tag

 arch/powerpc/mm/fault.c | 9 -
 drivers/crypto/Makefile | 1 -
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a17029aa..eab3ded 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -392,8 +392,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
address,
if (is_exec) {
/*
 * An execution fault + no execute ?
+* We need to check for CPU_FTR_COHERENT_ICACHE, since
+* on some variants, an NX fault is taken and
+* hash_page_do_lazy_icache() does the fixup. Without the
+* check for CPU_FTR_COHERENT_ICACHE we could have a false
+* positive if we have !CPU_FTR_COHERENT_ICACHE and
+* CPU_FTR_NOEXECUTE
 */
-   if (regs->msr & SRR1_ISI_N_OR_G)
+   if (cpu_has_feature(CPU_FTR_COHERENT_ICACHE) &&
+   (regs->msr & SRR1_ISI_N_OR_G))
goto bad_area;
 
/*
diff --git a/drivers/crypto/Makefile b/drivers/crypto/Makefile
index ad7250f..3c6432d 100644
--- a/drivers/crypto/Makefile
+++ b/drivers/crypto/Makefile
@@ -31,4 +31,3 @@ obj-$(CONFIG_CRYPTO_DEV_QCE) += qce/
 obj-$(CONFIG_CRYPTO_DEV_VMX) += vmx/
 obj-$(CONFIG_CRYPTO_DEV_SUN4I_SS) += sunxi-ss/
 obj-$(CONFIG_CRYPTO_DEV_ROCKCHIP) += rockchip/
-obj-$(CONFIG_CRYPTO_DEV_CHELSIO) += chelsio/
-- 
2.5.5



[PATCH] powerpc/opal-irqchip: Use interrupt names if present

2016-11-29 Thread Benjamin Herrenschmidt
Recent versions of OPAL will be able to provide names for the various
OPAL interrupts via a new "opal-interrupt-names" property. So let's
use them to make /proc/interrupts more informative.

This also modernises the code that fetches the interrupt array to use
the helpers provided by the generic code instead of hand-parsing the
property.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/platforms/powernv/opal-irqchip.c | 45 ---
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 998316b..fe9b029 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -183,8 +183,9 @@ void opal_event_shutdown(void)
 int __init opal_event_init(void)
 {
struct device_node *dn, *opal_node;
-   const __be32 *irqs;
-   int i, irqlen, rc = 0;
+   const char **names;
+   u32 *irqs;
+   int i, rc = 0;
 
opal_node = of_find_node_by_path("/ibm,opal");
if (!opal_node) {
@@ -209,37 +210,57 @@ int __init opal_event_init(void)
goto out;
}
 
-   /* Get interrupt property */
-   irqs = of_get_property(opal_node, "opal-interrupts", );
-   opal_irq_count = irqs ? (irqlen / 4) : 0;
+   /* Get opal-interrupts property and names if present */
+   rc = of_property_count_u32_elems(opal_node, "opal-interrupts");
+   if (rc < 0)
+   goto out;
+   opal_irq_count = rc;
pr_debug("Found %d interrupts reserved for OPAL\n", opal_irq_count);
+   irqs = kzalloc(rc * sizeof(u32), GFP_KERNEL);
+   if (WARN_ON(!irqs))
+   goto out;
+   rc = of_property_read_u32_array(opal_node, "opal-interrupts",
+   irqs, opal_irq_count);
+   if (rc < 0) {
+   pr_err("Error %d reading opal-interrupts array\n", rc);
+   goto out;
+   }
+   names = kzalloc(opal_irq_count * sizeof(char *), GFP_KERNEL);
+   of_property_read_string_array(opal_node, "opal-interrupts-names",
+ names, opal_irq_count);
 
/* Install interrupt handlers */
opal_irqs = kcalloc(opal_irq_count, sizeof(*opal_irqs), GFP_KERNEL);
-   for (i = 0; irqs && i < opal_irq_count; i++, irqs++) {
-   unsigned int irq, virq;
+   for (i = 0; i < opal_irq_count; i++) {
+   unsigned int virq;
+   char *name;
 
/* Get hardware and virtual IRQ */
-   irq = be32_to_cpup(irqs);
-   virq = irq_create_mapping(NULL, irq);
+   virq = irq_create_mapping(NULL, irqs[i]);
if (!virq) {
-   pr_warn("Failed to map irq 0x%x\n", irq);
+   pr_warn("Failed to map irq 0x%x\n", irqs[i]);
continue;
}
+   if (names && names[i] && strlen(names[i]))
+   name = kasprintf(GFP_KERNEL, "opal-%s", names[i]);
+   else
+   name = kasprintf(GFP_KERNEL, "opal");
 
/* Install interrupt handler */
rc = request_irq(virq, opal_interrupt, IRQF_TRIGGER_LOW,
-"opal", NULL);
+name, NULL);
if (rc) {
irq_dispose_mapping(virq);
pr_warn("Error %d requesting irq %d (0x%x)\n",
-rc, virq, irq);
+rc, virq, irqs[i]);
continue;
}
 
/* Cache IRQ */
opal_irqs[i] = virq;
}
+   kfree(irqs);
+   kfree(names);
 
 out:
of_node_put(opal_node);



Re: [PATCH v7 3/7] powerpc/mm: Introduce _PAGE_LARGE software pte bits

2016-11-29 Thread Balbir Singh


On 28/11/16 17:17, Aneesh Kumar K.V wrote:
> This patch adds a new software defined pte bit. We use the reserved
> fields of ISA 3.0 pte definition since we will only be using this
> on DD1 code paths. We can possibly look at removing this code later.
> 
> The software bit will be used to differentiate between 64K/4K and 2M ptes.
> This helps in finding the page size mapping by a pte so that we can do 
> efficient
> tlb flush.
> 
> We don't support 1G hugetlb pages yet. So we add a DEBUG WARN_ON to catch
> wrong usage.
> 

I thought we do in hugetlb_page_init() don't we register sizes for every size
from 0 to MMU_PAGE_COUNT?

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/hugetlb.h | 20 
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  9 +
>  arch/powerpc/include/asm/book3s/64/radix.h   |  2 ++
>  3 files changed, 31 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
> b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> index d9c283f95e05..c62f14d0bec1 100644
> --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
> +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> @@ -30,4 +30,24 @@ static inline int hstate_get_psize(struct hstate *hstate)
>   return mmu_virtual_psize;
>   }
>  }
> +
> +#define arch_make_huge_pte arch_make_huge_pte
> +static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct 
> *vma,
> +struct page *page, int writable)
> +{
> + unsigned long page_shift;
> +
> + if (!cpu_has_feature(CPU_FTR_POWER9_DD1))
> + return entry;
> +
> + page_shift = huge_page_shift(hstate_vma(vma));
> + /*
> +  * We don't support 1G hugetlb pages yet.
> +  */
> + VM_WARN_ON(page_shift == mmu_psize_defs[MMU_PAGE_1G].shift);
> + if (page_shift == mmu_psize_defs[MMU_PAGE_2M].shift)
> + return __pte(pte_val(entry) | _PAGE_LARGE);
> + else
> + return entry;
> +}
>  #endif
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 86870c11917b..6f39b9d134a2 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -26,6 +26,11 @@
>  #define _RPAGE_SW1   0x00800
>  #define _RPAGE_SW2   0x00400
>  #define _RPAGE_SW3   0x00200
> +#define _RPAGE_RSV1  0x1000UL
> +#define _RPAGE_RSV2  0x0800UL
> +#define _RPAGE_RSV3  0x0400UL
> +#define _RPAGE_RSV4  0x0200UL
> +

We use the top 4 bits and not the _SW bits?

>  #ifdef CONFIG_MEM_SOFT_DIRTY
>  #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking 
> */
>  #else
> @@ -34,6 +39,10 @@
>  #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */
>  #define _PAGE_DEVMAP _RPAGE_SW1
>  #define __HAVE_ARCH_PTE_DEVMAP
> +/*
> + * For DD1 only, we need to track whether the pte huge

For POWER9_DD1 only

> + */
> +#define _PAGE_LARGE  _RPAGE_RSV1
>  
>  
>  #define _PAGE_PTE(1ul << 62) /* distinguishes PTEs from 
> pointers */
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 2a46dea8e1b1..d2c5c064e266 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -243,6 +243,8 @@ static inline int radix__pmd_trans_huge(pmd_t pmd)
>  
>  static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
>  {
> + if (cpu_has_feature(CPU_FTR_POWER9_DD1))
> + return __pmd(pmd_val(pmd) | _PAGE_PTE | _PAGE_LARGE);
>   return __pmd(pmd_val(pmd) | _PAGE_PTE);
>  }
>  static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
> 


Re: [PATCH v7 3/7] powerpc/mm: Introduce _PAGE_LARGE software pte bits

2016-11-29 Thread Benjamin Herrenschmidt
On Wed, 2016-11-30 at 11:14 +1100, Balbir Singh wrote:
> > +#define _RPAGE_RSV1  0x1000UL
> > +#define _RPAGE_RSV2  0x0800UL
> > +#define _RPAGE_RSV3  0x0400UL
> > +#define _RPAGE_RSV4  0x0200UL
> > +
> 
> We use the top 4 bits and not the _SW bits?

Correct, welcome to the discussion we've been having the last 2 weeks
:-)

We use those bits because we are otherwise short on SW bits (we still
need _PAGE_DEVMAP etc...). We know P9 DD1 is supposed to ignore the
reserved bits so it's a good place holder.

Cheers,
Ben.


Re: [PATCH v11 0/8] powerpc: Implement kexec_file_load()

2016-11-29 Thread Thiago Jung Bauermann
Hello Andrew,

Am Dienstag, 29. November 2016, 13:45:18 BRST schrieb Andrew Morton:
> On Tue, 29 Nov 2016 23:45:46 +1100 Michael Ellerman  
wrote:
> > This is v11 of the kexec_file_load() for powerpc series.
> > 
> > I've stripped this down to the minimum we need, so we can get this in for
> > 4.10. Any additions can come later incrementally.
> 
> This made a bit of a mess of Mimi's series "ima: carry the
> measurement list across kexec v10".
> 
> powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch
> ima-on-soft-reboot-restore-the-measurement-list.patch
> ima-permit-duplicate-measurement-list-entries.patch
> ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch
> powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch
> ima-on-soft-reboot-save-the-measurement-list.patch
> ima-store-the-builtin-custom-template-definitions-in-a-list.patch
> ima-support-restoring-multiple-template-formats.patch
> ima-define-a-canonical-binary_runtime_measurements-list-format.patch
> ima-platform-independent-hash-value.patch
> 
> I made the syntactic fixes but I won't be testing it.

Sorry about that. We are preparing an updated version rebased on Michael's 
patches to address that.

Just to explain where v11 is coming from:

kexec_file_load v11 uses a minimal purgatory taken from kexec-lite, resulting 
in a purgatory object without relocations. This avoids the problem of having 
to have a lot of code to process purgatory relocations, which is the problem I 
was trying to address in the past couple versions of this patch series.

The new purgatory also doesn't need the kernel to set global variables to tell 
it where the stack, TOC and OPAL entrypoint are, so that code was dropped from 
setup_purgatory.

The other change was to move the code in elf_util.[ch] into kexec_elf_64.c, 
with no actual code change.

> > If no one objects I'll merge this via the powerpc tree. The three kexec
> > patches have been acked by Dave Young (since forever), and have been in
> > linux-next (via akpm's tree) also for a long time.
> 
> OK, I'll wait for these to appear in -next and I will await advice on

Mimi and I would like to thank you for your support and help with these 
patches, Andrew.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [RFC] fs: add userspace critical mounts event support

2016-11-29 Thread Luis R. Rodriguez
On Tue, Nov 29, 2016 at 10:10:56PM +0100, Tom Gundersen wrote:
> On Tue, Nov 15, 2016 at 10:28 AM, Johannes Berg
>  wrote:
> > My argument basically goes like this:
> >
> > First, given good drivers (i.e. using request_firmware_nowait())
> > putting firmware even for a built-in driver into initramfs or not
> > should be a system integrator decision. If they don't need the device
> > that early, it should be possible for them to delay it. Or, perhaps, if
> > the firmware is too big, etc. I'm sure we can all come up with more
> > examples of why you'd want to do it one way or another.
> 
> This is how I understood the the situation, but I never quite bought
> it. What is wrong with the kernel saying "you must put your module and
> your firmware together"? Sure, people may want to do things
> differently, but what is the real blocker?

0) Firmware upgrades are possible
1) Some firmware is optional
2) Firmware licenses may often not be GPLv2 compatible
3) Some firmwares may be stupid large (remote-proc) as such
   neither built-in firmware nor using the firmware in initramfs
   is reasonable.

But note that Johannes' main point was that today only a few
properly constructed drivers use async fw request, and furthermore
given the lack of a deterministic final rootfs signal his proposal
was to address the lack of semantics available between kernel and
userspcae available for this with a firmware kobject uevent fallback
helper. This fallback kobject uevent helper would not reply firmly against
files not found until it knows all rootfs firmware paths are ready.

> Fundamentally, it seems to me that if a module needs firmware, it
> makes no sense to make the module available before the firmware. I'm
> probably missing something though :)

You are right but just consider all the above.

  Luis


Re: [PATCH v11 0/8] powerpc: Implement kexec_file_load()

2016-11-29 Thread Andrew Morton
On Tue, 29 Nov 2016 23:45:46 +1100 Michael Ellerman  wrote:

> This is v11 of the kexec_file_load() for powerpc series.
> 
> I've stripped this down to the minimum we need, so we can get this in for 
> 4.10.
> Any additions can come later incrementally.

This made a bit of a mess of Mimi's series "ima: carry the
measurement list across kexec v10".

powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch
ima-on-soft-reboot-restore-the-measurement-list.patch
ima-permit-duplicate-measurement-list-entries.patch
ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch
powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch
ima-on-soft-reboot-save-the-measurement-list.patch
ima-store-the-builtin-custom-template-definitions-in-a-list.patch
ima-support-restoring-multiple-template-formats.patch
ima-define-a-canonical-binary_runtime_measurements-list-format.patch
ima-platform-independent-hash-value.patch

I made the syntactic fixes but I won't be testing it.

> If no one objects I'll merge this via the powerpc tree. The three kexec 
> patches
> have been acked by Dave Young (since forever), and have been in linux-next 
> (via
> akpm's tree) also for a long time.

OK, I'll wait for these to appear in -next and I will await advice on 


Re: [RFC] fs: add userspace critical mounts event support

2016-11-29 Thread Luis R. Rodriguez
On Wed, Nov 09, 2016 at 03:21:07AM -0800, Andy Lutomirski wrote:
> On Wed, Nov 9, 2016 at 1:13 AM, Daniel Wagner
>  wrote:
> > [CC: added Harald]
> >
> > As Harald pointed out over a beer yesterday evening, there is at least
> > one more reason why UMH isn't obsolete. The ordering of the firmware loading
> > might be of important. Say you want to greet the user with a splash screen
> > really early on, the graphic card firmware should be loaded first. Also the
> > automotive world has this fancy requirement that rear camera must be on the
> > screen within 2 seconds. So controlling the firmware loading order is of
> > importance (e.g. also do not overcommit the I/O bandwith not so important
> > firmwares). A user space helper is able to prioritize the request
> > accordingly the use case.
> 
> That seems like a valid problem, but I don't think that UMH adequately
> solves it.  Sure, loading firmware in the right order avoids a >2sec
> delay due to firmware loading, but what happens when you have a slow
> USB device that *doesn't* need firmware plugged in to your car's shiny
> USB port when you start the car?
> 
> It seems to me that this use case requires explicit control over
> device probing and, if that gets added, you get your firmware ordering
> for free (just probe the important devices first).

In theory this is correct, the problem comes with the flexibility we have
created with pivot_root() and friends (another is mount on /lib/firmware) which
enables system integrators to pick and choose the "real rootfs" to be a few
layers away from the first fs picked up by the kernel. In providing this
flexibility we did not envision nor have devised signals to enable a
deterministic lookup due to the requirements such lookups might have --
in this case the requirements are that direct fs is ready and kosher
all the paths possible for firmware are ready. As you can imagine first race is
not only an issue for firmware but a generic issue.

The generic race on the fs lookup requires a fs availability event, and
addressing fs suspend. I'll note that the race on init is addressed today
*only* by the firmware UMH (its UMH is kobject uevent and optionally a custom
binary) by using the UMH lock. During a cleanup by Daniel recently I
realized it was bogus to use the UMH of the UMH was not used, turns out
this would still expose the direct FS lookup to a race though. This
begs the question if the UMH lock either be removed / shared with the
other kernel UMHs or a generic solution provided for direct fs lookup
with some requirements specified.

This is all a mess so I've documented each component and issues / ideas
we've discussed so far separately, the firmware UMH (which we should
probably rebrand to firmware kobject uevent helper to avoid confusion)
[0], the real kernel usermode helper [1], the new common kernel file
loader [2]

[0] https://kernelnewbies.org/KernelProjects/firmware-class-enhancements
[1] https://kernelnewbies.org/KernelProjects/usermode-helper-enhancements
[2] https://kernelnewbies.org/KernelProjects/common-kernel-loader

  Luis


Re: powerpc/ps3: Fix system hang with GCC 5 builds

2016-11-29 Thread Greg KH
On Tue, Nov 29, 2016 at 10:47:32AM -0800, Geoff Levand wrote:
> GCC 5 generates different code for this bootwrapper null check
> that causes the PS3 to hang very early in its bootup.  This
> check is of limited value, so just get rid of it.
> 
> Signed-off-by: Geoff Levand 
> ---
>  arch/powerpc/boot/ps3-head.S | 5 -
>  arch/powerpc/boot/ps3.c  | 8 +---
>  2 files changed, 1 insertion(+), 12 deletions(-)



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
for how to do this properly.




Re: [RFC] fs: add userspace critical mounts event support

2016-11-29 Thread Tom Gundersen
On Tue, Nov 15, 2016 at 10:28 AM, Johannes Berg
 wrote:
> My argument basically goes like this:
>
> First, given good drivers (i.e. using request_firmware_nowait())
> putting firmware even for a built-in driver into initramfs or not
> should be a system integrator decision. If they don't need the device
> that early, it should be possible for them to delay it. Or, perhaps, if
> the firmware is too big, etc. I'm sure we can all come up with more
> examples of why you'd want to do it one way or another.

This is how I understood the the situation, but I never quite bought
it. What is wrong with the kernel saying "you must put your module and
your firmware together"? Sure, people may want to do things
differently, but what is the real blocker?

Fundamentally, it seems to me that if a module needs firmware, it
makes no sense to make the module available before the firmware. I'm
probably missing something though :)

Cheers,

Tom


Re: [PATCH v7 2/7] powerpc/mm/hugetlb: Handle hugepage size supported by hash config

2016-11-29 Thread Balbir Singh


On 28/11/16 17:16, Aneesh Kumar K.V wrote:
> W.r.t hash page table config, we support 16MB and 16GB as the hugepage
> size. Update the hstate_get_psize to handle 16M and 16G.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/hugetlb.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
> b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> index 499268045306..d9c283f95e05 100644
> --- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
> +++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
> @@ -21,6 +21,10 @@ static inline int hstate_get_psize(struct hstate *hstate)
>   return MMU_PAGE_2M;
>   else if (shift == mmu_psize_defs[MMU_PAGE_1G].shift)
>   return MMU_PAGE_1G;
> + else if (shift == mmu_psize_defs[MMU_PAGE_16M].shift)
> + return MMU_PAGE_16M;
> + else if (shift == mmu_psize_defs[MMU_PAGE_16G].shift)
> + return MMU_PAGE_16G;

Could we reorder this

We check for 2M, 1G, 16M and 16G. The likely sizes are
2M and 16M. Can we have those upfront so that the order of checks
is 2M, 16M, 1G and 16G

Balbir


[PATCH] powerpc/8xx: xmon compile fix

2016-11-29 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/xmon/xmon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 7605455..435f5f5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1213,10 +1213,13 @@ bpt_cmds(void)
 {
int cmd;
unsigned long a;
-   int mode, i;
+   int i;
struct bpt *bp;
+#ifndef CONFIG_8xx
+   int mode;
const char badaddr[] = "Only kernel addresses are permitted "
"for breakpoints\n";
+#endif
 
cmd = inchar();
switch (cmd) {
-- 
2.10.2



Re: [PATCH] powerpc/8xx: xmon compile fix

2016-11-29 Thread Christophe LEROY



Le 29/11/2016 à 09:56, Nicholas Piggin a écrit :

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/xmon/xmon.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 7605455..435f5f5 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -1213,10 +1213,13 @@ bpt_cmds(void)
 {
int cmd;
unsigned long a;
-   int mode, i;
+   int i;
struct bpt *bp;
+#ifndef CONFIG_8xx


CONFIG_8xx is deprecated (ref arch/powerpc/platforms/Kconfig.cputype).
CONFIG_PPC_8xx should be used instead.


+   int mode;


You could also have moved this declaration inside the switch {, 
something like


switch (cmd) {
#ifndef CONFIG_8xx
+   int mode;
case 'd':


Christophe


const char badaddr[] = "Only kernel addresses are permitted "
"for breakpoints\n";
+#endif

cmd = inchar();
switch (cmd) {



Re: [PATCH] powerpc/8xx: xmon compile fix

2016-11-29 Thread Nicholas Piggin
On Tue, 29 Nov 2016 10:06:43 +0100
Christophe LEROY  wrote:

> Le 29/11/2016 à 09:56, Nicholas Piggin a écrit :
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/xmon/xmon.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> > index 7605455..435f5f5 100644
> > --- a/arch/powerpc/xmon/xmon.c
> > +++ b/arch/powerpc/xmon/xmon.c
> > @@ -1213,10 +1213,13 @@ bpt_cmds(void)
> >  {
> > int cmd;
> > unsigned long a;
> > -   int mode, i;
> > +   int i;
> > struct bpt *bp;
> > +#ifndef CONFIG_8xx  
> 
> CONFIG_8xx is deprecated (ref arch/powerpc/platforms/Kconfig.cputype).
> CONFIG_PPC_8xx should be used instead.

Thanks for picking that up. Michael, can you adjust it if you merge
please?

> 
> > +   int mode;  
> 
> You could also have moved this declaration inside the switch {, 
> something like

I tried that, couldn't decide that it was better (you also need badaddr).

Thanks,
Nick


[PATCH 2/2] powerpc/8xx: Implement hw_breakpoint

2016-11-29 Thread Christophe Leroy
This patch implements HW breakpoint on the 8xx. The 8xx has
capability to manage HW breakpoints, which is slightly different
than BOOK3S:
1/ The breakpoint match doesn't trigger a DSI exception but a
dedicated data breakpoint exception.
2/ The breakpoint happens after the instruction has completed,
no need to single step or emulate the instruction,
3/ Matched address is not set in DAR but in BAR,
4/ DABR register doesn't exist, instead we have registers
LCTRL1, LCTRL2 and CMPx registers,
5/ The match on one comparator is not on a double word but
on a single word.

The patch does:
1/ Prepare the dedicated registers in call to __set_dabr(). In order
to emulate the double word handling of BOOK3S, comparator E is set to
DABR address value and comparator F to address + 4. Then breakpoint 1
is set to match comparator E or F,
2/ Skip the singlestepping stage when compiled for CONFIG_PPC_8xx,
3/ Implement the exception. In that exception, the matched address
is taken from SPRN_BAR and manage as if it was from SPRN_DAR.
4/ I/D TLB error exception routines perform a tlbie on bad TLBs. That
tlbie triggers the breakpoint exception when performed on the
breakpoint address. For this reason, the routine returns if the match
is from one of those two tlbie.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig|  2 +-
 arch/powerpc/include/asm/reg_8xx.h  |  7 +++
 arch/powerpc/kernel/head_8xx.S  | 28 +++-
 arch/powerpc/kernel/hw_breakpoint.c |  6 +-
 arch/powerpc/kernel/process.c   | 22 ++
 5 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5b736e4..75459cf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -113,7 +113,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S
+   select HAVE_HW_BREAKPOINT if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx)
select ARCH_WANT_IPC_PARSE_VERSION
select SPARSE_IRQ
select IRQ_DOMAIN
diff --git a/arch/powerpc/include/asm/reg_8xx.h 
b/arch/powerpc/include/asm/reg_8xx.h
index 0197e12..8c5c7f2 100644
--- a/arch/powerpc/include/asm/reg_8xx.h
+++ b/arch/powerpc/include/asm/reg_8xx.h
@@ -29,6 +29,13 @@
 #define SPRN_EIE   80  /* External interrupt enable (EE=1, RI=1) */
 #define SPRN_EID   81  /* External interrupt disable (EE=0, RI=1) */
 
+/* Debug registers */
+#define SPRN_CMPE  152
+#define SPRN_CMPF  153
+#define SPRN_LCTRL1156
+#define SPRN_LCTRL2157
+#define SPRN_BAR   159
+
 /* Commands.  Only the first few are available to the instruction cache.
 */
 #defineIDC_ENABLE  0x0200  /* Cache enable */
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index fb133a1..d4f3335 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -478,6 +478,7 @@ InstructionTLBError:
andis.  r10,r5,0x4000
beq+1f
tlbie   r4
+itlbie:
/* 0x400 is InstructionAccess exception, needed by bad_page_fault() */
 1: EXC_XFER_LITE(0x400, handle_page_fault)
 
@@ -502,6 +503,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
andis.  r10,r5,0x4000
beq+1f
tlbie   r4
+dtlbie:
 1: li  r10,RPN_PATTERN
mtspr   SPRN_DAR,r10/* Tag DAR, to be used in DTLB Error */
/* 0x300 is DataAccess exception, needed by bad_page_fault() */
@@ -519,7 +521,27 @@ DARFixed:/* Return from dcbx instruction bug workaround */
  * support of breakpoints and such.  Someday I will get around to
  * using them.
  */
-   EXCEPTION(0x1c00, Trap_1c, unknown_exception, EXC_XFER_EE)
+   . = 0x1c00
+DataBreakpoint:
+   EXCEPTION_PROLOG_0
+   mfcrr10
+   mfspr   r11, SPRN_SRR0
+   cmplwi  cr0, r11, (dtlbie - PAGE_OFFSET)@l
+   cmplwi  cr7, r11, (itlbie - PAGE_OFFSET)@l
+   beq-cr0, 11f
+   beq-cr7, 11f
+   EXCEPTION_PROLOG_1
+   EXCEPTION_PROLOG_2
+   addir3,r1,STACK_FRAME_OVERHEAD
+   mfspr   r4,SPRN_BAR
+   stw r4,_DAR(r11)
+   mfspr   r5,SPRN_DSISR
+   EXC_XFER_EE(0x1c00, do_break)
+11:
+   mtcrr10
+   EXCEPTION_EPILOG_0
+   rfi
+
EXCEPTION(0x1d00, Trap_1d, unknown_exception, EXC_XFER_EE)
EXCEPTION(0x1e00, Trap_1e, unknown_exception, EXC_XFER_EE)
EXCEPTION(0x1f00, Trap_1f, unknown_exception, EXC_XFER_EE)
@@ -870,6 +892,10 @@ initial_mmu:
lis r8, IDC_ENABLE@h
mtspr   SPRN_DC_CST, r8
 #endif
+   /* Disable debug mode entry on data breakpoints */
+   mfspr   r8, SPRN_DER
+   rlwinm  r8, r8, 0, ~0x8
+   mtspr   SPRN_DER, r8
blr
 
 
diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 03d089b..4b70a53 

Re: [PATCH v7 0/7] Radix pte update tlbflush optimizations.

2016-11-29 Thread Balbir Singh


On 28/11/16 17:16, Aneesh Kumar K.V wrote:
> Changes from v6:
> * restrict the new pte bit to radix and DD1 config
> 
> Changes from V5:
> Switch to use pte bits to track page size.
> 
> 

This series looks much better, I wish there was a better
way of avoiding to have to pass the address to the ptep function,
but I guess we get to live with it forever

Balbir Singh.


[PATCH 0/2] powerpc: hw_breakpoint for book3s/32 and 8xx

2016-11-29 Thread Christophe Leroy
This serie provides HW breakpoints on 32 bits Book3S and 8xx

Tested on mpc8321
Tested on mpc885

Christophe Leroy (2):
  powerpc/32: Enable HW_BREAKPOINT on BOOK3S
  powerpc/8xx: Implement hw_breakpoint

 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/include/asm/processor.h |  2 +-
 arch/powerpc/include/asm/reg_8xx.h   |  7 +++
 arch/powerpc/kernel/head_8xx.S   | 28 +++-
 arch/powerpc/kernel/hw_breakpoint.c  |  6 +-
 arch/powerpc/kernel/process.c| 22 ++
 6 files changed, 63 insertions(+), 4 deletions(-)

-- 
2.10.1



[PATCH] powerpc/32: tlbie provide L operand explicitly

2016-11-29 Thread Nicholas Piggin
The single-operand form of tlbie used to be accepted as the second
operand (L) being implicitly 0. Newer binutils reject this.

Change remaining single-op tlbie instructions to have explicit 0
second argument.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/ppc_asm.h  | 2 +-
 arch/powerpc/kernel/head_32.S   | 2 +-
 arch/powerpc/kernel/head_8xx.S  | 8 
 arch/powerpc/kernel/swsusp_32.S | 2 +-
 arch/powerpc/mm/hash_low_32.S   | 8 
 arch/powerpc/mm/mmu_decl.h  | 2 +-
 arch/powerpc/platforms/powermac/sleep.S | 2 +-
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index c73750b..5a0a2f9 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -416,7 +416,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601)
lis r4,KERNELBASE@h;\
.machine push;  \
.machine "power4";  \
-0: tlbie   r4; \
+0: tlbie   r4,0;   \
.machine pop;   \
addir4,r4,0x1000;   \
bdnz0b
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 9d96354..af99545 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -1110,7 +1110,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
 flush_tlbs:
lis r10, 0x40
 1: addic.  r10, r10, -0x1000
-   tlbie   r10
+   tlbie   r10,0
bgt 1b
sync
blr
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index fb133a1..b967bfa 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -314,9 +314,9 @@ SystemCall:
 #ifdef CONFIG_8xx_CPU15
 #define INVALIDATE_ADJACENT_PAGES_CPU15(tmp, addr) \
additmp, addr, PAGE_SIZE;   \
-   tlbie   tmp;\
+   tlbie   tmp,0;  \
additmp, addr, -PAGE_SIZE;  \
-   tlbie   tmp
+   tlbie   tmp,0
 #else
 #define INVALIDATE_ADJACENT_PAGES_CPU15(tmp, addr)
 #endif
@@ -477,7 +477,7 @@ InstructionTLBError:
mr  r5,r9
andis.  r10,r5,0x4000
beq+1f
-   tlbie   r4
+   tlbie   r4,0
/* 0x400 is InstructionAccess exception, needed by bad_page_fault() */
 1: EXC_XFER_LITE(0x400, handle_page_fault)
 
@@ -501,7 +501,7 @@ DARFixed:/* Return from dcbx instruction bug workaround */
mfspr   r4,SPRN_DAR
andis.  r10,r5,0x4000
beq+1f
-   tlbie   r4
+   tlbie   r4,0
 1: li  r10,RPN_PATTERN
mtspr   SPRN_DAR,r10/* Tag DAR, to be used in DTLB Error */
/* 0x300 is DataAccess exception, needed by bad_page_fault() */
diff --git a/arch/powerpc/kernel/swsusp_32.S b/arch/powerpc/kernel/swsusp_32.S
index ba4dee3..cb26ab3 100644
--- a/arch/powerpc/kernel/swsusp_32.S
+++ b/arch/powerpc/kernel/swsusp_32.S
@@ -302,7 +302,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
/* Flush all TLBs */
lis r4,0x1000
 1: addic.  r4,r4,-0x1000
-   tlbie   r4
+   tlbie   r4,0
bgt 1b
sync
 
diff --git a/arch/powerpc/mm/hash_low_32.S b/arch/powerpc/mm/hash_low_32.S
index 09cc50c..0675034 100644
--- a/arch/powerpc/mm/hash_low_32.S
+++ b/arch/powerpc/mm/hash_low_32.S
@@ -352,7 +352,7 @@ _GLOBAL(hash_page_patch_A)
 */
andi.   r6,r6,_PAGE_HASHPTE
beq+10f /* no PTE: go look for an empty slot */
-   tlbie   r4
+   tlbie   r4,0
 
addis   r4,r7,htab_hash_searches@ha
lwz r6,htab_hash_searches@l(r4)
@@ -612,7 +612,7 @@ _GLOBAL(flush_hash_patch_B)
 3: li  r0,0
STPTE   r0,0(r12)   /* invalidate entry */
 4: sync
-   tlbie   r4  /* in hw tlb too */
+   tlbie   r4,0/* in hw tlb too */
sync
 
 8: ble cr1,9f  /* if all ptes checked */
@@ -661,7 +661,7 @@ _GLOBAL(_tlbie)
stwcx.  r8,0,r9
bne-10b
eieio
-   tlbie   r3
+   tlbie   r3,0
sync
TLBSYNC
li  r0,0
@@ -670,7 +670,7 @@ _GLOBAL(_tlbie)
SYNC_601
isync
 #else /* CONFIG_SMP */
-   tlbie   r3
+   tlbie   r3,0
sync
 #endif /* CONFIG_SMP */
blr
diff --git a/arch/powerpc/mm/mmu_decl.h b/arch/powerpc/mm/mmu_decl.h
index f988db6..9b9e780 100644
--- a/arch/powerpc/mm/mmu_decl.h
+++ b/arch/powerpc/mm/mmu_decl.h
@@ -55,7 +55,7 @@ extern void _tlbil_pid_noind(unsigned int pid);
 static inline void _tlbil_va(unsigned long address, unsigned int pid,
 unsigned int tsize, unsigned int ind)
 {
-   asm volatile ("tlbie %0; sync" : : "r" (address) : "memory");
+   asm volatile ("tlbie 

Re: [PATCH v3 1/3] powernv:idle: Add IDLE_STATE_ENTER_SEQ_NORET macro

2016-11-29 Thread Balbir Singh


On 10/11/16 18:54, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Currently all the low-power idle states are expected to wake up
> at reset vector 0x100. Which is why the macro IDLE_STATE_ENTER_SEQ
> that puts the CPU to an idle state and never returns.
> 
> On ISA_300, when the ESL and EC bits in the PSSCR are zero, the
> CPU is expected to wake up at the next instruction of the idle
> instruction.
> 
> This patch adds a new macro named IDLE_STATE_ENTER_SEQ_NORET for the

I think something like IDLE_STATE_ENTER_SEQ_LOSE_CTX would be better?

> no-return variant and reuses the name IDLE_STATE_ENTER_SEQ
> for a variant that allows resuming operation at the instruction next
> to the idle-instruction.
> 

> +
> +#define  IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)   \
> + IDLE_STATE_ENTER_SEQ(IDLE_INST) \

So we start off with both as the same?

>   b   .
>  #endif /* CONFIG_PPC_P7_NAP */

Balbir


Re: [PATCH v7 1/7] powerpc/mm: Rename hugetlb-radix.h to hugetlb.h

2016-11-29 Thread Balbir Singh


On 28/11/16 17:16, Aneesh Kumar K.V wrote:
> We will start moving some book3s specific hugetlb functions there.

You mean for both radix and hash right?

Balbir


[PATCH 1/2] powerpc/32: Enable HW_BREAKPOINT on BOOK3S

2016-11-29 Thread Christophe Leroy
BOOK3S also has DABR register and capability to handle data
breakpoints, so this patch enable it on all BOOK3S, not only 64 bits.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/processor.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d86643..5b736e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -113,7 +113,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
+   select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S
select ARCH_WANT_IPC_PARSE_VERSION
select SPARSE_IRQ
select IRQ_DOMAIN
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 1ba8144..2053a4b 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -225,6 +225,7 @@ struct thread_struct {
 #ifdef CONFIG_PPC64
unsigned long   start_tb;   /* Start purr when proc switched in */
unsigned long   accum_tb;   /* Total accumulated purr for process */
+#endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
struct perf_event *ptrace_bps[HBP_NUM];
/*
@@ -233,7 +234,6 @@ struct thread_struct {
 */
struct perf_event *last_hit_ubp;
 #endif /* CONFIG_HAVE_HW_BREAKPOINT */
-#endif
struct arch_hw_breakpoint hw_brk; /* info on the hardware breakpoint */
unsigned long   trap_nr;/* last trap # on this thread */
u8 load_fp;
-- 
2.10.1



[PATCH v2] KVM/PPC Patch for KVM issue in real mode

2016-11-29 Thread Balbir Singh

Some KVM functions for book3s_hv are called in real mode.
In real mode the top 4 bits of the address space are ignored,
hence an address beginning with 0xc000+offset is the
same as 0xd000+offset. The issue was observed when
a kvm memslot resolution lead to random values when
access from kvmppc_h_enter(). The issue is hit if the
KVM host is running with a page size of 4K, since
kvzalloc() looks at size < PAGE_SIZE. On systems with
64K the issue is not observed easily, it largely depends
on the size of the structure being allocated.

The proposed fix moves all KVM allocations for book3s_hv
to kzalloc() until all structures used in real mode are
audited. For safety allocations are moved to kmalloc
space. The impact is a large allocation on systems with
4K page size.

Signed-off-by: Balbir Singh 
---
 Changelog v2:
   Fix build failures reported by the kbuild test robot
   http://www.spinics.net/lists/kvm/msg141727.html

 arch/powerpc/include/asm/kvm_host.h | 19 +++
 include/linux/kvm_host.h| 11 +++
 virt/kvm/kvm_main.c |  2 +-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index f15713a..53f5172 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -734,6 +734,25 @@ struct kvm_vcpu_arch {
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
 
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+#define __KVM_HAVE_ARCH_VZALLOC_OVERRIDE
+
+/*
+ * KVM uses some of these data structures -- the ones
+ * from kvzalloc() in real mode. If the data structure
+ * happens to come from a vmalloc'd range then its access
+ * in real mode will lead to problems due to the aliasing
+ * issue - (top 4 bits are ignore).
+ * A 0xd000+offset will point to a 0xc000+offset in realmode
+ * Hence we want our data structures from come from kmalloc'd
+ * regions, so that we don't have these aliasing issues
+ */
+static inline void *kvm_arch_vzalloc(unsigned long size)
+{
+   return kzalloc(size, GFP_KERNEL);
+}
+#endif
+
 static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..0c88af5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -793,6 +794,16 @@ static inline bool kvm_arch_has_noncoherent_dma(struct kvm 
*kvm)
return false;
 }
 #endif
+
+#ifdef __KVM_HAVE_ARCH_VZALLOC_OVERRIDE
+static void *kvm_arch_vzalloc(unsigned long size);
+#else
+static inline void *kvm_arch_vzalloc(unsigned long size)
+{
+   return vzalloc(size);
+}
+#endif
+
 #ifdef __KVM_HAVE_ARCH_ASSIGNED_DEVICE
 void kvm_arch_start_assignment(struct kvm *kvm);
 void kvm_arch_end_assignment(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fbf04c0..57e3dca 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -689,7 +689,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 void *kvm_kvzalloc(unsigned long size)
 {
if (size > PAGE_SIZE)
-   return vzalloc(size);
+   return kvm_arch_vzalloc(size);
else
return kzalloc(size, GFP_KERNEL);
 }
-- 
2.5.5



Re: [PATCH net 00/16] net: fix fixed-link phydev leaks

2016-11-29 Thread David Miller
From: Johan Hovold 
Date: Mon, 28 Nov 2016 19:24:53 +0100

> This series fixes failures to deregister and free fixed-link phydevs
> that have been registered using the of_phy_register_fixed_link()
> interface.
> 
> All but two drivers currently fail to do this and this series fixes most
> of them with the exception of a staging driver and the stmmac drivers
> which will be fixed by follow-on patches.
> 
> Included are also a couple of fixes for related of-node leaks.
> 
> Note that all patches except the of_mdio one have been compile-tested
> only.
> 
> Also note that the series is against net due to dependencies not yet in
> net-next.

Series applied, thanks Johan.


[PATCH kernel v7 1/7] powerpc/iommu: Pass mm_struct to init/cleanup helpers

2016-11-29 Thread Alexey Kardashevskiy
We are going to get rid of @current references in mmu_context_boos3s64.c
and cache mm_struct in the VFIO container. Since mm_context_t does not
have reference counting, we will be using mm_struct which does have
the reference counter.

This changes mm_iommu_init/mm_iommu_cleanup to receive mm_struct rather
than mm_context_t (which is embedded into mm).

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h | 4 ++--
 arch/powerpc/kernel/setup-common.c | 2 +-
 arch/powerpc/mm/mmu_context_book3s64.c | 4 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 9 +
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 5c45114..424844b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -23,8 +23,8 @@ extern bool mm_iommu_preregistered(void);
 extern long mm_iommu_get(unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
-extern void mm_iommu_init(mm_context_t *ctx);
-extern void mm_iommu_cleanup(mm_context_t *ctx);
+extern void mm_iommu_init(struct mm_struct *mm);
+extern void mm_iommu_cleanup(struct mm_struct *mm);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 270ee30..f516ac5 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -915,7 +915,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(_mm.context);
+   mm_iommu_init(_mm);
 #endif
irqstack_early_init();
exc_lvl_early_init();
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index b114f8b..ad82735 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -115,7 +115,7 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
mm->context.pte_frag = NULL;
 #endif
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_init(>context);
+   mm_iommu_init(mm);
 #endif
return 0;
 }
@@ -160,7 +160,7 @@ static inline void destroy_pagetable_page(struct mm_struct 
*mm)
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(>context);
+   mm_iommu_cleanup(mm);
 #endif
 
 #ifdef CONFIG_PPC_ICSWX
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index e0f1c33..ad2e575 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -373,16 +373,17 @@ void mm_iommu_mapped_dec(struct 
mm_iommu_table_group_mem_t *mem)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_mapped_dec);
 
-void mm_iommu_init(mm_context_t *ctx)
+void mm_iommu_init(struct mm_struct *mm)
 {
-   INIT_LIST_HEAD_RCU(>iommu_group_mem_list);
+   INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
 
-void mm_iommu_cleanup(mm_context_t *ctx)
+void mm_iommu_cleanup(struct mm_struct *mm)
 {
struct mm_iommu_table_group_mem_t *mem, *tmp;
 
-   list_for_each_entry_safe(mem, tmp, >iommu_group_mem_list, next) {
+   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
+   next) {
list_del_rcu(>next);
mm_iommu_do_free(mem);
}
-- 
2.5.0.rc3



Re: [PATCH v11 0/8] powerpc: Implement kexec_file_load()

2016-11-29 Thread Michael Ellerman
Andrew Morton  writes:

> On Tue, 29 Nov 2016 23:45:46 +1100 Michael Ellerman  
> wrote:
>
>> This is v11 of the kexec_file_load() for powerpc series.
>> 
>> I've stripped this down to the minimum we need, so we can get this in for 
>> 4.10.
>> Any additions can come later incrementally.
>
> This made a bit of a mess of Mimi's series "ima: carry the
> measurement list across kexec v10".

Urk, sorry about that. I didn't realise there was a big dependency
between them, but I guess I should have tried to do the rebase.

> powerpc-ima-get-the-kexec-buffer-passed-by-the-previous-kernel.patch
> ima-on-soft-reboot-restore-the-measurement-list.patch
> ima-permit-duplicate-measurement-list-entries.patch
> ima-maintain-memory-size-needed-for-serializing-the-measurement-list.patch
> powerpc-ima-send-the-kexec-buffer-to-the-next-kernel.patch
> ima-on-soft-reboot-save-the-measurement-list.patch
> ima-store-the-builtin-custom-template-definitions-in-a-list.patch
> ima-support-restoring-multiple-template-formats.patch
> ima-define-a-canonical-binary_runtime_measurements-list-format.patch
> ima-platform-independent-hash-value.patch
>
> I made the syntactic fixes but I won't be testing it.

Thanks. 

TBH I don't know how to test the IMA part, I'm relying on Thiago and
Mimi to do that.

>> If no one objects I'll merge this via the powerpc tree. The three kexec 
>> patches
>> have been acked by Dave Young (since forever), and have been in linux-next 
>> (via
>> akpm's tree) also for a long time.
>
> OK, I'll wait for these to appear in -next and I will await advice on 

Thanks. I'll let them stew for a few more hours and then put them in my
next for tomorrows linux-next.

cheers


[PATCH kernel v7 4/7] vfio/spapr: Add a helper to create default DMA window

2016-11-29 Thread Alexey Kardashevskiy
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.

Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.

This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v6:
* new to the patchset
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 87 ++---
 1 file changed, 42 insertions(+), 45 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 4efd2b2..a67bbfd 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -710,6 +710,29 @@ static long tce_iommu_remove_window(struct tce_container 
*container,
return 0;
 }
 
+static long tce_iommu_create_default_window(struct tce_container *container)
+{
+   long ret;
+   __u64 start_addr = 0;
+   struct tce_iommu_group *tcegrp;
+   struct iommu_table_group *table_group;
+
+   if (!tce_groups_attached(container))
+   return -ENODEV;
+
+   tcegrp = list_first_entry(>group_list,
+   struct tce_iommu_group, next);
+   table_group = iommu_group_get_iommudata(tcegrp->grp);
+   if (!table_group)
+   return -ENODEV;
+
+   ret = tce_iommu_create_window(container, IOMMU_PAGE_SHIFT_4K,
+   table_group->tce32_size, 1, _addr);
+   WARN_ON_ONCE(!ret && start_addr);
+
+   return ret;
+}
+
 static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
@@ -1100,9 +1123,6 @@ static void tce_iommu_release_ownership_ddw(struct 
tce_container *container,
 static long tce_iommu_take_ownership_ddw(struct tce_container *container,
struct iommu_table_group *table_group)
 {
-   long i, ret = 0;
-   struct iommu_table *tbl = NULL;
-
if (!table_group->ops->create_table || !table_group->ops->set_window ||
!table_group->ops->release_ownership) {
WARN_ON_ONCE(1);
@@ -,47 +1131,7 @@ static long tce_iommu_take_ownership_ddw(struct 
tce_container *container,
 
table_group->ops->take_ownership(table_group);
 
-   /*
-* If it the first group attached, check if there is
-* a default DMA window and create one if none as
-* the userspace expects it to exist.
-*/
-   if (!tce_groups_attached(container) && !container->tables[0]) {
-   ret = tce_iommu_create_table(container,
-   table_group,
-   0, /* window number */
-   IOMMU_PAGE_SHIFT_4K,
-   table_group->tce32_size,
-   1, /* default levels */
-   );
-   if (ret)
-   goto release_exit;
-   else
-   container->tables[0] = tbl;
-   }
-
-   /* Set all windows to the new group */
-   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
-   tbl = container->tables[i];
-
-   if (!tbl)
-   continue;
-
-   /* Set the default window to a new group */
-   ret = table_group->ops->set_window(table_group, i, tbl);
-   if (ret)
-   goto release_exit;
-   }
-
return 0;
-
-release_exit:
-   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
-   table_group->ops->unset_window(table_group, i);
-
-   table_group->ops->release_ownership(table_group);
-
-   return ret;
 }
 
 static int tce_iommu_attach_group(void *iommu_data,
@@ -1161,6 +1141,7 @@ static int tce_iommu_attach_group(void *iommu_data,
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
struct tce_iommu_group *tcegrp = NULL;
+   bool create_default_window = false;
 
mutex_lock(>lock);
 
@@ -1203,14 +1184,30 @@ static int tce_iommu_attach_group(void *iommu_data,
}
 
if (!table_group->ops || !table_group->ops->take_ownership ||
-   !table_group->ops->release_ownership)
+   !table_group->ops->release_ownership) {
ret = tce_iommu_take_ownership(container, table_group);
-   else
+   } else {
ret = tce_iommu_take_ownership_ddw(container, table_group);
+   

[PATCH kernel v7 3/7] vfio/spapr: Postpone allocation of userspace version of TCE table

2016-11-29 Thread Alexey Kardashevskiy
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.

As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.

This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v6:
* moved missing hunk from the next patch: tce_iommu_create_table()
would decrement locked_vm while new caller - tce_iommu_build_v2() -
will not; this adds a new return code to the DMA mapping path but
this seems to be a minor change.
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index d0c38b2..4efd2b2 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -515,6 +515,12 @@ static long tce_iommu_build_v2(struct tce_container 
*container,
unsigned long hpa;
enum dma_data_direction dirtmp;
 
+   if (!tbl->it_userspace) {
+   ret = tce_iommu_userspace_view_alloc(tbl);
+   if (ret)
+   return ret;
+   }
+
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
@@ -588,15 +594,6 @@ static long tce_iommu_create_table(struct tce_container 
*container,
WARN_ON(!ret && !(*ptbl)->it_ops->free);
WARN_ON(!ret && ((*ptbl)->it_allocated_size != table_size));
 
-   if (!ret && container->v2) {
-   ret = tce_iommu_userspace_view_alloc(*ptbl);
-   if (ret)
-   (*ptbl)->it_ops->free(*ptbl);
-   }
-
-   if (ret)
-   decrement_locked_vm(table_size >> PAGE_SHIFT);
-
return ret;
 }
 
@@ -1068,10 +1065,7 @@ static int tce_iommu_take_ownership(struct tce_container 
*container,
if (!tbl || !tbl->it_map)
continue;
 
-   rc = tce_iommu_userspace_view_alloc(tbl);
-   if (!rc)
-   rc = iommu_take_ownership(tbl);
-
+   rc = iommu_take_ownership(tbl);
if (rc) {
for (j = 0; j < i; ++j)
iommu_release_ownership(
-- 
2.5.0.rc3



[PATCH kernel v7 0/7] powerpc/spapr/vfio: Put pages on VFIO container shutdown

2016-11-29 Thread Alexey Kardashevskiy
These patches are to fix a bug when pages stay pinned hours
after QEMU which requested pinning exited.

Change to v6 it in the last 2 patches, individual patches got
detailed changelog.

Please comment. Thanks.

Alexey Kardashevskiy (7):
  powerpc/iommu: Pass mm_struct to init/cleanup helpers
  powerpc/iommu: Stop using @current in mm_iommu_xxx
  vfio/spapr: Postpone allocation of userspace version of TCE table
  vfio/spapr: Add a helper to create default DMA window
  vfio/spapr: Postpone default window creation
  vfio/spapr: Reference mm in tce_container
  powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

 arch/powerpc/include/asm/mmu_context.h |  20 +-
 arch/powerpc/kernel/setup-common.c |   2 +-
 arch/powerpc/mm/mmu_context_book3s64.c |   6 +-
 arch/powerpc/mm/mmu_context_iommu.c|  60 ++
 drivers/vfio/vfio_iommu_spapr_tce.c| 328 ++---
 5 files changed, 250 insertions(+), 166 deletions(-)

-- 
2.5.0.rc3



[PATCH kernel v7 5/7] vfio/spapr: Postpone default window creation

2016-11-29 Thread Alexey Kardashevskiy
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.

At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.

This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v6:
* new helper tce_iommu_create_default_window() moved to a separate patch;
* creates a default window when new window is requested; it used to
reset the def_window_pending flag instead;
* def_window_pending handling (mostly) localized in
tce_iommu_create_default_window() now, the only exception is removal
of not yet created default window.
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 40 +++--
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index a67bbfd..88622be 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -97,6 +97,7 @@ struct tce_container {
struct mutex lock;
bool enabled;
bool v2;
+   bool def_window_pending;
unsigned long locked_pages;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
@@ -717,6 +718,9 @@ static long tce_iommu_create_default_window(struct 
tce_container *container)
struct tce_iommu_group *tcegrp;
struct iommu_table_group *table_group;
 
+   if (!container->def_window_pending)
+   return 0;
+
if (!tce_groups_attached(container))
return -ENODEV;
 
@@ -730,6 +734,9 @@ static long tce_iommu_create_default_window(struct 
tce_container *container)
table_group->tce32_size, 1, _addr);
WARN_ON_ONCE(!ret && start_addr);
 
+   if (!ret)
+   container->def_window_pending = false;
+
return ret;
 }
 
@@ -823,6 +830,10 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
 
+   ret = tce_iommu_create_default_window(container);
+   if (ret)
+   return ret;
+
num = tce_iommu_find_table(container, param.iova, );
if (num < 0)
return -ENXIO;
@@ -886,6 +897,10 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
 
+   ret = tce_iommu_create_default_window(container);
+   if (ret)
+   return ret;
+
num = tce_iommu_find_table(container, param.iova, );
if (num < 0)
return -ENXIO;
@@ -1012,6 +1027,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 
mutex_lock(>lock);
 
+   ret = tce_iommu_create_default_window(container);
+   if (ret)
+   return ret;
+
ret = tce_iommu_create_window(container, create.page_shift,
create.window_size, create.levels,
_addr);
@@ -1044,6 +1063,11 @@ static long tce_iommu_ioctl(void *iommu_data,
if (remove.flags)
return -EINVAL;
 
+   if (container->def_window_pending && !remove.start_addr) {
+   container->def_window_pending = false;
+   return 0;
+   }
+
mutex_lock(>lock);
 
ret = tce_iommu_remove_window(container, remove.start_addr);
@@ -1141,7 +1165,6 @@ static int tce_iommu_attach_group(void *iommu_data,
struct tce_container *container = iommu_data;
struct iommu_table_group *table_group;
struct tce_iommu_group *tcegrp = NULL;
-   bool create_default_window = false;
 
mutex_lock(>lock);
 
@@ -1189,25 +1212,12 @@ static int tce_iommu_attach_group(void *iommu_data,
} else {
ret = tce_iommu_take_ownership_ddw(container, table_group);
if (!tce_groups_attached(container) && !container->tables[0])
-   create_default_window = true;
+   container->def_window_pending = true;
}
 
if (!ret) {
tcegrp->grp = iommu_group;

[PATCH v2] Fix the message in facility unavailable exception

2016-11-29 Thread Balbir Singh

I ran into this during some testing on qemu. The current
facility_strings[] are correct when the trap address is
0xf80 (hypervisor facility unavailable). When the trap
address is 0xf60, IC (Interruption Cause) a.k.a status
in the code is undefined for values 0 and 1. This patch
adds a check to prevent printing the wrong information
and helps better direct debugging effort.

Signed-off-by: Balbir Singh 
---
 Changelog v2:
   Redo conditional checks as suggested by Michael

 arch/powerpc/kernel/traps.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 023a462..010b11d 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1519,9 +1519,13 @@ void facility_unavailable_exception(struct pt_regs *regs)
return;
}
 
-   if ((status < ARRAY_SIZE(facility_strings)) &&
-   facility_strings[status])
-   facility = facility_strings[status];
+   if ((hv || status >= 2) &&
+   (status < ARRAY_SIZE(facility_strings)) &&
+   facility_strings[status])
+   facility = facility_strings[status];
+   else
+   pr_warn_ratelimited("Unexpected facility unavailable exception "
+   "interruption cause %d\n", status);
 
/* We restore the interrupt state now */
if (!arch_irq_disabled_regs(regs))
-- 
2.5.5



[PATCH kernel v7 6/7] vfio/spapr: Reference mm in tce_container

2016-11-29 Thread Alexey Kardashevskiy
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).

This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.

This replaces current->mm with container->mm everywhere except debug
prints.

This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.

DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.

This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v7:
* WARN_ON_ONCE(!mm)) in try_increment_locked_vm()
* s/&&/||/ in a parameter check in decrement_locked_vm()
* instead of failing on unset container,
the VFIO_IOMMU_SPAPR_TCE_REMOVE handler sets mm to container now

v6:
* updated the commit log about not referencing task

v5:
* postpone referencing of mm

v4:
* added check for container->mm!=current->mm in tce_iommu_ioctl()
for all ioctls and removed other redundand checks
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 160 ++--
 1 file changed, 100 insertions(+), 60 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 88622be..4c03c85 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -31,49 +31,49 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(long npages)
+static long try_increment_locked_vm(struct mm_struct *mm, long npages)
 {
long ret = 0, locked, lock_limit;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
+   if (WARN_ON_ONCE(!mm))
+   return -EPERM;
 
if (!npages)
return 0;
 
-   down_write(>mm->mmap_sem);
-   locked = current->mm->locked_vm + npages;
+   down_write(>mmap_sem);
+   locked = mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
ret = -ENOMEM;
else
-   current->mm->locked_vm += npages;
+   mm->locked_vm += npages;
 
pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK),
ret ? " - exceeded" : "");
 
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 
return ret;
 }
 
-static void decrement_locked_vm(long npages)
+static void decrement_locked_vm(struct mm_struct *mm, long npages)
 {
-   if (!current || !current->mm || !npages)
-   return; /* process exited */
+   if (!mm || !npages)
+   return;
 
-   down_write(>mm->mmap_sem);
-   if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-   npages = current->mm->locked_vm;
-   current->mm->locked_vm -= npages;
+   down_write(>mmap_sem);
+   if (WARN_ON_ONCE(npages > mm->locked_vm))
+   npages = mm->locked_vm;
+   mm->locked_vm -= npages;
pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
+   mm->locked_vm << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
 }
 
 /*
@@ -99,26 +99,38 @@ struct tce_container {
bool v2;
bool def_window_pending;
unsigned long locked_pages;
+   struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
 };
 
+static long tce_iommu_mm_set(struct tce_container *container)
+{
+   if (container->mm) {
+   if (container->mm == current->mm)
+   return 0;
+   return -EPERM;
+   }
+   BUG_ON(!current->mm);
+   container->mm = current->mm;
+   atomic_inc(>mm->mm_count);
+
+   return 0;
+}
+
 static long tce_iommu_unregister_pages(struct 

[PATCH kernel v7 2/7] powerpc/iommu: Stop using @current in mm_iommu_xxx

2016-11-29 Thread Alexey Kardashevskiy
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.

This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.

This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/mmu_context.h | 16 ++--
 arch/powerpc/mm/mmu_context_iommu.c| 46 +-
 drivers/vfio/vfio_iommu_spapr_tce.c| 14 ---
 3 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 424844b..b9e3f0a 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -19,16 +19,18 @@ extern void destroy_context(struct mm_struct *mm);
 struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);/* from internal.h */
-extern bool mm_iommu_preregistered(void);
-extern long mm_iommu_get(unsigned long ua, unsigned long entries,
+extern bool mm_iommu_preregistered(struct mm_struct *mm);
+extern long mm_iommu_get(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
-extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem);
+extern long mm_iommu_put(struct mm_struct *mm,
+   struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
 extern void mm_iommu_cleanup(struct mm_struct *mm);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua,
-   unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(unsigned long ua,
-   unsigned long entries);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm,
+   unsigned long ua, unsigned long size);
+extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+   unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index ad2e575..4c6db09 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -56,7 +56,7 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
}
 
pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current->pid,
+   current ? current->pid : 0,
incr ? '+' : '-',
npages << PAGE_SHIFT,
mm->locked_vm << PAGE_SHIFT,
@@ -66,12 +66,9 @@ static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
return ret;
 }
 
-bool mm_iommu_preregistered(void)
+bool mm_iommu_preregistered(struct mm_struct *mm)
 {
-   if (!current || !current->mm)
-   return false;
-
-   return !list_empty(>mm->context.iommu_group_mem_list);
+   return !list_empty(>context.iommu_group_mem_list);
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
@@ -124,19 +121,16 @@ static int mm_iommu_move_page_from_cma(struct page *page)
return 0;
 }
 
-long mm_iommu_get(unsigned long ua, unsigned long entries,
+long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
long i, j, ret = 0, locked_entries = 0;
struct page *page = NULL;
 
-   if (!current || !current->mm)
-   return -ESRCH; /* process exited */
-
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem, >mm->context.iommu_group_mem_list,
+   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list,
next) {
if ((mem->ua == ua) && (mem->entries == entries)) {
++mem->used;
@@ -154,7 +148,7 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
 
}
 
-   ret = mm_iommu_adjust_locked_vm(current->mm, entries, true);
+   ret = mm_iommu_adjust_locked_vm(mm, entries, true);
if (ret)
goto unlock_exit;
 
@@ -215,11 +209,11 @@ long mm_iommu_get(unsigned long ua, unsigned long entries,
mem->entries = entries;
*pmem = mem;
 
-   list_add_rcu(>next, >mm->context.iommu_group_mem_list);
+   list_add_rcu(>next, 

Re: [PATCH v7 0/7] Radix pte update tlbflush optimizations.

2016-11-29 Thread Michael Ellerman
Balbir Singh  writes:

> On 28/11/16 17:16, Aneesh Kumar K.V wrote:
>> Changes from v6:
>> * restrict the new pte bit to radix and DD1 config
>> 
>> Changes from V5:
>> Switch to use pte bits to track page size.
>
> This series looks much better, I wish there was a better
> way of avoiding to have to pass the address to the ptep function,
> but I guess we get to live with it forever

No, we can always revert it when P9 DD1 is dead and buried.

cheers


[PATCH kernel v7 7/7] powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown

2016-11-29 Thread Alexey Kardashevskiy
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().

This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.

This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.

This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.

As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Nicholas Piggin 
---
Changes:
v7:
* left sanity check in destroy_context()
* tce_iommu_prereg_free() does not free tce_iommu_prereg struct if
mm_iommu_put() failed; VFIO SPAPR container release callback now warns
on an error

v4:
* changed tce_iommu_register_pages() to call mm_iommu_find() first and
avoid calling mm_iommu_put() if memory is preregistered already

v3:
* moved tce_iommu_prereg_free() call out of list_for_each_entry()

v2:
* updated commit log
---
 arch/powerpc/mm/mmu_context_book3s64.c |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c| 11 --
 drivers/vfio/vfio_iommu_spapr_tce.c| 61 +-
 3 files changed, 61 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index ad82735..73bf6e1 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -156,13 +156,11 @@ static inline void destroy_pagetable_page(struct 
mm_struct *mm)
 }
 #endif
 
-
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-   mm_iommu_cleanup(mm);
+   WARN_ON_ONCE(!list_empty(>context.iommu_group_mem_list));
 #endif
-
 #ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 4c6db09..104bad0 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -365,14 +365,3 @@ void mm_iommu_init(struct mm_struct *mm)
 {
INIT_LIST_HEAD_RCU(>context.iommu_group_mem_list);
 }
-
-void mm_iommu_cleanup(struct mm_struct *mm)
-{
-   struct mm_iommu_table_group_mem_t *mem, *tmp;
-
-   list_for_each_entry_safe(mem, tmp, >context.iommu_group_mem_list,
-   next) {
-   list_del_rcu(>next);
-   mm_iommu_do_free(mem);
-   }
-}
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 4c03c85..c882357 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -89,6 +89,15 @@ struct tce_iommu_group {
 };
 
 /*
+ * A container needs to remember which preregistered region  it has
+ * referenced to do proper cleanup at the userspace process exit.
+ */
+struct tce_iommu_prereg {
+   struct list_head next;
+   struct mm_iommu_table_group_mem_t *mem;
+};
+
+/*
  * The container descriptor supports only a single group per container.
  * Required by the API as the container is not supplied with the IOMMU group
  * at the moment of initialization.
@@ -102,6 +111,7 @@ struct tce_container {
struct mm_struct *mm;
struct iommu_table *tables[IOMMU_TABLE_GROUP_MAX_TABLES];
struct list_head group_list;
+   struct list_head prereg_list;
 };
 
 static long tce_iommu_mm_set(struct tce_container *container)
@@ -118,10 +128,27 @@ static long tce_iommu_mm_set(struct tce_container 
*container)
return 0;
 }
 
+static long tce_iommu_prereg_free(struct tce_container *container,
+   struct tce_iommu_prereg *tcemem)
+{
+   long ret;
+
+   ret = mm_iommu_put(container->mm, tcemem->mem);
+   if (ret)
+   return ret;
+
+   list_del(>next);
+   kfree(tcemem);
+
+   return 0;
+}
+
 static long tce_iommu_unregister_pages(struct tce_container *container,
__u64 vaddr, __u64 size)
 {
struct mm_iommu_table_group_mem_t *mem;
+   struct tce_iommu_prereg *tcemem;
+   bool found = false;
 
if 

Re: [PATCH v7 3/7] powerpc/mm: Introduce _PAGE_LARGE software pte bits

2016-11-29 Thread Balbir Singh


On 30/11/16 11:35, Benjamin Herrenschmidt wrote:
> On Wed, 2016-11-30 at 11:14 +1100, Balbir Singh wrote:
>>> +#define _RPAGE_RSV1  0x1000UL
>>> +#define _RPAGE_RSV2  0x0800UL
>>> +#define _RPAGE_RSV3  0x0400UL
>>> +#define _RPAGE_RSV4  0x0200UL
>>> +
>>
>> We use the top 4 bits and not the _SW bits?
> 
> Correct, welcome to the discussion we've been having the last 2 weeks
> :-)
> 

I thought we were following Paul's suggestion here

https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-November/151620.html
and I also noticed
https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-November/151624.html

My bad, I thought we had two SW bits to use for DD1

Balbir Singh.


Re: [PATCH] EDAC: mpc85xx: Add T2080 l2-cache support

2016-11-29 Thread Johannes Thumshirn
On Tue, Nov 29, 2016 at 03:20:37PM +1300, Chris Packham wrote:
> The l2-cache controller on the T2080 SoC has similar capabilities to the
> others already supported by the mpc85xx_edac driver. Add it to the list
> of compatible devices.
> 
> Signed-off-by: Chris Packham 
> ---

Looks good,
Acked-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] PPC/CAS Add support for power9 in ibm_architecture_vec

2016-11-29 Thread Denis Kirjanov
On 11/29/16, Balbir Singh  wrote:
>
>
> The PVR list has been updated and IBM_ARCH_VEC_NRCORES_OFFSET.
> This provides the cpu versions supported to the hypervisor and in this case
> tells the hypervisor that the guest supports ISA 3.0 and Power9.
>
> Signed-off-by: Balbir Singh 

Michael rewrote the code so you have to update the patch.
See https://patchwork.ozlabs.org/patch/658627/
> ---
>  arch/powerpc/include/asm/prom.h | 2 ++
>  arch/powerpc/kernel/prom_init.c | 7 +--
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/prom.h
> b/arch/powerpc/include/asm/prom.h
> index 7f436ba..785bc6b 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -121,6 +121,8 @@ struct of_drconf_cell {
>  #define OV1_PPC_2_06 0x02/* set if we support PowerPC 2.06 */
>  #define OV1_PPC_2_07 0x01/* set if we support PowerPC 2.07 */
>
> +#define OV1_PPC_3_00 0x80/* set if we support PowerPC 3.00 */
> +
>  /* Option vector 2: Open Firmware options supported */
>  #define OV2_REAL_MODE0x20/* set if we want OF in real 
> mode */
>
> diff --git a/arch/powerpc/kernel/prom_init.c
> b/arch/powerpc/kernel/prom_init.c
> index 88ac964..2a8d6b0 100644
> --- a/arch/powerpc/kernel/prom_init.c
> +++ b/arch/powerpc/kernel/prom_init.c
> @@ -659,6 +659,8 @@ unsigned char ibm_architecture_vec[] = {
>   W(0x), W(0x004b),   /* POWER8E */
>   W(0x), W(0x004c),   /* POWER8NVL */
>   W(0x), W(0x004d),   /* POWER8 */
> + W(0x), W(0x004e),   /* POWER9 */
> + W(0x), W(0x0f05),   /* all 3.00-compliant */
>   W(0x), W(0x0f04),   /* all 2.07-compliant */
>   W(0x), W(0x0f03),   /* all 2.06-compliant */
>   W(0x), W(0x0f02),   /* all 2.05-compliant */
> @@ -666,10 +668,11 @@ unsigned char ibm_architecture_vec[] = {
>   NUM_VECTORS(6), /* 6 option vectors */
>
>   /* option vector 1: processor architectures supported */
> - VECTOR_LENGTH(2),   /* length */
> + VECTOR_LENGTH(3),   /* length */
>   0,  /* don't ignore, don't halt */
>   OV1_PPC_2_00 | OV1_PPC_2_01 | OV1_PPC_2_02 | OV1_PPC_2_03 |
>   OV1_PPC_2_04 | OV1_PPC_2_05 | OV1_PPC_2_06 | OV1_PPC_2_07,
> + OV1_PPC_3_00,
>
>   /* option vector 2: Open Firmware options supported */
>   VECTOR_LENGTH(33),  /* length */
> @@ -720,7 +723,7 @@ unsigned char ibm_architecture_vec[] = {
>* must match by the macro below. Update the definition if
>* the structure layout changes.
>*/
> -#define IBM_ARCH_VEC_NRCORES_OFFSET  133
> +#define IBM_ARCH_VEC_NRCORES_OFFSET  150
>   W(NR_CPUS), /* number of cores supported */
>   0,
>   0,
> --
> 2.5.5
>
>


Re: [1/3] powerpc/64e: convert cmpi to cmpwi in head_64.S

2016-11-29 Thread Michael Ellerman
On Wed, 2016-11-23 at 13:02:07 UTC, Nicholas Piggin wrote:
> >From 80f23935cadb ("powerpc: Convert cmp to cmpd in idle enter sequence"):
> 
> PowerPC's "cmp" instruction has four operands. Normally people write
> "cmpw" or "cmpd" for the second cmp operand 0 or 1. But, frequently
> people forget, and write "cmp" with just three operands.
> 
> With older binutils this is silently accepted as if this was "cmpw",
> while often "cmpd" is wanted. With newer binutils GAS will complain
> about this for 64-bit code. For 32-bit code it still silently assumes
> "cmpw" is what is meant.
> 
> In this instance the code comes directly from ISA v2.07, including the
> cmp, but cmpd is correct. Backport to stable so that new toolchains can
> build old kernels.
> 
> In this case, cmpwi is called for, so this is just a build fix for
> new toolchians.
> 
> Stable: v3.0
> Cc: Segher Boessenkool 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f87f253bac3ce4a4eb2a60a1ae604d

cheers


[PATCH v11 3/8] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
Signed-off-by: Michael Ellerman 
---
 include/linux/kexec.h |  1 +
 kernel/kexec_file.c   | 25 -
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 437ef1b47428..a33f63351f86 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -176,6 +176,7 @@ struct kexec_buf {
 int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   int (*func)(u64, u64, void *));
 extern int kexec_add_buffer(struct kexec_buf *kbuf);
+int kexec_locate_mem_hole(struct kexec_buf *kbuf);
 #endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index efd2c094af7e..0c2df7f73792 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -450,6 +450,23 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
 }
 
 /**
+ * kexec_locate_mem_hole - find free memory for the purgatory or the next 
kernel
+ * @kbuf:  Parameters for the memory search.
+ *
+ * On success, kbuf->mem will have the start address of the memory region 
found.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int kexec_locate_mem_hole(struct kexec_buf *kbuf)
+{
+   int ret;
+
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
+
+   return ret == 1 ? 0 : -EADDRNOTAVAIL;
+}
+
+/**
  * kexec_add_buffer - place a buffer in a kexec segment
  * @kbuf:  Buffer contents and memory parameters.
  *
@@ -489,11 +506,9 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
-   if (ret != 1) {
-   /* A suitable memory range could not be found for buffer */
-   return -EADDRNOTAVAIL;
-   }
+   ret = kexec_locate_mem_hole(kbuf);
+   if (ret)
+   return ret;
 
/* Found a suitable memory range */
ksegment = >image->segment[kbuf->image->nr_segments];
-- 
2.7.4



[PATCH v11 7/8] powerpc/kexec: Enable kexec_file_load() syscall

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

Define the Kconfig symbol so that the kexec_file_load() code can be
built, and wire up the syscall so that it can be called.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Kconfig   | 13 +
 arch/powerpc/include/asm/systbl.h  |  1 +
 arch/powerpc/include/asm/unistd.h  |  2 +-
 arch/powerpc/include/uapi/asm/unistd.h |  1 +
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6cb59c6e5ba4..897d0f14447d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -455,6 +455,19 @@ config KEXEC
  interface is strongly in flux, so no good recommendation can be
  made.
 
+config KEXEC_FILE
+   bool "kexec file based system call"
+   select KEXEC_CORE
+   select BUILD_BIN2C
+   depends on PPC64
+   depends on CRYPTO=y
+   depends on CRYPTO_SHA256=y
+   help
+ This is a new version of the kexec system call. This call is
+ file based and takes in file descriptors as system call arguments
+ for kernel and initramfs as opposed to a list of segments as is the
+ case for the older kexec call.
+
 config RELOCATABLE
bool "Build a relocatable kernel"
depends on (PPC64 && !COMPILE_TEST) || (FLATMEM && (44x || FSL_BOOKE))
diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 2fc5d4db503c..4b369d83fe9c 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -386,3 +386,4 @@ SYSCALL(mlock2)
 SYSCALL(copy_file_range)
 COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
+SYSCALL(kexec_file_load)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index e8cdfec8d512..eb1acee91a20 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls382
+#define NR_syscalls383
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index e9f5f41aa55a..2f26335a3c42 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -392,5 +392,6 @@
 #define __NR_copy_file_range   379
 #define __NR_preadv2   380
 #define __NR_pwritev2  381
+#define __NR_kexec_file_load   382
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.7.4



[PATCH v11 8/8] powerpc: Enable CONFIG_KEXEC_FILE in powerpc server defconfigs.

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

Enable CONFIG_KEXEC_FILE in powernv_defconfig, ppc64_defconfig and
pseries_defconfig.

It depends on CONFIG_CRYPTO_SHA256=y, so add that as well.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/configs/powernv_defconfig | 2 ++
 arch/powerpc/configs/ppc64_defconfig   | 2 ++
 arch/powerpc/configs/pseries_defconfig | 2 ++
 3 files changed, 6 insertions(+)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index d98b6eb3254f..5a190aa5534b 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -49,6 +49,7 @@ CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_HOTPLUG_CPU=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_NUMA=y
 CONFIG_MEMORY_HOTPLUG=y
@@ -301,6 +302,7 @@ CONFIG_CRYPTO_CCM=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
diff --git a/arch/powerpc/configs/ppc64_defconfig 
b/arch/powerpc/configs/ppc64_defconfig
index 58a98d40086f..0059d2088b9c 100644
--- a/arch/powerpc/configs/ppc64_defconfig
+++ b/arch/powerpc/configs/ppc64_defconfig
@@ -46,6 +46,7 @@ CONFIG_HZ_100=y
 CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_CRASH_DUMP=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTREMOVE=y
@@ -336,6 +337,7 @@ CONFIG_CRYPTO_TEST=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index 8a3bc016b732..f022f657a984 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -52,6 +52,7 @@ CONFIG_HZ_100=y
 CONFIG_BINFMT_MISC=m
 CONFIG_PPC_TRANSACTIONAL_MEM=y
 CONFIG_KEXEC=y
+CONFIG_KEXEC_FILE=y
 CONFIG_IRQ_ALL_CPUS=y
 CONFIG_MEMORY_HOTPLUG=y
 CONFIG_MEMORY_HOTREMOVE=y
@@ -303,6 +304,7 @@ CONFIG_CRYPTO_TEST=m
 CONFIG_CRYPTO_PCBC=m
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_MICHAEL_MIC=m
+CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_TGR192=m
 CONFIG_CRYPTO_WP512=m
 CONFIG_CRYPTO_ANUBIS=m
-- 
2.7.4



[PATCH v11 2/8] kexec_file: Change kexec_add_buffer to take kexec_buf as argument.

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

This is done to simplify the kexec_add_buffer argument list.
Adapt all callers to set up a kexec_buf to pass to kexec_add_buffer.

In addition, change the type of kexec_buf.buffer from char * to void *.
There is no particular reason for it to be a char *, and the change
allows us to get rid of 3 existing casts to char * in the code.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
Acked-by: Balbir Singh 
Signed-off-by: Michael Ellerman 
---
 arch/x86/kernel/crash.c   | 37 
 arch/x86/kernel/kexec-bzimage64.c | 48 +++--
 include/linux/kexec.h |  8 +---
 kernel/kexec_file.c   | 88 ++-
 4 files changed, 87 insertions(+), 94 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 650830e39e3a..3741461c63a0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -631,9 +631,9 @@ static int determine_backup_region(u64 start, u64 end, void 
*arg)
 
 int crash_load_segments(struct kimage *image)
 {
-   unsigned long src_start, src_sz, elf_sz;
-   void *elf_addr;
int ret;
+   struct kexec_buf kbuf = { .image = image, .buf_min = 0,
+ .buf_max = ULONG_MAX, .top_down = false };
 
/*
 * Determine and load a segment for backup area. First 640K RAM
@@ -647,43 +647,44 @@ int crash_load_segments(struct kimage *image)
if (ret < 0)
return ret;
 
-   src_start = image->arch.backup_src_start;
-   src_sz = image->arch.backup_src_sz;
-
/* Add backup segment. */
-   if (src_sz) {
+   if (image->arch.backup_src_sz) {
+   kbuf.buffer = _zero_bytes;
+   kbuf.bufsz = sizeof(crash_zero_bytes);
+   kbuf.memsz = image->arch.backup_src_sz;
+   kbuf.buf_align = PAGE_SIZE;
/*
 * Ideally there is no source for backup segment. This is
 * copied in purgatory after crash. Just add a zero filled
 * segment for now to make sure checksum logic works fine.
 */
-   ret = kexec_add_buffer(image, (char *)_zero_bytes,
-  sizeof(crash_zero_bytes), src_sz,
-  PAGE_SIZE, 0, -1, 0,
-  >arch.backup_load_addr);
+   ret = kexec_add_buffer();
if (ret)
return ret;
+   image->arch.backup_load_addr = kbuf.mem;
pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx 
memsz=0x%lx\n",
-image->arch.backup_load_addr, src_start, src_sz);
+image->arch.backup_load_addr,
+image->arch.backup_src_start, kbuf.memsz);
}
 
/* Prepare elf headers and add a segment */
-   ret = prepare_elf_headers(image, _addr, _sz);
+   ret = prepare_elf_headers(image, , );
if (ret)
return ret;
 
-   image->arch.elf_headers = elf_addr;
-   image->arch.elf_headers_sz = elf_sz;
+   image->arch.elf_headers = kbuf.buffer;
+   image->arch.elf_headers_sz = kbuf.bufsz;
 
-   ret = kexec_add_buffer(image, (char *)elf_addr, elf_sz, elf_sz,
-   ELF_CORE_HEADER_ALIGN, 0, -1, 0,
-   >arch.elf_load_addr);
+   kbuf.memsz = kbuf.bufsz;
+   kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
+   ret = kexec_add_buffer();
if (ret) {
vfree((void *)image->arch.elf_headers);
return ret;
}
+   image->arch.elf_load_addr = kbuf.mem;
pr_debug("Loaded ELF headers at 0x%lx bufsz=0x%lx memsz=0x%lx\n",
-image->arch.elf_load_addr, elf_sz, elf_sz);
+image->arch.elf_load_addr, kbuf.bufsz, kbuf.bufsz);
 
return ret;
 }
diff --git a/arch/x86/kernel/kexec-bzimage64.c 
b/arch/x86/kernel/kexec-bzimage64.c
index 3407b148c240..d0a814a9d96a 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -331,17 +331,17 @@ static void *bzImage64_load(struct kimage *image, char 
*kernel,
 
struct setup_header *header;
int setup_sects, kern16_size, ret = 0;
-   unsigned long setup_header_size, params_cmdline_sz, params_misc_sz;
+   unsigned long setup_header_size, params_cmdline_sz;
struct boot_params *params;
unsigned long bootparam_load_addr, kernel_load_addr, initrd_load_addr;
unsigned long purgatory_load_addr;
-   unsigned long kernel_bufsz, kernel_memsz, kernel_align;
-   char *kernel_buf;
struct bzimage64_data *ldata;
struct kexec_entry64_regs regs64;
void *stack;
unsigned int setup_hdr_offset = 

[PATCH v11 4/8] powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

Commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core
code") introduced CONFIG_KEXEC_CORE so that CONFIG_KEXEC means whether
the kexec_load system call should be compiled-in and CONFIG_KEXEC_FILE
means whether the kexec_file_load system call should be compiled-in.
These options can be set independently from each other.

Since until now powerpc only supported kexec_load, CONFIG_KEXEC and
CONFIG_KEXEC_CORE were synonyms. That is not the case anymore, so we
need to make a distinction. Almost all places where CONFIG_KEXEC was
being used should be using CONFIG_KEXEC_CORE instead, since
kexec_file_load also needs that code compiled in.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Kconfig  | 2 +-
 arch/powerpc/include/asm/debug.h  | 2 +-
 arch/powerpc/include/asm/kexec.h  | 6 +++---
 arch/powerpc/include/asm/machdep.h| 4 ++--
 arch/powerpc/include/asm/smp.h| 2 +-
 arch/powerpc/kernel/Makefile  | 4 ++--
 arch/powerpc/kernel/head_64.S | 2 +-
 arch/powerpc/kernel/misc_32.S | 2 +-
 arch/powerpc/kernel/misc_64.S | 6 +++---
 arch/powerpc/kernel/prom.c| 2 +-
 arch/powerpc/kernel/setup_64.c| 4 ++--
 arch/powerpc/kernel/smp.c | 6 +++---
 arch/powerpc/kernel/traps.c   | 2 +-
 arch/powerpc/platforms/85xx/corenet_generic.c | 2 +-
 arch/powerpc/platforms/85xx/smp.c | 8 
 arch/powerpc/platforms/cell/spu_base.c| 2 +-
 arch/powerpc/platforms/powernv/setup.c| 6 +++---
 arch/powerpc/platforms/ps3/setup.c| 4 ++--
 arch/powerpc/platforms/pseries/Makefile   | 2 +-
 arch/powerpc/platforms/pseries/setup.c| 4 ++--
 20 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65fba4c34cd7..6cb59c6e5ba4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -489,7 +489,7 @@ config CRASH_DUMP
 
 config FA_DUMP
bool "Firmware-assisted dump"
-   depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC
+   depends on PPC64 && PPC_RTAS && CRASH_DUMP && KEXEC_CORE
help
  A robust mechanism to get reliable kernel crash dump with
  assistance from firmware. This approach does not use kexec,
diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index a954e4975049..86308f177f2d 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -10,7 +10,7 @@ struct pt_regs;
 
 extern struct dentry *powerpc_debugfs_root;
 
-#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
+#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC_CORE)
 
 extern int (*__debugger)(struct pt_regs *regs);
 extern int (*__debugger_ipi)(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index a46f5f45570c..eca2f975bf44 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -53,7 +53,7 @@
 
 typedef void (*crash_shutdown_t)(void);
 
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
 
 /*
  * This function is responsible for capturing register states if coming
@@ -91,7 +91,7 @@ static inline bool kdump_in_progress(void)
return crashing_cpu >= 0;
 }
 
-#else /* !CONFIG_KEXEC */
+#else /* !CONFIG_KEXEC_CORE */
 static inline void crash_kexec_secondary(struct pt_regs *regs) { }
 
 static inline int overlaps_crashkernel(unsigned long start, unsigned long size)
@@ -116,7 +116,7 @@ static inline bool kdump_in_progress(void)
return false;
 }
 
-#endif /* CONFIG_KEXEC */
+#endif /* CONFIG_KEXEC_CORE */
 #endif /* ! __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_KEXEC_H */
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index e02cbc6a6c70..5011b69107a7 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -183,7 +183,7 @@ struct machdep_calls {
 */
void (*machine_shutdown)(void);
 
-#ifdef CONFIG_KEXEC
+#ifdef CONFIG_KEXEC_CORE
void (*kexec_cpu_down)(int crash_shutdown, int secondary);
 
/* Called to do what every setup is needed on image and the
@@ -198,7 +198,7 @@ struct machdep_calls {
 * no return.
 */
void (*machine_kexec)(struct kimage *image);
-#endif /* CONFIG_KEXEC */
+#endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_SUSPEND
/* These are called to disable and enable, respectively, IRQs when
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 0d02c11dc331..32db16d2e7ad 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -176,7 +176,7 @@ static inline void 

[PATCH v11 1/8] kexec_file: Allow arch-specific memory walking for kexec_add_buffer

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.

Signed-off-by: Thiago Jung Bauermann 
Acked-by: Dave Young 
Acked-by: Balbir Singh 
Signed-off-by: Michael Ellerman 
---
 include/linux/kexec.h   | 29 -
 kernel/kexec_file.c | 30 ++
 kernel/kexec_internal.h | 16 
 3 files changed, 50 insertions(+), 25 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 406c33dcae13..5e320ddaaa82 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -148,7 +148,34 @@ struct kexec_file_ops {
kexec_verify_sig_t *verify_sig;
 #endif
 };
-#endif
+
+/**
+ * struct kexec_buf - parameters for finding a place for a buffer in memory
+ * @image: kexec image in which memory to search.
+ * @buffer:Contents which will be copied to the allocated memory.
+ * @bufsz: Size of @buffer.
+ * @mem:   On return will have address of the buffer in memory.
+ * @memsz: Size for the buffer in memory.
+ * @buf_align: Minimum alignment needed.
+ * @buf_min:   The buffer can't be placed below this address.
+ * @buf_max:   The buffer can't be placed above this address.
+ * @top_down:  Allocate from top of memory.
+ */
+struct kexec_buf {
+   struct kimage *image;
+   char *buffer;
+   unsigned long bufsz;
+   unsigned long mem;
+   unsigned long memsz;
+   unsigned long buf_align;
+   unsigned long buf_min;
+   unsigned long buf_max;
+   bool top_down;
+};
+
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *));
+#endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
kimage_entry_t head;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 037c321c5618..f865674bff51 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
 }
 
+/**
+ * arch_kexec_walk_mem - call func(data) on free memory regions
+ * @kbuf:  Context info for the search. Also passed to @func.
+ * @func:  Function to call for each memory region.
+ *
+ * Return: The memory walk will stop when func returns a non-zero value
+ * and that value will be returned. If all free regions are visited without
+ * func returning non-zero, then zero will be returned.
+ */
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *))
+{
+   if (kbuf->image->type == KEXEC_TYPE_CRASH)
+   return walk_iomem_res_desc(crashk_res.desc,
+  IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
+  crashk_res.start, crashk_res.end,
+  kbuf, func);
+   else
+   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
+}
+
 /*
  * Helper function for placing a buffer in a kexec segment. This assumes
  * that kexec_mutex is held.
@@ -474,14 +495,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
unsigned long bufsz,
kbuf->top_down = top_down;
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   if (image->type == KEXEC_TYPE_CRASH)
-   ret = walk_iomem_res_desc(crashk_res.desc,
-   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
-   crashk_res.start, crashk_res.end, kbuf,
-   locate_mem_hole_callback);
-   else
-   ret = walk_system_ram_res(0, -1, kbuf,
- locate_mem_hole_callback);
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
if (ret != 1) {
/* A suitable memory range could not be found for buffer */
return -EADDRNOTAVAIL;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index 0a52315d9c62..4cef7e4706b0 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -20,22 +20,6 @@ struct kexec_sha_region {
unsigned long len;
 };
 
-/*
- * Keeps track of buffer parameters as provided by caller for requesting
- * memory placement of buffer.
- */
-struct kexec_buf {
-   struct kimage *image;
-   char *buffer;
-   unsigned long bufsz;
-   unsigned long mem;
-   unsigned long memsz;
-   unsigned long buf_align;
-   unsigned long buf_min;
-   unsigned long buf_max;
-   bool top_down;  /* allocate from top of memory hole */
-};
-
 void kimage_file_post_load_cleanup(struct kimage *image);
 #else /* CONFIG_KEXEC_FILE */
 static 

[PATCH v11 6/8] powerpc: Add purgatory for kexec_file_load() implementation.

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

This purgatory implementation is based on the versions from kexec-tools
and kexec-lite, with additional changes.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Makefile  |   1 +
 arch/powerpc/kernel/machine_kexec_64.c |   2 +-
 arch/powerpc/purgatory/.gitignore  |   2 +
 arch/powerpc/purgatory/Makefile|  15 
 arch/powerpc/purgatory/trampoline.S| 128 +
 5 files changed, 147 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/purgatory/.gitignore
 create mode 100644 arch/powerpc/purgatory/Makefile
 create mode 100644 arch/powerpc/purgatory/trampoline.S

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 617dece67924..5e7dcdaf93f5 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -249,6 +249,7 @@ core-y  += arch/powerpc/kernel/ 
\
 core-$(CONFIG_XMON)+= arch/powerpc/xmon/
 core-$(CONFIG_KVM) += arch/powerpc/kvm/
 core-$(CONFIG_PERF_EVENTS) += arch/powerpc/perf/
+core-$(CONFIG_KEXEC_FILE)  += arch/powerpc/purgatory/
 
 drivers-$(CONFIG_OPROFILE) += arch/powerpc/oprofile/
 
diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index a205fa3d9bf3..5c12e21d0d1a 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -310,7 +310,7 @@ void default_machine_kexec(struct kimage *image)
if (!kdump_in_progress())
kexec_prepare_cpus();
 
-   pr_debug("kexec: Starting switchover sequence.\n");
+   printk("kexec: Starting switchover sequence.\n");
 
/* switch to a staticly allocated stack.  Based on irq stack code.
 * We setup preempt_count to avoid using VMX in memcpy.
diff --git a/arch/powerpc/purgatory/.gitignore 
b/arch/powerpc/purgatory/.gitignore
new file mode 100644
index ..e9e66f178a6d
--- /dev/null
+++ b/arch/powerpc/purgatory/.gitignore
@@ -0,0 +1,2 @@
+kexec-purgatory.c
+purgatory.ro
diff --git a/arch/powerpc/purgatory/Makefile b/arch/powerpc/purgatory/Makefile
new file mode 100644
index ..ac8793c13348
--- /dev/null
+++ b/arch/powerpc/purgatory/Makefile
@@ -0,0 +1,15 @@
+targets += trampoline.o purgatory.ro kexec-purgatory.c
+
+LDFLAGS_purgatory.ro := -e purgatory_start -r --no-undefined
+
+$(obj)/purgatory.ro: $(obj)/trampoline.o FORCE
+   $(call if_changed,ld)
+
+CMD_BIN2C = $(objtree)/scripts/basic/bin2c
+quiet_cmd_bin2c = BIN2C   $@
+  cmd_bin2c = $(CMD_BIN2C) kexec_purgatory < $< > $@
+
+$(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
+   $(call if_changed,bin2c)
+
+obj-y  += kexec-purgatory.o
diff --git a/arch/powerpc/purgatory/trampoline.S 
b/arch/powerpc/purgatory/trampoline.S
new file mode 100644
index ..f9760ccf4032
--- /dev/null
+++ b/arch/powerpc/purgatory/trampoline.S
@@ -0,0 +1,128 @@
+/*
+ * kexec trampoline
+ *
+ * Based on code taken from kexec-tools and kexec-lite.
+ *
+ * Copyright (C) 2004 - 2005, Milton D Miller II, IBM Corporation
+ * Copyright (C) 2006, Mohan Kumar M, IBM Corporation
+ * Copyright (C) 2013, Anton Blanchard, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it 
under
+ * the terms of the GNU General Public License as published by the Free
+ * Software Foundation (version 2 of the License).
+ */
+
+#if defined(__LITTLE_ENDIAN__)
+#define STWX_BEstwbrx
+#define LWZX_BElwbrx
+#elif defined(__BIG_ENDIAN__)
+#define STWX_BEstwx
+#define LWZX_BElwzx
+#else
+#error no endianness defined!
+#endif
+
+   .machine ppc64
+   .balign 256
+   .globl purgatory_start
+purgatory_start:
+   b   master
+
+   /* ABI: possible run_at_load flag at 0x5c */
+   .org purgatory_start + 0x5c
+   .globl run_at_load
+run_at_load:
+   .long 0
+   .size run_at_load, . - run_at_load
+
+   /* ABI: slaves start at 60 with r3=phys */
+   .org purgatory_start + 0x60
+slave:
+   b .
+   /* ABI: end of copied region */
+   .org purgatory_start + 0x100
+   .size purgatory_start, . - purgatory_start
+
+/*
+ * The above 0x100 bytes at purgatory_start are replaced with the
+ * code from the kernel (or next stage) by setup_purgatory().
+ */
+
+master:
+   or  %r1,%r1,%r1 /* low priority to let other threads catchup */
+   isync
+   mr  %r17,%r3/* save cpu id to r17 */
+   mr  %r15,%r4/* save physical address in reg15 */
+
+   or  %r3,%r3,%r3 /* ok now to high priority, lets boot */
+   lis %r6,0x1
+   mtctr   %r6 /* delay a bit for slaves to catch up */
+   bdnz.   /* before we overwrite 0-100 again */
+
+   bl  0f  /* Work out where we're running */

Re: [3/3] powerpc/64e: don't branch to dot symbols

2016-11-29 Thread Michael Ellerman
On Wed, 2016-11-23 at 13:02:09 UTC, Nicholas Piggin wrote:
> This converts one that was missed by b1576fec7f4d ("powerpc: No need
> to use dot symbols when branching to a function").
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ae88f7b9af17a1267f5dd5b87a4487

cheers


Re: [v7,1/7] powerpc/mm: Rename hugetlb-radix.h to hugetlb.h

2016-11-29 Thread Michael Ellerman
On Mon, 2016-11-28 at 06:16:58 UTC, "Aneesh Kumar K.V" wrote:
> We will start moving some book3s specific hugetlb functions there.
> 
> Signed-off-by: Aneesh Kumar K.V 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/bee8b3b56d1dfc4075254a61340ee3

cheers


[PATCH] PPC/CAS Add support for power9 in ibm_architecture_vec

2016-11-29 Thread Balbir Singh


The PVR list has been updated and IBM_ARCH_VEC_NRCORES_OFFSET.
This provides the cpu versions supported to the hypervisor and in this case
tells the hypervisor that the guest supports ISA 3.0 and Power9.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/prom.h | 2 ++
 arch/powerpc/kernel/prom_init.c | 7 +--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..785bc6b 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -121,6 +121,8 @@ struct of_drconf_cell {
 #define OV1_PPC_2_06   0x02/* set if we support PowerPC 2.06 */
 #define OV1_PPC_2_07   0x01/* set if we support PowerPC 2.07 */
 
+#define OV1_PPC_3_00   0x80/* set if we support PowerPC 3.00 */
+
 /* Option vector 2: Open Firmware options supported */
 #define OV2_REAL_MODE  0x20/* set if we want OF in real mode */
 
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 88ac964..2a8d6b0 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -659,6 +659,8 @@ unsigned char ibm_architecture_vec[] = {
W(0x), W(0x004b),   /* POWER8E */
W(0x), W(0x004c),   /* POWER8NVL */
W(0x), W(0x004d),   /* POWER8 */
+   W(0x), W(0x004e),   /* POWER9 */
+   W(0x), W(0x0f05),   /* all 3.00-compliant */
W(0x), W(0x0f04),   /* all 2.07-compliant */
W(0x), W(0x0f03),   /* all 2.06-compliant */
W(0x), W(0x0f02),   /* all 2.05-compliant */
@@ -666,10 +668,11 @@ unsigned char ibm_architecture_vec[] = {
NUM_VECTORS(6), /* 6 option vectors */
 
/* option vector 1: processor architectures supported */
-   VECTOR_LENGTH(2),   /* length */
+   VECTOR_LENGTH(3),   /* length */
0,  /* don't ignore, don't halt */
OV1_PPC_2_00 | OV1_PPC_2_01 | OV1_PPC_2_02 | OV1_PPC_2_03 |
OV1_PPC_2_04 | OV1_PPC_2_05 | OV1_PPC_2_06 | OV1_PPC_2_07,
+   OV1_PPC_3_00,
 
/* option vector 2: Open Firmware options supported */
VECTOR_LENGTH(33),  /* length */
@@ -720,7 +723,7 @@ unsigned char ibm_architecture_vec[] = {
 * must match by the macro below. Update the definition if
 * the structure layout changes.
 */
-#define IBM_ARCH_VEC_NRCORES_OFFSET133
+#define IBM_ARCH_VEC_NRCORES_OFFSET150
W(NR_CPUS), /* number of cores supported */
0,
0,
-- 
2.5.5



Re: [1/3] powerpc: Stop passing ARCH=ppc64 to boot Makefile

2016-11-29 Thread Michael Ellerman
On Mon, 2016-11-21 at 10:14:33 UTC, Michael Ellerman wrote:
> Back in 2005 when the ppc/ppc64 merge started, we used to build the
> kernel code in arch/powerpc but use the boot code from arch/ppc or
> arch/ppc64 depending on whether we were building for 32 or 64-bit.
> 
> Originally we called the boot Makefile passing ARCH=$(OLDARCH), where
> OLDARCH was ppc or ppc64.
> 
> In commit 20f629549b30 ("powerpc: Make building the boot image work for
> both 32-bit and 64-bit") (2005-10-11) we split the call for 32/64-bit
> using an ifeq check, because the two Makefiles took different targets,
> and explicitly passed ARCH=ppc64 for the 64-bit case and ARCH=ppc for
> the 32-bit case.
> 
> Then in commit 94b212c29f68 ("powerpc: Move ppc64 boot wrapper code over
> to arch/powerpc") (2005-11-16) we moved the boot code into arch/powerpc
> and dropped the ppc case, but kept passing ARCH=ppc64 to
> arch/powerpc/boot/Makefile.
> 
> Since then there have been several more boot targets added, all of which
> have copied the ARCH=ppc64 setting, such that now we have four targets
> using it.
> 
> Currently it seems that nothing actually uses the ARCH value, but that's
> basically just luck, and in particular it prevents us from using the
> generic cpp_lds_S rule. It's also clearly wrong, ARCH=ppc64 is dead,
> buried and cremated.
> 
> Fix it by dropping the setting of ARCH completely, the correct value is
> exported by the top level Makefile.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/1196d7aaebf6cdad619310fe283422

cheers


[PATCH v11 0/8] powerpc: Implement kexec_file_load()

2016-11-29 Thread Michael Ellerman
This is v11 of the kexec_file_load() for powerpc series.

I've stripped this down to the minimum we need, so we can get this in for 4.10.
Any additions can come later incrementally.

If no one objects I'll merge this via the powerpc tree. The three kexec patches
have been acked by Dave Young (since forever), and have been in linux-next (via
akpm's tree) also for a long time.

cheers


v11 (Michael Ellerman):
 - Strip back purgatory to the minimal trampoline required. This avoids
   complexity in the purgatory environment where all exceptions are fatal.
 - Reorder the series so we don't start advertising the config symbol, or more
   importantly the syscall, until they're actually implemented.


Original cover letter by Thiago:

This patch series implements the kexec_file_load system call on PowerPC.

This system call moves the reading of the kernel, initrd and the device tree
from the userspace kexec tool to the kernel. This is needed if you want to
do one or both of the following:

1. only allow loading of signed kernels.
2. "measure" (i.e., record the hashes of) the kernel, initrd, kernel
   command line and other boot inputs for the Integrity Measurement
   Architecture subsystem.

The above are the functions kexec already has built into kexec_file_load.
Yesterday I posted a set of patches which allows a third feature:

3. have IMA pass-on its event log (where integrity measurements are
   registered) accross kexec to the second kernel, so that the event
   history is preserved.

Because OpenPower uses an intermediary Linux instance as a boot loader
(skiroot), feature 1 is needed to implement secure boot for the platform,
while features 2 and 3 are needed to implement trusted boot.

This patch series starts by removing an x86 assumption from kexec_file:
kexec_add_buffer uses iomem to find reserved memory ranges, but PowerPC
uses the memblock subsystem.  A hook is added so that each arch can
specify how memory ranges can be found.

Also, the memory-walking logic in kexec_add_buffer is useful in this
implementation to find a free area for the purgatory's stack, so the
next patch moves that logic to kexec_locate_mem_hole.

The kexec_file_load system call needs to apply relocations to the
purgatory but adding code for that would duplicate functionality with
the module loading mechanism, which also needs to apply relocations to
the kernel modules.  Therefore, this patch series factors out the module
relocation code so that it can be shared.

One thing that is still missing is crashkernel support, which I intend
to submit shortly. For now, arch_kexec_kernel_image_probe rejects crash
kernels.

This code is based on kexec-tools, but with many modifications to adapt
it to the kernel environment and facilities.


[PATCH v11 5/8] powerpc: Add support code for kexec_file_load()

2016-11-29 Thread Michael Ellerman
From: Thiago Jung Bauermann 

This patch adds the support code needed for implementing
kexec_file_load() on powerpc.

This consists of functions to load the ELF kernel, either big or little
endian, and setup the purgatory enviroment which switches from the first
kernel to the second kernel.

None of this code is built yet, as it depends on CONFIG_KEXEC_FILE which
we have not yet defined. Although we could define CONFIG_KEXEC_FILE in
this patch, we'd then have a window in history where the kconfig symbol
is present but the syscall is not, which would be awkward.

Signed-off-by: Josh Sklar 
Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/kexec.h|  10 +
 arch/powerpc/kernel/Makefile|   1 +
 arch/powerpc/kernel/kexec_elf_64.c  | 663 
 arch/powerpc/kernel/machine_kexec_file_64.c | 338 ++
 4 files changed, 1012 insertions(+)
 create mode 100644 arch/powerpc/kernel/kexec_elf_64.c
 create mode 100644 arch/powerpc/kernel/machine_kexec_file_64.c

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index eca2f975bf44..6c3b71502fbc 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -91,6 +91,16 @@ static inline bool kdump_in_progress(void)
return crashing_cpu >= 0;
 }
 
+#ifdef CONFIG_KEXEC_FILE
+extern struct kexec_file_ops kexec_elf64_ops;
+
+int setup_purgatory(struct kimage *image, const void *slave_code,
+   const void *fdt, unsigned long kernel_load_addr,
+   unsigned long fdt_load_addr);
+int setup_new_fdt(void *fdt, unsigned long initrd_load_addr,
+ unsigned long initrd_len, const char *cmdline);
+#endif /* CONFIG_KEXEC_FILE */
+
 #else /* !CONFIG_KEXEC_CORE */
 static inline void crash_kexec_secondary(struct pt_regs *regs) { }
 
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 22534a56c914..41d8ff34ae27 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_PCI)   += pci_$(BITS).o $(pci64-y) \
 obj-$(CONFIG_PCI_MSI)  += msi.o
 obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o crash.o \
   machine_kexec_$(BITS).o
+obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file_$(BITS).o 
kexec_elf_$(BITS).o
 obj-$(CONFIG_AUDIT)+= audit.o
 obj64-$(CONFIG_AUDIT)  += compat_audit.o
 
diff --git a/arch/powerpc/kernel/kexec_elf_64.c 
b/arch/powerpc/kernel/kexec_elf_64.c
new file mode 100644
index ..6acffd34a70f
--- /dev/null
+++ b/arch/powerpc/kernel/kexec_elf_64.c
@@ -0,0 +1,663 @@
+/*
+ * Load ELF vmlinux file for the kexec_file_load syscall.
+ *
+ * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
+ * Copyright (C) 2004  IBM Corp.
+ * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
+ * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
+ * Copyright (C) 2016  IBM Corporation
+ *
+ * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c.
+ * Heavily modified for the kernel by
+ * Thiago Jung Bauermann .
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation (version 2 of the License).
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt)"kexec_elf: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define PURGATORY_STACK_SIZE   (16 * 1024)
+
+#define elf_addr_to_cpuelf64_to_cpu
+
+#ifndef Elf_Rel
+#define Elf_RelElf64_Rel
+#endif /* Elf_Rel */
+
+struct elf_info {
+   /*
+* Where the ELF binary contents are kept.
+* Memory managed by the user of the struct.
+*/
+   const char *buffer;
+
+   const struct elfhdr *ehdr;
+   const struct elf_phdr *proghdrs;
+   struct elf_shdr *sechdrs;
+};
+
+static inline bool elf_is_elf_file(const struct elfhdr *ehdr)
+{
+   return memcmp(ehdr->e_ident, ELFMAG, SELFMAG) == 0;
+}
+
+static uint64_t elf64_to_cpu(const struct elfhdr *ehdr, uint64_t value)
+{
+   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
+   value = le64_to_cpu(value);
+   else if (ehdr->e_ident[EI_DATA] == ELFDATA2MSB)
+   value = be64_to_cpu(value);
+
+   return value;
+}
+
+static uint16_t elf16_to_cpu(const struct elfhdr *ehdr, uint16_t value)
+{
+   if (ehdr->e_ident[EI_DATA] == ELFDATA2LSB)
+   value = le16_to_cpu(value);
+   else if 

Re: [PATCH v4 3/7] PCI: Separate VF BAR updates from standard BAR updates

2016-11-29 Thread Bjorn Helgaas
On Tue, Nov 29, 2016 at 03:55:46PM +1100, Gavin Shan wrote:
> On Mon, Nov 28, 2016 at 10:15:06PM -0600, Bjorn Helgaas wrote:
> >Previously pci_update_resource() used the same code path for updating
> >standard BARs and VF BARs in SR-IOV capabilities.
> >
> >Split the VF BAR update into a new pci_iov_update_resource() internal
> >interface, which makes it simpler to compute the BAR address (we can get
> >rid of pci_resource_bar() and pci_iov_resource_bar()).
> >
> >This patch:
> >
> >  - Renames pci_update_resource() to pci_std_update_resource(),
> >  - Adds pci_iov_update_resource(),
> >  - Makes pci_update_resource() a wrapper that calls the appropriate one,
> >
> >No functional change intended.
> >
> >Signed-off-by: Bjorn Helgaas 
> 
> With below minor comments fixed:
> 
> Reviewed-by: Gavin Shan 
> 
> >---
> > drivers/pci/iov.c   |   49 
> > +++
> > drivers/pci/pci.h   |1 +
> > drivers/pci/setup-res.c |   13 +++-
> > 3 files changed, 61 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> >index d41ec29..d00ed5c 100644
> >--- a/drivers/pci/iov.c
> >+++ b/drivers/pci/iov.c
> >@@ -571,6 +571,55 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno)
> > 4 * (resno - PCI_IOV_RESOURCES);
> > }
> >
> >+/**
> >+ * pci_iov_update_resource - update a VF BAR
> >+ * @dev: the PCI device
> >+ * @resno: the resource number
> >+ *
> >+ * Update a VF BAR in the SR-IOV capability of a PF.
> >+ */
> >+void pci_iov_update_resource(struct pci_dev *dev, int resno)
> >+{
> >+struct pci_sriov *iov = dev->is_physfn ? dev->sriov : NULL;
> >+struct resource *res = dev->resource + resno;
> >+int vf_bar = resno - PCI_IOV_RESOURCES;
> >+struct pci_bus_region region;
> >+u32 new;
> >+int reg;
> >+
> >+/*
> >+ * The generic pci_restore_bars() path calls this for all devices,
> >+ * including VFs and non-SR-IOV devices.  If this is not a PF, we
> >+ * have nothing to do.
> >+ */
> >+if (!iov)
> >+return;
> >+
> >+/*
> >+ * Ignore unimplemented BARs, unused resource slots for 64-bit
> >+ * BARs, and non-movable resources, e.g., those described via
> >+ * Enhanced Allocation.
> >+ */
> >+if (!res->flags)
> >+return;
> >+
> >+if (res->flags & IORESOURCE_UNSET)
> >+return;
> >+
> >+if (res->flags & IORESOURCE_PCI_FIXED)
> >+return;
> >+
> >+pcibios_resource_to_bus(dev->bus, , res);
> >+new = region.start;
> >+
> 
> The bits indicating the BAR's property (e.g. memory, IO etc) are missed in 
> @new.

Hmm, yes.  I omitted those because those bits are supposed to be
read-only, per spec (PCI r3.0, sec 6.2.5.1).  But I guess it would be
more conservative to keep them, and this shouldn't be needlessly
different from pci_std_update_resource().

However, I don't think this code in pci_update_resource() is obviously
correct:

  new = region.start | (res->flags & PCI_REGION_FLAG_MASK);

PCI_REGION_FLAG_MASK is 0xf.  For memory BARs, bits 0-3 are read-only
property bits.  For I/O BARs, bits 0-1 are read-only and bits 2-3 are
part of the address, so on the face of it, the above could corrupt two
bits of an I/O address.

It's true that decode_bar() initializes flags correctly, using
PCI_BASE_ADDRESS_IO_MASK for I/O BARs and PCI_BASE_ADDRESS_MEM_MASK
for memory BARs, but it would take a little more digging to be sure
that we never set bits 2-3 of flags for an I/O resource elsewhere.

How about this in pci_std_update_resource():

pcibios_resource_to_bus(dev->bus, , res);
new = region.start;

if (res->flags & IORESOURCE_IO) {
mask = (u32)PCI_BASE_ADDRESS_IO_MASK;
new |= res->flags & ~PCI_BASE_ADDRESS_IO_MASK;
} else {
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;
new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;
}

and this in pci_iov_update_resource():

pcibios_resource_to_bus(dev->bus, , res);
new = region.start;
new |= res->flags & ~PCI_BASE_ADDRESS_MEM_MASK;

It shouldn't fix anything, but I think it is more obvious that we
can't corrupt bits 2-3 of an I/O BAR.

> >+reg = iov->pos + PCI_SRIOV_BAR + 4 * vf_bar;
> >+pci_write_config_dword(dev, reg, new);
> >+if (res->flags & IORESOURCE_MEM_64) {
> >+new = region.start >> 16 >> 16;
> 
> I think it was copied from pci_update_resource(). Why we can't just have "new 
> = region.start >> 32"? 

Right; I did copy this from pci_update_resource().  The changelog from
cf7bee5a0bf2 ("[PATCH] Fix restore of 64-bit PCI BAR's") says "Also
make sure to write high bits - use "x >> 16 >> 16" (rather than the
simpler ">> 32") to avoid warnings on 32-bit architectures where we're
not going to have any high bits."

I didn't take the time to revalidate whether that's still applicable.

Re: [PATCH] scsi/ipr: Fix runaway IRQs when falling back from MSI to LSI

2016-11-29 Thread Martin K. Petersen
> "Benjamin" == Benjamin Herrenschmidt  writes:

Benjamin> LSIs must be ack'ed with an MMIO otherwise they remain
Benjamin> asserted forever. This is controlled by the "clear_isr" flag.

Benjamin> While we set that flag properly when deciding initially
Benjamin> whether to use LSIs or MSIs, we fail to set it if we first
Benjamin> chose MSIs, the test fails, then fallback to LSIs.

Brian: Please review!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH v2 04/14] cxlflash: Avoid command room violation

2016-11-29 Thread Matthew R. Ochs
Uma,

This looks better, thanks for reworking.


-matt

> On Nov 28, 2016, at 6:41 PM, Uma Krishnan  wrote:
> 
> During test, a command room violation interrupt is occasionally seen
> for the master context when the CXL flash devices are stressed.
> 
> After studying the code, there could be gaps in the way command room
> value is being cached in cxlflash. When the cached command room is zero
> the thread attempting to send becomes burdened with updating the cached
> value with the actual value from the AFU. Today, this is handled with an
> atomic set operation of the raw value read. Following the atomic update,
> the thread proceeds to send.
> 
> This behavior is incorrect on two counts:
> 
>   - The update fails to take into account the current thread and its
> consumption of one of the hardware commands.
> 
>   - The update does not take into account other threads also atomically
> updating. Per design, a worker thread updates the cached value when a
> send thread times out. By not protecting the update with a lock, the
> cached value can be incorrectly clobbered.
> 
> To correct these issues, the update of the cached command room has been
> simplified and also protected using a spin lock which is held until the
> MMIO is complete. This ensures the command room is properly consumed by
> the same thread. Update of cached value also takes into account the
> current thread consuming a hardware command.
> 
> Signed-off-by: Uma Krishnan 

Acked-by: Matthew R. Ochs 



powerpc/ps3: Fix system hang with GCC 5 builds

2016-11-29 Thread Geoff Levand

GCC 5 generates different code for this bootwrapper null check
that causes the PS3 to hang very early in its bootup.  This
check is of limited value, so just get rid of it.

Signed-off-by: Geoff Levand 
---
 arch/powerpc/boot/ps3-head.S | 5 -
 arch/powerpc/boot/ps3.c  | 8 +---
 2 files changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/powerpc/boot/ps3-head.S b/arch/powerpc/boot/ps3-head.S
index b6fcbaf..3dc44b0 100644
--- a/arch/powerpc/boot/ps3-head.S
+++ b/arch/powerpc/boot/ps3-head.S
@@ -57,11 +57,6 @@ __system_reset_overlay:
bctr
 
 1:

-   /* Save the value at addr zero for a null pointer write check later. */
-
-   li  r4, 0
-   lwz r3, 0(r4)
-
/* Primary delays then goes to _zimage_start in wrapper. */
 
 	or	31, 31, 31 /* db16cyc */

diff --git a/arch/powerpc/boot/ps3.c b/arch/powerpc/boot/ps3.c
index 4ec2d86..a05558a 100644
--- a/arch/powerpc/boot/ps3.c
+++ b/arch/powerpc/boot/ps3.c
@@ -119,13 +119,12 @@ void ps3_copy_vectors(void)
flush_cache((void *)0x100, 512);
 }
 
-void platform_init(unsigned long null_check)

+void platform_init(void)
 {
const u32 heapsize = 0x100 - (u32)_end; /* 16MiB */
void *chosen;
unsigned long ft_addr;
u64 rm_size;
-   unsigned long val;
 
 	console_ops.write = ps3_console_write;

platform_ops.exit = ps3_exit;
@@ -153,11 +152,6 @@ void platform_init(unsigned long null_check)
 
 	printf(" flat tree at 0x%lx\n\r", ft_addr);
 
-	val = *(unsigned long *)0;

-
-   if (val != null_check)
-   printf("null check failed: %lx != %lx\n\r", val, null_check);
-
((kernel_entry_t)0)(ft_addr, 0, NULL);
 
 	ps3_exit();

--
2.7.4