Re: [PATCH] powerpc/xive: skip ioremap() of ESB pages for LSI interrupts

2019-12-03 Thread Daniel Axtens
Cédric Le Goater  writes:

> The PCI INTx interrupts and other LSI interrupts are handled differently
> under a sPAPR platform. When the interrupt source characteristics are
> queried, the hypervisor returns an H_INT_ESB flag to inform the OS
> that it should be using the H_INT_ESB hcall for interrupt management
> and not loads and stores on the interrupt ESB pages.
>
> A default -1 value is returned for the addresses of the ESB pages. The
> driver ignores this condition today and performs a bogus IO mapping.
> Recent changes and the DEBUG_VM configuration option make the bug
> visible with :
>
> [0.015518] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
> [0.015578] Oops: Exception in kernel mode, sig: 5 [#1]
> [0.015627] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA 
> pSeries
> [0.015697] Modules linked in:
> [0.015739] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.4.0-0.rc6.git0.1.fc32.ppc64le #1
> [0.015812] NIP:  c0f63294 LR: c0f62e44 CTR: 
> 
> [0.015889] REGS: c000fa45f0d0 TRAP: 0700   Not tainted  
> (5.4.0-0.rc6.git0.1.fc32.ppc64le)
> [0.015971] MSR:  82029033   CR: 
> 44000424  XER: 
> [0.016050] CFAR: c0f63128 IRQMASK: 0
> [0.016050] GPR00: c0f62e44 c000fa45f360 c1be5400 
> 
> [0.016050] GPR04: c19c7d38 c000fa340030 fa330009 
> c1c15e18
> [0.016050] GPR08: 0040 ffe0  
> 8418dd352dbd190f
> [0.016050] GPR12:  c1e0 c00a8006 
> c00a8006
> [0.016050] GPR16:  81ae c1c24d98 
> 
> [0.016050] GPR20: c00a8007 c1cafca0 c00a8007 
> 
> [0.016050] GPR24: c00a8008 c00a8008 c1cafca8 
> c00a8008
> [0.016050] GPR28: c000fa32e010 c00a8006  
> c000fa33
> [0.016711] NIP [c0f63294] ioremap_page_range+0x4c4/0x6e0
> [0.016778] LR [c0f62e44] ioremap_page_range+0x74/0x6e0
> [0.016846] Call Trace:
> [0.016876] [c000fa45f360] [c0f62e44] 
> ioremap_page_range+0x74/0x6e0 (unreliable)
> [0.016969] [c000fa45f460] [c00934bc] do_ioremap+0x8c/0x120
> [0.017037] [c000fa45f4b0] [c00938e8] 
> __ioremap_caller+0x128/0x140
> [0.017116] [c000fa45f500] [c00931a0] ioremap+0x30/0x50
> [0.017184] [c000fa45f520] [c00d1380] 
> xive_spapr_populate_irq_data+0x170/0x260
> [0.017263] [c000fa45f5c0] [c00cc90c] 
> xive_irq_domain_map+0x8c/0x170
> [0.017344] [c000fa45f600] [c0219124] 
> irq_domain_associate+0xb4/0x2d0
> [0.017424] [c000fa45f690] [c0219fe0] 
> irq_create_mapping+0x1e0/0x3b0
> [0.017506] [c000fa45f730] [c021ad6c] 
> irq_create_fwspec_mapping+0x27c/0x3e0
> [0.017586] [c000fa45f7c0] [c021af68] 
> irq_create_of_mapping+0x98/0xb0
> [0.017666] [c000fa45f830] [c08d4e48] 
> of_irq_parse_and_map_pci+0x168/0x230
> [0.017746] [c000fa45f910] [c0075428] 
> pcibios_setup_device+0x88/0x250
> [0.017826] [c000fa45f9a0] [c0077b84] 
> pcibios_setup_bus_devices+0x54/0x100
> [0.017906] [c000fa45fa10] [c00793f0] __of_scan_bus+0x160/0x310
> [0.017973] [c000fa45faf0] [c0075fc0] 
> pcibios_scan_phb+0x330/0x390
> [0.018054] [c000fa45fba0] [c139217c] pcibios_init+0x8c/0x128
> [0.018121] [c000fa45fc20] [c00107b0] 
> do_one_initcall+0x60/0x2c0
> [0.018201] [c000fa45fcf0] [c1384624] 
> kernel_init_freeable+0x290/0x378
> [0.018280] [c000fa45fdb0] [c0010d24] kernel_init+0x2c/0x148
> [0.018348] [c000fa45fe20] [c000bdbc] 
> ret_from_kernel_thread+0x5c/0x80
> [0.018427] Instruction dump:
> [0.018468] 41820014 3920fe7f 7d494838 7d290074 7929d182 f8e10038 69290001 
> 0b09
> [0.018552] 7a098420 0b09 7bc95960 7929a802 <0b09> 7fc68b78 
> e8610048 7dc47378

I hit this too, and your patch works for me. I can't claim to understand
it, but I can verify it! :)

Tested-by: Daniel Axtens 

Regards,
Daniel
>
> Cc: sta...@vger.kernel.org # v4.14+
> Fixes: bed81ee181dd ("powerpc/xive: introduce H_INT_ESB hcall")
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/spapr.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 33c10749edec..55dc61cb4867 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -392,20 +392,28 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, 
> struct xive_irq_data *data)
>   data->esb_shift = esb_shift;
>   data->trig_page = trig_page;
>  
> + 

[PATCH] powerpc/pmem: Fix kernel crash due to wrong range value usage in flush_dcache_range

2019-12-03 Thread Aneesh Kumar K.V
This patch fix the below kernel crash.

 BUG: Unable to handle kernel data access on read at 0xc0038000
 Faulting instruction address: 0xc008b6f0
cpu 0x5: Vector: 300 (Data Access) at [c000d8587790]
pc: c008b6f0: arch_remove_memory+0x150/0x210
lr: c008b720: arch_remove_memory+0x180/0x210
sp: c000d8587a20
   msr: 8280b033
   dar: c0038000
 dsisr: 4000
  current = 0xc000d8558600
  paca= 0xcfff8f00   irqmask: 0x03   irq_happened: 0x01
pid   = 1220, comm = ndctl
enter ? for help
 memunmap_pages+0x33c/0x410
 devm_action_release+0x30/0x50
 release_nodes+0x30c/0x3a0
 device_release_driver_internal+0x178/0x240
 unbind_store+0x74/0x190
 drv_attr_store+0x44/0x60
 sysfs_kf_write+0x74/0xa0
 kernfs_fop_write+0x1b0/0x260
 __vfs_write+0x3c/0x70
 vfs_write+0xe4/0x200
 ksys_write+0x7c/0x140
 system_call+0x5c/0x68

Fixes: 076265907cf9 ("powerpc: Chunk calls to flush_dcache_range in 
arch_*_memory")
Reported-by: Sachin Sant 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ad299e72ec30..9488b63dfc87 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -121,7 +121,7 @@ static void flush_dcache_range_chunked(unsigned long start, 
unsigned long stop,
unsigned long i;
 
for (i = start; i < stop; i += chunk) {
-   flush_dcache_range(i, min(stop, start + chunk));
+   flush_dcache_range(i, min(stop, i + chunk));
cond_resched();
}
 }
-- 
2.23.0



Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-03 Thread David Gibson
On Wed, Dec 04, 2019 at 12:08:09PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 04/12/2019 11:49, Ram Pai wrote:
> > On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 04/12/2019 03:52, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 03/12/2019 15:05, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 03/12/2019 13:08, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 02/12/2019 17:45, Ram Pai wrote:
> > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as 
> > one of
> > its parameters. One page is dedicated per cpu, for the lifetime of 
> > the
> > kernel for this purpose. On secure VMs, contents of this page, when
> > accessed by the hypervisor, retrieves encrypted TCE entries.  
> > Hypervisor
> > needs to know the unencrypted entries, to update the TCE table
> > accordingly.  There is nothing secret or sensitive about these 
> > entries.
> > Hence share the page with the hypervisor.
> 
>  This unsecures a page in the guest in a random place which creates an
>  additional attack surface which is hard to exploit indeed but
>  nevertheless it is there.
>  A safer option would be not to use the
>  hcall-multi-tce hyperrtas option (which translates 
>  FW_FEATURE_MULTITCE
>  in the guest).
> >>>
> >>>
> >>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets 
> >>> invoked
> >>> automatically when IOMMU option is enabled.
> >>
> >> It is advertised by QEMU but the guest does not have to use it.
> >
> > Are you suggesting that even normal-guest, not use hcall-multi-tce?
> > or just secure-guest?  
> 
> 
>  Just secure.
> >>>
> >>> hmm..  how are the TCE entries communicated to the hypervisor, if
> >>> hcall-multi-tce is disabled?
> >>
> >> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> >> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> >> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> >> optimization.
> > 
> > Do you still think, secure-VM should use H_PUT_TCE and not
> > H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> > Is there any advantage of special casing it for secure-VMs.
> 
> 
> Reducing the amount of insecure memory at random location.

The other approach we could use for that - which would still allow
H_PUT_TCE_INDIRECT, would be to allocate the TCE buffer page from the
same pool that we use for the bounce buffers.  I assume there must
already be some sort of allocator for that?

> > In fact, we could make use of as much optimization as possible.
> > 
> > 
> >>
> >> Is not this for pci+swiotlb? 
> > ..snip..
> > This patch is purely to help the hypervisor setup the TCE table, in the
> > presence of a IOMMU.
> 
>  Then the hypervisor should be able to access the guest pages mapped for
>  DMA and these pages should be made unsecure for this to work. Where/when
>  does this happen?
> >>>
> >>> This happens in the SWIOTLB code.  The code to do that is already
> >>> upstream.  
> >>>
> >>> The sharing of the pages containing the SWIOTLB bounce buffers is done
> >>> in init_svm() which calls swiotlb_update_mem_attributes() which calls
> >>> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() 
> >>> calls 
> >>> uv_share_page().
> >>
> >>
> >> This does not seem enough as when you enforce iommu_platform=on, QEMU
> >> starts accessing virtio buffers via IOMMU so bounce buffers have to be
> >> mapped explicitly, via H_PUT_TCE, where does this happen?
> >>
> > 
> > I think, it happens at boot time. Every page of the guest memory is TCE
> > mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
> > as part of that operation.
> 
> 
> Ah I see. This works via the huge dma window. Ok, makes sense now.
> 
> It just seems like a waste that we could map swiotlb 1:1 via the always
> existing small DMA window but instead we rely on a huge window to map
> these small buffers. This way we are wasting the entire 32bit window and
> most of the huge window. We may fix it in the future (not right now) but
> for now I would still avoid unsecuring additional memory. Thanks,
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2 1/3] powernv/iov: Ensure the pdn for VFs always contains a valid PE number

2019-12-03 Thread Shawn Anastasio

On 10/28/19 3:54 AM, Oliver O'Halloran wrote:

On pseries there is a bug with adding hotplugged devices to an IOMMU group.
For a number of dumb reasons fixing that bug first requires re-working how
VFs are configured on PowerNV.


Are these patches expected to land soon, or is there another fix in the
pipeline? As far as I can tell this is still an issue.


Re: [PATCH v4 2/2] powerpc/irq: inline call_do_irq() and call_do_softirq()

2019-12-03 Thread Christophe Leroy

Hi,

Le 29/11/2019 à 19:46, Segher Boessenkool a écrit :

Hi!

On Wed, Nov 27, 2019 at 04:15:15PM +0100, Christophe Leroy wrote:

Le 27/11/2019 à 15:59, Segher Boessenkool a écrit :

On Wed, Nov 27, 2019 at 02:50:30PM +0100, Christophe Leroy wrote:

So what do we do ? We just drop the "r2" clobber ?


You have to make sure your asm code works for all ABIs.  This is quite
involved if you do a call to an external function.  The compiler does
*not* see this call, so you will have to make sure that all that the
compiler and linker do will work, or prevent some of those things (say,
inlining of the function containing the call).


But the whole purpose of the patch is to inline the call to __do_irq()
in order to avoid the trampoline function.


Yes, so you call __do_irq.  You have to make sure that what you tell the
compiler -- and what you *don't tell the compiler -- works with what the
ABIs require, and what the called function expects and provides.


That does not fix everything.  The called function requires a specific
value in r2 on entry.


Euh ... but there is nothing like that when using existing
call_do_irq().



How does GCC know that call_do_irq() has same TOC as __do_irq() ?


The existing call_do_irq isn't C code.  It doesn't do anything with r2,
as far as I can see; __do_irq just gets whatever the caller of call_do_irq
has.

So I guess all the callers of call_do_irq have the correct r2 value always
already?  In that case everything Just Works.


Indeed, there is only one caller for call_do_irq() which is do_IRQ().
And do_IRQ() is also calling __do_irq() directly (when the stack pointer 
is already set to IRQ stack). do_IRQ() and __do_irq() are both in 
arch/powerpc/kernel/irq.c


As far as I can see when replacing the call to call_do_irq() by a call 
to __do_irq(), the compiler doesn't do anything special with r2, and 
doesn't add any nop after the bl either, whereas for all calls outside 
irq.c, there is a nop added. So I guess that's ok ?


Now that call_do_irq() is inlined, we can even define __do_irq() as static.

And that's the same for do_softirq_own_stack(), it is only called from 
do_softirq() which is defined in the same file as __do_softirq() 
(kernel/softirq.c)


Christophe


RE: [v5 1/3] soc: fsl: handle RCPM errata A-008646 on SoC LS1021A

2019-12-03 Thread Biwen Li
Hi All,

1.Need apply below patches before apply these patches.
1.1 Flextimer dts
https://lore.kernel.org/patchwork/series/405653/mbox/ 
(https://lore.kernel.org/patchwork/patch/1112493/)
1.2 RCPM driver
https://lore.kernel.org/patchwork/patch/1143809/mbox/ [v10,1/3] PM: wakeup: Add 
routine to help fetch wakeup source object)
https://lore.kernel.org/patchwork/patch/1143810/mbox/ ([v10,2/3] Documentation: 
dt: binding: fsl: Add 'little-endian' and update Chassis define)
https://lore.kernel.org/patchwork/patch/1143811/mbox/ ([v10,3/3] soc: fsl: add 
RCPM driver)

> Subject: [v5 1/3] soc: fsl: handle RCPM errata A-008646 on SoC LS1021A
> 
> Description:
>   - Reading configuration register RCPM_IPPDEXPCR1
> always return zero
> 
> Workaround:
>   - Save register RCPM_IPPDEXPCR1's value to
> register SCFG_SPARECR8.(uboot's psci also
> need reading value from the register SCFG_SPARECR8
> to set register RCPM_IPPDEXPCR1)
> 
> Impact:
>   - FlexTimer module will cannot wakeup system in
> deep sleep on SoC LS1021A
> 
> Signed-off-by: Biwen Li 
> ---
> Change in v5:
>   - update the patch, because of rcpm driver has updated.
> 
> Change in v4:
>   - rename property name
> fsl,ippdexpcr-alt-addr -> fsl,ippdexpcr1-alt-addr
> 
> Change in v3:
>   - update commit message
>   - rename property name
> fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
> 
> Change in v2:
>   - fix stype problems
> 
>  drivers/soc/fsl/rcpm.c | 47
> --
>  1 file changed, 45 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c index
> a093dbe6d2cb..775c618f0456 100644
> --- a/drivers/soc/fsl/rcpm.c
> +++ b/drivers/soc/fsl/rcpm.c
> @@ -6,13 +6,16 @@
>  //
>  // Author: Ran Wang 
> 
> +#include 
>  #include 
> +#include 
> +#include 
>  #include 
> -#include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
> -#include 
> 
>  #define RCPM_WAKEUP_CELL_MAX_SIZE7
> 
> @@ -37,6 +40,9 @@ static int rcpm_pm_prepare(struct device *dev)
>   struct device_node  *np = dev->of_node;
>   u32 value[RCPM_WAKEUP_CELL_MAX_SIZE + 1];
>   u32 setting[RCPM_WAKEUP_CELL_MAX_SIZE] = {0};
> + struct regmap *scfg_addr_regmap = NULL;
> + u32 reg_offset[RCPM_WAKEUP_CELL_MAX_SIZE + 1];
> + u32 reg_value = 0;
> 
>   rcpm = dev_get_drvdata(dev);
>   if (!rcpm)
> @@ -90,6 +96,43 @@ static int rcpm_pm_prepare(struct device *dev)
>   tmp |= ioread32be(address);
>   iowrite32be(tmp, address);
>   }
> + /*
> +  * Workaround of errata A-008646 on SoC LS1021A:
> +  * There is a bug of register ippdexpcr1.
> +  * Reading configuration register RCPM_IPPDEXPCR1
> +  * always return zero. So save ippdexpcr1's value
> +  * to register SCFG_SPARECR8.And the value of
> +  * ippdexpcr1 will be read from SCFG_SPARECR8.
> +  */
> + if (device_property_present(dev, "fsl,ippdexpcr1-alt-addr")) {
> + if (dev_of_node(dev)) {
> + scfg_addr_regmap =
> syscon_regmap_lookup_by_phandle(np,
> +
> "fsl,ippdexpcr1-alt-addr");
> + } else if (is_acpi_node(dev->fwnode)) {
> + dev_err(dev, "not support acpi for rcpm\n");
> + continue;
> + }
> +
> + if (scfg_addr_regmap && (i == 1)) {
> + if (device_property_read_u32_array(dev,
> + "fsl,ippdexpcr1-alt-addr",
> + reg_offset,
> + 1 + sizeof(u64)/sizeof(u32))) {
> + scfg_addr_regmap = NULL;
> + continue;
> + }
> + /* Read value from register SCFG_SPARECR8 */
> + regmap_read(scfg_addr_regmap,
> + (u32)(((u64)(reg_offset[1] << 
> (sizeof(u32) * 8) |
> + reg_offset[2])) & 0x),
> + _value);
> + /* Write value to register SCFG_SPARECR8 */
> + regmap_write(scfg_addr_regmap,
> +  (u32)(((u64)(reg_offset[1] << 
> (sizeof(u32) * 8) |
> +  reg_offset[2])) & 0x),
> +  tmp | reg_value);
> + }
> + }
>   }
> 
>   return 0;
> --
> 2.17.1



Re: [PATCH] powerpc/mm: Remove kvm radix prefetch workaround for Power9 DD2.2

2019-12-03 Thread Oliver O'Halloran
On Mon, Dec 2, 2019 at 2:08 PM Jordan Niethe  wrote:
>
> Commit a25bd72badfa ("powerpc/mm/radix: Workaround prefetch issue with
> KVM") introduced a number of workarounds as coming out of a guest with
> the mmu enabled would make the cpu would start running in hypervisor
> state with the PID value from the guest. The cpu will then start
> prefetching for the hypervisor with that PID value.
>
> In Power9 DD2.2 the cpu behaviour was modified to fix this. When
> accessing Quadrant 0 in hypervisor mode with LPID != 0 prefetching will
> not be performed. This means that we can get rid of the workarounds for
> Power9 DD2.2 and later revisions.
>
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  9 +
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 18 --
>  arch/powerpc/mm/book3s64/radix_tlb.c |  5 +
>  3 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index faebcbb8c4db..6bbc5fbc7ea9 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1793,6 +1793,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> tlbsync
> ptesync
>
> +   /* We do not need this work around from POWER9 DD2.2 and onwards */
> +   mfspr   r3, SPRN_PVR
> +   srwir6, r3, 16
> +   cmpwi   cr0, r6, PVR_POWER9
> +   bne cr0, 2f
> +   andi.   r3, r3, 0xfff
> +   cmpwi   cr0, r3, 0x202
> +   bge cr0, 2f

This seems like the sort of thing we'd want to use a CPU_FTR_SECTION()
for. We probably want to have a CPU_FTR for this anyway so we're not
open-coding the PVR check all over the place.

> +
> /* Radix: Handle the case where the guest used an illegal PID */
> LOAD_REG_ADDR(r4, mmu_base_pid)
> lwz r3, VCPU_GUEST_PID(r9)
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 6ee17d09649c..1f280124994e 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -312,6 +312,7 @@ static void __init radix_init_pgtable(void)
>  {
> unsigned long rts_field;
> struct memblock_region *reg;
> +   unsigned int pvr;
>
> /* We don't support slb for radix */
> mmu_slb_size = 0;
> @@ -336,24 +337,29 @@ static void __init radix_init_pgtable(void)
> }
>
> /* Find out how many PID bits are supported */
> +   pvr = mfspr(SPRN_PVR);
> if (cpu_has_feature(CPU_FTR_HVMODE)) {
> if (!mmu_pid_bits)
> mmu_pid_bits = 20;
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> /*
> -* When KVM is possible, we only use the top half of the
> -* PID space to avoid collisions between host and guest PIDs
> -* which can cause problems due to prefetch when exiting the
> -* guest with AIL=3
> +* Before Power9 DD2.2, when KVM is possible, we only use the
> +* top half of the PID space to avoid collisions between host
> +* and guest PIDs which can cause problems due to prefetch 
> when
> +* exiting the guest with AIL=3
>  */
> -   mmu_base_pid = 1 << (mmu_pid_bits - 1);
> +   if (PVR_VER(pvr) == PVR_POWER9 && ((0xfff & pvr) < 0x202))
> +   mmu_base_pid = 1;
> +   else
> +   mmu_base_pid = 1 << (mmu_pid_bits - 1);
>  #else
> mmu_base_pid = 1;
>  #endif
> } else {
> /* The guest uses the bottom half of the PID space */
> if (!mmu_pid_bits)
> -   mmu_pid_bits = 19;
> +   mmu_pid_bits = (PVR_VER(pvr) == PVR_POWER9 &&
> +   ((0xfff & pvr) < 0x202)) ? 19 : 20;
> mmu_base_pid = 1;
> }
>
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 67af871190c6..cc86d8a88b86 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1217,10 +1217,15 @@ void radix__flush_tlb_all(void)
>  extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
>  {
> unsigned long pid = mm->context.id;
> +   unsigned int pvr;
>
> if (unlikely(pid == MMU_NO_CONTEXT))
> return;
>
> +   pvr = mfspr(SPRN_PVR);
> +   if (PVR_VER(pvr) != PVR_POWER9 || ((0xfff & pvr) >= 0x202))
> +   return;
> +
> /*
>  * If this context hasn't run on that CPU before and KVM is
>  * around, there's a slim chance that the guest on another
> --
> 2.20.1
>


Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-03 Thread Alexey Kardashevskiy



On 04/12/2019 11:49, Ram Pai wrote:
> On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 04/12/2019 03:52, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:


 On 03/12/2019 15:05, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 03/12/2019 13:08, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:


 On 02/12/2019 17:45, Ram Pai wrote:
> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one 
> of
> its parameters. One page is dedicated per cpu, for the lifetime of the
> kernel for this purpose. On secure VMs, contents of this page, when
> accessed by the hypervisor, retrieves encrypted TCE entries.  
> Hypervisor
> needs to know the unencrypted entries, to update the TCE table
> accordingly.  There is nothing secret or sensitive about these 
> entries.
> Hence share the page with the hypervisor.

 This unsecures a page in the guest in a random place which creates an
 additional attack surface which is hard to exploit indeed but
 nevertheless it is there.
 A safer option would be not to use the
 hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
 in the guest).
>>>
>>>
>>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
>>> automatically when IOMMU option is enabled.
>>
>> It is advertised by QEMU but the guest does not have to use it.
>
> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> or just secure-guest?  


 Just secure.
>>>
>>> hmm..  how are the TCE entries communicated to the hypervisor, if
>>> hcall-multi-tce is disabled?
>>
>> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
>> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
>> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
>> optimization.
> 
> Do you still think, secure-VM should use H_PUT_TCE and not
> H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
> Is there any advantage of special casing it for secure-VMs.


Reducing the amount of insecure memory at random location.


> In fact, we could make use of as much optimization as possible.
> 
> 
>>
>> Is not this for pci+swiotlb? 
> ..snip..
> This patch is purely to help the hypervisor setup the TCE table, in the
> presence of a IOMMU.

 Then the hypervisor should be able to access the guest pages mapped for
 DMA and these pages should be made unsecure for this to work. Where/when
 does this happen?
>>>
>>> This happens in the SWIOTLB code.  The code to do that is already
>>> upstream.  
>>>
>>> The sharing of the pages containing the SWIOTLB bounce buffers is done
>>> in init_svm() which calls swiotlb_update_mem_attributes() which calls
>>> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() 
>>> calls 
>>> uv_share_page().
>>
>>
>> This does not seem enough as when you enforce iommu_platform=on, QEMU
>> starts accessing virtio buffers via IOMMU so bounce buffers have to be
>> mapped explicitly, via H_PUT_TCE, where does this happen?
>>
> 
> I think, it happens at boot time. Every page of the guest memory is TCE
> mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
> as part of that operation.


Ah I see. This works via the huge dma window. Ok, makes sense now.

It just seems like a waste that we could map swiotlb 1:1 via the always
existing small DMA window but instead we rely on a huge window to map
these small buffers. This way we are wasting the entire 32bit window and
most of the huge window. We may fix it in the future (not right now) but
for now I would still avoid unsecuring additional memory. Thanks,


-- 
Alexey


RE: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-03 Thread Ram Pai
On Wed, Dec 04, 2019 at 11:04:04AM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 04/12/2019 03:52, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 03/12/2019 15:05, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 03/12/2019 13:08, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 02/12/2019 17:45, Ram Pai wrote:
> >>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one 
> >>> of
> >>> its parameters. One page is dedicated per cpu, for the lifetime of the
> >>> kernel for this purpose. On secure VMs, contents of this page, when
> >>> accessed by the hypervisor, retrieves encrypted TCE entries.  
> >>> Hypervisor
> >>> needs to know the unencrypted entries, to update the TCE table
> >>> accordingly.  There is nothing secret or sensitive about these 
> >>> entries.
> >>> Hence share the page with the hypervisor.
> >>
> >> This unsecures a page in the guest in a random place which creates an
> >> additional attack surface which is hard to exploit indeed but
> >> nevertheless it is there.
> >> A safer option would be not to use the
> >> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
> >> in the guest).
> >
> >
> > Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> > automatically when IOMMU option is enabled.
> 
>  It is advertised by QEMU but the guest does not have to use it.
> >>>
> >>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
> >>> or just secure-guest?  
> >>
> >>
> >> Just secure.
> > 
> > hmm..  how are the TCE entries communicated to the hypervisor, if
> > hcall-multi-tce is disabled?
> 
> Via H_PUT_TCE which updates 1 entry at once (sets or clears).
> hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
> H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
> optimization.

Do you still think, secure-VM should use H_PUT_TCE and not
H_PUT_TCE_INDIRECT?  And normal VM should use H_PUT_TCE_INDIRECT?
Is there any advantage of special casing it for secure-VMs.
In fact, we could make use of as much optimization as possible.


> 
>  Is not this for pci+swiotlb? 
..snip..
> >>> This patch is purely to help the hypervisor setup the TCE table, in the
> >>> presence of a IOMMU.
> >>
> >> Then the hypervisor should be able to access the guest pages mapped for
> >> DMA and these pages should be made unsecure for this to work. Where/when
> >> does this happen?
> > 
> > This happens in the SWIOTLB code.  The code to do that is already
> > upstream.  
> >
> > The sharing of the pages containing the SWIOTLB bounce buffers is done
> > in init_svm() which calls swiotlb_update_mem_attributes() which calls
> > set_memory_decrypted().  In the case of pseries, set_memory_decrypted() 
> > calls 
> > uv_share_page().
> 
> 
> This does not seem enough as when you enforce iommu_platform=on, QEMU
> starts accessing virtio buffers via IOMMU so bounce buffers have to be
> mapped explicitly, via H_PUT_TCE, where does this happen?
> 

I think, it happens at boot time. Every page of the guest memory is TCE
mapped, if iommu is enabled. SWIOTLB pages get implicitly TCE-mapped
as part of that operation.

RP



Re: [PATCH v2 00/27] Add support for OpenCAPI SCM devices

2019-12-03 Thread Dan Williams
On Tue, Dec 3, 2019 at 4:43 AM Matthew Wilcox  wrote:
>
> On Tue, Dec 03, 2019 at 03:01:17PM +1100, Alastair D'Silva wrote:
> > On Mon, 2019-12-02 at 19:50 -0800, Matthew Wilcox wrote:
> > > On Tue, Dec 03, 2019 at 02:46:28PM +1100, Alastair D'Silva wrote:
> > > > This series adds support for OpenCAPI SCM devices, exposing
> > >
> > > Could we _not_ introduce yet another term for persistent memory?
> > >
> >
> > "Storage Class Memory" is an industry wide term, and is used repeatedly
> > in the device specifications. It's not something that has been pulled
> > out of thin air.
>
> "Somebody else also wrote down Storage Class Memory".  Don't care.
> Google gets 750k hits for Persistent Memory and 150k hits for
> Storage Class Memory.  This term lost.
>
> > The term is also already in use within the 'papr_scm' driver.
>
> The acronym "SCM" is already in use.  Socket Control Messages go back
> to early Unix (SCM_RIGHTS, SCM_CREDENTIALS, etc).  Really, you're just
> making the case that IBM already uses the term SCM.  You should stop.

I tend to agree. The naming of things under
arch/powerpc/platforms/pseries/ is not under my purview, but
drivers/nvdimm/ is. Since this driver is colocated with the "pmem"
driver let's not proliferate the "scm" vs "pmem" confusion.


Re: [PATCH v2 02/27] nvdimm: remove prototypes for nonexistent functions

2019-12-03 Thread Dan Williams
On Mon, Dec 2, 2019 at 7:48 PM Alastair D'Silva  wrote:
>
> From: Alastair D'Silva 
>
> These functions don't exist, so remove the prototypes for them.
>
> Signed-off-by: Alastair D'Silva 
> Reviewed-by: Frederic Barrat 
> ---

This was already merged as commit:

cda93d6965a1 libnvdimm: Remove prototypes for nonexistent functions

You should have received a notification from the patchwork bot that it
was already accepted.

What baseline did you use for this submission?


Re: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-03 Thread Alexey Kardashevskiy



On 04/12/2019 03:52, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 03/12/2019 15:05, Ram Pai wrote:
>>> On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:


 On 03/12/2019 13:08, Ram Pai wrote:
> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 02/12/2019 17:45, Ram Pai wrote:
>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>> its parameters. One page is dedicated per cpu, for the lifetime of the
>>> kernel for this purpose. On secure VMs, contents of this page, when
>>> accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
>>> needs to know the unencrypted entries, to update the TCE table
>>> accordingly.  There is nothing secret or sensitive about these entries.
>>> Hence share the page with the hypervisor.
>>
>> This unsecures a page in the guest in a random place which creates an
>> additional attack surface which is hard to exploit indeed but
>> nevertheless it is there.
>> A safer option would be not to use the
>> hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>> in the guest).
>
>
> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> automatically when IOMMU option is enabled.

 It is advertised by QEMU but the guest does not have to use it.
>>>
>>> Are you suggesting that even normal-guest, not use hcall-multi-tce?
>>> or just secure-guest?  
>>
>>
>> Just secure.
> 
> hmm..  how are the TCE entries communicated to the hypervisor, if
> hcall-multi-tce is disabled?

Via H_PUT_TCE which updates 1 entry at once (sets or clears).
hcall-multi-tce  enables H_PUT_TCE_INDIRECT (512 entries at once) and
H_STUFF_TCE (clearing, up to 4bln at once? many), these are simply an
optimization.


> 
>>
>>
>>>

> This happens even
> on a normal VM when IOMMU is enabled.
>
>
>>
>> Also what is this for anyway? 
>
> This is for sending indirect-TCE entries to the hypervisor.
> The hypervisor must be able to read those TCE entries, so that it can 
> use those entires to populate the TCE table with the correct mappings.
>
>> if I understand things right, you cannot
>> map any random guest memory, you should only be mapping that 64MB-ish
>> bounce buffer array but 1) I do not see that happening (I may have
>> missed it) 2) it should be done once and it takes a little time for
>> whatever memory size we allow for bounce buffers anyway. Thanks,
>
> Any random guest memory can be shared by the guest. 

 Yes but we do not want this to be this random. 
>>>
>>> It is not sharing some random page. It is sharing a page that is
>>> ear-marked for communicating TCE entries. Yes the address of the page
>>> can be random, depending on where the allocator decides to allocate it.
>>> The purpose of the page is not random.
>>
>> I was talking about the location.
>>
>>
>>> That page is used for one specific purpose; to communicate the TCE
>>> entries to the hypervisor.  
>>>
 I thought the whole idea
 of swiotlb was to restrict the amount of shared memory to bare minimum,
 what do I miss?
>>>
>>> I think, you are making a incorrect connection between this patch and
>>> SWIOTLB.  This patch has nothing to do with SWIOTLB.
>>
>> I can see this and this is the confusing part.
>>
>>

> Maybe you are confusing this with the SWIOTLB bounce buffers used by
> PCI devices, to transfer data to the hypervisor?

 Is not this for pci+swiotlb? 
>>>
>>>
>>> No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
>>> different set of pages allocated and earmarked for bounce buffering.
>>>
>>> This patch is purely to help the hypervisor setup the TCE table, in the
>>> presence of a IOMMU.
>>
>> Then the hypervisor should be able to access the guest pages mapped for
>> DMA and these pages should be made unsecure for this to work. Where/when
>> does this happen?
> 
> This happens in the SWIOTLB code.  The code to do that is already
> upstream.  
>
> The sharing of the pages containing the SWIOTLB bounce buffers is done
> in init_svm() which calls swiotlb_update_mem_attributes() which calls
> set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
> uv_share_page().


This does not seem enough as when you enforce iommu_platform=on, QEMU
starts accessing virtio buffers via IOMMU so bounce buffers have to be
mapped explicitly, via H_PUT_TCE, where does this happen?


> 
> The code that bounces the contents of a I/O buffer through the 
> SWIOTLB buffers, is in swiotlb_bounce().
> 
>>
>>
 The cover letter suggests it is for
 virtio-scsi-_pci_ with iommu_platform=on which makes it a
 normal pci device just like emulated XHCI. Thanks,
>>>
>>> Well, I guess, the cover letter is probably confusing. 

Re: [PATCH v2 1/2] kasan: support instrumented bitops combined with generic bitops

2019-12-03 Thread Daniel Axtens
Hi Michael,
> I only just noticed this thread as I was about to send a pull request
> for these two commits.
>
> I think I agree that test_bit() shouldn't move (yet), but I dislike that
> the documentation ends up being confusing due to this patch.
>
> So I'm inclined to append or squash in the patch below, which removes
> the new headers from the documentation. The end result is the docs look
> more or less the same, just the ordering of some of the functions
> changes. But we don't end up with test_bit() under the "Non-atomic"
> header, and then also documented in Documentation/atomic_bitops.txt.
>
> Thoughts?

That sounds good to me.

Regards,
Daniel

>
> cheers
>
>
> diff --git a/Documentation/core-api/kernel-api.rst 
> b/Documentation/core-api/kernel-api.rst
> index 2caaeb55e8dd..4ac53a1363f6 100644
> --- a/Documentation/core-api/kernel-api.rst
> +++ b/Documentation/core-api/kernel-api.rst
> @@ -57,21 +57,12 @@ The Linux kernel provides more basic utility functions.
>  Bit Operations
>  --
>  
> -Atomic Operations
> -~
> -
>  .. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h
> :internal:
>  
> -Non-atomic Operations
> -~
> -
>  .. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h
> :internal:
>  
> -Locking Operations
> -~~
> -
>  .. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h
> :internal:
>  


[PATCH 4.19 195/321] net/wan/fsl_ucc_hdlc: Avoid double free in ucc_hdlc_probe()

2019-12-03 Thread Greg Kroah-Hartman
From: Wen Yang 

[ Upstream commit 40752b3eae29f8ca2378e978a02bd6dbeeb06d16 ]

This patch fixes potential double frees if register_hdlc_device() fails.

Signed-off-by: Wen Yang 
Reviewed-by: Peng Hao 
CC: Zhao Qiang 
CC: "David S. Miller" 
CC: net...@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-ker...@vger.kernel.org
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/net/wan/fsl_ucc_hdlc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 5f0366a125e26..0212f576a838c 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -1113,7 +1113,6 @@ static int ucc_hdlc_probe(struct platform_device *pdev)
if (register_hdlc_device(dev)) {
ret = -ENOBUFS;
pr_err("ucc_hdlc: unable to register hdlc device\n");
-   free_netdev(dev);
goto free_dev;
}
 
-- 
2.20.1





Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2019-12-03 Thread kbuild test robot
Hi "Gautham,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v5.4 next-20191203]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:
https://github.com/0day-ci/linux/commits/Gautham-R-Shenoy/powerpc-pseries-Account-for-SPURR-ticks-on-idle-CPUs/20191127-234537
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gcc (GCC) 7.5.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.5.0 make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

Note: the 
linux-review/Gautham-R-Shenoy/powerpc-pseries-Account-for-SPURR-ticks-on-idle-CPUs/20191127-234537
 HEAD 54932d09dae77a0b15c47c7e51f0fb6e7d34900f builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/sysfs.c: In function 'idle_purr_show':
>> arch/powerpc/kernel/sysfs.c:1052:34: error: 'paca_ptrs' undeclared (first 
>> use in this function); did you mean 'hash_ptr'?
 struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
 ^
 hash_ptr
   arch/powerpc/kernel/sysfs.c:1052:34: note: each undeclared identifier is 
reported only once for each function it appears in
   In file included from include/linux/byteorder/big_endian.h:5:0,
from arch/powerpc/include/uapi/asm/byteorder.h:14,
from include/asm-generic/bitops/le.h:6,
from arch/powerpc/include/asm/bitops.h:243,
from include/linux/bitops.h:26,
from include/linux/kernel.h:12,
from include/linux/list.h:9,
from include/linux/kobject.h:19,
from include/linux/device.h:16,
from arch/powerpc/kernel/sysfs.c:2:
>> arch/powerpc/kernel/sysfs.c:1053:51: error: dereferencing pointer to 
>> incomplete type 'struct lppaca'
 u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
  ^
   include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of 
macro '__be64_to_cpu'
#define __be64_to_cpu(x) ((__force __u64)(__be64)(x))
  ^
   arch/powerpc/kernel/sysfs.c:1053:25: note: in expansion of macro 
'be64_to_cpu'
 u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
^~~

vim +1052 arch/powerpc/kernel/sysfs.c

  1046  
  1047  static ssize_t idle_purr_show(struct device *dev,
  1048struct device_attribute *attr, char *buf)
  1049  {
  1050  struct cpu *cpu = container_of(dev, struct cpu, dev);
  1051  unsigned int cpuid = cpu->dev.id;
> 1052  struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
> 1053  u64 idle_purr_cycles = 
> be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
  1054  
  1055  return sprintf(buf, "%llx\n", idle_purr_cycles);
  1056  }
  1057  static DEVICE_ATTR_RO(idle_purr);
  1058  

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org Intel Corporation


.config.gz
Description: application/gzip


[PATCH] powerpc/xive: skip ioremap() of ESB pages for LSI interrupts

2019-12-03 Thread Cédric Le Goater
The PCI INTx interrupts and other LSI interrupts are handled differently
under a sPAPR platform. When the interrupt source characteristics are
queried, the hypervisor returns an H_INT_ESB flag to inform the OS
that it should be using the H_INT_ESB hcall for interrupt management
and not loads and stores on the interrupt ESB pages.

A default -1 value is returned for the addresses of the ESB pages. The
driver ignores this condition today and performs a bogus IO mapping.
Recent changes and the DEBUG_VM configuration option make the bug
visible with :

[0.015518] kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612!
[0.015578] Oops: Exception in kernel mode, sig: 5 [#1]
[0.015627] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries
[0.015697] Modules linked in:
[0.015739] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.4.0-0.rc6.git0.1.fc32.ppc64le #1
[0.015812] NIP:  c0f63294 LR: c0f62e44 CTR: 
[0.015889] REGS: c000fa45f0d0 TRAP: 0700   Not tainted  
(5.4.0-0.rc6.git0.1.fc32.ppc64le)
[0.015971] MSR:  82029033   CR: 44000424  
XER: 
[0.016050] CFAR: c0f63128 IRQMASK: 0
[0.016050] GPR00: c0f62e44 c000fa45f360 c1be5400 

[0.016050] GPR04: c19c7d38 c000fa340030 fa330009 
c1c15e18
[0.016050] GPR08: 0040 ffe0  
8418dd352dbd190f
[0.016050] GPR12:  c1e0 c00a8006 
c00a8006
[0.016050] GPR16:  81ae c1c24d98 

[0.016050] GPR20: c00a8007 c1cafca0 c00a8007 

[0.016050] GPR24: c00a8008 c00a8008 c1cafca8 
c00a8008
[0.016050] GPR28: c000fa32e010 c00a8006  
c000fa33
[0.016711] NIP [c0f63294] ioremap_page_range+0x4c4/0x6e0
[0.016778] LR [c0f62e44] ioremap_page_range+0x74/0x6e0
[0.016846] Call Trace:
[0.016876] [c000fa45f360] [c0f62e44] 
ioremap_page_range+0x74/0x6e0 (unreliable)
[0.016969] [c000fa45f460] [c00934bc] do_ioremap+0x8c/0x120
[0.017037] [c000fa45f4b0] [c00938e8] 
__ioremap_caller+0x128/0x140
[0.017116] [c000fa45f500] [c00931a0] ioremap+0x30/0x50
[0.017184] [c000fa45f520] [c00d1380] 
xive_spapr_populate_irq_data+0x170/0x260
[0.017263] [c000fa45f5c0] [c00cc90c] 
xive_irq_domain_map+0x8c/0x170
[0.017344] [c000fa45f600] [c0219124] 
irq_domain_associate+0xb4/0x2d0
[0.017424] [c000fa45f690] [c0219fe0] 
irq_create_mapping+0x1e0/0x3b0
[0.017506] [c000fa45f730] [c021ad6c] 
irq_create_fwspec_mapping+0x27c/0x3e0
[0.017586] [c000fa45f7c0] [c021af68] 
irq_create_of_mapping+0x98/0xb0
[0.017666] [c000fa45f830] [c08d4e48] 
of_irq_parse_and_map_pci+0x168/0x230
[0.017746] [c000fa45f910] [c0075428] 
pcibios_setup_device+0x88/0x250
[0.017826] [c000fa45f9a0] [c0077b84] 
pcibios_setup_bus_devices+0x54/0x100
[0.017906] [c000fa45fa10] [c00793f0] __of_scan_bus+0x160/0x310
[0.017973] [c000fa45faf0] [c0075fc0] 
pcibios_scan_phb+0x330/0x390
[0.018054] [c000fa45fba0] [c139217c] pcibios_init+0x8c/0x128
[0.018121] [c000fa45fc20] [c00107b0] do_one_initcall+0x60/0x2c0
[0.018201] [c000fa45fcf0] [c1384624] 
kernel_init_freeable+0x290/0x378
[0.018280] [c000fa45fdb0] [c0010d24] kernel_init+0x2c/0x148
[0.018348] [c000fa45fe20] [c000bdbc] 
ret_from_kernel_thread+0x5c/0x80
[0.018427] Instruction dump:
[0.018468] 41820014 3920fe7f 7d494838 7d290074 7929d182 f8e10038 69290001 
0b09
[0.018552] 7a098420 0b09 7bc95960 7929a802 <0b09> 7fc68b78 e8610048 
7dc47378

Cc: sta...@vger.kernel.org # v4.14+
Fixes: bed81ee181dd ("powerpc/xive: introduce H_INT_ESB hcall")
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/spapr.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 33c10749edec..55dc61cb4867 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -392,20 +392,28 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, 
struct xive_irq_data *data)
data->esb_shift = esb_shift;
data->trig_page = trig_page;
 
+   data->hw_irq = hw_irq;
+
/*
 * No chip-id for the sPAPR backend. This has an impact how we
 * pick a target. See xive_pick_irq_target().
 */
data->src_chip = XIVE_INVALID_CHIP_ID;
 
+   /*
+* When the H_INT_ESB flag is set, the H_INT_ESB hcall should
+* be used for interrupt management. Skip the remapping of the
+

Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later

2019-12-03 Thread Christoph Hellwig
Please try the patch below:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 512856a88106..340c15400423 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -28,6 +28,7 @@
 struct iomap_page {
atomic_tread_count;
atomic_twrite_count;
+   spinlock_t  uptodate_lock;
DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
 };
 
@@ -51,6 +52,7 @@ iomap_page_create(struct inode *inode, struct page *page)
iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
atomic_set(>read_count, 0);
atomic_set(>write_count, 0);
+   spin_lock_init(>uptodate_lock);
bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
 
/*
@@ -139,25 +141,38 @@ iomap_adjust_read_range(struct inode *inode, struct 
iomap_page *iop,
 }
 
 static void
-iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+iomap_iop_set_range_uptodate(struct page *page, unsigned off, unsigned len)
 {
struct iomap_page *iop = to_iomap_page(page);
struct inode *inode = page->mapping->host;
unsigned first = off >> inode->i_blkbits;
unsigned last = (off + len - 1) >> inode->i_blkbits;
-   unsigned int i;
bool uptodate = true;
+   unsigned long flags;
+   unsigned int i;
 
-   if (iop) {
-   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
-   if (i >= first && i <= last)
-   set_bit(i, iop->uptodate);
-   else if (!test_bit(i, iop->uptodate))
-   uptodate = false;
-   }
+   spin_lock_irqsave(>uptodate_lock, flags);
+   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
+   if (i >= first && i <= last)
+   set_bit(i, iop->uptodate);
+   else if (!test_bit(i, iop->uptodate))
+   uptodate = false;
}
 
-   if (uptodate && !PageError(page))
+   if (uptodate)
+   SetPageUptodate(page);
+   spin_unlock_irqrestore(>uptodate_lock, flags);
+}
+
+static void
+iomap_set_range_uptodate(struct page *page, unsigned off, unsigned len)
+{
+   if (PageError(page))
+   return;
+
+   if (page_has_private(page))
+   iomap_iop_set_range_uptodate(page, off, len);
+   else
SetPageUptodate(page);
 }
 


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-03 Thread Segher Boessenkool
Hi!

On Tue, Dec 03, 2019 at 03:03:22PM +1100, Michael Ellerman wrote:
> Sebastian Andrzej Siewior  writes:
> I've certainly heard it said that on some OF's the phandle was just ==
> the address of the internal representation, and I guess maybe for SLOF
> that is true.

It is (or was).  In many OFs it is just the effective address of some
node structure.  SLOF runs with translation off normally.

> They seem to vary wildly though, eg. on an Apple G5:

Apple OF runs with translation on usually.  IIRC these are effective
addresses as well.

The OF they have on G5 machines is mostly 32-bit, for compatibility is my
guess (for userland things dealing with addresses from OF, importantly).

>   $ find /proc/device-tree/ -name phandle | xargs lsprop | head -10
>   /proc/device-tree/vsp@0,f900/veo@f918/phandle ff970848
>   /proc/device-tree/vsp@0,f900/phandle ff970360
>   /proc/device-tree/vsp@0,f900/veo@f908/phandle ff970730
>   /proc/device-tree/nvram@0,fff04000/phandle ff967fb8
>   /proc/device-tree/xmodem/phandle ff9655e8
>   /proc/device-tree/multiboot/phandle ff9504f0
>   /proc/device-tree/diagnostics/phandle ff965550
>   /proc/device-tree/options/phandle ff893cf0
>   /proc/device-tree/openprom/client-services/phandle ff8925b8
>   /proc/device-tree/openprom/phandle ff892458
> 
> That machine does not have enough RAM for those to be 32-bit real
> addresses. I think Apple OF is running in virtual mode though (?), so
> maybe they are pointers?

Yes, I think the default is to have 8MB ram at the top of 4GB (which is
the physical address of the bootrom, btw) for OF.

> And on an IBM pseries machine they're a bit all over the place:
> 
>   /proc/device-tree/cpus/PowerPC,POWER8@40/ibm,phandle 1040
>   /proc/device-tree/cpus/l2-cache@2005/ibm,phandle 2005
>   /proc/device-tree/cpus/PowerPC,POWER8@30/ibm,phandle 1030
>   /proc/device-tree/cpus/PowerPC,POWER8@20/ibm,phandle 1020
>   /proc/device-tree/cpus/PowerPC,POWER8@10/ibm,phandle 1010
>   /proc/device-tree/cpus/l2-cache@2003/ibm,phandle 2003
>   /proc/device-tree/cpus/l2-cache@200a/ibm,phandle 200a
>   /proc/device-tree/cpus/l3-cache@3108/ibm,phandle 3108
>   /proc/device-tree/cpus/l2-cache@2001/ibm,phandle 2001
>   /proc/device-tree/cpus/l3-cache@3106/ibm,phandle 3106
>   /proc/device-tree/cpus/ibm,phandle fff8
>   /proc/device-tree/cpus/l3-cache@3104/ibm,phandle 3104
>   /proc/device-tree/cpus/l2-cache@2008/ibm,phandle 2008
>   /proc/device-tree/cpus/l3-cache@3102/ibm,phandle 3102
>   /proc/device-tree/cpus/l2-cache@2006/ibm,phandle 2006
>   /proc/device-tree/cpus/l3-cache@3100/ibm,phandle 3100
>   /proc/device-tree/cpus/PowerPC,POWER8@8/ibm,phandle 1008
>   /proc/device-tree/cpus/l2-cache@2004/ibm,phandle 2004
>   /proc/device-tree/cpus/PowerPC,POWER8@48/ibm,phandle 1048
>   /proc/device-tree/cpus/PowerPC,POWER8@38/ibm,phandle 1038
>   /proc/device-tree/cpus/l2-cache@2002/ibm,phandle 2002
>   /proc/device-tree/cpus/PowerPC,POWER8@28/ibm,phandle 1028
>   /proc/device-tree/cpus/l3-cache@3107/ibm,phandle 3107
>   /proc/device-tree/cpus/PowerPC,POWER8@18/ibm,phandle 1018
>   /proc/device-tree/cpus/l2-cache@2000/ibm,phandle 2000
>   /proc/device-tree/cpus/l3-cache@3105/ibm,phandle 3105
>   /proc/device-tree/cpus/l3-cache@3103/ibm,phandle 3103
>   /proc/device-tree/cpus/l3-cache@310a/ibm,phandle 310a
>   /proc/device-tree/cpus/PowerPC,POWER8@0/ibm,phandle 1000
>   /proc/device-tree/cpus/l2-cache@2007/ibm,phandle 2007
>   /proc/device-tree/cpus/l3-cache@3101/ibm,phandle 3101
>   /proc/device-tree/pci@800201b/ibm,phandle 201b

Some (the 1000) look like addresses as well.

> > So the hash array has 64 entries out which only 8 are populated. Using
> > hash_32() populates 29 entries.

> On the G5 it's similarly inefficient:
> [0.007379] OF: of_populate_phandle_cache(242) Used entries: 31, hashed: 
> 111

> And some output from a "real" pseries machine (IBM OF), which is
> slightly better:
> [0.129467] OF: of_populate_phandle_cache(242) Used entries: 39, hashed: 81

> So yeah using hash_32() is quite a bit better in both cases.

Yup, no surprise there.  And hash_32 is very cheap to compute.

> And if I'm reading your patch right it would be a single line change to
> switch, so that seems like it's worth doing to me.

Agreed!

Btw.  Some OFs mangle the phandles some way, to make it easier to catch
people using it as an address (and similarly, mangle ihandles differently,
so you catch confusion between ihandles and phandles as well).  Like a
simple xor, with some odd number preferably.  You should assume *nothing*
about phandles, they are opaque identifiers.


Segher


Re: [RFC] Efficiency of the phandle_cache on ppc64/SLOF

2019-12-03 Thread Rob Herring
On Mon, Dec 2, 2019 at 10:28 PM Frank Rowand  wrote:
>
> On 12/2/19 10:12 PM, Michael Ellerman wrote:
> > Frank Rowand  writes:
> >> On 11/29/19 9:10 AM, Sebastian Andrzej Siewior wrote:
> >>> I've been looking at phandle_cache and noticed the following: The raw
> >>> phandle value as generated by dtc starts at zero and is incremented by
> >>> one for each phandle entry. The qemu pSeries model is using Slof (which
> >>> is probably the same thing as used on real hardware) and this looks like
> >>> a poiner value for the phandle.
> >>> With
> >>> qemu-system-ppc64le -m 16G -machine pseries -smp 8
> >>>
> >>> I got the following output:
> >>> | entries: 64
> >>> | phandle 7e732468 slot 28 hash c
> >>> | phandle 7e732ad0 slot 10 hash 27
> >>> | phandle 7e732ee8 slot 28 hash 3a
> >>> | phandle 7e734160 slot 20 hash 36
> >>> | phandle 7e734318 slot 18 hash 3a
> >>> | phandle 7e734428 slot 28 hash 33
> >>> | phandle 7e734538 slot 38 hash 2c
> >>> | phandle 7e734850 slot 10 hash e
> >>> | phandle 7e735220 slot 20 hash 2d
> >>> | phandle 7e735bf0 slot 30 hash d
> >>> | phandle 7e7365c0 slot 0 hash 2d
> >>> | phandle 7e736f90 slot 10 hash d
> >>> | phandle 7e737960 slot 20 hash 2d
> >>> | phandle 7e738330 slot 30 hash d
> >>> | phandle 7e738d00 slot 0 hash 2d
> >>> | phandle 7e739730 slot 30 hash 38
> >>> | phandle 7e73bd08 slot 8 hash 17
> >>> | phandle 7e73c2e0 slot 20 hash 32
> >>> | phandle 7e73c7f8 slot 38 hash 37
> >>> | phandle 7e782420 slot 20 hash 13
> >>> | phandle 7e782ed8 slot 18 hash 1b
> >>> | phandle 7e73ce28 slot 28 hash 39
> >>> | phandle 7e73d390 slot 10 hash 22
> >>> | phandle 7e73d9a8 slot 28 hash 1a
> >>> | phandle 7e73dc28 slot 28 hash 37
> >>> | phandle 7e73de00 slot 0 hash a
> >>> | phandle 7e73e028 slot 28 hash 0
> >>> | phandle 7e7621a8 slot 28 hash 36
> >>> | phandle 7e73e458 slot 18 hash 1e
> >>> | phandle 7e73e608 slot 8 hash 1e
> >>> | phandle 7e740078 slot 38 hash 28
> >>> | phandle 7e740180 slot 0 hash 1d
> >>> | phandle 7e740240 slot 0 hash 33
> >>> | phandle 7e740348 slot 8 hash 29
> >>> | phandle 7e740410 slot 10 hash 2
> >>> | phandle 7e740eb0 slot 30 hash 3e
> >>> | phandle 7e745390 slot 10 hash 33
> >>> | phandle 7e747b08 slot 8 hash c
> >>> | phandle 7e748528 slot 28 hash f
> >>> | phandle 7e74a6e0 slot 20 hash 18
> >>> | phandle 7e74aab0 slot 30 hash b
> >>> | phandle 7e74f788 slot 8 hash d
> >>> | Used entries: 8, hashed: 29
> >>>
> >>> So the hash array has 64 entries out which only 8 are populated. Using
> >>> hash_32() populates 29 entries.
> >>> Could someone with real hardware verify this?
> >>> I'm not sure how important this performance wise, it looks just like a
> >>> waste using only 1/8 of the array.
> >>
> >> The hash used is based on the assumptions you noted, and as stated in the
> >> code, that phandle property values are in a contiguous range of 1..n
> >> (not starting from zero), which is what dtc generates.
> >
> >> We knew that for systems that do not match the assumptions that the hash
> >> will not be optimal.
> >
> > If we're going to have the phandle cache it should at least make some
> > attempt to work across the systems that Linux supports.
> >
> >> Unless there is a serious performance problem for
> >> such systems, I do not want to make the phandle hash code more complicated
> >> to optimize for these cases.  And the pseries have been performing ok
> >> without phandle related performance issues that I remember hearing since
> >> before the cache was added, which could have only helped the performance.
> >> Yes, if your observations are correct, some memory is being wasted, but
> >> a 64 entry cache is not very large on a pseries.
> >
> > A single line change to use an actual hash function is hardly
> > complicating it, compared to the amount of code already there.
>
> With a dtc generated devicetree, the hash is perfect, with no
> misses.  That single line change then makes the hash bad for
> dtc generated devicetrees.

To quantify that, I did some tests with the hash algo and it's about a
10% collision rate using phandle values of 1-$cache_size. There's
never more than 2 phandles colliding in a slot. The actual impact
would be less given 100% thrashing seems unlikely.

> The cache was added to address a problem with a system with a
> dtc generated devicetree.

The cache was added to address a problem with a system with a large
number of nodes and phandles (814 phandles in the reported case). dtc
happens to be used in that case.

> I had not heard of any phandle
> lookup performance issues on ppc systems.  An imperfect hash
> for ppc should not make the ppc performance worse (unless
> every single phandle value hashes to a single bucket).  So
> the ppc performance is likely better than it was before
> the hash was added, even without an optimal hash algorithm.
>
> So the change would not be a single line change.  It would
> be a change to use different hash algorithms for dtc
> generated device trees versus other device trees.
>
> Another 

RE: [PATCH v4 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-03 Thread Ram Pai
On Tue, Dec 03, 2019 at 03:24:37PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 03/12/2019 15:05, Ram Pai wrote:
> > On Tue, Dec 03, 2019 at 01:15:04PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 03/12/2019 13:08, Ram Pai wrote:
> >>> On Tue, Dec 03, 2019 at 11:56:43AM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 02/12/2019 17:45, Ram Pai wrote:
> > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> > its parameters. One page is dedicated per cpu, for the lifetime of the
> > kernel for this purpose. On secure VMs, contents of this page, when
> > accessed by the hypervisor, retrieves encrypted TCE entries.  Hypervisor
> > needs to know the unencrypted entries, to update the TCE table
> > accordingly.  There is nothing secret or sensitive about these entries.
> > Hence share the page with the hypervisor.
> 
>  This unsecures a page in the guest in a random place which creates an
>  additional attack surface which is hard to exploit indeed but
>  nevertheless it is there.
>  A safer option would be not to use the
>  hcall-multi-tce hyperrtas option (which translates FW_FEATURE_MULTITCE
>  in the guest).
> >>>
> >>>
> >>> Hmm... How do we not use it?  AFAICT hcall-multi-tce option gets invoked
> >>> automatically when IOMMU option is enabled.
> >>
> >> It is advertised by QEMU but the guest does not have to use it.
> > 
> > Are you suggesting that even normal-guest, not use hcall-multi-tce?
> > or just secure-guest?  
> 
> 
> Just secure.

hmm..  how are the TCE entries communicated to the hypervisor, if
hcall-multi-tce is disabled?

> 
> 
> > 
> >>
> >>> This happens even
> >>> on a normal VM when IOMMU is enabled.
> >>>
> >>>
> 
>  Also what is this for anyway? 
> >>>
> >>> This is for sending indirect-TCE entries to the hypervisor.
> >>> The hypervisor must be able to read those TCE entries, so that it can 
> >>> use those entires to populate the TCE table with the correct mappings.
> >>>
>  if I understand things right, you cannot
>  map any random guest memory, you should only be mapping that 64MB-ish
>  bounce buffer array but 1) I do not see that happening (I may have
>  missed it) 2) it should be done once and it takes a little time for
>  whatever memory size we allow for bounce buffers anyway. Thanks,
> >>>
> >>> Any random guest memory can be shared by the guest. 
> >>
> >> Yes but we do not want this to be this random. 
> > 
> > It is not sharing some random page. It is sharing a page that is
> > ear-marked for communicating TCE entries. Yes the address of the page
> > can be random, depending on where the allocator decides to allocate it.
> > The purpose of the page is not random.
> 
> I was talking about the location.
> 
> 
> > That page is used for one specific purpose; to communicate the TCE
> > entries to the hypervisor.  
> > 
> >> I thought the whole idea
> >> of swiotlb was to restrict the amount of shared memory to bare minimum,
> >> what do I miss?
> > 
> > I think, you are making a incorrect connection between this patch and
> > SWIOTLB.  This patch has nothing to do with SWIOTLB.
> 
> I can see this and this is the confusing part.
> 
> 
> >>
> >>> Maybe you are confusing this with the SWIOTLB bounce buffers used by
> >>> PCI devices, to transfer data to the hypervisor?
> >>
> >> Is not this for pci+swiotlb? 
> > 
> > 
> > No. This patch is NOT for PCI+SWIOTLB.  The SWIOTLB pages are a
> > different set of pages allocated and earmarked for bounce buffering.
> > 
> > This patch is purely to help the hypervisor setup the TCE table, in the
> > presence of a IOMMU.
> 
> Then the hypervisor should be able to access the guest pages mapped for
> DMA and these pages should be made unsecure for this to work. Where/when
> does this happen?

This happens in the SWIOTLB code.  The code to do that is already
upstream.  

The sharing of the pages containing the SWIOTLB bounce buffers is done
in init_svm() which calls swiotlb_update_mem_attributes() which calls
set_memory_decrypted().  In the case of pseries, set_memory_decrypted() calls 
uv_share_page().

The code that bounces the contents of a I/O buffer through the 
SWIOTLB buffers, is in swiotlb_bounce().

> 
> 
> >> The cover letter suggests it is for
> >> virtio-scsi-_pci_ with iommu_platform=on which makes it a
> >> normal pci device just like emulated XHCI. Thanks,
> > 
> > Well, I guess, the cover letter is probably confusing. There are two
> > patches, which togather enable virtio on secure guests, in the presence
> > of IOMMU.
> > 
> > The second patch enables virtio in the presence of a IOMMU, to use
> > DMA_ops+SWIOTLB infrastructure, to correctly navigate the I/O to virtio
> > devices.
> 
> The second patch does nothing in relation to the problem being solved.

The second patch registers dma_iommu_ops with the PCI-system.  Doing so
enables I/O to take the dma_iommu_ops path, which internally 
leads it 

Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later

2019-12-03 Thread Darrick J. Wong
On Tue, Dec 03, 2019 at 09:35:28AM -0500, Jan Stancek wrote:
> 
> - Original Message -
> > On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote:
> > > My theory is that there's a race in iomap. There appear to be
> > > interleaved calls to iomap_set_range_uptodate() for same page
> > > with varying offset and length. Each call sees bitmap as _not_
> > > entirely "uptodate" and hence doesn't call SetPageUptodate().
> > > Even though each bit in bitmap ends up uptodate by the time
> > > all calls finish.
> > 
> > Weird.  That should be prevented by the page lock that all callers
> > of iomap_set_range_uptodate.  But in case I miss something, does
> > the patch below trigger?  If not it is not jut a race, but might
> > be some weird ordering problem with the bitops, especially if it
> > only triggers on ppc, which is very weakly ordered.
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index d33c7bc5ee92..25e942c71590 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned 
> > off,
> > unsigned len)
> > unsigned int i;
> > bool uptodate = true;
> >  
> > +   WARN_ON_ONCE(!PageLocked(page));
> > +
> > if (iop) {
> > for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> > if (i >= first && i <= last)
> > 
> 
> Hit it pretty quick this time:
> 
> # uptime
>  09:27:42 up 22 min,  2 users,  load average: 0.09, 13.38, 26.18
> 
> # /mnt/testarea/ltp/testcases/bin/genbessel   
>   
> 
> Bus error (core dumped)
> 
> # dmesg | grep -i -e warn -e call 
>   
> 
> [0.00] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or 
> unsupported by kernel)
> [0.00] random: get_random_u64 called from 
> cache_random_seq_create+0x98/0x1e0 with crng_init=0
> [0.00] rcu: Offload RCU callbacks from CPUs: (none).
> [5.312075] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is 
> called outbound_intr_mask:0x4009
> [5.357307] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is 
> called outbound_intr_mask:0x4009
> [5.485126] megaraid_sas 0031:01:00.0: megasas_enable_intr_fusion is 
> called outbound_intr_mask:0x4000
> 
> So, extra WARN_ON_ONCE applied on top of v5.4-8836-g81b6b96475ac
> did not trigger.
> 
> Is it possible for iomap code to submit multiple bio-s for same
> locked page and then receive callbacks in parallel?

Yes, if (say) you have 64k pages on a 4k-block filesystem and the extent
mapping for all 16 blocks aren't contiguous, then iomap will issue
separate bios for each physical fragment it finds.  iomap will call
submit_bio on those bios whenever it thinks it's done filling the bio,
so you can indeed get multiple callbacks in parallel.

--D


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-03 Thread David Hildenbrand
On 03.12.19 16:10, Oscar Salvador wrote:
> On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote:
>> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
>> Signed-off-by: David Hildenbrand 
> 
> I did not see anything wrong with the taken approach, and makes sense to me.
> The only thing that puzzles me is we seem to not balance spanned_pages
> for ZONE_DEVICE anymore.
> memremap_pages() increments them via move_pfn_range_to_zone, but we skip
> ZONE_DEVICE in remove_pfn_range_from_zone.

Yes, documented e.g., in

commit 7ce700bf11b5e2cb84e4352bbdf2123a7a239c84
Author: David Hildenbrand 
Date:   Thu Nov 21 17:53:56 2019 -0800

mm/memory_hotplug: don't access uninitialized memmaps in
shrink_zone_span()

Needs some more thought - but is definitely not urgent (well, now it's
at least no longer completely broken).

> 
> That is not really related to this patch, so I might be missing something,
> but it caught my eye while reviewing this.
> 
> Anyway, for this one:
> 
> Reviewed-by: Oscar Salvador 
> 

Thanks!

> 
> off-topic: I __think__ we really need to trim the CC list.

Yes we should :) - done.

-- 
Thanks,

David / dhildenb



Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-12-03 Thread Oscar Salvador
On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote:
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> Signed-off-by: David Hildenbrand 

I did not see anything wrong with the taken approach, and makes sense to me.
The only thing that puzzles me is we seem to not balance spanned_pages
for ZONE_DEVICE anymore.
memremap_pages() increments them via move_pfn_range_to_zone, but we skip
ZONE_DEVICE in remove_pfn_range_from_zone.

That is not really related to this patch, so I might be missing something,
but it caught my eye while reviewing this.

Anyway, for this one:

Reviewed-by: Oscar Salvador 


off-topic: I __think__ we really need to trim the CC list.

> ---
>  arch/arm64/mm/mmu.c|  4 +---
>  arch/ia64/mm/init.c|  4 +---
>  arch/powerpc/mm/mem.c  |  3 +--
>  arch/s390/mm/init.c|  4 +---
>  arch/sh/mm/init.c  |  4 +---
>  arch/x86/mm/init_32.c  |  4 +---
>  arch/x86/mm/init_64.c  |  4 +---
>  include/linux/memory_hotplug.h |  7 +--
>  mm/memory_hotplug.c| 31 ---
>  mm/memremap.c  |  2 +-
>  10 files changed, 29 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 60c929f3683b..d10247fab0fd 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
>   /*
>* FIXME: Cleanup page tables (also in arch_add_memory() in case
> @@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
>* unlocked yet.
>*/
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index bf9df2625bc8..a6dd80a2c939 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index be941d382c8d..97e5922cb52e 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 
> size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
>   int ret;
>  
> - __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  
>   /* Remove htab bolted mappings for this section of memory */
>   start = (unsigned long)__va(start);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index a124f19f7b3c..c1d96e588152 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -291,10 +291,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>   vmem_remove_mapping(start, size);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index dfdbaa50946e..d1b1ff2be17a 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + __remove_pages(start_pfn, nr_pages, altmap);
>  }
>  #endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 930edeb41ec3..0a74407ef92e 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -865,10 +865,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
>  {
>   unsigned long start_pfn = start >> PAGE_SHIFT;
>   unsigned long nr_pages = size >> PAGE_SHIFT;
> - struct zone *zone;
>  
> - zone = page_zone(pfn_to_page(start_pfn));
> - __remove_pages(zone, start_pfn, nr_pages, altmap);
> + 

Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later

2019-12-03 Thread Jan Stancek


- Original Message -
> On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote:
> > My theory is that there's a race in iomap. There appear to be
> > interleaved calls to iomap_set_range_uptodate() for same page
> > with varying offset and length. Each call sees bitmap as _not_
> > entirely "uptodate" and hence doesn't call SetPageUptodate().
> > Even though each bit in bitmap ends up uptodate by the time
> > all calls finish.
> 
> Weird.  That should be prevented by the page lock that all callers
> of iomap_set_range_uptodate.  But in case I miss something, does
> the patch below trigger?  If not it is not jut a race, but might
> be some weird ordering problem with the bitops, especially if it
> only triggers on ppc, which is very weakly ordered.
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d33c7bc5ee92..25e942c71590 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned off,
> unsigned len)
>   unsigned int i;
>   bool uptodate = true;
>  
> + WARN_ON_ONCE(!PageLocked(page));
> +
>   if (iop) {
>   for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
>   if (i >= first && i <= last)
> 

Hit it pretty quick this time:

# uptime
 09:27:42 up 22 min,  2 users,  load average: 0.09, 13.38, 26.18

# /mnt/testarea/ltp/testcases/bin/genbessel 


Bus error (core dumped)

# dmesg | grep -i -e warn -e call   


[0.00] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or 
unsupported by kernel)
[0.00] random: get_random_u64 called from 
cache_random_seq_create+0x98/0x1e0 with crng_init=0
[0.00] rcu: Offload RCU callbacks from CPUs: (none).
[5.312075] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called 
outbound_intr_mask:0x4009
[5.357307] megaraid_sas 0031:01:00.0: megasas_disable_intr_fusion is called 
outbound_intr_mask:0x4009
[5.485126] megaraid_sas 0031:01:00.0: megasas_enable_intr_fusion is called 
outbound_intr_mask:0x4000

So, extra WARN_ON_ONCE applied on top of v5.4-8836-g81b6b96475ac
did not trigger.

Is it possible for iomap code to submit multiple bio-s for same
locked page and then receive callbacks in parallel?



Re: [PATCH 1/3] powerpc/pseries: Account for SPURR ticks on idle CPUs

2019-12-03 Thread Kamalesh Babulal
On 11/27/19 5:31 PM, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> On PSeries LPARs, to compute the utilization, tools such as lparstat
> need to know the [S]PURR ticks when the CPUs were busy or idle.
> 
> In the pseries cpuidle driver, we keep track of the idle PURR ticks in
> the VPA variable "wait_state_cycles". This patch extends the support
> to account for the idle SPURR ticks.

Thanks for working on it.

> 
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Kamalesh Babulal 



Re: [PATCH 2/3] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2019-12-03 Thread Kamalesh Babulal
On 11/27/19 5:31 PM, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> On Pseries LPARs, to calculate utilization, we need to know the
> [S]PURR ticks when the CPUs were busy or idle.
> 
> The total PURR and SPURR ticks are already exposed via the per-cpu
> sysfs files /sys/devices/system/cpu/cpuX/purr and
> /sys/devices/system/cpu/cpuX/spurr.
> 
> This patch adds support for exposing the idle PURR and SPURR ticks via
> /sys/devices/system/cpu/cpuX/idle_purr and
> /sys/devices/system/cpu/cpuX/idle_spurr.

The patch looks good to me, with a minor file mode nit pick mentioned below.

> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/sysfs.c | 32 
>  1 file changed, 32 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 80a676d..42ade55 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -1044,6 +1044,36 @@ static ssize_t show_physical_id(struct device *dev,
>  }
>  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> 
> +static ssize_t idle_purr_show(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct cpu *cpu = container_of(dev, struct cpu, dev);
> + unsigned int cpuid = cpu->dev.id;
> + struct lppaca *cpu_lppaca_ptr = paca_ptrs[cpuid]->lppaca_ptr;
> + u64 idle_purr_cycles = be64_to_cpu(cpu_lppaca_ptr->wait_state_cycles);
> +
> + return sprintf(buf, "%llx\n", idle_purr_cycles);
> +}
> +static DEVICE_ATTR_RO(idle_purr);

per cpu purr/spurr sysfs file is created with file mode 0400. Using
DEVICE_ATTR_RO for their idle_* variants will create sysfs files with 0444 as
their file mode, you should probably use DEVICE_ATTR() with file mode 0400 to
have consist permission for both variants.

-- 
Kamalesh



Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory

2019-12-03 Thread Oscar Salvador
On Mon, Dec 02, 2019 at 10:09:51AM +0100, David Hildenbrand wrote:
> @Michal, @Oscar, can some of you at least have a patch #5 now so we can
> proceed with that? (the other patches can stay in -next some time longer)

Hi, 

I will be having a look at patch#5 shortly.

Thanks for the reminder

-- 
Oscar Salvador
SUSE L3


Re: [bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later

2019-12-03 Thread Christoph Hellwig
On Tue, Dec 03, 2019 at 07:50:39AM -0500, Jan Stancek wrote:
> My theory is that there's a race in iomap. There appear to be
> interleaved calls to iomap_set_range_uptodate() for same page
> with varying offset and length. Each call sees bitmap as _not_
> entirely "uptodate" and hence doesn't call SetPageUptodate().
> Even though each bit in bitmap ends up uptodate by the time
> all calls finish.

Weird.  That should be prevented by the page lock that all callers
of iomap_set_range_uptodate.  But in case I miss something, does
the patch below trigger?  If not it is not jut a race, but might
be some weird ordering problem with the bitops, especially if it
only triggers on ppc, which is very weakly ordered.

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d33c7bc5ee92..25e942c71590 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -148,6 +148,8 @@ iomap_set_range_uptodate(struct page *page, unsigned off, 
unsigned len)
unsigned int i;
bool uptodate = true;
 
+   WARN_ON_ONCE(!PageLocked(page));
+
if (iop) {
for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
if (i >= first && i <= last)


Re: [PATCH v2 1/2] kasan: support instrumented bitops combined with generic bitops

2019-12-03 Thread Michael Ellerman
Marco Elver  writes:
> On Wed, 20 Nov 2019 at 08:42, Daniel Axtens  wrote:
>>
>> > But the docs do seem to indicate that it's atomic (for whatever that
>> > means for a single read operation?), so you are right, it should live in
>> > instrumented-atomic.h.
>>
>> Actually, on further inspection, test_bit has lived in
>> bitops/non-atomic.h since it was added in 4117b02132d1 ("[PATCH] bitops:
>> generic __{,test_and_}{set,clear,change}_bit() and test_bit()")
>>
>> So to match that, the wrapper should live in instrumented-non-atomic.h
>> too.
>>
>> If test_bit should move, that would need to be a different patch. But I
>> don't really know if it makes too much sense to stress about a read
>> operation, as opposed to a read/modify/write...
>
> That's fair enough. I suppose this can stay where it is because it's
> not hurting anyone per-se, but the only bad thing about it is that
> kernel-api documentation will present test_bit() in non-atomic
> operations.

I only just noticed this thread as I was about to send a pull request
for these two commits.

I think I agree that test_bit() shouldn't move (yet), but I dislike that
the documentation ends up being confusing due to this patch.

So I'm inclined to append or squash in the patch below, which removes
the new headers from the documentation. The end result is the docs look
more or less the same, just the ordering of some of the functions
changes. But we don't end up with test_bit() under the "Non-atomic"
header, and then also documented in Documentation/atomic_bitops.txt.

Thoughts?

cheers


diff --git a/Documentation/core-api/kernel-api.rst 
b/Documentation/core-api/kernel-api.rst
index 2caaeb55e8dd..4ac53a1363f6 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -57,21 +57,12 @@ The Linux kernel provides more basic utility functions.
 Bit Operations
 --
 
-Atomic Operations
-~
-
 .. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h
:internal:
 
-Non-atomic Operations
-~
-
 .. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h
:internal:
 
-Locking Operations
-~~
-
 .. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h
:internal:
 


[bug] userspace hitting sporadic SIGBUS on xfs (Power9, ppc64le), v4.19 and later

2019-12-03 Thread Jan Stancek
Hi,

(This bug report is summary from thread [1] with some additions)

User-space binaries on Power9 ppc64le (with 64k pages) on xfs
filesystem are sporadically hitting SIGBUS:

-- 8< --
(gdb) r
Starting program: /mnt/testarea/ltp/testcases/bin/genasin

Program received signal SIGBUS, Bus error.
dl_main (phdr=0x1040, phnum=, user_entry=0x7fffe760, 
auxv=) at rtld.c:1362
1362switch (ph->p_type)

(gdb) p ph
$1 = (const Elf64_Phdr *) 0x1040

(gdb) p *ph
Cannot access memory at address 0x1040

(gdb) info proc map
process 1110670
Mapped address spaces:

  Start Addr   End Addr   Size Offset objfile
  0x1000 0x10010x10x0 
/mnt/testarea/ltp/testcases/bin/genasin
  0x1001 0x10030x20x0 
/mnt/testarea/ltp/testcases/bin/genasin
  0x77f9 0x77fb0x20x0 [vdso]
  0x77fb 0x77fe0x30x0 
/usr/lib64/ld-2.30.so
  0x77fe 0x78000x20x2 
/usr/lib64/ld-2.30.so
  0x7ffd 0x80000x30x0 [stack]

(gdb) x/1x 0x1040
0x1040: Cannot access memory at address 0x1040
-- >8 --

When this happens the binary continues to hit SIGBUS until page
is released, for example by: echo 3 > /proc/sys/vm/drop_caches

The issue goes back to at least v4.19.

I can semi-reliably reproduce it with LTP is installed to /mnt/testarea/ltp by:
while [ True ]; do
echo 3 > /proc/sys/vm/drop_caches
rm -f /mnt/testarea/ltp/results/RUNTEST.log 
/mnt/testarea/ltp/output/RUNTEST.run.log
./runltp -p -d results -l RUNTEST.log -o RUNTEST.run.log -f math
grep FAIL /mnt/testarea/ltp/results/RUNTEST.log && exit 1
done

and some stress activity in other terminal (e.g. kernel build).
Sometimes in minutes, sometimes in hours. It is not reliable
enough to get meaningful bisect results.

My theory is that there's a race in iomap. There appear to be
interleaved calls to iomap_set_range_uptodate() for same page
with varying offset and length. Each call sees bitmap as _not_
entirely "uptodate" and hence doesn't call SetPageUptodate().
Even though each bit in bitmap ends up uptodate by the time
all calls finish.

For example, with following traces:

iomap_set_range_uptodate()
...
if (uptodate && !PageError(page))
SetPageUptodate(page);
+   
+   if (mycheck(page)) {
+   trace_printk("page: %px, iop: %px, uptodate: %d, 
!PageError(page): %d, flags: %lx\n", page, iop, uptodate, !PageError(page), 
page->flags);
+   trace_printk("first: %u, last: %u, off: %u, len: %u, i: %u\n", 
first, last, off, len, i);
+   }

I get:
 genacos-18471 [057]    162.465730: iomap_readpages: mapping: 
c03f185a1ab0
 genacos-18471 [057]    162.465732: iomap_page_create: 
iomap_page_create page: c00c0fe26180, page->private: , iop: 
c03fc70a19c0, flags: 380001
 genacos-18471 [057]    162.465736: iomap_set_range_uptodate: page: 
c00c0fe26180, iop: c03fc70a19c0, uptodate: 0, !PageError(page): 1, 
flags: 382001
 genacos-18471 [057]    162.465736: iomap_set_range_uptodate: 
first: 1, last: 14, off: 4096, len: 57344, i: 16
  -0 [060] ..s.   162.534862: iomap_set_range_uptodate: page: 
c00c0fe26180, iop: c03fc70a19c0, uptodate: 0, !PageError(page): 1, 
flags: 382081
  -0 [061] ..s.   162.534862: iomap_set_range_uptodate: page: 
c00c0fe26180, iop: c03fc70a19c0, uptodate: 0, !PageError(page): 1, 
flags: 382081
  -0 [060] ..s.   162.534864: iomap_set_range_uptodate: 
first: 0, last: 0, off: 0, len: 4096, i: 16
  -0 [061] ..s.   162.534864: iomap_set_range_uptodate: 
first: 15, last: 15, off: 61440, len: 4096, i: 16

This page doesn't have Uptodate flag set, which leads to filemap_fault()
returning VM_FAULT_SIGBUS:

crash> p/x ((struct page *) 0xc00c0fe26180)->flags  

   
$1 = 0x382032

crash> kmem -g 0x382032
FLAGS: 382032
  PAGE-FLAG   BIT  VALUE
  PG_error  1  002
  PG_dirty  4  010
  PG_lru5  020
  PG_private_2 13  0002000
  PG_fscache   13  0002000
  PG_savepinned 4  010
  PG_double_map13  0002000

But iomap_page->uptodate in page->private suggests all bits are uptodate:

crash> p/x ((struct page *) 0xc00c0fe26180)->private
$2 = 0xc03fc70a19c0

crash> p/x ((struct iomap_page *) 0xc03fc70a19c0)->uptodate 

   
$3 = {0x, 0x0}


It appears (after ~4 hours) that I 

Re: [PATCH v2 00/27] Add support for OpenCAPI SCM devices

2019-12-03 Thread Matthew Wilcox
On Tue, Dec 03, 2019 at 03:01:17PM +1100, Alastair D'Silva wrote:
> On Mon, 2019-12-02 at 19:50 -0800, Matthew Wilcox wrote:
> > On Tue, Dec 03, 2019 at 02:46:28PM +1100, Alastair D'Silva wrote:
> > > This series adds support for OpenCAPI SCM devices, exposing
> > 
> > Could we _not_ introduce yet another term for persistent memory?
> > 
> 
> "Storage Class Memory" is an industry wide term, and is used repeatedly
> in the device specifications. It's not something that has been pulled
> out of thin air.

"Somebody else also wrote down Storage Class Memory".  Don't care.
Google gets 750k hits for Persistent Memory and 150k hits for
Storage Class Memory.  This term lost.

> The term is also already in use within the 'papr_scm' driver.

The acronym "SCM" is already in use.  Socket Control Messages go back
to early Unix (SCM_RIGHTS, SCM_CREDENTIALS, etc).  Really, you're just
making the case that IBM already uses the term SCM.  You should stop.


[v5 1/3] soc: fsl: handle RCPM errata A-008646 on SoC LS1021A

2019-12-03 Thread Biwen Li
Description:
- Reading configuration register RCPM_IPPDEXPCR1
  always return zero

Workaround:
- Save register RCPM_IPPDEXPCR1's value to
  register SCFG_SPARECR8.(uboot's psci also
  need reading value from the register SCFG_SPARECR8
  to set register RCPM_IPPDEXPCR1)

Impact:
- FlexTimer module will cannot wakeup system in
  deep sleep on SoC LS1021A

Signed-off-by: Biwen Li 
---
Change in v5:
- update the patch, because of rcpm driver has updated.

Change in v4:
- rename property name
  fsl,ippdexpcr-alt-addr -> fsl,ippdexpcr1-alt-addr

Change in v3:
- update commit message
- rename property name
  fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr

Change in v2:
- fix stype problems

 drivers/soc/fsl/rcpm.c | 47 --
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
index a093dbe6d2cb..775c618f0456 100644
--- a/drivers/soc/fsl/rcpm.c
+++ b/drivers/soc/fsl/rcpm.c
@@ -6,13 +6,16 @@
 //
 // Author: Ran Wang 
 
+#include 
 #include 
+#include 
+#include 
 #include 
-#include 
 #include 
+#include 
+#include 
 #include 
 #include 
-#include 
 
 #define RCPM_WAKEUP_CELL_MAX_SIZE  7
 
@@ -37,6 +40,9 @@ static int rcpm_pm_prepare(struct device *dev)
struct device_node  *np = dev->of_node;
u32 value[RCPM_WAKEUP_CELL_MAX_SIZE + 1];
u32 setting[RCPM_WAKEUP_CELL_MAX_SIZE] = {0};
+   struct regmap *scfg_addr_regmap = NULL;
+   u32 reg_offset[RCPM_WAKEUP_CELL_MAX_SIZE + 1];
+   u32 reg_value = 0;
 
rcpm = dev_get_drvdata(dev);
if (!rcpm)
@@ -90,6 +96,43 @@ static int rcpm_pm_prepare(struct device *dev)
tmp |= ioread32be(address);
iowrite32be(tmp, address);
}
+   /*
+* Workaround of errata A-008646 on SoC LS1021A:
+* There is a bug of register ippdexpcr1.
+* Reading configuration register RCPM_IPPDEXPCR1
+* always return zero. So save ippdexpcr1's value
+* to register SCFG_SPARECR8.And the value of
+* ippdexpcr1 will be read from SCFG_SPARECR8.
+*/
+   if (device_property_present(dev, "fsl,ippdexpcr1-alt-addr")) {
+   if (dev_of_node(dev)) {
+   scfg_addr_regmap = 
syscon_regmap_lookup_by_phandle(np,
+   
   "fsl,ippdexpcr1-alt-addr");
+   } else if (is_acpi_node(dev->fwnode)) {
+   dev_err(dev, "not support acpi for rcpm\n");
+   continue;
+   }
+
+   if (scfg_addr_regmap && (i == 1)) {
+   if (device_property_read_u32_array(dev,
+   "fsl,ippdexpcr1-alt-addr",
+   reg_offset,
+   1 + sizeof(u64)/sizeof(u32))) {
+   scfg_addr_regmap = NULL;
+   continue;
+   }
+   /* Read value from register SCFG_SPARECR8 */
+   regmap_read(scfg_addr_regmap,
+   (u32)(((u64)(reg_offset[1] << 
(sizeof(u32) * 8) |
+   reg_offset[2])) & 0x),
+   _value);
+   /* Write value to register SCFG_SPARECR8 */
+   regmap_write(scfg_addr_regmap,
+(u32)(((u64)(reg_offset[1] << 
(sizeof(u32) * 8) |
+reg_offset[2])) & 0x),
+tmp | reg_value);
+   }
+   }
}
 
return 0;
-- 
2.17.1



[v5 2/3] arm: dts: ls1021a: fix that FlexTimer cannot wakeup system in deep sleep

2019-12-03 Thread Biwen Li
The patch fixes a bug that FlexTimer cannot
wakeup system in deep sleep.

Signed-off-by: Biwen Li 
---
Change in v5:
- none

Change in v4:
- update property name
  fsl,ippdexpcr-alt-addr -> fsl,ippdexpcr1-alt-addr

Change in v3:
- update property name
  fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr
  
Change in v2:
- none

 arch/arm/boot/dts/ls1021a.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm/boot/dts/ls1021a.dtsi b/arch/arm/boot/dts/ls1021a.dtsi
index 816e2926c448..6659d83c3aa2 100644
--- a/arch/arm/boot/dts/ls1021a.dtsi
+++ b/arch/arm/boot/dts/ls1021a.dtsi
@@ -988,6 +988,12 @@
compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
reg = <0x0 0x1ee2140 0x0 0x8>;
#fsl,rcpm-wakeup-cells = <2>;
+
+   /*
+* The second and third entry compose an alt offset
+* address for IPPDEXPCR1(SCFG_SPARECR8)
+*/
+   fsl,ippdexpcr1-alt-addr = < 0x0 0x51c>;
};
 
ftm_alarm0: timer0@29d {
-- 
2.17.1



[v5 3/3] Documentation: dt: binding: fsl: Add 'fsl, ippdexpcr1-alt-addr' property

2019-12-03 Thread Biwen Li
The 'fsl,ippdexpcr1-alt-addr' property is used to handle an errata A-008646
on LS1021A

Signed-off-by: Biwen Li 
---
Change in v5:
- none

Change in v4:
- rename property name
  fsl,ippdexpcr-alt-addr -> fsl,ippdexpcr1-alt-addr

Change in v3:
- rename property name
  fsl,rcpm-scfg -> fsl,ippdexpcr-alt-addr

Change in v2:
- update desc of the property
  'fsl,rcpm-scfg'

 .../devicetree/bindings/soc/fsl/rcpm.txt  | 21 +++
 1 file changed, 21 insertions(+)

diff --git a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt 
b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
index 5a33619d881d..751a7655b694 100644
--- a/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/rcpm.txt
@@ -34,6 +34,13 @@ Chassis Version  Example Chips
 Optional properties:
  - little-endian : RCPM register block is Little Endian. Without it RCPM
will be Big Endian (default case).
+ - fsl,ippdexpcr1-alt-addr : The property is related to a hardware issue
+   on SoC LS1021A and only needed on SoC LS1021A.
+   Must include 1 + 2 entries.
+   The first entry must be a link to the SCFG device node.
+   The non-first entry must be offset of registers of SCFG.
+   The second and third entry compose an alt offset address
+   for IPPDEXPCR1(SCFG_SPARECR8)
 
 Example:
 The RCPM node for T4240:
@@ -43,6 +50,20 @@ The RCPM node for T4240:
#fsl,rcpm-wakeup-cells = <2>;
};
 
+The RCPM node for LS1021A:
+   rcpm: rcpm@1ee2140 {
+   compatible = "fsl,ls1021a-rcpm", "fsl,qoriq-rcpm-2.1+";
+   reg = <0x0 0x1ee2140 0x0 0x8>;
+   #fsl,rcpm-wakeup-cells = <2>;
+
+   /*
+* The second and third entry compose an alt offset
+* address for IPPDEXPCR1(SCFG_SPARECR8)
+*/
+   fsl,ippdexpcr1-alt-addr = < 0x0 0x51c>;
+   };
+
+
 * Freescale RCPM Wakeup Source Device Tree Bindings
 ---
 Required fsl,rcpm-wakeup property should be added to a device node if the 
device
-- 
2.17.1



Re: [PATCH v3 2/4] powerpc/fadump: reorganize /sys/kernel/fadump_* sysfs files

2019-12-03 Thread Sourabh Jain
  DEFINE_SHOW_ATTRIBUTE(fadump_region);
  
  static void fadump_init_files(void)
 @@ -1435,6 +1448,11 @@ static void fadump_init_files(void)
struct dentry *debugfs_file;
int rc = 0;
  
 +  fadump_kobj = kobject_create_and_add("fadump", kernel_kobj);
 +  if (!fadump_kobj) {
 +  pr_err("failed to create fadump kobject\n");
 +  return;
 +  }
rc = sysfs_create_file(kernel_kobj, _attr.attr);
if (rc)
printk(KERN_ERR "fadump: unable to create sysfs file"
 @@ -1458,6 +1476,26 @@ static void fadump_init_files(void)
printk(KERN_ERR "fadump: unable to create sysfs file"
" fadump_release_mem (%d)\n", rc);
}
 +  /* Replicating the following sysfs attributes under FADump kobject.
 +   *
 +   *  - fadump_enabled -> enabled
 +   *  - fadump_registered -> registered
 +   *  - fadump_release_mem -> release_mem
 +   */
 +  rc = sysfs_create_file(fadump_kobj, _attr.attr);
 +  if (rc)
 +  pr_err("unable to create enabled sysfs file (%d)\n",
 + rc);
 +  rc = sysfs_create_file(fadump_kobj, _attr.attr);
 +  if (rc)
 +  pr_err("unable to create registered sysfs file (%d)\n",
 + rc);
 +  if (fw_dump.dump_active) {
 +  rc = sysfs_create_file(fadump_kobj, _attr.attr);
 +  if (rc)
 +  pr_err("unable to create release_mem sysfs file (%d)\n",
 + rc);
 +  }
return;
  }
>>> Hello,
>>>
>>
>> I’m so sorry for taking this long to write you back. 
>>
>>> wouldn't it make more sense to create the files in the new location and
>>> add a symlink at the old location?
>>
>> There are APIs which allow to create a symlink for an entire kobject but
>> I did not find a way to create symlink of an individual sysfs file.
>>
>> Do you have any approach in mind to achieve the same?
> 
> There is at least one example of plain symlink:
> 
> find /sys -type l -xtype f
> /sys/kernel/security/evm
> 
> If there is no interface to do one sanely duplicationg the files is
> better than nothing.

Hello Michal,

I found a function (__compat_only_sysfs_link_entry_to_kobj) that adds a symlink
to a kobject.

But the problem is __compat_only_sysfs_link_entry_to_kobj function keeps the
symlink file name similar to sysfs file and has no option to change it.

We can't use the __compat_only_sysfs_link_entry_to_kobj function directly 
because
our symlink file name must be different from the target file name.

fadump_enabled -> fadump/enabled

But the good thing is we can tweak the __compat_only_sysfs_link_entry_to_kobj
function and allow the caller to change the sysmlink file name.

So I am writing a separate patch that adds a wrapper function around the 
__compat_only_sysfs_link_entry_to_kobj function which will allow to have a 
custom symlink file name.

Thanks,
Sourabh Jain



[PATCH resend] ASoC: fsl_audmix: add missed pm_runtime_disable

2019-12-03 Thread Chuhong Yuan
The driver forgets to call pm_runtime_disable in probe failure
and remove.
Add the missed calls to fix it.

Signed-off-by: Chuhong Yuan 
---
 sound/soc/fsl/fsl_audmix.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/sound/soc/fsl/fsl_audmix.c b/sound/soc/fsl/fsl_audmix.c
index a1db1bce330f..5faecbeb5497 100644
--- a/sound/soc/fsl/fsl_audmix.c
+++ b/sound/soc/fsl/fsl_audmix.c
@@ -505,15 +505,20 @@ static int fsl_audmix_probe(struct platform_device *pdev)
  ARRAY_SIZE(fsl_audmix_dai));
if (ret) {
dev_err(dev, "failed to register ASoC DAI\n");
-   return ret;
+   goto err_disable_pm;
}
 
priv->pdev = platform_device_register_data(dev, mdrv, 0, NULL, 0);
if (IS_ERR(priv->pdev)) {
ret = PTR_ERR(priv->pdev);
dev_err(dev, "failed to register platform %s: %d\n", mdrv, ret);
+   goto err_disable_pm;
}
 
+   return 0;
+
+err_disable_pm:
+   pm_runtime_disable(dev);
return ret;
 }
 
@@ -521,6 +526,8 @@ static int fsl_audmix_remove(struct platform_device *pdev)
 {
struct fsl_audmix *priv = dev_get_drvdata(>dev);
 
+   pm_runtime_disable(>dev);
+
if (priv->pdev)
platform_device_unregister(priv->pdev);
 
-- 
2.24.0



Re: [PATCH v5 0/5] Append new variables to vmcoreinfo (TCR_EL1.T1SZ for arm64 and MAX_PHYSMEM_BITS for all archs)

2019-12-03 Thread Will Deacon
On Sat, Nov 30, 2019 at 01:35:36AM +0530, Bhupesh Sharma wrote:
> On Fri, Nov 29, 2019 at 3:54 PM Will Deacon  wrote:
> > On Fri, Nov 29, 2019 at 01:53:36AM +0530, Bhupesh Sharma wrote:
> > > Changes since v4:
> > > 
> > > - v4 can be seen here:
> > >   http://lists.infradead.org/pipermail/kexec/2019-November/023961.html
> > > - Addressed comments from Dave and added patches for documenting
> > >   new variables appended to vmcoreinfo documentation.
> > > - Added testing report shared by Akashi for PATCH 2/5.
> >
> > Please can you fix your mail setup? The last two times you've sent this
> > series it seems to get split into two threads, which is really hard to
> > track in my inbox:
> >
> > First thread:
> >
> > https://lore.kernel.org/lkml/1574972621-25750-1-git-send-email-bhsha...@redhat.com/
> >
> > Second thread:
> >
> > https://lore.kernel.org/lkml/1574972716-25858-1-git-send-email-bhsha...@redhat.com/
> 
> There seems to be some issue with my server's msmtp settings. I have
> tried resending the v5 (see
> ).
> 
> I hope the threading is ok this time.

Much better now, thanks for sorting it out.

Will


Re: [PATCH v11 0/7] KVM: PPC: Driver to manage pages of secure guest

2019-12-03 Thread Bharata B Rao
On Sun, Dec 01, 2019 at 12:24:50PM -0800, Hugh Dickins wrote:
> On Thu, 28 Nov 2019, Bharata B Rao wrote:
> > On Mon, Nov 25, 2019 at 08:36:24AM +0530, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > This is the next version of the patchset that adds required support
> > > in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.
> > > 
> > 
> > Here is a fix for the issue Hugh identified with the usage of ksm_madvise()
> > in this patchset. It applies on top of this patchset.
> 
> It looks correct to me, and I hope will not spoil your performance in any
> way that matters.  But I have to say, the patch would be so much clearer,
> if you just named your bool "downgraded" instead of "downgrade".

Thanks for confirming. Yes "downgraded" would have been more
appropriate, will probably change it when we do any next change in this
part of the code.

Regards,
Bharata.



[PATCH V11] mm/debug: Add tests validating architecture page table helpers

2019-12-03 Thread Anshuman Khandual
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

This test covers basic page table entry transformations including but not
limited to old, young, dirty, clean, write, write protect etc at various
level along with populating intermediate entries with next page table page
and validating them.

Test page table pages are allocated from system memory with required size
and alignments. The mapped pfns at page table levels are derived from a
real pfn representing a valid kernel text symbol. This test gets called
right after page_alloc_init_late().

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Folks interested in making sure that a given platform's page table helpers
conform to expected generic MM semantics should enable the above config
which will just trigger this test during boot. Any non conformity here will
be reported as an warning which would need to be fixed. This test will help
catch any changes to the agreed upon semantics expected from generic MM and
enable platforms to accommodate it thereafter.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Ingo Molnar 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org

Tested-by: Christophe Leroy#PPC32
Reviewed-by: Ingo Molnar 
Suggested-by: Catalin Marinas 
Signed-off-by: Andrew Morton 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---
This adds a test validation for architecture exported page table helpers.
Patch adds basic transformation tests at various levels of the page table.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Needs to be applied on linux V5.4

Changes in V11:

- Rebased the patch on V5.4

Changes in V10: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=205529)

- Always enable DEBUG_VM_PGTABLE when DEBUG_VM is enabled per Ingo
- Added tags from Ingo

Changes in V9: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=201429)

- Changed feature support enumeration for powerpc platforms per Christophe
- Changed config wrapper for basic_[pmd|pud]_tests() to enable ARC platform
- Enabled the test on ARC platform

Changes in V8: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=194297)

- Enabled ARCH_HAS_DEBUG_VM_PGTABLE on PPC32 platform per Christophe
- Updated feature documentation as DEBUG_VM_PGTABLE is now enabled on PPC32 
platform
- Moved ARCH_HAS_DEBUG_VM_PGTABLE earlier to indent it with DEBUG_VM per 
Christophe
- Added an information message in debug_vm_pgtable() per Christophe
- Dropped random_vaddr boundary condition checks per Christophe and Qian
- Replaced virt_addr_valid() check with pfn_valid() check in debug_vm_pgtable()
- Slightly changed pr_fmt(fmt) information

Changes in V7: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=193051)

- Memory allocation and free routines for mapped pages have been droped
- Mapped pfns are derived from standard kernel text symbol per Matthew
- Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian 
- Updated the commit message per Michal
- Updated W=1 GCC warning problem on x86 per Qian Cai
- Addition of new alloc_contig_pages() helper has been submitted separately

Changes in V6: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=187589)

- Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal
- Moved alloc_gigantic_page_order() 

Re: [PATCH 0/9] Improve boot command line handling

2019-12-03 Thread Christophe Leroy




Le 02/04/2019 à 11:08, Christophe Leroy a écrit :

The purpose of this series is to improve and enhance the
handling of kernel boot arguments.

It is first focussed on powerpc but also extends the capability
for other arches.

This is based on suggestion from Daniel Walker 


Looks like nobody has been interested in that series.

It doesn't apply anymore and I don't plan to rebase it, I'll retire it.

Christophe




Christophe Leroy (9):
   powerpc: enable appending of CONFIG_CMDLINE to bootloader's cmdline.
   Add generic function to build command line.
   drivers: of: use cmdline building function
   powerpc/prom_init: get rid of PROM_SCRATCH_SIZE
   powerpc: convert to generic builtin command line
   Add capability to prepend the command line
   powerpc: add capability to prepend default command line
   Gives arches opportunity to use generically defined boot cmdline
 manipulation
   powerpc: use generic CMDLINE manipulations

  arch/powerpc/Kconfig   | 23 ++
  arch/powerpc/kernel/prom_init.c| 38 ++-
  arch/powerpc/kernel/prom_init_check.sh |  2 +-
  drivers/of/fdt.c   | 23 +++---
  include/linux/cmdline.h| 37 ++
  init/Kconfig   | 56 ++
  6 files changed, 117 insertions(+), 62 deletions(-)
  create mode 100644 include/linux/cmdline.h



Re: Build failure on latest powerpc/merge (311ae9e159d8 io_uring: fix dead-hung for non-iter fixed rw)

2019-12-03 Thread Geert Uytterhoeven
Hi Jens,

On Fri, Nov 29, 2019 at 5:06 PM Jens Axboe  wrote:
> On 11/29/19 6:53 AM, Christophe Leroy wrote:
> > CC  fs/io_uring.o
> > fs/io_uring.c: In function ‘loop_rw_iter’:
> > fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
> > [-Werror=implicit-function-declaration]
> >   iovec.iov_base = kmap(iter->bvec->bv_page)
> >^
> > fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
> > without a cast [-Wint-conversion]
> >   iovec.iov_base = kmap(iter->bvec->bv_page)
> >  ^
> > fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
> > [-Werror=implicit-function-declaration]
> >   kunmap(iter->bvec->bv_page);
> >   ^
> >
> >
> > Reverting commit 311ae9e159d8 ("io_uring: fix dead-hung for non-iter
> > fixed rw") clears the failure.
> >
> > Most likely an #include is missing.
>
> Huh weird how the build bots didn't catch that. Does the below work?

Thanks, this fixes the same issue on SuperH:

Tested-by: Geert Uytterhoeven 

> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -69,6 +69,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>
>   #define CREATE_TRACE_POINTS
>   #include 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [-merge] BUG followed by oops running ndctl tests

2019-12-03 Thread Aneesh Kumar K.V
Sachin Sant  writes:

> Following Oops is seen on latest (commit 3b4852888d) powerpc merge branch
> code while running ndctl (test_namespace) tests
>
> 85c5b0984e was good.
>
>  (06/12) avocado-misc-tests/memory/ndctl.py:NdctlTest.test_namespace:  [  
> 213.570536] memmap_init_zone_device initialised 1636608 pages in 10ms
> [  213.570835] pmem0: detected capacity change from 0 to 107256741888
> [  216.488983] BUG: Unable to handle kernel data access at 0xc439
> [  216.488996] Faulting instruction address: 0xc0087510
> [  216.489002] Oops: Kernel access of bad area, sig: 11 [#1]
> [  216.489007] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [  216.489019] Dumping ftrace buffer:
> [  216.489029](ftrace buffer empty)
> [  216.489033] Modules linked in: dm_mod nf_conntrack nf_defrag_ipv6 
> nf_defrag_ipv4 libcrc32c ip6_tables nft_compat ip_set nf_tables nfnetlink 
> sunrpc sg pseries_rng papr_scm uio_pdrv_genirq uio sch_fq_codel ip_tables 
> sd_mod ibmvscsi ibmveth scsi_transport_srp
> [  216.489059] CPU: 8 PID: 17523 Comm: lt-ndctl Not tainted 
> 5.4.0-rc7-autotest #1
> [  216.489065] NIP:  c0087510 LR: c008752c CTR: 
> 01ffce80
> [  216.489071] REGS: c07ca84a37d0 TRAP: 0300   Not tainted  
> (5.4.0-rc7-autotest)
> [  216.489076] MSR:  8280b033   CR: 
> 42048224  XER: 
> [  216.489086] CFAR: c0087518 DAR: c439 DSISR: 4000 
> IRQMASK: 0 
> [  216.489086] GPR00: c008752c c07ca84a3a60 c159bb00 
>  
> [  216.489086] GPR04: 40066bdea7010e15 60553194  
> 0080 
> [  216.489086] GPR08: c439 c07f 0180 
>  
> [  216.489086] GPR12: 8000 c0001ec5d200 7897f9e9 
> 1002e088 
> [  216.489086] GPR16:  10050d88 1002f778 
> 1002f770 
> [  216.489086] GPR20:  1002e048 10050e3d 
> 10050e40 
> [  216.489086] GPR24:  c07c8d0a6c10 c07cced28a20 
> c1463048 
> [  216.489086] GPR28: c4208000 c4204000 c439 
> c420 
> [  216.489137] NIP [c0087510] arch_remove_memory+0x100/0x1b0
> [  216.489143] LR [c008752c] arch_remove_memory+0x11c/0x1b0
> [  216.489148] Call Trace:
> [  216.489151] [c07ca84a3a60] [c008752c] 
> arch_remove_memory+0x11c/0x1b0 (unreliable)
> [  216.489159] [c07ca84a3b00] [c0407258] 
> memunmap_pages+0x188/0x2c0
> [  216.489167] [c07ca84a3b80] [c07b0810] 
> devm_action_release+0x30/0x50
> [  216.489174] [c07ca84a3ba0] [c07b18f8] release_nodes+0x2f8/0x3e0
> [  216.489180] [c07ca84a3c50] [c07aa698] 
> device_release_driver_internal+0x168/0x270
> [  216.489187] [c07ca84a3c90] [c07a6ad0] unbind_store+0x130/0x170
> [  216.489193] [c07ca84a3cd0] [c07a5c34] drv_attr_store+0x44/0x60
> [  216.489200] [c07ca84a3cf0] [c04fa0d8] sysfs_kf_write+0x68/0x80
> [  216.489205] [c07ca84a3d10] [c04f9530] 
> kernfs_fop_write+0xf0/0x270
> [  216.489212] [c07ca84a3d60] [c040cbdc] __vfs_write+0x3c/0x70
> [  216.489217] [c07ca84a3d80] [c041052c] vfs_write+0xcc/0x240
> [  216.489223] [c07ca84a3dd0] [c041090c] ksys_write+0x7c/0x140
> [  216.489229] [c07ca84a3e20] [c000b278] system_call+0x5c/0x68
> [  216.489233] Instruction dump:
> [  216.489238] 80fb0008 815b000c 7d0700d0 7d08e038 7c0004ac 4c00012c 3927 
> 7d29ea14 
> [  216.489245] 7d284850 7d2a5437 41820014 7d4903a6 <7c0040ac> 7d083a14 
> 4200fff8 7c0004ac 
> [  216.489254] ---[ end trace d9a4dfc9e158858a ]—
>
> Thanks
> -Sachin

Can you try this patch?

commit 0eb3f28de8ad769c1c559f1269f9a9447af08005
Author: Aneesh Kumar K.V 
Date:   Tue Dec 3 14:23:58 2019 +0530

powerpc/pmem: Fix kernel crash due to wrong range value usage in 
flush_dcache_range

This patch fix the below kernel crash.

 BUG: Unable to handle kernel data access on read at 0xc0038000
 Faulting instruction address: 0xc008b6f0
cpu 0x5: Vector: 300 (Data Access) at [c000d8587790]
pc: c008b6f0: arch_remove_memory+0x150/0x210
lr: c008b720: arch_remove_memory+0x180/0x210
sp: c000d8587a20
   msr: 8280b033
   dar: c0038000
 dsisr: 4000
  current = 0xc000d8558600
  paca= 0xcfff8f00   irqmask: 0x03   irq_happened: 0x01
pid   = 1220, comm = ndctl
enter ? for help
 memunmap_pages+0x33c/0x410
 devm_action_release+0x30/0x50
 release_nodes+0x30c/0x3a0
 device_release_driver_internal+0x178/0x240
 unbind_store+0x74/0x190
 drv_attr_store+0x44/0x60
 sysfs_kf_write+0x74/0xa0
 kernfs_fop_write+0x1b0/0x260
 __vfs_write+0x3c/0x70
 vfs_write+0xe4/0x200
 ksys_write+0x7c/0x140