Re: [PATCH 1/3] powerpc/xive: Fix trying to "push" an already active pool VP
On Wed, 2018-04-11 at 15:17 +1000, Benjamin Herrenschmidt wrote: > When setting up a CPU, we "push" (activate) a pool VP for it. > > However it's an error to do so if it already has an active > pool VP. > > This happens when doing soft CPU hotplug on powernv since we > don't tear down the CPU on unplug. The HW flags the error which > gets captured by the diagnostics. > > Fix this by making sure to "pull" out any already active pool > first. > > Signed-off-by: Benjamin HerrenschmidtCC: sta...@vger.kernel.org... > --- > arch/powerpc/sysdev/xive/native.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/arch/powerpc/sysdev/xive/native.c > b/arch/powerpc/sysdev/xive/native.c > index d22aeb0b69e1..b48454be5b98 100644 > --- a/arch/powerpc/sysdev/xive/native.c > +++ b/arch/powerpc/sysdev/xive/native.c > @@ -389,6 +389,10 @@ static void xive_native_setup_cpu(unsigned int cpu, > struct xive_cpu *xc) > if (xive_pool_vps == XIVE_INVALID_VP) > return; > > + /* Check if pool VP already active, if it is, pull it */ > + if (in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2) & TM_QW2W2_VP) > + in_be64(xive_tima + TM_SPC_PULL_POOL_CTX); > + > /* Enable the pool VP */ > vp = xive_pool_vps + cpu; > pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp);
[PATCH 2/3] powerpc/xive: Remove now useless pr_debug statements
Those overly verbose statement in the setup of the pool VP aren't particularly useful (esp. considering we don't actually use the pool, we configure it bcs HW requires it only). So remove them which improves the code readability. Signed-off-by: Benjamin Herrenschmidt--- arch/powerpc/sysdev/xive/native.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index b48454be5b98..c7088a35eb89 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -395,7 +395,6 @@ static void xive_native_setup_cpu(unsigned int cpu, struct xive_cpu *xc) /* Enable the pool VP */ vp = xive_pool_vps + cpu; - pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp); for (;;) { rc = opal_xive_set_vp_info(vp, OPAL_XIVE_VP_ENABLED, 0); if (rc != OPAL_BUSY) @@ -415,16 +414,9 @@ static void xive_native_setup_cpu(unsigned int cpu, struct xive_cpu *xc) } vp_cam = be64_to_cpu(vp_cam_be); - pr_debug("VP CAM = %llx\n", vp_cam); - /* Push it on the CPU (set LSMFB to 0xff to skip backlog scan) */ - pr_debug("(Old HW value: %08x)\n", -in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2)); out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD0, 0xff); - out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2, -TM_QW2W2_VP | vp_cam); - pr_debug("(New HW value: %08x)\n", -in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2)); + out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2, TM_QW2W2_VP | vp_cam); } static void xive_native_teardown_cpu(unsigned int cpu, struct xive_cpu *xc) -- 2.14.3
[PATCH 3/3] powerpc/xive: Remove xive_kexec_teardown_cpu()
It's identical to xive_teardown_cpu() so just use the latter Signed-off-by: Benjamin Herrenschmidt--- arch/powerpc/include/asm/xive.h| 1 - arch/powerpc/platforms/powernv/setup.c | 2 +- arch/powerpc/platforms/pseries/kexec.c | 2 +- arch/powerpc/sysdev/xive/common.c | 22 -- 4 files changed, 2 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 8d1a2792484f..3c704f5dd3ae 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -87,7 +87,6 @@ extern int xive_smp_prepare_cpu(unsigned int cpu); extern void xive_smp_setup_cpu(void); extern void xive_smp_disable_cpu(void); extern void xive_teardown_cpu(void); -extern void xive_kexec_teardown_cpu(int secondary); extern void xive_shutdown(void); extern void xive_flush_interrupt(void); diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 092715b9674b..5b4b09816791 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -282,7 +282,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary) u64 reinit_flags; if (xive_enabled()) - xive_kexec_teardown_cpu(secondary); + xive_teardown_cpu(); else xics_kexec_teardown_cpu(secondary); diff --git a/arch/powerpc/platforms/pseries/kexec.c b/arch/powerpc/platforms/pseries/kexec.c index eeb13429d685..9dabf019556b 100644 --- a/arch/powerpc/platforms/pseries/kexec.c +++ b/arch/powerpc/platforms/pseries/kexec.c @@ -53,7 +53,7 @@ void pseries_kexec_cpu_down(int crash_shutdown, int secondary) } if (xive_enabled()) - xive_kexec_teardown_cpu(secondary); + xive_teardown_cpu(); else xics_kexec_teardown_cpu(secondary); } diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 40c06110821c..c8db51b60b4b 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -1408,28 +1408,6 @@ void xive_teardown_cpu(void) xive_cleanup_cpu_queues(cpu, xc); } -void xive_kexec_teardown_cpu(int secondary) -{ - struct xive_cpu *xc = __this_cpu_read(xive_cpu); - unsigned int cpu = smp_processor_id(); - - /* Set CPPR to 0 to disable flow of interrupts */ - xc->cppr = 0; - out_8(xive_tima + xive_tima_offset + TM_CPPR, 0); - - /* Backend cleanup if any */ - if (xive_ops->teardown_cpu) - xive_ops->teardown_cpu(cpu, xc); - -#ifdef CONFIG_SMP - /* Get rid of IPI */ - xive_cleanup_cpu_ipi(cpu, xc); -#endif - - /* Disable and free the queues */ - xive_cleanup_cpu_queues(cpu, xc); -} - void xive_shutdown(void) { xive_ops->shutdown(); -- 2.14.3
[PATCH 1/3] powerpc/xive: Fix trying to "push" an already active pool VP
When setting up a CPU, we "push" (activate) a pool VP for it. However it's an error to do so if it already has an active pool VP. This happens when doing soft CPU hotplug on powernv since we don't tear down the CPU on unplug. The HW flags the error which gets captured by the diagnostics. Fix this by making sure to "pull" out any already active pool first. Signed-off-by: Benjamin Herrenschmidt--- arch/powerpc/sysdev/xive/native.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index d22aeb0b69e1..b48454be5b98 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -389,6 +389,10 @@ static void xive_native_setup_cpu(unsigned int cpu, struct xive_cpu *xc) if (xive_pool_vps == XIVE_INVALID_VP) return; + /* Check if pool VP already active, if it is, pull it */ + if (in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2) & TM_QW2W2_VP) + in_be64(xive_tima + TM_SPC_PULL_POOL_CTX); + /* Enable the pool VP */ vp = xive_pool_vps + cpu; pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp); -- 2.14.3
Re: [PATCH 2/2] powerpc/mm/memtrace: Let the arch hotunplug code flush cache
On 06/04/18 15:24, Balbir Singh wrote: > Don't do this via custom code, instead now that we have support > in the arch hotplug/hotunplug code, rely on those routines > to do the right thing. > > Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory > tracing") > because the older code uses ppc64_caches.l1d.size instead of > ppc64_caches.l1d.line_size > > Signed-off-by: Balbir SinghReviewed-by: Rashmica Gupta
Re: [PATCH 1/2] powerpc/mm: Flush cache on memory hot(un)plug
On 06/04/18 15:24, Balbir Singh wrote: > This patch adds support for flushing potentially dirty > cache lines when memory is hot-plugged/hot-un-plugged. > The support is currently limited to 64 bit systems. > > The bug was exposed when mappings for a device were > actually hot-unplugged and plugged in back later. > A similar issue was observed during the development > of memtrace, but memtrace does it's own flushing of > region via a custom routine. > > These patches do a flush both on hotplug/unplug to > clear any stale data in the cache w.r.t mappings, > there is a small race window where a clean cache > line may be created again just prior to tearing > down the mapping. > > The patches were tested by disabling the flush > routines in memtrace and doing I/O on the trace > file. The system immediately checkstops (quite > reliablly if prior to the hot-unplug of the memtrace > region, we memset the regions we are about to > hot unplug). After these patches no custom flushing > is needed in the memtrace code. > > Signed-off-by: Balbir SinghReviewed-by: Rashmica Gupta
[PATCH] powerpc/eeh: Fix enabling bridge MMIO windows
On boot we save the configuration space of PCIe bridges. We do this so when we get an EEH event and everything gets reset that we can restore them. Unfortunately we save this state before we've enabled the MMIO space on the bridges. Hence if we have to reset the bridge when we come back MMIO is not enabled and we end up taking an PE freeze when the driver starts accessing again. This patch forces the memory/MMIO and bus mastering on when restoring bridges on EEH. Ideally we'd do this correctly by saving the configuration space writes later, but that will have to come later in a larger EEH rewrite. For now we have this simple fix. The original bug can be triggered on a boston machine by doing: echo 0x8000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound On boston, this PHB has a PCIe switch on it. Without this patch, you'll see two EEH events, 1 expected and 1 the failure we are fixing here. The second EEH event causes the anything under the PHB to disappear (i.e. the i40e eth). With this patch, only 1 EEH event occurs and devices properly recover. Reported-by: Pridhiviraj PaidipeddiSigned-off-by: Michael Neuling Cc: sta...@vger.kernel.org --- arch/powerpc/kernel/eeh_pe.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 2d4956e97a..ee5a67d57a 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -807,7 +807,8 @@ static void eeh_restore_bridge_bars(struct eeh_dev *edev) eeh_ops->write_config(pdn, 15*4, 4, edev->config_space[15]); /* PCI Command: 0x4 */ - eeh_ops->write_config(pdn, PCI_COMMAND, 4, edev->config_space[1]); + eeh_ops->write_config(pdn, PCI_COMMAND, 4, edev->config_space[1] | + PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER); /* Check the PCIe link is ready */ eeh_bridge_check_link(edev); -- 2.14.1
Re: [RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry
On Wed, 11 Apr 2018 11:32:12 +1000 Benjamin Herrenschmidtwrote: > On Tue, 2018-04-10 at 22:48 +1000, Nicholas Piggin wrote: > > > > + /* > > +* Do we need to flush the TLB for the LPAR? (see TLB comment above) > > + * On POWER9, individual threads can come in here, but the > > + * TLB is shared between the 4 threads in a core, hence > > + * invalidating on one thread invalidates for all. > > + * Thus we make all 4 threads use the same bit here. > > + */ > > This might be true of P9 implementation but isn't architecturally > correct. From an ISA perspective, the threads could have dedicatd > tagged TLB entries. Do we need to be careful here vs. backward > compatiblity ? I think so. I noticed that, just trying to do like for like replacement with this patch. Yes it should have a feature bit test for this optimization IMO. That can be expanded if other CPUs have the same ability... Is it even a worthwhile optimisation to do at this point, I wonder? I didn't see it being hit a lot in traces. > Also this won't flush ERAT entries for another thread afaik. Yeah, I'm still not entirely clear exactly when ERATs get invalidated. I would like to see more commentary here to show why it's okay. > > > + tmp = pcpu; > > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > > + tmp &= ~0x3UL; > > + if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) { > > + if (kvm_is_radix(vc->kvm)) > > + radix__local_flush_tlb_lpid(vc->kvm->arch.lpid); > > + else > > + hash__local_flush_tlb_lpid(vc->kvm->arch.lpid); > > + /* Clear the bit after the TLB flush */ > > + cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush); > > + } > > + >
Re: [RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry
On Tue, 2018-04-10 at 22:48 +1000, Nicholas Piggin wrote: > > + /* > +* Do we need to flush the TLB for the LPAR? (see TLB comment above) > + * On POWER9, individual threads can come in here, but the > + * TLB is shared between the 4 threads in a core, hence > + * invalidating on one thread invalidates for all. > + * Thus we make all 4 threads use the same bit here. > + */ This might be true of P9 implementation but isn't architecturally correct. From an ISA perspective, the threads could have dedicatd tagged TLB entries. Do we need to be careful here vs. backward compatiblity ? Also this won't flush ERAT entries for another thread afaik. > + tmp = pcpu; > + if (cpu_has_feature(CPU_FTR_ARCH_300)) > + tmp &= ~0x3UL; > + if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) { > + if (kvm_is_radix(vc->kvm)) > + radix__local_flush_tlb_lpid(vc->kvm->arch.lpid); > + else > + hash__local_flush_tlb_lpid(vc->kvm->arch.lpid); > + /* Clear the bit after the TLB flush */ > + cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush); > + } > +
[PATCH] ibmvnic: Define vnic_login_client_data name field as unsized array
The "name" field of struct vnic_login_client_data is a char array of undefined length. This should be written as "char name[]" so the compiler can make better decisions about the field (for example, not assuming it's a single character). This was noticed while trying to tighten the CONFIG_FORTIFY_SOURCE checking. Signed-off-by: Kees Cook--- drivers/net/ethernet/ibm/ibmvnic.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index aad5658d79d5..35fbb41cd2d4 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -3170,7 +3170,7 @@ static int send_version_xchg(struct ibmvnic_adapter *adapter) struct vnic_login_client_data { u8 type; __be16 len; - charname; + charname[]; } __packed; static int vnic_client_data_len(struct ibmvnic_adapter *adapter) @@ -3199,21 +3199,21 @@ static void vnic_add_client_data(struct ibmvnic_adapter *adapter, vlcd->type = 1; len = strlen(os_name) + 1; vlcd->len = cpu_to_be16(len); - strncpy(>name, os_name, len); - vlcd = (struct vnic_login_client_data *)((char *)>name + len); + strncpy(vlcd->name, os_name, len); + vlcd = (struct vnic_login_client_data *)(vlcd->name + len); /* Type 2 - LPAR name */ vlcd->type = 2; len = strlen(utsname()->nodename) + 1; vlcd->len = cpu_to_be16(len); - strncpy(>name, utsname()->nodename, len); - vlcd = (struct vnic_login_client_data *)((char *)>name + len); + strncpy(vlcd->name, utsname()->nodename, len); + vlcd = (struct vnic_login_client_data *)(vlcd->name + len); /* Type 3 - device name */ vlcd->type = 3; len = strlen(adapter->netdev->name) + 1; vlcd->len = cpu_to_be16(len); - strncpy(>name, adapter->netdev->name, len); + strncpy(vlcd->name, adapter->netdev->name, len); } static int send_login(struct ibmvnic_adapter *adapter) -- 2.7.4 -- Kees Cook Pixel Security
Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
On Tue, 10 Apr 2018, Laurent Dufour wrote: > > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote: > >> arch/powerpc/include/asm/pte-common.h | 3 --- > >> arch/riscv/Kconfig | 1 + > >> arch/s390/Kconfig | 1 + > > > > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from > > arch/riscv/include/asm/pgtable-bits.h > > Damned ! > Thanks for catching it. > Squashing the two patches together at least allowed it to be caught easily. After it's fixed, feel free to add Acked-by: David RientjesThanks for doing this!
Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
On Tue, 10 Apr 2018 09:09:32 PDT (-0700), wi...@infradead.org wrote: On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote: arch/powerpc/include/asm/pte-common.h | 3 --- arch/riscv/Kconfig | 1 + arch/s390/Kconfig | 1 + You forgot to delete __HAVE_ARCH_PTE_SPECIAL from arch/riscv/include/asm/pgtable-bits.h Thanks -- I was looking for that but couldn't find it and assumed I'd just misunderstood something.
Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages
Bjorn, On 04/10/2018 04:55 PM, Bjorn Helgaas wrote: On Tue, Apr 10, 2018 at 02:36:31PM -0500, Bjorn Helgaas wrote: On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote: The disabling informational messages on the PCI subsystem should be deleted since they do not represent any real value for the system logs. These messages are either not presented, or presented for all PCI devices (e.g., powerpc now realigns all PCI devices to its page size). Thus, they are flooding system logs and can be interpreted as a false positive for total PCI failure on the system. [root@system user]# dmesg | grep -i disabling [0.692270] pci :00:00.0: Disabling memory decoding and releasing memory resources [0.692324] pci :00:00.0: disabling bridge mem windows [0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing memory resources [0.737352] pci 0001:00:00.0: disabling bridge mem windows [0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing memory resources [0.784509] pci 0002:00:00.0: disabling bridge mem windows ... and goes on for all PCI devices on the system ... Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() to force PCI devices to be page aligned") Signed-off-by: Desnes A. Nunes do RosarioApplied to pci/resource for v4.18, thanks! I should have gotten this in for v4.17, but I didn't; sorry about that. This is trivial and I'm planning to squeeze a few more things into v4.17, so I moved this to my "for-linus" branch for v4.17. No need for apologies. On the contrary, thank you very much for your review and branch change. --- drivers/pci/pci.c | 1 - drivers/pci/setup-res.c | 2 -- 2 files changed, 3 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 8c71d1a66cdd..1563ce1ee091 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct pci_dev *dev) return; } - pci_info(dev, "Disabling memory decoding and releasing memory resources\n"); pci_read_config_word(dev, PCI_COMMAND, ); command &= ~PCI_COMMAND_MEMORY; pci_write_config_word(dev, PCI_COMMAND, command); diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c index 369d48d6c6f1..6bd35e8e7cde 100644 --- a/drivers/pci/setup-res.c +++ b/drivers/pci/setup-res.c @@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource); void pci_disable_bridge_window(struct pci_dev *dev) { - pci_info(dev, "disabling bridge mem windows\n"); - /* MMIO Base/Limit */ pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0); -- 2.14.3 -- Desnes A. Nunes do Rosario -- Linux Developer - IBM
Re: [PATCH 5/5] powerpc:dts:pm: add power management node
On Wed, Mar 28, 2018 at 8:31 PM, Ran Wangwrote: > Enable Power Management feature on device tree, including MPC8536, > MPC8544, MPC8548, MPC8572, P1010, P1020, P1021, P1022, P2020, P2041, > P3041, T104X, T1024. There are no device tree bindings documented for the properties and compatible strings used in the patch. Please update the binding documents first before adding them into device tree. > > Signed-off-by: Zhao Chenhui > Signed-off-by: Ran Wang > --- > arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi | 14 ++- > arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi |2 + > arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi |2 + > arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi |2 + > arch/powerpc/boot/dts/fsl/p1010si-post.dtsi |8 > arch/powerpc/boot/dts/fsl/p1020si-post.dtsi |5 +++ > arch/powerpc/boot/dts/fsl/p1021si-post.dtsi |5 +++ > arch/powerpc/boot/dts/fsl/p1022si-post.dtsi |9 +++-- > arch/powerpc/boot/dts/fsl/p2020si-post.dtsi | 14 +++ > arch/powerpc/boot/dts/fsl/pq3-power.dtsi | 48 > + > arch/powerpc/boot/dts/fsl/t1024rdb.dts|2 +- > arch/powerpc/boot/dts/fsl/t1040rdb.dts|2 +- > arch/powerpc/boot/dts/fsl/t1042rdb.dts|2 +- > arch/powerpc/boot/dts/fsl/t1042rdb_pi.dts |2 +- > 14 files changed, 108 insertions(+), 9 deletions(-) > create mode 100644 arch/powerpc/boot/dts/fsl/pq3-power.dtsi > > diff --git a/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi > b/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi > index 4193570..fba40a1 100644 > --- a/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi > @@ -199,6 +199,10 @@ > > /include/ "pq3-dma-0.dtsi" > /include/ "pq3-etsec1-0.dtsi" > + enet0: ethernet@24000 { > + fsl,wake-on-filer; > + fsl,pmc-handle = <_clk>; > + }; > /include/ "pq3-etsec1-timer-0.dtsi" > > usb@22000 { > @@ -222,9 +226,10 @@ > }; > > /include/ "pq3-etsec1-2.dtsi" > - > - ethernet@26000 { > + enet2: ethernet@26000 { > cell-index = <1>; > + fsl,wake-on-filer; > + fsl,pmc-handle = <_clk>; > }; > > usb@2b000 { > @@ -249,4 +254,9 @@ > reg = <0xe 0x1000>; > fsl,has-rstcr; > }; > + > +/include/ "pq3-power.dtsi" > + power@e0070 { > + compatible = "fsl,mpc8536-pmc", "fsl,mpc8548-pmc"; > + }; > }; > diff --git a/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi > b/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi > index b68eb11..ea7416a 100644 > --- a/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi > @@ -188,4 +188,6 @@ > reg = <0xe 0x1000>; > fsl,has-rstcr; > }; > + > +/include/ "pq3-power.dtsi" > }; > diff --git a/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi > b/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi > index 579d76c..dddb737 100644 > --- a/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi > @@ -156,4 +156,6 @@ > reg = <0xe 0x1000>; > fsl,has-rstcr; > }; > + > +/include/ "pq3-power.dtsi" > }; > diff --git a/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi > b/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi > index 49294cf..40a6cff 100644 > --- a/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi > @@ -193,4 +193,6 @@ > reg = <0xe 0x1000>; > fsl,has-rstcr; > }; > + > +/include/ "pq3-power.dtsi" > }; > diff --git a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > index 1b4aafc..47b62a8 100644 > --- a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > +++ b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi > @@ -173,6 +173,8 @@ > > /include/ "pq3-etsec2-0.dtsi" > enet0: ethernet@b { > + fsl,pmc-handle = <_clk>; > + > queue-group@b { > fsl,rx-bit-map = <0xff>; > fsl,tx-bit-map = <0xff>; > @@ -181,6 +183,8 @@ > > /include/ "pq3-etsec2-1.dtsi" > enet1: ethernet@b1000 { > + fsl,pmc-handle = <_clk>; > + > queue-group@b1000 { > fsl,rx-bit-map = <0xff>; > fsl,tx-bit-map = <0xff>; > @@ -189,6 +193,8 @@ > > /include/ "pq3-etsec2-2.dtsi" > enet2: ethernet@b2000 { > + fsl,pmc-handle = <_clk>; > + > queue-group@b2000 { > fsl,rx-bit-map = <0xff>; > fsl,tx-bit-map = <0xff>; > @@ -201,4 +207,6 @@ > reg = <0xe 0x1000>; > fsl,has-rstcr; > }; > + >
Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages
On Tue, Apr 10, 2018 at 02:36:31PM -0500, Bjorn Helgaas wrote: > On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote: > > The disabling informational messages on the PCI subsystem should be deleted > > since they do not represent any real value for the system logs. > > > > These messages are either not presented, or presented for all PCI devices > > (e.g., powerpc now realigns all PCI devices to its page size). Thus, they > > are flooding system logs and can be interpreted as a false positive for > > total PCI failure on the system. > > > > [root@system user]# dmesg | grep -i disabling > > [0.692270] pci :00:00.0: Disabling memory decoding and releasing > > memory resources > > [0.692324] pci :00:00.0: disabling bridge mem windows > > [0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing > > memory resources > > [0.737352] pci 0001:00:00.0: disabling bridge mem windows > > [0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing > > memory resources > > [0.784509] pci 0002:00:00.0: disabling bridge mem windows > > ... and goes on for all PCI devices on the system ... > > > > Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() > > to force PCI devices to be page aligned") > > Signed-off-by: Desnes A. Nunes do Rosario> > Applied to pci/resource for v4.18, thanks! > > I should have gotten this in for v4.17, but I didn't; sorry about that. This is trivial and I'm planning to squeeze a few more things into v4.17, so I moved this to my "for-linus" branch for v4.17. > > --- > > drivers/pci/pci.c | 1 - > > drivers/pci/setup-res.c | 2 -- > > 2 files changed, 3 deletions(-) > > > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > > index 8c71d1a66cdd..1563ce1ee091 100644 > > --- a/drivers/pci/pci.c > > +++ b/drivers/pci/pci.c > > @@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct > > pci_dev *dev) > > return; > > } > > > > - pci_info(dev, "Disabling memory decoding and releasing memory > > resources\n"); > > pci_read_config_word(dev, PCI_COMMAND, ); > > command &= ~PCI_COMMAND_MEMORY; > > pci_write_config_word(dev, PCI_COMMAND, command); > > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c > > index 369d48d6c6f1..6bd35e8e7cde 100644 > > --- a/drivers/pci/setup-res.c > > +++ b/drivers/pci/setup-res.c > > @@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource); > > > > void pci_disable_bridge_window(struct pci_dev *dev) > > { > > - pci_info(dev, "disabling bridge mem windows\n"); > > - > > /* MMIO Base/Limit */ > > pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0); > > > > -- > > 2.14.3 > >
Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages
On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote: > The disabling informational messages on the PCI subsystem should be deleted > since they do not represent any real value for the system logs. > > These messages are either not presented, or presented for all PCI devices > (e.g., powerpc now realigns all PCI devices to its page size). Thus, they > are flooding system logs and can be interpreted as a false positive for > total PCI failure on the system. > > [root@system user]# dmesg | grep -i disabling > [0.692270] pci :00:00.0: Disabling memory decoding and releasing > memory resources > [0.692324] pci :00:00.0: disabling bridge mem windows > [0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing > memory resources > [0.737352] pci 0001:00:00.0: disabling bridge mem windows > [0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing > memory resources > [0.784509] pci 0002:00:00.0: disabling bridge mem windows > ... and goes on for all PCI devices on the system ... > > Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() to > force PCI devices to be page aligned") > Signed-off-by: Desnes A. Nunes do RosarioApplied to pci/resource for v4.18, thanks! I should have gotten this in for v4.17, but I didn't; sorry about that. > --- > drivers/pci/pci.c | 1 - > drivers/pci/setup-res.c | 2 -- > 2 files changed, 3 deletions(-) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 8c71d1a66cdd..1563ce1ee091 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct pci_dev > *dev) > return; > } > > - pci_info(dev, "Disabling memory decoding and releasing memory > resources\n"); > pci_read_config_word(dev, PCI_COMMAND, ); > command &= ~PCI_COMMAND_MEMORY; > pci_write_config_word(dev, PCI_COMMAND, command); > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c > index 369d48d6c6f1..6bd35e8e7cde 100644 > --- a/drivers/pci/setup-res.c > +++ b/drivers/pci/setup-res.c > @@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource); > > void pci_disable_bridge_window(struct pci_dev *dev) > { > - pci_info(dev, "disabling bridge mem windows\n"); > - > /* MMIO Base/Limit */ > pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0); > > -- > 2.14.3 >
Re: [PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols
On 04/09/2018 11:51 PM, Michael Ellerman wrote: Thanks for picking this one up. I hate to be a pain ... but before we merge this and proliferate these names, I'd like to change the names of some of these early asm functions. They're terribly named due to historical reasons. Indeed :) No worries. I haven't actually thought of good names yet though:) I'll try and come up with some and post a patch doing the renames. Alright. Could you please copy me on that, and I can post an update. cheers, Mauricio
Re: [PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL
On 10/04/2018 17:58, Robin Murphy wrote: > On 10/04/18 16:25, Laurent Dufour wrote: >> Remove the additional define HAVE_PTE_SPECIAL and rely directly on >> CONFIG_ARCH_HAS_PTE_SPECIAL. >> >> There is no functional change introduced by this patch >> >> Signed-off-by: Laurent Dufour>> --- >> mm/memory.c | 23 ++- >> 1 file changed, 10 insertions(+), 13 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 96910c625daa..53b6344a90d2 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma, >> unsigned long addr, >> * PFNMAP mappings in order to support COWable mappings. >> * >> */ >> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >> -# define HAVE_PTE_SPECIAL 1 >> -#else >> -# define HAVE_PTE_SPECIAL 0 >> -#endif >> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long >> addr, >> pte_t pte, bool with_public_device) >> { >> unsigned long pfn = pte_pfn(pte); >> - if (HAVE_PTE_SPECIAL) { >> - if (likely(!pte_special(pte))) >> - goto check_pfn; >> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL > > Nit: Couldn't you use IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) within the > existing code structure to avoid having to add these #ifdefs? I agree, that would be better. I didn't thought about this option.. Thanks for reporting this.
Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
On 10/04/2018 18:09, Matthew Wilcox wrote: > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote: >> arch/powerpc/include/asm/pte-common.h | 3 --- >> arch/riscv/Kconfig | 1 + >> arch/s390/Kconfig | 1 + > > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from > arch/riscv/include/asm/pgtable-bits.h Damned ! Thanks for catching it.
Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()
On 03/04/2018 01:57, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> When dealing with speculative page fault handler, we may race with VMA >> being split or merged. In this case the vma->vm_start and vm->vm_end >> fields may not match the address the page fault is occurring. >> >> This can only happens when the VMA is split but in that case, the >> anon_vma pointer of the new VMA will be the same as the original one, >> because in __split_vma the new->anon_vma is set to src->anon_vma when >> *new = *vma. >> >> So even if the VMA boundaries are not correct, the anon_vma pointer is >> still valid. >> >> If the VMA has been merged, then the VMA in which it has been merged >> must have the same anon_vma pointer otherwise the merge can't be done. >> >> So in all the case we know that the anon_vma is valid, since we have >> checked before starting the speculative page fault that the anon_vma >> pointer is valid for this VMA and since there is an anon_vma this >> means that at one time a page has been backed and that before the VMA >> is cleaned, the page table lock would have to be grab to clean the >> PTE, and the anon_vma field is checked once the PTE is locked. >> >> This patch introduce a new __page_add_new_anon_rmap() service which >> doesn't check for the VMA boundaries, and create a new inline one >> which do the check. >> >> When called from a page fault handler, if this is not a speculative one, >> there is a guarantee that vm_start and vm_end match the faulting address, >> so this check is useless. In the context of the speculative page fault >> handler, this check may be wrong but anon_vma is still valid as explained >> above. >> >> Signed-off-by: Laurent Dufour> > I'm indifferent on this: it could be argued both sides that the new > function and its variant for a simple VM_BUG_ON() isn't worth it and it > would should rather be done in the callers of page_add_new_anon_rmap(). > It feels like it would be better left to the caller and add a comment to > page_add_anon_rmap() itself in mm/rmap.c. Well there are 11 calls to page_add_new_anon_rmap() which will need to be impacted and future ones too. By introducing __page_add_new_anon_rmap() my goal was to make clear that this call is *special* and that calling it is not the usual way. This also implies that most of the time the check is done (when build with the right config) and that we will not miss some.
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On 03/04/2018 02:11, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> This change is inspired by the Peter's proposal patch [1] which was >> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in >> that particular case, and it is introducing major performance degradation >> due to excessive scheduling operations. >> >> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch >> is protecting it access using a rwlock. As the mm_rb tree is a O(log n) >> search it is safe to protect it using such a lock. The VMA cache is not >> protected by the new rwlock and it should not be used without holding the >> mmap_sem. >> >> To allow the picked VMA structure to be used once the rwlock is released, a >> use count is added to the VMA structure. When the VMA is allocated it is >> set to 1. Each time the VMA is picked with the rwlock held its use count >> is incremented. Each time the VMA is released it is decremented. When the >> use count hits zero, this means that the VMA is no more used and should be >> freed. >> >> This patch is preparing for 2 kind of VMA access : >> - as usual, under the control of the mmap_sem, >> - without holding the mmap_sem for the speculative page fault handler. >> >> Access done under the control the mmap_sem doesn't require to grab the >> rwlock to protect read access to the mm_rb tree, but access in write must >> be done under the protection of the rwlock too. This affects inserting and >> removing of elements in the RB tree. >> >> The patch is introducing 2 new functions: >> - vma_get() to find a VMA based on an address by holding the new rwlock. >> - vma_put() to release the VMA when its no more used. >> These services are designed to be used when access are made to the RB tree >> without holding the mmap_sem. >> >> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and >> we rely on the WMB done when releasing the rwlock to serialize the write >> with the RMB done in a later patch to check for the VMA's validity. >> >> When free_vma is called, the file associated with the VMA is closed >> immediately, but the policy and the file structure remained in used until >> the VMA's use count reach 0, which may happens later when exiting an >> in progress speculative page fault. >> >> [1] https://patchwork.kernel.org/patch/5108281/ >> >> Cc: Peter Zijlstra (Intel)>> Cc: Matthew Wilcox >> Signed-off-by: Laurent Dufour > > Can __free_vma() be generalized for mm/nommu.c's delete_vma() and > do_mmap()? Good question ! I guess if there is no mmu, there is no page fault, so no speculative page fault and this patch is clearly required by the speculative page fault handler. By the I should probably make CONFIG_SPECULATIVE_PAGE_FAULT dependent on CONFIG_MMU. This being said, if your idea is to extend the mm_rb tree rwlocking to the nommu case, then this is another story, and I wondering if there is a real need in such case. But I've to admit I'm not so familliar with kernel built for mmuless systems. Am I missing something ? Thanks, Laurent.
Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote: > arch/powerpc/include/asm/pte-common.h | 3 --- > arch/riscv/Kconfig | 1 + > arch/s390/Kconfig | 1 + You forgot to delete __HAVE_ARCH_PTE_SPECIAL from arch/riscv/include/asm/pgtable-bits.h
Re: [PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL
On 10/04/18 16:25, Laurent Dufour wrote: Remove the additional define HAVE_PTE_SPECIAL and rely directly on CONFIG_ARCH_HAS_PTE_SPECIAL. There is no functional change introduced by this patch Signed-off-by: Laurent Dufour--- mm/memory.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 96910c625daa..53b6344a90d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, * PFNMAP mappings in order to support COWable mappings. * */ -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -# define HAVE_PTE_SPECIAL 1 -#else -# define HAVE_PTE_SPECIAL 0 -#endif struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte, bool with_public_device) { unsigned long pfn = pte_pfn(pte); - if (HAVE_PTE_SPECIAL) { - if (likely(!pte_special(pte))) - goto check_pfn; +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL Nit: Couldn't you use IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) within the existing code structure to avoid having to add these #ifdefs? Robin. + if (unlikely(pte_special(pte))) { if (vma->vm_ops && vma->vm_ops->find_special_page) return vma->vm_ops->find_special_page(vma, addr); if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) @@ -862,7 +856,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; } - /* !HAVE_PTE_SPECIAL case follows: */ +#else /* CONFIG_ARCH_HAS_PTE_SPECIAL */ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { @@ -881,7 +875,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, if (is_zero_pfn(pfn)) return NULL; -check_pfn: +#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ + if (unlikely(pfn > highest_memmap_pfn)) { print_bad_pte(vma, addr, pte, NULL); return NULL; @@ -891,7 +886,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, * NOTE! We still have PageReserved() pages in the page tables. * eg. VDSO mappings can cause them to exist. */ -out: +out: __maybe_unused return pfn_to_page(pfn); } @@ -904,7 +899,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, /* * There is no pmd_special() but there may be special pmds, e.g. * in a direct-access (dax) mapping, so let's just replicate the -* !HAVE_PTE_SPECIAL case from vm_normal_page() here. +* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here. */ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { @@ -1926,6 +1921,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, track_pfn_insert(vma, , pfn); +#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL /* * If we don't have pte special, then we have to use the pfn_valid() * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* @@ -1933,7 +1929,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ - if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) { + if (!pfn_t_devmap(pfn) && pfn_t_valid(pfn)) { struct page *page; /* @@ -1944,6 +1940,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_to_page(pfn_t_to_pfn(pfn)); return insert_page(vma, addr, page, pgprot); } +#endif return insert_pfn(vma, addr, pfn, pgprot, mkwrite); }
[PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL
Currently the PTE special supports is turned on in per architecture header files. Most of the time, it is defined in arch/*/include/asm/pgtable.h depending or not on some other per architecture static definition. This patch introduce a new configuration variable to manage this directly in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL. Here notes for some architecture where the definition of __HAVE_ARCH_PTE_SPECIAL is not obvious: arm __HAVE_ARCH_PTE_SPECIAL which is currently defined in arch/arm/include/asm/pgtable-3level.h which is included by arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set. So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE. powerpc __HAVE_ARCH_PTE_SPECIAL is defined in 2 files: - arch/powerpc/include/asm/book3s/64/pgtable.h - arch/powerpc/include/asm/pte-common.h The first one is included if (PPC_BOOK3S & PPC64) while the second is included in all the other cases. So select ARCH_HAS_PTE_SPECIAL all the time. sparc: __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) && defined(__arch64__) which are defined through the compiler in sparc/Makefile if !SPARC32 which I assume to be if SPARC64. So select ARCH_HAS_PTE_SPECIAL if SPARC64 There is no functional change introduced by this patch. Suggested-by: Jerome GlisseReviewed-by: Jerome Glisse Signed-off-by: Laurent Dufour --- Documentation/features/vm/pte_special/arch-support.txt | 2 +- arch/arc/Kconfig | 1 + arch/arc/include/asm/pgtable.h | 2 -- arch/arm/Kconfig | 1 + arch/arm/include/asm/pgtable-3level.h | 1 - arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 2 -- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 3 --- arch/powerpc/include/asm/pte-common.h | 3 --- arch/riscv/Kconfig | 1 + arch/s390/Kconfig | 1 + arch/s390/include/asm/pgtable.h| 1 - arch/sh/Kconfig| 1 + arch/sh/include/asm/pgtable.h | 2 -- arch/sparc/Kconfig | 1 + arch/sparc/include/asm/pgtable_64.h| 3 --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable_types.h | 1 - include/linux/pfn_t.h | 4 ++-- mm/Kconfig | 3 +++ mm/gup.c | 4 ++-- mm/memory.c| 2 +- 23 files changed, 18 insertions(+), 24 deletions(-) diff --git a/Documentation/features/vm/pte_special/arch-support.txt b/Documentation/features/vm/pte_special/arch-support.txt index 055004f467d2..cd05924ea875 100644 --- a/Documentation/features/vm/pte_special/arch-support.txt +++ b/Documentation/features/vm/pte_special/arch-support.txt @@ -1,6 +1,6 @@ # # Feature name: pte_special -# Kconfig: __HAVE_ARCH_PTE_SPECIAL +# Kconfig: ARCH_HAS_PTE_SPECIAL # description: arch supports the pte_special()/pte_mkspecial() VM APIs # --- diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index d76bf4a83740..8516e2b0239a 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -44,6 +44,7 @@ config ARC select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA + select ARCH_HAS_PTE_SPECIAL config MIGHT_HAVE_PCI bool diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h index 08fe33830d4b..8ec5599a0957 100644 --- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -320,8 +320,6 @@ PTE_BIT_FUNC(mkexec,|= (_PAGE_EXECUTE)); PTE_BIT_FUNC(mkspecial,|= (_PAGE_SPECIAL)); PTE_BIT_FUNC(mkhuge, |= (_PAGE_HW_SZ)); -#define __HAVE_ARCH_PTE_SPECIAL - static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot)); diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index a7f8e7f4b88f..c088c851b235 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -8,6 +8,7 @@ config ARM select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FORTIFY_SOURCE + select ARCH_HAS_PTE_SPECIAL if ARM_LPAE select ARCH_HAS_SET_MEMORY select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 2a4836087358..6d50a11d7793 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++
[PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL
Remove the additional define HAVE_PTE_SPECIAL and rely directly on CONFIG_ARCH_HAS_PTE_SPECIAL. There is no functional change introduced by this patch Signed-off-by: Laurent Dufour--- mm/memory.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 96910c625daa..53b6344a90d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, * PFNMAP mappings in order to support COWable mappings. * */ -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -# define HAVE_PTE_SPECIAL 1 -#else -# define HAVE_PTE_SPECIAL 0 -#endif struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte, bool with_public_device) { unsigned long pfn = pte_pfn(pte); - if (HAVE_PTE_SPECIAL) { - if (likely(!pte_special(pte))) - goto check_pfn; +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL + if (unlikely(pte_special(pte))) { if (vma->vm_ops && vma->vm_ops->find_special_page) return vma->vm_ops->find_special_page(vma, addr); if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) @@ -862,7 +856,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; } - /* !HAVE_PTE_SPECIAL case follows: */ +#else /* CONFIG_ARCH_HAS_PTE_SPECIAL */ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { @@ -881,7 +875,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, if (is_zero_pfn(pfn)) return NULL; -check_pfn: +#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ + if (unlikely(pfn > highest_memmap_pfn)) { print_bad_pte(vma, addr, pte, NULL); return NULL; @@ -891,7 +886,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, * NOTE! We still have PageReserved() pages in the page tables. * eg. VDSO mappings can cause them to exist. */ -out: +out: __maybe_unused return pfn_to_page(pfn); } @@ -904,7 +899,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, /* * There is no pmd_special() but there may be special pmds, e.g. * in a direct-access (dax) mapping, so let's just replicate the -* !HAVE_PTE_SPECIAL case from vm_normal_page() here. +* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here. */ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { @@ -1926,6 +1921,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, track_pfn_insert(vma, , pfn); +#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL /* * If we don't have pte special, then we have to use the pfn_valid() * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* @@ -1933,7 +1929,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. */ - if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) { + if (!pfn_t_devmap(pfn) && pfn_t_valid(pfn)) { struct page *page; /* @@ -1944,6 +1940,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, page = pfn_to_page(pfn_t_to_pfn(pfn)); return insert_page(vma, addr, page, pgprot); } +#endif return insert_pfn(vma, addr, pfn, pgprot, mkwrite); } -- 2.7.4
[PATCH v2 0/2] move __HAVE_ARCH_PTE_SPECIAL in Kconfig
The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the per architecture header files. This doesn't allow to make other configuration dependent on it. The first patch of this series is replacing __HAVE_ARCH_PTE_SPECIAL by CONFIG_ARCH_HAS_PTE_SPECIAL defined into the Kconfig files, setting it automatically when architectures was already setting it in header file. The second patch is removing the odd define HAVE_PTE_SPECIAL which is a duplicate of CONFIG_ARCH_HAS_PTE_SPECIAL. There is no functional change introduced by this series. Laurent Dufour (2): mm: introduce ARCH_HAS_PTE_SPECIAL mm: remove odd HAVE_PTE_SPECIAL .../features/vm/pte_special/arch-support.txt | 2 +- arch/arc/Kconfig | 1 + arch/arc/include/asm/pgtable.h | 2 -- arch/arm/Kconfig | 1 + arch/arm/include/asm/pgtable-3level.h | 1 - arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 2 -- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/book3s/64/pgtable.h | 3 --- arch/powerpc/include/asm/pte-common.h | 3 --- arch/riscv/Kconfig | 1 + arch/s390/Kconfig | 1 + arch/s390/include/asm/pgtable.h| 1 - arch/sh/Kconfig| 1 + arch/sh/include/asm/pgtable.h | 2 -- arch/sparc/Kconfig | 1 + arch/sparc/include/asm/pgtable_64.h| 3 --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable_types.h | 1 - include/linux/pfn_t.h | 4 ++-- mm/Kconfig | 3 +++ mm/gup.c | 4 ++-- mm/memory.c| 23 ++ 23 files changed, 27 insertions(+), 36 deletions(-) -- 2.7.4
Re: [alsa-devel] [PATCH] ASoC: fsl_esai: Fix divisor calculation failure at lower ratio
Hi Nicolin, On Sun, Apr 8, 2018 at 8:57 PM, Nicolin Chenwrote: > When the desired ratio is less than 256, the savesub (tolerance) > in the calculation would become 0. This will then fail the loop- > search immediately without reporting any errors. > > But if the ratio is smaller enough, there is no need to calculate > the tolerance because PM divisor alone is enough to get the ratio. > > So a simple fix could be just to set PM directly instead of going > into the loop-search. > > Reported-by: Marek Vasut > Signed-off-by: Nicolin Chen > Cc: Marek Vasut Thanks for the fix: Reviewed-by: Fabio Estevam
Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL
On 09/04/2018 22:08, David Rientjes wrote: > On Mon, 9 Apr 2018, Christoph Hellwig wrote: > >>> -#ifdef __HAVE_ARCH_PTE_SPECIAL >>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL >>> # define HAVE_PTE_SPECIAL 1 >>> #else >>> # define HAVE_PTE_SPECIAL 0 >> >> I'd say kill this odd indirection and just use the >> CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly. >> >> > > Agree, and I think it would be easier to audit/review if patches 1 and 3 > were folded together to see the relationship between the newly added > selects and what #define's it is replacing. Otherwise, looks good! > Ok I will fold the 3 patches and introduce a new one removing HAVE_PTE_SPECIAL. Thanks, Laurent.
[PATCH v2 2/2] powerpc/fadump: Do not use hugepages when fadump is active
FADump capture kernel boots in restricted memory environment preserving the context of previous kernel to save vmcore. Supporting hugepages in such environment makes things unnecessarily complicated, as hugepages need memory set aside for them. This means most of the capture kernel's memory is used in supporting hugepages. In most cases, this results in out-of-memory issues while booting FADump capture kernel. But hugepages are not of much use in capture kernel whose only job is to save vmcore. So, disabling hugepages support, when fadump is active, is a reliable solution for the out of memory issues. Introducing a flag variable to disable HugeTLB support when fadump is active. Signed-off-by: Hari Bathini--- Changes in v2: * Introduce a hugetlb_disabled flag to enable/disable hugepage support & use that flag to disable hugepage support when fadump is active. arch/powerpc/include/asm/page.h |1 + arch/powerpc/kernel/fadump.c|8 arch/powerpc/mm/hash_utils_64.c |6 -- arch/powerpc/mm/hugetlbpage.c |7 +++ 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h index 8da5d4c..40aee93 100644 --- a/arch/powerpc/include/asm/page.h +++ b/arch/powerpc/include/asm/page.h @@ -39,6 +39,7 @@ #ifndef __ASSEMBLY__ #ifdef CONFIG_HUGETLB_PAGE +extern bool hugetlb_disabled; extern unsigned int HPAGE_SHIFT; #else #define HPAGE_SHIFT PAGE_SHIFT diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index bea8d5f..8ceabef4 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -402,6 +402,14 @@ int __init fadump_reserve_mem(void) if (fw_dump.dump_active) { pr_info("Firmware-assisted dump is active.\n"); +#ifdef CONFIG_HUGETLB_PAGE + /* +* FADump capture kernel doesn't care much about hugepages. +* In fact, handling hugepages in capture kernel is asking for +* trouble. So, disable HugeTLB support when fadump is active. +*/ + hugetlb_disabled = true; +#endif /* * If last boot has crashed then reserve all the memory * above boot_memory_size so that we don't touch it until diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index cf290d41..eab8f1d 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -571,8 +571,10 @@ static void __init htab_scan_page_sizes(void) } #ifdef CONFIG_HUGETLB_PAGE - /* Reserve 16G huge page memory sections for huge pages */ - of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL); + if (!hugetlb_disabled) { + /* Reserve 16G huge page memory sections for huge pages */ + of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL); + } #endif /* CONFIG_HUGETLB_PAGE */ } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 876da2b..18c080a 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -35,6 +35,8 @@ #define PAGE_SHIFT_16M 24 #define PAGE_SHIFT_16G 34 +bool hugetlb_disabled = false; + unsigned int HPAGE_SHIFT; EXPORT_SYMBOL(HPAGE_SHIFT); @@ -653,6 +655,11 @@ static int __init hugetlbpage_init(void) { int psize; + if (hugetlb_disabled) { + pr_info("HugeTLB support is disabled!\n"); + return 0; + } + #if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx) if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE)) return -ENODEV;
[PATCH v2 1/2] powerpc/fadump: exclude memory holes while reserving memory in second kernel
From: Mahesh SalgaonkarThe second kernel, during early boot after the crash, reserves rest of the memory above boot memory size to make sure it does not touch any of the dump memory area. It uses memblock_reserve() that reserves the specified memory region irrespective of memory holes present within that region. There are chances where previous kernel would have hot removed some of its memory leaving memory holes behind. In such cases fadump kernel reports incorrect number of reserved pages through arch_reserved_kernel_pages() hook causing kernel to hang or panic. Fix this by excluding memory holes while reserving rest of the memory above boot memory size during second kernel boot after crash. Signed-off-by: Mahesh Salgaonkar Signed-off-by: Hari Bathini --- Changes in v2: * Split crash dump memory reservation into a separate function. arch/powerpc/kernel/fadump.c | 29 +++-- 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 3c2c268..bea8d5f 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -335,6 +335,26 @@ static unsigned long get_fadump_area_size(void) return size; } +static void __init fadump_reserve_crash_area(unsigned long base, +unsigned long size) +{ + struct memblock_region *reg; + unsigned long mstart, mend, msize; + + for_each_memblock(memory, reg) { + mstart = max_t(unsigned long, base, reg->base); + mend = reg->base + reg->size; + mend = min(base + size, mend); + + if (mstart < mend) { + msize = mend - mstart; + memblock_reserve(mstart, msize); + pr_info("Reserved %ldMB of memory at %#016lx for saving crash dump\n", + (msize >> 20), mstart); + } + } +} + int __init fadump_reserve_mem(void) { unsigned long base, size, memory_boundary; @@ -380,7 +400,8 @@ int __init fadump_reserve_mem(void) memory_boundary = memblock_end_of_DRAM(); if (fw_dump.dump_active) { - printk(KERN_INFO "Firmware-assisted dump is active.\n"); + pr_info("Firmware-assisted dump is active.\n"); + /* * If last boot has crashed then reserve all the memory * above boot_memory_size so that we don't touch it until @@ -389,11 +410,7 @@ int __init fadump_reserve_mem(void) */ base = fw_dump.boot_memory_size; size = memory_boundary - base; - memblock_reserve(base, size); - printk(KERN_INFO "Reserved %ldMB of memory at %ldMB " - "for saving crash dump\n", - (unsigned long)(size >> 20), - (unsigned long)(base >> 20)); + fadump_reserve_crash_area(base, size); fw_dump.fadumphdr_addr = be64_to_cpu(fdm_active->rmr_region.destination_address) +
Re: [PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops
On Tue, 10 Apr 2018 14:07:28 +0200 Alexandre Belloniwrote: > Hi Nicholas, > > I would greatly appreciate a changelog and at least the cover letter > because it is difficult to grasp how this relates to the previous > patches you sent to the RTC mailing list. Yes good point. Basically this change is "standalone" except using OPAL_BUSY_DELAY_MS define from patch 1. That patch has a lot of comments about firmware delays I did not think would be too interesting. Basically we're adding msleep(10) here, because the firmware can repeatedly return OPAL_BUSY for long periods, so we want to context switch and respond to interrupts. > > On 10/04/2018 21:49:32+1000, Nicholas Piggin wrote: > > The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or > > OPAL_BUSY_EVENT from firmware, which causes large scheduling > > latencies, up to 50 seconds have been observed here when RTC stops > > responding (BMC reboot can do it). > > > > Fix this by converting it to the standard form OPAL_BUSY loop that > > sleeps. > > > > Fixes("powerpc/powernv: Add RTC and NVRAM support plus RTAS > > fallbacks" > > Cc: Benjamin Herrenschmidt > > Cc: linux-...@vger.kernel.org > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/platforms/powernv/opal-rtc.c | 8 +++-- > > drivers/rtc/rtc-opal.c| 37 ++- > > From what I understand, the changes in those files are fairly > independent, they should probably be separated to ease merging. I'm happy to do that. It's using the same firmware call, so I thought a single patch would be fine. But I guess the boot call can be dropped from this patch because it does not not solve the problem described in the changelog. Would you be happy for the driver change to be merged via the powerpc tree? The code being fixed here came from the same original patch as a similar issue being fixed in the OPAL NVRAM driver so it might be easier that way. Thanks, Nick
[RFC PATCH 5/5] KVM: PPC: Book3S HV: Radix do not clear partition scoped page table when page fault races with other vCPUs.
KVM with an SMP radix guest can get into storms of page faults and tlbies due to the partition scopd page tables being invalidated and TLB flushed if they were found to race with another page fault that set them up. This tends to cause vCPUs to pile up if several hit common addresses, then page faults will get serialized on common locks, and then they each invalidate the previous entry and it's long enough before installing the new entry that will cause more CPUs to hit page faults and they will invalidate that new entry. There doesn't seem to be a need to invalidate in the case of an existing entry. This solves the tlbie storms. Signed-off-by: Nicholas Piggin--- arch/powerpc/kvm/book3s_64_mmu_radix.c | 39 +++--- 1 file changed, 22 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index dab6b622011c..4af177d24f6c 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -243,6 +243,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, pmd = pmd_offset(pud, gpa); if (pmd_is_leaf(*pmd)) { unsigned long lgpa = gpa & PMD_MASK; + pte_t old_pte = *pmdp_ptep(pmd); /* * If we raced with another CPU which has just put @@ -252,18 +253,17 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, ret = -EAGAIN; goto out_unlock; } - /* Valid 2MB page here already, remove it */ - old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd), - ~0UL, 0, lgpa, PMD_SHIFT); - kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT); - if (old & _PAGE_DIRTY) { - unsigned long gfn = lgpa >> PAGE_SHIFT; - struct kvm_memory_slot *memslot; - memslot = gfn_to_memslot(kvm, gfn); - if (memslot && memslot->dirty_bitmap) - kvmppc_update_dirty_map(memslot, - gfn, PMD_SIZE); + WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte)); + if (pte_val(old_pte) == pte_val(pte)) { + ret = -EAGAIN; + goto out_unlock; } + + /* Valid 2MB page here already, remove it */ + kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd), + 0, pte_val(pte), lgpa, PMD_SHIFT); + ret = 0; + goto out_unlock; } else if (level == 1 && !pmd_none(*pmd)) { /* * There's a page table page here, but we wanted @@ -274,6 +274,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, goto out_unlock; } if (level == 0) { + pte_t old_pte; + if (pmd_none(*pmd)) { if (!new_ptep) goto out_unlock; @@ -281,13 +283,16 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, new_ptep = NULL; } ptep = pte_offset_kernel(pmd, gpa); - if (pte_present(*ptep)) { + old_pte = *ptep; + if (pte_present(old_pte)) { /* PTE was previously valid, so invalidate it */ - old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_PRESENT, - 0, gpa, 0); - kvmppc_radix_tlbie_page(kvm, gpa, 0); - if (old & _PAGE_DIRTY) - mark_page_dirty(kvm, gpa >> PAGE_SHIFT); + WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte)); + if (pte_val(old_pte) == pte_val(pte)) { + ret = -EAGAIN; + goto out_unlock; + } + kvmppc_radix_update_pte(kvm, ptep, 0, + pte_val(pte), gpa, 0); } kvmppc_radix_set_pte_at(kvm, gpa, ptep, pte); } else { -- 2.17.0
[RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry
Move this flushing out of assembly and have it use Linux TLB flush implementations introduced earlier. This allows powerpc:tlbie trace events to be used. Signed-off-by: Nicholas Piggin--- arch/powerpc/kvm/book3s_hv.c| 21 +++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 43 + 2 files changed, 21 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 81e2ea882d97..5d4783b5b47a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2680,7 +2680,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc) int sub; bool thr0_done; unsigned long cmd_bit, stat_bit; - int pcpu, thr; + int pcpu, thr, tmp; int target_threads; int controlled_threads; int trap; @@ -2780,6 +2780,25 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc) return; } + /* +* Do we need to flush the TLB for the LPAR? (see TLB comment above) + * On POWER9, individual threads can come in here, but the + * TLB is shared between the 4 threads in a core, hence + * invalidating on one thread invalidates for all. + * Thus we make all 4 threads use the same bit here. + */ + tmp = pcpu; + if (cpu_has_feature(CPU_FTR_ARCH_300)) + tmp &= ~0x3UL; + if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) { + if (kvm_is_radix(vc->kvm)) + radix__local_flush_tlb_lpid(vc->kvm->arch.lpid); + else + hash__local_flush_tlb_lpid(vc->kvm->arch.lpid); + /* Clear the bit after the TLB flush */ + cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush); + } + kvmppc_clear_host_core(pcpu); /* Decide on micro-threading (split-core) mode */ diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index bd63fa8a08b5..6a23a0f3ceea 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -647,49 +647,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) mtspr SPRN_LPID,r7 isync - /* See if we need to flush the TLB */ - lhz r6,PACAPACAINDEX(r13) /* test_bit(cpu, need_tlb_flush) */ -BEGIN_FTR_SECTION - /* -* On POWER9, individual threads can come in here, but the -* TLB is shared between the 4 threads in a core, hence -* invalidating on one thread invalidates for all. -* Thus we make all 4 threads use the same bit here. -*/ - clrrdi r6,r6,2 -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) - clrldi r7,r6,64-6 /* extract bit number (6 bits) */ - srdir6,r6,6 /* doubleword number */ - sldir6,r6,3 /* address offset */ - add r6,r6,r9 - addir6,r6,KVM_NEED_FLUSH/* dword in kvm->arch.need_tlb_flush */ - li r8,1 - sld r8,r8,r7 - ld r7,0(r6) - and.r7,r7,r8 - beq 22f - /* Flush the TLB of any entries for this LPID */ - lwz r0,KVM_TLB_SETS(r9) - mtctr r0 - li r7,0x800/* IS field = 0b10 */ - ptesync - li r0,0/* RS for P9 version of tlbiel */ - bne cr7, 29f -28:tlbiel r7 /* On P9, rs=0, RIC=0, PRS=0, R=0 */ - addir7,r7,0x1000 - bdnz28b - b 30f -29:PPC_TLBIEL(7,0,2,1,1) /* for radix, RIC=2, PRS=1, R=1 */ - addir7,r7,0x1000 - bdnz29b -30:ptesync -23:ldarx r7,0,r6 /* clear the bit after TLB flushed */ - andcr7,r7,r8 - stdcx. r7,0,r6 - bne 23b - /* Add timebase offset onto timebase */ -22:ld r8,VCORE_TB_OFFSET(r5) + ld r8,VCORE_TB_OFFSET(r5) cmpdi r8,0 beq 37f mftbr6 /* current host timebase */ -- 2.17.0
[RFC PATCH 3/5] KVM: PPC: Book3S HV: kvmhv_p9_set_lpcr use Linux flush function
The existing flush uses the radix value for sets, and uses R=0 tlbiel instructions. This can't be quite right, but I'm not entirely sure if this is the right way to fix it. Signed-off-by: Nicholas Piggin--- arch/powerpc/kvm/book3s_hv_builtin.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 0b9b8e188bfa..577769fbfae9 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -676,7 +676,7 @@ static void wait_for_sync(struct kvm_split_mode *sip, int phase) void kvmhv_p9_set_lpcr(struct kvm_split_mode *sip) { - unsigned long rb, set; + struct kvm *kvm = local_paca->kvm_hstate.kvm_vcpu->kvm; /* wait for every other thread to get to real mode */ wait_for_sync(sip, PHASE_REALMODE); @@ -689,14 +689,10 @@ void kvmhv_p9_set_lpcr(struct kvm_split_mode *sip) /* Invalidate the TLB on thread 0 */ if (local_paca->kvm_hstate.tid == 0) { sip->do_set = 0; - asm volatile("ptesync" : : : "memory"); - for (set = 0; set < POWER9_TLB_SETS_RADIX; ++set) { - rb = TLBIEL_INVAL_SET_LPID + - (set << TLBIEL_INVAL_SET_SHIFT); - asm volatile(PPC_TLBIEL(%0, %1, 0, 0, 0) : : -"r" (rb), "r" (0)); - } - asm volatile("ptesync" : : : "memory"); + if (kvm_is_radix(kvm)) + radix__local_flush_tlb_lpid(kvm->arch.lpid); + else + hash__local_flush_tlb_lpid(kvm->arch.lpid); } /* indicate that we have done so and wait for others */ -- 2.17.0
[RFC PATCH 2/5] KVM: PPC: Book3S HV: kvmppc_radix_tlbie_page use Linux flush function
This has the advantage of consolidating TLB flush code in fewer places, and it also implements powerpc:tlbie trace events. 1GB pages should be handled without further modification. Signed-off-by: Nicholas Piggin--- arch/powerpc/kvm/book3s_64_mmu_radix.c | 26 +++--- 1 file changed, 7 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 81d5ad26f9a1..dab6b622011c 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -139,28 +139,16 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, return 0; } -#ifdef CONFIG_PPC_64K_PAGES -#define MMU_BASE_PSIZE MMU_PAGE_64K -#else -#define MMU_BASE_PSIZE MMU_PAGE_4K -#endif - static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr, unsigned int pshift) { - int psize = MMU_BASE_PSIZE; - - if (pshift >= PMD_SHIFT) - psize = MMU_PAGE_2M; - addr &= ~0xfffUL; - addr |= mmu_psize_defs[psize].ap << 5; - asm volatile("ptesync": : :"memory"); - asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1) -: : "r" (addr), "r" (kvm->arch.lpid) : "memory"); - if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG)) - asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1) -: : "r" (addr), "r" (kvm->arch.lpid) : "memory"); - asm volatile("eieio ; tlbsync ; ptesync": : :"memory"); + unsigned long psize = PAGE_SIZE; + + if (pshift) + psize = 1UL << pshift; + + addr &= ~(psize - 1); + radix__flush_tlb_lpid_page(kvm->arch.lpid, addr, psize); } unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep, -- 2.17.0
[RFC PATCH 1/5] powerpc/64s/mm: Implement LPID based TLB flushes to be used by KVM
Implent local TLB flush for entire LPID, for hash and radix, and a global TLB flush for a partition scoped page in an LPID, for radix. These will be used by KVM in subsequent patches. Signed-off-by: Nicholas Piggin--- .../include/asm/book3s/64/tlbflush-hash.h | 2 + .../include/asm/book3s/64/tlbflush-radix.h| 5 ++ arch/powerpc/mm/hash_native_64.c | 8 ++ arch/powerpc/mm/tlb-radix.c | 87 +++ 4 files changed, 102 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h index 64d02a704bcb..8b328fd87722 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h @@ -53,6 +53,8 @@ static inline void arch_leave_lazy_mmu_mode(void) extern void hash__tlbiel_all(unsigned int action); +extern void hash__local_flush_tlb_lpid(unsigned int lpid); + extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize, unsigned long flags); extern void flush_hash_range(unsigned long number, int local); diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h index 19b45ba6caf9..2ddaadf3e9ea 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h @@ -51,4 +51,9 @@ extern void radix__flush_tlb_all(void); extern void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, struct mm_struct *mm, unsigned long address); +extern void radix__flush_tlb_lpid_page(unsigned int lpid, + unsigned long addr, + unsigned long page_size); +extern void radix__local_flush_tlb_lpid(unsigned int lpid); + #endif diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c index 1d049c78c82a..2f02cd780c19 100644 --- a/arch/powerpc/mm/hash_native_64.c +++ b/arch/powerpc/mm/hash_native_64.c @@ -294,6 +294,14 @@ static inline void tlbie(unsigned long vpn, int psize, int apsize, raw_spin_unlock(_tlbie_lock); } +void hash__local_flush_tlb_lpid(unsigned int lpid) +{ + VM_BUG_ON(mfspr(SPRN_LPID) != lpid); + + hash__tlbiel_all(TLB_INVAL_SCOPE_LPID); +} +EXPORT_SYMBOL_GPL(hash__local_flush_tlb_lpid); + static inline void native_lock_hpte(struct hash_pte *hptep) { unsigned long *word = (unsigned long *)>v; diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c index 2fba6170ab3f..f246fb0ac049 100644 --- a/arch/powerpc/mm/tlb-radix.c +++ b/arch/powerpc/mm/tlb-radix.c @@ -119,6 +119,22 @@ static inline void __tlbie_pid(unsigned long pid, unsigned long ric) trace_tlbie(0, 0, rb, rs, ric, prs, r); } +static inline void __tlbiel_lpid(unsigned long lpid, int set, + unsigned long ric) +{ + unsigned long rb,rs,prs,r; + + rb = PPC_BIT(52); /* IS = 2 */ + rb |= set << PPC_BITLSHIFT(51); + rs = 0; /* LPID comes from LPIDR */ + prs = 0; /* partition scoped */ + r = 1; /* radix format */ + + asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1) +: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory"); + trace_tlbie(lpid, 1, rb, rs, ric, prs, r); +} + static inline void __tlbiel_va(unsigned long va, unsigned long pid, unsigned long ap, unsigned long ric) { @@ -151,6 +167,22 @@ static inline void __tlbie_va(unsigned long va, unsigned long pid, trace_tlbie(0, 0, rb, rs, ric, prs, r); } +static inline void __tlbie_lpid_va(unsigned long va, unsigned long lpid, + unsigned long ap, unsigned long ric) +{ + unsigned long rb,rs,prs,r; + + rb = va & ~(PPC_BITMASK(52, 63)); + rb |= ap << PPC_BITLSHIFT(58); + rs = lpid; + prs = 0; /* partition scoped */ + r = 1; /* radix format */ + + asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1) +: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory"); + trace_tlbie(lpid, 0, rb, rs, ric, prs, r); +} + static inline void fixup_tlbie(void) { unsigned long pid = 0; @@ -215,6 +247,34 @@ static inline void _tlbie_pid(unsigned long pid, unsigned long ric) asm volatile("eieio; tlbsync; ptesync": : :"memory"); } +static inline void _tlbiel_lpid(unsigned long lpid, unsigned long ric) +{ + int set; + + VM_BUG_ON(mfspr(SPRN_LPID) != lpid); + + asm volatile("ptesync": : :"memory"); + + /* +* Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL, +* also flush the entire Page Walk Cache. +*/ + __tlbiel_lpid(lpid, 0, ric); + + /* For PWC, only one flush is needed */ + if (ric == RIC_FLUSH_PWC) { + asm
[RFC PATCH 0/5] KVM TLB flushing improvements
This series adds powerpc:tlbie tracepoints for radix partition scoped invalidations. After I started getting some traces on a 32 vCPU radix guest it showed a problem with partition scoped faults/invalidates, so I had a try at fixing it. This seems to stable be on radix so far (haven't tested hash yet). Thanks, Nick Nicholas Piggin (5): powerpc/64s/mm: Implement LPID based TLB flushes to be used by KVM KVM: PPC: Book3S HV: kvmppc_radix_tlbie_page use Linux flush function KVM: PPC: Book3S HV: kvmhv_p9_set_lpcr use Linux flush function KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry KVM: PPC: Book3S HV: Radix do not clear partition scoped page table when page fault races with other vCPUs. .../include/asm/book3s/64/tlbflush-hash.h | 2 + .../include/asm/book3s/64/tlbflush-radix.h| 5 ++ arch/powerpc/kvm/book3s_64_mmu_radix.c| 65 +++--- arch/powerpc/kvm/book3s_hv.c | 21 - arch/powerpc/kvm/book3s_hv_builtin.c | 14 ++- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 43 + arch/powerpc/mm/hash_native_64.c | 8 ++ arch/powerpc/mm/tlb-radix.c | 87 +++ 8 files changed, 157 insertions(+), 88 deletions(-) -- 2.17.0
Re: [PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops
Hi Nicholas, I would greatly appreciate a changelog and at least the cover letter because it is difficult to grasp how this relates to the previous patches you sent to the RTC mailing list. On 10/04/2018 21:49:32+1000, Nicholas Piggin wrote: > The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or > OPAL_BUSY_EVENT from firmware, which causes large scheduling > latencies, up to 50 seconds have been observed here when RTC stops > responding (BMC reboot can do it). > > Fix this by converting it to the standard form OPAL_BUSY loop that > sleeps. > > Fixes 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS > fallbacks" > Cc: Benjamin Herrenschmidt> Cc: linux-...@vger.kernel.org > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/platforms/powernv/opal-rtc.c | 8 +++-- > drivers/rtc/rtc-opal.c| 37 ++- >From what I understand, the changes in those files are fairly independent, they should probably be separated to ease merging. > 2 files changed, 28 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/opal-rtc.c > b/arch/powerpc/platforms/powernv/opal-rtc.c > index f8868864f373..aa2a5139462e 100644 > --- a/arch/powerpc/platforms/powernv/opal-rtc.c > +++ b/arch/powerpc/platforms/powernv/opal-rtc.c > @@ -48,10 +48,12 @@ unsigned long __init opal_get_boot_time(void) > > while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { > rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms); > - if (rc == OPAL_BUSY_EVENT) > + if (rc == OPAL_BUSY_EVENT) { > + mdelay(OPAL_BUSY_DELAY_MS); > opal_poll_events(NULL); > - else if (rc == OPAL_BUSY) > - mdelay(10); > + } else if (rc == OPAL_BUSY) { > + mdelay(OPAL_BUSY_DELAY_MS); > + } > } > if (rc != OPAL_SUCCESS) > return 0; > diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c > index 304e891e35fc..60f2250fd96b 100644 > --- a/drivers/rtc/rtc-opal.c > +++ b/drivers/rtc/rtc-opal.c > @@ -57,7 +57,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 > *h_m_s_ms) > > static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm) > { > - long rc = OPAL_BUSY; > + s64 rc = OPAL_BUSY; > int retries = 10; > u32 y_m_d; > u64 h_m_s_ms; > @@ -66,13 +66,17 @@ static int opal_get_rtc_time(struct device *dev, struct > rtc_time *tm) > > while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { > rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms); > - if (rc == OPAL_BUSY_EVENT) > + if (rc == OPAL_BUSY_EVENT) { > + msleep(OPAL_BUSY_DELAY_MS); > opal_poll_events(NULL); > - else if (retries-- && (rc == OPAL_HARDWARE > -|| rc == OPAL_INTERNAL_ERROR)) > - msleep(10); > - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT) > - break; > + } else if (rc == OPAL_BUSY) { > + msleep(OPAL_BUSY_DELAY_MS); > + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) { > + if (retries--) { > + msleep(10); /* Wait 10ms before retry */ > + rc = OPAL_BUSY; /* go around again */ > + } > + } > } > > if (rc != OPAL_SUCCESS) > @@ -87,21 +91,26 @@ static int opal_get_rtc_time(struct device *dev, struct > rtc_time *tm) > > static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm) > { > - long rc = OPAL_BUSY; > + s64 rc = OPAL_BUSY; > int retries = 10; > u32 y_m_d = 0; > u64 h_m_s_ms = 0; > > tm_to_opal(tm, _m_d, _m_s_ms); > + > while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { > rc = opal_rtc_write(y_m_d, h_m_s_ms); > - if (rc == OPAL_BUSY_EVENT) > + if (rc == OPAL_BUSY_EVENT) { > + msleep(OPAL_BUSY_DELAY_MS); > opal_poll_events(NULL); > - else if (retries-- && (rc == OPAL_HARDWARE > -|| rc == OPAL_INTERNAL_ERROR)) > - msleep(10); > - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT) > - break; > + } else if (rc == OPAL_BUSY) { > + msleep(OPAL_BUSY_DELAY_MS); > + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) { > + if (retries--) { > + msleep(10); /* Wait 10ms before retry */ > + rc = OPAL_BUSY; /* go around again */ > + } > + } > } > > return rc == OPAL_SUCCESS ? 0 : -EIO; > -- > 2.17.0 > --
[PATCH 3/3] powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
The OPAL NVRAM driver does not sleep in case it gets OPAL_BUSY or OPAL_BUSY_EVENT from firmware, which causes large scheduling latencies, and various lockup errors to trigger (again, BMC reboot can cause it). Fix this by converting it to the standard form OPAL_BUSY loop that sleeps. Fixes: 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks") Cc: Benjamin HerrenschmidtSigned-off-by: Nicholas Piggin --- arch/powerpc/platforms/powernv/opal-nvram.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/powernv/opal-nvram.c b/arch/powerpc/platforms/powernv/opal-nvram.c index ba2ff06a2c98..1bceb95f422d 100644 --- a/arch/powerpc/platforms/powernv/opal-nvram.c +++ b/arch/powerpc/platforms/powernv/opal-nvram.c @@ -11,6 +11,7 @@ #define DEBUG +#include #include #include #include @@ -56,8 +57,12 @@ static ssize_t opal_nvram_write(char *buf, size_t count, loff_t *index) while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { rc = opal_write_nvram(__pa(buf), count, off); - if (rc == OPAL_BUSY_EVENT) + if (rc == OPAL_BUSY_EVENT) { + msleep(OPAL_BUSY_DELAY_MS); opal_poll_events(NULL); + } else if (rc == OPAL_BUSY) { + msleep(OPAL_BUSY_DELAY_MS); + } } if (rc) -- 2.17.0
[PATCH 0/3] Fix RTC and NVRAM OPAL_BUSY loops
This is a couple of important fixes broken out of the series "first step of standardising OPAL_BUSY handling", that prevents the kernel from locking up if the NVRAM or RTC hardware does not respond. Another one, the console driver, has a similar problem that has also been hit in testing, but that requires larger fixes to the opal console and hvc tty driver that won't make it for 4.17. Thanks, Nick Nicholas Piggin (3): powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops arch/powerpc/include/asm/opal.h | 3 ++ arch/powerpc/platforms/powernv/opal-nvram.c | 7 +++- arch/powerpc/platforms/powernv/opal-rtc.c | 8 +++-- drivers/rtc/rtc-opal.c | 37 + 4 files changed, 37 insertions(+), 18 deletions(-) -- 2.17.0
[PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops
The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or OPAL_BUSY_EVENT from firmware, which causes large scheduling latencies, up to 50 seconds have been observed here when RTC stops responding (BMC reboot can do it). Fix this by converting it to the standard form OPAL_BUSY loop that sleeps. Fixes 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS fallbacks" Cc: Benjamin HerrenschmidtCc: linux-...@vger.kernel.org Signed-off-by: Nicholas Piggin --- arch/powerpc/platforms/powernv/opal-rtc.c | 8 +++-- drivers/rtc/rtc-opal.c| 37 ++- 2 files changed, 28 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/platforms/powernv/opal-rtc.c b/arch/powerpc/platforms/powernv/opal-rtc.c index f8868864f373..aa2a5139462e 100644 --- a/arch/powerpc/platforms/powernv/opal-rtc.c +++ b/arch/powerpc/platforms/powernv/opal-rtc.c @@ -48,10 +48,12 @@ unsigned long __init opal_get_boot_time(void) while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms); - if (rc == OPAL_BUSY_EVENT) + if (rc == OPAL_BUSY_EVENT) { + mdelay(OPAL_BUSY_DELAY_MS); opal_poll_events(NULL); - else if (rc == OPAL_BUSY) - mdelay(10); + } else if (rc == OPAL_BUSY) { + mdelay(OPAL_BUSY_DELAY_MS); + } } if (rc != OPAL_SUCCESS) return 0; diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c index 304e891e35fc..60f2250fd96b 100644 --- a/drivers/rtc/rtc-opal.c +++ b/drivers/rtc/rtc-opal.c @@ -57,7 +57,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 *h_m_s_ms) static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm) { - long rc = OPAL_BUSY; + s64 rc = OPAL_BUSY; int retries = 10; u32 y_m_d; u64 h_m_s_ms; @@ -66,13 +66,17 @@ static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm) while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms); - if (rc == OPAL_BUSY_EVENT) + if (rc == OPAL_BUSY_EVENT) { + msleep(OPAL_BUSY_DELAY_MS); opal_poll_events(NULL); - else if (retries-- && (rc == OPAL_HARDWARE - || rc == OPAL_INTERNAL_ERROR)) - msleep(10); - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT) - break; + } else if (rc == OPAL_BUSY) { + msleep(OPAL_BUSY_DELAY_MS); + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) { + if (retries--) { + msleep(10); /* Wait 10ms before retry */ + rc = OPAL_BUSY; /* go around again */ + } + } } if (rc != OPAL_SUCCESS) @@ -87,21 +91,26 @@ static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm) static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm) { - long rc = OPAL_BUSY; + s64 rc = OPAL_BUSY; int retries = 10; u32 y_m_d = 0; u64 h_m_s_ms = 0; tm_to_opal(tm, _m_d, _m_s_ms); + while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) { rc = opal_rtc_write(y_m_d, h_m_s_ms); - if (rc == OPAL_BUSY_EVENT) + if (rc == OPAL_BUSY_EVENT) { + msleep(OPAL_BUSY_DELAY_MS); opal_poll_events(NULL); - else if (retries-- && (rc == OPAL_HARDWARE - || rc == OPAL_INTERNAL_ERROR)) - msleep(10); - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT) - break; + } else if (rc == OPAL_BUSY) { + msleep(OPAL_BUSY_DELAY_MS); + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) { + if (retries--) { + msleep(10); /* Wait 10ms before retry */ + rc = OPAL_BUSY; /* go around again */ + } + } } return rc == OPAL_SUCCESS ? 0 : -EIO; -- 2.17.0
[PATCH 1/3] powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
This is the start of an effort to tidy up and standardise all the delays. Existing loops have a range of delay/sleep periods from 1ms to 20ms, and some have no delay. They all loop forever except rtc, which times out after 10 retries, and that uses 10ms delays. So use 10ms as our standard delay. The OPAL maintainer agrees 10ms is a reasonable starting point. The idea is to use the same recipe everywhere, once this is proven to work then it will be documented as an OPAL API standard. Then both firmware and OS can agree, and if a particular call needs something else, then that can be documented with reasoning. This is not the end-all of this effort, it's just a relatively easy change that fixes some existing high latency delays. There should be provision for standardising timeouts and/or interruptible loops where possible, so non-fatal firmware errors don't cause hangs. Signed-off-by: Nicholas Piggin--- arch/powerpc/include/asm/opal.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 7159e1a6a61a..03e1a920491e 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -21,6 +21,9 @@ /* We calculate number of sg entries based on PAGE_SIZE */ #define SG_ENTRIES_PER_NODE ((PAGE_SIZE - 16) / sizeof(struct opal_sg_entry)) +/* Default time to sleep or delay between OPAL_BUSY/OPAL_BUSY_EVENT loops */ +#define OPAL_BUSY_DELAY_MS 10 + /* /sys/firmware/opal */ extern struct kobject *opal_kobj; -- 2.17.0
Re: [PATCH 00/32] docs/vm: convert to ReST format
Jon, Andrew, How do you suggest to continue with this? On Sun, Apr 01, 2018 at 09:38:58AM +0300, Mike Rapoport wrote: > (added akpm) > > On Thu, Mar 29, 2018 at 03:46:07PM -0600, Jonathan Corbet wrote: > > On Wed, 21 Mar 2018 21:22:16 +0200 > > Mike Rapoportwrote: > > > > > These patches convert files in Documentation/vm to ReST format, add an > > > initial index and link it to the top level documentation. > > > > > > There are no contents changes in the documentation, except few spelling > > > fixes. The relatively large diffstat stems from the indentation and > > > paragraph wrapping changes. > > > > > > I've tried to keep the formatting as consistent as possible, but I could > > > miss some places that needed markup and add some markup where it was not > > > necessary. > > > > So I've been pondering on these for a bit. It looks like a reasonable and > > straightforward RST conversion, no real complaints there. But I do have a > > couple of concerns... > > > > One is that, as we move documentation into RST, I'm really trying to > > organize it a bit so that it is better tuned to the various audiences we > > have. For example, ksm.txt is going to be of interest to sysadmin types, > > who might want to tune it. mmu_notifier.txt is of interest to ... > > somebody, but probably nobody who is thinking in user space. And so on. > > > > So I would really like to see this material split up and put into the > > appropriate places in the RST hierarchy - admin-guide for administrative > > stuff, core-api for kernel development topics, etc. That, of course, > > could be done separately from the RST conversion, but I suspect I know > > what will (or will not) happen if we agree to defer that for now :) > > Well, I was actually planning on doing that ;-) > > My thinking was to start with mechanical RST conversion and then to start > working on the contents and ordering of the documentation. Some of the > existing files, e.g. ksm.txt, can be moved as is into the appropriate > places, others, like transhuge.txt should be at least split into admin/user > and developer guides. > > Another problem with many of the existing mm docs is that they are rather > developer notes and it wouldn't be really straight forward to assign them > to a particular topic. > > I believe that keeping the mm docs together will give better visibility of > what (little) mm documentation we have and will make the updates easier. > The documents that fit well into a certain topic could be linked there. For > instance: > > - > diff --git a/Documentation/admin-guide/index.rst > b/Documentation/admin-guide/index.rst > index 5bb9161..8f6c6e6 100644 > --- a/Documentation/admin-guide/index.rst > +++ b/Documentation/admin-guide/index.rst > @@ -63,6 +63,7 @@ configure specific aspects of kernel behavior to your > liking. > pm/index > thunderbolt > LSM/index > + vm/index > > .. only:: subproject and html > > diff --git a/Documentation/admin-guide/vm/index.rst > b/Documentation/admin-guide/vm/index.rst > new file mode 100644 > index 000..d86f1c8 > --- /dev/null > +++ b/Documentation/admin-guide/vm/index.rst > @@ -0,0 +1,5 @@ > +== > +Knobs and Buttons for Memory Management Tuning > +== > + > +* :ref:`ksm ` > - > > > The other is the inevitable merge conflicts that changing that many doc > > files will create. Sending the patches through Andrew could minimize > > that, I guess, or at least make it his problem. Alternatively, we could > > try to do it as an end-of-merge-window sort of thing. I can try to manage > > that, but an ack or two from the mm crowd would be nice to have. > > I can rebase on top of Andrew's tree if that would help to minimize the > merge conflicts. > > > Thanks, > > > > jon > > > > -- > Sincerely yours, > Mike. > -- Sincerely yours, Mike.
Re: Occasionally losing the tick_sched_timer
On Tue, 10 Apr 2018, Nicholas Piggin wrote: > On Tue, 10 Apr 2018 09:42:29 +0200 (CEST) > Thomas Gleixnerwrote: > > > Thomas do you have any ideas on what we might look for, or if we can add > > > some BUG_ON()s to catch this at its source? > > > > Not really. Tracing might be a more efficient tool that random BUG_ONs. > > Sure, we could try that. Any suggestions? timer events? timer, hrtimer and the tick-sched stuff should be a good start. And make sure to freeze the trace once you hit the fault case. tracing_off() is your friend. Thanks, tglx
[PATCH] powerpc/8xx: Build fix with Hugetlbfs enabled
8xx use slice code when hugetlbfs is enabled. We missed a header include on 8xx which resulted in the below build failure. config: mpc885_ads_defconfig + CONFIG_HUGETLBFS CC arch/powerpc/mm/slice.o arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area': arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 'need_extra_context' [-Werror=implicit-function-declaration] arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 'alloc_extended_context' [-Werror=implicit-function-declaration] cc1: all warnings being treated as errors make[1]: *** [arch/powerpc/mm/slice.o] Error 1 make: *** [arch/powerpc/mm] Error 2 on PPC64 the mmu_context.h was included via linux/pkeys.h CC: Christophe LEROYSigned-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/slice.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 9cd87d11fe4e..205fe557ca10 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -35,6 +35,7 @@ #include #include #include +#include static DEFINE_SPINLOCK(slice_convert_lock); -- 2.14.3
Re: Occasionally losing the tick_sched_timer
On Tue, 10 Apr 2018 09:42:29 +0200 (CEST) Thomas Gleixnerwrote: > Nick, > > On Tue, 10 Apr 2018, Nicholas Piggin wrote: > > We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no > > more timers scheduled, despite hard and soft lockup watchdogs should have > > their heart beat timers and probably many others. > > > > The reproducer we have is running a KVM workload. The lockup is in the > > host kernel, quite rare but we may be able to slowly test things. > > > > I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has > > stopped for a long time and no hrtimer active. Included CPU4 for what the > > other CPUs look like. > > > > Thomas do you have any ideas on what we might look for, or if we can add > > some BUG_ON()s to catch this at its source? > > Not really. Tracing might be a more efficient tool that random BUG_ONs. Sure, we could try that. Any suggestions? timer events? > > > - CPU3 is sitting in its cpuidle loop (polling idle with all other idle > > states disabled). > > > > - `taskset -c 3 ls` basically revived the CPU and got timers running again. > > > > Which is not surprising because that kicks the CPU out of idle and starts > the tick timer again. Yep. > Does this restart the watchdog timers as well? I think so, but now you ask I'm not 100% sure we directly observed it. We can check that next time it locks up. > > cpu: 3 > > clock 0: > > .base: df30f5ab > > .index: 0 > > .resolution: 1 nsecs > > .get_time: ktime_get > > .offset: 0 nsecs > > active timers: > > So in theory the soft lockup watchdog hrtimer should be queued here. > > > .expires_next : 9223372036854775807 nsecs > > .hres_active: 1 > > .nr_events : 1446533 > > .nr_retries : 1434 > > .nr_hangs : 0 > > .max_hang_time : 0 > > .nohz_mode : 2 > > .last_tick : 1776312000 nsecs > > .tick_stopped : 1 > > .idle_jiffies : 4296713609 > > .idle_calls : 2573133 > > .idle_sleeps: 1957794 > > > .idle_waketime : 59550238131639 nsecs > > .idle_sleeptime : 17504617295679 nsecs > > .iowait_sleeptime: 719978688 nsecs > > .last_jiffies : 4296713608 > > So this was the last time when the CPU came out of idle: > > > .idle_exittime : 17763110009176 nsecs > > Here it went back into idle: > > > .idle_entrytime : 1776312625 nsecs > > And this was the next timer wheel timer due for expiry: > > > .next_timer : 1776313000 > > .idle_expires : 1776313000 nsecs > > which makes no sense whatsoever, but this might be stale information. Can't > tell. Wouldn't we expect to see that if there is a timer that was missed somehow because the tick_sched_timer was not set? > > > cpu: 4 > > clock 0: > > .base: 07d8226b > > .index: 0 > > .resolution: 1 nsecs > > .get_time: ktime_get > > .offset: 0 nsecs > > active timers: #0: , tick_sched_timer, S:01 > > # expires at 5955295000-5955295000 nsecs [in > > 2685654802 to 2685654802 nsecs] > > The tick timer is scheduled because the next timer wheel timer is due at > 5955295000, which might be the hard watchdog timer > > > #1: <9b4a3b88>, hrtimer_wakeup, S:01 > > # expires at 59602585423025-59602642458243 nsecs [in > > 52321077827 to 52378113045 nsecs] > > That might be the soft lockup hrtimer. > > I'd try to gather more information about the chain of events via tracing > and stop the tracer once the lockup detector hits. Okay will do, thanks for taking a look. Thanks, Nick
Re: Occasionally losing the tick_sched_timer
Nick, On Tue, 10 Apr 2018, Nicholas Piggin wrote: > We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no > more timers scheduled, despite hard and soft lockup watchdogs should have > their heart beat timers and probably many others. > > The reproducer we have is running a KVM workload. The lockup is in the > host kernel, quite rare but we may be able to slowly test things. > > I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has > stopped for a long time and no hrtimer active. Included CPU4 for what the > other CPUs look like. > > Thomas do you have any ideas on what we might look for, or if we can add > some BUG_ON()s to catch this at its source? Not really. Tracing might be a more efficient tool that random BUG_ONs. > - CPU3 is sitting in its cpuidle loop (polling idle with all other idle > states disabled). > > - `taskset -c 3 ls` basically revived the CPU and got timers running again. Which is not surprising because that kicks the CPU out of idle and starts the tick timer again. Does this restart the watchdog timers as well? > cpu: 3 > clock 0: > .base: df30f5ab > .index: 0 > .resolution: 1 nsecs > .get_time: ktime_get > .offset: 0 nsecs > active timers: So in theory the soft lockup watchdog hrtimer should be queued here. > .expires_next : 9223372036854775807 nsecs > .hres_active: 1 > .nr_events : 1446533 > .nr_retries : 1434 > .nr_hangs : 0 > .max_hang_time : 0 > .nohz_mode : 2 > .last_tick : 1776312000 nsecs > .tick_stopped : 1 > .idle_jiffies : 4296713609 > .idle_calls : 2573133 > .idle_sleeps: 1957794 > .idle_waketime : 59550238131639 nsecs > .idle_sleeptime : 17504617295679 nsecs > .iowait_sleeptime: 719978688 nsecs > .last_jiffies : 4296713608 So this was the last time when the CPU came out of idle: > .idle_exittime : 17763110009176 nsecs Here it went back into idle: > .idle_entrytime : 1776312625 nsecs And this was the next timer wheel timer due for expiry: > .next_timer : 1776313000 > .idle_expires : 1776313000 nsecs which makes no sense whatsoever, but this might be stale information. Can't tell. > cpu: 4 > clock 0: > .base: 07d8226b > .index: 0 > .resolution: 1 nsecs > .get_time: ktime_get > .offset: 0 nsecs > active timers: #0:, tick_sched_timer, S:01 ># expires at 5955295000-5955295000 nsecs [in > 2685654802 to 2685654802 nsecs] The tick timer is scheduled because the next timer wheel timer is due at 5955295000, which might be the hard watchdog timer >#1: <9b4a3b88>, hrtimer_wakeup, S:01 ># expires at 59602585423025-59602642458243 nsecs [in > 52321077827 to 52378113045 nsecs] That might be the soft lockup hrtimer. I'd try to gather more information about the chain of events via tracing and stop the tracer once the lockup detector hits. Thanks, tglx
[PATCH] powerpc/powernv/opal: Use standard interrupts property when available
For (bad) historical reasons, OPAL used to create a non-standard pair of properties "opal-interrupts" and "opal-interrupts-names" for representing the list of interrupts it wants Linux to request on its behalf. Among other issues, the opal-interrupts doesn't have a way to carry the type of interrupts, and they were assumed to be all level sensitive. This is wrong on some recent systems where some of them are edge sensitive causing warnings in the XIVE code and possible misbehaviours if they need to be retriggered (typically the NPU2 TCE error interrupts). This makes Linux switch to using the standard "interrupts" and "interrupt-names" properties instead when they are available, using standard of_irq helpers, which can carry all the desired type information. Newer versions of OPAL will generate those properties in addition to the legacy ones. Signed-off-by: Benjamin Herrenschmidt--- diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c b/arch/powerpc/platforms/powernv/opal-irqchip.c index 9d1b8c0aaf93..46785eaf625d 100644 --- a/arch/powerpc/platforms/powernv/opal-irqchip.c +++ b/arch/powerpc/platforms/powernv/opal-irqchip.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -39,8 +40,8 @@ struct opal_event_irqchip { }; static struct opal_event_irqchip opal_event_irqchip; -static unsigned int opal_irq_count; -static unsigned int *opal_irqs; +static int opal_irq_count; +static struct resource *opal_irqs; static void opal_handle_irq_work(struct irq_work *work); static u64 last_outstanding_events; @@ -174,24 +175,21 @@ void opal_event_shutdown(void) /* First free interrupts, which will also mask them */ for (i = 0; i < opal_irq_count; i++) { - if (!opal_irqs[i]) + if (!opal_irqs || !opal_irqs[i].start) continue; if (in_interrupt()) - disable_irq_nosync(opal_irqs[i]); + disable_irq_nosync(opal_irqs[i].start); else - free_irq(opal_irqs[i], NULL); - - opal_irqs[i] = 0; + free_irq(opal_irqs[i].start, NULL); } } int __init opal_event_init(void) { struct device_node *dn, *opal_node; - const char **names; - u32 *irqs; - int i, rc; + bool old_style = false; + int i, rc = 0; opal_node = of_find_node_by_path("/ibm,opal"); if (!opal_node) { @@ -216,67 +214,91 @@ int __init opal_event_init(void) goto out; } - /* Get opal-interrupts property and names if present */ - rc = of_property_count_u32_elems(opal_node, "opal-interrupts"); - if (rc < 0) - goto out; + /* Look for new-style (standard) "interrupts" property */ + opal_irq_count = of_irq_count(opal_node); - opal_irq_count = rc; - pr_debug("Found %d interrupts reserved for OPAL\n", opal_irq_count); + /* Absent ? Look for the old one */ + if (opal_irq_count < 1) { + /* Get opal-interrupts property and names if present */ + rc = of_property_count_u32_elems(opal_node, "opal-interrupts"); + if (rc > 0) + opal_irq_count = rc; + old_style = true; + } - irqs = kcalloc(opal_irq_count, sizeof(*irqs), GFP_KERNEL); - names = kcalloc(opal_irq_count, sizeof(*names), GFP_KERNEL); - opal_irqs = kcalloc(opal_irq_count, sizeof(*opal_irqs), GFP_KERNEL); + /* No interrupts ? Bail out */ + if (!opal_irq_count) + goto out; - if (WARN_ON(!irqs || !names || !opal_irqs)) - goto out_free; + pr_debug("OPAL: Found %d interrupts reserved for OPAL using %s scheme\n", +opal_irq_count, old_style ? "old" : "new"); - rc = of_property_read_u32_array(opal_node, "opal-interrupts", - irqs, opal_irq_count); - if (rc < 0) { - pr_err("Error %d reading opal-interrupts array\n", rc); - goto out_free; + /* Allocate an IRQ resources array */ + opal_irqs = kcalloc(opal_irq_count, sizeof(struct resource), GFP_KERNEL); + if (WARN_ON(!opal_irqs)) { + rc = -ENOMEM; + goto out; } - /* It's not an error for the names to be missing */ - of_property_read_string_array(opal_node, "opal-interrupts-names", - names, opal_irq_count); + /* Build the resources array */ + if (old_style) { + /* Old style "opal-interrupts" property */ + for (i = 0; i < opal_irq_count; i++) { + struct resource *r = _irqs[i]; + const char *name = NULL; + u32 hw_irq; + int virq; + + rc =
Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event
On Mon, 26 Mar 2018, Andi Kleen wrote: > > Aside: should there be a new spec_flt field for struct task_struct that > > complements maj_flt and min_flt and be exported through /proc/pid/stat? > > No. task_struct is already too bloated. If you need per process tracking > you can always get it through trace points. > Hi Andi, We have count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); in handle_mm_fault() but not counterpart for spf. I think it would be helpful to be able to determine how much faulting can be done speculatively if there is no per-process tracking without tracing.
[PATCH] Revert "powerpc/64: Fix checksum folding in csum_add()"
This reverts commit 6ad966d7303b70165228dba1ee8da1a05c10eefe. That commit was pointless, because csum_add() sums two 32 bits values, so the sum is 0x1fffe at the maximum. And then when adding upper part (1) and lower part (0xfffe), the result is 0x which doesn't carry. Any lower value will not carry either. And behind the fact that this commit is useless, it also kills the whole purpose of having an arch specific inline csum_add() because the resulting code gets even worse than what is obtained with the generic implementation of csum_add() 0240 <.csum_add>: 240: 38 00 ff ff li r0,-1 244: 7c 84 1a 14 add r4,r4,r3 248: 78 00 00 20 clrldi r0,r0,32 24c: 78 89 00 22 rldicl r9,r4,32,32 250: 7c 80 00 38 and r0,r4,r0 254: 7c 09 02 14 add r0,r9,r0 258: 78 09 00 22 rldicl r9,r0,32,32 25c: 7c 00 4a 14 add r0,r0,r9 260: 78 03 00 20 clrldi r3,r0,32 264: 4e 80 00 20 blr In comparison, the generic implementation of csum_add() gives: 0290 <.csum_add>: 290: 7c 63 22 14 add r3,r3,r4 294: 7f 83 20 40 cmplw cr7,r3,r4 298: 7c 10 10 26 mfocrf r0,1 29c: 54 00 ef fe rlwinm r0,r0,29,31,31 2a0: 7c 60 1a 14 add r3,r0,r3 2a4: 78 63 00 20 clrldi r3,r3,32 2a8: 4e 80 00 20 blr And the reverted implementation for PPC64 gives: 0240 <.csum_add>: 240: 7c 84 1a 14 add r4,r4,r3 244: 78 80 00 22 rldicl r0,r4,32,32 248: 7c 80 22 14 add r4,r0,r4 24c: 78 83 00 20 clrldi r3,r4,32 250: 4e 80 00 20 blr Fixes: 6ad966d7303b7 ("powerpc/64: Fix checksum folding in csum_add()") Signed-off-by: Christophe Leroy--- arch/powerpc/include/asm/checksum.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h index 842124b199b5..4e63787dc3be 100644 --- a/arch/powerpc/include/asm/checksum.h +++ b/arch/powerpc/include/asm/checksum.h @@ -112,7 +112,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend) #ifdef __powerpc64__ res += (__force u64)addend; - return (__force __wsum) from64to32(res); + return (__force __wsum)((u32)res + (res >> 32)); #else asm("addc %0,%0,%1;" "addze %0,%0;" -- 2.13.3
[PATCH] powerpc/64: optimises from64to32()
The current implementation of from64to32() gives a poor result: 0270 <.from64to32>: 270: 38 00 ff ff li r0,-1 274: 78 69 00 22 rldicl r9,r3,32,32 278: 78 00 00 20 clrldi r0,r0,32 27c: 7c 60 00 38 and r0,r3,r0 280: 7c 09 02 14 add r0,r9,r0 284: 78 09 00 22 rldicl r9,r0,32,32 288: 7c 00 4a 14 add r0,r0,r9 28c: 78 03 00 20 clrldi r3,r0,32 290: 4e 80 00 20 blr This patch modifies from64to32() to operate in the same spirit as csum_fold() It swaps the two 32-bit halves of sum then it adds it with the unswapped sum. If there is a carry from adding the two 32-bit halves, it will carry from the lower half into the upper half, giving us the correct sum in the upper half. The resulting code is: 0260 <.from64to32>: 260: 78 60 00 02 rotldi r0,r3,32 264: 7c 60 1a 14 add r3,r0,r3 268: 78 63 00 22 rldicl r3,r3,32,32 26c: 4e 80 00 20 blr Signed-off-by: Christophe Leroy--- arch/powerpc/include/asm/checksum.h | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/checksum.h b/arch/powerpc/include/asm/checksum.h index 4e63787dc3be..54065caa40b3 100644 --- a/arch/powerpc/include/asm/checksum.h +++ b/arch/powerpc/include/asm/checksum.h @@ -12,6 +12,7 @@ #ifdef CONFIG_GENERIC_CSUM #include #else +#include /* * Computes the checksum of a memory block at src, length len, * and adds in "sum" (32-bit), while copying the block to dst. @@ -55,11 +56,7 @@ static inline __sum16 csum_fold(__wsum sum) static inline u32 from64to32(u64 x) { - /* add up 32-bit and 32-bit for 32+c bit */ - x = (x & 0x) + (x >> 32); - /* add up carry.. */ - x = (x & 0x) + (x >> 32); - return (u32)x; + return (x + ror64(x, 32)) >> 32; } static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len, -- 2.13.3
Re: [PATCH 1/2] KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
On Tue, 10 Apr 2018 11:25:02 +0530 "Naveen N. Rao"wrote: > Michael Ellerman wrote: > > Nicholas Piggin writes: > > > >> On Sun, 8 Apr 2018 20:17:47 +1000 > >> Balbir Singh wrote: > >> > >>> On Fri, Apr 6, 2018 at 3:56 AM, Nicholas Piggin > >>> wrote: > >>> > This crashes with a "Bad real address for load" attempting to load > >>> > from the vmalloc region in realmode (faulting address is in DAR). > >>> > > >>> > Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1] > >>> > LE SMP NR_CPUS=2048 NUMA PowerNV > >>> > CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted > >>> > 4.16.0-01530-g43d1859f0994 > >>> > NIP: c00155ac LR: c00c2430 CTR: c0015580 > >>> > REGS: c00fff76dd80 TRAP: 0200 Not tainted > >>> > (4.16.0-01530-g43d1859f0994) > >>> > MSR: 90201003 CR: 4808 XER: > >>> > CFAR: 000102900ef0 DAR: d00017fffd941a28 DSISR: 0040 SOFTE: 3 > >>> > NIP [c00155ac] perf_trace_tlbie+0x2c/0x1a0 > >>> > LR [c00c2430] do_tlbies+0x230/0x2f0 > >>> > > >>> > I suspect the reason is the per-cpu data is not in the linear chunk. > >>> > This could be restored if that was able to be fixed, but for now, > >>> > just remove the tracepoints. > >>> > >>> Could you share the stack trace as well? I've not observed this in my > >>> testing. > >> > >> I can't seem to find it, I can try reproduce tomorrow. It was coming > >> from h_remove hcall from the guest. It's 176 logical CPUs. > >> > >>> May be I don't have as many cpus. I presume your talking about the per cpu > >>> data offsets for per cpu trace data? > >> > >> It looked like it was dereferencing virtually mapped per-cpu data, yes. > >> Probably the perf_events deref. > > > > Naveen has posted a series to (hopefully) fix this, which just missed > > the merge window: > > > > https://patchwork.ozlabs.org/patch/894757/ > > I'm afraid that won't actually help here :( > That series is specific to the function tracer, while this is using > static tracepoints. > > We could convert trace_tlbie() to a TRACE_EVENT_CONDITION() and guard it > within a check for paca->ftrace_enabled, but that would only be useful > if the below callsites can ever be hit outside of KVM guest mode. Right, removing the trace points is the right thing to do here. Doing tracing in real mode would be a whole effort itself, I'd expect. Or disabling realmode handling of HPT hcalls if trace points are active. Thanks, Nick