date:20180410

Re: [PATCH 1/3] powerpc/xive: Fix trying to "push" an already active pool VP

2018-04-10 Thread Benjamin Herrenschmidt

On Wed, 2018-04-11 at 15:17 +1000, Benjamin Herrenschmidt wrote:
> When setting up a CPU, we "push" (activate) a pool VP for it.
> 
> However it's an error to do so if it already has an active
> pool VP.
> 
> This happens when doing soft CPU hotplug on powernv since we
> don't tear down the CPU on unplug. The HW flags the error which
> gets captured by the diagnostics.
> 
> Fix this by making sure to "pull" out any already active pool
> first.
> 
> Signed-off-by: Benjamin Herrenschmidt 

CC: sta...@vger.kernel.org...

> ---
>  arch/powerpc/sysdev/xive/native.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index d22aeb0b69e1..b48454be5b98 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -389,6 +389,10 @@ static void xive_native_setup_cpu(unsigned int cpu, 
> struct xive_cpu *xc)
>   if (xive_pool_vps == XIVE_INVALID_VP)
>   return;
>  
> + /* Check if pool VP already active, if it is, pull it */
> + if (in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2) & TM_QW2W2_VP)
> + in_be64(xive_tima + TM_SPC_PULL_POOL_CTX);
> +
>   /* Enable the pool VP */
>   vp = xive_pool_vps + cpu;
>   pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp);

[PATCH 2/3] powerpc/xive: Remove now useless pr_debug statements

2018-04-10 Thread Benjamin Herrenschmidt

Those overly verbose statement in the setup of the pool VP
aren't particularly useful (esp. considering we don't actually
use the pool, we configure it bcs HW requires it only). So
remove them which improves the code readability.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/sysdev/xive/native.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index b48454be5b98..c7088a35eb89 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -395,7 +395,6 @@ static void xive_native_setup_cpu(unsigned int cpu, struct 
xive_cpu *xc)
 
/* Enable the pool VP */
vp = xive_pool_vps + cpu;
-   pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp);
for (;;) {
rc = opal_xive_set_vp_info(vp, OPAL_XIVE_VP_ENABLED, 0);
if (rc != OPAL_BUSY)
@@ -415,16 +414,9 @@ static void xive_native_setup_cpu(unsigned int cpu, struct 
xive_cpu *xc)
}
vp_cam = be64_to_cpu(vp_cam_be);
 
-   pr_debug("VP CAM = %llx\n", vp_cam);
-
/* Push it on the CPU (set LSMFB to 0xff to skip backlog scan) */
-   pr_debug("(Old HW value: %08x)\n",
-in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2));
out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD0, 0xff);
-   out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2,
-TM_QW2W2_VP | vp_cam);
-   pr_debug("(New HW value: %08x)\n",
-in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2));
+   out_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2, TM_QW2W2_VP | vp_cam);
 }
 
 static void xive_native_teardown_cpu(unsigned int cpu, struct xive_cpu *xc)
-- 
2.14.3

[PATCH 3/3] powerpc/xive: Remove xive_kexec_teardown_cpu()

2018-04-10 Thread Benjamin Herrenschmidt

It's identical to xive_teardown_cpu() so just use the latter

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/include/asm/xive.h|  1 -
 arch/powerpc/platforms/powernv/setup.c |  2 +-
 arch/powerpc/platforms/pseries/kexec.c |  2 +-
 arch/powerpc/sysdev/xive/common.c  | 22 --
 4 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 8d1a2792484f..3c704f5dd3ae 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -87,7 +87,6 @@ extern int  xive_smp_prepare_cpu(unsigned int cpu);
 extern void xive_smp_setup_cpu(void);
 extern void xive_smp_disable_cpu(void);
 extern void xive_teardown_cpu(void);
-extern void xive_kexec_teardown_cpu(int secondary);
 extern void xive_shutdown(void);
 extern void xive_flush_interrupt(void);
 
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 092715b9674b..5b4b09816791 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -282,7 +282,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int 
secondary)
u64 reinit_flags;
 
if (xive_enabled())
-   xive_kexec_teardown_cpu(secondary);
+   xive_teardown_cpu();
else
xics_kexec_teardown_cpu(secondary);
 
diff --git a/arch/powerpc/platforms/pseries/kexec.c 
b/arch/powerpc/platforms/pseries/kexec.c
index eeb13429d685..9dabf019556b 100644
--- a/arch/powerpc/platforms/pseries/kexec.c
+++ b/arch/powerpc/platforms/pseries/kexec.c
@@ -53,7 +53,7 @@ void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
}
 
if (xive_enabled())
-   xive_kexec_teardown_cpu(secondary);
+   xive_teardown_cpu();
else
xics_kexec_teardown_cpu(secondary);
 }
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 40c06110821c..c8db51b60b4b 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1408,28 +1408,6 @@ void xive_teardown_cpu(void)
xive_cleanup_cpu_queues(cpu, xc);
 }
 
-void xive_kexec_teardown_cpu(int secondary)
-{
-   struct xive_cpu *xc = __this_cpu_read(xive_cpu);
-   unsigned int cpu = smp_processor_id();
-
-   /* Set CPPR to 0 to disable flow of interrupts */
-   xc->cppr = 0;
-   out_8(xive_tima + xive_tima_offset + TM_CPPR, 0);
-
-   /* Backend cleanup if any */
-   if (xive_ops->teardown_cpu)
-   xive_ops->teardown_cpu(cpu, xc);
-
-#ifdef CONFIG_SMP
-   /* Get rid of IPI */
-   xive_cleanup_cpu_ipi(cpu, xc);
-#endif
-
-   /* Disable and free the queues */
-   xive_cleanup_cpu_queues(cpu, xc);
-}
-
 void xive_shutdown(void)
 {
xive_ops->shutdown();
-- 
2.14.3

[PATCH 1/3] powerpc/xive: Fix trying to "push" an already active pool VP

2018-04-10 Thread Benjamin Herrenschmidt

When setting up a CPU, we "push" (activate) a pool VP for it.

However it's an error to do so if it already has an active
pool VP.

This happens when doing soft CPU hotplug on powernv since we
don't tear down the CPU on unplug. The HW flags the error which
gets captured by the diagnostics.

Fix this by making sure to "pull" out any already active pool
first.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/sysdev/xive/native.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index d22aeb0b69e1..b48454be5b98 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -389,6 +389,10 @@ static void xive_native_setup_cpu(unsigned int cpu, struct 
xive_cpu *xc)
if (xive_pool_vps == XIVE_INVALID_VP)
return;
 
+   /* Check if pool VP already active, if it is, pull it */
+   if (in_be32(xive_tima + TM_QW2_HV_POOL + TM_WORD2) & TM_QW2W2_VP)
+   in_be64(xive_tima + TM_SPC_PULL_POOL_CTX);
+
/* Enable the pool VP */
vp = xive_pool_vps + cpu;
pr_debug("CPU %d setting up pool VP 0x%x\n", cpu, vp);
-- 
2.14.3

Re: [PATCH 2/2] powerpc/mm/memtrace: Let the arch hotunplug code flush cache

2018-04-10 Thread rashmica



On 06/04/18 15:24, Balbir Singh wrote:
> Don't do this via custom code, instead now that we have support
> in the arch hotplug/hotunplug code, rely on those routines
> to do the right thing.
>
> Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory 
> tracing")
> because the older code uses ppc64_caches.l1d.size instead of
> ppc64_caches.l1d.line_size
>
> Signed-off-by: Balbir Singh 

Reviewed-by: Rashmica Gupta

Re: [PATCH 1/2] powerpc/mm: Flush cache on memory hot(un)plug

2018-04-10 Thread rashmica



On 06/04/18 15:24, Balbir Singh wrote:
> This patch adds support for flushing potentially dirty
> cache lines when memory is hot-plugged/hot-un-plugged.
> The support is currently limited to 64 bit systems.
>
> The bug was exposed when mappings for a device were
> actually hot-unplugged and plugged in back later.
> A similar issue was observed during the development
> of memtrace, but memtrace does it's own flushing of
> region via a custom routine.
>
> These patches do a flush both on hotplug/unplug to
> clear any stale data in the cache w.r.t mappings,
> there is a small race window where a clean cache
> line may be created again just prior to tearing
> down the mapping.
>
> The patches were tested by disabling the flush
> routines in memtrace and doing I/O on the trace
> file. The system immediately checkstops (quite
> reliablly if prior to the hot-unplug of the memtrace
> region, we memset the regions we are about to
> hot unplug). After these patches no custom flushing
> is needed in the memtrace code.
>
> Signed-off-by: Balbir Singh 

Reviewed-by: Rashmica Gupta

[PATCH] powerpc/eeh: Fix enabling bridge MMIO windows

2018-04-10 Thread Michael Neuling

On boot we save the configuration space of PCIe bridges. We do this so
when we get an EEH event and everything gets reset that we can restore
them.

Unfortunately we save this state before we've enabled the MMIO space
on the bridges. Hence if we have to reset the bridge when we come back
MMIO is not enabled and we end up taking an PE freeze when the driver
starts accessing again.

This patch forces the memory/MMIO and bus mastering on when restoring
bridges on EEH. Ideally we'd do this correctly by saving the
configuration space writes later, but that will have to come later in
a larger EEH rewrite. For now we have this simple fix.

The original bug can be triggered on a boston machine by doing:
  echo 0x8000 > /sys/kernel/debug/powerpc/PCI0001/err_injct_outbound
On boston, this PHB has a PCIe switch on it.  Without this patch,
you'll see two EEH events, 1 expected and 1 the failure we are fixing
here. The second EEH event causes the anything under the PHB to
disappear (i.e. the i40e eth).

With this patch, only 1 EEH event occurs and devices properly recover.

Reported-by: Pridhiviraj Paidipeddi 
Signed-off-by: Michael Neuling 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/kernel/eeh_pe.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 2d4956e97a..ee5a67d57a 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -807,7 +807,8 @@ static void eeh_restore_bridge_bars(struct eeh_dev *edev)
eeh_ops->write_config(pdn, 15*4, 4, edev->config_space[15]);
 
/* PCI Command: 0x4 */
-   eeh_ops->write_config(pdn, PCI_COMMAND, 4, edev->config_space[1]);
+   eeh_ops->write_config(pdn, PCI_COMMAND, 4, edev->config_space[1] |
+ PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER);
 
/* Check the PCIe link is ready */
eeh_bridge_check_link(edev);
-- 
2.14.1

Re: [RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry

2018-04-10 Thread Nicholas Piggin

On Wed, 11 Apr 2018 11:32:12 +1000
Benjamin Herrenschmidt  wrote:

> On Tue, 2018-04-10 at 22:48 +1000, Nicholas Piggin wrote:
> >  
> > +   /*
> > +* Do we need to flush the TLB for the LPAR? (see TLB comment above)
> > + * On POWER9, individual threads can come in here, but the
> > + * TLB is shared between the 4 threads in a core, hence
> > + * invalidating on one thread invalidates for all.
> > + * Thus we make all 4 threads use the same bit here.
> > + */  
> 
> This might be true of P9 implementation but isn't architecturally
> correct. From an ISA perspective, the threads could have dedicatd
> tagged TLB entries. Do we need to be careful here vs. backward
> compatiblity ?

I think so. I noticed that, just trying to do like for like replacement
with this patch.

Yes it should have a feature bit test for this optimization IMO. That
can be expanded if other CPUs have the same ability... Is it even
a worthwhile optimisation to do at this point, I wonder? I didn't see
it being hit a lot in traces.

> Also this won't flush ERAT entries for another thread afaik.

Yeah, I'm still not entirely clear exactly when ERATs get invalidated.
I would like to see more commentary here to show why it's okay.

> 
> > +   tmp = pcpu;
> > +   if (cpu_has_feature(CPU_FTR_ARCH_300))
> > +   tmp &= ~0x3UL;
> > +   if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) {
> > +   if (kvm_is_radix(vc->kvm))
> > +   radix__local_flush_tlb_lpid(vc->kvm->arch.lpid);
> > +   else
> > +   hash__local_flush_tlb_lpid(vc->kvm->arch.lpid);
> > +   /* Clear the bit after the TLB flush */
> > +   cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush);
> > +   }
> > +  
>

Re: [RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry

2018-04-10 Thread Benjamin Herrenschmidt

On Tue, 2018-04-10 at 22:48 +1000, Nicholas Piggin wrote:
>  
> +   /*
> +* Do we need to flush the TLB for the LPAR? (see TLB comment above)
> + * On POWER9, individual threads can come in here, but the
> + * TLB is shared between the 4 threads in a core, hence
> + * invalidating on one thread invalidates for all.
> + * Thus we make all 4 threads use the same bit here.
> + */

This might be true of P9 implementation but isn't architecturally
correct. From an ISA perspective, the threads could have dedicatd
tagged TLB entries. Do we need to be careful here vs. backward
compatiblity ?

Also this won't flush ERAT entries for another thread afaik.

> +   tmp = pcpu;
> +   if (cpu_has_feature(CPU_FTR_ARCH_300))
> +   tmp &= ~0x3UL;
> +   if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) {
> +   if (kvm_is_radix(vc->kvm))
> +   radix__local_flush_tlb_lpid(vc->kvm->arch.lpid);
> +   else
> +   hash__local_flush_tlb_lpid(vc->kvm->arch.lpid);
> +   /* Clear the bit after the TLB flush */
> +   cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush);
> +   }
> +

[PATCH] ibmvnic: Define vnic_login_client_data name field as unsized array

2018-04-10 Thread Kees Cook

The "name" field of struct vnic_login_client_data is a char array of
undefined length. This should be written as "char name[]" so the compiler
can make better decisions about the field (for example, not assuming
it's a single character). This was noticed while trying to tighten the
CONFIG_FORTIFY_SOURCE checking.

Signed-off-by: Kees Cook 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..35fbb41cd2d4 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3170,7 +3170,7 @@ static int send_version_xchg(struct ibmvnic_adapter 
*adapter)
 struct vnic_login_client_data {
u8  type;
__be16  len;
-   charname;
+   charname[];
 } __packed;
 
 static int vnic_client_data_len(struct ibmvnic_adapter *adapter)
@@ -3199,21 +3199,21 @@ static void vnic_add_client_data(struct ibmvnic_adapter 
*adapter,
vlcd->type = 1;
len = strlen(os_name) + 1;
vlcd->len = cpu_to_be16(len);
-   strncpy(>name, os_name, len);
-   vlcd = (struct vnic_login_client_data *)((char *)>name + len);
+   strncpy(vlcd->name, os_name, len);
+   vlcd = (struct vnic_login_client_data *)(vlcd->name + len);
 
/* Type 2 - LPAR name */
vlcd->type = 2;
len = strlen(utsname()->nodename) + 1;
vlcd->len = cpu_to_be16(len);
-   strncpy(>name, utsname()->nodename, len);
-   vlcd = (struct vnic_login_client_data *)((char *)>name + len);
+   strncpy(vlcd->name, utsname()->nodename, len);
+   vlcd = (struct vnic_login_client_data *)(vlcd->name + len);
 
/* Type 3 - device name */
vlcd->type = 3;
len = strlen(adapter->netdev->name) + 1;
vlcd->len = cpu_to_be16(len);
-   strncpy(>name, adapter->netdev->name, len);
+   strncpy(vlcd->name, adapter->netdev->name, len);
 }
 
 static int send_login(struct ibmvnic_adapter *adapter)
-- 
2.7.4


-- 
Kees Cook
Pixel Security

Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread David Rientjes

On Tue, 10 Apr 2018, Laurent Dufour wrote:

> > On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:
> >>  arch/powerpc/include/asm/pte-common.h  | 3 ---
> >>  arch/riscv/Kconfig | 1 +
> >>  arch/s390/Kconfig  | 1 +
> > 
> > You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
> > arch/riscv/include/asm/pgtable-bits.h
> 
> Damned !
> Thanks for catching it.
> 

Squashing the two patches together at least allowed it to be caught 
easily.  After it's fixed, feel free to add

Acked-by: David Rientjes 

Thanks for doing this!

Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread Palmer Dabbelt


On Tue, 10 Apr 2018 09:09:32 PDT (-0700), wi...@infradead.org wrote:

On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:

 arch/powerpc/include/asm/pte-common.h  | 3 ---
 arch/riscv/Kconfig | 1 +
 arch/s390/Kconfig  | 1 +


You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
arch/riscv/include/asm/pgtable-bits.h


Thanks -- I was looking for that but couldn't find it and assumed I'd just 
misunderstood something.

Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages

2018-04-10 Thread Desnes A. Nunes do Rosario


Bjorn,

On 04/10/2018 04:55 PM, Bjorn Helgaas wrote:

On Tue, Apr 10, 2018 at 02:36:31PM -0500, Bjorn Helgaas wrote:

On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote:

The disabling informational messages on the PCI subsystem should be deleted
since they do not represent any real value for the system logs.

These messages are either not presented, or presented for all PCI devices
(e.g., powerpc now realigns all PCI devices to its page size). Thus, they
are flooding system logs and can be interpreted as a false positive for
total PCI failure on the system.

[root@system user]# dmesg | grep -i disabling
[0.692270] pci :00:00.0: Disabling memory decoding and releasing memory 
resources
[0.692324] pci :00:00.0: disabling bridge mem windows
[0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing memory 
resources
[0.737352] pci 0001:00:00.0: disabling bridge mem windows
[0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing memory 
resources
[0.784509] pci 0002:00:00.0: disabling bridge mem windows
... and goes on for all PCI devices on the system ...

Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() to force 
PCI devices to be page aligned")
Signed-off-by: Desnes A. Nunes do Rosario 


Applied to pci/resource for v4.18, thanks!

I should have gotten this in for v4.17, but I didn't; sorry about that.


This is trivial and I'm planning to squeeze a few more things into v4.17,
so I moved this to my "for-linus" branch for v4.17.


No need for apologies.

On the contrary, thank you very much for your review and branch change.




---
  drivers/pci/pci.c   | 1 -
  drivers/pci/setup-res.c | 2 --
  2 files changed, 3 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8c71d1a66cdd..1563ce1ee091 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
*dev)
return;
}
  
-	pci_info(dev, "Disabling memory decoding and releasing memory resources\n");

pci_read_config_word(dev, PCI_COMMAND, );
command &= ~PCI_COMMAND_MEMORY;
pci_write_config_word(dev, PCI_COMMAND, command);
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 369d48d6c6f1..6bd35e8e7cde 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource);
  
  void pci_disable_bridge_window(struct pci_dev *dev)

  {
-   pci_info(dev, "disabling bridge mem windows\n");
-
/* MMIO Base/Limit */
pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0);
  
--

2.14.3





--
Desnes A. Nunes do Rosario
--
Linux Developer - IBM

Re: [PATCH 5/5] powerpc:dts:pm: add power management node

2018-04-10 Thread Li Yang

On Wed, Mar 28, 2018 at 8:31 PM, Ran Wang  wrote:
> Enable Power Management feature on device tree, including MPC8536,
> MPC8544, MPC8548, MPC8572, P1010, P1020, P1021, P1022, P2020, P2041,
> P3041, T104X, T1024.

There are no device tree bindings documented for the properties and
compatible strings used in the patch. Please update the binding
documents first before adding them into device tree.

>
> Signed-off-by: Zhao Chenhui 
> Signed-off-by: Ran Wang 
> ---
>  arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi |   14 ++-
>  arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi |2 +
>  arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi |2 +
>  arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi |2 +
>  arch/powerpc/boot/dts/fsl/p1010si-post.dtsi   |8 
>  arch/powerpc/boot/dts/fsl/p1020si-post.dtsi   |5 +++
>  arch/powerpc/boot/dts/fsl/p1021si-post.dtsi   |5 +++
>  arch/powerpc/boot/dts/fsl/p1022si-post.dtsi   |9 +++--
>  arch/powerpc/boot/dts/fsl/p2020si-post.dtsi   |   14 +++
>  arch/powerpc/boot/dts/fsl/pq3-power.dtsi  |   48 
> +
>  arch/powerpc/boot/dts/fsl/t1024rdb.dts|2 +-
>  arch/powerpc/boot/dts/fsl/t1040rdb.dts|2 +-
>  arch/powerpc/boot/dts/fsl/t1042rdb.dts|2 +-
>  arch/powerpc/boot/dts/fsl/t1042rdb_pi.dts |2 +-
>  14 files changed, 108 insertions(+), 9 deletions(-)
>  create mode 100644 arch/powerpc/boot/dts/fsl/pq3-power.dtsi
>
> diff --git a/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi
> index 4193570..fba40a1 100644
> --- a/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/mpc8536si-post.dtsi
> @@ -199,6 +199,10 @@
>
>  /include/ "pq3-dma-0.dtsi"
>  /include/ "pq3-etsec1-0.dtsi"
> +   enet0: ethernet@24000 {
> +   fsl,wake-on-filer;
> +   fsl,pmc-handle = <_clk>;
> +   };
>  /include/ "pq3-etsec1-timer-0.dtsi"
>
> usb@22000 {
> @@ -222,9 +226,10 @@
> };
>
>  /include/ "pq3-etsec1-2.dtsi"
> -
> -   ethernet@26000 {
> +   enet2: ethernet@26000 {
> cell-index = <1>;
> +   fsl,wake-on-filer;
> +   fsl,pmc-handle = <_clk>;
> };
>
> usb@2b000 {
> @@ -249,4 +254,9 @@
> reg = <0xe 0x1000>;
> fsl,has-rstcr;
> };
> +
> +/include/ "pq3-power.dtsi"
> +   power@e0070 {
> +   compatible = "fsl,mpc8536-pmc", "fsl,mpc8548-pmc";
> +   };
>  };
> diff --git a/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi
> index b68eb11..ea7416a 100644
> --- a/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/mpc8544si-post.dtsi
> @@ -188,4 +188,6 @@
> reg = <0xe 0x1000>;
> fsl,has-rstcr;
> };
> +
> +/include/ "pq3-power.dtsi"
>  };
> diff --git a/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi
> index 579d76c..dddb737 100644
> --- a/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/mpc8548si-post.dtsi
> @@ -156,4 +156,6 @@
> reg = <0xe 0x1000>;
> fsl,has-rstcr;
> };
> +
> +/include/ "pq3-power.dtsi"
>  };
> diff --git a/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi
> index 49294cf..40a6cff 100644
> --- a/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi
> @@ -193,4 +193,6 @@
> reg = <0xe 0x1000>;
> fsl,has-rstcr;
> };
> +
> +/include/ "pq3-power.dtsi"
>  };
> diff --git a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi 
> b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> index 1b4aafc..47b62a8 100644
> --- a/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> +++ b/arch/powerpc/boot/dts/fsl/p1010si-post.dtsi
> @@ -173,6 +173,8 @@
>
>  /include/ "pq3-etsec2-0.dtsi"
> enet0: ethernet@b {
> +   fsl,pmc-handle = <_clk>;
> +
> queue-group@b {
> fsl,rx-bit-map = <0xff>;
> fsl,tx-bit-map = <0xff>;
> @@ -181,6 +183,8 @@
>
>  /include/ "pq3-etsec2-1.dtsi"
> enet1: ethernet@b1000 {
> +   fsl,pmc-handle = <_clk>;
> +
> queue-group@b1000 {
> fsl,rx-bit-map = <0xff>;
> fsl,tx-bit-map = <0xff>;
> @@ -189,6 +193,8 @@
>
>  /include/ "pq3-etsec2-2.dtsi"
> enet2: ethernet@b2000 {
> +   fsl,pmc-handle = <_clk>;
> +
> queue-group@b2000 {
> fsl,rx-bit-map = <0xff>;
> fsl,tx-bit-map = <0xff>;
> @@ -201,4 +207,6 @@
> reg = <0xe 0x1000>;
> fsl,has-rstcr;
> };
> +
>

Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages

2018-04-10 Thread Bjorn Helgaas

On Tue, Apr 10, 2018 at 02:36:31PM -0500, Bjorn Helgaas wrote:
> On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote:
> > The disabling informational messages on the PCI subsystem should be deleted
> > since they do not represent any real value for the system logs.
> > 
> > These messages are either not presented, or presented for all PCI devices
> > (e.g., powerpc now realigns all PCI devices to its page size). Thus, they
> > are flooding system logs and can be interpreted as a false positive for
> > total PCI failure on the system.
> > 
> > [root@system user]# dmesg | grep -i disabling
> > [0.692270] pci :00:00.0: Disabling memory decoding and releasing 
> > memory resources
> > [0.692324] pci :00:00.0: disabling bridge mem windows
> > [0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing 
> > memory resources
> > [0.737352] pci 0001:00:00.0: disabling bridge mem windows
> > [0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing 
> > memory resources
> > [0.784509] pci 0002:00:00.0: disabling bridge mem windows
> > ... and goes on for all PCI devices on the system ...
> > 
> > Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() 
> > to force PCI devices to be page aligned")
> > Signed-off-by: Desnes A. Nunes do Rosario 
> 
> Applied to pci/resource for v4.18, thanks!
> 
> I should have gotten this in for v4.17, but I didn't; sorry about that.

This is trivial and I'm planning to squeeze a few more things into v4.17,
so I moved this to my "for-linus" branch for v4.17.

> > ---
> >  drivers/pci/pci.c   | 1 -
> >  drivers/pci/setup-res.c | 2 --
> >  2 files changed, 3 deletions(-)
> > 
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 8c71d1a66cdd..1563ce1ee091 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct 
> > pci_dev *dev)
> > return;
> > }
> >  
> > -   pci_info(dev, "Disabling memory decoding and releasing memory 
> > resources\n");
> > pci_read_config_word(dev, PCI_COMMAND, );
> > command &= ~PCI_COMMAND_MEMORY;
> > pci_write_config_word(dev, PCI_COMMAND, command);
> > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
> > index 369d48d6c6f1..6bd35e8e7cde 100644
> > --- a/drivers/pci/setup-res.c
> > +++ b/drivers/pci/setup-res.c
> > @@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource);
> >  
> >  void pci_disable_bridge_window(struct pci_dev *dev)
> >  {
> > -   pci_info(dev, "disabling bridge mem windows\n");
> > -
> > /* MMIO Base/Limit */
> > pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0);
> >  
> > -- 
> > 2.14.3
> >

Re: [PATCH, RESEND, pci, v2] pci: Delete PCI disabling informational messages

2018-04-10 Thread Bjorn Helgaas

On Wed, Apr 04, 2018 at 12:10:35PM -0300, Desnes A. Nunes do Rosario wrote:
> The disabling informational messages on the PCI subsystem should be deleted
> since they do not represent any real value for the system logs.
> 
> These messages are either not presented, or presented for all PCI devices
> (e.g., powerpc now realigns all PCI devices to its page size). Thus, they
> are flooding system logs and can be interpreted as a false positive for
> total PCI failure on the system.
> 
> [root@system user]# dmesg | grep -i disabling
> [0.692270] pci :00:00.0: Disabling memory decoding and releasing 
> memory resources
> [0.692324] pci :00:00.0: disabling bridge mem windows
> [0.729134] pci 0001:00:00.0: Disabling memory decoding and releasing 
> memory resources
> [0.737352] pci 0001:00:00.0: disabling bridge mem windows
> [0.776295] pci 0002:00:00.0: Disabling memory decoding and releasing 
> memory resources
> [0.784509] pci 0002:00:00.0: disabling bridge mem windows
> ... and goes on for all PCI devices on the system ...
> 
> Fixes: 38274637699 ("powerpc/powernv: Override pcibios_default_alignment() to 
> force PCI devices to be page aligned")
> Signed-off-by: Desnes A. Nunes do Rosario 

Applied to pci/resource for v4.18, thanks!

I should have gotten this in for v4.17, but I didn't; sorry about that.

> ---
>  drivers/pci/pci.c   | 1 -
>  drivers/pci/setup-res.c | 2 --
>  2 files changed, 3 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8c71d1a66cdd..1563ce1ee091 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5505,7 +5505,6 @@ void pci_reassigndev_resource_alignment(struct pci_dev 
> *dev)
>   return;
>   }
>  
> - pci_info(dev, "Disabling memory decoding and releasing memory 
> resources\n");
>   pci_read_config_word(dev, PCI_COMMAND, );
>   command &= ~PCI_COMMAND_MEMORY;
>   pci_write_config_word(dev, PCI_COMMAND, command);
> diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
> index 369d48d6c6f1..6bd35e8e7cde 100644
> --- a/drivers/pci/setup-res.c
> +++ b/drivers/pci/setup-res.c
> @@ -172,8 +172,6 @@ EXPORT_SYMBOL(pci_claim_resource);
>  
>  void pci_disable_bridge_window(struct pci_dev *dev)
>  {
> - pci_info(dev, "disabling bridge mem windows\n");
> -
>   /* MMIO Base/Limit */
>   pci_write_config_dword(dev, PCI_MEMORY_BASE, 0xfff0);
>  
> -- 
> 2.14.3
>

Re: [PATCH v3] powerpc/64: Fix section mismatch warnings for early boot symbols

2018-04-10 Thread Mauricio Faria de Oliveira


On 04/09/2018 11:51 PM, Michael Ellerman wrote:

Thanks for picking this one up.

I hate to be a pain ... but before we merge this and proliferate these
names, I'd like to change the names of some of these early asm
functions. They're terribly named due to historical reasons.


Indeed :) No worries.


I haven't actually thought of good names yet though:)

I'll try and come up with some and post a patch doing the renames.


Alright. Could you please copy me on that, and I can post an update.

cheers,
Mauricio

Re: [PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL

2018-04-10 Thread Laurent Dufour



On 10/04/2018 17:58, Robin Murphy wrote:
> On 10/04/18 16:25, Laurent Dufour wrote:
>> Remove the additional define HAVE_PTE_SPECIAL and rely directly on
>> CONFIG_ARCH_HAS_PTE_SPECIAL.
>>
>> There is no functional change introduced by this patch
>>
>> Signed-off-by: Laurent Dufour 
>> ---
>>   mm/memory.c | 23 ++-
>>   1 file changed, 10 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 96910c625daa..53b6344a90d2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma,
>> unsigned long addr,
>>    * PFNMAP mappings in order to support COWable mappings.
>>    *
>>    */
>> -#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>> -# define HAVE_PTE_SPECIAL 1
>> -#else
>> -# define HAVE_PTE_SPECIAL 0
>> -#endif
>>   struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long 
>> addr,
>>    pte_t pte, bool with_public_device)
>>   {
>>   unsigned long pfn = pte_pfn(pte);
>>   -    if (HAVE_PTE_SPECIAL) {
>> -    if (likely(!pte_special(pte)))
>> -    goto check_pfn;
>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> 
> Nit: Couldn't you use IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) within the
> existing code structure to avoid having to add these #ifdefs?

I agree, that would be better. I didn't thought about this option..
Thanks for reporting this.

Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread Laurent Dufour

On 10/04/2018 18:09, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:
>>  arch/powerpc/include/asm/pte-common.h  | 3 ---
>>  arch/riscv/Kconfig | 1 +
>>  arch/s390/Kconfig  | 1 +
> 
> You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
> arch/riscv/include/asm/pgtable-bits.h

Damned !
Thanks for catching it.

Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-04-10 Thread Laurent Dufour

On 03/04/2018 01:57, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> When dealing with speculative page fault handler, we may race with VMA
>> being split or merged. In this case the vma->vm_start and vm->vm_end
>> fields may not match the address the page fault is occurring.
>>
>> This can only happens when the VMA is split but in that case, the
>> anon_vma pointer of the new VMA will be the same as the original one,
>> because in __split_vma the new->anon_vma is set to src->anon_vma when
>> *new = *vma.
>>
>> So even if the VMA boundaries are not correct, the anon_vma pointer is
>> still valid.
>>
>> If the VMA has been merged, then the VMA in which it has been merged
>> must have the same anon_vma pointer otherwise the merge can't be done.
>>
>> So in all the case we know that the anon_vma is valid, since we have
>> checked before starting the speculative page fault that the anon_vma
>> pointer is valid for this VMA and since there is an anon_vma this
>> means that at one time a page has been backed and that before the VMA
>> is cleaned, the page table lock would have to be grab to clean the
>> PTE, and the anon_vma field is checked once the PTE is locked.
>>
>> This patch introduce a new __page_add_new_anon_rmap() service which
>> doesn't check for the VMA boundaries, and create a new inline one
>> which do the check.
>>
>> When called from a page fault handler, if this is not a speculative one,
>> there is a guarantee that vm_start and vm_end match the faulting address,
>> so this check is useless. In the context of the speculative page fault
>> handler, this check may be wrong but anon_vma is still valid as explained
>> above.
>>
>> Signed-off-by: Laurent Dufour 
> 
> I'm indifferent on this: it could be argued both sides that the new 
> function and its variant for a simple VM_BUG_ON() isn't worth it and it 
> would should rather be done in the callers of page_add_new_anon_rmap().  
> It feels like it would be better left to the caller and add a comment to 
> page_add_anon_rmap() itself in mm/rmap.c.

Well there are 11 calls to page_add_new_anon_rmap() which will need to be
impacted and future ones too.

By introducing __page_add_new_anon_rmap() my goal was to make clear that this
call is *special* and that calling it is not the usual way. This also implies
that most of the time the check is done (when build with the right config) and
that we will not miss some.

Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-04-10 Thread Laurent Dufour



On 03/04/2018 02:11, David Rientjes wrote:
> On Tue, 13 Mar 2018, Laurent Dufour wrote:
> 
>> This change is inspired by the Peter's proposal patch [1] which was
>> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
>> that particular case, and it is introducing major performance degradation
>> due to excessive scheduling operations.
>>
>> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
>> is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
>> search it is safe to protect it using such a lock.  The VMA cache is not
>> protected by the new rwlock and it should not be used without holding the
>> mmap_sem.
>>
>> To allow the picked VMA structure to be used once the rwlock is released, a
>> use count is added to the VMA structure. When the VMA is allocated it is
>> set to 1.  Each time the VMA is picked with the rwlock held its use count
>> is incremented. Each time the VMA is released it is decremented. When the
>> use count hits zero, this means that the VMA is no more used and should be
>> freed.
>>
>> This patch is preparing for 2 kind of VMA access :
>>  - as usual, under the control of the mmap_sem,
>>  - without holding the mmap_sem for the speculative page fault handler.
>>
>> Access done under the control the mmap_sem doesn't require to grab the
>> rwlock to protect read access to the mm_rb tree, but access in write must
>> be done under the protection of the rwlock too. This affects inserting and
>> removing of elements in the RB tree.
>>
>> The patch is introducing 2 new functions:
>>  - vma_get() to find a VMA based on an address by holding the new rwlock.
>>  - vma_put() to release the VMA when its no more used.
>> These services are designed to be used when access are made to the RB tree
>> without holding the mmap_sem.
>>
>> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
>> we rely on the WMB done when releasing the rwlock to serialize the write
>> with the RMB done in a later patch to check for the VMA's validity.
>>
>> When free_vma is called, the file associated with the VMA is closed
>> immediately, but the policy and the file structure remained in used until
>> the VMA's use count reach 0, which may happens later when exiting an
>> in progress speculative page fault.
>>
>> [1] https://patchwork.kernel.org/patch/5108281/
>>
>> Cc: Peter Zijlstra (Intel) 
>> Cc: Matthew Wilcox 
>> Signed-off-by: Laurent Dufour 
> 
> Can __free_vma() be generalized for mm/nommu.c's delete_vma() and 
> do_mmap()?

Good question !
I guess if there is no mmu, there is no page fault, so no speculative page
fault and this patch is clearly required by the speculative page fault handler.
By the I should probably make CONFIG_SPECULATIVE_PAGE_FAULT dependent on
CONFIG_MMU.

This being said, if your idea is to extend the mm_rb tree rwlocking to the
nommu case, then this is another story, and I wondering if there is a real need
in such case. But I've to admit I'm not so familliar with kernel built for
mmuless systems.

Am I missing something ?

Thanks,
Laurent.

Re: [PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread Matthew Wilcox

On Tue, Apr 10, 2018 at 05:25:50PM +0200, Laurent Dufour wrote:
>  arch/powerpc/include/asm/pte-common.h  | 3 ---
>  arch/riscv/Kconfig | 1 +
>  arch/s390/Kconfig  | 1 +

You forgot to delete __HAVE_ARCH_PTE_SPECIAL from
arch/riscv/include/asm/pgtable-bits.h

Re: [PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL

2018-04-10 Thread Robin Murphy


On 10/04/18 16:25, Laurent Dufour wrote:

Remove the additional define HAVE_PTE_SPECIAL and rely directly on
CONFIG_ARCH_HAS_PTE_SPECIAL.

There is no functional change introduced by this patch

Signed-off-by: Laurent Dufour 
---
  mm/memory.c | 23 ++-
  1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 96910c625daa..53b6344a90d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
   * PFNMAP mappings in order to support COWable mappings.
   *
   */
-#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
-# define HAVE_PTE_SPECIAL 1
-#else
-# define HAVE_PTE_SPECIAL 0
-#endif
  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 pte_t pte, bool with_public_device)
  {
unsigned long pfn = pte_pfn(pte);
  
-	if (HAVE_PTE_SPECIAL) {

-   if (likely(!pte_special(pte)))
-   goto check_pfn;
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL


Nit: Couldn't you use IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) within the 
existing code structure to avoid having to add these #ifdefs?


Robin.


+   if (unlikely(pte_special(pte))) {
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -862,7 +856,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
return NULL;
}
  
-	/* !HAVE_PTE_SPECIAL case follows: */

+#else  /* CONFIG_ARCH_HAS_PTE_SPECIAL */
  
  	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {

if (vma->vm_flags & VM_MIXEDMAP) {
@@ -881,7 +875,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
  
  	if (is_zero_pfn(pfn))

return NULL;
-check_pfn:
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
+
if (unlikely(pfn > highest_memmap_pfn)) {
print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -891,7 +886,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 * NOTE! We still have PageReserved() pages in the page tables.
 * eg. VDSO mappings can cause them to exist.
 */
-out:
+out: __maybe_unused
return pfn_to_page(pfn);
  }
  
@@ -904,7 +899,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

/*
 * There is no pmd_special() but there may be special pmds, e.g.
 * in a direct-access (dax) mapping, so let's just replicate the
-* !HAVE_PTE_SPECIAL case from vm_normal_page() here.
+* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
 */
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -1926,6 +1921,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
  
  	track_pfn_insert(vma, , pfn);
  
+#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL

/*
 * If we don't have pte special, then we have to use the pfn_valid()
 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1933,7 +1929,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 * without pte special, it would there be refcounted as a normal page.
 */
-   if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+   if (!pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
struct page *page;
  
  		/*

@@ -1944,6 +1940,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
page = pfn_to_page(pfn_t_to_pfn(pfn));
return insert_page(vma, addr, page, pgprot);
}
+#endif
return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
  }

[PATCH v2 1/2] mm: introduce ARCH_HAS_PTE_SPECIAL

2018-04-10 Thread Laurent Dufour

Currently the PTE special supports is turned on in per architecture header
files. Most of the time, it is defined in arch/*/include/asm/pgtable.h
depending or not on some other per architecture static definition.

This patch introduce a new configuration variable to manage this directly
in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL.

Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:

arm
 __HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
 - arch/powerpc/include/asm/book3s/64/pgtable.h
 - arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.

sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64

There is no functional change introduced by this patch.

Suggested-by: Jerome Glisse 
Reviewed-by: Jerome Glisse 
Signed-off-by: Laurent Dufour 
---
 Documentation/features/vm/pte_special/arch-support.txt | 2 +-
 arch/arc/Kconfig   | 1 +
 arch/arc/include/asm/pgtable.h | 2 --
 arch/arm/Kconfig   | 1 +
 arch/arm/include/asm/pgtable-3level.h  | 1 -
 arch/arm64/Kconfig | 1 +
 arch/arm64/include/asm/pgtable.h   | 2 --
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h   | 3 ---
 arch/powerpc/include/asm/pte-common.h  | 3 ---
 arch/riscv/Kconfig | 1 +
 arch/s390/Kconfig  | 1 +
 arch/s390/include/asm/pgtable.h| 1 -
 arch/sh/Kconfig| 1 +
 arch/sh/include/asm/pgtable.h  | 2 --
 arch/sparc/Kconfig | 1 +
 arch/sparc/include/asm/pgtable_64.h| 3 ---
 arch/x86/Kconfig   | 1 +
 arch/x86/include/asm/pgtable_types.h   | 1 -
 include/linux/pfn_t.h  | 4 ++--
 mm/Kconfig | 3 +++
 mm/gup.c   | 4 ++--
 mm/memory.c| 2 +-
 23 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/Documentation/features/vm/pte_special/arch-support.txt 
b/Documentation/features/vm/pte_special/arch-support.txt
index 055004f467d2..cd05924ea875 100644
--- a/Documentation/features/vm/pte_special/arch-support.txt
+++ b/Documentation/features/vm/pte_special/arch-support.txt
@@ -1,6 +1,6 @@
 #
 # Feature name:  pte_special
-# Kconfig:   __HAVE_ARCH_PTE_SPECIAL
+# Kconfig:   ARCH_HAS_PTE_SPECIAL
 # description:   arch supports the pte_special()/pte_mkspecial() VM 
APIs
 #
 ---
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index d76bf4a83740..8516e2b0239a 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -44,6 +44,7 @@ config ARC
select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
+   select ARCH_HAS_PTE_SPECIAL
 
 config MIGHT_HAVE_PCI
bool
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 08fe33830d4b..8ec5599a0957 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -320,8 +320,6 @@ PTE_BIT_FUNC(mkexec,|= (_PAGE_EXECUTE));
 PTE_BIT_FUNC(mkspecial,|= (_PAGE_SPECIAL));
 PTE_BIT_FUNC(mkhuge,   |= (_PAGE_HW_SZ));
 
-#define __HAVE_ARCH_PTE_SPECIAL
-
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..c088c851b235 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -8,6 +8,7 @@ config ARM
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
+   select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 2a4836087358..6d50a11d7793 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++

[PATCH v2 2/2] mm: remove odd HAVE_PTE_SPECIAL

2018-04-10 Thread Laurent Dufour

Remove the additional define HAVE_PTE_SPECIAL and rely directly on
CONFIG_ARCH_HAS_PTE_SPECIAL.

There is no functional change introduced by this patch

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 96910c625daa..53b6344a90d2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -817,19 +817,13 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
  * PFNMAP mappings in order to support COWable mappings.
  *
  */
-#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
-# define HAVE_PTE_SPECIAL 1
-#else
-# define HAVE_PTE_SPECIAL 0
-#endif
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 pte_t pte, bool with_public_device)
 {
unsigned long pfn = pte_pfn(pte);
 
-   if (HAVE_PTE_SPECIAL) {
-   if (likely(!pte_special(pte)))
-   goto check_pfn;
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+   if (unlikely(pte_special(pte))) {
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -862,7 +856,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
return NULL;
}
 
-   /* !HAVE_PTE_SPECIAL case follows: */
+#else  /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -881,7 +875,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
if (is_zero_pfn(pfn))
return NULL;
-check_pfn:
+#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
+
if (unlikely(pfn > highest_memmap_pfn)) {
print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -891,7 +886,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 * NOTE! We still have PageReserved() pages in the page tables.
 * eg. VDSO mappings can cause them to exist.
 */
-out:
+out: __maybe_unused
return pfn_to_page(pfn);
 }
 
@@ -904,7 +899,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, 
unsigned long addr,
/*
 * There is no pmd_special() but there may be special pmds, e.g.
 * in a direct-access (dax) mapping, so let's just replicate the
-* !HAVE_PTE_SPECIAL case from vm_normal_page() here.
+* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
 */
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -1926,6 +1921,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
 
track_pfn_insert(vma, , pfn);
 
+#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
/*
 * If we don't have pte special, then we have to use the pfn_valid()
 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1933,7 +1929,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
 * than insert_pfn).  If a zero_pfn were inserted into a VM_MIXEDMAP
 * without pte special, it would there be refcounted as a normal page.
 */
-   if (!HAVE_PTE_SPECIAL && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+   if (!pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
struct page *page;
 
/*
@@ -1944,6 +1940,7 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, 
unsigned long addr,
page = pfn_to_page(pfn_t_to_pfn(pfn));
return insert_page(vma, addr, page, pgprot);
}
+#endif
return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
 }
 
-- 
2.7.4

[PATCH v2 0/2] move __HAVE_ARCH_PTE_SPECIAL in Kconfig

2018-04-10 Thread Laurent Dufour

The per architecture __HAVE_ARCH_PTE_SPECIAL is defined statically in the
per architecture header files. This doesn't allow to make other
configuration dependent on it.

The first patch of this series is replacing __HAVE_ARCH_PTE_SPECIAL by
CONFIG_ARCH_HAS_PTE_SPECIAL defined into the Kconfig files,
setting it automatically when architectures was already setting it in
header file.

The second patch is removing the odd define HAVE_PTE_SPECIAL which is a
duplicate of CONFIG_ARCH_HAS_PTE_SPECIAL.

There is no functional change introduced by this series.

Laurent Dufour (2):
  mm: introduce ARCH_HAS_PTE_SPECIAL
  mm: remove odd HAVE_PTE_SPECIAL

 .../features/vm/pte_special/arch-support.txt   |  2 +-
 arch/arc/Kconfig   |  1 +
 arch/arc/include/asm/pgtable.h |  2 --
 arch/arm/Kconfig   |  1 +
 arch/arm/include/asm/pgtable-3level.h  |  1 -
 arch/arm64/Kconfig |  1 +
 arch/arm64/include/asm/pgtable.h   |  2 --
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h   |  3 ---
 arch/powerpc/include/asm/pte-common.h  |  3 ---
 arch/riscv/Kconfig |  1 +
 arch/s390/Kconfig  |  1 +
 arch/s390/include/asm/pgtable.h|  1 -
 arch/sh/Kconfig|  1 +
 arch/sh/include/asm/pgtable.h  |  2 --
 arch/sparc/Kconfig |  1 +
 arch/sparc/include/asm/pgtable_64.h|  3 ---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/pgtable_types.h   |  1 -
 include/linux/pfn_t.h  |  4 ++--
 mm/Kconfig |  3 +++
 mm/gup.c   |  4 ++--
 mm/memory.c| 23 ++
 23 files changed, 27 insertions(+), 36 deletions(-)

-- 
2.7.4

Re: [alsa-devel] [PATCH] ASoC: fsl_esai: Fix divisor calculation failure at lower ratio

2018-04-10 Thread Fabio Estevam

Hi Nicolin,

On Sun, Apr 8, 2018 at 8:57 PM, Nicolin Chen  wrote:
> When the desired ratio is less than 256, the savesub (tolerance)
> in the calculation would become 0. This will then fail the loop-
> search immediately without reporting any errors.
>
> But if the ratio is smaller enough, there is no need to calculate
> the tolerance because PM divisor alone is enough to get the ratio.
>
> So a simple fix could be just to set PM directly instead of going
> into the loop-search.
>
> Reported-by: Marek Vasut 
> Signed-off-by: Nicolin Chen 
> Cc: Marek Vasut 

Thanks for the fix:

Reviewed-by: Fabio Estevam

Re: [PATCH 2/3] mm: replace __HAVE_ARCH_PTE_SPECIAL

2018-04-10 Thread Laurent Dufour

On 09/04/2018 22:08, David Rientjes wrote:
> On Mon, 9 Apr 2018, Christoph Hellwig wrote:
> 
>>> -#ifdef __HAVE_ARCH_PTE_SPECIAL
>>> +#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>>>  # define HAVE_PTE_SPECIAL 1
>>>  #else
>>>  # define HAVE_PTE_SPECIAL 0
>>
>> I'd say kill this odd indirection and just use the
>> CONFIG_ARCH_HAS_PTE_SPECIAL symbol directly.
>>
>>
> 
> Agree, and I think it would be easier to audit/review if patches 1 and 3 
> were folded together to see the relationship between the newly added 
> selects and what #define's it is replacing.  Otherwise, looks good!
>

Ok I will fold the 3 patches and introduce a new one removing HAVE_PTE_SPECIAL.

Thanks,
Laurent.

[PATCH v2 2/2] powerpc/fadump: Do not use hugepages when fadump is active

2018-04-10 Thread Hari Bathini

FADump capture kernel boots in restricted memory environment preserving
the context of previous kernel to save vmcore. Supporting hugepages in
such environment makes things unnecessarily complicated, as hugepages
need memory set aside for them. This means most of the capture kernel's
memory is used in supporting hugepages. In most cases, this results in
out-of-memory issues while booting FADump capture kernel. But hugepages
are not of much use in capture kernel whose only job is to save vmcore.
So, disabling hugepages support, when fadump is active, is a reliable
solution for the out of memory issues. Introducing a flag variable to
disable HugeTLB support when fadump is active.

Signed-off-by: Hari Bathini 
---

Changes in v2:
* Introduce a hugetlb_disabled flag to enable/disable hugepage support &
  use that flag to disable hugepage support when fadump is active.


 arch/powerpc/include/asm/page.h |1 +
 arch/powerpc/kernel/fadump.c|8 
 arch/powerpc/mm/hash_utils_64.c |6 --
 arch/powerpc/mm/hugetlbpage.c   |7 +++
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 8da5d4c..40aee93 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -39,6 +39,7 @@
 
 #ifndef __ASSEMBLY__
 #ifdef CONFIG_HUGETLB_PAGE
+extern bool hugetlb_disabled;
 extern unsigned int HPAGE_SHIFT;
 #else
 #define HPAGE_SHIFT PAGE_SHIFT
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index bea8d5f..8ceabef4 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -402,6 +402,14 @@ int __init fadump_reserve_mem(void)
if (fw_dump.dump_active) {
pr_info("Firmware-assisted dump is active.\n");
 
+#ifdef CONFIG_HUGETLB_PAGE
+   /*
+* FADump capture kernel doesn't care much about hugepages.
+* In fact, handling hugepages in capture kernel is asking for
+* trouble. So, disable HugeTLB support when fadump is active.
+*/
+   hugetlb_disabled = true;
+#endif
/*
 * If last boot has crashed then reserve all the memory
 * above boot_memory_size so that we don't touch it until
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index cf290d41..eab8f1d 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -571,8 +571,10 @@ static void __init htab_scan_page_sizes(void)
}
 
 #ifdef CONFIG_HUGETLB_PAGE
-   /* Reserve 16G huge page memory sections for huge pages */
-   of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+   if (!hugetlb_disabled) {
+   /* Reserve 16G huge page memory sections for huge pages */
+   of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+   }
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 876da2b..18c080a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -35,6 +35,8 @@
 #define PAGE_SHIFT_16M 24
 #define PAGE_SHIFT_16G 34
 
+bool hugetlb_disabled = false;
+
 unsigned int HPAGE_SHIFT;
 EXPORT_SYMBOL(HPAGE_SHIFT);
 
@@ -653,6 +655,11 @@ static int __init hugetlbpage_init(void)
 {
int psize;
 
+   if (hugetlb_disabled) {
+   pr_info("HugeTLB support is disabled!\n");
+   return 0;
+   }
+
 #if !defined(CONFIG_PPC_FSL_BOOK3E) && !defined(CONFIG_PPC_8xx)
if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
return -ENODEV;

[PATCH v2 1/2] powerpc/fadump: exclude memory holes while reserving memory in second kernel

2018-04-10 Thread Hari Bathini

From: Mahesh Salgaonkar 

The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages()
hook causing kernel to hang or panic.

Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.

Signed-off-by: Mahesh Salgaonkar 
Signed-off-by: Hari Bathini 
---

Changes in v2:
* Split crash dump memory reservation into a separate function.



 arch/powerpc/kernel/fadump.c |   29 +++--
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3c2c268..bea8d5f 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -335,6 +335,26 @@ static unsigned long get_fadump_area_size(void)
return size;
 }
 
+static void __init fadump_reserve_crash_area(unsigned long base,
+unsigned long size)
+{
+   struct memblock_region *reg;
+   unsigned long mstart, mend, msize;
+
+   for_each_memblock(memory, reg) {
+   mstart = max_t(unsigned long, base, reg->base);
+   mend = reg->base + reg->size;
+   mend = min(base + size, mend);
+
+   if (mstart < mend) {
+   msize = mend - mstart;
+   memblock_reserve(mstart, msize);
+   pr_info("Reserved %ldMB of memory at %#016lx for saving 
crash dump\n",
+   (msize >> 20), mstart);
+   }
+   }
+}
+
 int __init fadump_reserve_mem(void)
 {
unsigned long base, size, memory_boundary;
@@ -380,7 +400,8 @@ int __init fadump_reserve_mem(void)
memory_boundary = memblock_end_of_DRAM();
 
if (fw_dump.dump_active) {
-   printk(KERN_INFO "Firmware-assisted dump is active.\n");
+   pr_info("Firmware-assisted dump is active.\n");
+
/*
 * If last boot has crashed then reserve all the memory
 * above boot_memory_size so that we don't touch it until
@@ -389,11 +410,7 @@ int __init fadump_reserve_mem(void)
 */
base = fw_dump.boot_memory_size;
size = memory_boundary - base;
-   memblock_reserve(base, size);
-   printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
-   "for saving crash dump\n",
-   (unsigned long)(size >> 20),
-   (unsigned long)(base >> 20));
+   fadump_reserve_crash_area(base, size);
 
fw_dump.fadumphdr_addr =

be64_to_cpu(fdm_active->rmr_region.destination_address) +

Re: [PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops

2018-04-10 Thread Nicholas Piggin

On Tue, 10 Apr 2018 14:07:28 +0200
Alexandre Belloni  wrote:

> Hi Nicholas,
> 
> I would greatly appreciate a changelog and at least the cover letter
> because it is difficult to grasp how this relates to the previous
> patches you sent to the RTC mailing list. 

Yes good point. Basically this change is "standalone" except using
OPAL_BUSY_DELAY_MS define from patch 1. That patch has a lot of
comments about firmware delays I did not think would be too
interesting.

Basically we're adding msleep(10) here, because the firmware can
repeatedly return OPAL_BUSY for long periods, so we want to context
switch and respond to interrupts.

> 
> On 10/04/2018 21:49:32+1000, Nicholas Piggin wrote:
> > The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
> > OPAL_BUSY_EVENT from firmware, which causes large scheduling
> > latencies, up to 50 seconds have been observed here when RTC stops
> > responding (BMC reboot can do it).
> > 
> > Fix this by converting it to the standard form OPAL_BUSY loop that
> > sleeps.
> > 
> > Fixes("powerpc/powernv: Add RTC and NVRAM support plus RTAS 
> > fallbacks"
> > Cc: Benjamin Herrenschmidt 
> > Cc: linux-...@vger.kernel.org
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/platforms/powernv/opal-rtc.c |  8 +++--
> >  drivers/rtc/rtc-opal.c| 37 ++-  
> 
> From what I understand, the changes in those files are fairly
> independent, they should probably be separated to ease merging.

I'm happy to do that. It's using the same firmware call, so I thought
a single patch would be fine. But I guess the boot call can be
dropped from this patch because it does not  not solve the problem
described in the changelog.

Would you be happy for the driver change to be merged via the powerpc
tree? The code being fixed here came from the same original patch as
a similar issue being fixed in the OPAL NVRAM driver so it might be
easier that way.

Thanks,
Nick

[RFC PATCH 5/5] KVM: PPC: Book3S HV: Radix do not clear partition scoped page table when page fault races with other vCPUs.

2018-04-10 Thread Nicholas Piggin

KVM with an SMP radix guest can get into storms of page faults and
tlbies due to the partition scopd page tables being invalidated and
TLB flushed if they were found to race with another page fault that
set them up.

This tends to cause vCPUs to pile up if several hit common addresses,
then page faults will get serialized on common locks, and then they
each invalidate the previous entry and it's long enough before installing
the new entry that will cause more CPUs to hit page faults and they will
invalidate that new entry.

There doesn't seem to be a need to invalidate in the case of an existing
entry. This solves the tlbie storms.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 39 +++---
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index dab6b622011c..4af177d24f6c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -243,6 +243,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
pmd = pmd_offset(pud, gpa);
if (pmd_is_leaf(*pmd)) {
unsigned long lgpa = gpa & PMD_MASK;
+   pte_t old_pte = *pmdp_ptep(pmd);
 
/*
 * If we raced with another CPU which has just put
@@ -252,18 +253,17 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
ret = -EAGAIN;
goto out_unlock;
}
-   /* Valid 2MB page here already, remove it */
-   old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
- ~0UL, 0, lgpa, PMD_SHIFT);
-   kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT);
-   if (old & _PAGE_DIRTY) {
-   unsigned long gfn = lgpa >> PAGE_SHIFT;
-   struct kvm_memory_slot *memslot;
-   memslot = gfn_to_memslot(kvm, gfn);
-   if (memslot && memslot->dirty_bitmap)
-   kvmppc_update_dirty_map(memslot,
-   gfn, PMD_SIZE);
+   WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte));
+   if (pte_val(old_pte) == pte_val(pte)) {
+   ret = -EAGAIN;
+   goto out_unlock;
}
+
+   /* Valid 2MB page here already, remove it */
+   kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
+   0, pte_val(pte), lgpa, PMD_SHIFT);
+   ret = 0;
+   goto out_unlock;
} else if (level == 1 && !pmd_none(*pmd)) {
/*
 * There's a page table page here, but we wanted
@@ -274,6 +274,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
goto out_unlock;
}
if (level == 0) {
+   pte_t old_pte;
+
if (pmd_none(*pmd)) {
if (!new_ptep)
goto out_unlock;
@@ -281,13 +283,16 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
new_ptep = NULL;
}
ptep = pte_offset_kernel(pmd, gpa);
-   if (pte_present(*ptep)) {
+   old_pte = *ptep;
+   if (pte_present(old_pte)) {
/* PTE was previously valid, so invalidate it */
-   old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_PRESENT,
- 0, gpa, 0);
-   kvmppc_radix_tlbie_page(kvm, gpa, 0);
-   if (old & _PAGE_DIRTY)
-   mark_page_dirty(kvm, gpa >> PAGE_SHIFT);
+   WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte));
+   if (pte_val(old_pte) == pte_val(pte)) {
+   ret = -EAGAIN;
+   goto out_unlock;
+   }
+   kvmppc_radix_update_pte(kvm, ptep, 0,
+   pte_val(pte), gpa, 0);
}
kvmppc_radix_set_pte_at(kvm, gpa, ptep, pte);
} else {
-- 
2.17.0

[RFC PATCH 4/5] KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest entry

2018-04-10 Thread Nicholas Piggin

Move this flushing out of assembly and have it use Linux TLB
flush implementations introduced earlier. This allows powerpc:tlbie
trace events to be used.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c| 21 +++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 43 +
 2 files changed, 21 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 81e2ea882d97..5d4783b5b47a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2680,7 +2680,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
int sub;
bool thr0_done;
unsigned long cmd_bit, stat_bit;
-   int pcpu, thr;
+   int pcpu, thr, tmp;
int target_threads;
int controlled_threads;
int trap;
@@ -2780,6 +2780,25 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
return;
}
 
+   /*
+* Do we need to flush the TLB for the LPAR? (see TLB comment above)
+ * On POWER9, individual threads can come in here, but the
+ * TLB is shared between the 4 threads in a core, hence
+ * invalidating on one thread invalidates for all.
+ * Thus we make all 4 threads use the same bit here.
+ */
+   tmp = pcpu;
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   tmp &= ~0x3UL;
+   if (cpumask_test_cpu(tmp, >kvm->arch.need_tlb_flush)) {
+   if (kvm_is_radix(vc->kvm))
+   radix__local_flush_tlb_lpid(vc->kvm->arch.lpid);
+   else
+   hash__local_flush_tlb_lpid(vc->kvm->arch.lpid);
+   /* Clear the bit after the TLB flush */
+   cpumask_clear_cpu(tmp, >kvm->arch.need_tlb_flush);
+   }
+
kvmppc_clear_host_core(pcpu);
 
/* Decide on micro-threading (split-core) mode */
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index bd63fa8a08b5..6a23a0f3ceea 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -647,49 +647,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
mtspr   SPRN_LPID,r7
isync
 
-   /* See if we need to flush the TLB */
-   lhz r6,PACAPACAINDEX(r13)   /* test_bit(cpu, need_tlb_flush) */
-BEGIN_FTR_SECTION
-   /*
-* On POWER9, individual threads can come in here, but the
-* TLB is shared between the 4 threads in a core, hence
-* invalidating on one thread invalidates for all.
-* Thus we make all 4 threads use the same bit here.
-*/
-   clrrdi  r6,r6,2
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-   clrldi  r7,r6,64-6  /* extract bit number (6 bits) */
-   srdir6,r6,6 /* doubleword number */
-   sldir6,r6,3 /* address offset */
-   add r6,r6,r9
-   addir6,r6,KVM_NEED_FLUSH/* dword in kvm->arch.need_tlb_flush */
-   li  r8,1
-   sld r8,r8,r7
-   ld  r7,0(r6)
-   and.r7,r7,r8
-   beq 22f
-   /* Flush the TLB of any entries for this LPID */
-   lwz r0,KVM_TLB_SETS(r9)
-   mtctr   r0
-   li  r7,0x800/* IS field = 0b10 */
-   ptesync
-   li  r0,0/* RS for P9 version of tlbiel */
-   bne cr7, 29f
-28:tlbiel  r7  /* On P9, rs=0, RIC=0, PRS=0, R=0 */
-   addir7,r7,0x1000
-   bdnz28b
-   b   30f
-29:PPC_TLBIEL(7,0,2,1,1)   /* for radix, RIC=2, PRS=1, R=1 */
-   addir7,r7,0x1000
-   bdnz29b
-30:ptesync
-23:ldarx   r7,0,r6 /* clear the bit after TLB flushed */
-   andcr7,r7,r8
-   stdcx.  r7,0,r6
-   bne 23b
-
/* Add timebase offset onto timebase */
-22:ld  r8,VCORE_TB_OFFSET(r5)
+   ld  r8,VCORE_TB_OFFSET(r5)
cmpdi   r8,0
beq 37f
mftbr6  /* current host timebase */
-- 
2.17.0

[RFC PATCH 3/5] KVM: PPC: Book3S HV: kvmhv_p9_set_lpcr use Linux flush function

2018-04-10 Thread Nicholas Piggin

The existing flush uses the radix value for sets, and uses R=0
tlbiel instructions. This can't be quite right, but I'm not entirely
sure if this is the right way to fix it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 0b9b8e188bfa..577769fbfae9 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -676,7 +676,7 @@ static void wait_for_sync(struct kvm_split_mode *sip, int 
phase)
 
 void kvmhv_p9_set_lpcr(struct kvm_split_mode *sip)
 {
-   unsigned long rb, set;
+   struct kvm *kvm = local_paca->kvm_hstate.kvm_vcpu->kvm;
 
/* wait for every other thread to get to real mode */
wait_for_sync(sip, PHASE_REALMODE);
@@ -689,14 +689,10 @@ void kvmhv_p9_set_lpcr(struct kvm_split_mode *sip)
/* Invalidate the TLB on thread 0 */
if (local_paca->kvm_hstate.tid == 0) {
sip->do_set = 0;
-   asm volatile("ptesync" : : : "memory");
-   for (set = 0; set < POWER9_TLB_SETS_RADIX; ++set) {
-   rb = TLBIEL_INVAL_SET_LPID +
-   (set << TLBIEL_INVAL_SET_SHIFT);
-   asm volatile(PPC_TLBIEL(%0, %1, 0, 0, 0) : :
-"r" (rb), "r" (0));
-   }
-   asm volatile("ptesync" : : : "memory");
+   if (kvm_is_radix(kvm))
+   radix__local_flush_tlb_lpid(kvm->arch.lpid);
+   else
+   hash__local_flush_tlb_lpid(kvm->arch.lpid);
}
 
/* indicate that we have done so and wait for others */
-- 
2.17.0

[RFC PATCH 2/5] KVM: PPC: Book3S HV: kvmppc_radix_tlbie_page use Linux flush function

2018-04-10 Thread Nicholas Piggin

This has the advantage of consolidating TLB flush code in fewer
places, and it also implements powerpc:tlbie trace events.

1GB pages should be handled without further modification.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 26 +++---
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 81d5ad26f9a1..dab6b622011c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -139,28 +139,16 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t 
eaddr,
return 0;
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define MMU_BASE_PSIZE MMU_PAGE_64K
-#else
-#define MMU_BASE_PSIZE MMU_PAGE_4K
-#endif
-
 static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
unsigned int pshift)
 {
-   int psize = MMU_BASE_PSIZE;
-
-   if (pshift >= PMD_SHIFT)
-   psize = MMU_PAGE_2M;
-   addr &= ~0xfffUL;
-   addr |= mmu_psize_defs[psize].ap << 5;
-   asm volatile("ptesync": : :"memory");
-   asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1)
-: : "r" (addr), "r" (kvm->arch.lpid) : "memory");
-   if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG))
-   asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1)
-: : "r" (addr), "r" (kvm->arch.lpid) : "memory");
-   asm volatile("eieio ; tlbsync ; ptesync": : :"memory");
+   unsigned long psize = PAGE_SIZE;
+
+   if (pshift)
+   psize = 1UL << pshift;
+
+   addr &= ~(psize - 1);
+   radix__flush_tlb_lpid_page(kvm->arch.lpid, addr, psize);
 }
 
 unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
-- 
2.17.0

[RFC PATCH 1/5] powerpc/64s/mm: Implement LPID based TLB flushes to be used by KVM

2018-04-10 Thread Nicholas Piggin

Implent local TLB flush for entire LPID, for hash and radix, and
a global TLB flush for a partition scoped page in an LPID, for
radix.

These will be used by KVM in subsequent patches.

Signed-off-by: Nicholas Piggin 
---
 .../include/asm/book3s/64/tlbflush-hash.h |  2 +
 .../include/asm/book3s/64/tlbflush-radix.h|  5 ++
 arch/powerpc/mm/hash_native_64.c  |  8 ++
 arch/powerpc/mm/tlb-radix.c   | 87 +++
 4 files changed, 102 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 64d02a704bcb..8b328fd87722 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -53,6 +53,8 @@ static inline void arch_leave_lazy_mmu_mode(void)
 
 extern void hash__tlbiel_all(unsigned int action);
 
+extern void hash__local_flush_tlb_lpid(unsigned int lpid);
+
 extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
int ssize, unsigned long flags);
 extern void flush_hash_range(unsigned long number, int local);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 19b45ba6caf9..2ddaadf3e9ea 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -51,4 +51,9 @@ extern void radix__flush_tlb_all(void);
 extern void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, struct 
mm_struct *mm,
unsigned long address);
 
+extern void radix__flush_tlb_lpid_page(unsigned int lpid,
+   unsigned long addr,
+   unsigned long page_size);
+extern void radix__local_flush_tlb_lpid(unsigned int lpid);
+
 #endif
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 1d049c78c82a..2f02cd780c19 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -294,6 +294,14 @@ static inline void tlbie(unsigned long vpn, int psize, int 
apsize,
raw_spin_unlock(_tlbie_lock);
 }
 
+void hash__local_flush_tlb_lpid(unsigned int lpid)
+{
+   VM_BUG_ON(mfspr(SPRN_LPID) != lpid);
+
+   hash__tlbiel_all(TLB_INVAL_SCOPE_LPID);
+}
+EXPORT_SYMBOL_GPL(hash__local_flush_tlb_lpid);
+
 static inline void native_lock_hpte(struct hash_pte *hptep)
 {
unsigned long *word = (unsigned long *)>v;
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 2fba6170ab3f..f246fb0ac049 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -119,6 +119,22 @@ static inline void __tlbie_pid(unsigned long pid, unsigned 
long ric)
trace_tlbie(0, 0, rb, rs, ric, prs, r);
 }
 
+static inline void __tlbiel_lpid(unsigned long lpid, int set,
+   unsigned long ric)
+{
+   unsigned long rb,rs,prs,r;
+
+   rb = PPC_BIT(52); /* IS = 2 */
+   rb |= set << PPC_BITLSHIFT(51);
+   rs = 0;  /* LPID comes from LPIDR */
+   prs = 0; /* partition scoped */
+   r = 1;   /* radix format */
+
+   asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
+: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : 
"memory");
+   trace_tlbie(lpid, 1, rb, rs, ric, prs, r);
+}
+
 static inline void __tlbiel_va(unsigned long va, unsigned long pid,
   unsigned long ap, unsigned long ric)
 {
@@ -151,6 +167,22 @@ static inline void __tlbie_va(unsigned long va, unsigned 
long pid,
trace_tlbie(0, 0, rb, rs, ric, prs, r);
 }
 
+static inline void __tlbie_lpid_va(unsigned long va, unsigned long lpid,
+ unsigned long ap, unsigned long ric)
+{
+   unsigned long rb,rs,prs,r;
+
+   rb = va & ~(PPC_BITMASK(52, 63));
+   rb |= ap << PPC_BITLSHIFT(58);
+   rs = lpid;
+   prs = 0; /* partition scoped */
+   r = 1;   /* radix format */
+
+   asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
+: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : 
"memory");
+   trace_tlbie(lpid, 0, rb, rs, ric, prs, r);
+}
+
 static inline void fixup_tlbie(void)
 {
unsigned long pid = 0;
@@ -215,6 +247,34 @@ static inline void _tlbie_pid(unsigned long pid, unsigned 
long ric)
asm volatile("eieio; tlbsync; ptesync": : :"memory");
 }
 
+static inline void _tlbiel_lpid(unsigned long lpid, unsigned long ric)
+{
+   int set;
+
+   VM_BUG_ON(mfspr(SPRN_LPID) != lpid);
+
+   asm volatile("ptesync": : :"memory");
+
+   /*
+* Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
+* also flush the entire Page Walk Cache.
+*/
+   __tlbiel_lpid(lpid, 0, ric);
+
+   /* For PWC, only one flush is needed */
+   if (ric == RIC_FLUSH_PWC) {
+   asm

[RFC PATCH 0/5] KVM TLB flushing improvements

2018-04-10 Thread Nicholas Piggin

This series adds powerpc:tlbie tracepoints for radix partition
scoped invalidations. After I started getting some traces on a
32 vCPU radix guest it showed a problem with partition scoped
faults/invalidates, so I had a try at fixing it. This seems to
stable be on radix so far (haven't tested hash yet).

Thanks,
Nick

Nicholas Piggin (5):
  powerpc/64s/mm: Implement LPID based TLB flushes to be used by KVM
  KVM: PPC: Book3S HV: kvmppc_radix_tlbie_page use Linux flush function
  KVM: PPC: Book3S HV: kvmhv_p9_set_lpcr use Linux flush function
  KVM: PPC: Book3S HV: handle need_tlb_flush in C before low-level guest
entry
  KVM: PPC: Book3S HV: Radix do not clear partition scoped page table
when page fault races with other vCPUs.

 .../include/asm/book3s/64/tlbflush-hash.h |  2 +
 .../include/asm/book3s/64/tlbflush-radix.h|  5 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 65 +++---
 arch/powerpc/kvm/book3s_hv.c  | 21 -
 arch/powerpc/kvm/book3s_hv_builtin.c  | 14 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 43 +
 arch/powerpc/mm/hash_native_64.c  |  8 ++
 arch/powerpc/mm/tlb-radix.c   | 87 +++
 8 files changed, 157 insertions(+), 88 deletions(-)

-- 
2.17.0

Re: [PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops

2018-04-10 Thread Alexandre Belloni

Hi Nicholas,

I would greatly appreciate a changelog and at least the cover letter
because it is difficult to grasp how this relates to the previous
patches you sent to the RTC mailing list. 

On 10/04/2018 21:49:32+1000, Nicholas Piggin wrote:
> The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
> OPAL_BUSY_EVENT from firmware, which causes large scheduling
> latencies, up to 50 seconds have been observed here when RTC stops
> responding (BMC reboot can do it).
> 
> Fix this by converting it to the standard form OPAL_BUSY loop that
> sleeps.
> 
> Fixes 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS 
> fallbacks"
> Cc: Benjamin Herrenschmidt 
> Cc: linux-...@vger.kernel.org
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/powernv/opal-rtc.c |  8 +++--
>  drivers/rtc/rtc-opal.c| 37 ++-

>From what I understand, the changes in those files are fairly
independent, they should probably be separated to ease merging.

>  2 files changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal-rtc.c 
> b/arch/powerpc/platforms/powernv/opal-rtc.c
> index f8868864f373..aa2a5139462e 100644
> --- a/arch/powerpc/platforms/powernv/opal-rtc.c
> +++ b/arch/powerpc/platforms/powernv/opal-rtc.c
> @@ -48,10 +48,12 @@ unsigned long __init opal_get_boot_time(void)
>  
>   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
>   rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
> - if (rc == OPAL_BUSY_EVENT)
> + if (rc == OPAL_BUSY_EVENT) {
> + mdelay(OPAL_BUSY_DELAY_MS);
>   opal_poll_events(NULL);
> - else if (rc == OPAL_BUSY)
> - mdelay(10);
> + } else if (rc == OPAL_BUSY) {
> + mdelay(OPAL_BUSY_DELAY_MS);
> + }
>   }
>   if (rc != OPAL_SUCCESS)
>   return 0;
> diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
> index 304e891e35fc..60f2250fd96b 100644
> --- a/drivers/rtc/rtc-opal.c
> +++ b/drivers/rtc/rtc-opal.c
> @@ -57,7 +57,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 
> *h_m_s_ms)
>  
>  static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm)
>  {
> - long rc = OPAL_BUSY;
> + s64 rc = OPAL_BUSY;
>   int retries = 10;
>   u32 y_m_d;
>   u64 h_m_s_ms;
> @@ -66,13 +66,17 @@ static int opal_get_rtc_time(struct device *dev, struct 
> rtc_time *tm)
>  
>   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
>   rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
> - if (rc == OPAL_BUSY_EVENT)
> + if (rc == OPAL_BUSY_EVENT) {
> + msleep(OPAL_BUSY_DELAY_MS);
>   opal_poll_events(NULL);
> - else if (retries-- && (rc == OPAL_HARDWARE
> -|| rc == OPAL_INTERNAL_ERROR))
> - msleep(10);
> - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
> - break;
> + } else if (rc == OPAL_BUSY) {
> + msleep(OPAL_BUSY_DELAY_MS);
> + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
> + if (retries--) {
> + msleep(10); /* Wait 10ms before retry */
> + rc = OPAL_BUSY; /* go around again */
> + }
> + }
>   }
>  
>   if (rc != OPAL_SUCCESS)
> @@ -87,21 +91,26 @@ static int opal_get_rtc_time(struct device *dev, struct 
> rtc_time *tm)
>  
>  static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm)
>  {
> - long rc = OPAL_BUSY;
> + s64 rc = OPAL_BUSY;
>   int retries = 10;
>   u32 y_m_d = 0;
>   u64 h_m_s_ms = 0;
>  
>   tm_to_opal(tm, _m_d, _m_s_ms);
> +
>   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
>   rc = opal_rtc_write(y_m_d, h_m_s_ms);
> - if (rc == OPAL_BUSY_EVENT)
> + if (rc == OPAL_BUSY_EVENT) {
> + msleep(OPAL_BUSY_DELAY_MS);
>   opal_poll_events(NULL);
> - else if (retries-- && (rc == OPAL_HARDWARE
> -|| rc == OPAL_INTERNAL_ERROR))
> - msleep(10);
> - else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
> - break;
> + } else if (rc == OPAL_BUSY) {
> + msleep(OPAL_BUSY_DELAY_MS);
> + } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
> + if (retries--) {
> + msleep(10); /* Wait 10ms before retry */
> + rc = OPAL_BUSY; /* go around again */
> + }
> + }
>   }
>  
>   return rc == OPAL_SUCCESS ? 0 : -EIO;
> -- 
> 2.17.0
> 

--

[PATCH 3/3] powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops

2018-04-10 Thread Nicholas Piggin

The OPAL NVRAM driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, and various lockup errors to trigger (again, BMC reboot
can cause it).

Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.

Fixes: 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS 
fallbacks")
Cc: Benjamin Herrenschmidt 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-nvram.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-nvram.c 
b/arch/powerpc/platforms/powernv/opal-nvram.c
index ba2ff06a2c98..1bceb95f422d 100644
--- a/arch/powerpc/platforms/powernv/opal-nvram.c
+++ b/arch/powerpc/platforms/powernv/opal-nvram.c
@@ -11,6 +11,7 @@
 
 #define DEBUG
 
+#include 
 #include 
 #include 
 #include 
@@ -56,8 +57,12 @@ static ssize_t opal_nvram_write(char *buf, size_t count, 
loff_t *index)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_write_nvram(__pa(buf), count, off);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   }
}
 
if (rc)
-- 
2.17.0

[PATCH 0/3] Fix RTC and NVRAM OPAL_BUSY loops

2018-04-10 Thread Nicholas Piggin

This is a couple of important fixes broken out of the series
"first step of standardising OPAL_BUSY handling", that prevents
the kernel from locking up if the NVRAM or RTC hardware does not
respond.

Another one, the console driver, has a similar problem that has
also been hit in testing, but that requires larger fixes to the
opal console and hvc tty driver that won't make it for 4.17.

Thanks,
Nick

Nicholas Piggin (3):
  powerpc/powernv: define a standard delay for OPAL_BUSY type retry
loops
  powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops
  powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops

 arch/powerpc/include/asm/opal.h |  3 ++
 arch/powerpc/platforms/powernv/opal-nvram.c |  7 +++-
 arch/powerpc/platforms/powernv/opal-rtc.c   |  8 +++--
 drivers/rtc/rtc-opal.c  | 37 +
 4 files changed, 37 insertions(+), 18 deletions(-)

-- 
2.17.0

[PATCH 2/3] powerpc/powernv: Fix OPAL RTC driver OPAL_BUSY loops

2018-04-10 Thread Nicholas Piggin

The OPAL RTC driver does not sleep in case it gets OPAL_BUSY or
OPAL_BUSY_EVENT from firmware, which causes large scheduling
latencies, up to 50 seconds have been observed here when RTC stops
responding (BMC reboot can do it).

Fix this by converting it to the standard form OPAL_BUSY loop that
sleeps.

Fixes 628daa8d5abfd ("powerpc/powernv: Add RTC and NVRAM support plus RTAS 
fallbacks"
Cc: Benjamin Herrenschmidt 
Cc: linux-...@vger.kernel.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/powernv/opal-rtc.c |  8 +++--
 drivers/rtc/rtc-opal.c| 37 ++-
 2 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-rtc.c 
b/arch/powerpc/platforms/powernv/opal-rtc.c
index f8868864f373..aa2a5139462e 100644
--- a/arch/powerpc/platforms/powernv/opal-rtc.c
+++ b/arch/powerpc/platforms/powernv/opal-rtc.c
@@ -48,10 +48,12 @@ unsigned long __init opal_get_boot_time(void)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   mdelay(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (rc == OPAL_BUSY)
-   mdelay(10);
+   } else if (rc == OPAL_BUSY) {
+   mdelay(OPAL_BUSY_DELAY_MS);
+   }
}
if (rc != OPAL_SUCCESS)
return 0;
diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
index 304e891e35fc..60f2250fd96b 100644
--- a/drivers/rtc/rtc-opal.c
+++ b/drivers/rtc/rtc-opal.c
@@ -57,7 +57,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 
*h_m_s_ms)
 
 static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm)
 {
-   long rc = OPAL_BUSY;
+   s64 rc = OPAL_BUSY;
int retries = 10;
u32 y_m_d;
u64 h_m_s_ms;
@@ -66,13 +66,17 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
 
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (retries-- && (rc == OPAL_HARDWARE
-  || rc == OPAL_INTERNAL_ERROR))
-   msleep(10);
-   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
-   break;
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
+   if (retries--) {
+   msleep(10); /* Wait 10ms before retry */
+   rc = OPAL_BUSY; /* go around again */
+   }
+   }
}
 
if (rc != OPAL_SUCCESS)
@@ -87,21 +91,26 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
 
 static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm)
 {
-   long rc = OPAL_BUSY;
+   s64 rc = OPAL_BUSY;
int retries = 10;
u32 y_m_d = 0;
u64 h_m_s_ms = 0;
 
tm_to_opal(tm, _m_d, _m_s_ms);
+
while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
rc = opal_rtc_write(y_m_d, h_m_s_ms);
-   if (rc == OPAL_BUSY_EVENT)
+   if (rc == OPAL_BUSY_EVENT) {
+   msleep(OPAL_BUSY_DELAY_MS);
opal_poll_events(NULL);
-   else if (retries-- && (rc == OPAL_HARDWARE
-  || rc == OPAL_INTERNAL_ERROR))
-   msleep(10);
-   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
-   break;
+   } else if (rc == OPAL_BUSY) {
+   msleep(OPAL_BUSY_DELAY_MS);
+   } else if (rc == OPAL_HARDWARE || rc == OPAL_INTERNAL_ERROR) {
+   if (retries--) {
+   msleep(10); /* Wait 10ms before retry */
+   rc = OPAL_BUSY; /* go around again */
+   }
+   }
}
 
return rc == OPAL_SUCCESS ? 0 : -EIO;
-- 
2.17.0

[PATCH 1/3] powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops

2018-04-10 Thread Nicholas Piggin

This is the start of an effort to tidy up and standardise all the
delays. Existing loops have a range of delay/sleep periods from 1ms
to 20ms, and some have no delay. They all loop forever except rtc,
which times out after 10 retries, and that uses 10ms delays. So use
10ms as our standard delay. The OPAL maintainer agrees 10ms is a
reasonable starting point.

The idea is to use the same recipe everywhere, once this is proven to
work then it will be documented as an OPAL API standard. Then both
firmware and OS can agree, and if a particular call needs something
else, then that can be documented with reasoning.

This is not the end-all of this effort, it's just a relatively easy
change that fixes some existing high latency delays. There should be
provision for standardising timeouts and/or interruptible loops where
possible, so non-fatal firmware errors don't cause hangs.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/opal.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 7159e1a6a61a..03e1a920491e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -21,6 +21,9 @@
 /* We calculate number of sg entries based on PAGE_SIZE */
 #define SG_ENTRIES_PER_NODE ((PAGE_SIZE - 16) / sizeof(struct opal_sg_entry))
 
+/* Default time to sleep or delay between OPAL_BUSY/OPAL_BUSY_EVENT loops */
+#define OPAL_BUSY_DELAY_MS 10
+
 /* /sys/firmware/opal */
 extern struct kobject *opal_kobj;
 
-- 
2.17.0

Re: [PATCH 00/32] docs/vm: convert to ReST format

2018-04-10 Thread Mike Rapoport

Jon, Andrew,

How do you suggest to continue with this?

On Sun, Apr 01, 2018 at 09:38:58AM +0300, Mike Rapoport wrote:
> (added akpm)
> 
> On Thu, Mar 29, 2018 at 03:46:07PM -0600, Jonathan Corbet wrote:
> > On Wed, 21 Mar 2018 21:22:16 +0200
> > Mike Rapoport  wrote:
> > 
> > > These patches convert files in Documentation/vm to ReST format, add an
> > > initial index and link it to the top level documentation.
> > > 
> > > There are no contents changes in the documentation, except few spelling
> > > fixes. The relatively large diffstat stems from the indentation and
> > > paragraph wrapping changes.
> > > 
> > > I've tried to keep the formatting as consistent as possible, but I could
> > > miss some places that needed markup and add some markup where it was not
> > > necessary.
> > 
> > So I've been pondering on these for a bit.  It looks like a reasonable and
> > straightforward RST conversion, no real complaints there.  But I do have a
> > couple of concerns...
> > 
> > One is that, as we move documentation into RST, I'm really trying to
> > organize it a bit so that it is better tuned to the various audiences we
> > have.  For example, ksm.txt is going to be of interest to sysadmin types,
> > who might want to tune it.  mmu_notifier.txt is of interest to ...
> > somebody, but probably nobody who is thinking in user space.  And so on.
> > 
> > So I would really like to see this material split up and put into the
> > appropriate places in the RST hierarchy - admin-guide for administrative
> > stuff, core-api for kernel development topics, etc.  That, of course,
> > could be done separately from the RST conversion, but I suspect I know
> > what will (or will not) happen if we agree to defer that for now :)
> 
> Well, I was actually planning on doing that ;-)
> 
> My thinking was to start with mechanical RST conversion and then to start
> working on the contents and ordering of the documentation. Some of the
> existing files, e.g. ksm.txt, can be moved as is into the appropriate
> places, others, like transhuge.txt should be at least split into admin/user
> and developer guides.
> 
> Another problem with many of the existing mm docs is that they are rather
> developer notes and it wouldn't be really straight forward to assign them
> to a particular topic.
> 
> I believe that keeping the mm docs together will give better visibility of
> what (little) mm documentation we have and will make the updates easier.
> The documents that fit well into a certain topic could be linked there. For
> instance:
> 
> -
> diff --git a/Documentation/admin-guide/index.rst 
> b/Documentation/admin-guide/index.rst
> index 5bb9161..8f6c6e6 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -63,6 +63,7 @@ configure specific aspects of kernel behavior to your 
> liking.
> pm/index
> thunderbolt
> LSM/index
> +   vm/index
> 
>  .. only::  subproject and html
> 
> diff --git a/Documentation/admin-guide/vm/index.rst 
> b/Documentation/admin-guide/vm/index.rst
> new file mode 100644
> index 000..d86f1c8
> --- /dev/null
> +++ b/Documentation/admin-guide/vm/index.rst
> @@ -0,0 +1,5 @@
> +==
> +Knobs and Buttons for Memory Management Tuning
> +==
> +
> +* :ref:`ksm `
> -
> 
> > The other is the inevitable merge conflicts that changing that many doc
> > files will create.  Sending the patches through Andrew could minimize
> > that, I guess, or at least make it his problem.  Alternatively, we could
> > try to do it as an end-of-merge-window sort of thing.  I can try to manage
> > that, but an ack or two from the mm crowd would be nice to have.
> 
> I can rebase on top of Andrew's tree if that would help to minimize the
> merge conflicts.
> 
> > Thanks,
> > 
> > jon
> > 
> 
> -- 
> Sincerely yours,
> Mike.
> 

-- 
Sincerely yours,
Mike.

Re: Occasionally losing the tick_sched_timer

2018-04-10 Thread Thomas Gleixner

On Tue, 10 Apr 2018, Nicholas Piggin wrote:
> On Tue, 10 Apr 2018 09:42:29 +0200 (CEST)
> Thomas Gleixner  wrote:
> > > Thomas do you have any ideas on what we might look for, or if we can add
> > > some BUG_ON()s to catch this at its source?  
> > 
> > Not really. Tracing might be a more efficient tool that random BUG_ONs.
> 
> Sure, we could try that. Any suggestions? timer events?

timer, hrtimer and the tick-sched stuff should be a good start. And make
sure to freeze the trace once you hit the fault case. tracing_off() is your
friend.

Thanks,

tglx

[PATCH] powerpc/8xx: Build fix with Hugetlbfs enabled

2018-04-10 Thread Aneesh Kumar K.V

8xx use slice code when hugetlbfs is enabled. We missed a header include on
8xx which resulted in the below build failure.

config: mpc885_ads_defconfig + CONFIG_HUGETLBFS

   CC  arch/powerpc/mm/slice.o
arch/powerpc/mm/slice.c: In function 'slice_get_unmapped_area':
arch/powerpc/mm/slice.c:655:2: error: implicit declaration of function 
'need_extra_context' [-Werror=implicit-function-declaration]
arch/powerpc/mm/slice.c:656:3: error: implicit declaration of function 
'alloc_extended_context' [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors
make[1]: *** [arch/powerpc/mm/slice.o] Error 1
make: *** [arch/powerpc/mm] Error 2

on PPC64 the mmu_context.h was included via linux/pkeys.h

CC: Christophe LEROY 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 9cd87d11fe4e..205fe557ca10 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_SPINLOCK(slice_convert_lock);
 
-- 
2.14.3

Re: Occasionally losing the tick_sched_timer

2018-04-10 Thread Nicholas Piggin

On Tue, 10 Apr 2018 09:42:29 +0200 (CEST)
Thomas Gleixner  wrote:

> Nick,
> 
> On Tue, 10 Apr 2018, Nicholas Piggin wrote:
> > We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no
> > more timers scheduled, despite hard and soft lockup watchdogs should have
> > their heart beat timers and probably many others.
> >
> > The reproducer we have is running a KVM workload. The lockup is in the
> > host kernel, quite rare but we may be able to slowly test things.
> > 
> > I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has
> > stopped for a long time and no hrtimer active. Included CPU4 for what the
> > other CPUs look like.
> > 
> > Thomas do you have any ideas on what we might look for, or if we can add
> > some BUG_ON()s to catch this at its source?  
> 
> Not really. Tracing might be a more efficient tool that random BUG_ONs.

Sure, we could try that. Any suggestions? timer events?

> 
> > - CPU3 is sitting in its cpuidle loop (polling idle with all other idle
> >   states disabled).
> > 
> > - `taskset -c 3 ls` basically revived the CPU and got timers running again. 
> >  
> 
> Which is not surprising because that kicks the CPU out of idle and starts
> the tick timer again.

Yep.
 
> Does this restart the watchdog timers as well?

I think so, but now you ask I'm not 100% sure we directly observed it.
We can check that next time it locks up.

> > cpu: 3
> >  clock 0:
> >   .base:   df30f5ab
> >   .index:  0
> >   .resolution: 1 nsecs
> >   .get_time:   ktime_get
> >   .offset: 0 nsecs
> > active timers:  
> 
> So in theory the soft lockup watchdog hrtimer should be queued here.
> 
> >   .expires_next   : 9223372036854775807 nsecs
> >   .hres_active: 1
> >   .nr_events  : 1446533
> >   .nr_retries : 1434
> >   .nr_hangs   : 0
> >   .max_hang_time  : 0
> >   .nohz_mode  : 2
> >   .last_tick  : 1776312000 nsecs
> >   .tick_stopped   : 1
> >   .idle_jiffies   : 4296713609
> >   .idle_calls : 2573133
> >   .idle_sleeps: 1957794  
> 
> >   .idle_waketime  : 59550238131639 nsecs
> >   .idle_sleeptime : 17504617295679 nsecs
> >   .iowait_sleeptime: 719978688 nsecs
> >   .last_jiffies   : 4296713608  
> 
> So this was the last time when the CPU came out of idle:
> 
> >   .idle_exittime  : 17763110009176 nsecs  
> 
> Here it went back into idle:
> 
> >   .idle_entrytime : 1776312625 nsecs  
> 
> And this was the next timer wheel timer due for expiry:
> 
> >   .next_timer : 1776313000
> >   .idle_expires   : 1776313000 nsecs  
> 
> which makes no sense whatsoever, but this might be stale information. Can't
> tell.

Wouldn't we expect to see that if there is a timer that was missed
somehow because the tick_sched_timer was not set?

> 
> > cpu: 4
> >  clock 0:
> >   .base:   07d8226b
> >   .index:  0
> >   .resolution: 1 nsecs
> >   .get_time:   ktime_get
> >   .offset: 0 nsecs
> > active timers: #0: , tick_sched_timer, S:01
> >  # expires at 5955295000-5955295000 nsecs [in 
> > 2685654802 to 2685654802 nsecs]  
> 
> The tick timer is scheduled because the next timer wheel timer is due at
> 5955295000, which might be the hard watchdog timer
> 
> >  #1: <9b4a3b88>, hrtimer_wakeup, S:01
> >  # expires at 59602585423025-59602642458243 nsecs [in 
> > 52321077827 to 52378113045 nsecs]  
> 
> That might be the soft lockup hrtimer.
> 
> I'd try to gather more information about the chain of events via tracing
> and stop the tracer once the lockup detector hits.

Okay will do, thanks for taking a look.

Thanks,
Nick

Re: Occasionally losing the tick_sched_timer

2018-04-10 Thread Thomas Gleixner

Nick,

On Tue, 10 Apr 2018, Nicholas Piggin wrote:
> We are seeing rare hard lockup watchdog timeouts, a CPU seems to have no
> more timers scheduled, despite hard and soft lockup watchdogs should have
> their heart beat timers and probably many others.
>
> The reproducer we have is running a KVM workload. The lockup is in the
> host kernel, quite rare but we may be able to slowly test things.
> 
> I have a sysrq+q snippet. CPU3 is the stuck one, you can see its tick has
> stopped for a long time and no hrtimer active. Included CPU4 for what the
> other CPUs look like.
> 
> Thomas do you have any ideas on what we might look for, or if we can add
> some BUG_ON()s to catch this at its source?

Not really. Tracing might be a more efficient tool that random BUG_ONs.

> - CPU3 is sitting in its cpuidle loop (polling idle with all other idle
>   states disabled).
> 
> - `taskset -c 3 ls` basically revived the CPU and got timers running again.

Which is not surprising because that kicks the CPU out of idle and starts
the tick timer again.

Does this restart the watchdog timers as well?

> cpu: 3
>  clock 0:
>   .base:   df30f5ab
>   .index:  0
>   .resolution: 1 nsecs
>   .get_time:   ktime_get
>   .offset: 0 nsecs
> active timers:

So in theory the soft lockup watchdog hrtimer should be queued here.

>   .expires_next   : 9223372036854775807 nsecs
>   .hres_active: 1
>   .nr_events  : 1446533
>   .nr_retries : 1434
>   .nr_hangs   : 0
>   .max_hang_time  : 0
>   .nohz_mode  : 2
>   .last_tick  : 1776312000 nsecs
>   .tick_stopped   : 1
>   .idle_jiffies   : 4296713609
>   .idle_calls : 2573133
>   .idle_sleeps: 1957794

>   .idle_waketime  : 59550238131639 nsecs
>   .idle_sleeptime : 17504617295679 nsecs
>   .iowait_sleeptime: 719978688 nsecs
>   .last_jiffies   : 4296713608

So this was the last time when the CPU came out of idle:

>   .idle_exittime  : 17763110009176 nsecs

Here it went back into idle:

>   .idle_entrytime : 1776312625 nsecs

And this was the next timer wheel timer due for expiry:

>   .next_timer : 1776313000
>   .idle_expires   : 1776313000 nsecs

which makes no sense whatsoever, but this might be stale information. Can't
tell.

> cpu: 4
>  clock 0:
>   .base:   07d8226b
>   .index:  0
>   .resolution: 1 nsecs
>   .get_time:   ktime_get
>   .offset: 0 nsecs
> active timers: #0: , tick_sched_timer, S:01
># expires at 5955295000-5955295000 nsecs [in 
> 2685654802 to 2685654802 nsecs]

The tick timer is scheduled because the next timer wheel timer is due at
5955295000, which might be the hard watchdog timer

>#1: <9b4a3b88>, hrtimer_wakeup, S:01
># expires at 59602585423025-59602642458243 nsecs [in 
> 52321077827 to 52378113045 nsecs]

That might be the soft lockup hrtimer.

I'd try to gather more information about the chain of events via tracing
and stop the tracer once the lockup detector hits.

Thanks,

tglx

[PATCH] powerpc/powernv/opal: Use standard interrupts property when available

2018-04-10 Thread Benjamin Herrenschmidt

For (bad) historical reasons, OPAL used to create a non-standard pair of
properties "opal-interrupts" and "opal-interrupts-names" for representing
the list of interrupts it wants Linux to request on its behalf.

Among other issues, the opal-interrupts doesn't have a way to carry the
type of interrupts, and they were assumed to be all level sensitive.

This is wrong on some recent systems where some of them are edge sensitive
causing warnings in the XIVE code and possible misbehaviours if they need
to be retriggered (typically the NPU2 TCE error interrupts).

This makes Linux switch to using the standard "interrupts" and
"interrupt-names" properties instead when they are available, using standard
of_irq helpers, which can carry all the desired type information.

Newer versions of OPAL will generate those properties in addition to the
legacy ones.

Signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c 
b/arch/powerpc/platforms/powernv/opal-irqchip.c
index 9d1b8c0aaf93..46785eaf625d 100644
--- a/arch/powerpc/platforms/powernv/opal-irqchip.c
+++ b/arch/powerpc/platforms/powernv/opal-irqchip.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -39,8 +40,8 @@ struct opal_event_irqchip {
 };
 static struct opal_event_irqchip opal_event_irqchip;
 
-static unsigned int opal_irq_count;
-static unsigned int *opal_irqs;
+static int opal_irq_count;
+static struct resource *opal_irqs;
 
 static void opal_handle_irq_work(struct irq_work *work);
 static u64 last_outstanding_events;
@@ -174,24 +175,21 @@ void opal_event_shutdown(void)
 
/* First free interrupts, which will also mask them */
for (i = 0; i < opal_irq_count; i++) {
-   if (!opal_irqs[i])
+   if (!opal_irqs || !opal_irqs[i].start)
continue;
 
if (in_interrupt())
-   disable_irq_nosync(opal_irqs[i]);
+   disable_irq_nosync(opal_irqs[i].start);
else
-   free_irq(opal_irqs[i], NULL);
-
-   opal_irqs[i] = 0;
+   free_irq(opal_irqs[i].start, NULL);
}
 }
 
 int __init opal_event_init(void)
 {
struct device_node *dn, *opal_node;
-   const char **names;
-   u32 *irqs;
-   int i, rc;
+   bool old_style = false;
+   int i, rc = 0;
 
opal_node = of_find_node_by_path("/ibm,opal");
if (!opal_node) {
@@ -216,67 +214,91 @@ int __init opal_event_init(void)
goto out;
}
 
-   /* Get opal-interrupts property and names if present */
-   rc = of_property_count_u32_elems(opal_node, "opal-interrupts");
-   if (rc < 0)
-   goto out;
+   /* Look for new-style (standard) "interrupts" property */
+   opal_irq_count = of_irq_count(opal_node);
 
-   opal_irq_count = rc;
-   pr_debug("Found %d interrupts reserved for OPAL\n", opal_irq_count);
+   /* Absent ? Look for the old one */
+   if (opal_irq_count < 1) {
+   /* Get opal-interrupts property and names if present */
+   rc = of_property_count_u32_elems(opal_node, "opal-interrupts");
+   if (rc > 0)
+   opal_irq_count = rc;
+   old_style = true;
+   }
 
-   irqs = kcalloc(opal_irq_count, sizeof(*irqs), GFP_KERNEL);
-   names = kcalloc(opal_irq_count, sizeof(*names), GFP_KERNEL);
-   opal_irqs = kcalloc(opal_irq_count, sizeof(*opal_irqs), GFP_KERNEL);
+   /* No interrupts ? Bail out */
+   if (!opal_irq_count)
+   goto out;
 
-   if (WARN_ON(!irqs || !names || !opal_irqs))
-   goto out_free;
+   pr_debug("OPAL: Found %d interrupts reserved for OPAL using %s 
scheme\n",
+opal_irq_count, old_style ? "old" : "new");
 
-   rc = of_property_read_u32_array(opal_node, "opal-interrupts",
-   irqs, opal_irq_count);
-   if (rc < 0) {
-   pr_err("Error %d reading opal-interrupts array\n", rc);
-   goto out_free;
+   /* Allocate an IRQ resources array */
+   opal_irqs = kcalloc(opal_irq_count, sizeof(struct resource), 
GFP_KERNEL);
+   if (WARN_ON(!opal_irqs)) {
+   rc = -ENOMEM;
+   goto out;
}
 
-   /* It's not an error for the names to be missing */
-   of_property_read_string_array(opal_node, "opal-interrupts-names",
- names, opal_irq_count);
+   /* Build the resources array */
+   if (old_style) {
+   /* Old style "opal-interrupts" property */
+   for (i = 0; i < opal_irq_count; i++) {
+   struct resource *r = _irqs[i];
+   const char *name = NULL;
+   u32 hw_irq;
+   int virq;
+
+   rc =

Re: [PATCH v9 21/24] perf tools: Add support for the SPF perf event

2018-04-10 Thread David Rientjes

On Mon, 26 Mar 2018, Andi Kleen wrote:

> > Aside: should there be a new spec_flt field for struct task_struct that 
> > complements maj_flt and min_flt and be exported through /proc/pid/stat?
> 
> No. task_struct is already too bloated. If you need per process tracking 
> you can always get it through trace points.
> 

Hi Andi,

We have

count_vm_event(PGFAULT);
count_memcg_event_mm(vma->vm_mm, PGFAULT);

in handle_mm_fault() but not counterpart for spf.  I think it would be 
helpful to be able to determine how much faulting can be done 
speculatively if there is no per-process tracking without tracing.

[PATCH] Revert "powerpc/64: Fix checksum folding in csum_add()"

2018-04-10 Thread Christophe Leroy

This reverts commit 6ad966d7303b70165228dba1ee8da1a05c10eefe.

That commit was pointless, because csum_add() sums two 32 bits
values, so the sum is 0x1fffe at the maximum.
And then when adding upper part (1) and lower part (0xfffe),
the result is 0x which doesn't carry.
Any lower value will not carry either.

And behind the fact that this commit is useless, it also kills the
whole purpose of having an arch specific inline csum_add()
because the resulting code gets even worse than what is obtained
with the generic implementation of csum_add()

0240 <.csum_add>:
 240:   38 00 ff ff li  r0,-1
 244:   7c 84 1a 14 add r4,r4,r3
 248:   78 00 00 20 clrldi  r0,r0,32
 24c:   78 89 00 22 rldicl  r9,r4,32,32
 250:   7c 80 00 38 and r0,r4,r0
 254:   7c 09 02 14 add r0,r9,r0
 258:   78 09 00 22 rldicl  r9,r0,32,32
 25c:   7c 00 4a 14 add r0,r0,r9
 260:   78 03 00 20 clrldi  r3,r0,32
 264:   4e 80 00 20 blr

In comparison, the generic implementation of csum_add() gives:

0290 <.csum_add>:
 290:   7c 63 22 14 add r3,r3,r4
 294:   7f 83 20 40 cmplw   cr7,r3,r4
 298:   7c 10 10 26 mfocrf  r0,1
 29c:   54 00 ef fe rlwinm  r0,r0,29,31,31
 2a0:   7c 60 1a 14 add r3,r0,r3
 2a4:   78 63 00 20 clrldi  r3,r3,32
 2a8:   4e 80 00 20 blr

And the reverted implementation for PPC64 gives:

0240 <.csum_add>:
 240:   7c 84 1a 14 add r4,r4,r3
 244:   78 80 00 22 rldicl  r0,r4,32,32
 248:   7c 80 22 14 add r4,r0,r4
 24c:   78 83 00 20 clrldi  r3,r4,32
 250:   4e 80 00 20 blr

Fixes: 6ad966d7303b7 ("powerpc/64: Fix checksum folding in csum_add()")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/checksum.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/checksum.h 
b/arch/powerpc/include/asm/checksum.h
index 842124b199b5..4e63787dc3be 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -112,7 +112,7 @@ static inline __wsum csum_add(__wsum csum, __wsum addend)
 
 #ifdef __powerpc64__
res += (__force u64)addend;
-   return (__force __wsum) from64to32(res);
+   return (__force __wsum)((u32)res + (res >> 32));
 #else
asm("addc %0,%0,%1;"
"addze %0,%0;"
-- 
2.13.3

[PATCH] powerpc/64: optimises from64to32()

2018-04-10 Thread Christophe Leroy

The current implementation of from64to32() gives a poor result:

0270 <.from64to32>:
 270:   38 00 ff ff li  r0,-1
 274:   78 69 00 22 rldicl  r9,r3,32,32
 278:   78 00 00 20 clrldi  r0,r0,32
 27c:   7c 60 00 38 and r0,r3,r0
 280:   7c 09 02 14 add r0,r9,r0
 284:   78 09 00 22 rldicl  r9,r0,32,32
 288:   7c 00 4a 14 add r0,r0,r9
 28c:   78 03 00 20 clrldi  r3,r0,32
 290:   4e 80 00 20 blr

This patch modifies from64to32() to operate in the same
spirit as csum_fold()

It swaps the two 32-bit halves of sum then it adds it with the
unswapped sum. If there is a carry from adding the two 32-bit halves,
it will carry from the lower half into the upper half, giving us the
correct sum in the upper half.

The resulting code is:

0260 <.from64to32>:
 260:   78 60 00 02 rotldi  r0,r3,32
 264:   7c 60 1a 14 add r3,r0,r3
 268:   78 63 00 22 rldicl  r3,r3,32,32
 26c:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/checksum.h | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/checksum.h 
b/arch/powerpc/include/asm/checksum.h
index 4e63787dc3be..54065caa40b3 100644
--- a/arch/powerpc/include/asm/checksum.h
+++ b/arch/powerpc/include/asm/checksum.h
@@ -12,6 +12,7 @@
 #ifdef CONFIG_GENERIC_CSUM
 #include 
 #else
+#include 
 /*
  * Computes the checksum of a memory block at src, length len,
  * and adds in "sum" (32-bit), while copying the block to dst.
@@ -55,11 +56,7 @@ static inline __sum16 csum_fold(__wsum sum)
 
 static inline u32 from64to32(u64 x)
 {
-   /* add up 32-bit and 32-bit for 32+c bit */
-   x = (x & 0x) + (x >> 32);
-   /* add up carry.. */
-   x = (x & 0x) + (x >> 32);
-   return (u32)x;
+   return (x + ror64(x, 32)) >> 32;
 }
 
 static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, __u32 len,
-- 
2.13.3

Re: [PATCH 1/2] KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode

2018-04-10 Thread Nicholas Piggin

On Tue, 10 Apr 2018 11:25:02 +0530
"Naveen N. Rao"  wrote:

> Michael Ellerman wrote:
> > Nicholas Piggin  writes:
> >   
> >> On Sun, 8 Apr 2018 20:17:47 +1000
> >> Balbir Singh  wrote:
> >>  
> >>> On Fri, Apr 6, 2018 at 3:56 AM, Nicholas Piggin  
> >>> wrote:  
> >>> > This crashes with a "Bad real address for load" attempting to load
> >>> > from the vmalloc region in realmode (faulting address is in DAR).
> >>> >
> >>> >   Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
> >>> >   LE SMP NR_CPUS=2048 NUMA PowerNV
> >>> >   CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted 
> >>> > 4.16.0-01530-g43d1859f0994
> >>> >   NIP:  c00155ac LR: c00c2430 CTR: c0015580
> >>> >   REGS: c00fff76dd80 TRAP: 0200   Not tainted  
> >>> > (4.16.0-01530-g43d1859f0994)
> >>> >   MSR:  90201003   CR: 4808  XER: 
> >>> >   CFAR: 000102900ef0 DAR: d00017fffd941a28 DSISR: 0040 SOFTE: 3
> >>> >   NIP [c00155ac] perf_trace_tlbie+0x2c/0x1a0
> >>> >   LR [c00c2430] do_tlbies+0x230/0x2f0
> >>> >
> >>> > I suspect the reason is the per-cpu data is not in the linear chunk.
> >>> > This could be restored if that was able to be fixed, but for now,
> >>> > just remove the tracepoints.
> >>> 
> >>> Could you share the stack trace as well? I've not observed this in my 
> >>> testing.  
> >>
> >> I can't seem to find it, I can try reproduce tomorrow. It was coming
> >> from h_remove hcall from the guest. It's 176 logical CPUs.
> >>  
> >>> May be I don't have as many cpus. I presume your talking about the per cpu
> >>> data offsets for per cpu trace data?  
> >>
> >> It looked like it was dereferencing virtually mapped per-cpu data, yes.
> >> Probably the perf_events deref.  
> > 
> > Naveen has posted a series to (hopefully) fix this, which just missed
> > the merge window:
> > 
> >   https://patchwork.ozlabs.org/patch/894757/  
> 
> I'm afraid that won't actually help here :(
> That series is specific to the function tracer, while this is using 
> static tracepoints.
> 
> We could convert trace_tlbie() to a TRACE_EVENT_CONDITION() and guard it 
> within a check for paca->ftrace_enabled, but that would only be useful 
> if the below callsites can ever be hit outside of KVM guest mode.

Right, removing the trace points is the right thing to do here.

Doing tracing in real mode would be a whole effort itself, I'd expect.
Or disabling realmode handling of HPT hcalls if trace points are
active.

Thanks,
Nick

52 matches

Mail list logo