Re: [PATCH v13 3/3] mm, powerpc, x86: introduce an additional vma bit for powerpc pkey

2018-05-04 Thread Dave Hansen
On 05/04/2018 06:12 PM, Ram Pai wrote:
>> That new line boils down to:
>>
>>  [ilog2(0)]  = "",
>>
>> on x86.  It wasn't *obvious* to me that it is OK to do that.  The other
>> possibly undefined bits (VM_SOFTDIRTY for instance) #ifdef themselves
>> out of this array.
>>
>> I would just be a wee bit worried that this would overwrite the 0 entry
>> ("??") with "".
> Yes it would :-( and could potentially break anything that depends on
> 0th entry being "??"
> 
> Is the following fix acceptable?
> 
> #if VM_PKEY_BIT4
> [ilog2(VM_PKEY_BIT4)]   = "",
> #endif

Yep, I think that works for me.


[PATCH] powerpc: cpm_gpio: Remove owner assignment from platform_driver

2018-05-04 Thread Fabio Estevam
From: Fabio Estevam 

Structure platform_driver does not need to set the owner field, as this
will be populated by the driver core.

Generated by scripts/coccinelle/api/platform_no_drv_owner.cocci.

Signed-off-by: Fabio Estevam 
---
 arch/powerpc/sysdev/cpm_gpio.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/sysdev/cpm_gpio.c b/arch/powerpc/sysdev/cpm_gpio.c
index 0badc90..0695d26 100644
--- a/arch/powerpc/sysdev/cpm_gpio.c
+++ b/arch/powerpc/sysdev/cpm_gpio.c
@@ -63,7 +63,6 @@ static struct platform_driver cpm_gpio_driver = {
.probe  = cpm_gpio_probe,
.driver = {
.name   = "cpm-gpio",
-   .owner  = THIS_MODULE,
.of_match_table = cpm_gpio_match,
},
 };
-- 
2.7.4



Re: [PATCH v13 3/3] mm, powerpc, x86: introduce an additional vma bit for powerpc pkey

2018-05-04 Thread Ram Pai
On Fri, May 04, 2018 at 03:57:33PM -0700, Dave Hansen wrote:
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 0c9e392..3c7 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -679,6 +679,7 @@ static void show_smap_vma_flags(struct seq_file *m, 
> > struct vm_area_struct *vma)
> > [ilog2(VM_PKEY_BIT1)]   = "",
> > [ilog2(VM_PKEY_BIT2)]   = "",
> > [ilog2(VM_PKEY_BIT3)]   = "",
> > +   [ilog2(VM_PKEY_BIT4)]   = "",
> >  #endif /* CONFIG_ARCH_HAS_PKEYS */
> ...
> > +#if defined(CONFIG_PPC)
> > +# define VM_PKEY_BIT4  VM_HIGH_ARCH_4
> > +#else 
> > +# define VM_PKEY_BIT4  0
> > +#endif
> >  #endif /* CONFIG_ARCH_HAS_PKEYS */
> 
> That new line boils down to:
> 
>   [ilog2(0)]  = "",
> 
> on x86.  It wasn't *obvious* to me that it is OK to do that.  The other
> possibly undefined bits (VM_SOFTDIRTY for instance) #ifdef themselves
> out of this array.
> 
> I would just be a wee bit worried that this would overwrite the 0 entry
> ("??") with "".

Yes it would :-( and could potentially break anything that depends on
0th entry being "??"

Is the following fix acceptable?

#if VM_PKEY_BIT4
[ilog2(VM_PKEY_BIT4)]   = "",
#endif

-- 
Ram Pai



Re: [PATCH v13 3/3] mm, powerpc, x86: introduce an additional vma bit for powerpc pkey

2018-05-04 Thread Dave Hansen
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 0c9e392..3c7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -679,6 +679,7 @@ static void show_smap_vma_flags(struct seq_file *m, 
> struct vm_area_struct *vma)
>   [ilog2(VM_PKEY_BIT1)]   = "",
>   [ilog2(VM_PKEY_BIT2)]   = "",
>   [ilog2(VM_PKEY_BIT3)]   = "",
> + [ilog2(VM_PKEY_BIT4)]   = "",
>  #endif /* CONFIG_ARCH_HAS_PKEYS */
...
> +#if defined(CONFIG_PPC)
> +# define VM_PKEY_BIT4VM_HIGH_ARCH_4
> +#else 
> +# define VM_PKEY_BIT40
> +#endif
>  #endif /* CONFIG_ARCH_HAS_PKEYS */

That new line boils down to:

[ilog2(0)]  = "",

on x86.  It wasn't *obvious* to me that it is OK to do that.  The other
possibly undefined bits (VM_SOFTDIRTY for instance) #ifdef themselves
out of this array.

I would just be a wee bit worried that this would overwrite the 0 entry
("??") with "".


Re: [PATCH 4/4] powerpc/xive: prepare all hcalls to support long busy delays

2018-05-04 Thread Benjamin Herrenschmidt
On Fri, 2018-05-04 at 20:42 +1000, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
> > This is not the case for the moment, but future releases of pHyp might
> > need to introduce some synchronisation routines under the hood which
> > would make the XIVE hcalls longer to complete.
> > 
> > As this was done for H_INT_RESET, let's wrap the other hcalls in a
> > loop catching the H_LONG_BUSY_* codes.
> 
> Are we sure it's safe to msleep() in all these paths?

Probably not. We can have the IRQ descriptor lock. We might need to
mdelay.

There's a Kconfig option (forgot which one) that will add checks for
attempts to sleep inside locks, you should run with that.

Cheers,
Ben.

> 
> cheers
> 
> > diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> > b/arch/powerpc/sysdev/xive/spapr.c
> > index 7113f5d87952..97ea0a67a173 100644
> > --- a/arch/powerpc/sysdev/xive/spapr.c
> > +++ b/arch/powerpc/sysdev/xive/spapr.c
> > @@ -165,7 +165,10 @@ static long plpar_int_get_source_info(unsigned long 
> > flags,
> > unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> > long rc;
> >  
> > -   rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
> > +   do {
> > +   rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_GET_SOURCE_INFO lisn=%ld failed %ld\n", lisn, rc);
> > return rc;
> > @@ -198,8 +201,11 @@ static long plpar_int_set_source_config(unsigned long 
> > flags,
> > flags, lisn, target, prio, sw_irq);
> >  
> >  
> > -   rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
> > -   target, prio, sw_irq);
> > +   do {
> > +   rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
> > +   target, prio, sw_irq);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_SET_SOURCE_CONFIG lisn=%ld target=%lx prio=%lx 
> > failed %ld\n",
> >lisn, target, prio, rc);
> > @@ -218,7 +224,11 @@ static long plpar_int_get_queue_info(unsigned long 
> > flags,
> > unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> > long rc;
> >  
> > -   rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target, priority);
> > +   do {
> > +   rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target,
> > +priority);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_GET_QUEUE_INFO cpu=%ld prio=%ld failed %ld\n",
> >target, priority, rc);
> > @@ -247,8 +257,11 @@ static long plpar_int_set_queue_config(unsigned long 
> > flags,
> > pr_devel("H_INT_SET_QUEUE_CONFIG flags=%lx target=%lx priority=%lx 
> > qpage=%lx qsize=%lx\n",
> > flags,  target, priority, qpage, qsize);
> >  
> > -   rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
> > -   priority, qpage, qsize);
> > +   do {
> > +   rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
> > +   priority, qpage, qsize);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_SET_QUEUE_CONFIG cpu=%ld prio=%ld qpage=%lx 
> > returned %ld\n",
> >target, priority, qpage, rc);
> > @@ -262,7 +275,10 @@ static long plpar_int_sync(unsigned long flags, 
> > unsigned long lisn)
> >  {
> > long rc;
> >  
> > -   rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
> > +   do {
> > +   rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_SYNC lisn=%ld returned %ld\n", lisn, rc);
> > return  rc;
> > @@ -285,7 +301,11 @@ static long plpar_int_esb(unsigned long flags,
> > pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
> > flags,  lisn, offset, in_data);
> >  
> > -   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
> > +   do {
> > +   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset,
> > +in_data);
> > +   } while (plpar_busy_delay(rc));
> > +
> > if (rc) {
> > pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
> >lisn, offset, rc);
> > -- 
> > 2.13.6


[PATCH v11 3/3] mm, x86, powerpc: display pkey in smaps only if arch supports pkeys

2018-05-04 Thread Ram Pai
Currently the  architecture  specific code is expected to
display  the  protection  keys  in  smap  for a given vma.
This can lead to redundant code and possibly to divergent
formats in which the key gets displayed.

This  patch  changes  the implementation. It displays the
pkey only if the architecture support pkeys, i.e
arch_pkeys_enabled() returns true.  This patch
provides x86 implementation for arch_pkeys_enabled().

x86 arch_show_smap() function is not needed anymore.
Deleting it.

cc: Michael Ellermen 
cc: Benjamin Herrenschmidt 
cc: Andrew Morton 
Reviewed-by: Dave Hansen 
Signed-off-by: Thiago Jung Bauermann 
(fixed compilation errors for x86 configs)
Acked-by: Michal Hocko 
Reviewed-by: Ingo Molnar 
Signed-off-by: Ram Pai 
---
 arch/powerpc/include/asm/mmu_context.h |5 -
 arch/x86/include/asm/mmu_context.h |5 -
 arch/x86/include/asm/pkeys.h   |1 +
 arch/x86/kernel/fpu/xstate.c   |5 +
 arch/x86/kernel/setup.c|8 
 fs/proc/task_mmu.c |   11 ++-
 include/linux/pkeys.h  |7 ++-
 7 files changed, 18 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1..896efa5 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -250,11 +250,6 @@ static inline bool arch_vma_access_permitted(struct 
vm_area_struct *vma,
 #define thread_pkey_regs_restore(new_thread, old_thread)
 #define thread_pkey_regs_init(thread)
 
-static inline int vma_pkey(struct vm_area_struct *vma)
-{
-   return 0;
-}
-
 static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
 {
return 0x0UL;
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 57e3785..3d748bd 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -296,11 +296,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
-#else
-static inline int vma_pkey(struct vm_area_struct *vma)
-{
-   return 0;
-}
 #endif
 
 /*
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index a0ba1ff..f6c287b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -6,6 +6,7 @@
 
 extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val);
+extern bool arch_pkeys_enabled(void);
 
 /*
  * Try to dedicate one of the protection keys to be used as an
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7..4f566e9 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -945,6 +945,11 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
 
return 0;
 }
+
+bool arch_pkeys_enabled(void)
+{
+   return boot_cpu_has(X86_FEATURE_OSPKE);
+}
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
 /*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697..960dbab 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1306,11 +1306,3 @@ static int __init register_kernel_offset_dumper(void)
return 0;
 }
 __initcall(register_kernel_offset_dumper);
-
-void arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
-{
-   if (!boot_cpu_has(X86_FEATURE_OSPKE))
-   return;
-
-   seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
-}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3c7..9ce0097 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -18,10 +18,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 #define SEQ_PUT_DEC(str, val) \
@@ -728,12 +730,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
 }
 #endif /* HUGETLB_PAGE */
 
-void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
-{
-}
-
 #define SEQ_PUT_DEC(str, val) \
seq_put_decimal_ull_width(m, str, (val) >> 10, 8)
+
 static int show_smap(struct seq_file *m, void *v, int is_pid)
 {
struct proc_maps_private *priv = m->private;
@@ -836,9 +835,11 @@ static int show_smap(struct seq_file *m, void *v, int 
is_pid)
seq_puts(m, " kB\n");
}
if (!rollup_mode) {
-   arch_show_smap(m, vma);
+   if (arch_pkeys_enabled())
+   seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
show_smap_vma_flags(m, vma);
}
+
m_cache_vma(m, vma);
return ret;
 }
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 0794ca7..49dff15 100644
--- a/include/linux/pkeys.h
+++ 

[PATCH v13 3/3] mm, powerpc, x86: introduce an additional vma bit for powerpc pkey

2018-05-04 Thread Ram Pai
Only 4bits are allocated in the vma flags to hold 16 keys. This is
sufficient on x86. PowerPC supports 32 keys, which needs 5bits.
Allocate an  additional bit.

cc: Dave Hansen 
cc: Michael Ellermen 
cc: Benjamin Herrenschmidt 
cc: Andrew Morton 
Reviewed-by: Ingo Molnar 
Acked-by: Balbir Singh 
Signed-off-by: Ram Pai 
---
 fs/proc/task_mmu.c |1 +
 include/linux/mm.h |8 +++-
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0c9e392..3c7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -679,6 +679,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_PKEY_BIT1)]   = "",
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
+   [ilog2(VM_PKEY_BIT4)]   = "",
 #endif /* CONFIG_ARCH_HAS_PKEYS */
};
size_t i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c6a6f24..cca67d1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -230,10 +230,16 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit value */
+/* Protection key is a 4-bit value on x86 and 5-bit value on ppc64   */
+# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
 # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
+#if defined(CONFIG_PPC)
+# define VM_PKEY_BIT4  VM_HIGH_ARCH_4
+#else 
+# define VM_PKEY_BIT4  0
+#endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
 #if defined(CONFIG_X86)
-- 
1.7.1



[PATCH v13 1/3] mm, powerpc, x86: define VM_PKEY_BITx bits if CONFIG_ARCH_HAS_PKEYS is enabled

2018-05-04 Thread Ram Pai
VM_PKEY_BITx are defined only if CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
is enabled. Powerpc also needs these bits. Hence lets define the
VM_PKEY_BITx bits for any architecture that enables
CONFIG_ARCH_HAS_PKEYS.

cc: Michael Ellermen 
cc: Benjamin Herrenschmidt 
cc: Andrew Morton 
Reviewed-by: Dave Hansen 
Signed-off-by: Ram Pai 
Reviewed-by: Ingo Molnar 
Reviewed-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pkeys.h |2 ++
 fs/proc/task_mmu.c   |4 ++--
 include/linux/mm.h   |9 +
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 3a9b82b..425b181 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -26,6 +26,8 @@
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
 # define VM_PKEY_BIT4  VM_HIGH_ARCH_4
+#elif !defined(VM_PKEY_BIT4)
+# define VM_PKEY_BIT4  VM_HIGH_ARCH_4
 #endif
 
 #define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | VM_PKEY_BIT1 | VM_PKEY_BIT2 | \
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65ae546..0c9e392 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -673,13 +673,13 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
[ilog2(VM_MERGEABLE)]   = "mg",
[ilog2(VM_UFFD_MISSING)]= "um",
[ilog2(VM_UFFD_WP)] = "uw",
-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_ARCH_HAS_PKEYS
/* These come out via ProtectionKey: */
[ilog2(VM_PKEY_BIT0)]   = "",
[ilog2(VM_PKEY_BIT1)]   = "",
[ilog2(VM_PKEY_BIT2)]   = "",
[ilog2(VM_PKEY_BIT3)]   = "",
-#endif
+#endif /* CONFIG_ARCH_HAS_PKEYS */
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..c6a6f24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -228,15 +228,16 @@ extern int overcommit_kbytes_handler(struct ctl_table *, 
int, void __user *,
 #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
-#if defined(CONFIG_X86)
-# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
+#ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit value */
 # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
 # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
 # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
-#endif
+#endif /* CONFIG_ARCH_HAS_PKEYS */
+
+#if defined(CONFIG_X86)
+# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
 #elif defined(CONFIG_PPC)
 # define VM_SAOVM_ARCH_1   /* Strong Access Ordering 
(powerpc) */
 #elif defined(CONFIG_PARISC)
-- 
1.7.1



[PATCH v13 0/3] mm, x86, powerpc: Enhancements to Memory Protection Keys.

2018-05-04 Thread Ram Pai
This patch series provides arch-neutral enhancements to
enable memory-keys on new architecutes, and the corresponding
changes in x86 and powerpc specific code to support that.

a) Provides ability to support upto 32 keys.  PowerPC
can handle 32 keys and hence needs this.

b) Arch-neutral code; and not the arch-specific code,
   determines the format of the string, that displays the key
   for each vma in smaps.

History:
---
version 14:
(1) made VM_PKEY_BIT4 unusable on x86, #defined it to 0
-- comment by Dave Hansen
(2) due to some reason this patch series continue to
  break some or the other build. The last series
  passed everything but created a merge
  conflict followed by build failure for
  Michael Ellermen. :(

version v13:
(1) fixed a git bisect error. :(

version v12:
(1) fixed compilation errors seen with various x86
configs.
version v11:
(1) code that displays key in smaps is not any more
defined under CONFIG_ARCH_HAS_PKEYS.
- Comment by Eric W. Biederman and Michal Hocko
(2) merged two patches that implemented (1).
- comment by Michal Hocko

version prior to v11:
(1) used one additional bit from VM_HIGH_ARCH_*
to support 32 keys.
- Suggestion by Dave Hansen.
(2) powerpc specific changes to support memory keys.


Ram Pai (3):
  mm, powerpc, x86: define VM_PKEY_BITx bits if CONFIG_ARCH_HAS_PKEYS
is enabled
  mm, powerpc, x86: introduce an additional vma bit for powerpc pkey
  mm, x86, powerpc: display pkey in smaps only if arch supports pkeys

 arch/powerpc/include/asm/mmu_context.h |5 -
 arch/powerpc/include/asm/pkeys.h   |2 ++
 arch/x86/include/asm/mmu_context.h |5 -
 arch/x86/include/asm/pkeys.h   |1 +
 arch/x86/kernel/fpu/xstate.c   |5 +
 arch/x86/kernel/setup.c|8 
 fs/proc/task_mmu.c |   16 +---
 include/linux/mm.h |   15 +++
 include/linux/pkeys.h  |7 ++-
 9 files changed, 34 insertions(+), 30 deletions(-)



Re: [PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Ram Pai
On Fri, May 04, 2018 at 02:31:10PM -0700, Dave Hansen wrote:
> On 05/04/2018 02:26 PM, Michal Suchánek wrote:
> > If it is not ok to change permissions of pkey 0 is it ok to free it?
> 
> It's pretty much never OK to free it on x86 or ppc.  But, we're not
> going to put code in to keep userspace from shooting itself in the foot,
> at least on x86.

and on powerpc aswell.


-- 
Ram Pai



Re: [PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Dave Hansen
On 05/04/2018 02:26 PM, Michal Suchánek wrote:
> If it is not ok to change permissions of pkey 0 is it ok to free it?

It's pretty much never OK to free it on x86 or ppc.  But, we're not
going to put code in to keep userspace from shooting itself in the foot,
at least on x86.


Re: [PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Michal Suchánek
On Fri,  4 May 2018 12:22:58 -0700
"Ram Pai"  wrote:

> Applications need the ability to associate an address-range with some
> key and latter revert to its initial default key. Pkey-0 comes close
> to providing this function but falls short, because the current
> implementation disallows applications to explicitly associate pkey-0
> to the address range.
> 
> Lets make pkey-0 less special and treat it almost like any other key.
> Thus it can be explicitly associated with any address range, and can
> be freed. This gives the application more flexibility and power.  The
> ability to free pkey-0 must be used responsibily, since pkey-0 is
> associated with almost all address-range by default.
> 
> Even with this change pkey-0 continues to be slightly more special
> from the following point of view.
> (a) it is implicitly allocated.
> (b) it is the default key assigned to any address-range.
> (c) its permissions cannot be modified by userspace.
> 
> NOTE: (c) is specific to powerpc only. pkey-0 is associated by default
> with all pages including kernel pages, and pkeys are also active in
> kernel mode. If any permission is denied on pkey-0, the kernel running
> in the context of the application will be unable to operate.

If it is not ok to change permissions of pkey 0 is it ok to free it?

Thanks

Michal
> 
> Tested on powerpc.
> 
> cc: Thomas Gleixner 
> cc: Dave Hansen 
> cc: Michael Ellermen 
> cc: Ingo Molnar 
> cc: Andrew Morton 
> Signed-off-by: Ram Pai 
> ---
> History:
> 
>   v3: . Corrected a comment in arch_set_user_pkey_access().
>   . Clarified the header, to capture the notion that
> pkey-0 permissions cannot be modified by userspace on
> powerpc. -- comment from Thiago
> 
>   v2: . mm_pkey_is_allocated() continued to treat pkey-0
> special. fixed it.
> 
>  arch/powerpc/include/asm/pkeys.h |   22 ++
>  arch/powerpc/mm/pkeys.c  |   26 +++---
>  2 files changed, 33 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pkeys.h
> b/arch/powerpc/include/asm/pkeys.h index 0409c80..31a6976 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -101,10 +101,14 @@ static inline u16 pte_to_pkey_bits(u64 pteflags)
>  
>  static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int
> pkey) {
> - /* A reserved key is never considered as 'explicitly
> allocated' */
> - return ((pkey < arch_max_pkey()) &&
> - !__mm_pkey_is_reserved(pkey) &&
> - __mm_pkey_is_allocated(mm, pkey));
> + if (pkey < 0 || pkey >= arch_max_pkey())
> + return false;
> +
> + /* Reserved keys are never allocated. */
> + if (__mm_pkey_is_reserved(pkey))
> + return false;
> +
> + return __mm_pkey_is_allocated(mm, pkey);
>  }
>  
>  extern void __arch_activate_pkey(int pkey);
> @@ -200,6 +204,16 @@ static inline int
> arch_set_user_pkey_access(struct task_struct *tsk, int pkey, {
>   if (static_branch_likely(_disabled))
>   return -EINVAL;
> +
> + /*
> +  * userspace should not change pkey-0 permissions.
> +  * pkey-0 is associated with every page in the kernel.
> +  * If userspace denies any permission on pkey-0, the
> +  * kernel cannot operate.
> +  */
> + if (!pkey)
> + return init_val ? -EINVAL : 0;
> +
>   return __arch_set_user_pkey_access(tsk, pkey, init_val);
>  }
>  
> diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
> index 0eafdf0..d6873b4 100644
> --- a/arch/powerpc/mm/pkeys.c
> +++ b/arch/powerpc/mm/pkeys.c
> @@ -119,16 +119,21 @@ int pkey_initialize(void)
>  #else
>   os_reserved = 0;
>  #endif
> + /* Bits are in LE format. */
>   initial_allocation_mask = ~0x0;
> +
> + /* register mask is in BE format */
>   pkey_amr_uamor_mask = ~0x0ul;
>   pkey_iamr_mask = ~0x0ul;
> - /*
> -  * key 0, 1 are reserved.
> -  * key 0 is the default key, which allows read/write/execute.
> -  * key 1 is recommended not to be used. PowerISA(3.0) page
> 1015,
> -  * programming note.
> -  */
> - for (i = 2; i < (pkeys_total - os_reserved); i++) {
> +
> + for (i = 0; i < (pkeys_total - os_reserved); i++) {
> + /*
> +  * key 1 is recommended not to be used.
> +  * PowerISA(3.0) page 1015,
> +  */
> + if (i == 1)
> + continue;
> +
>   initial_allocation_mask &= ~(0x1 << i);
>   pkey_amr_uamor_mask &= ~(0x3ul << pkeyshift(i));
>   pkey_iamr_mask &= ~(0x1ul << pkeyshift(i));
> @@ -142,7 +147,9 @@ void pkey_mm_init(struct mm_struct *mm)
>  {
>   if (static_branch_likely(_disabled))
>   return;
> - mm_pkey_allocation_map(mm) = 

Re: [PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Ram Pai
On Fri, May 04, 2018 at 12:59:27PM -0700, Dave Hansen wrote:
> On 05/04/2018 12:22 PM, Ram Pai wrote:
> > @@ -407,9 +414,6 @@ static bool pkey_access_permitted(int pkey, bool write, 
> > bool execute)
> > int pkey_shift;
> > u64 amr;
> >  
> > -   if (!pkey)
> > -   return true;
> > -
> > if (!is_pkey_enabled(pkey))
> > return true;
> 
> Looks fine to me.  Obviously doesn't have any impact on x86 or the
> generic code.
> 
> One question, though.  Which other check makes up for this removed !pkey
> check?

is_pkey_enabled() does take care of it.  we do not enable userspace to
change permissions on pkey-0. This information is tracked in
UAMOR register.  is_pkey_enabled() refers to UAMOR to determine
if the given key is modifiable by userspace. since UAMOR has the bit
corresponding to key-0 set to 0, is_pkey_enabled(key-0) will return
false. 

The deleted code above, would have done the same job without
referring UAMOR. However having special checks on pkey-0 makes
pkey-0 special. It defeats the purpose of this patch; which is to make
pkey-0 less special :).


-- 
Ram Pai



[PATCH ] powerpc/pkeys: Detach execute_only key on !PROT_EXEC

2018-05-04 Thread Ram Pai
Disassociate the exec_key from a VMA if the VMA permission is not
PROT_EXEC anymore.  Otherwise the exec_only key continues to be
associated with the vma, causing unexpected behavior.

The problem was reported on x86 by Shakeel Butt,
which is also applicable on powerpc.

cc: Shakeel Butt 
Reported-by: Shakeel Butt 
Fixes 5586cf6 ("powerpc: introduce execute-only pkey")
Signed-off-by: Ram Pai 
---
 arch/powerpc/mm/pkeys.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index e81d59e..fdeb9f5 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -425,9 +425,9 @@ int __arch_override_mprotect_pkey(struct vm_area_struct 
*vma, int prot,
 {
/*
 * If the currently associated pkey is execute-only, but the requested
-* protection requires read or write, move it back to the default pkey.
+* protection is not execute-only, move it back to the default pkey.
 */
-   if (vma_is_pkey_exec_only(vma) && (prot & (PROT_READ | PROT_WRITE)))
+   if (vma_is_pkey_exec_only(vma) && (prot != PROT_EXEC))
return 0;
 
/*
-- 
1.7.1



Re: [PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Dave Hansen
On 05/04/2018 12:22 PM, Ram Pai wrote:
> @@ -407,9 +414,6 @@ static bool pkey_access_permitted(int pkey, bool write, 
> bool execute)
>   int pkey_shift;
>   u64 amr;
>  
> - if (!pkey)
> - return true;
> -
>   if (!is_pkey_enabled(pkey))
>   return true;

Looks fine to me.  Obviously doesn't have any impact on x86 or the
generic code.

One question, though.  Which other check makes up for this removed !pkey
check?


[PATCH v2] powerpc: do not allow userspace to modify execute-only pkey

2018-05-04 Thread Ram Pai
When mprotect(,PROT_EXEC) is called, the kernel allocates a
execute-only pkey and associates the pkey with the given address space.
The permission of this key should not be modifiable from userspace.
However a bug in the current implementation lets the permissions on the
key modifiable from userspace.

Whenever a key is allocated through mm_pkey_alloc(), the kernel programs
the UAMOR register to allow userspace to change permissions on the key.
This is fine for keys explicitly allocated through the
sys_pkey_alloc(). But for execute-only pkey, it must be disallowed.
Restructured the code to fix the bug.

cc: Thiago Jung Bauermann 
cc: Michael Ellermen 

Signed-off-by: Ram Pai 
---
History:

v2: Thiago noticed a bug -- __execute_only_pkey() will always fail
since it calls is_pkey_enabled() which always returns false
for execute_only key. is_pkey_enabled() returns false
because UAMOR bit for the execute_only key is and never be set.
Fixed it.


 arch/powerpc/include/asm/pkeys.h |   24 
 arch/powerpc/mm/pkeys.c  |   57 ++---
 2 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 31a6976..3a9b82b 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -113,6 +113,8 @@ static inline bool mm_pkey_is_allocated(struct mm_struct 
*mm, int pkey)
 
 extern void __arch_activate_pkey(int pkey);
 extern void __arch_deactivate_pkey(int pkey);
+extern int __mm_pkey_alloc(struct mm_struct *mm);
+
 /*
  * Returns a positive, 5-bit key on success, or -1 on failure.
  * Relies on the mmap_sem to protect against concurrency in mm_pkey_alloc() and
@@ -120,29 +122,14 @@ static inline bool mm_pkey_is_allocated(struct mm_struct 
*mm, int pkey)
  */
 static inline int mm_pkey_alloc(struct mm_struct *mm)
 {
-   /*
-* Note: this is the one and only place we make sure that the pkey is
-* valid as far as the hardware is concerned. The rest of the kernel
-* trusts that only good, valid pkeys come out of here.
-*/
-   u32 all_pkeys_mask = (u32)(~(0x0));
int ret;
 
if (static_branch_likely(_disabled))
return -1;
 
+   ret = __mm_pkey_alloc(mm);
/*
-* Are we out of pkeys? We must handle this specially because ffz()
-* behavior is undefined if there are no zeros.
-*/
-   if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
-   return -1;
-
-   ret = ffz((u32)mm_pkey_allocation_map(mm));
-   __mm_pkey_allocated(mm, ret);
-
-   /*
-* Enable the key in the hardware
+* Enable userspace to modify the key permissions.
 */
if (ret > 0)
__arch_activate_pkey(ret);
@@ -158,7 +145,8 @@ static inline int mm_pkey_free(struct mm_struct *mm, int 
pkey)
return -EINVAL;
 
/*
-* Disable the key in the hardware
+* Reset the key and disable userspace
+* from modifying the key permissions.
 */
__arch_deactivate_pkey(pkey);
__mm_pkey_free(mm, pkey);
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index d6873b4..e81d59e 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -190,6 +190,9 @@ static inline void write_uamor(u64 value)
mtspr(SPRN_UAMOR, value);
 }
 
+/*
+ * return true if userspace can modify the pkey permissions.
+ */
 static bool is_pkey_enabled(int pkey)
 {
u64 uamor = read_uamor();
@@ -228,7 +231,10 @@ static void pkey_status_change(int pkey, bool enable)
init_amr(pkey, 0x0);
init_iamr(pkey, 0x0);
 
-   /* Enable/disable key */
+   /*
+* Enable/disable userspace to/from modifying the permissions
+* on the key
+*/
old_uamor = read_uamor();
if (enable)
old_uamor |= (0x3ul << pkeyshift(pkey));
@@ -247,19 +253,35 @@ void __arch_deactivate_pkey(int pkey)
pkey_status_change(pkey, false);
 }
 
-/*
- * Set the access rights in AMR IAMR and UAMOR registers for @pkey to that
- * specified in @init_val.
- */
-int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+int __mm_pkey_alloc(struct mm_struct *mm)
+{
+   /*
+* Note: this is the one and only place we make sure that the pkey is
+* valid as far as the hardware is concerned. The rest of the kernel
+* trusts that only good, valid pkeys come out of here.
+*/
+   u32 all_pkeys_mask = (u32)(~(0x0));
+   int ret;
+
+   /*
+* Are we out of pkeys? We must handle this specially because ffz()
+* behavior is undefined if there are no zeros.
+*/
+   if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+   return -1;
+
+   ret = 

[PATCH v3] powerpc, pkey: make protection key 0 less special

2018-05-04 Thread Ram Pai
Applications need the ability to associate an address-range with some
key and latter revert to its initial default key. Pkey-0 comes close to
providing this function but falls short, because the current
implementation disallows applications to explicitly associate pkey-0 to
the address range.

Lets make pkey-0 less special and treat it almost like any other key.
Thus it can be explicitly associated with any address range, and can be
freed. This gives the application more flexibility and power.  The
ability to free pkey-0 must be used responsibily, since pkey-0 is
associated with almost all address-range by default.

Even with this change pkey-0 continues to be slightly more special
from the following point of view.
(a) it is implicitly allocated.
(b) it is the default key assigned to any address-range.
(c) its permissions cannot be modified by userspace.

NOTE: (c) is specific to powerpc only. pkey-0 is associated by default
with all pages including kernel pages, and pkeys are also active in
kernel mode. If any permission is denied on pkey-0, the kernel running
in the context of the application will be unable to operate.

Tested on powerpc.

cc: Thomas Gleixner 
cc: Dave Hansen 
cc: Michael Ellermen 
cc: Ingo Molnar 
cc: Andrew Morton 
Signed-off-by: Ram Pai 
---
History:

v3: . Corrected a comment in arch_set_user_pkey_access().
. Clarified the header, to capture the notion that
  pkey-0 permissions cannot be modified by userspace on powerpc.
-- comment from Thiago

v2: . mm_pkey_is_allocated() continued to treat pkey-0 special.
fixed it.

 arch/powerpc/include/asm/pkeys.h |   22 ++
 arch/powerpc/mm/pkeys.c  |   26 +++---
 2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0409c80..31a6976 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -101,10 +101,14 @@ static inline u16 pte_to_pkey_bits(u64 pteflags)
 
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
-   /* A reserved key is never considered as 'explicitly allocated' */
-   return ((pkey < arch_max_pkey()) &&
-   !__mm_pkey_is_reserved(pkey) &&
-   __mm_pkey_is_allocated(mm, pkey));
+   if (pkey < 0 || pkey >= arch_max_pkey())
+   return false;
+
+   /* Reserved keys are never allocated. */
+   if (__mm_pkey_is_reserved(pkey))
+   return false;
+
+   return __mm_pkey_is_allocated(mm, pkey);
 }
 
 extern void __arch_activate_pkey(int pkey);
@@ -200,6 +204,16 @@ static inline int arch_set_user_pkey_access(struct 
task_struct *tsk, int pkey,
 {
if (static_branch_likely(_disabled))
return -EINVAL;
+
+   /*
+* userspace should not change pkey-0 permissions.
+* pkey-0 is associated with every page in the kernel.
+* If userspace denies any permission on pkey-0, the
+* kernel cannot operate.
+*/
+   if (!pkey)
+   return init_val ? -EINVAL : 0;
+
return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 0eafdf0..d6873b4 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -119,16 +119,21 @@ int pkey_initialize(void)
 #else
os_reserved = 0;
 #endif
+   /* Bits are in LE format. */
initial_allocation_mask = ~0x0;
+
+   /* register mask is in BE format */
pkey_amr_uamor_mask = ~0x0ul;
pkey_iamr_mask = ~0x0ul;
-   /*
-* key 0, 1 are reserved.
-* key 0 is the default key, which allows read/write/execute.
-* key 1 is recommended not to be used. PowerISA(3.0) page 1015,
-* programming note.
-*/
-   for (i = 2; i < (pkeys_total - os_reserved); i++) {
+
+   for (i = 0; i < (pkeys_total - os_reserved); i++) {
+   /*
+* key 1 is recommended not to be used.
+* PowerISA(3.0) page 1015,
+*/
+   if (i == 1)
+   continue;
+
initial_allocation_mask &= ~(0x1 << i);
pkey_amr_uamor_mask &= ~(0x3ul << pkeyshift(i));
pkey_iamr_mask &= ~(0x1ul << pkeyshift(i));
@@ -142,7 +147,9 @@ void pkey_mm_init(struct mm_struct *mm)
 {
if (static_branch_likely(_disabled))
return;
-   mm_pkey_allocation_map(mm) = initial_allocation_mask;
+
+   /* allocate key-0 by default */
+   mm_pkey_allocation_map(mm) = initial_allocation_mask | 0x1;
/* -1 means unallocated or invalid */
mm->context.execute_only_pkey = -1;
 }
@@ -407,9 +414,6 @@ static bool 

[PATCH 2/2] powerpc/ptrace: Disable array-bounds warning with gcc8

2018-05-04 Thread Khem Raj
This masks the new gcc8 warning

regset.h:270:4: error: 'memcpy' offset [-527, -529] is out
of the bounds [0, 16] of object 'vrsave' with type 'union '

Signed-off-by: Khem Raj 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 7ac5a68ad6b1..ab159a34704a 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -4,6 +4,7 @@
 #
 
 CFLAGS_ptrace.o+= -DUTS_MACHINE='"$(UTS_MACHINE)"' $(call 
cc-disable-warning, attribute-alias)
+CFLAGS_ptrace.o+= $(call cc-disable-warning, array-bounds)
 CFLAGS_syscalls.o  += $(call cc-disable-warning, attribute-alias)
 
 subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
-- 
2.17.0



[PATCH 1/2] powerpc: Disable attribute-alias warnings from gcc8

2018-05-04 Thread Khem Raj
Fixes
alias between functions of incompatible types warnings
which are new with gcc8

Signed-off-by: Khem Raj 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 2b4c40b255e4..7ac5a68ad6b1 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -3,7 +3,8 @@
 # Makefile for the linux kernel.
 #
 
-CFLAGS_ptrace.o+= -DUTS_MACHINE='"$(UTS_MACHINE)"'
+CFLAGS_ptrace.o+= -DUTS_MACHINE='"$(UTS_MACHINE)"' $(call 
cc-disable-warning, attribute-alias)
+CFLAGS_syscalls.o  += $(call cc-disable-warning, attribute-alias)
 
 subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
-- 
2.17.0



[PATCH 11/11] powerpc/time: account broadcast timer event interrupts separately

2018-05-04 Thread Nicholas Piggin
These are not local timer interrupts but IPIs. It's good to be able
to see how timer offloading is behaving, so split these out into
their own category.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hardirq.h | 1 +
 arch/powerpc/kernel/irq.c  | 6 ++
 arch/powerpc/kernel/time.c | 5 +
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hardirq.h 
b/arch/powerpc/include/asm/hardirq.h
index 5986d473722b..20b01897ea5d 100644
--- a/arch/powerpc/include/asm/hardirq.h
+++ b/arch/powerpc/include/asm/hardirq.h
@@ -8,6 +8,7 @@
 typedef struct {
unsigned int __softirq_pending;
unsigned int timer_irqs_event;
+   unsigned int broadcast_irqs_event;
unsigned int timer_irqs_others;
unsigned int pmu_irqs;
unsigned int mce_exceptions;
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 6569b593..627db34bb79d 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -518,6 +518,11 @@ int arch_show_interrupts(struct seq_file *p, int prec)
seq_printf(p, "%10u ", per_cpu(irq_stat, j).timer_irqs_event);
 seq_printf(p, "  Local timer interrupts for timer event device\n");
 
+   seq_printf(p, "%*s: ", prec, "BCT");
+   for_each_online_cpu(j)
+   seq_printf(p, "%10u ", per_cpu(irq_stat, 
j).broadcast_irqs_event);
+   seq_printf(p, "  Broadcast timer interrupts for timer event device\n");
+
seq_printf(p, "%*s: ", prec, "LOC");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(irq_stat, j).timer_irqs_others);
@@ -577,6 +582,7 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
 {
u64 sum = per_cpu(irq_stat, cpu).timer_irqs_event;
 
+   sum += per_cpu(irq_stat, cpu).broadcast_irqs_event;
sum += per_cpu(irq_stat, cpu).pmu_irqs;
sum += per_cpu(irq_stat, cpu).mce_exceptions;
sum += per_cpu(irq_stat, cpu).spurious_irqs;
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 23921f7b6e67..ed6b2abdde15 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -647,13 +647,10 @@ EXPORT_SYMBOL(timer_interrupt);
 void timer_broadcast_interrupt(void)
 {
u64 *next_tb = this_cpu_ptr(_next_tb);
-   struct pt_regs *regs = get_irq_regs();
 
-   trace_timer_interrupt_entry(regs);
*next_tb = ~(u64)0;
tick_receive_broadcast();
-   __this_cpu_inc(irq_stat.timer_irqs_event);
-   trace_timer_interrupt_exit(regs);
+   __this_cpu_inc(irq_stat.broadcast_irqs_event);
 }
 #endif
 
-- 
2.17.0



[PATCH 10/11] powerpc: move a stray NMI IPI case under NMI_IPI ifdef

2018-05-04 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 914708eeb43f..28ec1638a540 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -193,7 +193,9 @@ const char *smp_ipi_name[] = {
 #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
 #endif
+#ifdef CONFIG_NMI_IPI
[PPC_MSG_NMI_IPI] = "nmi ipi",
+#endif
 };
 
 /* optional function to request ipi, for controllers with >= 4 ipis */
-- 
2.17.0



[PATCH 09/11] powerpc: move timer broadcast code under GENERIC_CLOCKEVENTS_BROADCAST ifdef

2018-05-04 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/smp.c  | 8 
 arch/powerpc/kernel/time.c | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5441a47701b1..914708eeb43f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -155,11 +155,13 @@ static irqreturn_t reschedule_action(int irq, void *data)
return IRQ_HANDLED;
 }
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
timer_broadcast_interrupt();
return IRQ_HANDLED;
 }
+#endif
 
 #ifdef CONFIG_NMI_IPI
 static irqreturn_t nmi_ipi_action(int irq, void *data)
@@ -172,7 +174,9 @@ static irqreturn_t nmi_ipi_action(int irq, void *data)
 static irq_handler_t smp_ipi_action[] = {
[PPC_MSG_CALL_FUNCTION] =  call_function_action,
[PPC_MSG_RESCHEDULE] = reschedule_action,
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
[PPC_MSG_TICK_BROADCAST] = tick_broadcast_ipi_action,
+#endif
 #ifdef CONFIG_NMI_IPI
[PPC_MSG_NMI_IPI] = nmi_ipi_action,
 #endif
@@ -186,7 +190,9 @@ static irq_handler_t smp_ipi_action[] = {
 const char *smp_ipi_name[] = {
[PPC_MSG_CALL_FUNCTION] =  "ipi call function",
[PPC_MSG_RESCHEDULE] = "ipi reschedule",
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
[PPC_MSG_TICK_BROADCAST] = "ipi tick-broadcast",
+#endif
[PPC_MSG_NMI_IPI] = "nmi ipi",
 };
 
@@ -277,8 +283,10 @@ irqreturn_t smp_ipi_demux_relaxed(void)
generic_smp_call_function_interrupt();
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
timer_broadcast_interrupt();
+#endif
 #ifdef CONFIG_NMI_IPI
if (all & IPI_MESSAGE(PPC_MSG_NMI_IPI))
nmi_ipi_action(0, NULL);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 5862a3611795..23921f7b6e67 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -643,6 +643,7 @@ void timer_interrupt(struct pt_regs *regs)
 }
 EXPORT_SYMBOL(timer_interrupt);
 
+#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
 void timer_broadcast_interrupt(void)
 {
u64 *next_tb = this_cpu_ptr(_next_tb);
@@ -654,6 +655,7 @@ void timer_broadcast_interrupt(void)
__this_cpu_inc(irq_stat.timer_irqs_event);
trace_timer_interrupt_exit(regs);
 }
+#endif
 
 /*
  * Hypervisor decrementer interrupts shouldn't occur but are sometimes
-- 
2.17.0



[PATCH 08/11] powerpc: allow soft-NMI watchdog to cover timer interrupts with large decrementers

2018-05-04 Thread Nicholas Piggin
Large decrementers (e.g., POWER9) can take a very long time to wrap,
so when the timer iterrupt handler sets the decrementer to max so as
to avoid taking another decrementer interrupt when hard enabling
interrupts before running timers, it effectively disables the soft
NMI coverage for timer interrupts.

Fix this by using the traditional 31-bit value instead, which wraps
after a few seconds. masked interrupt code does the same thing, and
in normal operation neither of these paths would ever wrap even the
31 bit value.

Note: the SMP watchdog should catch timer interrupt lockups, but it
is preferable for the local soft-NMI to catch them, mainly to avoid
the IPI.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/time.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index ad876906f847..5862a3611795 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -578,22 +578,29 @@ void timer_interrupt(struct pt_regs *regs)
struct pt_regs *old_regs;
u64 now;
 
-   /* Ensure a positive value is written to the decrementer, or else
-* some CPUs will continue to take decrementer exceptions.
-*/
-   set_dec(decrementer_max);
-
/* Some implementations of hotplug will get timer interrupts while
 * offline, just ignore these and we also need to set
 * decrementers_next_tb as MAX to make sure __check_irq_replay
 * don't replay timer interrupt when return, otherwise we'll trap
 * here infinitely :(
 */
-   if (!cpu_online(smp_processor_id())) {
+   if (unlikely(!cpu_online(smp_processor_id( {
*next_tb = ~(u64)0;
+   set_dec(decrementer_max);
return;
}
 
+   /* Ensure a positive value is written to the decrementer, or else
+* some CPUs will continue to take decrementer exceptions. When the
+* PPC_WATCHDOG (decrementer based) is configured, keep this at most
+* 31 bits, which is about 4 seconds on most systems, which gives
+* the watchdog a chance of catching timer interrupt hard lockups.
+*/
+   if (IS_ENABLED(CONFIG_PPC_WATCHDOG))
+   set_dec(0x7fff);
+   else
+   set_dec(decrementer_max);
+
/* Conditionally hard-enable interrupts now that the DEC has been
 * bumped to its maximum value
 */
-- 
2.17.0



[PATCH 07/11] powerpc: generic clockevents broadcast receiver call tick_receive_broadcast

2018-05-04 Thread Nicholas Piggin
The broadcast tick recipient can call tick_receive_broadcast rather
than re-running the full timer interrupt.

It does not have to check for the next event time, because the sender
already determined the timer has expired. It does not have to test
irq_work_pending, because that's a direct decrementer interrupt and
does not go through the clock events subsystem. And it does not have
to read PURR because that was removed with the previous patch.

This results in no code size change, but both the decrementer and
broadcast path lengths are reduced.

Cc: Srivatsa S. Bhat 
Cc: Preeti U Murthy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h |  1 +
 arch/powerpc/include/asm/time.h   |  1 -
 arch/powerpc/kernel/smp.c |  4 +-
 arch/powerpc/kernel/time.c| 84 ++-
 4 files changed, 42 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index fbc2d83808aa..46fe8307fc43 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -55,6 +55,7 @@ extern void replay_system_reset(void);
 extern void __replay_interrupt(unsigned int vector);
 
 extern void timer_interrupt(struct pt_regs *);
+extern void timer_broadcast_interrupt(void);
 extern void performance_monitor_exception(struct pt_regs *regs);
 extern void WatchdogException(struct pt_regs *regs);
 extern void unknown_exception(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index c965c79765c4..69b89f941252 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,7 +28,6 @@ extern struct clock_event_device decrementer_clockevent;
 
 struct rtc_time;
 extern void to_tm(int tim, struct rtc_time * tm);
-extern void tick_broadcast_ipi_handler(void);
 
 extern void generic_calibrate_decr(void);
 extern void hdec_interrupt(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9ca7148b5881..5441a47701b1 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -157,7 +157,7 @@ static irqreturn_t reschedule_action(int irq, void *data)
 
 static irqreturn_t tick_broadcast_ipi_action(int irq, void *data)
 {
-   tick_broadcast_ipi_handler();
+   timer_broadcast_interrupt();
return IRQ_HANDLED;
 }
 
@@ -278,7 +278,7 @@ irqreturn_t smp_ipi_demux_relaxed(void)
if (all & IPI_MESSAGE(PPC_MSG_RESCHEDULE))
scheduler_ipi();
if (all & IPI_MESSAGE(PPC_MSG_TICK_BROADCAST))
-   tick_broadcast_ipi_handler();
+   timer_broadcast_interrupt();
 #ifdef CONFIG_NMI_IPI
if (all & IPI_MESSAGE(PPC_MSG_NMI_IPI))
nmi_ipi_action(0, NULL);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 1fe6a24357e7..ad876906f847 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -567,47 +567,16 @@ void arch_irq_work_raise(void)
 
 #endif /* CONFIG_IRQ_WORK */
 
-static void __timer_interrupt(void)
-{
-   struct pt_regs *regs = get_irq_regs();
-   u64 *next_tb = this_cpu_ptr(_next_tb);
-   struct clock_event_device *evt = this_cpu_ptr();
-   u64 now;
-
-   trace_timer_interrupt_entry(regs);
-
-   if (test_irq_work_pending()) {
-   clear_irq_work_pending();
-   irq_work_run();
-   }
-
-   now = get_tb_or_rtc();
-   if (now >= *next_tb) {
-   *next_tb = ~(u64)0;
-   if (evt->event_handler)
-   evt->event_handler(evt);
-   __this_cpu_inc(irq_stat.timer_irqs_event);
-   } else {
-   now = *next_tb - now;
-   if (now <= decrementer_max)
-   set_dec(now);
-   /* We may have raced with new irq work */
-   if (test_irq_work_pending())
-   set_dec(1);
-   __this_cpu_inc(irq_stat.timer_irqs_others);
-   }
-
-   trace_timer_interrupt_exit(regs);
-}
-
 /*
  * timer_interrupt - gets called when the decrementer overflows,
  * with interrupts disabled.
  */
-void timer_interrupt(struct pt_regs * regs)
+void timer_interrupt(struct pt_regs *regs)
 {
-   struct pt_regs *old_regs;
+   struct clock_event_device *evt = this_cpu_ptr();
u64 *next_tb = this_cpu_ptr(_next_tb);
+   struct pt_regs *old_regs;
+   u64 now;
 
/* Ensure a positive value is written to the decrementer, or else
 * some CPUs will continue to take decrementer exceptions.
@@ -638,13 +607,47 @@ void timer_interrupt(struct pt_regs * regs)
 
old_regs = set_irq_regs(regs);
irq_enter();
+   trace_timer_interrupt_entry(regs);
+
+   if (test_irq_work_pending()) {
+   clear_irq_work_pending();
+ 

[PATCH 06/11] powerpc/pseries: lparcfg calculate PURR on demand

2018-05-04 Thread Nicholas Piggin
For SPLPAR, lparcfg provides a sum of PURR registers for all CPUs.
Currently this is done by reading PURR in context switch and timer
interrupt, and storing that into a per-CPU variable. These are summed
to provide the value.

This does not work with all timer schemes (e.g., NO_HZ_FULL), and it
is sub-optimal for performance because it reads the PURR register on
every context switch, although that's been difficult to distinguish
from noise in the contxt_switch microbenchmark.

This patch implements the sum by calling a function on each CPU, to
read and add PURR values of each CPU.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h  |  8 
 arch/powerpc/kernel/process.c| 14 --
 arch/powerpc/kernel/time.c   |  8 
 arch/powerpc/platforms/pseries/lparcfg.c | 18 ++
 4 files changed, 10 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index db546c034905..c965c79765c4 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -196,14 +196,6 @@ extern u64 mulhdu(u64, u64);
 extern void div128_by_32(u64 dividend_high, u64 dividend_low,
 unsigned divisor, struct div_result *dr);
 
-/* Used to store Processor Utilization register (purr) values */
-
-struct cpu_usage {
-u64 current_tb;  /* Holds the current purr register values */
-};
-
-DECLARE_PER_CPU(struct cpu_usage, cpu_usage_array);
-
 extern void secondary_cpu_time_init(void);
 extern void __init time_init(void);
 
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ff7344d996e3..e6ff36923d84 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -845,10 +845,6 @@ bool ppc_breakpoint_available(void)
 }
 EXPORT_SYMBOL_GPL(ppc_breakpoint_available);
 
-#ifdef CONFIG_PPC64
-DEFINE_PER_CPU(struct cpu_usage, cpu_usage_array);
-#endif
-
 static inline bool hw_brk_match(struct arch_hw_breakpoint *a,
  struct arch_hw_breakpoint *b)
 {
@@ -1181,16 +1177,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
 
WARN_ON(!irqs_disabled());
 
-#ifdef CONFIG_PPC64
-   /*
-* Collect processor utilization data per process
-*/
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = this_cpu_ptr(_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif /* CONFIG_PPC64 */
-
 #ifdef CONFIG_PPC_BOOK3S_64
batch = this_cpu_ptr(_tlb_batch);
if (batch->active) {
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index e7e8611e8863..1fe6a24357e7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -597,14 +597,6 @@ static void __timer_interrupt(void)
__this_cpu_inc(irq_stat.timer_irqs_others);
}
 
-#ifdef CONFIG_PPC64
-   /* collect purr register values often, for accurate calculations */
-   if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct cpu_usage *cu = this_cpu_ptr(_usage_array);
-   cu->current_tb = mfspr(SPRN_PURR);
-   }
-#endif
-
trace_timer_interrupt_exit(regs);
 }
 
diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index c508c938dc71..7c872dc01bdb 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -52,18 +52,20 @@
  * Track sum of all purrs across all processors. This is used to further
  * calculate usage values by different applications
  */
+static void cpu_get_purr(void *arg)
+{
+   atomic64_t *sum = arg;
+
+   atomic64_add(mfspr(SPRN_PURR), sum);
+}
+
 static unsigned long get_purr(void)
 {
-   unsigned long sum_purr = 0;
-   int cpu;
+   atomic64_t purr = ATOMIC64_INIT(0);
 
-   for_each_possible_cpu(cpu) {
-   struct cpu_usage *cu;
+   on_each_cpu(cpu_get_purr, , 1);
 
-   cu = _cpu(cpu_usage_array, cpu);
-   sum_purr += cu->current_tb;
-   }
-   return sum_purr;
+   return atomic64_read();
 }
 
 /*
-- 
2.17.0



[PATCH 05/11] powerpc/64: remove start_tb and accum_tb from thread_struct

2018-05-04 Thread Nicholas Piggin
These fields are only written to.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/processor.h | 4 
 arch/powerpc/kernel/process.c| 6 +-
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index c4b36a494a63..eff269adfa71 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -264,10 +264,6 @@ struct thread_struct {
struct thread_fp_state  *fp_save_area;
int fpexc_mode; /* floating-point exception mode */
unsigned intalign_ctl;  /* alignment handling control */
-#ifdef CONFIG_PPC64
-   unsigned long   start_tb;   /* Start purr when proc switched in */
-   unsigned long   accum_tb;   /* Total accumulated purr for process */
-#endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
struct perf_event *ptrace_bps[HBP_NUM];
/*
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 1237f13fed51..ff7344d996e3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1187,11 +1187,7 @@ struct task_struct *__switch_to(struct task_struct *prev,
 */
if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
struct cpu_usage *cu = this_cpu_ptr(_usage_array);
-   long unsigned start_tb, current_tb;
-   start_tb = old_thread->start_tb;
-   cu->current_tb = current_tb = mfspr(SPRN_PURR);
-   old_thread->accum_tb += (current_tb - start_tb);
-   new_thread->start_tb = current_tb;
+   cu->current_tb = mfspr(SPRN_PURR);
}
 #endif /* CONFIG_PPC64 */
 
-- 
2.17.0



[PATCH 04/11] powerpc/64s: micro-optimise __hard_irq_enable() for mtmsrd L=1 support

2018-05-04 Thread Nicholas Piggin
Book3S minimum supported ISA version now requires mtmsrd L=1. This
instruction does not require bits other than RI and EE to be supplied,
so __hard_irq_enable() and __hard_irq_disable() does not have to read
the kernel_msr from paca.

Interrupt entry code already relies on L=1 support.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 8004d7887ff6..fbc2d83808aa 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -228,8 +228,8 @@ static inline bool arch_irqs_disabled(void)
 #define __hard_irq_enable()asm volatile("wrteei 1" : : : "memory")
 #define __hard_irq_disable()   asm volatile("wrteei 0" : : : "memory")
 #else
-#define __hard_irq_enable()__mtmsrd(local_paca->kernel_msr | MSR_EE, 1)
-#define __hard_irq_disable()   __mtmsrd(local_paca->kernel_msr, 1)
+#define __hard_irq_enable()__mtmsrd(MSR_EE|MSR_RI, 1)
+#define __hard_irq_disable()   __mtmsrd(MSR_RI, 1)
 #endif
 
 #define hard_irq_disable() do {\
-- 
2.17.0



[PATCH 03/11] powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely

2018-05-04 Thread Nicholas Piggin
When the masked interrupt handler clears MSR[EE] for an interrupt in
the PACA_IRQ_MUST_HARD_MASK set, it does not set PACA_IRQ_HARD_DIS.
This makes them get out of synch.

With that taken into account, it's only low level irq manipulation
(and interrupt entry before reconcile) where they can be out of synch.
This makes the code less surprising.

It also allows the IRQ replay code to rely on the IRQ_HARD_DIS value
and not have to mtmsrd again in this case (e.g., for an external
interrupt that has been masked). The bigger benefit might just be
that there is not such an element of surprise in these two bits of
state.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h| 10 ++
 arch/powerpc/kernel/entry_64.S   |  8 
 arch/powerpc/kernel/exceptions-64s.S |  5 -
 arch/powerpc/kernel/irq.c| 28 +++-
 4 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 855e17d158b1..8004d7887ff6 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -248,14 +248,16 @@ static inline bool lazy_irq_pending(void)
 
 /*
  * This is called by asynchronous interrupts to conditionally
- * re-enable hard interrupts when soft-disabled after having
- * cleared the source of the interrupt
+ * re-enable hard interrupts after having cleared the source
+ * of the interrupt. They are kept disabled if there is a different
+ * soft-masked interrupt pending that requires hard masking.
  */
 static inline void may_hard_irq_enable(void)
 {
-   get_paca()->irq_happened &= ~PACA_IRQ_HARD_DIS;
-   if (!(get_paca()->irq_happened & PACA_IRQ_MUST_HARD_MASK))
+   if (!(get_paca()->irq_happened & PACA_IRQ_MUST_HARD_MASK)) {
+   get_paca()->irq_happened &= ~PACA_IRQ_HARD_DIS;
__hard_irq_enable();
+   }
 }
 
 static inline bool arch_irq_disabled_regs(struct pt_regs *regs)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..db4df061c33a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -973,6 +973,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
or  r4,r4,r3
std r4,_TRAP(r1)
 
+   /*
+* PACA_IRQ_HARD_DIS won't always be set here, so set it now
+* to reconcile the IRQ state. Tracing is already accounted for.
+*/
+   ld  r4,PACAIRQHAPPENED(r13)
+   ori r4,r4,PACA_IRQ_HARD_DIS
+   stb r4,PACAIRQHAPPENED(r13)
+
/*
 * Then find the right handler and call it. Interrupts are
 * still soft-disabled and we keep them that way.
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index ae6a849db60b..69172dd41b11 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1498,7 +1498,10 @@ masked_##_H##interrupt:  
\
mfspr   r10,SPRN_##_H##SRR1;\
xorir10,r10,MSR_EE; /* clear MSR_EE */  \
mtspr   SPRN_##_H##SRR1,r10;\
-2: mtcrf   0x80,r9;\
+   ori r11,r11,PACA_IRQ_HARD_DIS;  \
+   stb r11,PACAIRQHAPPENED(r13);   \
+2: /* done */  \
+   mtcrf   0x80,r9;\
ld  r9,PACA_EXGEN+EX_R9(r13);   \
ld  r10,PACA_EXGEN+EX_R10(r13); \
ld  r11,PACA_EXGEN+EX_R11(r13); \
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 061aa0f47bb1..6569b593 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -145,8 +145,20 @@ notrace unsigned int __check_irq_replay(void)
trace_hardirqs_on();
trace_hardirqs_off();
 
+   /*
+* We are always hard disabled here, but PACA_IRQ_HARD_DIS may
+* not be set, which means interrupts have only just been hard
+* disabled as part of the local_irq_restore or interrupt return
+* code. In that case, skip the decrementr check becaus it's
+* expensive to read the TB.
+*
+* HARD_DIS then gets cleared here, but it's reconciled later.
+* Either local_irq_disable will replay the interrupt and that
+* will reconcile state like other hard interrupts. Or interrupt
+* retur will replay the interrupt and in that case it sets
+* PACA_IRQ_HARD_DIS by hand (see comments in entry_64.S).
+*/
if (happened & PACA_IRQ_HARD_DIS) {
-   /* Clear bit 0 which we wouldn't clear otherwise */
local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
 
/*
@@ -256,16 +268,14 @@ notrace void arch_local_irq_restore(unsigned long mask)
 * __check_irq_replay(). We 

[PATCH 02/11] powerpc/pseries: put cede MSR[EE] check under IRQ_SOFT_MASK_DEBUG

2018-05-04 Thread Nicholas Piggin
This check does not catch IRQ soft mask bugs, but this option is
slightly more suitable than TRACE_IRQFLAGS.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/plpar_wrappers.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 96c1a46acbd0..cff5a411e595 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -39,10 +39,10 @@ static inline long extended_cede_processor(unsigned long 
latency_hint)
set_cede_latency_hint(latency_hint);
 
rc = cede_processor();
-#ifdef CONFIG_TRACE_IRQFLAGS
-   /* Ensure that H_CEDE returns with IRQs on */
-   if (WARN_ON(!(mfmsr() & MSR_EE)))
-   __hard_irq_enable();
+#ifdef CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
+   /* Ensure that H_CEDE returns with IRQs on */
+   if (WARN_ON(!(mfmsr() & MSR_EE)))
+   __hard_irq_enable();
 #endif
 
set_cede_latency_hint(old_latency_hint);
-- 
2.17.0



[PATCH 01/11] powerpc/64: irq_work avoid interrupt when called with hardware irqs enabled

2018-05-04 Thread Nicholas Piggin
irq_work_raise should not cause a decrementer exception unless it is
called from NMI context. Doing so often just results in an immediate
masked decrementer interrupt:

   <...>-55090d...4us : update_curr_rt <-dequeue_task_rt
   <...>-55090d...5us : dbs_update_util_handler <-update_curr_rt
   <...>-55090d...6us : arch_irq_work_raise <-irq_work_queue
   <...>-55090d...7us : soft_nmi_interrupt <-soft_nmi_common
   <...>-55090d...7us : printk_nmi_enter <-soft_nmi_interrupt
   <...>-55090d.Z.8us : rcu_nmi_enter <-soft_nmi_interrupt
   <...>-55090d.Z.9us : rcu_nmi_exit <-soft_nmi_interrupt
   <...>-55090d...9us : printk_nmi_exit <-soft_nmi_interrupt
   <...>-55090d...   10us : cpuacct_charge <-update_curr_rt

The soft_nmi_interrupt here is the call into the watchdog, due to the
decrementer interrupt firing with irqs soft-disabled. This is
harmless, but sub-optimal.

When it's not called from NMI context or with interrupts enabled, mark
the decrementer pending in the irq_happened mask directly, rather than
having the masked decrementer interupt handler do it. This will be
replayed at the next local_irq_enable. See the comment for details.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/time.c | 33 +++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 360e71d455cc..e7e8611e8863 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -513,6 +513,35 @@ static inline void clear_irq_work_pending(void)
"i" (offsetof(struct paca_struct, irq_work_pending)));
 }
 
+void arch_irq_work_raise(void)
+{
+   preempt_disable();
+   set_irq_work_pending_flag();
+   /*
+* Non-nmi code running with interrupts disabled will replay
+* irq_happened before it re-enables interrupts, so setthe
+* decrementer there instead of causing a hardware exception
+* which would immediately hit the masked interrupt handler
+* and have the net effect of setting the decrementer in
+* irq_happened.
+*
+* NMI interrupts can not check this when they return, so the
+* decrementer hardware exception is raised, which will fire
+* when interrupts are next enabled.
+*
+* BookE does not support this yet, it must audit all NMI
+* interrupt handlers to ensure they call nmi_enter() so this
+* check would be correct.
+*/
+   if (IS_ENABLED(CONFIG_BOOKE) || !irqs_disabled() || in_nmi()) {
+   set_dec(1);
+   } else {
+   hard_irq_disable();
+   local_paca->irq_happened |= PACA_IRQ_DEC;
+   }
+   preempt_enable();
+}
+
 #else /* 32-bit */
 
 DEFINE_PER_CPU(u8, irq_work_pending);
@@ -521,8 +550,6 @@ DEFINE_PER_CPU(u8, irq_work_pending);
 #define test_irq_work_pending()
__this_cpu_read(irq_work_pending)
 #define clear_irq_work_pending()   __this_cpu_write(irq_work_pending, 0)
 
-#endif /* 32 vs 64 bit */
-
 void arch_irq_work_raise(void)
 {
preempt_disable();
@@ -531,6 +558,8 @@ void arch_irq_work_raise(void)
preempt_enable();
 }
 
+#endif /* 32 vs 64 bit */
+
 #else  /* CONFIG_IRQ_WORK */
 
 #define test_irq_work_pending()0
-- 
2.17.0



[PATCH 00/11] assortment of timer, watchdog, interrupt

2018-05-04 Thread Nicholas Piggin
These are a bunch of small things I've built up from looking
through code trying to track down some rare irq latency issues.

None of them actually fix any long irq latencies, but they
hopefully make the code a bit neater, get rid of some small
glitches, increase watchdog coverage etc.

Ben spotted a bug with the first patch last time I posted,
that's fixed.

Thanks,
Nick


Nicholas Piggin (11):
  powerpc/64: irq_work avoid interrupt when called with hardware irqs
enabled
  powerpc/pseries: put cede MSR[EE] check under IRQ_SOFT_MASK_DEBUG
  powerpc/64s: make PACA_IRQ_HARD_DIS track MSR[EE] closely
  powerpc/64s: micro-optimise __hard_irq_enable() for mtmsrd L=1 support
  powerpc/64: remove start_tb and accum_tb from thread_struct
  powerpc/pseries: lparcfg calculate PURR on demand
  powerpc: generic clockevents broadcast receiver call
tick_receive_broadcast
  powerpc: allow soft-NMI watchdog to cover timer interrupts with large
decrementers
  powerpc: move timer broadcast code under GENERIC_CLOCKEVENTS_BROADCAST
ifdef
  powerpc: move a stray NMI IPI case under NMI_IPI ifdef
  powerpc/time: account broadcast timer event interrupts separately

 arch/powerpc/include/asm/hardirq.h|   1 +
 arch/powerpc/include/asm/hw_irq.h |  15 ++-
 arch/powerpc/include/asm/plpar_wrappers.h |   8 +-
 arch/powerpc/include/asm/processor.h  |   4 -
 arch/powerpc/include/asm/time.h   |   9 --
 arch/powerpc/kernel/entry_64.S|   8 ++
 arch/powerpc/kernel/exceptions-64s.S  |   5 +-
 arch/powerpc/kernel/irq.c |  34 +++--
 arch/powerpc/kernel/process.c |  18 ---
 arch/powerpc/kernel/smp.c |  14 ++-
 arch/powerpc/kernel/time.c| 143 +-
 arch/powerpc/platforms/pseries/lparcfg.c  |  18 +--
 12 files changed, 155 insertions(+), 122 deletions(-)

-- 
2.17.0



Re: [PATCH v10 24/25] x86/mm: add speculative pagefault handling

2018-05-04 Thread Punit Agrawal
Laurent Dufour  writes:

> On 30/04/2018 20:43, Punit Agrawal wrote:
>> Hi Laurent,
>> 
>> I am looking to add support for speculative page fault handling to
>> arm64 (effectively porting this patch) and had a few questions.
>> Apologies if I've missed an obvious explanation for my queries. I'm
>> jumping in bit late to the discussion.
>
> Hi Punit,
>
> Thanks for giving this series a review.
> I don't have arm64 hardware to play with, but I'll be happy to add arm64
> patches to my series and to try to maintain them.

I'll be happy to try them on arm64 platforms I have access to and
provide feedback.

>
>> 
>> On Tue, Apr 17, 2018 at 3:33 PM, Laurent Dufour
>>  wrote:
>>> From: Peter Zijlstra 
>>>

[...]

>>>
>>> -   vma = find_vma(mm, address);
>>> +   if (!vma || !can_reuse_spf_vma(vma, address))
>>> +   vma = find_vma(mm, address);
>> 
>> Is there a measurable benefit from reusing the vma?
>> 
>> Dropping the vma reference unconditionally after speculative page
>> fault handling gets rid of the implicit state when "vma != NULL"
>> (increased ref-count). I found it a bit confusing to follow.
>
> I do agree, this is quite confusing. My initial goal was to be able to reuse
> the VMA in the case a protection key error was detected, but it's not really
> necessary on x86 since we know at the beginning of the fault operation that
> protection key are in the loop. This is not the case on ppc64 but I couldn't
> find a way to easily rely on the speculatively fetched VMA neither, so for
> protection keys, this didn't help.
>
> Regarding the measurable benefit of reusing the fetched vma, I did further
> tests using will-it-scale/page_fault2_threads test, and I'm no more really
> convince that this worth the added complexity. I think I'll drop the patch 
> "mm:
> speculative page fault handler return VMA" of the series, and thus remove the
> call to can_reuse_spf_vma().

Makes sense. Thanks for giving this a go.

Punit

[...]



Re: [PATCH 1/6] powerpc/syscalls: Switch trivial cases to SYSCALL_DEFINE

2018-05-04 Thread Naveen N. Rao

Michael Ellerman wrote:

From: Al Viro 

Signed-off-by: Al Viro 
---
 arch/powerpc/kernel/pci_32.c   | 6 +++---
 arch/powerpc/kernel/pci_64.c   | 4 ++--
 arch/powerpc/mm/subpage-prot.c | 4 +++-
 arch/powerpc/platforms/cell/spu_syscalls.c | 3 ++-
 4 files changed, 10 insertions(+), 7 deletions(-)



I suppose we can also do this for switch_endian?

diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index 466216506eb2..290265f2700c 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -123,7 +123,7 @@ long ppc_fadvise64_64(int fd, int advice, u32 offset_high, 
u32 offset_low,
 (u64)len_high << 32 | len_low, advice);
}

-long sys_switch_endian(void)
+SYSCALL_DEFINE0(switch_endian)
{
struct thread_info *ti;


- Naveen




Re: [PATCH 12/17] powerpc/8xx: Remove PTE_ATOMIC_UPDATES

2018-05-04 Thread Joakim Tjernlund
On Fri, 2018-05-04 at 14:34 +0200, Christophe Leroy wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
> 
> 
> commit 1bc54c03117b9 ("powerpc: rework 4xx PTE access and TLB miss")
> introduced non atomic PTE updates and started the work of removing
> PTE updates in TLB miss handlers, but kept PTE_ATOMIC_UPDATES for the
> 8xx with the following comment:
> /* Until my rework is finished, 8xx still needs atomic PTE updates */
> 
> commit fe11dc3f9628e ("powerpc/8xx: Update TLB asm so it behaves as linux
> mm expects") removed all PTE updates done in TLB miss handlers

Is that my 7 year old commit ?

> 
> Therefore, atomic PTE updates are not needed anymore for the 8xx

About time removing atomic updates then :)

> 
> Signed-off-by: Christophe Leroy 
> 


[PATCH 2/2] powerpc/trace: Update syscall name matching logic to account for ppc_ prefix

2018-05-04 Thread Naveen N. Rao
Some syscall entry functions on powerpc are prefixed with
ppc_/ppc32_/ppc64_ rather than the usual sys_/__se_sys prefix. fork(),
clone(), swapcontext() are some examples of syscalls with such entry
points. We need to match against these names when initializing ftrace
syscall tracing.

Signed-off-by: Naveen N. Rao 
---
Tracing fork() and clone() was working so far since those had "ppc_" 
prefix and we were jumping past the initial 3 characters to account for 
"SyS_" prefix. We were not enabling tracing for ppc32_/ppc64_ prefixed 
syscall entry points so far.

- Naveen

 arch/powerpc/include/asm/ftrace.h | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 731cb4314a42..331b17b2e46e 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -67,13 +67,30 @@ struct dyn_arch_ftrace {
 #endif
 
 #if defined(CONFIG_FTRACE_SYSCALLS) && !defined(__ASSEMBLY__)
-#ifdef PPC64_ELF_ABI_v1
+/*
+ * Some syscall entry functions on powerpc start with "ppc_" (fork and clone,
+ * for instance) or ppc32_/ppc64_. We should also match the sys_ variant with
+ * those.
+ */
 #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+#ifdef PPC64_ELF_ABI_v1
 static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
 {
/* We need to skip past the initial dot, and the __se_sys alias */
return !strcmp(sym + 1, name) ||
-   (!strncmp(sym, ".__se_sys", 9) && !strcmp(sym + 6, name));
+   (!strncmp(sym, ".__se_sys", 9) && !strcmp(sym + 6, name)) ||
+   (!strncmp(sym, ".ppc_", 5) && !strcmp(sym + 5, name + 4)) ||
+   (!strncmp(sym, ".ppc32_", 7) && !strcmp(sym + 7, name + 4)) ||
+   (!strncmp(sym, ".ppc64_", 7) && !strcmp(sym + 7, name + 4));
+}
+#else
+static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
+{
+   return !strcmp(sym, name) ||
+   (!strncmp(sym, "__se_sys", 8) && !strcmp(sym + 5, name)) ||
+   (!strncmp(sym, "ppc_", 4) && !strcmp(sym + 4, name + 4)) ||
+   (!strncmp(sym, "ppc32_", 6) && !strcmp(sym + 6, name + 4)) ||
+   (!strncmp(sym, "ppc64_", 6) && !strcmp(sym + 6, name + 4));
 }
 #endif
 #endif /* CONFIG_FTRACE_SYSCALLS && !__ASSEMBLY__ */
-- 
2.17.0



[PATCH 1/2] powerpc/trace elfv1: Update syscall name matching logic

2018-05-04 Thread Naveen N. Rao
On powerpc64 ABIv1, we are enabling syscall tracing for only ~20
syscalls. This is due to commit e145242ea0df6 ("syscalls/core,
syscalls/x86: Clean up syscall stub naming convention") which has
changed the syscall entry wrapper prefix from "SyS" to "__se_sys".

Update the logic for ABIv1 to not just skip the initial dot, but also
the "__se_sys" prefix.

Fixes: commit e145242ea0df6 ("syscalls/core, syscalls/x86: Clean up syscall 
stub naming convention")
Reported-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ftrace.h | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index f7a23c2dce74..731cb4314a42 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -71,13 +71,9 @@ struct dyn_arch_ftrace {
 #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
 static inline bool arch_syscall_match_sym_name(const char *sym, const char 
*name)
 {
-   /*
-* Compare the symbol name with the system call name. Skip the .sys or 
.SyS
-* prefix from the symbol name and the sys prefix from the system call 
name and
-* just match the rest. This is only needed on ppc64 since symbol names 
on
-* 32bit do not start with a period so the generic function will work.
-*/
-   return !strcmp(sym + 4, name + 3);
+   /* We need to skip past the initial dot, and the __se_sys alias */
+   return !strcmp(sym + 1, name) ||
+   (!strncmp(sym, ".__se_sys", 9) && !strcmp(sym + 6, name));
 }
 #endif
 #endif /* CONFIG_FTRACE_SYSCALLS && !__ASSEMBLY__ */
-- 
2.17.0



[PATCH v3] On ppc64le we HAVE_RELIABLE_STACKTRACE

2018-05-04 Thread Torsten Duwe

The "Power Architecture 64-Bit ELF V2 ABI" says in section 2.3.2.3:

[...] There are several rules that must be adhered to in order to ensure
reliable and consistent call chain backtracing:

* Before a function calls any other function, it shall establish its
  own stack frame, whose size shall be a multiple of 16 bytes.

 – In instances where a function’s prologue creates a stack frame, the
   back-chain word of the stack frame shall be updated atomically with
   the value of the stack pointer (r1) when a back chain is implemented.
   (This must be supported as default by all ELF V2 ABI-compliant
   environments.)
[...]
 – The function shall save the link register that contains its return
   address in the LR save doubleword of its caller’s stack frame before
   calling another function.

To me this sounds like the equivalent of HAVE_RELIABLE_STACKTRACE.
This patch may be unneccessarily limited to ppc64le, but OTOH the only
user of this flag so far is livepatching, which is only implemented on
PPCs with 64-LE, a.k.a. ELF ABI v2.

Feel free to add other ppc variants, but so far only ppc64le got tested.

This change also implements save_stack_trace_tsk_reliable() for ppc64le
that checks for the above conditions, where possible.

Signed-off-by: Torsten Duwe 
Signed-off-by: Nicolai Stange 

---
v3:
 * big bunch of fixes, credits go to Nicolai Stange:
   - get the correct return address from the graph tracer,
 should it be active.
 IMO this should be moved into to generic code, but I'll
 leave it like this for now, to get things going.
   - bail out on a kretprobe
   - also stop at an exception frame
   - accomodate for the different stack layout of the idle task
   - use an even more beautiful test: __kernel_text_address()

v2:
 * implemented save_stack_trace_tsk_reliable(), with a bunch of sanity
   checks. The test for a kernel code pointer is much nicer now, and
   the exit condition is exact (when compared to last week's follow-up)

---
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..54f1daf4f9e5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -220,6 +220,7 @@ config PPC
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
select HAVE_REGS_AND_STACK_ACCESS_API
+   select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index d534ed901538..3d62ecb2587b 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -2,7 +2,7 @@
  * Stack trace utility
  *
  * Copyright 2008 Christoph Hellwig, IBM Corp.
- *
+ * Copyright 2018 SUSE Linux GmbH
  *
  *  This program is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU General Public License
@@ -11,11 +11,16 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Save stack-backtrace addresses into a stack_trace buffer.
@@ -76,3 +81,114 @@ save_stack_trace_regs(struct pt_regs *regs, struct 
stack_trace *trace)
save_context_stack(trace, regs->gpr[1], current, 0);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_regs);
+
+#ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
+int
+save_stack_trace_tsk_reliable(struct task_struct *tsk,
+   struct stack_trace *trace)
+{
+   unsigned long sp;
+   unsigned long stack_page = (unsigned long)task_stack_page(tsk);
+   unsigned long stack_end;
+   int graph_idx = 0;
+
+   /* The last frame (unwinding first) may not yet have saved
+* its LR onto the stack.
+*/
+   int firstframe = 1;
+
+   if (tsk == current)
+   sp = current_stack_pointer();
+   else
+   sp = tsk->thread.ksp;
+
+   stack_end = stack_page + THREAD_SIZE;
+   if (!is_idle_task(tsk)) {
+   /*
+* For user tasks, this is the SP value loaded on
+* kernel entry, see "PACAKSAVE(r13)" in _switch() and
+* system_call_common()/EXCEPTION_PROLOG_COMMON().
+*
+* Likewise for non-swapper kernel threads,
+* this also happens to be the top of the stack
+* as setup by copy_thread().
+*
+* Note that stack backlinks are not properly setup by
+* copy_thread() and thus, a forked task() will have
+* an unreliable stack trace until it's been
+* _switch()'ed to for the first time.
+*/
+   stack_end -= STACK_FRAME_OVERHEAD + sizeof(struct pt_regs);
+   } else {
+   /*
+* idle tasks have a custom stack layout,
+* 

Re: [PATCH 1/3] powerpc/nohash: remove hash related code from nohash headers.

2018-05-04 Thread Christophe LEROY

Le 04/05/2018 à 13:17, Michael Ellerman a écrit :

kbuild test robot  writes:


Hi Christophe,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.17-rc2 next-20180426]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-nohash-remove-hash-related-code-from-nohash-headers/20180425-182026
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-ppc64e_defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=powerpc

All errors (new ones prefixed by >>):

In file included from arch/powerpc/include/asm/nohash/pgtable.h:6:0,
 from arch/powerpc/include/asm/pgtable.h:19,
 from include/linux/memremap.h:8,
 from include/linux/mm.h:27,
 from include/linux/mman.h:5,
 from arch/powerpc/kernel/asm-offsets.c:22:
arch/powerpc/include/asm/nohash/64/pgtable.h: In function 
'__ptep_test_and_clear_young':

arch/powerpc/include/asm/nohash/64/pgtable.h:214:6: error: implicit declaration 
of function 'pte_young'; did you mean 'pte_pud'? 
[-Werror=implicit-function-declaration]

  if (pte_young(*ptep))
  ^
  pte_pud


Urk.

There's a circular dependency here.

I fixed it with the patch below, which seems to be the least worst
solution. Possibly we can clean things up further in future.


Oops, I just sent you a new version of the patch not using pte_young() 
anymore. I'll let you decide what to do.


Christophe



cheers

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 140f8e74b478..987a658b18e1 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -267,6 +267,11 @@ static inline void __ptep_set_access_flags(struct 
mm_struct *mm,
pte_update(ptep, clr, set);
  }
  
+static inline int pte_young(pte_t pte)

+{
+   return pte_val(pte) & _PAGE_ACCESSED;
+}
+
  #define __HAVE_ARCH_PTE_SAME
  #define pte_same(A,B) ((pte_val(A) ^ pte_val(B)) == 0)
  
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h

index e8de7cb4d3fb..6ac8381f4c7a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -204,6 +204,11 @@ static inline unsigned long pte_update(struct mm_struct 
*mm,
return old;
  }
  
+static inline int pte_young(pte_t pte)

+{
+   return pte_val(pte) & _PAGE_ACCESSED;
+}
+
  static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
  unsigned long addr, pte_t *ptep)
  {
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 077472640b35..2160be2e4339 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -17,7 +17,6 @@ static inline int pte_write(pte_t pte)
  }
  static inline int pte_read(pte_t pte) { return 1; }
  static inline int pte_dirty(pte_t pte){ return pte_val(pte) & 
_PAGE_DIRTY; }
-static inline int pte_young(pte_t pte) { return pte_val(pte) & 
_PAGE_ACCESSED; }
  static inline int pte_special(pte_t pte)  { return pte_val(pte) & 
_PAGE_SPECIAL; }
  static inline int pte_none(pte_t pte) { return (pte_val(pte) & 
~_PTE_NONE_MASK) == 0; }
  static inline pgprot_t pte_pgprot(pte_t pte)  { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }



[PATCH BAD 17/17] powerpc/mm: Use pte_fragment_alloc() on 8xx

2018-05-04 Thread Christophe Leroy
DO NOT APPLY THAT ONE, IT BUGS. But comments are welcome.


In 16k pages mode, the 8xx still need only 4k for the page table.

This patch makes use of the pte_fragment functions in order
to avoid wasting memory space

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu-8xx.h   |  4 
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 29 +++-
 arch/powerpc/include/asm/nohash/32/pgtable.h |  5 -
 arch/powerpc/mm/mmu_context_nohash.c |  4 
 arch/powerpc/mm/pgtable.c| 10 +-
 arch/powerpc/mm/pgtable_32.c | 12 
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 7 files changed, 62 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 193f53116c7a..4f4cb754afd8 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -190,6 +190,10 @@ typedef struct {
struct slice_mask mask_8m;
 # endif
 #endif
+#ifdef CONFIG_NEED_PTE_FRAG
+   /* for 4K PTE fragment support */
+   void *pte_frag;
+#endif
 } mm_context_t;
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8)
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 1c6461e7c6aa..1e3b8f580499 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -93,6 +93,32 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmdp,
((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel_g(pmd, address))? \
NULL: pte_offset_kernel(pmd, address))
 
+#ifdef CONFIG_NEED_PTE_FRAG
+extern pte_t *pte_fragment_alloc(struct mm_struct *, unsigned long, int);
+extern void pte_fragment_free(unsigned long *, int);
+
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
+ unsigned long address)
+{
+   return (pte_t *)pte_fragment_alloc(mm, address, 1);
+}
+
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
+ unsigned long address)
+{
+   return (pgtable_t)pte_fragment_alloc(mm, address, 0);
+}
+
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+   pte_fragment_free((unsigned long *)pte, 1);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+   pte_fragment_free((unsigned long *)ptepage, 0);
+}
+#else
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
@@ -106,11 +132,12 @@ static inline void pte_free(struct mm_struct *mm, 
pgtable_t ptepage)
pgtable_page_dtor(ptepage);
__free_page(ptepage);
 }
+#endif
 
 static inline void pgtable_free(void *table, unsigned index_size)
 {
if (!index_size) {
-   free_page((unsigned long)table);
+   pte_free_kernel(NULL, table);
} else {
BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
kmem_cache_free(PGT_CACHE(index_size), table);
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 3efd616bbc80..e2a22c8dc7f6 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -20,6 +20,9 @@ extern int icache_44x_need_flush;
 
 #if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
 #define PTE_INDEX_SIZE  (PTE_SHIFT - 2)
+#define PTE_FRAG_NR4
+#define PTE_FRAG_SIZE_SHIFT12
+#define PTE_FRAG_SIZE  (1UL << PTE_FRAG_SIZE_SHIFT)
 #else
 #define PTE_INDEX_SIZE PTE_SHIFT
 #endif
@@ -303,7 +306,7 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
  */
 #ifndef CONFIG_BOOKE
 #define pmd_page_vaddr(pmd)\
-   ((unsigned long) __va(pmd_val(pmd) & PAGE_MASK))
+   ((unsigned long) __va(pmd_val(pmd) & ~(PTE_TABLE_SIZE - 1)))
 #define pmd_page(pmd)  \
pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
 #else
diff --git a/arch/powerpc/mm/mmu_context_nohash.c 
b/arch/powerpc/mm/mmu_context_nohash.c
index e09228a9ad00..8b0ab33673e5 100644
--- a/arch/powerpc/mm/mmu_context_nohash.c
+++ b/arch/powerpc/mm/mmu_context_nohash.c
@@ -390,6 +390,9 @@ int init_new_context(struct task_struct *t, struct 
mm_struct *mm)
 #endif
mm->context.id = MMU_NO_CONTEXT;
mm->context.active = 0;
+#ifdef CONFIG_NEED_PTE_FRAG
+   mm->context.pte_frag = NULL;
+#endif
return 0;
 }
 
@@ -418,6 +421,7 @@ void destroy_context(struct mm_struct *mm)
nr_free_contexts++;
}
raw_spin_unlock_irqrestore(_lock, flags);
+   destroy_pagetable_page(mm);
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 2d34755ed727..96cc5aa73331 100644
--- a/arch/powerpc/mm/pgtable.c
+++ 

[PATCH 16/17] powerpc/mm: Make pte_fragment_alloc() common to PPC32 and PPC64

2018-05-04 Thread Christophe Leroy
In order to allow the 8xx to handle pte_fragments, this patch
makes in common to PPC32 and PPC64

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu_context.h | 28 ++
 arch/powerpc/mm/mmu_context_book3s64.c | 28 --
 arch/powerpc/mm/pgtable.c  | 67 ++
 arch/powerpc/mm/pgtable_64.c   | 67 --
 arch/powerpc/platforms/Kconfig.cputype |  5 +++
 5 files changed, 100 insertions(+), 95 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 1835ca1505d6..252988f7e219 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -262,5 +262,33 @@ static inline u64 pte_to_hpte_pkey_bits(u64 pteflags)
 
 #endif /* CONFIG_PPC_MEM_KEYS */
 
+#ifdef CONFIG_NEED_PTE_FRAG
+static inline void destroy_pagetable_page(struct mm_struct *mm)
+{
+   int count;
+   void *pte_frag;
+   struct page *page;
+
+   pte_frag = mm->context.pte_frag;
+   if (!pte_frag)
+   return;
+
+   page = virt_to_page(pte_frag);
+   /* drop all the pending references */
+   count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
+   /* We allow PTE_FRAG_NR fragments from a PTE page */
+   if (page_ref_sub_and_test(page, PTE_FRAG_NR - count)) {
+   pgtable_page_dtor(page);
+   free_unref_page(page);
+   }
+}
+
+#else
+static inline void destroy_pagetable_page(struct mm_struct *mm)
+{
+   return;
+}
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index b75194dff64c..2f55a4e3c09a 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -192,34 +192,6 @@ static void destroy_contexts(mm_context_t *ctx)
spin_unlock(_context_lock);
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-static void destroy_pagetable_page(struct mm_struct *mm)
-{
-   int count;
-   void *pte_frag;
-   struct page *page;
-
-   pte_frag = mm->context.pte_frag;
-   if (!pte_frag)
-   return;
-
-   page = virt_to_page(pte_frag);
-   /* drop all the pending references */
-   count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
-   /* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (page_ref_sub_and_test(page, PTE_FRAG_NR - count)) {
-   pgtable_page_dtor(page);
-   free_unref_page(page);
-   }
-}
-
-#else
-static inline void destroy_pagetable_page(struct mm_struct *mm)
-{
-   return;
-}
-#endif
-
 void destroy_context(struct mm_struct *mm)
 {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9f361ae571e9..2d34755ed727 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -264,3 +264,70 @@ unsigned long vmalloc_to_phys(void *va)
return __pa(pfn_to_kaddr(pfn)) + offset_in_page(va);
 }
 EXPORT_SYMBOL_GPL(vmalloc_to_phys);
+
+#ifdef CONFIG_NEED_PTE_FRAG
+static pte_t *get_from_cache(struct mm_struct *mm)
+{
+   void *pte_frag, *ret;
+
+   spin_lock(>page_table_lock);
+   ret = mm->context.pte_frag;
+   if (ret) {
+   pte_frag = ret + PTE_FRAG_SIZE;
+   /*
+* If we have taken up all the fragments mark PTE page NULL
+*/
+   if (((unsigned long)pte_frag & ~PAGE_MASK) == 0)
+   pte_frag = NULL;
+   mm->context.pte_frag = pte_frag;
+   }
+   spin_unlock(>page_table_lock);
+   return (pte_t *)ret;
+}
+
+static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel)
+{
+   void *ret = NULL;
+   struct page *page;
+
+   if (!kernel) {
+   page = alloc_page(PGALLOC_GFP | __GFP_ACCOUNT);
+   if (!page)
+   return NULL;
+   if (!pgtable_page_ctor(page)) {
+   __free_page(page);
+   return NULL;
+   }
+   } else {
+   page = alloc_page(PGALLOC_GFP);
+   if (!page)
+   return NULL;
+   }
+
+   ret = page_address(page);
+   spin_lock(>page_table_lock);
+   /*
+* If we find pgtable_page set, we return
+* the allocated page with single fragement
+* count.
+*/
+   if (likely(!mm->context.pte_frag)) {
+   set_page_count(page, PTE_FRAG_NR);
+   mm->context.pte_frag = ret + PTE_FRAG_SIZE;
+   }
+   spin_unlock(>page_table_lock);
+
+   return (pte_t *)ret;
+}
+
+pte_t *pte_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr, int 
kernel)
+{
+   pte_t *pte;
+
+   pte = get_from_cache(mm);
+   if (pte)
+   return pte;
+

[PATCH 15/17] powerpc/8xx: Free up SPRN_SPRG_SCRATCH2

2018-05-04 Thread Christophe Leroy
We can now use SPRN_M_TW in the DAR Fixup code, freeing
SPRN_SPRG_SCRATCH2

Then SPRN_SPRG_SCRATCH2 may be used for something else in the future.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index c98a4ebb5a4d..8e96b526f109 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -624,7 +624,7 @@ InstructionBreakpoint:
  /* define if you don't want to use self modifying code */
 #define NO_SELF_MODIFYING_CODE
 FixupDAR:/* Entry point for dcbx workaround. */
-   mtspr   SPRN_SPRG_SCRATCH2, r10
+   mtspr   SPRN_M_TW, r10
/* fetch instruction from memory. */
mfspr   r10, SPRN_SRR0
mtspr   SPRN_MD_EPN, r10
@@ -668,7 +668,7 @@ _ENTRY(FixupDAR_cmp)
beq+142f
cmpwi   cr0, r10, 1964  /* Is icbi? */
beq+142f
-141:   mfspr   r10,SPRN_SPRG_SCRATCH2
+141:   mfspr   r10,SPRN_M_TW
b   DARFixed/* Nope, go back to normal TLB processing */
 
 200:
@@ -703,7 +703,7 @@ modified_instr:
bne+143f
subfr10,r0,r10  /* r10=r10-r0, only if reg RA is r0 */
 143:   mtdar   r10 /* store faulting EA in DAR */
-   mfspr   r10,SPRN_SPRG_SCRATCH2
+   mfspr   r10,SPRN_M_TW
b   DARFixed/* Go back to normal TLB handling */
 #else
mfctr   r10
@@ -757,7 +757,7 @@ modified_instr:
mfdar   r11
mtctr   r11 /* restore ctr reg from DAR */
mtdar   r10 /* save fault EA to DAR */
-   mfspr   r10,SPRN_SPRG_SCRATCH2
+   mfspr   r10,SPRN_M_TW
b   DARFixed/* Go back to normal TLB handling */
 
/* special handling for r10,r11 since these are modified already */
-- 
2.13.3



[PATCH 14/17] powerpc/8xx: reunify TLB handler routines

2018-05-04 Thread Christophe Leroy
Each handler must not exceed 64 instructions to fit into the main
exception area.
Following the significant size reduction of TLB handler routines,
the side handlers can be brought back close to the main part.

In the worst case:
Main part of ITLB handler is 45 insn, side part is 9 insn ==> total 54
Main part of DTLB handler is 37 insn, side part is 23 insn ==> total 60

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 108 -
 1 file changed, 52 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 4855d5a36f70..c98a4ebb5a4d 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -388,6 +388,23 @@ _ENTRY(itlb_miss_perf)
mfspr   r11, SPRN_SPRG_SCRATCH1
rfi
 
+#ifndef CONFIG_PIN_TLB_TEXT
+ITLBMissLinear:
+   mtcrr11
+   /* Set 8M byte page and mark it valid */
+   li  r11, MI_PS8MEG | MI_SVALID
+   mtspr   SPRN_MI_TWC, r11
+   rlwinm  r10, r10, 20, 0x0f80/* 8xx supports max 256Mb RAM */
+   ori r10, r10, 0xf0 | MI_SPS16K | _PAGE_PRIVILEGED | _PAGE_DIRTY | \
+ _PAGE_PRESENT
+   mtspr   SPRN_MI_RPN, r10/* Update TLB entry */
+
+_ENTRY(itlb_miss_exit_2)
+   mfspr   r10, SPRN_SPRG_SCRATCH0
+   mfspr   r11, SPRN_SPRG_SCRATCH1
+   rfi
+#endif
+
. = 0x1200
 DataStoreTLBMiss:
mtspr   SPRN_SPRG_SCRATCH0, r10
@@ -463,6 +480,41 @@ _ENTRY(dtlb_miss_perf)
mfspr   r11, SPRN_SPRG_SCRATCH1
rfi
 
+DTLBMissIMMR:
+   mtcrr11
+   /* Set 512k byte guarded page and mark it valid */
+   li  r10, MD_PS512K | MD_GUARDED | MD_SVALID
+   mtspr   SPRN_MD_TWC, r10
+   mfspr   r10, SPRN_IMMR  /* Get current IMMR */
+   rlwinm  r10, r10, 0, 0xfff8 /* Get 512 kbytes boundary */
+   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_PRIVILEGED | _PAGE_DIRTY | \
+ _PAGE_PRESENT | _PAGE_NO_CACHE
+   mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
+
+   li  r11, RPN_PATTERN
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+_ENTRY(dtlb_miss_exit_2)
+   mfspr   r10, SPRN_SPRG_SCRATCH0
+   mfspr   r11, SPRN_SPRG_SCRATCH1
+   rfi
+
+DTLBMissLinear:
+   mtcrr11
+   /* Set 8M byte page and mark it valid */
+   li  r11, MD_PS8MEG | MD_SVALID
+   mtspr   SPRN_MD_TWC, r11
+   rlwinm  r10, r10, 20, 0x0f80/* 8xx supports max 256Mb RAM */
+   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_PRIVILEGED | _PAGE_DIRTY | \
+ _PAGE_PRESENT
+   mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
+
+   li  r11, RPN_PATTERN
+   mtspr   SPRN_DAR, r11   /* Tag DAR */
+_ENTRY(dtlb_miss_exit_3)
+   mfspr   r10, SPRN_SPRG_SCRATCH0
+   mfspr   r11, SPRN_SPRG_SCRATCH1
+   rfi
+
 /* This is an instruction TLB error on the MPC8xx.  This could be due
  * to many reasons, such as executing guarded memory or illegal instruction
  * addresses.  There is nothing to do but handle a big time error fault.
@@ -565,62 +617,6 @@ InstructionBreakpoint:
 
. = 0x2000
 
-/*
- * Bottom part of DataStoreTLBMiss handlers for IMMR area and linear RAM.
- * not enough space in the DataStoreTLBMiss area.
- */
-DTLBMissIMMR:
-   mtcrr11
-   /* Set 512k byte guarded page and mark it valid */
-   li  r10, MD_PS512K | MD_GUARDED | MD_SVALID
-   mtspr   SPRN_MD_TWC, r10
-   mfspr   r10, SPRN_IMMR  /* Get current IMMR */
-   rlwinm  r10, r10, 0, 0xfff8 /* Get 512 kbytes boundary */
-   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_PRIVILEGED | _PAGE_DIRTY | \
- _PAGE_PRESENT | _PAGE_NO_CACHE
-   mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
-
-   li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
-_ENTRY(dtlb_miss_exit_2)
-   mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
-   rfi
-
-DTLBMissLinear:
-   mtcrr11
-   /* Set 8M byte page and mark it valid */
-   li  r11, MD_PS8MEG | MD_SVALID
-   mtspr   SPRN_MD_TWC, r11
-   rlwinm  r10, r10, 20, 0x0f80/* 8xx supports max 256Mb RAM */
-   ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_PRIVILEGED | _PAGE_DIRTY | \
- _PAGE_PRESENT
-   mtspr   SPRN_MD_RPN, r10/* Update TLB entry */
-
-   li  r11, RPN_PATTERN
-   mtspr   SPRN_DAR, r11   /* Tag DAR */
-_ENTRY(dtlb_miss_exit_3)
-   mfspr   r10, SPRN_SPRG_SCRATCH0
-   mfspr   r11, SPRN_SPRG_SCRATCH1
-   rfi
-
-#ifndef CONFIG_PIN_TLB_TEXT
-ITLBMissLinear:
-   mtcrr11
-   /* Set 8M byte page and mark it valid */
-   li  r11, MI_PS8MEG | MI_SVALID
-   mtspr   SPRN_MI_TWC, r11
-   rlwinm  r10, r10, 20, 0x0f80  

[PATCH 13/17] powerpc/mm: Use hardware assistance in TLB handlers on the 8xx

2018-05-04 Thread Christophe Leroy
Today, on the 8xx the TLB handlers do SW tablewalk by doing all
the calculation in ASM, in order to match with the Linux page
table structure.

The 8xx offers hardware assistance which allows significant size
reduction of the TLB handlers, hence also reduces the time spent
in the handlers.

However, using this HW assistance implies some constraints on the
page table structure:
- Regardless of the main page size used (4k or 16k), the
level 1 table (PGD) contains 1024 entries and each PGD entry covers
a 4Mbytes area which is managed by a level 2 table (PTE) containing
also 1024 entries each describing a 4k page.
- 16k pages require 4 identifical entries in the L2 table
- 512k pages PTE have to be spread every 128 bytes in the L2 table
- 8M pages PTE are at the address pointed by the L1 entry and each
8M page require 2 identical entries in the PGD.

In order to use hardware assistance, this patch does the following
modifications:
- Make PGD size independant of the main page size
- In 16k pages mode, redefine pte_t as a struct with 4 elements,
and populate those 4 elements in __set_pte_at() and pte_update()
- Modify the TLB handlers to use HW assistance
- Adapt the size of the hugepage tables.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h   |   4 +-
 arch/powerpc/include/asm/nohash/32/pgtable.h |  16 +-
 arch/powerpc/include/asm/nohash/pgtable.h|   4 +
 arch/powerpc/include/asm/pgtable-types.h |   4 +
 arch/powerpc/kernel/head_8xx.S   | 227 +--
 arch/powerpc/mm/8xx_mmu.c|  10 +-
 arch/powerpc/mm/hugetlbpage.c|  12 ++
 7 files changed, 112 insertions(+), 165 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 78540c074d70..6d29be6bac74 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -77,7 +77,9 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
unsigned long idx = 0;
 
pte_t *dir = hugepd_page(hpd);
-#ifndef CONFIG_PPC_FSL_BOOK3E
+#ifdef CONFIG_PPC_8xx
+   idx = (addr & ((1UL << pdshift) - 1)) >> PAGE_SHIFT;
+#elif !defined(CONFIG_PPC_FSL_BOOK3E)
idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd);
 #endif
 
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 009a5b3d3192..3efd616bbc80 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -18,7 +18,11 @@ extern int icache_44x_need_flush;
 
 #endif /* __ASSEMBLY__ */
 
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+#define PTE_INDEX_SIZE  (PTE_SHIFT - 2)
+#else
 #define PTE_INDEX_SIZE PTE_SHIFT
+#endif
 #define PMD_INDEX_SIZE 0
 #define PUD_INDEX_SIZE 0
 #define PGD_INDEX_SIZE (32 - PGDIR_SHIFT)
@@ -47,7 +51,11 @@ extern int icache_44x_need_flush;
  * -Matt
  */
 /* PGDIR_SHIFT determines what a top-level page table entry can map */
+#ifdef CONFIG_PPC_8xx
+#define PGDIR_SHIFT22
+#else
 #define PGDIR_SHIFT(PAGE_SHIFT + PTE_INDEX_SIZE)
+#endif
 #define PGDIR_SIZE (1UL << PGDIR_SHIFT)
 #define PGDIR_MASK (~(PGDIR_SIZE-1))
 
@@ -190,7 +198,13 @@ static inline unsigned long pte_update(pte_t *p,
: "cc" );
 #else /* PTE_ATOMIC_UPDATES */
unsigned long old = pte_val(*p);
-   *p = __pte((old & ~clr) | set);
+   unsigned long new = (old & ~clr) | set;
+
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+   p->pte = p->pte1 = p->pte2 = p->pte3 = new;
+#else
+   *p = __pte(new);
+#endif
 #endif /* !PTE_ATOMIC_UPDATES */
 
 #ifdef CONFIG_44x
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 077472640b35..e4b6c084be5c 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -165,7 +165,11 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
/* Anything else just stores the PTE normally. That covers all 64-bit
 * cases, and 32-bit non-hash with 32-bit PTEs.
 */
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+   ptep->pte = ptep->pte1 = ptep->pte2 = ptep->pte3 = pte_val(pte);
+#else
*ptep = pte;
+#endif
 
/*
 * With hardware tablewalk, a sync is needed to ensure that
diff --git a/arch/powerpc/include/asm/pgtable-types.h 
b/arch/powerpc/include/asm/pgtable-types.h
index eccb30b38b47..3b0edf041b2e 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -3,7 +3,11 @@
 #define _ASM_POWERPC_PGTABLE_TYPES_H
 
 /* PTE level */
+#if defined(CONFIG_PPC_8xx) && defined(CONFIG_PPC_16K_PAGES)
+typedef struct { pte_basic_t pte, pte1, pte2, pte3; } pte_t;
+#else
 typedef struct { pte_basic_t pte; } pte_t;
+#endif
 #define __pte(x)   ((pte_t) { (x) })
 static inline 

[PATCH 12/17] powerpc/8xx: Remove PTE_ATOMIC_UPDATES

2018-05-04 Thread Christophe Leroy
commit 1bc54c03117b9 ("powerpc: rework 4xx PTE access and TLB miss")
introduced non atomic PTE updates and started the work of removing
PTE updates in TLB miss handlers, but kept PTE_ATOMIC_UPDATES for the
8xx with the following comment:
/* Until my rework is finished, 8xx still needs atomic PTE updates */

commit fe11dc3f9628e ("powerpc/8xx: Update TLB asm so it behaves as linux
mm expects") removed all PTE updates done in TLB miss handlers

Therefore, atomic PTE updates are not needed anymore for the 8xx

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index a9a2919251e0..31401320c1a5 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -54,9 +54,6 @@
 #define _PMD_GUARDED   0x0010
 #define _PMD_USER  0x0020  /* APG 1 */
 
-/* Until my rework is finished, 8xx still needs atomic PTE updates */
-#define PTE_ATOMIC_UPDATES 1
-
 #ifdef CONFIG_PPC_16K_PAGES
 #define _PAGE_PSIZE_PAGE_HUGE
 #endif
-- 
2.13.3



[PATCH 11/17] powerpc/nohash32: set GUARDED attribute in the PMD directly

2018-05-04 Thread Christophe Leroy
On the 8xx, the GUARDED attribute of the pages is managed in the
L1 entry, therefore to avoid having to copy it into L1 entry
at each TLB miss, we set it in the PMD.

For this, we split the VM alloc space in two parts, one
for VM alloc and non Guarded IO, and one for Guarded IO.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 10 ++
 arch/powerpc/include/asm/nohash/32/pgtable.h | 18 --
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |  3 ++-
 arch/powerpc/kernel/head_8xx.S   | 18 +++---
 arch/powerpc/mm/dump_linuxpagetables.c   | 26 --
 arch/powerpc/mm/ioremap.c| 11 ---
 arch/powerpc/mm/mem.c|  9 +
 arch/powerpc/mm/pgtable_32.c | 28 +++-
 arch/powerpc/platforms/Kconfig.cputype   |  3 +++
 9 files changed, 106 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h 
b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 29d37bd1f3b3..1c6461e7c6aa 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -58,6 +58,12 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, 
pmd_t *pmdp,
*pmdp = __pmd(__pa(pte) | _PMD_PRESENT);
 }
 
+static inline void pmd_populate_kernel_g(struct mm_struct *mm, pmd_t *pmdp,
+pte_t *pte)
+{
+   *pmdp = __pmd(__pa(pte) | _PMD_PRESENT | _PMD_GUARDED);
+}
+
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pte_page)
 {
@@ -83,6 +89,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmdp,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 #endif
 
+#define pte_alloc_kernel_g(pmd, address)   \
+   ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel_g(pmd, address))? \
+   NULL: pte_offset_kernel(pmd, address))
+
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 93dc22dbe964..009a5b3d3192 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -69,9 +69,14 @@ extern int icache_44x_need_flush;
  * virtual space that goes below PKMAP and FIXMAP
  */
 #ifdef CONFIG_HIGHMEM
-#define KVIRT_TOP  PKMAP_BASE
+#define _KVIRT_TOP PKMAP_BASE
 #else
-#define KVIRT_TOP  (0xfe00UL)  /* for now, could be FIXMAP_BASE ? */
+#define _KVIRT_TOP (0xfe00UL)  /* for now, could be FIXMAP_BASE ? */
+#endif
+#ifdef CONFIG_PPC_GUARDED_PAGE_IN_PMD
+#define KVIRT_TOP  _ALIGN_DOWN(_KVIRT_TOP, PGDIR_SIZE)
+#else
+#define KVIRT_TOP  _KVIRT_TOP
 #endif
 
 /*
@@ -84,7 +89,11 @@ extern int icache_44x_need_flush;
 #else
 #define IOREMAP_ENDKVIRT_TOP
 #endif
+#ifdef CONFIG_PPC_GUARDED_PAGE_IN_PMD
+#define IOREMAP_BASE   _ALIGN_UP(VMALLOC_BASE + (IOREMAP_END - VMALLOC_BASE) / 
2, PGDIR_SIZE)
+#else
 #define IOREMAP_BASE   VMALLOC_BASE
+#endif
 
 /*
  * Just any arbitrary offset to the start of the vmalloc VM area: the
@@ -103,8 +112,13 @@ extern int icache_44x_need_flush;
 #else
 #define VMALLOC_BASE _ALIGN_DOWN((long)high_memory + VMALLOC_OFFSET, 
VMALLOC_OFFSET)
 #endif
+#ifdef CONFIG_PPC_GUARDED_PAGE_IN_PMD
+#define VMALLOC_START  VMALLOC_BASE
+#define VMALLOC_ENDIOREMAP_BASE
+#else
 #define VMALLOC_START  ioremap_bot
 #define VMALLOC_ENDIOREMAP_END
+#endif
 
 /*
  * Bits in a linux-style PTE.  These match the bits in the
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h 
b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index f04cb46ae8a1..a9a2919251e0 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -47,10 +47,11 @@
 #define _PAGE_RO   0x0600  /* Supervisor RO, User no access */
 
 #define _PMD_PRESENT   0x0001
-#define _PMD_BAD   0x0fd0
+#define _PMD_BAD   0x0fc0
 #define _PMD_PAGE_MASK 0x000c
 #define _PMD_PAGE_8M   0x000c
 #define _PMD_PAGE_512K 0x0004
+#define _PMD_GUARDED   0x0010
 #define _PMD_USER  0x0020  /* APG 1 */
 
 /* Until my rework is finished, 8xx still needs atomic PTE updates */
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index c3b831bb8bad..85b017c67e11 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -345,6 +345,10 @@ _ENTRY(ITLBMiss_cmp)
rlwinm  r10, r10, 32 - (PAGE_SHIFT - 2), 32 - PAGE_SHIFT, 29
 #ifdef CONFIG_HUGETLB_PAGE
mtcrr11
+#endif
+   /* Load the MI_TWC with the attributes for this "segment." */
+   mtspr   SPRN_MI_TWC, r11/* Set segment attributes */
+#ifdef CONFIG_HUGETLB_PAGE
bt- 28, 10f   

[PATCH 10/17] powerpc: use _ALIGN macro

2018-05-04 Thread Christophe Leroy
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index b413abcd5a09..93dc22dbe964 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -99,9 +99,9 @@ extern int icache_44x_need_flush;
  */
 #define VMALLOC_OFFSET (0x100) /* 16M */
 #ifdef PPC_PIN_SIZE
-#define VMALLOC_BASE (((_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE _ALIGN_DOWN(_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET, VMALLOC_OFFSET)
 #else
-#define VMALLOC_BASE long)high_memory + VMALLOC_OFFSET) & 
~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE _ALIGN_DOWN((long)high_memory + VMALLOC_OFFSET, 
VMALLOC_OFFSET)
 #endif
 #define VMALLOC_START  ioremap_bot
 #define VMALLOC_ENDIOREMAP_END
-- 
2.13.3



[PATCH 09/17] powerpc: make __ioremap_caller() common to PPC32 and PPC64

2018-05-04 Thread Christophe Leroy
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   1 +
 arch/powerpc/mm/ioremap.c| 126 +++
 2 files changed, 34 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c5c6ead06bfb..2bebdd8302cb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -18,6 +18,7 @@
 #define _PAGE_RO   0
 #define _PAGE_USER 0
 #define _PAGE_HWWRITE  0
+#define _PAGE_COHERENT 0
 
 #define _PAGE_EXEC 0x1 /* execute permission */
 #define _PAGE_WRITE0x2 /* write access allowed */
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index 65d611d44d38..59be5dfcb3e9 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -33,95 +33,6 @@ unsigned long ioremap_bot;
 unsigned long ioremap_bot = IOREMAP_BASE;
 #endif
 
-#ifdef CONFIG_PPC32
-
-void __iomem *
-__ioremap_caller(phys_addr_t addr, unsigned long size, unsigned long flags,
-void *caller)
-{
-   unsigned long v, i;
-   phys_addr_t p;
-   int err;
-
-   /* Make sure we have the base flags */
-   if ((flags & _PAGE_PRESENT) == 0)
-   flags |= pgprot_val(PAGE_KERNEL);
-
-   /* Non-cacheable page cannot be coherent */
-   if (flags & _PAGE_NO_CACHE)
-   flags &= ~_PAGE_COHERENT;
-
-   /*
-* Choose an address to map it to.
-* Once the vmalloc system is running, we use it.
-* Before then, we use space going up from IOREMAP_BASE
-* (ioremap_bot records where we're up to).
-*/
-   p = addr & PAGE_MASK;
-   size = PAGE_ALIGN(addr + size) - p;
-
-   /*
-* If the address lies within the first 16 MB, assume it's in ISA
-* memory space
-*/
-   if (p < 16*1024*1024)
-   p += _ISA_MEM_BASE;
-
-#ifndef CONFIG_CRASH_DUMP
-   /*
-* Don't allow anybody to remap normal RAM that we're using.
-* mem_init() sets high_memory so only do the check after that.
-*/
-   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
-   page_is_ram(__phys_to_pfn(p))) {
-   printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
-  (unsigned long long)p, __builtin_return_address(0));
-   return NULL;
-   }
-#endif
-
-   if (size == 0)
-   return NULL;
-
-   /*
-* Is it already mapped?  Perhaps overlapped by a previous
-* mapping.
-*/
-   v = p_block_mapped(p);
-   if (v)
-   goto out;
-
-   if (slab_is_available()) {
-   struct vm_struct *area;
-   area = get_vm_area_caller(size, VM_IOREMAP, caller);
-   if (area == 0)
-   return NULL;
-   area->phys_addr = p;
-   v = (unsigned long) area->addr;
-   } else {
-   v = ioremap_bot;
-   ioremap_bot += size;
-   }
-
-   /*
-* Should check if it is a candidate for a BAT mapping
-*/
-
-   err = 0;
-   for (i = 0; i < size && err == 0; i += PAGE_SIZE)
-   err = map_kernel_page(v+i, p+i, flags);
-   if (err) {
-   if (slab_is_available())
-   vunmap((void *)v);
-   return NULL;
-   }
-
-out:
-   return (void __iomem *) (v + ((unsigned long)addr & ~PAGE_MASK));
-}
-
-#else
-
 /**
  * __ioremap_at - Low level function to establish the page tables
  *for an IO mapping
@@ -135,6 +46,10 @@ void __iomem * __ioremap_at(phys_addr_t pa, void *ea, 
unsigned long size,
if ((flags & _PAGE_PRESENT) == 0)
flags |= pgprot_val(PAGE_KERNEL);
 
+   /* Non-cacheable page cannot be coherent */
+   if (flags & _PAGE_NO_CACHE)
+   flags &= ~_PAGE_COHERENT;
+
/* We don't support the 4K PFN hack with ioremap */
if (flags & H_PAGE_4K_PFN)
return NULL;
@@ -187,6 +102,33 @@ void __iomem * __ioremap_caller(phys_addr_t addr, unsigned 
long size,
if ((size == 0) || (paligned == 0))
return NULL;
 
+   /*
+* If the address lies within the first 16 MB, assume it's in ISA
+* memory space
+*/
+   if (IS_ENABLED(CONFIG_PPC32) && paligned < 16*1024*1024)
+   paligned += _ISA_MEM_BASE;
+
+   /*
+* Don't allow anybody to remap normal RAM that we're using.
+* mem_init() sets high_memory so only do the check after that.
+*/
+   if (!IS_ENABLED(CONFIG_CRASH_DUMP) &&
+   slab_is_available() && (paligned < virt_to_phys(high_memory)) &&
+   page_is_ram(__phys_to_pfn(paligned))) {
+   printk("__ioremap(): phys addr 0x%llx is RAM 

[PATCH 07/17] powerpc: make ioremap_bot common to PPC32 and PPC64

2018-05-04 Thread Christophe Leroy
Today, early ioremap maps from IOREMAP_BASE down to up on PPC64
and from IOREMAP_TOP up to down on PPC32

This patchs modifies PPC32 behaviour to get same behaviour as PPC64

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/pgtable.h | 16 +---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 20 
 arch/powerpc/mm/dma-noncoherent.c|  2 +-
 arch/powerpc/mm/dump_linuxpagetables.c   |  6 +++---
 arch/powerpc/mm/init_32.c|  6 +-
 arch/powerpc/mm/ioremap.c| 22 ++
 arch/powerpc/mm/mem.c|  7 ---
 7 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index c615abdce119..6cf962ec7a20 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -54,16 +54,17 @@
 #else
 #define KVIRT_TOP  (0xfe00UL)  /* for now, could be FIXMAP_BASE ? */
 #endif
+#define IOREMAP_BASE   VMALLOC_BASE
 
 /*
- * ioremap_bot starts at that address. Early ioremaps move down from there,
- * until mem_init() at which point this becomes the top of the vmalloc
+ * ioremap_bot starts at IOREMAP_BASE. Early ioremaps move up from there,
+ * until mem_init() at which point this becomes the bottom of the vmalloc
  * and ioremap space
  */
 #ifdef CONFIG_NOT_COHERENT_CACHE
-#define IOREMAP_TOP((KVIRT_TOP - CONFIG_CONSISTENT_SIZE) & PAGE_MASK)
+#define IOREMAP_END((KVIRT_TOP - CONFIG_CONSISTENT_SIZE) & PAGE_MASK)
 #else
-#define IOREMAP_TOPKVIRT_TOP
+#define IOREMAP_ENDKVIRT_TOP
 #endif
 
 /*
@@ -85,11 +86,12 @@
  */
 #define VMALLOC_OFFSET (0x100) /* 16M */
 #ifdef PPC_PIN_SIZE
-#define VMALLOC_START (((_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE (((_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)))
 #else
-#define VMALLOC_START long)high_memory + VMALLOC_OFFSET) & 
~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE long)high_memory + VMALLOC_OFFSET) & 
~(VMALLOC_OFFSET-1)))
 #endif
-#define VMALLOC_ENDioremap_bot
+#define VMALLOC_START  ioremap_bot
+#define VMALLOC_ENDIOREMAP_END
 
 #ifndef __ASSEMBLY__
 #include 
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 140f8e74b478..b413abcd5a09 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -80,10 +80,11 @@ extern int icache_44x_need_flush;
  * and ioremap space
  */
 #ifdef CONFIG_NOT_COHERENT_CACHE
-#define IOREMAP_TOP((KVIRT_TOP - CONFIG_CONSISTENT_SIZE) & PAGE_MASK)
+#define IOREMAP_END((KVIRT_TOP - CONFIG_CONSISTENT_SIZE) & PAGE_MASK)
 #else
-#define IOREMAP_TOPKVIRT_TOP
+#define IOREMAP_ENDKVIRT_TOP
 #endif
+#define IOREMAP_BASE   VMALLOC_BASE
 
 /*
  * Just any arbitrary offset to the start of the vmalloc VM area: the
@@ -94,21 +95,16 @@ extern int icache_44x_need_flush;
  * area for the same reason. ;)
  *
  * We no longer map larger than phys RAM with the BATs so we don't have
- * to worry about the VMALLOC_OFFSET causing problems.  We do have to worry
- * about clashes between our early calls to ioremap() that start growing down
- * from IOREMAP_TOP being run into the VM area allocations (growing upwards
- * from VMALLOC_START).  For this reason we have ioremap_bot to check when
- * we actually run into our mappings setup in the early boot with the VM
- * system.  This really does become a problem for machines with good amounts
- * of RAM.  -- Cort
+ * to worry about the VMALLOC_OFFSET causing problems.
  */
 #define VMALLOC_OFFSET (0x100) /* 16M */
 #ifdef PPC_PIN_SIZE
-#define VMALLOC_START (((_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE (((_ALIGN((long)high_memory, PPC_PIN_SIZE) + 
VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1)))
 #else
-#define VMALLOC_START long)high_memory + VMALLOC_OFFSET) & 
~(VMALLOC_OFFSET-1)))
+#define VMALLOC_BASE long)high_memory + VMALLOC_OFFSET) & 
~(VMALLOC_OFFSET-1)))
 #endif
-#define VMALLOC_ENDioremap_bot
+#define VMALLOC_START  ioremap_bot
+#define VMALLOC_ENDIOREMAP_END
 
 /*
  * Bits in a linux-style PTE.  These match the bits in the
diff --git a/arch/powerpc/mm/dma-noncoherent.c 
b/arch/powerpc/mm/dma-noncoherent.c
index 382528475433..d0a8fe74f5a0 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -43,7 +43,7 @@
  * can be further configured for specific applications under
  * the "Advanced Setup" menu. -Matt
  */
-#define CONSISTENT_BASE(IOREMAP_TOP)
+#define CONSISTENT_BASE(IOREMAP_END)
 #define CONSISTENT_END (CONSISTENT_BASE + 
CONFIG_CONSISTENT_SIZE)
 #define CONSISTENT_OFFSET(x)   

[PATCH 08/17] powerpc: make __iounmap() common to PPC32 and PPC64

2018-05-04 Thread Christophe Leroy
This patch makes __iounmap() common to PPC32 and PPC64.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/ioremap.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index 153657db084e..65d611d44d38 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -120,20 +120,6 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
unsigned long flags,
return (void __iomem *) (v + ((unsigned long)addr & ~PAGE_MASK));
 }
 
-void __iounmap(volatile void __iomem *addr)
-{
-   /*
-* If mapped by BATs then there is nothing to do.
-* Calling vfree() generates a benign warning.
-*/
-   if (v_block_mapped((unsigned long)addr))
-   return;
-
-   if ((unsigned long) addr >= ioremap_bot)
-   vunmap((void *) (PAGE_MASK & (unsigned long)addr));
-}
-EXPORT_SYMBOL(__iounmap);
-
 #else
 
 /**
@@ -225,6 +211,8 @@ void __iomem * __ioremap_caller(phys_addr_t addr, unsigned 
long size,
return ret;
 }
 
+#endif
+
 /*
  * Unmap an IO region and remove it from imalloc'd list.
  * Access to IO memory should be serialized by driver.
@@ -238,6 +226,14 @@ void __iounmap(volatile void __iomem *token)
 
addr = (void *) ((unsigned long __force)
 PCI_FIX_ADDR(token) & PAGE_MASK);
+
+   /*
+* If mapped by BATs then there is nothing to do.
+* Calling vfree() generates a benign warning.
+*/
+   if (v_block_mapped((unsigned long)addr))
+   return;
+
if ((unsigned long)addr < ioremap_bot) {
printk(KERN_WARNING "Attempt to iounmap early bolted mapping"
   " at 0x%p\n", addr);
@@ -247,8 +243,6 @@ void __iounmap(volatile void __iomem *token)
 }
 EXPORT_SYMBOL(__iounmap);
 
-#endif
-
 void __iomem * __ioremap(phys_addr_t addr, unsigned long size,
 unsigned long flags)
 {
-- 
2.13.3



[PATCH 06/17] powerpc: common ioremap functions.

2018-05-04 Thread Christophe Leroy
__ioremap(), ioremap(), ioremap_wc() et ioremap_prot() are
very similar between PPC32 and PPC64, they can easily be
made common.

_PAGE_WRITE equals to _PAGE_RW on PPC32
_PAGE_RO and _PAGE_HWWRITE are 0 on PPC64

iounmap() can also be made common by renamig the PPC32
iounmap() as __iounmap()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
 arch/powerpc/include/asm/machdep.h   |  2 +-
 arch/powerpc/mm/ioremap.c| 95 +---
 3 files changed, 31 insertions(+), 67 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 47b5ffc8715d..c5c6ead06bfb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -17,6 +17,7 @@
 #define _PAGE_NA   0
 #define _PAGE_RO   0
 #define _PAGE_USER 0
+#define _PAGE_HWWRITE  0
 
 #define _PAGE_EXEC 0x1 /* execute permission */
 #define _PAGE_WRITE0x2 /* write access allowed */
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index ffe7c71e1132..84d99ed82d5d 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -33,11 +33,11 @@ struct pci_host_bridge;
 
 struct machdep_calls {
char*name;
-#ifdef CONFIG_PPC64
void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
   unsigned long flags, void *caller);
void(*iounmap)(volatile void __iomem *token);
 
+#ifdef CONFIG_PPC64
 #ifdef CONFIG_PM
void(*iommu_save)(void);
void(*iommu_restore)(void);
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index 5d2645193568..f8dc9638c598 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mmu_decl.h"
 
@@ -32,44 +33,6 @@ unsigned long ioremap_bot;
 EXPORT_SYMBOL(ioremap_bot);/* aka VMALLOC_END */
 
 void __iomem *
-ioremap(phys_addr_t addr, unsigned long size)
-{
-   return __ioremap_caller(addr, size, _PAGE_NO_CACHE | _PAGE_GUARDED,
-   __builtin_return_address(0));
-}
-EXPORT_SYMBOL(ioremap);
-
-void __iomem *
-ioremap_wc(phys_addr_t addr, unsigned long size)
-{
-   return __ioremap_caller(addr, size, _PAGE_NO_CACHE,
-   __builtin_return_address(0));
-}
-EXPORT_SYMBOL(ioremap_wc);
-
-void __iomem *
-ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
-{
-   /* writeable implies dirty for kernel addresses */
-   if ((flags & (_PAGE_RW | _PAGE_RO)) != _PAGE_RO)
-   flags |= _PAGE_DIRTY | _PAGE_HWWRITE;
-
-   /* we don't want to let _PAGE_USER and _PAGE_EXEC leak out */
-   flags &= ~(_PAGE_USER | _PAGE_EXEC);
-   flags |= _PAGE_PRIVILEGED;
-
-   return __ioremap_caller(addr, size, flags, __builtin_return_address(0));
-}
-EXPORT_SYMBOL(ioremap_prot);
-
-void __iomem *
-__ioremap(phys_addr_t addr, unsigned long size, unsigned long flags)
-{
-   return __ioremap_caller(addr, size, flags, __builtin_return_address(0));
-}
-EXPORT_SYMBOL(__ioremap);
-
-void __iomem *
 __ioremap_caller(phys_addr_t addr, unsigned long size, unsigned long flags,
 void *caller)
 {
@@ -153,7 +116,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
unsigned long flags,
return (void __iomem *) (v + ((unsigned long)addr & ~PAGE_MASK));
 }
 
-void iounmap(volatile void __iomem *addr)
+void __iounmap(volatile void __iomem *addr)
 {
/*
 * If mapped by BATs then there is nothing to do.
@@ -165,7 +128,7 @@ void iounmap(volatile void __iomem *addr)
if (addr > high_memory && (unsigned long) addr < ioremap_bot)
vunmap((void *) (PAGE_MASK & (unsigned long)addr));
 }
-EXPORT_SYMBOL(iounmap);
+EXPORT_SYMBOL(__iounmap);
 
 #else
 
@@ -264,6 +227,30 @@ void __iomem * __ioremap_caller(phys_addr_t addr, unsigned 
long size,
return ret;
 }
 
+/*
+ * Unmap an IO region and remove it from imalloc'd list.
+ * Access to IO memory should be serialized by driver.
+ */
+void __iounmap(volatile void __iomem *token)
+{
+   void *addr;
+
+   if (!slab_is_available())
+   return;
+
+   addr = (void *) ((unsigned long __force)
+PCI_FIX_ADDR(token) & PAGE_MASK);
+   if ((unsigned long)addr < ioremap_bot) {
+   printk(KERN_WARNING "Attempt to iounmap early bolted mapping"
+  " at 0x%p\n", addr);
+   return;
+   }
+   vunmap(addr);
+}
+EXPORT_SYMBOL(__iounmap);
+
+#endif
+
 void __iomem * __ioremap(phys_addr_t addr, unsigned long size,
 unsigned long flags)
 {
@@ -299,8 +286,8 @@ void __iomem * 

[PATCH 05/17] powerpc: move io mapping functions into ioremap.c

2018-05-04 Thread Christophe Leroy
This patch is the first of a serie that intends to make
io mappings common to PPC32 and PPC64.

It moves ioremap/unmap fonctions into a new file called ioremap.c with
no other modification to the functions.
For the time being, the PPC32 and PPC64 parts get enclosed into #ifdef.
Following patches will aim at making those functions as common as
possible between PPC32 and PPC64.

This patch also moves EXPORT_SYMBOL at the end of each function

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/Makefile |   2 +-
 arch/powerpc/mm/ioremap.c| 350 +++
 arch/powerpc/mm/pgtable_32.c | 139 -
 arch/powerpc/mm/pgtable_64.c | 177 --
 4 files changed, 351 insertions(+), 317 deletions(-)
 create mode 100644 arch/powerpc/mm/ioremap.c

diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index f06f3577d8d1..22d54c1d90e1 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -9,7 +9,7 @@ ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC)
 
 obj-y  := fault.o mem.o pgtable.o mmap.o \
   init_$(BITS).o pgtable_$(BITS).o \
-  init-common.o mmu_context.o drmem.o
+  init-common.o mmu_context.o drmem.o ioremap.o
 obj-$(CONFIG_PPC_MMU_NOHASH)   += mmu_context_nohash.o tlb_nohash.o \
   tlb_nohash_low.o
 obj-$(CONFIG_PPC_BOOK3E)   += tlb_low_$(BITS)e.o
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
new file mode 100644
index ..5d2645193568
--- /dev/null
+++ b/arch/powerpc/mm/ioremap.c
@@ -0,0 +1,350 @@
+/*
+ * This file contains the routines for mapping IO areas
+ *
+ *  Derived from arch/powerpc/mm/pgtable_32.c and
+ *  arch/powerpc/mm/pgtable_64.c
+ *
+ * SPDX-License-Identifier: GPL-2.0
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mmu_decl.h"
+
+#ifdef CONFIG_PPC32
+
+unsigned long ioremap_bot;
+EXPORT_SYMBOL(ioremap_bot);/* aka VMALLOC_END */
+
+void __iomem *
+ioremap(phys_addr_t addr, unsigned long size)
+{
+   return __ioremap_caller(addr, size, _PAGE_NO_CACHE | _PAGE_GUARDED,
+   __builtin_return_address(0));
+}
+EXPORT_SYMBOL(ioremap);
+
+void __iomem *
+ioremap_wc(phys_addr_t addr, unsigned long size)
+{
+   return __ioremap_caller(addr, size, _PAGE_NO_CACHE,
+   __builtin_return_address(0));
+}
+EXPORT_SYMBOL(ioremap_wc);
+
+void __iomem *
+ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long flags)
+{
+   /* writeable implies dirty for kernel addresses */
+   if ((flags & (_PAGE_RW | _PAGE_RO)) != _PAGE_RO)
+   flags |= _PAGE_DIRTY | _PAGE_HWWRITE;
+
+   /* we don't want to let _PAGE_USER and _PAGE_EXEC leak out */
+   flags &= ~(_PAGE_USER | _PAGE_EXEC);
+   flags |= _PAGE_PRIVILEGED;
+
+   return __ioremap_caller(addr, size, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(ioremap_prot);
+
+void __iomem *
+__ioremap(phys_addr_t addr, unsigned long size, unsigned long flags)
+{
+   return __ioremap_caller(addr, size, flags, __builtin_return_address(0));
+}
+EXPORT_SYMBOL(__ioremap);
+
+void __iomem *
+__ioremap_caller(phys_addr_t addr, unsigned long size, unsigned long flags,
+void *caller)
+{
+   unsigned long v, i;
+   phys_addr_t p;
+   int err;
+
+   /* Make sure we have the base flags */
+   if ((flags & _PAGE_PRESENT) == 0)
+   flags |= pgprot_val(PAGE_KERNEL);
+
+   /* Non-cacheable page cannot be coherent */
+   if (flags & _PAGE_NO_CACHE)
+   flags &= ~_PAGE_COHERENT;
+
+   /*
+* Choose an address to map it to.
+* Once the vmalloc system is running, we use it.
+* Before then, we use space going down from IOREMAP_TOP
+* (ioremap_bot records where we're up to).
+*/
+   p = addr & PAGE_MASK;
+   size = PAGE_ALIGN(addr + size) - p;
+
+   /*
+* If the address lies within the first 16 MB, assume it's in ISA
+* memory space
+*/
+   if (p < 16*1024*1024)
+   p += _ISA_MEM_BASE;
+
+#ifndef CONFIG_CRASH_DUMP
+   /*
+* Don't allow anybody to remap normal RAM that we're using.
+* mem_init() sets high_memory so only do the check after that.
+*/
+   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
+   page_is_ram(__phys_to_pfn(p))) {
+   printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
+  (unsigned long long)p, __builtin_return_address(0));
+   return NULL;
+   }
+#endif
+
+   if (size == 0)
+   return NULL;
+
+   /*
+* Is it already mapped?  Perhaps 

[PATCH 04/17] Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP"

2018-05-04 Thread Christophe Leroy
This reverts commit 4f94b2c7462d9720b2afa7e8e8d4c19446bb31ce.

That commit was buggy, as it used rlwinm instead of rlwimi.
Instead of fixing that bug, we revert the previous commit in order to
reduce the dependency between L1 entries and L2 entries

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu-8xx.h | 34 +---
 arch/powerpc/kernel/head_8xx.S | 45 +++---
 arch/powerpc/mm/8xx_mmu.c  |  2 +-
 3 files changed, 34 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 4f547752ae79..193f53116c7a 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -34,20 +34,12 @@
  * respectively NA for All or X for Supervisor and no access for User.
  * Then we use the APG to say whether accesses are according to Page rules or
  * "all Supervisor" rules (Access to all)
- * We also use the 2nd APG bit for _PAGE_ACCESSED when having SWAP:
- * When that bit is not set access is done iaw "all user"
- * which means no access iaw page rules.
- * Therefore, we define 4 APG groups. lsb is _PMD_USER, 2nd is _PAGE_ACCESSED
- * 0x => No access => 11 (all accesses performed as user iaw page definition)
- * 10 => No user => 01 (all accesses performed according to page definition)
- * 11 => User => 00 (all accesses performed as supervisor iaw page definition)
+ * Therefore, we define 2 APG groups. lsb is _PMD_USER
+ * 0 => No user => 01 (all accesses performed according to page definition)
+ * 1 => User => 00 (all accesses performed as supervisor iaw page definition)
  * We define all 16 groups so that all other bits of APG can take any value
  */
-#ifdef CONFIG_SWAP
-#define MI_APG_INIT0xf4f4f4f4
-#else
 #define MI_APG_INIT0x
-#endif
 
 /* The effective page number register.  When read, contains the information
  * about the last instruction TLB miss.  When MI_RPN is written, bits in
@@ -115,20 +107,12 @@
  * Supervisor and no access for user and NA for ALL.
  * Then we use the APG to say whether accesses are according to Page rules or
  * "all Supervisor" rules (Access to all)
- * We also use the 2nd APG bit for _PAGE_ACCESSED when having SWAP:
- * When that bit is not set access is done iaw "all user"
- * which means no access iaw page rules.
- * Therefore, we define 4 APG groups. lsb is _PMD_USER, 2nd is _PAGE_ACCESSED
- * 0x => No access => 11 (all accesses performed as user iaw page definition)
- * 10 => No user => 01 (all accesses performed according to page definition)
- * 11 => User => 00 (all accesses performed as supervisor iaw page definition)
+ * Therefore, we define 2 APG groups. lsb is _PMD_USER
+ * 0 => No user => 01 (all accesses performed according to page definition)
+ * 1 => User => 00 (all accesses performed as supervisor iaw page definition)
  * We define all 16 groups so that all other bits of APG can take any value
  */
-#ifdef CONFIG_SWAP
-#define MD_APG_INIT0xf4f4f4f4
-#else
 #define MD_APG_INIT0x
-#endif
 
 /* The effective page number register.  When read, contains the information
  * about the last instruction TLB miss.  When MD_RPN is written, bits in
@@ -180,12 +164,6 @@
  */
 #define SPRN_M_TW  799
 
-/* APGs */
-#define M_APG0 0x
-#define M_APG1 0x0020
-#define M_APG2 0x0040
-#define M_APG3 0x0060
-
 #ifdef CONFIG_PPC_MM_SLICES
 #include 
 #define SLICE_ARRAY_SIZE   (1 << (32 - SLICE_LOW_SHIFT - 1))
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index d8670a37d70c..c3b831bb8bad 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -354,13 +354,14 @@ _ENTRY(ITLBMiss_cmp)
 #if defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE)
mtcrr12
 #endif
-
-#ifdef CONFIG_SWAP
-   rlwinm  r11, r10, 31, _PAGE_ACCESSED >> 1
-#endif
/* Load the MI_TWC with the attributes for this "segment." */
mtspr   SPRN_MI_TWC, r11/* Set segment attributes */
 
+#ifdef CONFIG_SWAP
+   rlwinm  r11, r10, 32-5, _PAGE_PRESENT
+   and r11, r11, r10
+   rlwimi  r10, r11, 0, _PAGE_PRESENT
+#endif
li  r11, RPN_PATTERN | 0x200
/* The Linux PTE won't go exactly into the MMU TLB.
 * Software indicator bits 20 and 23 must be clear.
@@ -471,14 +472,22 @@ _ENTRY(DTLBMiss_jmp)
 * above.
 */
rlwimi  r11, r10, 0, _PAGE_GUARDED
-#ifdef CONFIG_SWAP
-   /* _PAGE_ACCESSED has to be set. We use second APG bit for that, 0
-* on that bit will represent a Non Access group
-*/
-   rlwinm  r11, r10, 31, _PAGE_ACCESSED >> 1
-#endif
mtspr   SPRN_MD_TWC, r11
 
+   /* Both _PAGE_ACCESSED and _PAGE_PRESENT has to be set.
+* We also need to know if the insn is a load/store, so:
+* Clear _PAGE_PRESENT and load that which will
+* trap 

[PATCH 03/17] powerpc/nohash: use IS_ENABLED() to simplify __set_pte_at()

2018-05-04 Thread Christophe Leroy
By using IS_ENABLED() we can simplify __set_pte_at() by removing
redundant *ptep = pte

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/pgtable.h | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index f2fe3cbe90af..077472640b35 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -148,40 +148,33 @@ extern void set_pte_at(struct mm_struct *mm, unsigned 
long addr, pte_t *ptep,
 static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
 {
-#if defined(CONFIG_PPC32) && defined(CONFIG_PTE_64BIT)
/* Second case is 32-bit with 64-bit PTE.  In this case, we
 * can just store as long as we do the two halves in the right order
 * with a barrier in between.
 * In the percpu case, we also fallback to the simple update
 */
-   if (percpu) {
-   *ptep = pte;
+   if (IS_ENABLED(CONFIG_PPC32) && IS_ENABLED(CONFIG_PTE_64BIT) && 
!percpu) {
+   __asm__ __volatile__("\
+   stw%U0%X0 %2,%0\n\
+   eieio\n\
+   stw%U0%X0 %L2,%1"
+   : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
+   : "r" (pte) : "memory");
return;
}
-   __asm__ __volatile__("\
-   stw%U0%X0 %2,%0\n\
-   eieio\n\
-   stw%U0%X0 %L2,%1"
-   : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
-   : "r" (pte) : "memory");
-
-#else
/* Anything else just stores the PTE normally. That covers all 64-bit
 * cases, and 32-bit non-hash with 32-bit PTEs.
 */
*ptep = pte;
 
-#ifdef CONFIG_PPC_BOOK3E_64
/*
 * With hardware tablewalk, a sync is needed to ensure that
 * subsequent accesses see the PTE we just wrote.  Unlike userspace
 * mappings, we can't tolerate spurious faults, so make sure
 * the new PTE will be seen the first time.
 */
-   if (is_kernel_addr(addr))
+   if (IS_ENABLED(CONFIG_PPC_BOOK3E_64) && is_kernel_addr(addr))
mb();
-#endif
-#endif
 }
 
 
-- 
2.13.3



[PATCH 02/17] powerpc/nohash: remove _PAGE_BUSY

2018-05-04 Thread Christophe Leroy
_PAGE_BUSY is always 0, remove it

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/64/pgtable.h | 10 +++---
 arch/powerpc/include/asm/nohash/pte-book3e.h |  5 -
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 4f6f5a27bfb5..c3559d7a94fb 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -186,14 +186,12 @@ static inline unsigned long pte_update(struct mm_struct 
*mm,
 
__asm__ __volatile__(
"1: ldarx   %0,0,%3 # pte_update\n\
-   andi.   %1,%0,%6\n\
-   bne-1b \n\
andc%1,%0,%4 \n\
-   or  %1,%1,%7\n\
+   or  %1,%1,%6\n\
stdcx.  %1,0,%3 \n\
bne-1b"
: "=" (old), "=" (tmp), "=m" (*ptep)
-   : "r" (ptep), "r" (clr), "m" (*ptep), "i" (_PAGE_BUSY), "r" (set)
+   : "r" (ptep), "r" (clr), "m" (*ptep), "r" (set)
: "cc" );
 #else
unsigned long old = pte_val(*ptep);
@@ -290,13 +288,11 @@ static inline void __ptep_set_access_flags(struct 
mm_struct *mm,
 
__asm__ __volatile__(
"1: ldarx   %0,0,%4\n\
-   andi.   %1,%0,%6\n\
-   bne-1b \n\
or  %0,%3,%0\n\
stdcx.  %0,0,%4\n\
bne-1b"
:"=" (old), "=" (tmp), "=m" (*ptep)
-   :"r" (bits), "r" (ptep), "m" (*ptep), "i" (_PAGE_BUSY)
+   :"r" (bits), "r" (ptep), "m" (*ptep)
:"cc");
 #else
unsigned long old = pte_val(*ptep);
diff --git a/arch/powerpc/include/asm/nohash/pte-book3e.h 
b/arch/powerpc/include/asm/nohash/pte-book3e.h
index 9ff51b4c0cac..12730b81cd98 100644
--- a/arch/powerpc/include/asm/nohash/pte-book3e.h
+++ b/arch/powerpc/include/asm/nohash/pte-book3e.h
@@ -57,13 +57,8 @@
 #define _PAGE_USER (_PAGE_BAP_UR | _PAGE_BAP_SR) /* Can be read */
 #define _PAGE_PRIVILEGED   (_PAGE_BAP_SR)
 
-#define _PAGE_BUSY 0
-
 #define _PAGE_SPECIAL  _PAGE_SW0
 
-/* Flags to be preserved on PTE modifications */
-#define _PAGE_HPTEFLAGS_PAGE_BUSY
-
 /* Base page size */
 #ifdef CONFIG_PPC_64K_PAGES
 #define _PAGE_PSIZE_PAGE_PSIZE_64K
-- 
2.13.3



[PATCH 00/17] Implement use of HW assistance on TLB table walk on 8xx

2018-05-04 Thread Christophe Leroy
The purpose of this serie is to implement hardware assistance for TLB table walk
on the 8xx.

First part is to make L1 entries and L2 entries independant.
For that, we need to alter ioremap functions in order to handle GUARD attribute
at the PGD/PMD level.

Last part is to try and reuse PTE fragment implemented on PPC64 in order to
not waste 16k Pages for page tables as only 4k are used. For the time being,
it doesn't work, but I include it in the serie anyway in order to get feedback.

Tested successfully on 8xx up to the one before the last.

Didn't have time to do compilation test on other configs, I send it anyway
before leaving for one week vacation in order to get feedback.

Christophe Leroy (17):
  powerpc/nohash: remove hash related code from nohash headers.
  powerpc/nohash: remove _PAGE_BUSY
  powerpc/nohash: use IS_ENABLED() to simplify __set_pte_at()
  Revert "powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for
CONFIG_SWAP"
  powerpc: move io mapping functions into ioremap.c
  powerpc: common ioremap functions.
  powerpc: make ioremap_bot common to PPC32 and PPC64
  powerpc: make __iounmap() common to PPC32 and PPC64
  powerpc: make __ioremap_caller() common to PPC32 and PPC64
  powerpc: use _ALIGN macro
  powerpc/nohash32: set GUARDED attribute in the PMD directly
  powerpc/8xx: Remove PTE_ATOMIC_UPDATES
  powerpc/mm: Use hardware assistance in TLB handlers on the 8xx
  powerpc/8xx: reunify TLB handler routines
  powerpc/8xx: Free up SPRN_SPRG_SCRATCH2
  powerpc/mm: Make pte_fragment_alloc() common to PPC32 and PPC64
  powerpc/mm: Use pte_fragment_alloc() on 8xx (Not Working yet)

 arch/powerpc/include/asm/book3s/32/pgtable.h |  16 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |   2 +
 arch/powerpc/include/asm/hugetlb.h   |   4 +-
 arch/powerpc/include/asm/machdep.h   |   2 +-
 arch/powerpc/include/asm/mmu-8xx.h   |  38 +--
 arch/powerpc/include/asm/mmu_context.h   |  28 +++
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  39 ++-
 arch/powerpc/include/asm/nohash/32/pgtable.h |  88 +++
 arch/powerpc/include/asm/nohash/32/pte-8xx.h |   6 +-
 arch/powerpc/include/asm/nohash/64/pgtable.h |  26 +-
 arch/powerpc/include/asm/nohash/pgtable.h|  61 ++---
 arch/powerpc/include/asm/nohash/pte-book3e.h |   6 -
 arch/powerpc/include/asm/pgtable-types.h |   4 +
 arch/powerpc/kernel/head_8xx.S   | 350 ++-
 arch/powerpc/mm/8xx_mmu.c|  12 +-
 arch/powerpc/mm/Makefile |   2 +-
 arch/powerpc/mm/dma-noncoherent.c|   2 +-
 arch/powerpc/mm/dump_linuxpagetables.c   |  32 ++-
 arch/powerpc/mm/hugetlbpage.c|  12 +
 arch/powerpc/mm/init_32.c|   6 +-
 arch/powerpc/mm/ioremap.c| 250 +++
 arch/powerpc/mm/mem.c|  16 +-
 arch/powerpc/mm/mmu_context_book3s64.c   |  28 ---
 arch/powerpc/mm/mmu_context_nohash.c |   4 +
 arch/powerpc/mm/pgtable.c|  75 ++
 arch/powerpc/mm/pgtable_32.c | 167 +++--
 arch/powerpc/mm/pgtable_64.c | 244 ---
 arch/powerpc/platforms/Kconfig.cputype   |   9 +
 28 files changed, 730 insertions(+), 799 deletions(-)
 create mode 100644 arch/powerpc/mm/ioremap.c

-- 
2.13.3



[PATCH 01/17] powerpc/nohash: remove hash related code from nohash headers.

2018-05-04 Thread Christophe Leroy
When nohash and book3s header were split, some hash related stuff
remained in the nohash header. This patch removes them.

Signed-off-by: Christophe Leroy 
---
 Removed the call to pte_young() as it fails, back to using PAGE_ACCESSED 
directly.

 arch/powerpc/include/asm/nohash/32/pgtable.h | 29 +++--
 arch/powerpc/include/asm/nohash/64/pgtable.h | 16 ++--
 arch/powerpc/include/asm/nohash/pgtable.h| 38 +++-
 arch/powerpc/include/asm/nohash/pte-book3e.h |  1 -
 4 files changed, 10 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 03bbd1149530..140f8e74b478 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -133,7 +133,7 @@ extern int icache_44x_need_flush;
 #ifndef __ASSEMBLY__
 
 #define pte_clear(mm, addr, ptep) \
-   do { pte_update(ptep, ~_PAGE_HASHPTE, 0); } while (0)
+   do { pte_update(ptep, ~0, 0); } while (0)
 
 #define pmd_none(pmd)  (!pmd_val(pmd))
 #definepmd_bad(pmd)(pmd_val(pmd) & _PMD_BAD)
@@ -146,21 +146,6 @@ static inline void pmd_clear(pmd_t *pmdp)
 
 
 /*
- * When flushing the tlb entry for a page, we also need to flush the hash
- * table entry.  flush_hash_pages is assembler (for speed) in hashtable.S.
- */
-extern int flush_hash_pages(unsigned context, unsigned long va,
-   unsigned long pmdval, int count);
-
-/* Add an HPTE to the hash table */
-extern void add_hash_page(unsigned context, unsigned long va,
- unsigned long pmdval);
-
-/* Flush an entry from the TLB/hash table */
-extern void flush_hash_entry(struct mm_struct *mm, pte_t *ptep,
-unsigned long address);
-
-/*
  * PTE updates. This function is called whenever an existing
  * valid PTE is updated. This does -not- include set_pte_at()
  * which nowadays only sets a new PTE.
@@ -246,12 +231,6 @@ static inline int __ptep_test_and_clear_young(unsigned int 
context, unsigned lon
 {
unsigned long old;
old = pte_update(ptep, _PAGE_ACCESSED, 0);
-#if _PAGE_HASHPTE != 0
-   if (old & _PAGE_HASHPTE) {
-   unsigned long ptephys = __pa(ptep) & PAGE_MASK;
-   flush_hash_pages(context, addr, ptephys, 1);
-   }
-#endif
return (old & _PAGE_ACCESSED) != 0;
 }
 #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
@@ -261,7 +240,7 @@ static inline int __ptep_test_and_clear_young(unsigned int 
context, unsigned lon
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long 
addr,
   pte_t *ptep)
 {
-   return __pte(pte_update(ptep, ~_PAGE_HASHPTE, 0));
+   return __pte(pte_update(ptep, ~0, 0));
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -289,7 +268,7 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
 }
 
 #define __HAVE_ARCH_PTE_SAME
-#define pte_same(A,B)  (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HASHPTE) == 0)
+#define pte_same(A,B)  ((pte_val(A) ^ pte_val(B)) == 0)
 
 /*
  * Note that on Book E processors, the pmd contains the kernel virtual
@@ -330,7 +309,7 @@ static inline void __ptep_set_access_flags(struct mm_struct 
*mm,
 /*
  * Encode and decode a swap entry.
  * Note that the bits we use in a PTE for representing a swap entry
- * must not include the _PAGE_PRESENT bit or the _PAGE_HASHPTE bit (if used).
+ * must not include the _PAGE_PRESENT bit.
  *   -- paulus
  */
 #define __swp_type(entry)  ((entry).val & 0x1f)
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 5c5f75d005ad..4f6f5a27bfb5 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -173,8 +173,6 @@ static inline void pgd_set(pgd_t *pgdp, unsigned long val)
 /* to find an entry in a kernel page-table-directory */
 /* This now only contains the vmalloc pages */
 #define pgd_offset_k(address) pgd_offset(_mm, address)
-extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
-   pte_t *ptep, unsigned long pte, int huge);
 
 /* Atomic PTE updates */
 static inline unsigned long pte_update(struct mm_struct *mm,
@@ -205,11 +203,6 @@ static inline unsigned long pte_update(struct mm_struct 
*mm,
if (!huge)
assert_pte_locked(mm, addr);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (old & _PAGE_HASHPTE)
-   hpte_need_flush(mm, addr, ptep, old, huge);
-#endif
-
return old;
 }
 
@@ -218,7 +211,7 @@ static inline int __ptep_test_and_clear_young(struct 
mm_struct *mm,
 {
unsigned long old;
 
-   if ((pte_val(*ptep) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
+   if ((pte_val(*ptep) & _PAGE_ACCESSED) == 0)
return 0;
old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 

Re: [PATCH 1/4] powerpc/64/kexec: fix race in kexec when XIVE is shutdowned

2018-05-04 Thread Cédric Le Goater
On 05/04/2018 01:13 PM, Cédric Le Goater wrote:
> On 05/04/2018 12:41 PM, Michael Ellerman wrote:
>> Cédric Le Goater  writes:
>>
>>> The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
>>> secondary CPUs before the kexec_cpu_down() operation is called on
>>> secondaries. This can raise conflicts and provoque errors in the XIVE
>>> hcalls when XIVE is shutdowned with H_INT_RESET on the primary CPU.
>>>
>>> To synchronize the kexec_cpu_down() operations and make sure the
>>> secondaries have completed their task before the primary starts doing
>>> the same, let's move the primary kexec_cpu_down() after the
>>> KEXEC_STATE_REAL_MODE barrier.
>>
>> This sounds reasonable, I'm sure you've tested it. I'm just a bit
>> worried that it could potentially break on other platforms because it
>> changes the sequence of operations.
> 
> yes. We are adding a last barrier to be exact making the full sequence 
> a little slower.
> 
>> Looking we only have kexec_cpu_down() implemented for pseries, powernv,
>> ps3 and 85xx.
>>
>> We can easily test the first two. > ps3 doesn't do much so hopefully that's 
>> safe.
>>
>> mpc85xx_smp_kexec_cpu_down() does very little on 32-bit, and on 64-bit
>> it seems to already wait for at least one other CPU to get into
>> KEXEC_STATE_REAL_MODE, so that's probably safe too.
>>
>> So I guess I'm OK to merge this, and we'll fix any fallout. It would be
>> good for the change log to call out the change though, and that we think
>> it's a sensible change for all platforms.
> 
> OK. 

Ah and can you please fix the 'shutdowned' spelling ? it has been bugging me
since I sent the patch :) thx

C.


Re: [PATCH 2/4] powerpc/xive: fix hcall H_INT_RESET to support long busy delays

2018-05-04 Thread Cédric Le Goater
On 05/04/2018 12:41 PM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
>> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
>> b/arch/powerpc/sysdev/xive/spapr.c
>> index 091f1d0d0af1..7113f5d87952 100644
>> --- a/arch/powerpc/sysdev/xive/spapr.c
>> +++ b/arch/powerpc/sysdev/xive/spapr.c
>> @@ -108,6 +109,52 @@ static void xive_irq_bitmap_free(int irq)
>>  }
>>  }
>>  
>> +
>> +/* Based on the similar routines in RTAS */
>> +static unsigned int plpar_busy_delay_time(long rc)
>> +{
>> +unsigned int ms = 0;
>> +
>> +if (H_IS_LONG_BUSY(rc)) {
>> +ms = get_longbusy_msecs(rc);
>> +} else if (rc == H_BUSY) {
>> +ms = 10; /* seems appropriate for XIVE hcalls */
>> +}
>> +
>> +return ms;
>> +}
>> +
>> +static unsigned int plpar_busy_delay(int rc)
>> +{
>> +unsigned int ms;
>> +
>> +might_sleep();
>> +ms = plpar_busy_delay_time(rc);
>> +if (ms && need_resched())
>> +msleep(ms);
> 
> This is called from kexec shutdown isn't it?
> 
> In which case I don't think msleep() is a great idea.>> We could be crashing 
> for example.

yes.

> An mdelay would be safer I think?

I agree but we would mdelay be OK ? The hcall can take up to 160ms.

Thanks,

C. 


Re: [PATCH 3/4] powerpc/xive: shutdown XIVE when kexec or kdump is performed

2018-05-04 Thread Cédric Le Goater
On 05/04/2018 12:42 PM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
>> The hcall H_INT_RESET should be called to make sure XIVE is fully
>> reseted.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/platforms/pseries/kexec.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/kexec.c 
>> b/arch/powerpc/platforms/pseries/kexec.c
>> index eeb13429d685..1d9bbf8e7357 100644
>> --- a/arch/powerpc/platforms/pseries/kexec.c
>> +++ b/arch/powerpc/platforms/pseries/kexec.c
>> @@ -52,8 +52,11 @@ void pseries_kexec_cpu_down(int crash_shutdown, int 
>> secondary)
>>  }
>>  }
>>  
>> -if (xive_enabled())
>> +if (xive_enabled()) {
>>  xive_kexec_teardown_cpu(secondary);
>> -else
>> +
>> +if (!secondary)
>> +xive_shutdown();
> 
> Couldn't that logic go in xive_kexec_teardown_cpu()?

On powernv, we wait for the secondaries to reach OPAL before doing a 
XIVE shutdown. This is another kexec barrier but it is after the 
KEXEC_STATE_REAL_MODE barrier if I am correct.

So I don't think we can move the code in the  xive_kexec_teardown_cpu()

> Why do we not want to do it on powernv?
>> Actually we *do* do it on powernv, but elsewhere.

yes in a different file.

Thanks,

C.

> cheers
> 
>> +} else
>>  xics_kexec_teardown_cpu(secondary);
>>  }
>> -- 
>> 2.13.6



Re: [PATCH 4/4] powerpc/xive: prepare all hcalls to support long busy delays

2018-05-04 Thread Cédric Le Goater
On 05/04/2018 12:42 PM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
>> This is not the case for the moment, but future releases of pHyp might
>> need to introduce some synchronisation routines under the hood which
>> would make the XIVE hcalls longer to complete.
>>
>> As this was done for H_INT_RESET, let's wrap the other hcalls in a
>> loop catching the H_LONG_BUSY_* codes.
> 
> Are we sure it's safe to msleep() in all these paths?

Hmm. No. we could be under lock as these are called at the bottom of 
the stack of the irq layer. Should we use mdelay() then ?  

What about the rtas_busy_delay() in rtas.c ? I was wondering why we 
were using msleep() there also.

Thanks,

C.

> cheers
> 
>> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
>> b/arch/powerpc/sysdev/xive/spapr.c
>> index 7113f5d87952..97ea0a67a173 100644
>> --- a/arch/powerpc/sysdev/xive/spapr.c
>> +++ b/arch/powerpc/sysdev/xive/spapr.c
>> @@ -165,7 +165,10 @@ static long plpar_int_get_source_info(unsigned long 
>> flags,
>>  unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
>>  long rc;
>>  
>> -rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
>> +do {
>> +rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_GET_SOURCE_INFO lisn=%ld failed %ld\n", lisn, rc);
>>  return rc;
>> @@ -198,8 +201,11 @@ static long plpar_int_set_source_config(unsigned long 
>> flags,
>>  flags, lisn, target, prio, sw_irq);
>>  
>>  
>> -rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
>> -target, prio, sw_irq);
>> +do {
>> +rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
>> +target, prio, sw_irq);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_SET_SOURCE_CONFIG lisn=%ld target=%lx prio=%lx 
>> failed %ld\n",
>> lisn, target, prio, rc);
>> @@ -218,7 +224,11 @@ static long plpar_int_get_queue_info(unsigned long 
>> flags,
>>  unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
>>  long rc;
>>  
>> -rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target, priority);
>> +do {
>> +rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target,
>> + priority);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_GET_QUEUE_INFO cpu=%ld prio=%ld failed %ld\n",
>> target, priority, rc);
>> @@ -247,8 +257,11 @@ static long plpar_int_set_queue_config(unsigned long 
>> flags,
>>  pr_devel("H_INT_SET_QUEUE_CONFIG flags=%lx target=%lx priority=%lx 
>> qpage=%lx qsize=%lx\n",
>>  flags,  target, priority, qpage, qsize);
>>  
>> -rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
>> -priority, qpage, qsize);
>> +do {
>> +rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
>> +priority, qpage, qsize);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_SET_QUEUE_CONFIG cpu=%ld prio=%ld qpage=%lx 
>> returned %ld\n",
>> target, priority, qpage, rc);
>> @@ -262,7 +275,10 @@ static long plpar_int_sync(unsigned long flags, 
>> unsigned long lisn)
>>  {
>>  long rc;
>>  
>> -rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
>> +do {
>> +rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_SYNC lisn=%ld returned %ld\n", lisn, rc);
>>  return  rc;
>> @@ -285,7 +301,11 @@ static long plpar_int_esb(unsigned long flags,
>>  pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
>>  flags,  lisn, offset, in_data);
>>  
>> -rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
>> +do {
>> +rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset,
>> + in_data);
>> +} while (plpar_busy_delay(rc));
>> +
>>  if (rc) {
>>  pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
>> lisn, offset, rc);
>> -- 
>> 2.13.6



Re: [PATCH 1/3] powerpc/nohash: remove hash related code from nohash headers.

2018-05-04 Thread Michael Ellerman
kbuild test robot  writes:

> Hi Christophe,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v4.17-rc2 next-20180426]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-nohash-remove-hash-related-code-from-nohash-headers/20180425-182026
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-ppc64e_defconfig (attached as .config)
> compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=powerpc 
>
> All errors (new ones prefixed by >>):
>
>In file included from arch/powerpc/include/asm/nohash/pgtable.h:6:0,
> from arch/powerpc/include/asm/pgtable.h:19,
> from include/linux/memremap.h:8,
> from include/linux/mm.h:27,
> from include/linux/mman.h:5,
> from arch/powerpc/kernel/asm-offsets.c:22:
>arch/powerpc/include/asm/nohash/64/pgtable.h: In function 
> '__ptep_test_and_clear_young':
>>> arch/powerpc/include/asm/nohash/64/pgtable.h:214:6: error: implicit 
>>> declaration of function 'pte_young'; did you mean 'pte_pud'? 
>>> [-Werror=implicit-function-declaration]
>  if (pte_young(*ptep))
>  ^
>  pte_pud

Urk.

There's a circular dependency here.

I fixed it with the patch below, which seems to be the least worst
solution. Possibly we can clean things up further in future.

cheers

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 140f8e74b478..987a658b18e1 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -267,6 +267,11 @@ static inline void __ptep_set_access_flags(struct 
mm_struct *mm,
pte_update(ptep, clr, set);
 }
 
+static inline int pte_young(pte_t pte)
+{
+   return pte_val(pte) & _PAGE_ACCESSED;
+}
+
 #define __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)  ((pte_val(A) ^ pte_val(B)) == 0)
 
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index e8de7cb4d3fb..6ac8381f4c7a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -204,6 +204,11 @@ static inline unsigned long pte_update(struct mm_struct 
*mm,
return old;
 }
 
+static inline int pte_young(pte_t pte)
+{
+   return pte_val(pte) & _PAGE_ACCESSED;
+}
+
 static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
  unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 077472640b35..2160be2e4339 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -17,7 +17,6 @@ static inline int pte_write(pte_t pte)
 }
 static inline int pte_read(pte_t pte)  { return 1; }
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & 
_PAGE_DIRTY; }
-static inline int pte_young(pte_t pte) { return pte_val(pte) & 
_PAGE_ACCESSED; }
 static inline int pte_special(pte_t pte)   { return pte_val(pte) & 
_PAGE_SPECIAL; }
 static inline int pte_none(pte_t pte)  { return (pte_val(pte) & 
~_PTE_NONE_MASK) == 0; }
 static inline pgprot_t pte_pgprot(pte_t pte)   { return __pgprot(pte_val(pte) 
& PAGE_PROT_BITS); }


Re: [PATCH 1/4] powerpc/64/kexec: fix race in kexec when XIVE is shutdowned

2018-05-04 Thread Cédric Le Goater
On 05/04/2018 12:41 PM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
>> The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
>> secondary CPUs before the kexec_cpu_down() operation is called on
>> secondaries. This can raise conflicts and provoque errors in the XIVE
>> hcalls when XIVE is shutdowned with H_INT_RESET on the primary CPU.
>>
>> To synchronize the kexec_cpu_down() operations and make sure the
>> secondaries have completed their task before the primary starts doing
>> the same, let's move the primary kexec_cpu_down() after the
>> KEXEC_STATE_REAL_MODE barrier.
> 
> This sounds reasonable, I'm sure you've tested it. I'm just a bit
> worried that it could potentially break on other platforms because it
> changes the sequence of operations.

yes. We are adding a last barrier to be exact making the full sequence 
a little slower.

> Looking we only have kexec_cpu_down() implemented for pseries, powernv,
> ps3 and 85xx.
> 
> We can easily test the first two. > ps3 doesn't do much so hopefully that's 
> safe.
> 
> mpc85xx_smp_kexec_cpu_down() does very little on 32-bit, and on 64-bit
> it seems to already wait for at least one other CPU to get into
> KEXEC_STATE_REAL_MODE, so that's probably safe too.
> 
> So I guess I'm OK to merge this, and we'll fix any fallout. It would be
> good for the change log to call out the change though, and that we think
> it's a sensible change for all platforms.

OK. 

Thanks,

C. 

 
> cheers
> 
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/kernel/machine_kexec_64.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
>> b/arch/powerpc/kernel/machine_kexec_64.c
>> index 49d34d7271e7..212ecb8e829c 100644
>> --- a/arch/powerpc/kernel/machine_kexec_64.c
>> +++ b/arch/powerpc/kernel/machine_kexec_64.c
>> @@ -230,16 +230,16 @@ static void kexec_prepare_cpus(void)
>>  /* we are sure every CPU has IRQs off at this point */
>>  kexec_all_irq_disabled = 1;
>>  
>> -/* after we tell the others to go down */
>> -if (ppc_md.kexec_cpu_down)
>> -ppc_md.kexec_cpu_down(0, 0);
>> -
>>  /*
>>   * Before removing MMU mappings make sure all CPUs have entered real
>>   * mode:
>>   */
>>  kexec_prepare_cpus_wait(KEXEC_STATE_REAL_MODE);
>>  
>> +/* after we tell the others to go down */
>> +if (ppc_md.kexec_cpu_down)
>> +ppc_md.kexec_cpu_down(0, 0);
>> +
>>  put_cpu();
>>  }
>>  
>> -- 
>> 2.13.6



Re: [PATCH 3/4] powerpc/xive: shutdown XIVE when kexec or kdump is performed

2018-05-04 Thread Michael Ellerman
Cédric Le Goater  writes:

> The hcall H_INT_RESET should be called to make sure XIVE is fully
> reseted.
>
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/platforms/pseries/kexec.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/kexec.c 
> b/arch/powerpc/platforms/pseries/kexec.c
> index eeb13429d685..1d9bbf8e7357 100644
> --- a/arch/powerpc/platforms/pseries/kexec.c
> +++ b/arch/powerpc/platforms/pseries/kexec.c
> @@ -52,8 +52,11 @@ void pseries_kexec_cpu_down(int crash_shutdown, int 
> secondary)
>   }
>   }
>  
> - if (xive_enabled())
> + if (xive_enabled()) {
>   xive_kexec_teardown_cpu(secondary);
> - else
> +
> + if (!secondary)
> + xive_shutdown();

Couldn't that logic go in xive_kexec_teardown_cpu()?

Why do we not want to do it on powernv?

Actually we *do* do it on powernv, but elsewhere.

cheers

> + } else
>   xics_kexec_teardown_cpu(secondary);
>  }
> -- 
> 2.13.6


Re: [PATCH 4/4] powerpc/xive: prepare all hcalls to support long busy delays

2018-05-04 Thread Michael Ellerman
Cédric Le Goater  writes:

> This is not the case for the moment, but future releases of pHyp might
> need to introduce some synchronisation routines under the hood which
> would make the XIVE hcalls longer to complete.
>
> As this was done for H_INT_RESET, let's wrap the other hcalls in a
> loop catching the H_LONG_BUSY_* codes.

Are we sure it's safe to msleep() in all these paths?

cheers

> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 7113f5d87952..97ea0a67a173 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -165,7 +165,10 @@ static long plpar_int_get_source_info(unsigned long 
> flags,
>   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
>   long rc;
>  
> - rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
> + do {
> + rc = plpar_hcall(H_INT_GET_SOURCE_INFO, retbuf, flags, lisn);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_GET_SOURCE_INFO lisn=%ld failed %ld\n", lisn, rc);
>   return rc;
> @@ -198,8 +201,11 @@ static long plpar_int_set_source_config(unsigned long 
> flags,
>   flags, lisn, target, prio, sw_irq);
>  
>  
> - rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
> - target, prio, sw_irq);
> + do {
> + rc = plpar_hcall_norets(H_INT_SET_SOURCE_CONFIG, flags, lisn,
> + target, prio, sw_irq);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_SET_SOURCE_CONFIG lisn=%ld target=%lx prio=%lx 
> failed %ld\n",
>  lisn, target, prio, rc);
> @@ -218,7 +224,11 @@ static long plpar_int_get_queue_info(unsigned long flags,
>   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
>   long rc;
>  
> - rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target, priority);
> + do {
> + rc = plpar_hcall(H_INT_GET_QUEUE_INFO, retbuf, flags, target,
> +  priority);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_GET_QUEUE_INFO cpu=%ld prio=%ld failed %ld\n",
>  target, priority, rc);
> @@ -247,8 +257,11 @@ static long plpar_int_set_queue_config(unsigned long 
> flags,
>   pr_devel("H_INT_SET_QUEUE_CONFIG flags=%lx target=%lx priority=%lx 
> qpage=%lx qsize=%lx\n",
>   flags,  target, priority, qpage, qsize);
>  
> - rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
> - priority, qpage, qsize);
> + do {
> + rc = plpar_hcall_norets(H_INT_SET_QUEUE_CONFIG, flags, target,
> + priority, qpage, qsize);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_SET_QUEUE_CONFIG cpu=%ld prio=%ld qpage=%lx 
> returned %ld\n",
>  target, priority, qpage, rc);
> @@ -262,7 +275,10 @@ static long plpar_int_sync(unsigned long flags, unsigned 
> long lisn)
>  {
>   long rc;
>  
> - rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
> + do {
> + rc = plpar_hcall_norets(H_INT_SYNC, flags, lisn);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_SYNC lisn=%ld returned %ld\n", lisn, rc);
>   return  rc;
> @@ -285,7 +301,11 @@ static long plpar_int_esb(unsigned long flags,
>   pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
>   flags,  lisn, offset, in_data);
>  
> - rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
> + do {
> + rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset,
> +  in_data);
> + } while (plpar_busy_delay(rc));
> +
>   if (rc) {
>   pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
>  lisn, offset, rc);
> -- 
> 2.13.6


Re: [PATCH 2/4] powerpc/xive: fix hcall H_INT_RESET to support long busy delays

2018-05-04 Thread Michael Ellerman
Cédric Le Goater  writes:

> diff --git a/arch/powerpc/sysdev/xive/spapr.c 
> b/arch/powerpc/sysdev/xive/spapr.c
> index 091f1d0d0af1..7113f5d87952 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -108,6 +109,52 @@ static void xive_irq_bitmap_free(int irq)
>   }
>  }
>  
> +
> +/* Based on the similar routines in RTAS */
> +static unsigned int plpar_busy_delay_time(long rc)
> +{
> + unsigned int ms = 0;
> +
> + if (H_IS_LONG_BUSY(rc)) {
> + ms = get_longbusy_msecs(rc);
> + } else if (rc == H_BUSY) {
> + ms = 10; /* seems appropriate for XIVE hcalls */
> + }
> +
> + return ms;
> +}
> +
> +static unsigned int plpar_busy_delay(int rc)
> +{
> + unsigned int ms;
> +
> + might_sleep();
> + ms = plpar_busy_delay_time(rc);
> + if (ms && need_resched())
> + msleep(ms);

This is called from kexec shutdown isn't it?

In which case I don't think msleep() is a great idea.

We could be crashing for example.

An mdelay would be safer I think?

cheers


Re: [PATCH 1/4] powerpc/64/kexec: fix race in kexec when XIVE is shutdowned

2018-05-04 Thread Michael Ellerman
Cédric Le Goater  writes:

> The kexec_state KEXEC_STATE_IRQS_OFF barrier is reached by all
> secondary CPUs before the kexec_cpu_down() operation is called on
> secondaries. This can raise conflicts and provoque errors in the XIVE
> hcalls when XIVE is shutdowned with H_INT_RESET on the primary CPU.
>
> To synchronize the kexec_cpu_down() operations and make sure the
> secondaries have completed their task before the primary starts doing
> the same, let's move the primary kexec_cpu_down() after the
> KEXEC_STATE_REAL_MODE barrier.

This sounds reasonable, I'm sure you've tested it. I'm just a bit
worried that it could potentially break on other platforms because it
changes the sequence of operations.

Looking we only have kexec_cpu_down() implemented for pseries, powernv,
ps3 and 85xx.

We can easily test the first two. ps3 doesn't do much so hopefully
that's safe.

mpc85xx_smp_kexec_cpu_down() does very little on 32-bit, and on 64-bit
it seems to already wait for at least one other CPU to get into
KEXEC_STATE_REAL_MODE, so that's probably safe too.

So I guess I'm OK to merge this, and we'll fix any fallout. It would be
good for the change log to call out the change though, and that we think
it's a sensible change for all platforms.

cheers

> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/kernel/machine_kexec_64.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
> b/arch/powerpc/kernel/machine_kexec_64.c
> index 49d34d7271e7..212ecb8e829c 100644
> --- a/arch/powerpc/kernel/machine_kexec_64.c
> +++ b/arch/powerpc/kernel/machine_kexec_64.c
> @@ -230,16 +230,16 @@ static void kexec_prepare_cpus(void)
>   /* we are sure every CPU has IRQs off at this point */
>   kexec_all_irq_disabled = 1;
>  
> - /* after we tell the others to go down */
> - if (ppc_md.kexec_cpu_down)
> - ppc_md.kexec_cpu_down(0, 0);
> -
>   /*
>* Before removing MMU mappings make sure all CPUs have entered real
>* mode:
>*/
>   kexec_prepare_cpus_wait(KEXEC_STATE_REAL_MODE);
>  
> + /* after we tell the others to go down */
> + if (ppc_md.kexec_cpu_down)
> + ppc_md.kexec_cpu_down(0, 0);
> +
>   put_cpu();
>  }
>  
> -- 
> 2.13.6


seccomp_bpf.c:2880:global.get_metadata:Expected 0 (0) == seccomp(1, 2, ) (4294967295)

2018-05-04 Thread Mathieu Malaterre
Hi there,

Quick question (I have not investigate root cause): is support for
seccomp complete on ppc32 ?

$ make KBUILD_OUTPUT=/tmp/kselftest TARGETS=seccomp kselftest
...
seccomp_bpf.c:1804:TRACE_syscall.ptrace_syscall_dropped:Expected 1 (1)
== syscall(286) (4294967295)
TRACE_syscall.ptrace_syscall_dropped: Test failed at step #13
[ FAIL ] TRACE_syscall.ptrace_syscall_dropped
...
[ RUN  ] global.get_metadata
seccomp_bpf.c:2880:global.get_metadata:Expected 0 (0) == seccomp(1, 2,
) (4294967295)
seccomp_bpf.c:2892:global.get_metadata:Expected 1 (1) ==
read(pipefd[0], , 1) (0)
global.get_metadata: Test terminated by assertion
[ FAIL ] global.get_metadata


Thanks


Re: [PATCH v10 12/25] mm: cache some VMA fields in the vm_fault structure

2018-05-04 Thread Laurent Dufour
On 03/05/2018 17:42, Minchan Kim wrote:
> On Thu, May 03, 2018 at 02:25:18PM +0200, Laurent Dufour wrote:
>> On 23/04/2018 09:42, Minchan Kim wrote:
>>> On Tue, Apr 17, 2018 at 04:33:18PM +0200, Laurent Dufour wrote:
 When handling speculative page fault, the vma->vm_flags and
 vma->vm_page_prot fields are read once the page table lock is released. So
 there is no more guarantee that these fields would not change in our back
 They will be saved in the vm_fault structure before the VMA is checked for
 changes.
>>>
>>> Sorry. I cannot understand.
>>> If it is changed under us, what happens? If it's critical, why cannot we
>>> check with seqcounter?
>>> Clearly, I'm not understanding the logic here. However, it's a global
>>> change without CONFIG_SPF so I want to be more careful.
>>> It would be better to describe why we need to sanpshot those values
>>> into vm_fault rather than preventing the race.
>>
>> The idea is to go forward processing the page fault using the VMA's fields
>> values saved in the vm_fault structure. Then once the pte are locked, the
>> vma->sequence_counter is checked again and if something has changed in our 
>> back
>> the speculative page fault processing is aborted.
> 
> Sorry, still I don't understand why we should capture some fields to vm_fault.
> If we found vma->seq_cnt is changed under pte lock, can't we just bail out and
> fallback to classic fault handling?
> 
> Maybe, I'm missing something clear now. It would be really helpful to 
> understand
> if you give some exmaple.

I'd rather say that I was not clear enough ;)

Here is the point, when we deal with a speculative page fault, the mmap_sem is
not taken, so parallel VMA's changes can occurred. When a VMA change is done
which will impact the page fault processing, we assumed that the VMA sequence
counter will be changed.

In the page fault processing, at the time the PTE is locked, we checked the VMA
sequence counter to detect changes done in our back. If no change is detected
we can continue further. But this doesn't prevent the VMA to not be changed in
our back while the PTE is locked. So VMA's fields which are used while the PTE
is locked must be saved to ensure that we are using *static* values.
This is important since the PTE changes will be made on regards to these VMA
fields and they need to be consistent. This concerns the vma->vm_flags and
vma->vm_page_prot VMA fields.

I hope I make this clear enough this time.

Thanks,
Laurent.



Re: [PATCH v2 9/9] powerpc/hugetlb: Enable hugetlb migration for ppc64

2018-05-04 Thread Aneesh Kumar K.V
Christophe LEROY  writes:

> Le 16/05/2017 à 11:23, Aneesh Kumar K.V a écrit :
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/platforms/Kconfig.cputype | 5 +
>>   1 file changed, 5 insertions(+)
>> 
>> diff --git a/arch/powerpc/platforms/Kconfig.cputype 
>> b/arch/powerpc/platforms/Kconfig.cputype
>> index 8017542d..8acc4f27d101 100644
>> --- a/arch/powerpc/platforms/Kconfig.cputype
>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>> @@ -351,6 +351,11 @@ config PPC_RADIX_MMU
>>is only implemented by IBM Power9 CPUs, if you don't have one of them
>>you can probably disable this.
>>   
>> +config ARCH_ENABLE_HUGEPAGE_MIGRATION
>> +def_bool y
>> +depends on PPC_BOOK3S_64 && HUGETLB_PAGE && MIGRATION
>> +
>> +
>
> Is there a reason why you redefine ARCH_ENABLE_HUGEPAGE_MIGRATION 
> instead of doing a 'select' as it is already defined in mm/Kconfig ?
>

That got copied from x86 Kconfig i guess.

-aneesh



Re: [PATCH 08/13] powerpc/eeh: Introduce eeh_for_each_pe()

2018-05-04 Thread Russell Currey
On Wed, 2018-05-02 at 16:36 +1000, Sam Bobroff wrote:
> Add a for_each-style macro for iterating through PEs without the
> boilerplate required by a traversal function. eeh_pe_next() is now
> exported, as it is now used directly in place.
> 
> Signed-off-by: Sam Bobroff 

Reviewed-by: Russell Currey 


Re: [PATCH 07/13] powerpc/eeh: Clean up pci_ers_result handling

2018-05-04 Thread Russell Currey
On Wed, 2018-05-02 at 16:36 +1000, Sam Bobroff wrote:
> As EEH event handling progresses, a cumulative result of type
> pci_ers_result is built up by (some of) the eeh_report_*() functions
> using either:
>   if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
>   if (*res == PCI_ERS_RESULT_NONE) *res = rc;
> or:
>   if ((*res == PCI_ERS_RESULT_NONE) ||
>   (*res == PCI_ERS_RESULT_RECOVERED)) *res = rc;
>   if (*res == PCI_ERS_RESULT_DISCONNECT &&
>   rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
> (Where *res is the accumulator.)
> 
> However, the intent is not immediately clear and the result in some
> situations is order dependent.
> 
> Address this by assigning a priority to each result value, and always
> merging to the highest priority. This renders the intent clear, and
> provides a stable value for all orderings.
> 
> Signed-off-by: Sam Bobroff 
> ---
>  arch/powerpc/kernel/eeh_driver.c | 36 ++--
> 
>  1 file changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c
> b/arch/powerpc/kernel/eeh_driver.c
> index 188d15c4fe3a..f33dd68a9ca2 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -39,6 +39,29 @@ struct eeh_rmv_data {
>   int removed;
>  };
>  
> +static int eeh_result_priority(enum pci_ers_result result)
> +{
> + switch (result) {
> + case PCI_ERS_RESULT_NONE: return 0;
> + case PCI_ERS_RESULT_NO_AER_DRIVER: return 1;
> + case PCI_ERS_RESULT_RECOVERED: return 2;
> + case PCI_ERS_RESULT_CAN_RECOVER: return 3;
> + case PCI_ERS_RESULT_DISCONNECT: return 4;
> + case PCI_ERS_RESULT_NEED_RESET: return 5;
> + default:
> + WARN_ONCE(1, "Unknown pci_ers_result value");

Missing newline and also would be good to print the value was

> + return 0;
> + }
> +};
> +
> +static enum pci_ers_result merge_result(enum pci_ers_result old,
> + enum pci_ers_result new)

merge_result() sounds like something really generic, maybe call it
pci_ers_merge_result() or something?

Note: just learned that it stands for Error Recovery System and that's
a thing (?)

> +{
> + if (eeh_result_priority(new) > eeh_result_priority(old))
> + return new;
> + return old;

MAX would be nicer as per mpe's suggestion

> +}
> +
>  /**
>   * eeh_pcid_get - Get the PCI device driver
>   * @pdev: PCI device
> @@ -206,9 +229,7 @@ static void *eeh_report_error(struct eeh_dev
> *edev, void *userdata)
>  
>   rc = driver->err_handler->error_detected(dev,
> pci_channel_io_frozen);
>  
> - /* A driver that needs a reset trumps all others */
> - if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
> - if (*res == PCI_ERS_RESULT_NONE) *res = rc;
> + *res = merge_result(*res, rc);
>  
>   edev->in_error = true;
>   pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
> @@ -249,9 +270,7 @@ static void *eeh_report_mmio_enabled(struct
> eeh_dev *edev, void *userdata)
>  
>   rc = driver->err_handler->mmio_enabled(dev);
>  
> - /* A driver that needs a reset trumps all others */
> - if (rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
> - if (*res == PCI_ERS_RESULT_NONE) *res = rc;
> + *res = merge_result(*res, rc);
>  
>  out:
>   eeh_pcid_put(dev);
> @@ -294,10 +313,7 @@ static void *eeh_report_reset(struct eeh_dev
> *edev, void *userdata)
>   goto out;
>  
>   rc = driver->err_handler->slot_reset(dev);
> - if ((*res == PCI_ERS_RESULT_NONE) ||
> - (*res == PCI_ERS_RESULT_RECOVERED)) *res = rc;
> - if (*res == PCI_ERS_RESULT_DISCONNECT &&
> -  rc == PCI_ERS_RESULT_NEED_RESET) *res = rc;
> + *res = merge_result(*res, rc);
>  
>  out:
>   eeh_pcid_put(dev);


Re: [PATCH 06/13] powerpc/eeh: Add message when PE processing at parent

2018-05-04 Thread Russell Currey
On Wed, 2018-05-02 at 16:35 +1000, Sam Bobroff wrote:
> To aid debugging, add a message to show when EEH processing for a PE
> will be done at the device's parent, rather than directly at the
> device.
> 
> Signed-off-by: Sam Bobroff 

Good idea!

Reviewed-by: Russell Currey 


Re: [PATCH 05/13] powerpc/eeh: Strengthen types of eeh traversal functions

2018-05-04 Thread Russell Currey
On Wed, 2018-05-02 at 16:35 +1000, Sam Bobroff wrote:
> The traversal functions eeh_pe_traverse() and eeh_pe_dev_traverse()
> both provide their first argument as void * but every single user
> casts
> it to the expected type.
> 
> Change the type of the first parameter from void * to the appropriate
> type, and clean up all uses.
> 
> Signed-off-by: Sam Bobroff 

Reviewed-by: Russell Currey 


Re: [PATCH 04/13] powerpc/eeh: Remove unused eeh_pcid_name()

2018-05-04 Thread Russell Currey
On Wed, 2018-05-02 at 16:35 +1000, Sam Bobroff wrote:
> Signed-off-by: Sam Bobroff 

Wow, this has been around since 2006.

Reviewed-by: Russell Currey 


Re: [PATCH 02/13] powerpc/eeh: Add final message for successful recovery

2018-05-04 Thread Russell Currey
On Fri, 2018-05-04 at 12:55 +1000, Michael Ellerman wrote:
> Sam Bobroff  writes:
> 
> > Add a single log line at the end of successful EEH recovery, so
> > that
> > it's clear that event processing has finished.
> > 
> > Signed-off-by: Sam Bobroff 
> > ---
> >  arch/powerpc/kernel/eeh_driver.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/powerpc/kernel/eeh_driver.c
> > b/arch/powerpc/kernel/eeh_driver.c
> > index 56a60b9eb397..07e0a42035ce 100644
> > --- a/arch/powerpc/kernel/eeh_driver.c
> > +++ b/arch/powerpc/kernel/eeh_driver.c
> > @@ -910,6 +910,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
> > pr_info("EEH: Notify device driver to resume\n");
> > eeh_pe_dev_traverse(pe, eeh_report_resume, NULL);
> >  
> > +   pr_info("EEH: Recovery successful.\n");
> Is it possible for recovery for multiple devices to be interleaved?
> 
> Should that message include the device?

Pretty sure EEH will only process a single error at a time so this
*should* always let you infer from context, but PHB and PE should
probably be included anyway.  It'd be cool to move pe_{err/warn/info}()
out of powernv for messages like this.

- Russell

> 
> cheers