[PATCH] powerpc/fixmap: Fix the size of the early debug area

2020-08-16 Thread Christophe Leroy
Commit ("03fd42d458fb powerpc/fixmap: Fix FIX_EARLY_DEBUG_BASE when
page size is 256k") reworked the setup of the early debug area and
mistakenly replaced 128 * 1024 by SZ_128.

Change to SZ_128K to restore the original 128 kbytes size of the area.

Fixes: 03fd42d458fb ("powerpc/fixmap: Fix FIX_EARLY_DEBUG_BASE when page size 
is 256k")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/fixmap.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/fixmap.h 
b/arch/powerpc/include/asm/fixmap.h
index 925cf89cbf4b..6bfc87915d5d 100644
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -52,7 +52,7 @@ enum fixed_addresses {
FIX_HOLE,
/* reserve the top 128K for early debugging purposes */
FIX_EARLY_DEBUG_TOP = FIX_HOLE,
-   FIX_EARLY_DEBUG_BASE = FIX_EARLY_DEBUG_TOP+(ALIGN(SZ_128, 
PAGE_SIZE)/PAGE_SIZE)-1,
+   FIX_EARLY_DEBUG_BASE = FIX_EARLY_DEBUG_TOP+(ALIGN(SZ_128K, 
PAGE_SIZE)/PAGE_SIZE)-1,
 #ifdef CONFIG_HIGHMEM
FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
-- 
2.25.0



[PATCH v3] powerpc/numa: Restrict possible nodes based on platform

2020-08-16 Thread Srikar Dronamraju
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time Abstaction
Services (RTAS) Node at
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf,
there are 2 device tree property ibm,max-associativity-domains (which
defines the maximum number of domains that the firmware i.e PowerVM can
support) and ibm,current-associativity-domains (which defines the maximum
number of domains that the platform can support). Value of
ibm,max-associativity-domains property is always greater than or equal to
ibm,current-associativity-domains property. If the said property is not
available, use ibm,max-associativity-domain as fallback. In this yet to be
released LoPAPR, ibm,current-associativity-domains is mentioned in page 833
/ B.5.3 which is covered under under "Appendix B. System Binding" section

Currently Powerpc uses ibm,max-associativity-domains property while
setting the possible number of nodes. This is currently set at 32.
However the possible number of nodes for a platform may be significantly
less. Hence set the possible number of nodes based on
ibm,current-associativity-domains property.

Nathan Lynch had raised a valid concern that post LPM, a user could DLPAR
add processors and memory after LPM with "new" associativity properties.
https://lore.kernel.org/linuxppc-dev/871rljfet9@linux.ibm.com/t/#u

He also pointed out that ibm,max-associativity-domains has the same contents
on all currently available PowerVM systems, unlike
ibm,current-associativity-domains and hence may be better able to handle the
new numa associativity properties.

However with the recent commit dbce45628085 ("powerpc/numa: Limit possible
nodes to within num_possible_nodes"), all new numa  associativity properties
are capped to initially set nr_node_ids. Hence this commit should be safe
with any new dlpar add post LPM.

$ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
/proc/device-tree/rtas/ibm,current-associativity-domains
 0005 0001 0002 0002 0002 0010
/proc/device-tree/rtas/ibm,max-associativity-domains
 0005 0001 0008 0020 0020 0100

$ cat /sys/devices/system/node/possible ##Before patch
0-31

$ cat /sys/devices/system/node/possible ##After patch
0-1

Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.

This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.

I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).

Before patch
free -k at boot
  totalusedfree  shared  buff/cache   available
Mem:  523498176 4106816   518820608   22272  570752   516606720
Swap:   4194240   0 4194240

free -k after creating 100 memory cgroups
  totalusedfree  shared  buff/cache   available
Mem:  523498176 4628416   518246464   22336  623296   516058688
Swap:   4194240   0 4194240

free -k after destroying 100 memory cgroups
  totalusedfree  shared  buff/cache   available
Mem:  523498176 4697408   518173760   22400  627008   515987904
Swap:   4194240   0 4194240

After patch
free -k at boot
  totalusedfree  shared  buff/cache   available
Mem:  523498176 3969472   518933888   22272  594816   516731776
Swap:   4194240   0 4194240

free -k after creating 100 memory cgroups
  totalusedfree  shared  buff/cache   available
Mem:  523498176 4181888   518676096   22208  640192   516496448
Swap:   4194240   0 4194240

free -k after destroying 100 memory cgroups
  totalusedfree  shared  buff/cache   available
Mem:  523498176 4232320   518619904   22272  645952   516443264
Swap:   4194240   0 4194240

Observations:
Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.

Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Anton Blanchard 
Cc: Bharata B Rao 
Cc: Tyrel Datwyler 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2->v3:
v2: 
https://lore.kernel.org/linuxppc-dev/20200707140644.7241-1-sri...@linux.vnet.ibm.com/t/#u
Updated commit msg to mention LoPAPR reference section  and url details
(Michael Ellerman)
Updated commit msg to discuss about possible new numa associativity properties.
(Nathan Lynch)

Changelog v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20200706064002.14848-1-sri...@linux.vnet.ibm.com/t/#u
Fallback to ibm,max-associativity-domains if ibm,current-associat

[PATCH v1 3/4] powerpc/process: Remove useless #ifdef CONFIG_SPE

2020-08-16 Thread Christophe Leroy
cpu_has_feature(CPU_FTR_SPE) returns false when CONFIG_SPE is
not set.

There is no need to enclose the test in an #ifdef CONFIG_SPE.
Remove it.

CPU_FTR_SPE only exists on 32 bits. Define it as 0 on 64 bits.

We have a couple of places like:

 #ifdef CONFIG_SPE
if (cpu_has_feature(CPU_FTR_SPE)) {
do_something_that_requires_CONFIG_SPE
} else {
return -EINVAL;
}
 #else
return -EINVAL;
 #endif

Replace them by a cleaner version:

if (cpu_has_feature(CPU_FTR_SPE)) {
 #ifdef CONFIG_SPE
do_something_that_requires_CONFIG_SPE
 #endif
} else {
return -EINVAL;
}

When CONFIG_SPE is not set, this resolves to an unconditional
return of -EINVAL

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/cputable.h |  1 +
 arch/powerpc/kernel/process.c   | 21 +++--
 2 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index fdddb822d564..cc21dbcc6794 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -165,6 +165,7 @@ static inline void cpu_feature_keys_init(void) { }
 #else  /* CONFIG_PPC32 */
 /* Define these to 0 for the sake of tests in common code */
 #define CPU_FTR_PPC_LE (0)
+#define CPU_FTR_SPE(0)
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 360415689f8a..7090c99a60d9 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -413,10 +413,8 @@ static int __init init_msr_all_available(void)
msr_all_available |= MSR_VEC;
if (cpu_has_feature(CPU_FTR_VSX))
msr_all_available |= MSR_VSX;
-#ifdef CONFIG_SPE
if (cpu_has_feature(CPU_FTR_SPE))
msr_all_available |= MSR_SPE;
-#endif
 
return 0;
 }
@@ -446,10 +444,8 @@ void giveup_all(struct task_struct *tsk)
 #endif
if (usermsr & MSR_VEC)
__giveup_altivec(tsk);
-#ifdef CONFIG_SPE
if (usermsr & MSR_SPE)
__giveup_spe(tsk);
-#endif
 
msr_check_and_clear(msr_all_available);
 }
@@ -1843,7 +1839,6 @@ int set_fpexc_mode(struct task_struct *tsk, unsigned int 
val)
 * fpexc_mode.  fpexc_mode is also used for setting FP exception
 * mode (asyn, precise, disabled) for 'Classic' FP. */
if (val & PR_FP_EXC_SW_ENABLE) {
-#ifdef CONFIG_SPE
if (cpu_has_feature(CPU_FTR_SPE)) {
/*
 * When the sticky exception bits are set
@@ -1857,16 +1852,15 @@ int set_fpexc_mode(struct task_struct *tsk, unsigned 
int val)
 * anyway to restore the prctl settings from
 * the saved environment.
 */
+#ifdef CONFIG_SPE
tsk->thread.spefscr_last = mfspr(SPRN_SPEFSCR);
tsk->thread.fpexc_mode = val &
(PR_FP_EXC_SW_ENABLE | PR_FP_ALL_EXCEPT);
+#endif
return 0;
} else {
return -EINVAL;
}
-#else
-   return -EINVAL;
-#endif
}
 
/* on a CONFIG_SPE this does not hurt us.  The bits that
@@ -1887,8 +1881,7 @@ int get_fpexc_mode(struct task_struct *tsk, unsigned long 
adr)
 {
unsigned int val;
 
-   if (tsk->thread.fpexc_mode & PR_FP_EXC_SW_ENABLE)
-#ifdef CONFIG_SPE
+   if (tsk->thread.fpexc_mode & PR_FP_EXC_SW_ENABLE) {
if (cpu_has_feature(CPU_FTR_SPE)) {
/*
 * When the sticky exception bits are set
@@ -1902,15 +1895,15 @@ int get_fpexc_mode(struct task_struct *tsk, unsigned 
long adr)
 * anyway to restore the prctl settings from
 * the saved environment.
 */
+#ifdef CONFIG_SPE
tsk->thread.spefscr_last = mfspr(SPRN_SPEFSCR);
val = tsk->thread.fpexc_mode;
+#endif
} else
return -EINVAL;
-#else
-   return -EINVAL;
-#endif
-   else
+   } else {
val = __unpack_fe01(tsk->thread.fpexc_mode);
+   }
return put_user(val, (unsigned int __user *) adr);
 }
 
-- 
2.25.0



[PATCH v1 4/4] powerpc/process: Remove useless #ifdef CONFIG_PPC_FPU

2020-08-16 Thread Christophe Leroy
Add a stub for __giveup_fpu() when CONFIG_PPC_FPU is
not selected, as done for CONFIG_SPE and CONFIG_ALTIVEC.

This allows to remove some #ifdef CONFIG_PPC_FPU.

Also change one to IS_ENABLED().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 7090c99a60d9..8133a6ce3242 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -229,6 +229,8 @@ void enable_kernel_fp(void)
}
 }
 EXPORT_SYMBOL(enable_kernel_fp);
+#else
+static inline void __giveup_fpu(struct task_struct *tsk) { }
 #endif /* CONFIG_PPC_FPU */
 
 #ifdef CONFIG_ALTIVEC
@@ -406,9 +408,8 @@ static unsigned long msr_all_available;
 
 static int __init init_msr_all_available(void)
 {
-#ifdef CONFIG_PPC_FPU
-   msr_all_available |= MSR_FP;
-#endif
+   if (IS_ENABLED(CONFIG_PPC_FPU))
+   msr_all_available |= MSR_FP;
if (cpu_has_feature(CPU_FTR_ALTIVEC))
msr_all_available |= MSR_VEC;
if (cpu_has_feature(CPU_FTR_VSX))
@@ -438,10 +439,8 @@ void giveup_all(struct task_struct *tsk)
 
WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & 
MSR_VEC)));
 
-#ifdef CONFIG_PPC_FPU
if (usermsr & MSR_FP)
__giveup_fpu(tsk);
-#endif
if (usermsr & MSR_VEC)
__giveup_altivec(tsk);
if (usermsr & MSR_SPE)
-- 
2.25.0



[PATCH v1 2/4] powerpc/process: Remove useless #ifdef CONFIG_ALTIVEC

2020-08-16 Thread Christophe Leroy
cpu_has_feature(CPU_FTR_ALTIVEC) returns false when CONFIG_ALTIVEC is
not set.

There is no need to enclose the test in an #ifdef CONFIG_ALTIVEC.
Remove it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index b64d71188715..360415689f8a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -409,10 +409,8 @@ static int __init init_msr_all_available(void)
 #ifdef CONFIG_PPC_FPU
msr_all_available |= MSR_FP;
 #endif
-#ifdef CONFIG_ALTIVEC
if (cpu_has_feature(CPU_FTR_ALTIVEC))
msr_all_available |= MSR_VEC;
-#endif
if (cpu_has_feature(CPU_FTR_VSX))
msr_all_available |= MSR_VSX;
 #ifdef CONFIG_SPE
@@ -446,10 +444,8 @@ void giveup_all(struct task_struct *tsk)
if (usermsr & MSR_FP)
__giveup_fpu(tsk);
 #endif
-#ifdef CONFIG_ALTIVEC
if (usermsr & MSR_VEC)
__giveup_altivec(tsk);
-#endif
 #ifdef CONFIG_SPE
if (usermsr & MSR_SPE)
__giveup_spe(tsk);
-- 
2.25.0



[PATCH v1 1/4] powerpc/process: Remove useless #ifdef CONFIG_VSX

2020-08-16 Thread Christophe Leroy
cpu_has_feature(CPU_FTR_VSX) returns false when CONFIG_VSX is
not set.

There is no need to enclose the test in an #ifdef CONFIG_VSX.
Remove it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a17d0746d55f..b64d71188715 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -124,10 +124,8 @@ unsigned long notrace msr_check_and_set(unsigned long bits)
 
newmsr = oldmsr | bits;
 
-#ifdef CONFIG_VSX
if (cpu_has_feature(CPU_FTR_VSX) && (bits & MSR_FP))
newmsr |= MSR_VSX;
-#endif
 
if (oldmsr != newmsr)
mtmsr_isync(newmsr);
@@ -144,10 +142,8 @@ void notrace __msr_check_and_clear(unsigned long bits)
 
newmsr = oldmsr & ~bits;
 
-#ifdef CONFIG_VSX
if (cpu_has_feature(CPU_FTR_VSX) && (bits & MSR_FP))
newmsr &= ~MSR_VSX;
-#endif
 
if (oldmsr != newmsr)
mtmsr_isync(newmsr);
@@ -162,10 +158,8 @@ static void __giveup_fpu(struct task_struct *tsk)
save_fpu(tsk);
msr = tsk->thread.regs->msr;
msr &= ~(MSR_FP|MSR_FE0|MSR_FE1);
-#ifdef CONFIG_VSX
if (cpu_has_feature(CPU_FTR_VSX))
msr &= ~MSR_VSX;
-#endif
tsk->thread.regs->msr = msr;
 }
 
@@ -245,10 +239,8 @@ static void __giveup_altivec(struct task_struct *tsk)
save_altivec(tsk);
msr = tsk->thread.regs->msr;
msr &= ~MSR_VEC;
-#ifdef CONFIG_VSX
if (cpu_has_feature(CPU_FTR_VSX))
msr &= ~MSR_VSX;
-#endif
tsk->thread.regs->msr = msr;
 }
 
@@ -421,10 +413,8 @@ static int __init init_msr_all_available(void)
if (cpu_has_feature(CPU_FTR_ALTIVEC))
msr_all_available |= MSR_VEC;
 #endif
-#ifdef CONFIG_VSX
if (cpu_has_feature(CPU_FTR_VSX))
msr_all_available |= MSR_VSX;
-#endif
 #ifdef CONFIG_SPE
if (cpu_has_feature(CPU_FTR_SPE))
msr_all_available |= MSR_SPE;
@@ -509,19 +499,18 @@ static bool should_restore_altivec(void) { return false; }
 static void do_restore_altivec(void) { }
 #endif /* CONFIG_ALTIVEC */
 
-#ifdef CONFIG_VSX
 static bool should_restore_vsx(void)
 {
if (cpu_has_feature(CPU_FTR_VSX))
return true;
return false;
 }
+#ifdef CONFIG_VSX
 static void do_restore_vsx(void)
 {
current->thread.used_vsr = 1;
 }
 #else
-static bool should_restore_vsx(void) { return false; }
 static void do_restore_vsx(void) { }
 #endif /* CONFIG_VSX */
 
-- 
2.25.0



[PATCH v1] powerpc/process: Tag an #endif to help locate the matching #ifdef.

2020-08-16 Thread Christophe Leroy
That #endif is more than 100 lines after the matching #ifdef,
and there are several #ifdef/#else/#endif inbetween.

Tag it as /* CONFIG_PPC_BOOK3S_64 */ to help locate the
matching #ifdef.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 016bd831908e..ffbe79960c73 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -575,7 +575,7 @@ void notrace restore_math(struct pt_regs *regs)
regs->msr |= new_msr;
}
 }
-#endif
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 static void save_all(struct task_struct *tsk)
 {
-- 
2.25.0



[PATCH v1] powerpc/process: Replace an #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE) by IS_ENABLED()

2020-08-16 Thread Christophe Leroy
The #if defined(CONFIG_4xx) || defined(CONFIG_BOOKE) encloses some
printk which can be compiled in all cases.

Replace by IS_ENABLED().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index fd3e11663613..087711aa0126 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1446,12 +1446,13 @@ void show_regs(struct pt_regs * regs)
trap = TRAP(regs);
if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR))
pr_cont("CFAR: "REG" ", regs->orig_gpr3);
-   if (trap == 0x200 || trap == 0x300 || trap == 0x600)
-#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
-   pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, regs->dsisr);
-#else
-   pr_cont("DAR: "REG" DSISR: %08lx ", regs->dar, regs->dsisr);
-#endif
+   if (trap == 0x200 || trap == 0x300 || trap == 0x600) {
+   if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE))
+   pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, 
regs->dsisr);
+   else
+   pr_cont("DAR: "REG" DSISR: %08lx ", regs->dar, 
regs->dsisr);
+   }
+
 #ifdef CONFIG_PPC64
pr_cont("IRQMASK: %lx ", regs->softe);
 #endif
-- 
2.25.0



[PATCH v1] powerpc/process: Replace an #ifdef CONFIG_PPC_BOOK3S_64 by IS_ENABLED()

2020-08-16 Thread Christophe Leroy
This #ifdef CONFIG_PPC_BOOK3S_64 calls preload_new_slb_context()
when radix is not enabled.

radix_enabled() is always defined, and the prototype for
preload_new_slb_context() is always present, so the #ifdef
is unneeded.

Replace it by IS_ENABLED().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 987bc8e73d5e..a17d0746d55f 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1725,10 +1725,8 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
 #ifdef CONFIG_PPC64
unsigned long load_addr = regs->gpr[2]; /* saved by ELF_PLAT_INIT */
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (!radix_enabled())
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
preload_new_slb_context(start, sp);
-#endif
 #endif
 
/*
-- 
2.25.0



[PATCH v1] powerpc/process: Remove unnecessary #ifdef CONFIG_FUNCTION_GRAPH_TRACER

2020-08-16 Thread Christophe Leroy
ftrace_graph_ret_addr() is always defined and returns 'ip' when
CONFIG_FUNCTION GRAPH_TRACER is not set.

So the #ifdef is not needed, remove it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ffbe79960c73..e86e28c28259 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2096,10 +2096,8 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack,
unsigned long sp, ip, lr, newsp;
int count = 0;
int firstframe = 1;
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
unsigned long ret_addr;
int ftrace_idx = 0;
-#endif
 
if (tsk == NULL)
tsk = current;
@@ -2127,12 +2125,10 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack,
if (!firstframe || ip != lr) {
printk("%s["REG"] ["REG"] %pS",
loglvl, sp, ip, (void *)ip);
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
ret_addr = ftrace_graph_ret_addr(current,
&ftrace_idx, ip, stack);
if (ret_addr != ip)
pr_cont(" (%pS)", (void *)ret_addr);
-#endif
if (firstframe)
pr_cont(" (unreliable)");
pr_cont("\n");
-- 
2.25.0



[PATCH v1] powerpc/process: Replace #ifdef CONFIG_KALLSYMS by IS_ENABLED()

2020-08-16 Thread Christophe Leroy
The #ifdef CONFIG_KALLSYMS encloses some printk which can
compile in all cases.

Replace by IS_ENABLED().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 087711aa0126..987bc8e73d5e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1469,14 +1469,14 @@ void show_regs(struct pt_regs * regs)
break;
}
pr_cont("\n");
-#ifdef CONFIG_KALLSYMS
/*
 * Lookup NIP late so we have the best change of getting the
 * above info out without failing
 */
-   printk("NIP ["REG"] %pS\n", regs->nip, (void *)regs->nip);
-   printk("LR ["REG"] %pS\n", regs->link, (void *)regs->link);
-#endif
+   if (IS_ENABLED(CONFIG_KALLSYMS)) {
+   printk("NIP ["REG"] %pS\n", regs->nip, (void *)regs->nip);
+   printk("LR ["REG"] %pS\n", regs->link, (void *)regs->link);
+   }
show_stack(current, (unsigned long *) regs->gpr[1], KERN_DEFAULT);
if (!user_mode(regs))
show_instructions(regs);
-- 
2.25.0



[PATCH v1] powerpc/process: Replace an #ifdef CONFIG_PPC_47x by IS_ENABLED()

2020-08-16 Thread Christophe Leroy
isync() is always defined, no need for an #ifdef.

Replace it by IS_ENABLED(CONFIG_PPC_47x).

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e86e28c28259..fd3e11663613 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -777,9 +777,8 @@ static void switch_hw_breakpoint(struct task_struct *new)
 static inline int __set_dabr(unsigned long dabr, unsigned long dabrx)
 {
mtspr(SPRN_DAC1, dabr);
-#ifdef CONFIG_PPC_47x
-   isync();
-#endif
+   if (IS_ENABLED(CONFIG_PPC_47x))
+   isync();
return 0;
 }
 #elif defined(CONFIG_PPC_BOOK3S)
-- 
2.25.0



Re: [PATCH v2 1/5] powerpc/mm: Introduce temporary mm

2020-08-16 Thread Christopher M. Riedl
On Thu Aug 6, 2020 at 6:27 AM CDT, Daniel Axtens wrote:
> Hi Chris,
>   
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >  bool ppc_breakpoint_available(void);
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> > b/arch/powerpc/include/asm/mmu_context.h
> > index 1a474f6b1992..9269c7c7b04e 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -10,6 +10,7 @@
> >  #include
> >  #include 
> >  #include 
> > +#include 
> >  
> >  /*
> >   * Most if the context management is out of line
> > @@ -300,5 +301,68 @@ static inline int arch_dup_mmap(struct mm_struct 
> > *oldmm,
> > return 0;
> >  }
> >  
> > +struct temp_mm {
> > +   struct mm_struct *temp;
> > +   struct mm_struct *prev;
> > +   bool is_kernel_thread;
> > +   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
> > +};
>
> This is on the nitpicky end, but I wonder if this should be named
> temp_mm, or should be labelled something else to capture its broader
> purpose as a context for code patching? I'm thinking that a store of
> breakpoints is perhaps unusual in a memory-managment structure?
>
> I don't have a better suggestion off the top of my head and I'm happy
> for you to leave it, I just wanted to flag it as a possible way we could
> be clearer.

First of all thank you for the review!

I had actually planned to move all this code into lib/code-patching.c
directly (and it turns out that's what x86 ended up doing as well).

>
> > +
> > +static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct 
> > *mm)
> > +{
> > +   temp_mm->temp = mm;
> > +   temp_mm->prev = NULL;
> > +   temp_mm->is_kernel_thread = false;
> > +   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
> > +}
> > +
> > +static inline void use_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   temp_mm->is_kernel_thread = current->mm == NULL;
> > +   if (temp_mm->is_kernel_thread)
> > +   temp_mm->prev = current->active_mm;
>
> You don't seem to restore active_mm below. I don't know what active_mm
> does, so I don't know if this is a problem.

For kernel threads 'current->mm' is NULL since a kthread does not need
a userspace mm; however they still need a mm so they "borrow" one which
is indicated by 'current->active_mm'.

'current->mm' needs to be restored because Hash requires a non-NULL
value when handling a page fault and so 'current->mm' gets set to the
temp_mm. This is a special case for kernel threads and Hash translation.

>
> > +   else
> > +   temp_mm->prev = current->mm;
> > +
> > +   /*
> > +* Hash requires a non-NULL current->mm to allocate a userspace address
> > +* when handling a page fault. Does not appear to hurt in Radix either.
> > +*/
> > +   current->mm = temp_mm->temp;
> > +   switch_mm_irqs_off(NULL, temp_mm->temp, current);
> > +
> > +   if (ppc_breakpoint_available()) {
>
> I wondered if this could be changed during a text-patching operation.
> AIUI, it potentially can on a P9 via "dawr_enable_dangerous" in debugfs.
>
> I don't know if that's a problem. My concern is that you could turn off
> breakpoints, call 'use_temporary_mm', then turn them back on again
> before 'unuse_temporary_mm' and get a breakpoint while that can access
> the temporary mm. Is there something else that makes that safe?
> disabling IRQs maybe?

Hmm, I will have to investigate this more. I'm not sure if there is a
better way to just completely disable breakpoints while the temporary mm
is in use.

>
> > +   struct arch_hw_breakpoint null_brk = {0};
> > +   int i = 0;
> > +
> > +   for (; i < nr_wp_slots(); ++i) {
>
> super nitpicky, and I'm not sure if this is actually documented, but I'd
> usually see this written as:
>
> for (i = 0; i < nr_wp_slots(); i++) {
>
> Not sure if there's any reason that it _shouldn't_ be written the way
> you've written it (and I do like initialising the variable when it's
> defined!), I'm just not used to it. (Likewise with the unuse function.)
>

I've found other places (even in arch/powerpc!) where this is done so I
think it's fine. I prefer using this style when the variable
declaration and initialization are "close" to the loop statement.

> > +   __get_breakpoint(i, &temp_mm->brk[i]);
> > +   if (temp_mm->brk[i].type != 0)
> > +   __set_breakpoint(i, &null_brk);
> > +   }
> > +   }
> > +}
> > +
>
> Kind regards,
> Daniel
>
> > +static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
> > +{
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   if (temp_mm->is_kernel_thread)
> > +   current->mm = NULL;
> > +   else
> > +   current->mm = temp_mm->prev;
> > +   switch_mm_irqs_off(NULL, tem

Re: [PATCH v2] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-08-16 Thread Michael Ellerman
Nathan Lynch  writes:
> Michael Ellerman  writes:
>> Tyrel Datwyler  writes:
>>> On 8/11/20 6:20 PM, Nathan Lynch wrote:
  
 +static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
 +{
 +  const unsigned int resched_interval = 20;
 +
 +  BUG_ON(lmb < drmem_info->lmbs);
 +  BUG_ON(lmb >= drmem_info->lmbs + drmem_info->n_lmbs);
>>>
>>> I think BUG_ON is a pretty big no-no these days unless there is no other 
>>> option.
>>
>> It's complicated, but yes we would like to avoid adding them if we can.
>>
>> In a case like this there is no other option, *if* the check has to be
>> in drmem_lmb_next().
>>
>> But I don't think we really need to check that there.
>>
>> If for some reason this was called with an *lmb pointing outside of the
>> lmbs array it would confuse the cond_resched() logic, but it's not worth
>> crashing the box for that IMHO.
>
> The BUG_ONs are pretty much orthogonal to the cond_resched().

Yes that was kind of my point. We don't need them to implement the
cond_resched() logic.

> It's not apparent from the context of the change, but some users of the
> for_each_drmem_lmb* macros modify elements of the drmem_info->lmbs
> array. If the lmb passed to drmem_lmb_next() violates the bounds check
> (say, if the callsite inappropriately modifies it within the loop), such
> users are guaranteed to corrupt other objects in memory. This was my
> thinking in adding the BUG_ONs, and I don't see another place to do
> it.

If it's really something we're worried about, I think we'd be better off
putting checks for that in the code that's doing those modifications.

That way you have enough context to do something more nuanced than a
BUG(), ie. you can print a useful message and fail whatever operation is
in progress.

If we did that then we could also add those BUG_ONs() as a safety net.

What you really want is a way for the for_each_xxx() construct to
express that there was an error traversing the list, but there isn't
really a nice way to do that in C.

cheers


Re: [PATCH v2 2/5] powerpc/lib: Initialize a temporary mm for code patching

2020-08-16 Thread Christopher M. Riedl
On Thu Aug 6, 2020 at 8:24 AM CDT, Daniel Axtens wrote:
> "Christopher M. Riedl"  writes:
>
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped with permissive memory
> > protections. Currently, a per-cpu vmalloc patch area is used for this
> > purpose. While the patch area is per-cpu, the temporary page mapping is
> > inserted into the kernel page tables for the duration of the patching.
> > The mapping is exposed to CPUs other than the patching CPU - this is
> > undesirable from a hardening perspective.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > portion. The next patch uses the temporary mm and patching address for
> > code patching.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/lib/code-patching.c | 33 
> >  1 file changed, 33 insertions(+)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 0a051dfeb177..8ae1a9e5fe6e 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -44,6 +46,37 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
> > ppc_inst instr)
> >  }
> >  
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +static struct mm_struct *patching_mm __ro_after_init;
> > +static unsigned long patching_addr __ro_after_init;
> > +
> > +void __init poking_init(void)
> > +{
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   pte_t *ptep;
> > +
> > +   /*
> > +* Some parts of the kernel (static keys for example) depend on
> > +* successful code patching. Code patching under STRICT_KERNEL_RWX
> > +* requires this setup - otherwise we cannot patch at all. We use
> > +* BUG_ON() here and later since an early failure is preferred to
> > +* buggy behavior and/or strange crashes later.
> > +*/
> > +   patching_mm = copy_init_mm();
> > +   BUG_ON(!patching_mm);
> > +
> > +   /*
> > +* In hash we cannot go above DEFAULT_MAP_WINDOW easily.
> > +* XXX: Do we want additional bits of entropy for radix?
> > +*/
> > +   patching_addr = (get_random_long() & PAGE_MASK) %
> > +   (DEFAULT_MAP_WINDOW - PAGE_SIZE);
>
> It took me a while to understand this calculation. I see that it's
> calculating a base address for a page in which to do patching. It does
> the following:

I will add a comment explaining the calulcation in the next spin.

>
> - get a random long
>
> - mask with PAGE_MASK so as to get a page aligned value
>
> - make sure that the base address is at least one PAGE_SIZE below
> DEFAULT_MAP_WINDOW so we have a clear page between the base and
> DEFAULT_MAP_WINDOW.
>
> On 64-bit Book3S with 64K pages, that works out to be
>
> PAGE_SIZE = 0x  0001 
> PAGE_MASK = 0x   
>
> DEFAULT_MAP_WINDOW = DEFAULT_MAP_WINDOW_USER64 = TASK_SIZE_128TB
> = 0x_8000__
>
> DEFAULT_MAP_WINDOW - PAGE_SIZE = 0x 7FFF  
>
> It took a while (and a conversation with my wife who studied pure
> maths!) but I am convinced that the modulo preserves the page-alignement
> of the patching address.

I am glad a proper mathematician agrees because my maths are decidedly
unpure :)

>
> One thing I did realise is that patching_addr can be zero at the end of
> this process. That seems dubious and slightly error-prone to me - is
> the patching process robust to that or should we exclude it?

Good catch! I will fix this in the next spin.

>
> Anyway, if I have the maths right, that there are 0x7fff or ~2
> billion possible locations for the patching page, which is just shy of
> 31 bits of entropy.
>
> I think this compares pretty favourably to most (K)ASLR implementations?

I will stress that I am not an expert here, but it looks like this does
compares favorably against other 64b ASLR [0].

[0]: https://www.cs.ucdavis.edu/~peisert/research/2017-SecDev-AnalysisASLR.pdf

>
> What's the range if built with 4k pages?

Using the formula from my series coverletter, we should expect 34 bits
of entropy since DEFAULT_MAP_WINDOW_USER64 is 64TB for 4K pages:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=4K, DEFAULT_MAP_WINDOW_USER64=64TB
bits of entropy = log2(64TB / 4K)
bits of entropy = 34

>
> Kind regards,
> Daniel
>
> > +
> > +   ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +   BUG_ON(!ptep);
> > +   pte_unmap_unlock(ptep, ptl);
> > +}
> > +
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >

[PATCH 1/2] powerpc/kernel/cputable: cleanup the function declarations

2020-08-16 Thread Madhavan Srinivasan
__machine_check_early_realmode_p*() are currently declared
as extern in cputable.c and because of this when compiled
with "C=1" (which enables semantic checker) produces these
warnings.

  CHECK   arch/powerpc/kernel/mce_power.c
arch/powerpc/kernel/mce_power.c:709:6: warning: symbol 
'__machine_check_early_realmode_p7' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:717:6: warning: symbol 
'__machine_check_early_realmode_p8' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:722:6: warning: symbol 
'__machine_check_early_realmode_p9' was not declared. Should it be static?
arch/powerpc/kernel/mce_power.c:740:6: warning: symbol 
'__machine_check_early_realmode_p10' was not declared. Should it be static?

Patch here moves the declaration to asm/mce.h
and includes the same in cputable.c

Fixes: ae744f3432d3 ("powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine 
check errors on power8")
Fixes: 7b9f71f974a1 ("powerpc/64s: POWER9 machine check handler")
Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/include/asm/cputable.h | 5 +
 arch/powerpc/include/asm/mce.h  | 7 +++
 arch/powerpc/kernel/cputable.c  | 3 ---
 arch/powerpc/kernel/dt_cpu_ftrs.c   | 4 
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index fdddb822d564b..e005b45810232 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -9,6 +9,11 @@
 
 #ifndef __ASSEMBLY__
 
+/*
+ * Added to include __machine_check_early_realmode_* functions
+ */
+#include 
+
 /* This structure can grow, it's real size is used by head.S code
  * via the mkdefs mechanism.
  */
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index adf2cda67f9a4..89aa8248a57dd 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,9 @@ struct mce_error_info {
 #define MCE_EVENT_RELEASE  true
 #define MCE_EVENT_DONTRELEASE  false
 
+struct pt_regs;
+struct notifier_block;
+
 extern void save_mce_event(struct pt_regs *regs, long handled,
   struct mce_error_info *mce_err, uint64_t nip,
   uint64_t addr, uint64_t phys_addr);
@@ -225,5 +228,9 @@ int mce_register_notifier(struct notifier_block *nb);
 int mce_unregister_notifier(struct notifier_block *nb);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
+long __machine_check_early_realmode_p7(struct pt_regs *regs);
+long __machine_check_early_realmode_p8(struct pt_regs *regs);
+long __machine_check_early_realmode_p9(struct pt_regs *regs);
+long __machine_check_early_realmode_p10(struct pt_regs *regs);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 #endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 3d406a9626e86..4ca9cf556be92 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -72,9 +72,6 @@ extern void __setup_cpu_power9(unsigned long offset, struct 
cpu_spec* spec);
 extern void __restore_cpu_power9(void);
 extern void __setup_cpu_power10(unsigned long offset, struct cpu_spec* spec);
 extern void __restore_cpu_power10(void);
-extern long __machine_check_early_realmode_p7(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p8(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p9(struct pt_regs *regs);
 #endif /* CONFIG_PPC64 */
 #if defined(CONFIG_E500)
 extern void __setup_cpu_e5500(unsigned long offset, struct cpu_spec* spec);
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 6f8c0c6b937a1..8dc46f38680b2 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -64,10 +64,6 @@ struct dt_cpu_feature {
  * Set up the base CPU
  */
 
-extern long __machine_check_early_realmode_p8(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p9(struct pt_regs *regs);
-extern long __machine_check_early_realmode_p10(struct pt_regs *regs);
-
 static int hv_mode;
 
 static struct {
-- 
2.26.2



[PATCH 2/2] powerpc: Add POWER10 raw mode cputable entry

2020-08-16 Thread Madhavan Srinivasan
Add a raw mode cputable entry for POWER10. Copies most of the fields
from commit a3ea40d5c736 ("powerpc: Add POWER10 architected mode")
except for oprofile_cpu_type, machine_check_early, pvr_mask and pvr_mask
fields. On bare metal systems we use DT CPU features, which doesn't need a
cputable entry. But in VMs we still rely on the raw cputable entry to
set the correct values for the PMU related fields.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/kernel/cputable.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 4ca9cf556be92..2aa89c6b28967 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -539,6 +539,25 @@ static struct cpu_spec __initdata cpu_specs[] = {
.machine_check_early= __machine_check_early_realmode_p9,
.platform   = "power9",
},
+   {   /* Power10 */
+   .pvr_mask   = 0x,
+   .pvr_value  = 0x0080,
+   .cpu_name   = "POWER10 (raw)",
+   .cpu_features   = CPU_FTRS_POWER10,
+   .cpu_user_features  = COMMON_USER_POWER10,
+   .cpu_user_features2 = COMMON_USER2_POWER10,
+   .mmu_features   = MMU_FTRS_POWER10,
+   .icache_bsize   = 128,
+   .dcache_bsize   = 128,
+   .num_pmcs   = 6,
+   .pmc_type   = PPC_PMC_IBM,
+   .oprofile_cpu_type  = "ppc64/power10",
+   .oprofile_type  = PPC_OPROFILE_INVALID,
+   .cpu_setup  = __setup_cpu_power10,
+   .cpu_restore= __restore_cpu_power10,
+   .machine_check_early= __machine_check_early_realmode_p10,
+   .platform   = "power10",
+   },
{   /* Cell Broadband Engine */
.pvr_mask   = 0x,
.pvr_value  = 0x0070,
-- 
2.26.2



Re: fsl_espi errors on v5.7.15

2020-08-16 Thread Chris Packham

On 14/08/20 6:19 pm, Heiner Kallweit wrote:
> On 14.08.2020 04:48, Chris Packham wrote:
>> Hi,
>>
>> I'm seeing a problem with accessing spi-nor after upgrading a T2081
>> based system to linux v5.7.15
>>
>> For this board u-boot and the u-boot environment live on spi-nor.
>>
>> When I use fw_setenv from userspace I get the following kernel logs
>>
>> # fw_setenv foo=1
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> ...
>>
> This error reporting doesn't exist yet in 4.4. So you may have an issue
> under 4.4 too, it's just not reported.
> Did you verify that under 4.4 fw_setenv actually has an effect?
Just double checked and yes under 4.4 the setting does get saved.
>> If I run fw_printenv (before getting it into a bad state) it is able to
>> display the content of the boards u-boot environment.
>>
> This might indicate an issue with spi being locked. I've seen related
> questions, just use the search engine of your choice and check for
> fw_setenv and locked.
I'm running a version of fw_setenv which includes 
https://gitlab.denx.de/u-boot/u-boot/-/commit/db820159 so it shouldn't 
be locking things unnecessarily.
>> If been unsuccessful in producing a setup for bisecting the issue. I do
>> know the issue doesn't occur on the old 4.4.x based kernel but that's
>> probably not much help.
>>
>> Any pointers on what the issue (and/or solution) might be.
>>
>> Thanks,
>> Chris
>>
> Heiner

[PATCH v4 8/8] mm/vmalloc: Hugepage vmalloc mappings

2020-08-16 Thread Nicholas Piggin
On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.

Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmap is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |   2 +
 include/linux/vmalloc.h   |   1 +
 mm/vmalloc.c  | 177 +-
 3 files changed, 137 insertions(+), 43 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 98ea67f27809..eaef176c597f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
nohugeiomap [KNL,x86,PPC] Disable kernel huge I/O mappings.
 
+   nohugevmap  [KNL,x86,PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..c3595d87261c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmap(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmap", set_nohugevmap);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 bool is_vmalloc_addr(const void *x)
 {
unsigned long addr = (unsigned long)x;
@@ -468,31 +481,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long 
addr, unsigned long en
return 0;
 }
 
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify 
should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing.  The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
-pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages)
 {
unsigned long start = addr;
-   unsigned long end = addr + size;
-   unsigned long next;
pgd_t *pgd;
+   unsigned long next;
int err = 0;
int nr = 0;
pgtbl_mod_mask mask = 0;
@@ -514,6 +508,65 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
return 0;
 }
 
+static int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+   pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+   WARN_ON(page_shift < PAGE_SHIFT);
+
+   if (page_shift == PAGE_SHIFT) {
+   return vmap_small_pages_range_noflush(addr, end, prot, pages);
+   } else {
+   unsigned int i, nr = (end - addr) >> page_shift;
+
+   for (i = 0; i < nr; i++) {
+   int err;
+
+   err = vmap_range_noflush(addr, addr + (1UL << 
page_shift),
+   __pa(page_address(pages[i])), 
prot, page_shift);
+   if (err)
+   return err;
+
+   addr += 1UL << page_shift;
+   }
+
+   return 0;
+   }
+}
+
+static int vmap_pages_range(unsigned long addr, unsigned long end,
+   pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+   int err;
+
+   err = vmap_pages_range_noflush(addr,

[PATCH v4 7/8] mm/vmalloc: add vmap_range_noflush variant

2020-08-16 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
-   unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr,
+   pgprot_t prot, unsigned int max_page_shift)
 {
pgd_t *pgd;
unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr, pgp
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v4 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c

2020-08-16 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   2 +
 mm/ioremap.c| 192 
 mm/vmalloc.c| 191 +++
 3 files changed, 193 insertions(+), 192 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t 
phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_allow_huge = false;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(&init_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PUD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 
max_page_shift, mask))
-   return -ENOMEM;
-   } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
-   return 

[PATCH v4 5/8] mm: HUGE_VMAP arch support cleanup

2020-08-16 Thread Nicholas Piggin
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Signed-off-by: Nicholas Piggin 
---
 arch/arm64/mm/mmu.c  | 12 +--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++-
 arch/x86/mm/ioremap.c| 12 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  | 10 +++
 init/main.c  |  1 -
 mm/ioremap.c | 96 +++-
 7 files changed, 73 insertions(+), 77 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..bbb3ccf6a7ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
+#endif
 
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..eeb0e8451176 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,13 +1134,14 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1149,6 +1150,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 {
return 0;
 }
+#endif
 
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
@@ -1234,7 +1236,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..5b8b495ab4ed 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
 #ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
 #else
-   return 0;
+   return false;
 #endif
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return boot_cpu_has(X86_FEATURE_PSE);
 }
+#endif
 
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, 
unsigned long end,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline void ioremap_huge_init(void) { }
-#endif
-
 /*
  * Managed iomap interface
  */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..787d77ad7536 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -84,6 +84,16 @@ struct vmap_area {
};
 };
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#else
+static inline bool arch_vmap_p4d_suppo

[PATCH v4 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range

2020-08-16 Thread Nicholas Piggin
This will be moved to mm/ and used as a generic kernel virtual mapping
function, so re-name it in preparation.

Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 55 ++--
 1 file changed, 23 insertions(+), 32 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..6016ae3227ad 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t p

[PATCH v4 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range

2020-08-16 Thread Nicholas Piggin
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 49f225b0f855..3a1e45fd1626 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pte_t *pte;
 
@@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pud_t *pud;
unsigned long next;
@@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
p4d_t *p4d;
unsigned long next;
@@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, 
&mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v4 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered

2020-08-16 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 60 +++--
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..1d5f3093c249 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2260,13 +2260,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2287,13 +2294,20 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2314,13 +2328,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2339,8 +2360,15 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
-- 
2.23.0



[PATCH v4 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2020-08-16 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 40 ++--
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..49f225b0f855 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 
@@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v4 0/8] huge vmalloc mappings

2020-08-16 Thread Nicholas Piggin
Let's try again.

Thanks,
Nick

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
  Cameron for debugging the cause of this (hopefully).

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
  struct vm_struct is an interface (this should fix x86 vmap stack debug
  assert). [Thanks Zefan]

Nicholas Piggin (8):
  mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  lib/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/arm64/mm/mmu.c   |  12 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  10 +-
 arch/x86/mm/ioremap.c |  12 +-
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  13 +
 init/main.c   |   1 -
 mm/ioremap.c  | 231 +
 mm/memory.c   |  60 ++-
 mm/vmalloc.c  | 445 +++---
 10 files changed, 461 insertions(+), 334 deletions(-)

-- 
2.23.0