Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-18 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of October 19, 2020 11:00 am:
> Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
>> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>>> > Michal Suchánek  writes:
>>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> >>> > Hello,
>>> >>> > 
>>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> >>> > Reimplement book3s idle code in C").
>>> >>> > 
>>> >>> > The symptom is host locking up completely after some hours of KVM
>>> >>> > workload with messages like
>>> >>> > 
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 71
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 71
>>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab 
>>> >>> > cpu 47
>>> >>> > 
>>> >>> > printed before the host locks up.
>>> >>> > 
>>> >>> > The machines run sandboxed builds which is a mixed workload resulting 
>>> >>> > in
>>> >>> > IO/single core/mutiple core load over time and there are periods of no
>>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>>> >>> > setup/terdown is somewhat excercised as well.
>>> >>> > 
>>> >>> > POWER9 with the new guest entry fast path does not seem to be 
>>> >>> > affected.
>>> >>> > 
>>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> >>> > stable.
>>> >>> > 
>>> >>> > Config is attached.
>>> >>> > 
>>> >>> > I cannot easily revert this commit, especially if I want to use the 
>>> >>> > same
>>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> >>> > only to the new idle code.
>>> >>> > 
>>> >>> > Any idea what can be the problem?
>>> >>> 
>>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>>> >>> BMC unfortunately.
>>> >>
>>> >> It may be possible to set up fadump with a later kernel version that
>>> >> supports it on powernv and dump the whole kernel.
>>> > 
>>> > Your firmware won't support it AFAIK.
>>> > 
>>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>>> > good chance it won't work :/
>>> 
>>> I haven't had any luck yet reproducing this still. Testing with sub 
>>> cores of various different combinations, etc. I'll keep trying though.
>> 
>> Hello,
>> 
>> I tried running some KVM guests to simulate the workload and what I get
>> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
>> kernel and qemu 4.2.1 and 5.1.0
>> 
>> To start some guests I run
>> 
>> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
>> -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
>> -serial mon:telnet::444$i,server,wait & done
>> 
>> To simulate some workload I run
>> 
>> xz -zc9T0 < /dev/zero > /dev/null &
>> while true; do
>> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
>> done &
>> 
>> on the host and add a job that executes this to the ramdisk. However, most
>> guests never get to the point where the job is executed.
>> 
>> Any idea what might be the problem?
> 
> I would say try without pv queued spin locks (but if the same thing is 
> happening with 5.3 then it must be something else I guess). 
> 
> I'll try to test a similar setup on a POWER8 here.

Couldn't reproduce the guest hang, they seem to run fine even with 
queued spinlocks. Might have a different .config.

I might have got a lockup in the host (although different symptoms than 
the original report). I'll look into that a bit further.

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-10-18 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of October 17, 2020 6:14 am:
> On Mon, Sep 07, 2020 at 11:13:47PM +1000, Nicholas Piggin wrote:
>> Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
>> > Michal Suchánek  writes:
>> >> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> >>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> >>> > Hello,
>> >>> > 
>> >>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> >>> > Reimplement book3s idle code in C").
>> >>> > 
>> >>> > The symptom is host locking up completely after some hours of KVM
>> >>> > workload with messages like
>> >>> > 
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 71
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 71
>> >>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 
>> >>> > 47
>> >>> > 
>> >>> > printed before the host locks up.
>> >>> > 
>> >>> > The machines run sandboxed builds which is a mixed workload resulting 
>> >>> > in
>> >>> > IO/single core/mutiple core load over time and there are periods of no
>> >>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> >>> > setup/terdown is somewhat excercised as well.
>> >>> > 
>> >>> > POWER9 with the new guest entry fast path does not seem to be affected.
>> >>> > 
>> >>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> >>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> >>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>> >>> > stable.
>> >>> > 
>> >>> > Config is attached.
>> >>> > 
>> >>> > I cannot easily revert this commit, especially if I want to use the 
>> >>> > same
>> >>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> >>> > only to the new idle code.
>> >>> > 
>> >>> > Any idea what can be the problem?
>> >>> 
>> >>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> >>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> >>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> >>> BMC unfortunately.
>> >>
>> >> It may be possible to set up fadump with a later kernel version that
>> >> supports it on powernv and dump the whole kernel.
>> > 
>> > Your firmware won't support it AFAIK.
>> > 
>> > You could try kdump, but if we have CPUs stuck in KVM then there's a
>> > good chance it won't work :/
>> 
>> I haven't had any luck yet reproducing this still. Testing with sub 
>> cores of various different combinations, etc. I'll keep trying though.
> 
> Hello,
> 
> I tried running some KVM guests to simulate the workload and what I get
> is guests failing to start with a rcu stall. Tried both 5.3 and 5.9
> kernel and qemu 4.2.1 and 5.1.0
> 
> To start some guests I run
> 
> for i in $(seq 0 9) ; do /opt/qemu/bin/qemu-system-ppc64 -m 2048 -accel kvm 
> -smp 8 -kernel /boot/vmlinux -initrd /boot/initrd -nodefaults -nographic 
> -serial mon:telnet::444$i,server,wait & done
> 
> To simulate some workload I run
> 
> xz -zc9T0 < /dev/zero > /dev/null &
> while true; do
> killall -STOP xz; sleep 1; killall -CONT xz; sleep 1;
> done &
> 
> on the host and add a job that executes this to the ramdisk. However, most
> guests never get to the point where the job is executed.
> 
> Any idea what might be the problem?

I would say try without pv queued spin locks (but if the same thing is 
happening with 5.3 then it must be something else I guess). 

I'll try to test a similar setup on a POWER8 here.

Thanks,
Nick

> 
> In the past I was able to boot guests quite realiably

[PATCH] powerpc/security: Fix link stack flush instruction

2020-10-07 Thread Nicholas Piggin
The inline execution path for the hardware assisted branch flush
instruction failed to set CTR to the correct value before bcctr,
causing a crash when the feature is enabled.

Fixes: 4d24e21cc694 ("powerpc/security: Allow for processors that flush the 
link stack using the special bcctr")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 ++-
 arch/powerpc/kernel/entry_64.S|  8 --
 arch/powerpc/kernel/security.c| 34 ---
 3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index de14b1a34d56..9652756b0694 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -144,7 +144,9 @@ void _kvmppc_restore_tm_pr(struct kvm_vcpu *vcpu, u64 
guest_msr);
 void _kvmppc_save_tm_pr(struct kvm_vcpu *vcpu, u64 guest_msr);
 
 /* Patch sites */
-extern s32 patch__call_flush_branch_caches;
+extern s32 patch__call_flush_branch_caches1;
+extern s32 patch__call_flush_branch_caches2;
+extern s32 patch__call_flush_branch_caches3;
 extern s32 patch__flush_count_cache_return;
 extern s32 patch__flush_link_stack_return;
 extern s32 patch__call_kvm_flush_link_stack;
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 733e40eba4eb..2f3846192ec7 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -430,7 +430,11 @@ _ASM_NOKPROBE_SYMBOL(save_nvgprs);
 
 #define FLUSH_COUNT_CACHE  \
 1: nop;\
-   patch_site 1b, patch__call_flush_branch_caches
+   patch_site 1b, patch__call_flush_branch_caches1; \
+1: nop;\
+   patch_site 1b, patch__call_flush_branch_caches2; \
+1: nop;\
+   patch_site 1b, patch__call_flush_branch_caches3
 
 .macro nops number
.rept \number
@@ -512,7 +516,7 @@ _GLOBAL(_switch)
 
kuap_check_amr r9, r10
 
-   FLUSH_COUNT_CACHE
+   FLUSH_COUNT_CACHE   /* Clobbers r9, ctr */
 
/*
 * On SMP kernels, care must be taken because a task may be
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index c9876aab3142..e4e1a94ccf6a 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -430,30 +430,44 @@ device_initcall(stf_barrier_debugfs_init);
 
 static void update_branch_cache_flush(void)
 {
+   u32 *site;
+
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   site = __call_kvm_flush_link_stack;
// This controls the branch from guest_exit_cont to kvm_flush_link_stack
if (link_stack_flush_type == BRANCH_CACHE_FLUSH_NONE) {
-   patch_instruction_site(__call_kvm_flush_link_stack,
-  ppc_inst(PPC_INST_NOP));
+   patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
} else {
// Could use HW flush, but that could also flush count cache
-   patch_branch_site(__call_kvm_flush_link_stack,
- (u64)_flush_link_stack, BRANCH_SET_LINK);
+   patch_branch_site(site, (u64)_flush_link_stack, 
BRANCH_SET_LINK);
}
 #endif
 
+   // Patch out the bcctr first, then nop the rest
+   site = __call_flush_branch_caches3;
+   patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
+   site = __call_flush_branch_caches2;
+   patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
+   site = __call_flush_branch_caches1;
+   patch_instruction_site(site, ppc_inst(PPC_INST_NOP));
+
// This controls the branch from _switch to flush_branch_caches
if (count_cache_flush_type == BRANCH_CACHE_FLUSH_NONE &&
link_stack_flush_type == BRANCH_CACHE_FLUSH_NONE) {
-   patch_instruction_site(__call_flush_branch_caches,
-  ppc_inst(PPC_INST_NOP));
+   // Nothing to be done
+
} else if (count_cache_flush_type == BRANCH_CACHE_FLUSH_HW &&
   link_stack_flush_type == BRANCH_CACHE_FLUSH_HW) {
-   patch_instruction_site(__call_flush_branch_caches,
-  ppc_inst(PPC_INST_BCCTR_FLUSH));
+   // Patch in the bcctr last
+   site = __call_flush_branch_caches1;
+   patch_instruction_site(site, ppc_inst(0x39207fff)); // li 
r9,0x7fff
+   site = __call_flush_branch_caches2;
+   patch_instruction_site(site, ppc_inst(0x7d2903a6)); // mtctr r9
+   site = __call_flush_branch_caches3;
+   patch_instruction_site(site, ppc_inst(PPC_INST_BCCTR_FLUSH));
+
} else {
-   patch_branch_site(__call_flush_branch_caches,
- (u64)_branch_caches, BRANCH_SET_LINK);
+   patch_branch_site(site, (u64)_branch_caches, 
BRANCH_SET_LINK);

[PATCH v2 2/2] powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address

2020-09-15 Thread Nicholas Piggin
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add cp_abort
to the tlbiel sequence.

Signed-off-by: Nicholas Piggin 
---
v2:
- Untangle headers that were causing build failures.
- Improve the comment a bit.
- Exempt POWER9 from the workaround, as described by the comment (we
  worked this out already but I forgot about it when doing v1!)

 arch/powerpc/include/asm/synch.h   | 19 ++-
 arch/powerpc/mm/book3s64/hash_native.c |  8 
 arch/powerpc/mm/book3s64/radix_tlb.c   | 12 ++--
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index aca70fb43147..a2e89f27d547 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -3,8 +3,9 @@
 #define _ASM_POWERPC_SYNCH_H 
 #ifdef __KERNEL__
 
+#include 
 #include 
-#include 
+#include 
 
 #ifndef __ASSEMBLY__
 extern unsigned int __start___lwsync_fixup, __stop___lwsync_fixup;
@@ -20,6 +21,22 @@ static inline void isync(void)
 {
__asm__ __volatile__ ("isync" : : : "memory");
 }
+
+static inline void ppc_after_tlbiel_barrier(void)
+{
+asm volatile("ptesync": : :"memory");
+   /*
+* POWER9, POWER10 need a cp_abort after tlbiel to ensure the copy
+* is invalidated correctly. If this is not done, the paste can take
+* data from the physical address that was translated at copy time.
+*
+* POWER9 in practice does not need this, because address spaces
+* with accelerators mapped will use tlbie (which does invalidate
+* the copy) to invalidate translations. It's not possible to limit
+* POWER10 this way due to local copy-paste.
+*/
+asm volatile(ASM_FTR_IFSET(PPC_CP_ABORT, "", %0) : : "i" 
(CPU_FTR_ARCH_31) : "memory");
+}
 #endif /* __ASSEMBLY__ */
 
 #if defined(__powerpc64__)
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index cf20e5229ce1..0203cdf48c54 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -82,7 +82,7 @@ static void tlbiel_all_isa206(unsigned int num_sets, unsigned 
int is)
for (set = 0; set < num_sets; set++)
tlbiel_hash_set_isa206(set, is);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 }
 
 static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
@@ -110,7 +110,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, 
unsigned int is)
 */
tlbiel_hash_set_isa300(0, is, 0, 2, 1);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 
asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
 }
@@ -303,7 +303,7 @@ static inline void tlbie(unsigned long vpn, int psize, int 
apsize,
asm volatile("ptesync": : :"memory");
if (use_local) {
__tlbiel(vpn, psize, apsize, ssize);
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
} else {
__tlbie(vpn, psize, apsize, ssize);
fixup_tlbie_vpn(vpn, psize, apsize, ssize);
@@ -879,7 +879,7 @@ static void native_flush_hash_range(unsigned long number, 
int local)
__tlbiel(vpn, psize, psize, ssize);
} pte_iterate_hashed_end();
}
-   asm volatile("ptesync":::"memory");
+   ppc_after_tlbiel_barrier();
} else {
int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
 
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..5c9d2fccacc7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -65,7 +65,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, unsigned 
int is)
for (set = 1; set < num_sets; set++)
tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 1);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 }
 
 void radix__tlbiel_all(unsigned int action)
@@ -296,7 +296,7 @@ static __always_inline void _tlbiel_pid(unsigned long pid, 
unsigned long ric)
 
/* For PWC, only one flush is needed */
if (ric == RIC_FLUSH_PWC) {
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
return;
}
 
@@ -304,7 +304,7 @@ static __always_inline void _tlbiel_p

[PATCH v2 1/2] powerpc: untangle cputable mce include

2020-09-15 Thread Nicholas Piggin
Having cputable.h include mce.h means it pulls in a bunch of low level
headers (e.g., synch.h) which then can't use CPU_FTR_ definitions.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cputable.h | 5 -
 arch/powerpc/kernel/cputable.c  | 1 +
 arch/powerpc/kernel/dt_cpu_ftrs.c   | 1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 32a15dc49e8c..f89205eff691 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -9,11 +9,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Added to include __machine_check_early_realmode_* functions
- */
-#include 
-
 /* This structure can grow, it's real size is used by head.S code
  * via the mkdefs mechanism.
  */
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 2aa89c6b2896..b5bc2edef440 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include   /* for PTRRELOC on ARCH=ppc */
+#include 
 #include 
 #include 
 
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index f204ad79b6b5..1098863e17ee 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -17,6 +17,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.23.0



Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5

2020-09-15 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of September 15, 2020 10:54 pm:
> Michal Suchánek  writes:
>> Hello,
>>
>> Using the SLB mutihit injection test module (which I did not write so I
>> do not want to post it here) to verify updates on my 5.3 frankernekernel
>> I found that the kernel crashes with Oops: kernel bad access.
>>
>> I tested on latest upstream kernel build that I have at hand and the
>> result is te same (minus the message - nothing was logged and the kernel
>> simply rebooted).
> 
> That's disappointing.

It seems to work okay with qemu and mambo injection on upstream
(powernv_defconfig). I wonder why that nmi_enter is crashing.
Can you post the output of a successful test with the patch
reverted?


qemu injection test output - 
[  195.279885][C0] Disabling lock debugging due to kernel taint
[  195.280891][C0] MCE: CPU0: machine check (Warning) Host SLB Multihit 
DAR: deadbeef [Recovered]
[  195.282117][C0] MCE: CPU0: NIP: [c003c2b4] 
isa300_idle_stop_mayloss+0x68/0x6c
[  195.283631][C0] MCE: CPU0: Initiator CPU
[  195.284432][C0] MCE: CPU0: Probable Software error (some chance of 
hardware cause)
[  220.711577][   T90] MCE: CPU0: machine check (Warning) Host SLB Multihit 
DAR: deadbeef [Recovered]
[  220.712805][   T90] MCE: CPU0: PID: 90 Comm: yes NIP: [7fff7fdac2e0]
[  220.713553][   T90] MCE: CPU0: Initiator CPU
[  220.714021][   T90] MCE: CPU0: Probable Software error (some chance of 
hardware cause)

Thanks,
Nick


[PATCH] powerpc/64s: move the last of the page fault handling logic to C

2020-09-15 Thread Nicholas Piggin
The page fault handling still has some complex logic particularly around
hash table handling, in asm. Implement this in C instead.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/bug.h|   1 +
 arch/powerpc/kernel/exceptions-64s.S  | 131 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  77 +--
 arch/powerpc/mm/fault.c   |  55 ++-
 4 files changed, 124 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 338f36cd9934..d714d83bbc7c 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -112,6 +112,7 @@
 
 struct pt_regs;
 extern int do_page_fault(struct pt_regs *, unsigned long, unsigned long);
+extern int hash__do_page_fault(struct pt_regs *, unsigned long, unsigned long);
 extern void bad_page_fault(struct pt_regs *, unsigned long, int);
 extern void _exception(int, struct pt_regs *, int, unsigned long);
 extern void _exception_pkey(struct pt_regs *, unsigned long, int);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f7d748b88705..f830b893fe03 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1403,14 +1403,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
  *
  * Handling:
  * - Hash MMU
- *   Go to do_hash_page first to see if the HPT can be filled from an entry in
- *   the Linux page table. Hash faults can hit in kernel mode in a fairly
+ *   Go to do_hash_fault, which attempts to fill the HPT from an entry in the
+ *   Linux page table. Hash faults can hit in kernel mode in a fairly
  *   arbitrary state (e.g., interrupts disabled, locks held) when accessing
  *   "non-bolted" regions, e.g., vmalloc space. However these should always be
- *   backed by Linux page tables.
+ *   backed by Linux page table entries.
  *
- *   If none is found, do a Linux page fault. Linux page faults can happen in
- *   kernel mode due to user copy operations of course.
+ *   If no entry is found the Linux page fault handler is invoked (by
+ *   do_hash_fault). Linux page faults can happen in kernel mode due to user
+ *   copy operations of course.
  *
  * - Radix MMU
  *   The hardware loads from the Linux page table directly, so a fault goes
@@ -1438,13 +1439,17 @@ EXC_COMMON_BEGIN(data_access_common)
GEN_COMMON data_access
ld  r4,_DAR(r1)
ld  r5,_DSISR(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
-   ld  r6,_MSR(r1)
-   li  r3,0x300
-   b   do_hash_page/* Try to handle as hpte fault */
+   bl  do_hash_fault
 MMU_FTR_SECTION_ELSE
-   b   handle_page_fault
+   bl  do_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
+cmpdi  r3,0
+   beq+interrupt_return
+   /* We need to restore NVGPRS */
+   REST_NVGPRS(r1)
+   b   interrupt_return
 
GEN_KVM data_access
 
@@ -1539,13 +1544,17 @@ EXC_COMMON_BEGIN(instruction_access_common)
GEN_COMMON instruction_access
ld  r4,_DAR(r1)
ld  r5,_DSISR(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
-   ld  r6,_MSR(r1)
-   li  r3,0x400
-   b   do_hash_page/* Try to handle as hpte fault */
+   bl  do_hash_fault
 MMU_FTR_SECTION_ELSE
-   b   handle_page_fault
+   bl  do_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
+cmpdi  r3,0
+   beq+interrupt_return
+   /* We need to restore NVGPRS */
+   REST_NVGPRS(r1)
+   b   interrupt_return
 
GEN_KVM instruction_access
 
@@ -3197,99 +3206,3 @@ disable_machine_check:
RFI_TO_KERNEL
 1: mtlrr0
blr
-
-/*
- * Hash table stuff
- */
-   .balign IFETCH_ALIGN_BYTES
-do_hash_page:
-#ifdef CONFIG_PPC_BOOK3S_64
-   lis r0,(DSISR_BAD_FAULT_64S | DSISR_DABRMATCH | DSISR_KEYFAULT)@h
-   ori r0,r0,DSISR_BAD_FAULT_64S@l
-   and.r0,r5,r0/* weird error? */
-   bne-handle_page_fault   /* if not, try to insert a HPTE */
-
-   /*
-* If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
-* don't call hash_page, just fail the fault. This is required to
-* prevent re-entrancy problems in the hash code, namely perf
-* interrupts hitting while something holds H_PAGE_BUSY, and taking a
-* hash fault. See the comment in hash_preload().
-*/
-   ld  r11, PACA_THREAD_INFO(r13)
-   lwz r0,TI_PREEMPT(r11)
-   andis.  r0,r0,NMI_MASK@h
-   bne 77f
-
-   /*
-* r3 contains the trap number
-* r4 contains the faulting address
-* r5 contains dsisr
-* r6 msr
-*
-* at return r3 = 0 for success, 1 for page fault, negative for error
-*/
-   bl  __hash_page 

[PATCH 6/6] powerpc/64: irq replay remove decrementer overflow check

2020-09-15 Thread Nicholas Piggin
This is an ad-hoc way to catch some cases of decrementer overflow. It
won't catch cases where interrupts were hard disabled before any soft
masked interrupts fired, for example. And it doesn't catch cases that
have overflowed an even number of times.

It's not clear what exactly what problem s being solved here. A lost
timer when we have an IRQ off latency of more than ~4.3 seconds could
be avoided (so long as it's also less than ~8.6s) but this is already
a hard lockup order of magnitude event, and the decrementer will wrap
again and provide a timer interrupt within the same latency magnitdue.

So the test catches some cases of lost decrementers in very exceptional
(buggy) latency event cases, reducing timer interrupt latency in that
case by up to 4.3 seconds. And for large decrementer, it's useless. It
is performed in potentially quite a hot path, reading the TB can be
a noticable overhead.

Perhaps more importantly it allows the clunky MSR[EE] vs
PACA_IRQ_HARD_DIS incoherency to be removed.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 50 +--
 1 file changed, 1 insertion(+), 49 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 631e6d236c97..d7162f142f24 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -102,14 +102,6 @@ static inline notrace unsigned long get_irq_happened(void)
return happened;
 }
 
-static inline notrace int decrementer_check_overflow(void)
-{
-   u64 now = get_tb_or_rtc();
-   u64 *next_tb = this_cpu_ptr(_next_tb);
- 
-   return now >= *next_tb;
-}
-
 #ifdef CONFIG_PPC_BOOK3E
 
 /* This is called whenever we are re-enabling interrupts
@@ -142,35 +134,6 @@ notrace unsigned int __check_irq_replay(void)
trace_hardirqs_on();
trace_hardirqs_off();
 
-   /*
-* We are always hard disabled here, but PACA_IRQ_HARD_DIS may
-* not be set, which means interrupts have only just been hard
-* disabled as part of the local_irq_restore or interrupt return
-* code. In that case, skip the decrementr check becaus it's
-* expensive to read the TB.
-*
-* HARD_DIS then gets cleared here, but it's reconciled later.
-* Either local_irq_disable will replay the interrupt and that
-* will reconcile state like other hard interrupts. Or interrupt
-* retur will replay the interrupt and in that case it sets
-* PACA_IRQ_HARD_DIS by hand (see comments in entry_64.S).
-*/
-   if (happened & PACA_IRQ_HARD_DIS) {
-   local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
-
-   /*
-* We may have missed a decrementer interrupt if hard disabled.
-* Check the decrementer register in case we had a rollover
-* while hard disabled.
-*/
-   if (!(happened & PACA_IRQ_DEC)) {
-   if (decrementer_check_overflow()) {
-   local_paca->irq_happened |= PACA_IRQ_DEC;
-   happened |= PACA_IRQ_DEC;
-   }
-   }
-   }
-
if (happened & PACA_IRQ_DEC) {
local_paca->irq_happened &= ~PACA_IRQ_DEC;
return 0x900;
@@ -229,18 +192,6 @@ void replay_soft_interrupts(void)
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
WARN_ON_ONCE(mfmsr() & MSR_EE);
 
-   if (happened & PACA_IRQ_HARD_DIS) {
-   /*
-* We may have missed a decrementer interrupt if hard disabled.
-* Check the decrementer register in case we had a rollover
-* while hard disabled.
-*/
-   if (!(happened & PACA_IRQ_DEC)) {
-   if (decrementer_check_overflow())
-   happened |= PACA_IRQ_DEC;
-   }
-   }
-
/*
 * Force the delivery of pending soft-disabled interrupts on PS3.
 * Any HV call will have this side effect.
@@ -345,6 +296,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
WARN_ON_ONCE(!(mfmsr() & MSR_EE));
__hard_irq_disable();
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
} else {
/*
 * We should already be hard disabled here. We had bugs
-- 
2.23.0



[PATCH 5/6] powerpc/64: make restore_interrupts 64e only

2020-09-15 Thread Nicholas Piggin
This is not used by 64s.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 37 +++--
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index b725509f9073..631e6d236c97 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -191,6 +191,25 @@ notrace unsigned int __check_irq_replay(void)
 
return 0;
 }
+
+/*
+ * This is specifically called by assembly code to re-enable interrupts
+ * if they are currently disabled. This is typically called before
+ * schedule() or do_signal() when returning to userspace. We do it
+ * in C to avoid the burden of dealing with lockdep etc...
+ *
+ * NOTE: This is called with interrupts hard disabled but not marked
+ * as such in paca->irq_happened, so we need to resync this.
+ */
+void notrace restore_interrupts(void)
+{
+   if (irqs_disabled()) {
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+   local_irq_enable();
+   } else
+   __hard_irq_enable();
+}
+
 #endif /* CONFIG_PPC_BOOK3E */
 
 void replay_soft_interrupts(void)
@@ -364,24 +383,6 @@ notrace void arch_local_irq_restore(unsigned long mask)
 }
 EXPORT_SYMBOL(arch_local_irq_restore);
 
-/*
- * This is specifically called by assembly code to re-enable interrupts
- * if they are currently disabled. This is typically called before
- * schedule() or do_signal() when returning to userspace. We do it
- * in C to avoid the burden of dealing with lockdep etc...
- *
- * NOTE: This is called with interrupts hard disabled but not marked
- * as such in paca->irq_happened, so we need to resync this.
- */
-void notrace restore_interrupts(void)
-{
-   if (irqs_disabled()) {
-   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
-   local_irq_enable();
-   } else
-   __hard_irq_enable();
-}
-
 /*
  * This is a helper to use when about to go into idle low-power
  * when the latter has the side effect of re-enabling interrupts
-- 
2.23.0



[PATCH 4/6] powerpc/64e: remove 64s specific interrupt soft-mask code

2020-09-15 Thread Nicholas Piggin
Since the assembly soft-masking code was moved to 64e specific, there
are some 64s specific interrupt types still there. Remove them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64e.S | 10 --
 arch/powerpc/kernel/irq.c|  2 +-
 2 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index ca444ca82b8d..f579ce46eef2 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1302,16 +1302,6 @@ fast_exception_return:
addir3,r1,STACK_FRAME_OVERHEAD;
bl  do_IRQ
b   ret_from_except
-1: cmpwi   cr0,r3,0xf00
-   bne 1f
-   addir3,r1,STACK_FRAME_OVERHEAD;
-   bl  performance_monitor_exception
-   b   ret_from_except
-1: cmpwi   cr0,r3,0xe60
-   bne 1f
-   addir3,r1,STACK_FRAME_OVERHEAD;
-   bl  handle_hmi_exception
-   b   ret_from_except
 1: cmpwi   cr0,r3,0x900
bne 1f
addir3,r1,STACK_FRAME_OVERHEAD;
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 736a6b56e7d6..b725509f9073 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -113,7 +113,7 @@ static inline notrace int decrementer_check_overflow(void)
 #ifdef CONFIG_PPC_BOOK3E
 
 /* This is called whenever we are re-enabling interrupts
- * and returns either 0 (nothing to do) or 500/900/280/a00/e80 if
+ * and returns either 0 (nothing to do) or 500/900/280 if
  * there's an EE, DEC or DBELL to generate.
  *
  * This is called in two contexts: From arch_local_irq_restore()
-- 
2.23.0



[PATCH 3/6] powerpc/64e: remove PACA_IRQ_EE_EDGE

2020-09-15 Thread Nicholas Piggin
This is not used anywhere.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hw_irq.h|  5 ++---
 arch/powerpc/kernel/exceptions-64e.S |  1 -
 arch/powerpc/kernel/irq.c| 23 ---
 3 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 35060be09073..50dc35711db3 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -25,9 +25,8 @@
 #define PACA_IRQ_DBELL 0x02
 #define PACA_IRQ_EE0x04
 #define PACA_IRQ_DEC   0x08 /* Or FIT */
-#define PACA_IRQ_EE_EDGE   0x10 /* BookE only */
-#define PACA_IRQ_HMI   0x20
-#define PACA_IRQ_PMI   0x40
+#define PACA_IRQ_HMI   0x10
+#define PACA_IRQ_PMI   0x20
 
 /*
  * Some soft-masked interrupts must be hard masked until they are replayed
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index d9ed79415100..ca444ca82b8d 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -988,7 +988,6 @@ kernel_dbg_exc:
 .endm
 
 masked_interrupt_book3e_0x500:
-   // XXX When adding support for EPR, use PACA_IRQ_EE_EDGE
masked_interrupt_book3e PACA_IRQ_EE 1
 
 masked_interrupt_book3e_0x900:
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 3fdad9336885..736a6b56e7d6 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -181,16 +181,6 @@ notrace unsigned int __check_irq_replay(void)
return 0x500;
}
 
-   /*
-* Check if an EPR external interrupt happened this bit is typically
-* set if we need to handle another "edge" interrupt from within the
-* MPIC "EPR" handler.
-*/
-   if (happened & PACA_IRQ_EE_EDGE) {
-   local_paca->irq_happened &= ~PACA_IRQ_EE_EDGE;
-   return 0x500;
-   }
-
if (happened & PACA_IRQ_DBELL) {
local_paca->irq_happened &= ~PACA_IRQ_DBELL;
return 0x280;
@@ -270,19 +260,6 @@ void replay_soft_interrupts(void)
hard_irq_disable();
}
 
-   /*
-* Check if an EPR external interrupt happened this bit is typically
-* set if we need to handle another "edge" interrupt from within the
-* MPIC "EPR" handler.
-*/
-   if (IS_ENABLED(CONFIG_PPC_BOOK3E) && (happened & PACA_IRQ_EE_EDGE)) {
-   local_paca->irq_happened &= ~PACA_IRQ_EE_EDGE;
-   regs.trap = 0x500;
-   do_IRQ();
-   if (!(local_paca->irq_happened & PACA_IRQ_HARD_DIS))
-   hard_irq_disable();
-   }
-
if (IS_ENABLED(CONFIG_PPC_DOORBELL) && (happened & PACA_IRQ_DBELL)) {
local_paca->irq_happened &= ~PACA_IRQ_DBELL;
if (IS_ENABLED(CONFIG_PPC_BOOK3E))
-- 
2.23.0



[PATCH 1/6] powerpc/64: fix irq replay missing preempt

2020-09-15 Thread Nicholas Piggin
Prior to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt
replay in C"), replayed interrupts returned by the regular interrupt
exit code, which performs preemption in case an interrupt had set
need_resched.

This logic was missed by the conversion. Adding preempt_disable/enable
around the interrupt replay and final irq enable will reschedule if
needed.

Fixes: 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bf21ebd36190..77019699606a 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -368,6 +368,12 @@ notrace void arch_local_irq_restore(unsigned long mask)
}
}
 
+   /*
+* Disable preempt here, so that the below preempt_enable will
+* perform resched if required (a replayed interrupt may set
+* need_resched).
+*/
+   preempt_disable();
irq_soft_mask_set(IRQS_ALL_DISABLED);
trace_hardirqs_off();
 
@@ -377,6 +383,7 @@ notrace void arch_local_irq_restore(unsigned long mask)
trace_hardirqs_on();
irq_soft_mask_set(IRQS_ENABLED);
__hard_irq_enable();
+   preempt_enable();
 }
 EXPORT_SYMBOL(arch_local_irq_restore);
 
-- 
2.23.0



[PATCH 2/6] powerpc/64: fix irq replay pt_regs->softe value

2020-09-15 Thread Nicholas Piggin
Replayed interrupts get an "artificial" struct pt_regs constructed to
pass to interrupt handler functions. This did not get the softe field
set correctly, it's as though the interrupt has hit while irqs are
disabled. It should be IRQS_ENABLED.

This is possibly harmless, asynchronous handlers should not be testing
if irqs were disabled, but it might be possible for example some code
is shared with synchronous or NMI handlers, and it makes more sense if
debug output looks at this.

Fixes: 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 77019699606a..3fdad9336885 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -214,7 +214,7 @@ void replay_soft_interrupts(void)
struct pt_regs regs;
 
ppc_save_regs();
-   regs.softe = IRQS_ALL_DISABLED;
+   regs.softe = IRQS_ENABLED;
 
 again:
if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
-- 
2.23.0



Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-09-14 Thread Nicholas Piggin
Excerpts from David Miller's message of September 15, 2020 5:59 am:
> From: Nicholas Piggin 
> Date: Mon, 14 Sep 2020 14:52:18 +1000
> 
>  ...
>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>> members and having them remove themselves from mm_cpumask. This is more
>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> powerpc has a (currently similarly broken) example.
>> 
>> Signed-off-by: Nicholas Piggin 
> 
> Sad to see this optimization go away, but what can I do:
> 
> Acked-by: David S. Miller 
> 

Thanks Dave, any objection if we merge this via the powerpc tree
to keep the commits together?

Thanks,
Nick


Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-09-14 Thread Nicholas Piggin
Excerpts from Anatoly Pugachev's message of September 14, 2020 8:23 pm:
> On Mon, Sep 14, 2020 at 10:00 AM Nicholas Piggin  wrote:
>>
>> Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:
>>
>> [...]
>>
>> > The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>> > optimisation could be effectively restored by sending IPIs to mm_cpumask
>> > members and having them remove themselves from mm_cpumask. This is more
>> > tricky so I leave it as an exercise for someone with a sparc64 SMP.
>> > powerpc has a (currently similarly broken) example.
>>
>> So this compiles and boots on qemu, but qemu does not support any
>> sparc64 machines with SMP. Attempting some simple hacks doesn't get
>> me far because openbios isn't populating an SMP device tree, which
>> blows up everywhere.
>>
>> The patch is _relatively_ simple, hopefully it shouldn't explode, so
>> it's probably ready for testing on real SMP hardware, if someone has
>> a few cycles.
> 
> Nick,
> 
> applied this patch to over 'v5.9-rc5' tag , used my test VM (ldom)
> with 32 vcpus.
> Machine boot, stress-ng test ( run as
> "stress-ng --cpu 8 --io 8 --vm 8 --vm-bytes 2G --fork 8 --timeout 15m" )
> finishes without errors.
> 

Thank you very much Anatoly.

Thanks,
Nick


Re: [PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-09-14 Thread Nicholas Piggin
Excerpts from pet...@infradead.org's message of September 14, 2020 8:56 pm:
> On Mon, Sep 14, 2020 at 02:52:16PM +1000, Nicholas Piggin wrote:
>> Reading and modifying current->mm and current->active_mm and switching
>> mm should be done with irqs off, to prevent races seeing an intermediate
>> state.
>> 
>> This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
>> invalidate"). At exec-time when the new mm is activated, the old one
>> should usually be single-threaded and no longer used, unless something
>> else is holding an mm_users reference (which may be possible).
>> 
>> Absent other mm_users, there is also a race with preemption and lazy tlb
>> switching. Consider the kernel_execve case where the current thread is
>> using a lazy tlb active mm:
>> 
>>   call_usermodehelper()
>> kernel_execve()
>>   old_mm = current->mm;
>>   active_mm = current->active_mm;
>>   *** preempt *** >  schedule()
>>prev->active_mm = NULL;
>>mmdrop(prev active_mm);
>>  ...
>>   <  schedule()
>>   current->mm = mm;
>>   current->active_mm = mm;
>>   if (!old_mm)
>>   mmdrop(active_mm);
>> 
>> If we switch back to the kernel thread from a different mm, there is a
>> double free of the old active_mm, and a missing free of the new one.
>> 
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
>> 
>> So as a first step, disable interrupts across the mm/active_mm updates
>> to close the lazy tlb preempt race, and provide an arch option to
>> extend that to activate_mm which allows architectures doing IPI based
>> TLB shootdowns to close the second race.
>> 
>> This is a bit ugly, but in the interest of fixing the bug and backporting
>> before all architectures are converted this is a compromise.
>> 
>> Signed-off-by: Nicholas Piggin 
> 
> Acked-by: Peter Zijlstra (Intel) 
> 
> I'm thinking we want this selected on x86 as well. Andy?

Thanks for the ack. The plan was to take it through the powerpc tree,
but if you'd want x86 to select it, maybe a topic branch? Although
Michael will be away during the next merge window so I don't want to
get too fancy. Would you mind doing it in a follow up merge after
powerpc, being that it's (I think) a small change?

I do think all archs should be selecting this, and we want to remove
the divergent code paths from here as soon as possible. I was planning
to send patches for the N+1 window at least for all the easy archs.
But the sooner the better really, we obviously want to share code
coverage with x86 :)

Thanks,
Nick


Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-09-14 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of September 14, 2020 2:52 pm:

[...]

> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
> optimisation could be effectively restored by sending IPIs to mm_cpumask
> members and having them remove themselves from mm_cpumask. This is more
> tricky so I leave it as an exercise for someone with a sparc64 SMP.
> powerpc has a (currently similarly broken) example.

So this compiles and boots on qemu, but qemu does not support any
sparc64 machines with SMP. Attempting some simple hacks doesn't get
me far because openbios isn't populating an SMP device tree, which
blows up everywhere.

The patch is _relatively_ simple, hopefully it shouldn't explode, so
it's probably ready for testing on real SMP hardware, if someone has
a few cycles.

Thanks,
Nick

> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/sparc/kernel/smp_64.c | 65 --
>  1 file changed, 14 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index e286e2badc8a..e38d8bf454e8 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
>   * are flush_tlb_*() routines, and these run after flush_cache_*()
>   * which performs the flushw.
>   *
> - * The SMP TLB coherency scheme we use works as follows:
> - *
> - * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
> - *space has (potentially) executed on, this is the heuristic
> - *we use to avoid doing cross calls.
> - *
> - *Also, for flushing from kswapd and also for clones, we
> - *use cpu_vm_mask as the list of cpus to make run the TLB.
> - *
> - * 2) TLB context numbers are shared globally across all processors
> - *in the system, this allows us to play several games to avoid
> - *cross calls.
> - *
> - *One invariant is that when a cpu switches to a process, and
> - *that processes tsk->active_mm->cpu_vm_mask does not have the
> - *current cpu's bit set, that tlb context is flushed locally.
> - *
> - *If the address space is non-shared (ie. mm->count == 1) we avoid
> - *cross calls when we want to flush the currently running process's
> - *tlb state.  This is done by clearing all cpu bits except the current
> - *processor's in current->mm->cpu_vm_mask and performing the
> - *flush locally only.  This will force any subsequent cpus which run
> - *this task to flush the context from the local tlb if the process
> - *migrates to another cpu (again).
> - *
> - * 3) For shared address spaces (threads) and swapping we bite the
> - *bullet for most cases and perform the cross call (but only to
> - *the cpus listed in cpu_vm_mask).
> - *
> - *The performance gain from "optimizing" away the cross call for threads 
> is
> - *questionable (in theory the big win for threads is the massive sharing 
> of
> - *address space state across processors).
> + * mm->cpu_vm_mask is a bit mask of which cpus an address
> + * space has (potentially) executed on, this is the heuristic
> + * we use to limit cross calls.
>   */
>  
>  /* This currently is only used by the hugetlb arch pre-fault
> @@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
>  void smp_flush_tlb_mm(struct mm_struct *mm)
>  {
>   u32 ctx = CTX_HWBITS(mm->context);
> - int cpu = get_cpu();
>  
> - if (atomic_read(>mm_users) == 1) {
> - cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> - goto local_flush_and_out;
> - }
> + get_cpu();
>  
>   smp_cross_call_masked(_flush_tlb_mm,
> ctx, 0, 0,
> mm_cpumask(mm));
>  
> -local_flush_and_out:
>   __flush_tlb_mm(ctx, SECONDARY_CONTEXT);
>  
>   put_cpu();
> @@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, 
> unsigned long nr, unsigned long
>  {
>   u32 ctx = CTX_HWBITS(mm->context);
>   struct tlb_pending_info info;
> - int cpu = get_cpu();
> +
> + get_cpu();
>  
>   info.ctx = ctx;
>   info.nr = nr;
>   info.vaddrs = vaddrs;
>  
> - if (mm == current->mm && atomic_read(>mm_users) == 1)
> - cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
> - else
> - smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> -, 1);
> + smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
> +, 1);
>  
>   __flush_tlb_pending(ctx, nr, vaddrs);
>  
> @@ -1134,14 +1098,13 @@ void sm

[PATCH v2 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-09-13 Thread Nicholas Piggin
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.

This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.

Reviewed-by: Michael Ellerman 
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/tlb.h   | 13 -
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-   WARN_ON(atomic_read(>context.copros) > 0);
-   /*
-* It's possible for mm_access to take a reference on mm_users to
-* access the remote mm from another thread, but it's not allowed
-* to set mm_cpumask, so mm_users may be > 1 here.
-*/
-   WARN_ON(current->mm != mm);
-   atomic_set(>context.active_cpus, 1);
-   cpumask_clear(mm_cpumask(mm));
-   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..143b4fd396f0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;
 
+   /*
+* A kthread could have done a mmget_not_zero() after the flushing CPU
+* checked mm_is_singlethreaded, and be in the process of
+* kthread_use_mm when interrupted here. In that case, current->mm will
+* be set to mm, because kthread_use_mm() setting ->mm and switching to
+* the mm is done with interrupts off.
+*/
if (current->mm == mm)
-   return; /* Local CPU */
+   goto out_flush;
 
if (current->active_mm == mm) {
-   /*
-* Must be a kernel thread because sender is single-threaded.
-*/
-   BUG_ON(current->mm);
+   WARN_ON_ONCE(current->mm != NULL);
+   /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(_mm);
-   switch_mm(mm, _mm, current);
current->active_mm = _mm;
+   switch_mm_irqs_off(mm, _mm, current);
mmdrop(mm);
}
+
+   atomic_dec(>context.active_cpus);
+   cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 */
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
-   mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0



[PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-09-13 Thread Nicholas Piggin
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

if (atomic_read(>mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}

vs fs/io_uring.c

if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
 !mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Signed-off-by: Nicholas Piggin 
---
 arch/sparc/kernel/smp_64.c | 65 --
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *space has (potentially) executed on, this is the heuristic
- *we use to avoid doing cross calls.
- *
- *Also, for flushing from kswapd and also for clones, we
- *use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *in the system, this allows us to play several games to avoid
- *cross calls.
- *
- *One invariant is that when a cpu switches to a process, and
- *that processes tsk->active_mm->cpu_vm_mask does not have the
- *current cpu's bit set, that tlb context is flushed locally.
- *
- *If the address space is non-shared (ie. mm->count == 1) we avoid
- *cross calls when we want to flush the currently running process's
- *tlb state.  This is done by clearing all cpu bits except the current
- *processor's in current->mm->cpu_vm_mask and performing the
- *flush locally only.  This will force any subsequent cpus which run
- *this task to flush the context from the local tlb if the process
- *migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *bullet for most cases and perform the cross call (but only to
- *the cpus listed in cpu_vm_mask).
- *
- *The performance gain from "optimizing" away the cross call for threads is
- *questionable (in theory the big win for threads is the massive sharing of
- *address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
u32 ctx = CTX_HWBITS(mm->context);
-   int cpu = get_cpu();
 
-   if (atomic_read(>mm_users) == 1) {
-   cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-   goto local_flush_and_out;
-   }
+   get_cpu();
 
smp_cross_call_masked(_flush_tlb_mm,
  ctx, 0, 0,
  mm_cpumask(mm));
 
-local_flush_and_out:
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);

[PATCH v2 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM

2020-09-13 Thread Nicholas Piggin
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad..587ba8352d01 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
+   select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index a3a12a8341b2..b42813359f49 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct 
mm_struct *next,
 #define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-   switch_mm(prev, next, current);
+   switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0



[PATCH v2 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-09-13 Thread Nicholas Piggin
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
kernel_execve()
  old_mm = current->mm;
  active_mm = current->active_mm;
  *** preempt *** >  schedule()
   prev->active_mm = NULL;
   mmdrop(prev active_mm);
 ...
  <  schedule()
  current->mm = mm;
  current->active_mm = mm;
  if (!old_mm)
  mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  7 +++
 fs/exec.c| 17 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+   bool
+   help
+ Temporary select until all architectures can be converted to have
+ irqs disabled over activate_mm. Architectures that do IPI based TLB
+ shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
}
 
task_lock(tsk);
-   active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
-   tsk->mm = mm;
+
+   local_irq_disable();
+   active_mm = tsk->active_mm;
tsk->active_mm = mm;
+   tsk->mm = mm;
+   /*
+* This prevents preemption while active_mm is being loaded and
+* it and mm are being updated, which could cause problems for
+* lazy tlb mm refcounting when these are updated by context
+* switches. Not all architectures can handle irqs off over
+* activate_mm yet.
+*/
+   if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
activate_mm(active_mm, mm);
+   if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
-- 
2.23.0



[PATCH v2 0/4] more mm switching vs TLB shootdown and lazy tlb fixes

2020-09-13 Thread Nicholas Piggin
This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

Testing so far indicates this has fixed a mm refcounting bug that
powerpc was running into (via distro report and backport). I haven't
had any real feedback on this series outside powerpc (and it doesn't
really affect other archs), so I propose patches 1,2,4 go via the
powerpc tree.

There is no dependency between them and patch 3, I put it there only
because it follows the history of the code (powerpc code was written
using the sparc64 logic), but I guess they have to go via different arch
trees. Dave, I'll leave patch 3 with you.

Thanks,
Nick

Since v1:
- Updates from Michael Ellerman's review comments.

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig   |  7 +++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h | 13 --
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++---
 arch/sparc/kernel/smp_64.c | 65 ++
 fs/exec.c  | 17 ++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0



Re: [PATCH] spi: fsl-espi: Only process interrupts for expected events

2020-09-13 Thread Nicholas Piggin
Excerpts from Chris Packham's message of September 14, 2020 8:03 am:
> Hi All,
> 
> On 4/09/20 12:28 pm, Chris Packham wrote:
>> The SPIE register contains counts for the TX FIFO so any time the irq
>> handler was invoked we would attempt to process the RX/TX fifos. Use the
>> SPIM value to mask the events so that we only process interrupts that
>> were expected.
>>
>> This was a latent issue exposed by commit 3282a3da25bd ("powerpc/64:
>> Implement soft interrupt replay in C").
>>
>> Signed-off-by: Chris Packham 
>> Cc: sta...@vger.kernel.org
>> ---
> ping?

I don't know the code/hardware but thanks for tracking this down.

Was there anything more to be done with Jocke's observations, or would 
that be a follow-up patch if anything?

If this patch fixes your problem it should probably go in, unless there 
are any objections.

Thanks,
Nick

>>
>> Notes:
>>  I've tested this on a T2080RDB and a custom board using the T2081 SoC. 
>> With
>>  this change I don't see any spurious instances of the "Transfer done but
>>  SPIE_DON isn't set!" or "Transfer done but rx/tx fifo's aren't empty!" 
>> messages
>>  and the updates to spi flash are successful.
>>  
>>  I think this should go into the stable trees that contain 3282a3da25bd 
>> but I
>>  haven't added a Fixes: tag because I think 3282a3da25bd exposed the 
>> issue as
>>  opposed to causing it.
>>
>>   drivers/spi/spi-fsl-espi.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
>> index 7e7c92cafdbb..cb120b68c0e2 100644
>> --- a/drivers/spi/spi-fsl-espi.c
>> +++ b/drivers/spi/spi-fsl-espi.c
>> @@ -574,13 +574,14 @@ static void fsl_espi_cpu_irq(struct fsl_espi *espi, 
>> u32 events)
>>   static irqreturn_t fsl_espi_irq(s32 irq, void *context_data)
>>   {
>>  struct fsl_espi *espi = context_data;
>> -u32 events;
>> +u32 events, mask;
>>   
>>  spin_lock(>lock);
>>   
>>  /* Get interrupt events(tx/rx) */
>>  events = fsl_espi_read_reg(espi, ESPI_SPIE);
>> -if (!events) {
>> +mask = fsl_espi_read_reg(espi, ESPI_SPIM);
>> +if (!(events & mask)) {
>>  spin_unlock(>lock);
>>  return IRQ_NONE;
>>  }


Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification

2020-09-13 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 9, 2020 4:20 pm:
> 
> 
> Le 09/09/2020 à 08:04, Aneesh Kumar K.V a écrit :
>> Christophe Leroy  writes:
>> 
>>> search_exception_tables() is an heavy operation, we have to avoid it.
>>> When KUAP is selected, we'll know the fault has been blocked by KUAP.
>>> Otherwise, it behaves just as if the address was already in the TLBs
>>> and no fault was generated.
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   arch/powerpc/mm/fault.c | 20 +---
>>>   1 file changed, 5 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>>> index 525e0c2b5406..edde169ba3a6 100644
>>> --- a/arch/powerpc/mm/fault.c
>>> +++ b/arch/powerpc/mm/fault.c
>>> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
>>> unsigned long error_code,
>>> if (address >= TASK_SIZE)
>>> return true;
>>>   
>>> -   if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>>> -   !search_exception_tables(regs->nip)) {
>>> +   // Read/write fault blocked by KUAP is bad, it can never succeed.
>>> +   if (bad_kuap_fault(regs, address, is_write)) {
>>> pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
>>> - exploit attempt? (uid: %d)\n",
>>> -   address,
>>> -   from_kuid(_user_ns, current_uid()));
>>> -   }
>>> -
>>> -   // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
>>> bad
>>> -   if (!search_exception_tables(regs->nip))
>>> -   return true;
>> 
>> We still need to keep this ? Without that we detect the lack of
>> exception tables pretty late.
> 
> Is that a problem at all to detect the lack of exception tables late ?
> That case is very unlikely and will lead to failure anyway. So, is it 
> worth impacting performance of the likely case which will always have an 
> exception table and where we expect the exception to run as fast as 
> possible ?
> 
> The other architectures I have looked at (arm64 and x86) only have the 
> exception table search together with the down_read_trylock(>mmap_sem).

Yeah I don't see how it'd be a problem. User could arrange for page 
table to already be at this address and avoid the fault so it's not the 
right way to stop an attacker, KUAP is.

Thanks,
Nick


Re: [PATCH v1 5/5] powerpc/fault: Perform exception fixup in do_page_fault()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Exception fixup doesn't require the heady full regs saving,
  heavy

> do it from do_page_fault() directly.
> 
> For that, split bad_page_fault() in two parts.
> 
> As bad_page_fault() can also be called from other places than
> handle_page_fault(), it will still perform exception fixup and
> fallback on __bad_page_fault().
> 
> handle_page_fault() directly calls __bad_page_fault() as the
> exception fixup will now be done by do_page_fault()

Looks good. We can probably get rid of bad_page_fault completely after 
this too.

Hmm, the alignment exception might(?) hit user copies if the user points 
it to CI memory. Then you could race and the memory gets unmapped. In 
that case the exception table check might be better to be explicit there
with comments.

The first call in do_hash_fault is not required (copy user will never
be in nmi context). The second one and the one in slb_fault could be
made explicit too. Anyway for now this is fine.

Thanks,
Nick

Reviewed-by: Nicholas Piggin 


> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/entry_32.S   |  2 +-
>  arch/powerpc/kernel/exceptions-64e.S |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S |  2 +-
>  arch/powerpc/mm/fault.c  | 33 
>  4 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..c198786591f9 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -678,7 +678,7 @@ handle_page_fault:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   lwz r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   ret_from_except_full
>  
>  #ifdef CONFIG_PPC_BOOK3S_32
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..dd9161ea5da8 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1024,7 +1024,7 @@ storage_fault_common:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   ld  r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   ret_from_except
>  
>  /*
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index f7d748b88705..2cb3bcfb896d 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -3254,7 +3254,7 @@ handle_page_fault:
>   mr  r5,r3
>   addir3,r1,STACK_FRAME_OVERHEAD
>   ld  r4,_DAR(r1)
> - bl  bad_page_fault
> + bl  __bad_page_fault
>   b   interrupt_return
>  
>  /* We have a data breakpoint exception - handle it */
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index edde169ba3a6..bd6e397eb84a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -542,10 +542,20 @@ NOKPROBE_SYMBOL(__do_page_fault);
>  int do_page_fault(struct pt_regs *regs, unsigned long address,
> unsigned long error_code)
>  {
> + const struct exception_table_entry *entry;
>   enum ctx_state prev_state = exception_enter();
>   int rc = __do_page_fault(regs, address, error_code);
>   exception_exit(prev_state);
> - return rc;
> + if (likely(!rc))
> + return 0;
> +
> + entry = search_exception_tables(regs->nip);
> + if (unlikely(!entry))
> + return rc;
> +
> + instruction_pointer_set(regs, extable_fixup(entry));
> +
> + return 0;
>  }
>  NOKPROBE_SYMBOL(do_page_fault);
>  
> @@ -554,17 +564,10 @@ NOKPROBE_SYMBOL(do_page_fault);
>   * It is called from the DSI and ISI handlers in head.S and from some
>   * of the procedures in traps.c.
>   */
> -void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
> +void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
>  {
> - const struct exception_table_entry *entry;
>   int is_write = page_fault_is_write(regs->dsisr);
>  
> - /* Are we prepared to handle this fault?  */
> - if ((entry = search_exception_tables(regs->nip)) != NULL) {
> - regs->nip = extable_fixup(entry);
> - return;
> - }
> -
>   /* kernel has accessed a bad area */
>  
>   switch (TRAP(regs)) {
> @@ -598,3 +601,15 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
> address, int sig)
>  
>   die("Kernel access of bad area", regs, sig);

Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> search_exception_tables() is an heavy operation, we have to avoid it.
> When KUAP is selected, we'll know the fault has been blocked by KUAP.
> Otherwise, it behaves just as if the address was already in the TLBs
> and no fault was generated.
> 
> Signed-off-by: Christophe Leroy 

Sorry I missed reviewing this. Yes, we discussed this and decided
that it's not effective I think (and KUAP solves it properly).

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/mm/fault.c | 20 +---
>  1 file changed, 5 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 525e0c2b5406..edde169ba3a6 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>   if (address >= TASK_SIZE)
>   return true;
>  
> - if (!is_exec && (error_code & DSISR_PROTFAULT) &&
> - !search_exception_tables(regs->nip)) {
> + // Read/write fault blocked by KUAP is bad, it can never succeed.
> + if (bad_kuap_fault(regs, address, is_write)) {
>   pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
> - exploit attempt? (uid: %d)\n",
> - address,
> - from_kuid(_user_ns, current_uid()));
> - }
> -
> - // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
> bad
> - if (!search_exception_tables(regs->nip))
> - return true;
> -
> - // Read/write fault in a valid region (the exception table search passed
> - // above), but blocked by KUAP is bad, it can never succeed.
> - if (bad_kuap_fault(regs, address, is_write))
> + address, from_kuid(_user_ns, 
> current_uid()));
>   return true;
> + }
>  
> - // What's left? Kernel fault on user in well defined regions (extable
> - // matched), and allowed by KUAP in the faulting context.
> + // What's left? Kernel fault on user and allowed by KUAP in the 
> faulting context.
>   return false;
>  }
>  
> -- 
> 2.25.0
> 
> 


Re: [PATCH v1 3/5] powerpc/fault: Reorder tests in bad_kernel_fault()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Check address earlier to simplify the following test.

Good logic reduction.

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 9ef9ee244f72..525e0c2b5406 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -210,17 +210,17 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
> unsigned long error_code,
>   return true;
>   }
>  
> - if (!is_exec && address < TASK_SIZE && (error_code & DSISR_PROTFAULT) &&
> + // Kernel fault on kernel address is bad
> + if (address >= TASK_SIZE)
> + return true;
> +
> + if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>   !search_exception_tables(regs->nip)) {
>   pr_crit_ratelimited("Kernel attempted to access user page (%lx) 
> - exploit attempt? (uid: %d)\n",
>   address,
>   from_kuid(_user_ns, current_uid()));
>   }
>  
> - // Kernel fault on kernel address is bad
> - if (address >= TASK_SIZE)
> - return true;
> -
>   // Fault on user outside of certain regions (eg. copy_tofrom_user()) is 
> bad
>   if (!search_exception_tables(regs->nip))
>   return true;
> -- 
> 2.25.0
> 
> 


Re: [PATCH v1 2/5] powerpc/fault: Unnest definition of page_fault_is_write() and page_fault_is_bad()

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> To make it more readable, separate page_fault_is_write() and 
> page_fault_is_bad()
> to avoir several levels of #ifdefs

Reviewed-by: Nicholas Piggin 

> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2efa34d7e644..9ef9ee244f72 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -363,17 +363,19 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>   */
>  #if (defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
>  #define page_fault_is_write(__err)   ((__err) & ESR_DST)
> -#define page_fault_is_bad(__err) (0)
>  #else
>  #define page_fault_is_write(__err)   ((__err) & DSISR_ISSTORE)
> -#if defined(CONFIG_PPC_8xx)
> +#endif
> +
> +#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
> +#define page_fault_is_bad(__err) (0)
> +#elif defined(CONFIG_PPC_8xx)
>  #define page_fault_is_bad(__err) ((__err) & DSISR_NOEXEC_OR_G)
>  #elif defined(CONFIG_PPC64)
>  #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_64S)
>  #else
>  #define page_fault_is_bad(__err) ((__err) & DSISR_BAD_FAULT_32S)
>  #endif
> -#endif
>  
>  /*
>   * For 600- and 800-family processors, the error_code parameter is DSISR
> -- 
> 2.25.0
> 
> 


Re: [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> The verification and message introduced by commit 374f3f5979f9
> ("powerpc/mm/hash: Handle user access of kernel address gracefully")
> applies to all platforms, it should not be limited to BOOK3S.
> 
> Make the BOOK3S version of sanity_check_fault() the one for all,
> and bail out earlier if not BOOK3S.
> 
> Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address 
> gracefully")
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/fault.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 925a7231abb3..2efa34d7e644 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
>  static inline void cmo_account_page_fault(void) { }
>  #endif /* CONFIG_PPC_SMLPAR */
>  
> -#ifdef CONFIG_PPC_BOOK3S
>  static void sanity_check_fault(bool is_write, bool is_user,
>  unsigned long error_code, unsigned long address)
>  {
> @@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>   return;
>   }
>  
> + if (!IS_ENABLED(CONFIG_PPC_BOOK3S))
> + return;

Seems okay. Why is address == -1 special though? I guess it's because 
it may not be an exploit kernel reference but a buggy pointer underflow? 
In that case -1 doesn't seem like it would catch very much. Would it be 
better to test for high bit set for example ((long)address < 0) ?

Anyway for your patch

Reviewed-by: Nicholas Piggin 

> +
>   /*
>* For hash translation mode, we should never get a
>* PROTFAULT. Any update to pte to reduce access will result in us
> @@ -354,10 +356,6 @@ static void sanity_check_fault(bool is_write, bool 
> is_user,
>  
>   WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
>  }
> -#else
> -static void sanity_check_fault(bool is_write, bool is_user,
> -unsigned long error_code, unsigned long address) 
> { }
> -#endif /* CONFIG_PPC_BOOK3S */
>  
>  /*
>   * Define the correct "is_write" bit in error_code based
> -- 
> 2.25.0
> 
> 


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:
> On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:
>> 
>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> > Make interrupt handlers all just take the pt_regs * argument and load
>> > DAR/DSISR etc from that. Make those that return a value return long.
>> 
>> I like this, it will likely simplify a bit the VMAP_STACK mess.
>> 
>> Not sure it is that easy. My board is stuck after the start of init.
>> 
>> 
>> On the 8xx, on Instruction TLB Error exception, we do
>> 
>>  andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On book3s/32, on ISI exception we do:
>>  andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On 40x and bookE, on ISI exception we do:
>>  li  r5,0/* Pass zero as arg3 */
>> 
>> 
>> And regs->dsisr will just contain nothing
>> 
>> So it means we should at least write back r5 into regs->dsisr from there 
>> ? The performance impact should be minimal as we already write _DAR so 
>> the cache line should already be in the cache.
>> 
>> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick, 
>> allthough we don't want to do it for both ISI and DSI at the end, so 
>> you'll have to do it in every head_xxx.S
> 
> To get you series build and work, I did the following hacks:

Great, thanks for this.

> diff --git a/arch/powerpc/include/asm/interrupt.h
> b/arch/powerpc/include/asm/interrupt.h
> index acfcc7d5779b..c11045d3113a 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
> pt_regs *regs, struct inter
>  {
>   nmi_exit();
>  
> +#ifdef CONFIG_PPC64
>   this_cpu_set_ftrace_enabled(state->ftrace_enabled);
> +#endif

This seems okay, not a hack.

>  #ifdef CONFIG_PPC_BOOK3S_64
>   /* Check we didn't change the pending interrupt mask. */
> diff --git a/arch/powerpc/kernel/entry_32.S
> b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..66f7adbe1076 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -663,6 +663,7 @@ ppc_swapcontext:
>   */
>   .globl  handle_page_fault
>  handle_page_fault:
> + stw r5,_DSISR(r1)
>   addir3,r1,STACK_FRAME_OVERHEAD
>  #ifdef CONFIG_PPC_BOOK3S_32
>   andis.  r0,r5,DSISR_DABRMATCH@h

Is this what you want to do for 32, or do you want to seperate
ISI and DSI sides?

Thanks,
Nick


Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-08 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 7, 2020 7:20 pm:
> 
> 
> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> Make interrupt handlers all just take the pt_regs * argument and load
>> DAR/DSISR etc from that. Make those that return a value return long.
> 
> I like this, it will likely simplify a bit the VMAP_STACK mess.
> 
> Not sure it is that easy. My board is stuck after the start of init.
> 
> 
> On the 8xx, on Instruction TLB Error exception, we do
> 
>   andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
> 
> On book3s/32, on ISI exception we do:
>   andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
> 
> On 40x and bookE, on ISI exception we do:
>   li  r5,0/* Pass zero as arg3 */
> 
> 
> And regs->dsisr will just contain nothing
> 
> So it means we should at least write back r5 into regs->dsisr from there 
> ? The performance impact should be minimal as we already write _DAR so 
> the cache line should already be in the cache.

Yes, I think that would be required. Sorry I didn't look closely at
32 bit.

> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick, 
> allthough we don't want to do it for both ISI and DSI at the end, so 
> you'll have to do it in every head_xxx.S
> 
> 
> While you are at it, it would probably also make sense to do remove the 
> address param of bad_page_fault(), there is no point in loading back 
> regs->dar in handle_page_fault() and machine_check_8xx() and 
> alignment_exception(), just read regs->dar in bad_page_fault()
> 
> The case of do_break() should also be looked at.

Yeah that's valid, I didn't do that because bad_page_fault was also
being called from asm, but an incremental patch should be quite easy.

> Why changing return code from int to long ?

Oh it's to make the next patch work without any changes to function
prototypes. Some handlers are returning int, others long. There is
no reason not to just return long AFAIKS so that's what I changed to.

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-09-07 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of August 31, 2020 8:50 pm:
> Michal Suchánek  writes:
>> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> > Hello,
>>> > 
>>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> > Reimplement book3s idle code in C").
>>> > 
>>> > The symptom is host locking up completely after some hours of KVM
>>> > workload with messages like
>>> > 
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> > 
>>> > printed before the host locks up.
>>> > 
>>> > The machines run sandboxed builds which is a mixed workload resulting in
>>> > IO/single core/mutiple core load over time and there are periods of no
>>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>>> > setup/terdown is somewhat excercised as well.
>>> > 
>>> > POWER9 with the new guest entry fast path does not seem to be affected.
>>> > 
>>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> > stable.
>>> > 
>>> > Config is attached.
>>> > 
>>> > I cannot easily revert this commit, especially if I want to use the same
>>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> > only to the new idle code.
>>> > 
>>> > Any idea what can be the problem?
>>> 
>>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>>> BMC unfortunately.
>>
>> It may be possible to set up fadump with a later kernel version that
>> supports it on powernv and dump the whole kernel.
> 
> Your firmware won't support it AFAIK.
> 
> You could try kdump, but if we have CPUs stuck in KVM then there's a
> good chance it won't work :/

I haven't had any luck yet reproducing this still. Testing with sub 
cores of various different combinations, etc. I'll keep trying though.

I don't know if there's much we can add to debug it. Can we run pdbg
on the BMCs on these things?

Thanks,
Nick


Re: [RFC PATCH 12/12] powerpc/64s: power4 nap fixup in C

2020-09-07 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 7, 2020 2:48 pm:
> 
> 
> Le 07/09/2020 à 06:02, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of September 6, 2020 5:32 pm:
>>>
>>>
>>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>>>> There is no need for this to be in asm, use the new intrrupt entry wrapper.
>>>>
>>>> Signed-off-by: Nicholas Piggin 
>>>> ---
>>>>arch/powerpc/include/asm/interrupt.h   | 14 
>>>>arch/powerpc/include/asm/processor.h   |  1 +
>>>>arch/powerpc/include/asm/thread_info.h |  6 
>>>>arch/powerpc/kernel/exceptions-64s.S   | 45 --
>>>>arch/powerpc/kernel/idle_book3s.S  |  4 +++
>>>>5 files changed, 25 insertions(+), 45 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/processor.h 
>>>> b/arch/powerpc/include/asm/processor.h
>>>> index ed0d633ab5aa..3da1dba91386 100644
>>>> --- a/arch/powerpc/include/asm/processor.h
>>>> +++ b/arch/powerpc/include/asm/processor.h
>>>> @@ -424,6 +424,7 @@ extern unsigned long isa300_idle_stop_mayloss(unsigned 
>>>> long psscr_val);
>>>>extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
>>>>#ifdef CONFIG_PPC_970_NAP
>>>>extern void power4_idle_nap(void);
>>>> +extern void power4_idle_nap_return(void);
>>>
>>> Please please please, 'extern' keyword is pointless and deprecated for
>>> function prototypes. Don't add new ones.
>>>
>>> Also, put it outside the #ifdef, so that you can use IS_ENABLED()
>>> instead of #ifdef when using it.
>> 
>> I just copy paste and forget to remove it. I expect someone will do a
>> "cleanup" patch to get rid of them in one go, I find a random assortment
>> of extern and not extern to be even uglier :(
> 
> If we don't want to make fixes backporting a huge headache, some 
> transition with random assortment is the price to pay.
> 
> One day, when 'extern' have become the minority, we can get rid of the 
> few last ones.
> 
> But if someone believe it is not such a problem with backporting, I can 
> provide a cleanup patch now.

I can't really see declarations being a big problem for backporting
or git history. But maybe Michael won't like a patch.

I will try to remember externs though.

Thanks,
Nick


Re: [RFC PATCH 12/12] powerpc/64s: power4 nap fixup in C

2020-09-06 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of September 6, 2020 5:32 pm:
> 
> 
> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> There is no need for this to be in asm, use the new intrrupt entry wrapper.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>   arch/powerpc/include/asm/interrupt.h   | 14 
>>   arch/powerpc/include/asm/processor.h   |  1 +
>>   arch/powerpc/include/asm/thread_info.h |  6 
>>   arch/powerpc/kernel/exceptions-64s.S   | 45 --
>>   arch/powerpc/kernel/idle_book3s.S  |  4 +++
>>   5 files changed, 25 insertions(+), 45 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/interrupt.h 
>> b/arch/powerpc/include/asm/interrupt.h
>> index 3ae3d2f93b61..acfcc7d5779b 100644
>> --- a/arch/powerpc/include/asm/interrupt.h
>> +++ b/arch/powerpc/include/asm/interrupt.h
>> @@ -8,6 +8,16 @@
>>   #include 
>>   #include 
>>   
>> +static inline void nap_adjust_return(struct pt_regs *regs)
>> +{
>> +#ifdef CONFIG_PPC_970_NAP
> 
> Avoid #ifdef, you can use IS_ENABLED(CONFIG_PPC_970_NAP) in the 'if' below

Yeah I guess.

>> +if (test_thread_local_flags(_TLF_NAPPING)) {
>> +clear_thread_local_flags(_TLF_NAPPING);
>> +regs->nip = (unsigned long)power4_idle_nap_return;
>> +}
>> +#endif
>> +}
>> +
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   static inline void interrupt_enter_prepare(struct pt_regs *regs)
>>   {
>> @@ -33,6 +43,8 @@ static inline void interrupt_async_enter_prepare(struct 
>> pt_regs *regs)
>>  if (cpu_has_feature(CPU_FTR_CTRL) &&
>>  !test_thread_local_flags(_TLF_RUNLATCH))
>>  __ppc64_runlatch_on();
>> +
>> +nap_adjust_return(regs);
>>   }
>>   
>>   #else /* CONFIG_PPC_BOOK3S_64 */
>> @@ -72,6 +84,8 @@ static inline void interrupt_nmi_enter_prepare(struct 
>> pt_regs *regs, struct inte
>>   
>>  this_cpu_set_ftrace_enabled(0);
>>   
>> +nap_adjust_return(regs);
>> +
>>  nmi_enter();
>>   }
>>   
>> diff --git a/arch/powerpc/include/asm/processor.h 
>> b/arch/powerpc/include/asm/processor.h
>> index ed0d633ab5aa..3da1dba91386 100644
>> --- a/arch/powerpc/include/asm/processor.h
>> +++ b/arch/powerpc/include/asm/processor.h
>> @@ -424,6 +424,7 @@ extern unsigned long isa300_idle_stop_mayloss(unsigned 
>> long psscr_val);
>>   extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
>>   #ifdef CONFIG_PPC_970_NAP
>>   extern void power4_idle_nap(void);
>> +extern void power4_idle_nap_return(void);
> 
> Please please please, 'extern' keyword is pointless and deprecated for 
> function prototypes. Don't add new ones.
> 
> Also, put it outside the #ifdef, so that you can use IS_ENABLED() 
> instead of #ifdef when using it.

I just copy paste and forget to remove it. I expect someone will do a 
"cleanup" patch to get rid of them in one go, I find a random assortment
of extern and not extern to be even uglier :(

>>   #endif
>>   
>>   extern unsigned long cpuidle_disable;
>> diff --git a/arch/powerpc/include/asm/thread_info.h 
>> b/arch/powerpc/include/asm/thread_info.h
>> index ca6c97025704..9b15f7edb0cb 100644
>> --- a/arch/powerpc/include/asm/thread_info.h
>> +++ b/arch/powerpc/include/asm/thread_info.h
>> @@ -156,6 +156,12 @@ void arch_setup_new_exec(void);
>>   
>>   #ifndef __ASSEMBLY__
>>   
>> +static inline void clear_thread_local_flags(unsigned int flags)
>> +{
>> +struct thread_info *ti = current_thread_info();
>> +ti->local_flags &= ~flags;
>> +}
>> +
>>   static inline bool test_thread_local_flags(unsigned int flags)
>>   {
>>  struct thread_info *ti = current_thread_info();
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index 227bad3a586d..1db6b3438c88 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -692,25 +692,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
>>  ld  r1,GPR1(r1)
>>   .endm
>>   
>> -/*
>> - * When the idle code in power4_idle puts the CPU into NAP mode,
>> - * it has to do so in a loop, and relies on the external interrupt
>> - * and decrementer interrupt entry code to get it out of the loop.
>> - * It sets the _TLF_NAPPING bit in current_thread_info()->local_flags
>> - * to signal that it is in the loop and needs help to get out.
>> - */
>> -#ifde

[RFC PATCH 12/12] powerpc/64s: power4 nap fixup in C

2020-09-05 Thread Nicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h   | 14 
 arch/powerpc/include/asm/processor.h   |  1 +
 arch/powerpc/include/asm/thread_info.h |  6 
 arch/powerpc/kernel/exceptions-64s.S   | 45 --
 arch/powerpc/kernel/idle_book3s.S  |  4 +++
 5 files changed, 25 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 3ae3d2f93b61..acfcc7d5779b 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -8,6 +8,16 @@
 #include 
 #include 
 
+static inline void nap_adjust_return(struct pt_regs *regs)
+{
+#ifdef CONFIG_PPC_970_NAP
+   if (test_thread_local_flags(_TLF_NAPPING)) {
+   clear_thread_local_flags(_TLF_NAPPING);
+   regs->nip = (unsigned long)power4_idle_nap_return;
+   }
+#endif
+}
+
 #ifdef CONFIG_PPC_BOOK3S_64
 static inline void interrupt_enter_prepare(struct pt_regs *regs)
 {
@@ -33,6 +43,8 @@ static inline void interrupt_async_enter_prepare(struct 
pt_regs *regs)
if (cpu_has_feature(CPU_FTR_CTRL) &&
!test_thread_local_flags(_TLF_RUNLATCH))
__ppc64_runlatch_on();
+
+   nap_adjust_return(regs);
 }
 
 #else /* CONFIG_PPC_BOOK3S_64 */
@@ -72,6 +84,8 @@ static inline void interrupt_nmi_enter_prepare(struct pt_regs 
*regs, struct inte
 
this_cpu_set_ftrace_enabled(0);
 
+   nap_adjust_return(regs);
+
nmi_enter();
 }
 
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index ed0d633ab5aa..3da1dba91386 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -424,6 +424,7 @@ extern unsigned long isa300_idle_stop_mayloss(unsigned long 
psscr_val);
 extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
 #ifdef CONFIG_PPC_970_NAP
 extern void power4_idle_nap(void);
+extern void power4_idle_nap_return(void);
 #endif
 
 extern unsigned long cpuidle_disable;
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index ca6c97025704..9b15f7edb0cb 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -156,6 +156,12 @@ void arch_setup_new_exec(void);
 
 #ifndef __ASSEMBLY__
 
+static inline void clear_thread_local_flags(unsigned int flags)
+{
+   struct thread_info *ti = current_thread_info();
+   ti->local_flags &= ~flags;
+}
+
 static inline bool test_thread_local_flags(unsigned int flags)
 {
struct thread_info *ti = current_thread_info();
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 227bad3a586d..1db6b3438c88 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -692,25 +692,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
ld  r1,GPR1(r1)
 .endm
 
-/*
- * When the idle code in power4_idle puts the CPU into NAP mode,
- * it has to do so in a loop, and relies on the external interrupt
- * and decrementer interrupt entry code to get it out of the loop.
- * It sets the _TLF_NAPPING bit in current_thread_info()->local_flags
- * to signal that it is in the loop and needs help to get out.
- */
-#ifdef CONFIG_PPC_970_NAP
-#define FINISH_NAP \
-BEGIN_FTR_SECTION  \
-   ld  r11, PACA_THREAD_INFO(r13); \
-   ld  r9,TI_LOCAL_FLAGS(r11); \
-   andi.   r10,r9,_TLF_NAPPING;\
-   bnelpower4_fixup_nap;   \
-END_FTR_SECTION_IFSET(CPU_FTR_CAN_NAP)
-#else
-#define FINISH_NAP
-#endif
-
 /*
  * There are a few constraints to be concerned with.
  * - Real mode exceptions code/data must be located at their physical location.
@@ -1250,7 +1231,6 @@ EXC_COMMON_BEGIN(machine_check_common)
 */
GEN_COMMON machine_check
 
-   FINISH_NAP
/* Enable MSR_RI when finished with PACA_EXMC */
li  r10,MSR_RI
mtmsrd  r10,1
@@ -1572,7 +1552,6 @@ EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
 EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
 EXC_COMMON_BEGIN(hardware_interrupt_common)
GEN_COMMON hardware_interrupt
-   FINISH_NAP
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_IRQ
b   interrupt_return
@@ -1757,7 +1736,6 @@ EXC_VIRT_BEGIN(decrementer, 0x4900, 0x80)
 EXC_VIRT_END(decrementer, 0x4900, 0x80)
 EXC_COMMON_BEGIN(decrementer_common)
GEN_COMMON decrementer
-   FINISH_NAP
addir3,r1,STACK_FRAME_OVERHEAD
bl  timer_interrupt
b   interrupt_return
@@ -1842,7 +1820,6 @@ EXC_VIRT_BEGIN(doorbell_super, 0x4a00, 0x100)
 EXC_VIRT_END(doorbell_super, 0x4a00, 0x100)
 EXC_COMMON_BEGIN(doorbell_super_common)
GEN_COMMON doorbell_super
-   FINISH_NAP

[RFC PATCH 11/12] powerpc/64s: runlatch interrupt handling in C

2020-09-05 Thread Nicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 16 ++--
 arch/powerpc/kernel/exceptions-64s.S | 18 --
 2 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index c26a7c466416..3ae3d2f93b61 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC_BOOK3S_64
 static inline void interrupt_enter_prepare(struct pt_regs *regs)
@@ -25,10 +26,22 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs)
}
 }
 
+static inline void interrupt_async_enter_prepare(struct pt_regs *regs)
+{
+   interrupt_enter_prepare(regs);
+
+   if (cpu_has_feature(CPU_FTR_CTRL) &&
+   !test_thread_local_flags(_TLF_RUNLATCH))
+   __ppc64_runlatch_on();
+}
+
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline void interrupt_enter_prepare(struct pt_regs *regs)
 {
 }
+static inline void interrupt_async_enter_prepare(struct pt_regs *regs)
+{
+}
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 struct interrupt_nmi_state {
@@ -76,7 +89,6 @@ static inline void interrupt_nmi_exit_prepare(struct pt_regs 
*regs, struct inter
 #endif
 }
 
-
 /**
  * DECLARE_INTERRUPT_HANDLER_RAW - Declare raw interrupt handler function
  * @func:  Function name of the entry point
@@ -193,7 +205,7 @@ static __always_inline void ___##func(struct pt_regs 
*regs);\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   interrupt_enter_prepare(regs);  \
+   interrupt_async_enter_prepare(regs);\
\
___##func (regs);   \
 }  \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 0949dd47be59..227bad3a586d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -692,14 +692,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
ld  r1,GPR1(r1)
 .endm
 
-#define RUNLATCH_ON\
-BEGIN_FTR_SECTION  \
-   ld  r3, PACA_THREAD_INFO(r13);  \
-   ld  r4,TI_LOCAL_FLAGS(r3);  \
-   andi.   r0,r4,_TLF_RUNLATCH;\
-   beqlppc64_runlatch_on_trampoline;   \
-END_FTR_SECTION_IFSET(CPU_FTR_CTRL)
-
 /*
  * When the idle code in power4_idle puts the CPU into NAP mode,
  * it has to do so in a loop, and relies on the external interrupt
@@ -1581,7 +1573,6 @@ EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
 EXC_COMMON_BEGIN(hardware_interrupt_common)
GEN_COMMON hardware_interrupt
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_IRQ
b   interrupt_return
@@ -1767,7 +1758,6 @@ EXC_VIRT_END(decrementer, 0x4900, 0x80)
 EXC_COMMON_BEGIN(decrementer_common)
GEN_COMMON decrementer
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
bl  timer_interrupt
b   interrupt_return
@@ -1853,7 +1843,6 @@ EXC_VIRT_END(doorbell_super, 0x4a00, 0x100)
 EXC_COMMON_BEGIN(doorbell_super_common)
GEN_COMMON doorbell_super
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
 #ifdef CONFIG_PPC_DOORBELL
bl  doorbell_exception
@@ -2208,7 +2197,6 @@ EXC_COMMON_BEGIN(hmi_exception_early_common)
 EXC_COMMON_BEGIN(hmi_exception_common)
GEN_COMMON hmi_exception
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
bl  handle_hmi_exception
b   interrupt_return
@@ -2238,7 +2226,6 @@ EXC_VIRT_END(h_doorbell, 0x4e80, 0x20)
 EXC_COMMON_BEGIN(h_doorbell_common)
GEN_COMMON h_doorbell
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
 #ifdef CONFIG_PPC_DOORBELL
bl  doorbell_exception
@@ -2272,7 +2259,6 @@ EXC_VIRT_END(h_virt_irq, 0x4ea0, 0x20)
 EXC_COMMON_BEGIN(h_virt_irq_common)
GEN_COMMON h_virt_irq
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_IRQ
b   interrupt_return
@@ -2319,7 +2305,6 @@ EXC_VIRT_END(performance_monitor, 0x4f00, 0x20)
 EXC_COMMON_BEGIN(performance_monitor_common)
GEN_COMMON performance_monitor
FINISH_NAP
-   RUNLATCH_ON
addir3,r1,STACK_FRAME_OVERHEAD
bl  performance_monitor_exception
b   interrup

[RFC PATCH 10/12] powerpc/64s: move NMI soft-mask handling to C

2020-09-05 Thread Nicholas Piggin
Saving and restoring soft-mask state can now be done in C using the
interrupt handler wrapper functions.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 25 
 arch/powerpc/kernel/exceptions-64s.S | 60 
 2 files changed, 25 insertions(+), 60 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 69eb8a432984..c26a7c466416 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -33,12 +33,30 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs)
 
 struct interrupt_nmi_state {
 #ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC_BOOK3S_64
+   u8 irq_soft_mask;
+   u8 irq_happened;
+#endif
u8 ftrace_enabled;
 #endif
 };
 
 static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct 
interrupt_nmi_state *state)
 {
+#ifdef CONFIG_PPC_BOOK3S_64
+   state->irq_soft_mask = local_paca->irq_soft_mask;
+   state->irq_happened = local_paca->irq_happened;
+   state->ftrace_enabled = this_cpu_get_ftrace_enabled();
+
+   /*
+* Set IRQS_ALL_DISABLED unconditionally so irqs_disabled() does
+* the right thing, and set IRQ_HARD_DIS. We do not want to reconcile
+* because that goes through irq tracing which we don't want in NMI.
+*/
+   local_paca->irq_soft_mask = IRQS_ALL_DISABLED;
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+#endif
+
this_cpu_set_ftrace_enabled(0);
 
nmi_enter();
@@ -49,6 +67,13 @@ static inline void interrupt_nmi_exit_prepare(struct pt_regs 
*regs, struct inter
nmi_exit();
 
this_cpu_set_ftrace_enabled(state->ftrace_enabled);
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   /* Check we didn't change the pending interrupt mask. */
+   WARN_ON_ONCE((state->irq_happened | PACA_IRQ_HARD_DIS) != 
local_paca->irq_happened);
+   local_paca->irq_happened = state->irq_happened;
+   local_paca->irq_soft_mask = state->irq_soft_mask;
+#endif
 }
 
 
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 121a55c87c02..0949dd47be59 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1010,20 +1010,6 @@ EXC_COMMON_BEGIN(system_reset_common)
ld  r1,PACA_NMI_EMERG_SP(r13)
subir1,r1,INT_FRAME_SIZE
__GEN_COMMON_BODY system_reset
-   /*
-* Set IRQS_ALL_DISABLED unconditionally so irqs_disabled() does
-* the right thing. We do not want to reconcile because that goes
-* through irq tracing which we don't want in NMI.
-*
-* Save PACAIRQHAPPENED to RESULT (otherwise unused), and set HARD_DIS
-* as we are running with MSR[EE]=0.
-*/
-   li  r10,IRQS_ALL_DISABLED
-   stb r10,PACAIRQSOFTMASK(r13)
-   lbz r10,PACAIRQHAPPENED(r13)
-   std r10,RESULT(r1)
-   ori r10,r10,PACA_IRQ_HARD_DIS
-   stb r10,PACAIRQHAPPENED(r13)
 
addir3,r1,STACK_FRAME_OVERHEAD
bl  system_reset_exception
@@ -1039,14 +1025,6 @@ EXC_COMMON_BEGIN(system_reset_common)
subir10,r10,1
sth r10,PACA_IN_NMI(r13)
 
-   /*
-* Restore soft mask settings.
-*/
-   ld  r10,RESULT(r1)
-   stb r10,PACAIRQHAPPENED(r13)
-   ld  r10,SOFTE(r1)
-   stb r10,PACAIRQSOFTMASK(r13)
-
kuap_restore_amr r9, r10
EXCEPTION_RESTORE_REGS
RFI_TO_USER_OR_KERNEL
@@ -1192,30 +1170,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
li  r10,MSR_RI
mtmsrd  r10,1
 
-   /*
-* Set IRQS_ALL_DISABLED and save PACAIRQHAPPENED (see
-* system_reset_common)
-*/
-   li  r10,IRQS_ALL_DISABLED
-   stb r10,PACAIRQSOFTMASK(r13)
-   lbz r10,PACAIRQHAPPENED(r13)
-   std r10,RESULT(r1)
-   ori r10,r10,PACA_IRQ_HARD_DIS
-   stb r10,PACAIRQHAPPENED(r13)
-
addir3,r1,STACK_FRAME_OVERHEAD
bl  machine_check_early
std r3,RESULT(r1)   /* Save result */
ld  r12,_MSR(r1)
 
-   /*
-* Restore soft mask settings.
-*/
-   ld  r10,RESULT(r1)
-   stb r10,PACAIRQHAPPENED(r13)
-   ld  r10,SOFTE(r1)
-   stb r10,PACAIRQSOFTMASK(r13)
-
 #ifdef CONFIG_PPC_P7_NAP
/*
 * Check if thread was in power saving mode. We come here when any
@@ -2814,17 +2773,6 @@ EXC_COMMON_BEGIN(soft_nmi_common)
subir1,r1,INT_FRAME_SIZE
__GEN_COMMON_BODY soft_nmi
 
-   /*
-* Set IRQS_ALL_DISABLED and save PACAIRQHAPPENED (see
-* system_reset_common)
-*/
-   li  r10,IRQS_ALL_DISABLED
-   stb r10,PACAIRQSOFTMASK(r13)
-   lbz r10,PACAIRQHAPPENED(r13)
-   std r10,RESULT(r1)
-   ori r10,r10,PACA_IRQ_HARD_D

[RFC PATCH 09/12] powerpc: move NMI entry/exit code into wrapper

2020-09-05 Thread Nicholas Piggin
This moves the common NMI entry and exit code into the interrupt handler
wrappers.

This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and
also MCE interrupts on 64e, by adding missing parts of the NMI entry to
them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h | 26 +++
 arch/powerpc/kernel/mce.c| 12 -
 arch/powerpc/kernel/traps.c  | 38 +---
 arch/powerpc/kernel/watchdog.c   | 10 +++-
 4 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 83fe1d64cf23..69eb8a432984 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -31,6 +31,27 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs)
 }
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
+struct interrupt_nmi_state {
+#ifdef CONFIG_PPC64
+   u8 ftrace_enabled;
+#endif
+};
+
+static inline void interrupt_nmi_enter_prepare(struct pt_regs *regs, struct 
interrupt_nmi_state *state)
+{
+   this_cpu_set_ftrace_enabled(0);
+
+   nmi_enter();
+}
+
+static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, struct 
interrupt_nmi_state *state)
+{
+   nmi_exit();
+
+   this_cpu_set_ftrace_enabled(state->ftrace_enabled);
+}
+
+
 /**
  * DECLARE_INTERRUPT_HANDLER_RAW - Declare raw interrupt handler function
  * @func:  Function name of the entry point
@@ -177,10 +198,15 @@ static __always_inline long ___##func(struct pt_regs 
*regs);  \
\
 __visible noinstr long func(struct pt_regs *regs)  \
 {  \
+   struct interrupt_nmi_state state;   \
long ret;   \
\
+   interrupt_nmi_enter_prepare(regs, );  \
+   \
ret = ___##func (regs); \
\
+   interrupt_nmi_exit_prepare(regs, );   \
+   \
return ret; \
 }  \
\
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index d0bbcc4fe13c..9f39deed4fca 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -592,13 +592,6 @@ EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 DEFINE_INTERRUPT_HANDLER_NMI(machine_check_early)
 {
long handled = 0;
-   bool nested = in_nmi();
-   u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
-
-   this_cpu_set_ftrace_enabled(0);
-
-   if (!nested)
-   nmi_enter();
 
hv_nmi_check_nonrecoverable(regs);
 
@@ -608,11 +601,6 @@ DEFINE_INTERRUPT_HANDLER_NMI(machine_check_early)
if (ppc_md.machine_check_early)
handled = ppc_md.machine_check_early(regs);
 
-   if (!nested)
-   nmi_exit();
-
-   this_cpu_set_ftrace_enabled(ftrace_enabled);
-
return handled;
 }
 
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 3784578db630..01ddbe5ed5a4 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -443,11 +443,6 @@ DEFINE_INTERRUPT_HANDLER_NMI(system_reset_exception)
 {
unsigned long hsrr0, hsrr1;
bool saved_hsrrs = false;
-   u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
-
-   this_cpu_set_ftrace_enabled(0);
-
-   nmi_enter();
 
/*
 * System reset can interrupt code where HSRRs are live and MSR[RI]=1.
@@ -519,10 +514,6 @@ DEFINE_INTERRUPT_HANDLER_NMI(system_reset_exception)
mtspr(SPRN_HSRR1, hsrr1);
}
 
-   nmi_exit();
-
-   this_cpu_set_ftrace_enabled(ftrace_enabled);
-
/* What should we do here? We could issue a shutdown or hard reset. */
 
return 0;
@@ -828,6 +819,12 @@ int machine_check_generic(struct pt_regs *regs)
 #endif /* everything else */
 
 
+/*
+ * BOOK3S_64 does not call this handler as a non-maskable interrupt
+ * (it uses its own early real-mode handler to handle the MCE proper
+ * and then raises irq_work to call this handler when interrupts are
+ * enabled).
+ */
 #ifdef CONFIG_PPC_BOOK3S_64
 DEFINE_INTERRUPT_HANDLER_ASYNC(machine_check_exception)
 #else
@@ -836,20 +833,6 @@ DEFINE_INTERRUPT_HANDLER_NMI(machine_check_exception)
 {
int recover = 0;
 
-   /*
-* BOOK3S_64 does not c

[RFC PATCH 08/12] powerpc/64: entry cpu time accounting in C

2020-09-05 Thread Nicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h |  4 
 arch/powerpc/include/asm/ppc_asm.h   | 24 
 arch/powerpc/kernel/exceptions-64e.S |  1 -
 arch/powerpc/kernel/exceptions-64s.S |  5 -
 4 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 511b3304722b..83fe1d64cf23 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -16,6 +17,9 @@ static inline void interrupt_enter_prepare(struct pt_regs 
*regs)
if (user_mode(regs)) {
CT_WARN_ON(ct_state() == CONTEXT_KERNEL);
user_exit_irqoff();
+
+   account_cpu_user_entry();
+   account_stolen_time();
} else {
CT_WARN_ON(ct_state() == CONTEXT_USER);
}
diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index b4cc6608131c..a363c9220ce3 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -25,7 +25,6 @@
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #define ACCOUNT_CPU_USER_ENTRY(ptr, ra, rb)
 #define ACCOUNT_CPU_USER_EXIT(ptr, ra, rb)
-#define ACCOUNT_STOLEN_TIME
 #else
 #define ACCOUNT_CPU_USER_ENTRY(ptr, ra, rb)\
MFTB(ra);   /* get timebase */  \
@@ -44,29 +43,6 @@
PPC_LL  ra, ACCOUNT_SYSTEM_TIME(ptr);   \
add ra,ra,rb;   /* add on to system time */ \
PPC_STL ra, ACCOUNT_SYSTEM_TIME(ptr)
-
-#ifdef CONFIG_PPC_SPLPAR
-#define ACCOUNT_STOLEN_TIME\
-BEGIN_FW_FTR_SECTION;  \
-   beq 33f;\
-   /* from user - see if there are any DTL entries to process */   \
-   ld  r10,PACALPPACAPTR(r13); /* get ptr to VPA */\
-   ld  r11,PACA_DTL_RIDX(r13); /* get log read index */\
-   addir10,r10,LPPACA_DTLIDX;  \
-   LDX_BE  r10,0,r10;  /* get log write index */   \
-   cmpdcr1,r11,r10;\
-   beq+cr1,33f;\
-   bl  accumulate_stolen_time; \
-   ld  r12,_MSR(r1);   \
-   andi.   r10,r12,MSR_PR; /* Restore cr0 (coming from user) */ \
-33:\
-END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
-
-#else  /* CONFIG_PPC_SPLPAR */
-#define ACCOUNT_STOLEN_TIME
-
-#endif /* CONFIG_PPC_SPLPAR */
-
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 /*
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 5988d61783b5..7a9db1a30e82 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -398,7 +398,6 @@ exc_##n##_common:   
\
std r10,_NIP(r1);   /* save SRR0 to stackframe */   \
std r11,_MSR(r1);   /* save SRR1 to stackframe */   \
beq 2f; /* if from kernel mode */   \
-   ACCOUNT_CPU_USER_ENTRY(r13,r10,r11);/* accounting (uses cr0+eq) */  \
 2: ld  r3,excf+EX_R10(r13);/* get back r10 */  \
ld  r4,excf+EX_R11(r13);/* get back r11 */  \
mfspr   r5,scratch; /* get back r13 */  \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b36247ad1f64..121a55c87c02 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -577,7 +577,6 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
kuap_save_amr_and_lock r9, r10, cr1, cr0
.endif
beq 101f/* if from kernel mode  */
-   ACCOUNT_CPU_USER_ENTRY(r13, r9, r10)
 BEGIN_FTR_SECTION
ld  r9,IAREA+EX_PPR(r13)/* Read PPR from paca   */
std r9,_PPR(r1)
@@ -645,10 +644,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
ld  r11,exception_marker@toc(r2)
std r10,RESULT(r1)  /* clear regs->result   */
std r11,STACK_FRAME_OVERHEAD-16(r1) /* mark the frame   */
-
-   .if ISTACK
-   ACCOUNT_STOLEN_TIME
-   .endif
 .endm
 
 /*
-- 
2.23.0



[RFC PATCH 07/12] powerpc/64: move account_stolen_time into its own function

2020-09-05 Thread Nicholas Piggin
This will be used by interrupt entry as well.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cputime.h | 15 +++
 arch/powerpc/kernel/syscall_64.c   | 10 +-
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index ed75d1c318e3..3f61604e1fcf 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -87,6 +87,18 @@ static notrace inline void account_cpu_user_exit(void)
acct->starttime_user = tb;
 }
 
+static notrace inline void account_stolen_time(void)
+{
+#ifdef CONFIG_PPC_SPLPAR
+   if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) &&
+   firmware_has_feature(FW_FEATURE_SPLPAR)) {
+   struct lppaca *lp = local_paca->lppaca_ptr;
+
+   if (unlikely(local_paca->dtl_ridx != be64_to_cpu(lp->dtl_idx)))
+   accumulate_stolen_time();
+   }
+#endif
+}
 
 #endif /* __KERNEL__ */
 #else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
@@ -96,5 +108,8 @@ static inline void account_cpu_user_entry(void)
 static inline void account_cpu_user_exit(void)
 {
 }
+static notrace inline void account_stolen_time(void)
+{
+}
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 #endif /* __POWERPC_CPUTIME_H */
diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
index 58eec1c7fdb8..27595aee5777 100644
--- a/arch/powerpc/kernel/syscall_64.c
+++ b/arch/powerpc/kernel/syscall_64.c
@@ -44,15 +44,7 @@ notrace long system_call_exception(long r3, long r4, long r5,
 
account_cpu_user_entry();
 
-#ifdef CONFIG_PPC_SPLPAR
-   if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) &&
-   firmware_has_feature(FW_FEATURE_SPLPAR)) {
-   struct lppaca *lp = local_paca->lppaca_ptr;
-
-   if (unlikely(local_paca->dtl_ridx != be64_to_cpu(lp->dtl_idx)))
-   accumulate_stolen_time();
-   }
-#endif
+   account_stolen_time();
 
/*
 * This is not required for the syscall exit path, but makes the
-- 
2.23.0



[RFC PATCH 06/12] powerpc/64s: reconcile interrupts in C

2020-09-05 Thread Nicholas Piggin
There is no need for this to be in asm, use the new intrrupt entry wrapper.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h |  4 
 arch/powerpc/kernel/exceptions-64s.S | 26 --
 2 files changed, 4 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 98acfbb2df04..511b3304722b 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -9,6 +9,10 @@
 #ifdef CONFIG_PPC_BOOK3S_64
 static inline void interrupt_enter_prepare(struct pt_regs *regs)
 {
+   if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
+   trace_hardirqs_off();
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+
if (user_mode(regs)) {
CT_WARN_ON(ct_state() == CONTEXT_KERNEL);
user_exit_irqoff();
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f6989321136d..b36247ad1f64 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -139,7 +139,6 @@ name:
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()   /* Virt entry tests KVM */
 #define ISTACK .L_ISTACK_\name\()  /* Set regular kernel stack */
 #define __ISTACK(name) .L_ISTACK_ ## name
-#define IRECONCILE .L_IRECONCILE_\name\()  /* Do RECONCILE_IRQ_STATE */
 #define IKUAP  .L_IKUAP_\name\()   /* Do KUAP lock */
 
 #define INT_DEFINE_BEGIN(n)\
@@ -203,9 +202,6 @@ do_define_int n
.ifndef ISTACK
ISTACK=1
.endif
-   .ifndef IRECONCILE
-   IRECONCILE=1
-   .endif
.ifndef IKUAP
IKUAP=1
.endif
@@ -653,10 +649,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
.if ISTACK
ACCOUNT_STOLEN_TIME
.endif
-
-   .if IRECONCILE
-   RECONCILE_IRQ_STATE(r10, r11)
-   .endif
 .endm
 
 /*
@@ -935,7 +927,6 @@ INT_DEFINE_BEGIN(system_reset)
 */
ISET_RI=0
ISTACK=0
-   IRECONCILE=0
IKVM_REAL=1
 INT_DEFINE_END(system_reset)
 
@@ -1125,7 +1116,6 @@ INT_DEFINE_BEGIN(machine_check_early)
ISTACK=0
IDAR=1
IDSISR=1
-   IRECONCILE=0
IKUAP=0 /* We don't touch AMR here, we never go to virtual mode */
 INT_DEFINE_END(machine_check_early)
 
@@ -1473,7 +1463,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 INT_DEFINE_BEGIN(data_access_slb)
IVEC=0x380
IAREA=PACA_EXSLB
-   IRECONCILE=0
IDAR=1
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_SKIP=1
@@ -1502,7 +1491,6 @@ MMU_FTR_SECTION_ELSE
li  r3,-EFAULT
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
-   RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_bad_slb_fault
b   interrupt_return
@@ -1564,7 +1552,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 INT_DEFINE_BEGIN(instruction_access_slb)
IVEC=0x480
IAREA=PACA_EXSLB
-   IRECONCILE=0
IISIDE=1
IDAR=1
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
@@ -1593,7 +1580,6 @@ MMU_FTR_SECTION_ELSE
li  r3,-EFAULT
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
-   RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_bad_slb_fault
b   interrupt_return
@@ -1753,7 +1739,6 @@ EXC_COMMON_BEGIN(program_check_common)
  */
 INT_DEFINE_BEGIN(fp_unavailable)
IVEC=0x800
-   IRECONCILE=0
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
 #endif
@@ -1768,7 +1753,6 @@ EXC_VIRT_END(fp_unavailable, 0x4800, 0x100)
 EXC_COMMON_BEGIN(fp_unavailable_common)
GEN_COMMON fp_unavailable
bne 1f  /* if from user, just load it up */
-   RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
bl  kernel_fp_unavailable_exception
 0: trap
@@ -1787,7 +1771,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
b   fast_interrupt_return
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 2: /* User process was in a transaction */
-   RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
bl  fp_unavailable_tm
b   interrupt_return
@@ -1852,7 +1835,6 @@ INT_DEFINE_BEGIN(hdecrementer)
IVEC=0x980
IHSRR=1
ISTACK=0
-   IRECONCILE=0
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(hdecrementer)
@@ -2226,7 +2208,6 @@ INT_DEFINE_BEGIN(hmi_exception_early)
IHSRR=1
IREALMODE_COMMON=1
ISTACK=0
-   IRECONCILE=0
IKUAP=0 /* We don't touch AMR here, we never go to virtual mode */
IKVM_REAL=1
 INT_DEFINE_END(hmi_exception_early)
@@ -2400,7 +2381,6 @@ EXC_COMMON_BEGIN(performance_monitor_common)
  */
 INT_DEFINE_BEGIN(altivec_unavailable)
I

[RFC PATCH 05/12] powerpc/64s: Do context tracking in interrupt entry wrapper

2020-09-05 Thread Nicholas Piggin
Context tracking is very broken currently. Add a helper function that is to
be called first thing by any normal interrupt handler function to track user
exits, and user entry is done by the interrupt exit prepare functions.

Context tracking is disabled on 64e for now, it must move to interrupt exit
in C before enabling it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/interrupt.h  | 23 +
 arch/powerpc/kernel/ptrace/ptrace.c   |  4 --
 arch/powerpc/kernel/signal.c  |  4 --
 arch/powerpc/kernel/syscall_64.c  | 14 +
 arch/powerpc/kernel/traps.c   | 74 ++-
 arch/powerpc/mm/book3s64/hash_utils.c |  2 -
 arch/powerpc/mm/fault.c   |  3 --
 8 files changed, 55 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..3da7bbff46a9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -189,7 +189,7 @@ config PPC
select HAVE_CBPF_JITif !PPC64
select HAVE_STACKPROTECTOR  if PPC64 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
select HAVE_STACKPROTECTOR  if PPC32 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
-   select HAVE_CONTEXT_TRACKINGif PPC64
+   select HAVE_CONTEXT_TRACKINGif PPC_BOOK3S_64
select HAVE_TIF_NOHZif PPC64
select HAVE_DEBUG_KMEMLEAK
select HAVE_DEBUG_STACKOVERFLOW
diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 7231949fc1c8..98acfbb2df04 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -6,6 +6,23 @@
 #include 
 #include 
 
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void interrupt_enter_prepare(struct pt_regs *regs)
+{
+   if (user_mode(regs)) {
+   CT_WARN_ON(ct_state() == CONTEXT_KERNEL);
+   user_exit_irqoff();
+   } else {
+   CT_WARN_ON(ct_state() == CONTEXT_USER);
+   }
+}
+
+#else /* CONFIG_PPC_BOOK3S_64 */
+static inline void interrupt_enter_prepare(struct pt_regs *regs)
+{
+}
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 /**
  * DECLARE_INTERRUPT_HANDLER_RAW - Declare raw interrupt handler function
  * @func:  Function name of the entry point
@@ -60,6 +77,8 @@ static __always_inline void ___##func(struct pt_regs *regs);  
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
+   interrupt_enter_prepare(regs);  \
+   \
___##func (regs);   \
 }  \
\
@@ -90,6 +109,8 @@ __visible noinstr long func(struct pt_regs *regs)
\
 {  \
long ret;   \
\
+   interrupt_enter_prepare(regs);  \
+   \
ret = ___##func (regs); \
\
return ret; \
@@ -118,6 +139,8 @@ static __always_inline void ___##func(struct pt_regs 
*regs);\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
+   interrupt_enter_prepare(regs);  \
+   \
___##func (regs);   \
 }  \
\
diff --git a/arch/powerpc/kernel/ptrace/ptrace.c 
b/arch/powerpc/kernel/ptrace/ptrace.c
index f6e51be47c6e..8970400e521c 100644
--- a/arch/powerpc/kernel/ptrace/ptrace.c
+++ b/arch/powerpc/kernel/ptrace/ptrace.c
@@ -290,8 +290,6 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 {
u32 flags;
 
-   user_exit();
-
flags = READ_ONCE(current_thread_info()->flags) &
(_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);
 
@@ -368,8 +366,6 @@ void do_syscall_trace_lea

[RFC PATCH 04/12] powerpc: add interrupt_cond_local_irq_enable helper

2020-09-05 Thread Nicholas Piggin
Simple helper for synchronous interrupt handlers to use to enable
interrupts if they were taken in interrupt-enabled context.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/interrupt.h |  7 +++
 arch/powerpc/kernel/traps.c  | 24 +++-
 arch/powerpc/mm/fault.c  |  4 +---
 3 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 7c7e58541171..7231949fc1c8 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -3,6 +3,7 @@
 #define _ASM_POWERPC_INTERRUPT_H
 
 #include 
+#include 
 #include 
 
 /**
@@ -198,4 +199,10 @@ DECLARE_INTERRUPT_HANDLER_ASYNC(unknown_async_exception);
 void replay_system_reset(void);
 void replay_soft_interrupts(void);
 
+static inline void interrupt_cond_local_irq_enable(struct pt_regs *regs)
+{
+   if (!arch_irq_disabled_regs(regs))
+   local_irq_enable();
+}
+
 #endif /* _ASM_POWERPC_INTERRUPT_H */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 96fa2d7e088c..a850647b7062 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -343,8 +343,8 @@ static bool exception_common(int signr, struct pt_regs 
*regs, int code,
 
show_signal_msg(signr, regs, code, addr);
 
-   if (arch_irqs_disabled() && !arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   if (arch_irqs_disabled())
+   interrupt_cond_local_irq_enable(regs);
 
current->thread.trap_nr = code;
 
@@ -1579,9 +1579,7 @@ DEFINE_INTERRUPT_HANDLER(program_check_exception)
if (!user_mode(regs))
goto sigill;
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
/* (reason & REASON_ILLEGAL) would be the obvious thing here,
 * but there seems to be a hardware bug on the 405GP (RevD)
@@ -1635,9 +1633,7 @@ DEFINE_INTERRUPT_HANDLER(alignment_exception)
int sig, code, fixed = 0;
unsigned long  reason;
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
reason = get_reason(regs);
 
@@ -1798,9 +1794,7 @@ DEFINE_INTERRUPT_HANDLER(facility_unavailable_exception)
die("Unexpected facility unavailable exception", regs, SIGABRT);
}
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
if (status == FSCR_DSCR_LG) {
/*
@@ -2145,9 +2139,7 @@ void SPEFloatingPointException(struct pt_regs *regs)
int code = FPE_FLTUNK;
int err;
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
flush_spe_to_thread(current);
 
@@ -2194,9 +2186,7 @@ void SPEFloatingPointRoundException(struct pt_regs *regs)
extern int speround_handler(struct pt_regs *regs);
int err;
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
preempt_disable();
if (regs->msr & MSR_SPE)
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 7d63b5512068..fd8e28944293 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -441,9 +441,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
return bad_area_nosemaphore(regs, address);
}
 
-   /* We restore the interrupt state now */
-   if (!arch_irq_disabled_regs(regs))
-   local_irq_enable();
+   interrupt_cond_local_irq_enable(regs);
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 
-- 
2.23.0



[RFC PATCH 03/12] powerpc: interrupt handler wrapper functions

2020-09-05 Thread Nicholas Piggin
Add wrapper functions (derived from x86 macros) for interrupt handler
functions. This allows interrupt entry code to be written in C.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/asm-prototypes.h |  28 ---
 arch/powerpc/include/asm/bug.h|   1 -
 arch/powerpc/include/asm/hw_irq.h |   9 -
 arch/powerpc/include/asm/interrupt.h  | 201 ++
 arch/powerpc/include/asm/time.h   |   2 +
 arch/powerpc/kernel/dbell.c   |   3 +-
 arch/powerpc/kernel/exceptions-64s.S  |   8 +-
 arch/powerpc/kernel/irq.c |   3 +-
 arch/powerpc/kernel/mce.c |   5 +-
 arch/powerpc/kernel/tau_6xx.c |   2 +-
 arch/powerpc/kernel/time.c|   3 +-
 arch/powerpc/kernel/traps.c   |  78 ++---
 arch/powerpc/kernel/watchdog.c|   7 +-
 arch/powerpc/kvm/book3s_hv_builtin.c  |   1 +
 arch/powerpc/mm/book3s64/hash_utils.c |   3 +-
 arch/powerpc/mm/book3s64/slb.c|   5 +-
 arch/powerpc/mm/fault.c   |  10 +-
 arch/powerpc/platforms/powernv/idle.c |   1 +
 18 files changed, 290 insertions(+), 80 deletions(-)
 create mode 100644 arch/powerpc/include/asm/interrupt.h

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index fffac9de2922..de4dad05e272 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -56,34 +56,6 @@ int exit_vmx_usercopy(void);
 int enter_vmx_ops(void);
 void *exit_vmx_ops(void *dest);
 
-/* Traps */
-long machine_check_early(struct pt_regs *regs);
-long hmi_exception_realmode(struct pt_regs *regs);
-void SMIException(struct pt_regs *regs);
-void handle_hmi_exception(struct pt_regs *regs);
-void instruction_breakpoint_exception(struct pt_regs *regs);
-void RunModeException(struct pt_regs *regs);
-void single_step_exception(struct pt_regs *regs);
-void program_check_exception(struct pt_regs *regs);
-void alignment_exception(struct pt_regs *regs);
-void StackOverflow(struct pt_regs *regs);
-void kernel_fp_unavailable_exception(struct pt_regs *regs);
-void altivec_unavailable_exception(struct pt_regs *regs);
-void vsx_unavailable_exception(struct pt_regs *regs);
-void fp_unavailable_tm(struct pt_regs *regs);
-void altivec_unavailable_tm(struct pt_regs *regs);
-void vsx_unavailable_tm(struct pt_regs *regs);
-void facility_unavailable_exception(struct pt_regs *regs);
-void TAUException(struct pt_regs *regs);
-void altivec_assist_exception(struct pt_regs *regs);
-void unrecoverable_exception(struct pt_regs *regs);
-void kernel_bad_stack(struct pt_regs *regs);
-void system_reset_exception(struct pt_regs *regs);
-void machine_check_exception(struct pt_regs *regs);
-void emulation_assist_interrupt(struct pt_regs *regs);
-long do_slb_fault(struct pt_regs *regs);
-void do_bad_slb_fault(struct pt_regs *regs);
-
 /* signals, syscalls and interrupts */
 long sys_swapcontext(struct ucontext __user *old_ctx,
struct ucontext __user *new_ctx,
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 2fa0cf6c6011..7b89cdbbb789 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -111,7 +111,6 @@
 #ifndef __ASSEMBLY__
 
 struct pt_regs;
-extern long do_page_fault(struct pt_regs *);
 extern long hash__do_page_fault(struct pt_regs *);
 extern void bad_page_fault(struct pt_regs *, unsigned long, int);
 extern void _exception(int, struct pt_regs *, int, unsigned long);
diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 3a0db7b0b46e..19420e48bcd5 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -51,15 +51,6 @@
 
 #ifndef __ASSEMBLY__
 
-extern void replay_system_reset(void);
-extern void replay_soft_interrupts(void);
-
-extern void timer_interrupt(struct pt_regs *);
-extern void timer_broadcast_interrupt(void);
-extern void performance_monitor_exception(struct pt_regs *regs);
-extern void WatchdogException(struct pt_regs *regs);
-extern void unknown_exception(struct pt_regs *regs);
-
 #ifdef CONFIG_PPC64
 #include 
 
diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
new file mode 100644
index ..7c7e58541171
--- /dev/null
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _ASM_POWERPC_INTERRUPT_H
+#define _ASM_POWERPC_INTERRUPT_H
+
+#include 
+#include 
+
+/**
+ * DECLARE_INTERRUPT_HANDLER_RAW - Declare raw interrupt handler function
+ * @func:  Function name of the entry point
+ * @returns:   Returns a value back to asm caller
+ */
+#define DECLARE_INTERRUPT_HANDLER_RAW(func)\
+   __visible long func(struct pt_regs *regs)
+
+/**
+ * DEFINE_INTERRUPT_HANDLER_RAW - Define raw interrupt handler function
+ * @func:  Function name of the entry point

[RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions

2020-09-05 Thread Nicholas Piggin
Make interrupt handlers all just take the pt_regs * argument and load
DAR/DSISR etc from that. Make those that return a value return long.

This is done to make the function signatures match more closely, which
will help with a future patch to add wrappers. Explicit arguments could
be re-added for performance in future but that would require more
complex wrapper macros.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 ++--
 arch/powerpc/include/asm/bug.h|  4 ++--
 arch/powerpc/kernel/exceptions-64e.S  |  2 --
 arch/powerpc/kernel/exceptions-64s.S  | 14 ++
 arch/powerpc/mm/book3s64/hash_utils.c |  8 +---
 arch/powerpc/mm/book3s64/slb.c| 11 +++
 arch/powerpc/mm/fault.c   | 16 +---
 7 files changed, 27 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index de14b1a34d56..fffac9de2922 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -81,8 +81,8 @@ void kernel_bad_stack(struct pt_regs *regs);
 void system_reset_exception(struct pt_regs *regs);
 void machine_check_exception(struct pt_regs *regs);
 void emulation_assist_interrupt(struct pt_regs *regs);
-long do_slb_fault(struct pt_regs *regs, unsigned long ea);
-void do_bad_slb_fault(struct pt_regs *regs, unsigned long ea, long err);
+long do_slb_fault(struct pt_regs *regs);
+void do_bad_slb_fault(struct pt_regs *regs);
 
 /* signals, syscalls and interrupts */
 long sys_swapcontext(struct ucontext __user *old_ctx,
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index d714d83bbc7c..2fa0cf6c6011 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -111,8 +111,8 @@
 #ifndef __ASSEMBLY__
 
 struct pt_regs;
-extern int do_page_fault(struct pt_regs *, unsigned long, unsigned long);
-extern int hash__do_page_fault(struct pt_regs *, unsigned long, unsigned long);
+extern long do_page_fault(struct pt_regs *);
+extern long hash__do_page_fault(struct pt_regs *);
 extern void bad_page_fault(struct pt_regs *, unsigned long, int);
 extern void _exception(int, struct pt_regs *, int, unsigned long);
 extern void _exception_pkey(struct pt_regs *, unsigned long, int);
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index d9ed79415100..5988d61783b5 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1012,8 +1012,6 @@ storage_fault_common:
std r14,_DAR(r1)
std r15,_DSISR(r1)
addir3,r1,STACK_FRAME_OVERHEAD
-   mr  r4,r14
-   mr  r5,r15
ld  r14,PACA_EXGEN+EX_R14(r13)
ld  r15,PACA_EXGEN+EX_R15(r13)
bl  do_page_fault
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f830b893fe03..1f34cfd1887c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1437,8 +1437,6 @@ EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
 EXC_VIRT_END(data_access, 0x4300, 0x80)
 EXC_COMMON_BEGIN(data_access_common)
GEN_COMMON data_access
-   ld  r4,_DAR(r1)
-   ld  r5,_DSISR(r1)
addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
bl  do_hash_fault
@@ -1491,10 +1489,9 @@ EXC_VIRT_BEGIN(data_access_slb, 0x4380, 0x80)
 EXC_VIRT_END(data_access_slb, 0x4380, 0x80)
 EXC_COMMON_BEGIN(data_access_slb_common)
GEN_COMMON data_access_slb
-   ld  r4,_DAR(r1)
-   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
/* HPT case, do SLB fault */
+   addir3,r1,STACK_FRAME_OVERHEAD
bl  do_slb_fault
cmpdi   r3,0
bne-1f
@@ -1506,8 +1503,6 @@ MMU_FTR_SECTION_ELSE
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
RECONCILE_IRQ_STATE(r10, r11)
-   ld  r4,_DAR(r1)
-   ld  r5,RESULT(r1)
addir3,r1,STACK_FRAME_OVERHEAD
bl  do_bad_slb_fault
b   interrupt_return
@@ -1542,8 +1537,6 @@ EXC_VIRT_BEGIN(instruction_access, 0x4400, 0x80)
 EXC_VIRT_END(instruction_access, 0x4400, 0x80)
 EXC_COMMON_BEGIN(instruction_access_common)
GEN_COMMON instruction_access
-   ld  r4,_DAR(r1)
-   ld  r5,_DSISR(r1)
addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
bl  do_hash_fault
@@ -1587,10 +1580,9 @@ EXC_VIRT_BEGIN(instruction_access_slb, 0x4480, 0x80)
 EXC_VIRT_END(instruction_access_slb, 0x4480, 0x80)
 EXC_COMMON_BEGIN(instruction_access_slb_common)
GEN_COMMON instruction_access_slb
-   ld  r4,_DAR(r1)
-   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
/* HPT case, do SLB fault */
+   addir3,r1,STACK_FRAME_OVERHEAD
bl  do_slb_fault
cmpdi   r3,0
bne

[RFC PATCH 01/12] powerpc/64s: move the last of the page fault handling logic to C

2020-09-05 Thread Nicholas Piggin
The page fault handling still has some complex logic particularly around
hash table handling, in asm. Implement this in C instead.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/bug.h|   1 +
 arch/powerpc/kernel/exceptions-64s.S  | 131 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  77 +--
 arch/powerpc/mm/fault.c   |  55 ++-
 4 files changed, 124 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 338f36cd9934..d714d83bbc7c 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -112,6 +112,7 @@
 
 struct pt_regs;
 extern int do_page_fault(struct pt_regs *, unsigned long, unsigned long);
+extern int hash__do_page_fault(struct pt_regs *, unsigned long, unsigned long);
 extern void bad_page_fault(struct pt_regs *, unsigned long, int);
 extern void _exception(int, struct pt_regs *, int, unsigned long);
 extern void _exception_pkey(struct pt_regs *, unsigned long, int);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index f7d748b88705..f830b893fe03 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1403,14 +1403,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
  *
  * Handling:
  * - Hash MMU
- *   Go to do_hash_page first to see if the HPT can be filled from an entry in
- *   the Linux page table. Hash faults can hit in kernel mode in a fairly
+ *   Go to do_hash_fault, which attempts to fill the HPT from an entry in the
+ *   Linux page table. Hash faults can hit in kernel mode in a fairly
  *   arbitrary state (e.g., interrupts disabled, locks held) when accessing
  *   "non-bolted" regions, e.g., vmalloc space. However these should always be
- *   backed by Linux page tables.
+ *   backed by Linux page table entries.
  *
- *   If none is found, do a Linux page fault. Linux page faults can happen in
- *   kernel mode due to user copy operations of course.
+ *   If no entry is found the Linux page fault handler is invoked (by
+ *   do_hash_fault). Linux page faults can happen in kernel mode due to user
+ *   copy operations of course.
  *
  * - Radix MMU
  *   The hardware loads from the Linux page table directly, so a fault goes
@@ -1438,13 +1439,17 @@ EXC_COMMON_BEGIN(data_access_common)
GEN_COMMON data_access
ld  r4,_DAR(r1)
ld  r5,_DSISR(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
-   ld  r6,_MSR(r1)
-   li  r3,0x300
-   b   do_hash_page/* Try to handle as hpte fault */
+   bl  do_hash_fault
 MMU_FTR_SECTION_ELSE
-   b   handle_page_fault
+   bl  do_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
+cmpdi  r3,0
+   beq+interrupt_return
+   /* We need to restore NVGPRS */
+   REST_NVGPRS(r1)
+   b   interrupt_return
 
GEN_KVM data_access
 
@@ -1539,13 +1544,17 @@ EXC_COMMON_BEGIN(instruction_access_common)
GEN_COMMON instruction_access
ld  r4,_DAR(r1)
ld  r5,_DSISR(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
 BEGIN_MMU_FTR_SECTION
-   ld  r6,_MSR(r1)
-   li  r3,0x400
-   b   do_hash_page/* Try to handle as hpte fault */
+   bl  do_hash_fault
 MMU_FTR_SECTION_ELSE
-   b   handle_page_fault
+   bl  do_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
+cmpdi  r3,0
+   beq+interrupt_return
+   /* We need to restore NVGPRS */
+   REST_NVGPRS(r1)
+   b   interrupt_return
 
GEN_KVM instruction_access
 
@@ -3197,99 +3206,3 @@ disable_machine_check:
RFI_TO_KERNEL
 1: mtlrr0
blr
-
-/*
- * Hash table stuff
- */
-   .balign IFETCH_ALIGN_BYTES
-do_hash_page:
-#ifdef CONFIG_PPC_BOOK3S_64
-   lis r0,(DSISR_BAD_FAULT_64S | DSISR_DABRMATCH | DSISR_KEYFAULT)@h
-   ori r0,r0,DSISR_BAD_FAULT_64S@l
-   and.r0,r5,r0/* weird error? */
-   bne-handle_page_fault   /* if not, try to insert a HPTE */
-
-   /*
-* If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
-* don't call hash_page, just fail the fault. This is required to
-* prevent re-entrancy problems in the hash code, namely perf
-* interrupts hitting while something holds H_PAGE_BUSY, and taking a
-* hash fault. See the comment in hash_preload().
-*/
-   ld  r11, PACA_THREAD_INFO(r13)
-   lwz r0,TI_PREEMPT(r11)
-   andis.  r0,r0,NMI_MASK@h
-   bne 77f
-
-   /*
-* r3 contains the trap number
-* r4 contains the faulting address
-* r5 contains dsisr
-* r6 msr
-*
-* at return r3 = 0 for success, 1 for page fault, negative for error
-*/
-   bl  __hash_page 

[RFC PATCH 00/12] interrupt entry wrappers

2020-09-05 Thread Nicholas Piggin
This series moves more stuff to C, and fixes context tracking on
64s.

Nicholas Piggin (12):
  powerpc/64s: move the last of the page fault handling logic to C
  powerpc: remove arguments from interrupt handler functions
  powerpc: interrupt handler wrapper functions
  powerpc: add interrupt_cond_local_irq_enable helper
  powerpc/64s: Do context tracking in interrupt entry wrapper
  powerpc/64s: reconcile interrupts in C
  powerpc/64: move account_stolen_time into its own function
  powerpc/64: entry cpu time accounting in C
  powerpc: move NMI entry/exit code into wrapper
  powerpc/64s: move NMI soft-mask handling to C
  powerpc/64s: runlatch interrupt handling in C
  powerpc/64s: power4 nap fixup in C

 arch/powerpc/Kconfig  |   2 +-
 arch/powerpc/include/asm/asm-prototypes.h |  28 --
 arch/powerpc/include/asm/bug.h|   2 +-
 arch/powerpc/include/asm/cputime.h|  15 +
 arch/powerpc/include/asm/hw_irq.h |   9 -
 arch/powerpc/include/asm/interrupt.h  | 316 ++
 arch/powerpc/include/asm/ppc_asm.h|  24 --
 arch/powerpc/include/asm/processor.h  |   1 +
 arch/powerpc/include/asm/thread_info.h|   6 +
 arch/powerpc/include/asm/time.h   |   2 +
 arch/powerpc/kernel/dbell.c   |   3 +-
 arch/powerpc/kernel/exceptions-64e.S  |   3 -
 arch/powerpc/kernel/exceptions-64s.S  | 307 ++---
 arch/powerpc/kernel/idle_book3s.S |   4 +
 arch/powerpc/kernel/irq.c |   3 +-
 arch/powerpc/kernel/mce.c |  17 +-
 arch/powerpc/kernel/ptrace/ptrace.c   |   4 -
 arch/powerpc/kernel/signal.c  |   4 -
 arch/powerpc/kernel/syscall_64.c  |  24 +-
 arch/powerpc/kernel/tau_6xx.c |   2 +-
 arch/powerpc/kernel/time.c|   3 +-
 arch/powerpc/kernel/traps.c   | 198 ++
 arch/powerpc/kernel/watchdog.c|  15 +-
 arch/powerpc/kvm/book3s_hv_builtin.c  |   1 +
 arch/powerpc/mm/book3s64/hash_utils.c |  82 --
 arch/powerpc/mm/book3s64/slb.c|  12 +-
 arch/powerpc/mm/fault.c   |  74 -
 arch/powerpc/platforms/powernv/idle.c |   1 +
 28 files changed, 608 insertions(+), 554 deletions(-)
 create mode 100644 arch/powerpc/include/asm/interrupt.h

-- 
2.23.0



Re: [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-09-02 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of September 1, 2020 10:00 pm:
> Nicholas Piggin  writes:
>> Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
>> single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
>> a process under certain conditions. One of the assumptions is that
>> mm_users would not be incremented via a reference outside the process
>> context with mmget_not_zero() then go on to kthread_use_mm() via that
>> reference.
>>
>> That invariant was broken by io_uring code (see previous sparc64 fix),
>> but I'll point Fixes: to the original powerpc commit because we are
>> changing that assumption going forward, so this will make backports
>> match up.
>>
>> Fix this by no longer relying on that assumption, but by having each CPU
>> check the mm is not being used, and clearing their own bit from the mask
>> if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
>> kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
>> and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.
> 
> You could use:
> 
> Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")

Good idea I wil.

>> Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
>> single-threaded mm_cpumask")
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/include/asm/tlb.h   | 13 -
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
>>  2 files changed, 16 insertions(+), 20 deletions(-)
> 
> One minor nit below if you're respinning anyway.
> 
> You know this stuff better than me, but I still reviewed it and it seems
> good to me.
> 
> Reviewed-by: Michael Ellerman 

Thanks.

> 
>> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
>> index fbc6f3002f23..d97f061fecac 100644
>> --- a/arch/powerpc/include/asm/tlb.h
>> +++ b/arch/powerpc/include/asm/tlb.h
>> @@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
>>  return false;
>>  return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
>>  }
>> -static inline void mm_reset_thread_local(struct mm_struct *mm)
>> -{
>> -WARN_ON(atomic_read(>context.copros) > 0);
>> -/*
>> - * It's possible for mm_access to take a reference on mm_users to
>> - * access the remote mm from another thread, but it's not allowed
>> - * to set mm_cpumask, so mm_users may be > 1 here.
>> - */
>> -WARN_ON(current->mm != mm);
>> -atomic_set(>context.active_cpus, 1);
>> -cpumask_clear(mm_cpumask(mm));
>> -cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
>> -}
>>  #else /* CONFIG_PPC_BOOK3S_64 */
>>  static inline int mm_is_thread_local(struct mm_struct *mm)
>>  {
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0d233763441f..a421a0e3f930 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
>>  struct mm_struct *mm = arg;
>>  unsigned long pid = mm->context.id;
>>  
>> +/*
>> + * A kthread could have done a mmget_not_zero() after the flushing CPU
>> + * checked mm_users == 1, and be in the process of kthread_use_mm when
> ^
> in mm_is_singlethreaded()
> 
> Adding that reference would help join the dots for a new reader I think.

Yes you're right I can change that.

Thanks,
Nick


[PATCH v3 16/23] powerpc: use asm-generic/mmu_context.h for no-op implementations

2020-09-01 Thread Nicholas Piggin
Cc: linuxppc-dev@lists.ozlabs.org
Acked-by: Michael Ellerman  (powerpc)
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mmu_context.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 7f3658a97384..a3a12a8341b2 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -14,7 +14,9 @@
 /*
  * Most if the context management is out of line
  */
+#define init_new_context init_new_context
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+#define destroy_context destroy_context
 extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
@@ -235,27 +237,26 @@ static inline void switch_mm(struct mm_struct *prev, 
struct mm_struct *next,
 }
 #define switch_mm_irqs_off switch_mm_irqs_off
 
-
-#define deactivate_mm(tsk,mm)  do { } while (0)
-
 /*
  * After we have set current->mm to a new value, this activates
  * the context for the new mm so we see the new mappings.
  */
+#define activate_mm activate_mm
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
switch_mm(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
+#ifdef CONFIG_PPC_BOOK3E_64
+#define enter_lazy_tlb enter_lazy_tlb
 static inline void enter_lazy_tlb(struct mm_struct *mm,
  struct task_struct *tsk)
 {
/* 64-bit Book3E keeps track of current PGD in the PACA */
-#ifdef CONFIG_PPC_BOOK3E_64
get_paca()->pgd = NULL;
-#endif
 }
+#endif
 
 extern void arch_exit_mmap(struct mm_struct *mm);
 
@@ -298,5 +299,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+#include 
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
-- 
2.23.0



Re: fsl_espi errors on v5.7.15

2020-09-01 Thread Nicholas Piggin
Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> 
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>>> On 30.08.2020 23:00, Chris Packham wrote:
>>>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>>>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>>>> 
>>>>>
>>>>>>>>>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>>
>>>>>>>>>>> With my current workaround of emptying the RX FIFO. It seems
>>>>>>>>>>> survivable. Interestingly it only ever seems to be 1 extra byte in 
>>>>>>>>>>> the
>>>>>>>>>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>>>>>>>>>
>>>>>>>>>>> fsl_espi ffe11.spi: tx 70
>>>>>>>>>>> fsl_espi ffe11.spi: rx 03
>>>>>>>>>>> fsl_espi ffe11.spi: Extra RX 00
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe11.spi: tx 05
>>>>>>>>>>> fsl_espi ffe11.spi: rx 00
>>>>>>>>>>> fsl_espi ffe11.spi: Extra RX 03
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>>>>>>>>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>>>>>>>>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>>>>>>>>> fsl_espi ffe11.spi: tx 05
>>>>>>>>>>> fsl_espi ffe11.spi: rx 00
>>>>>>>>>>> fsl_espi ffe11.spi: Extra RX 03
>>>>>>>>>>>
>>>>>>>>>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>>>>>>>>>> possible to continually read the SR/FSR. But I've no idea why it
>>>>>>>>>>> happens some times and not others.
>>>>>>>>>> So I think I've got a reproduction and I think I've bisected the 
>>>>>>>>>> problem
>>>>>>>>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay 
>>>>>>>>>> in
>>>>>>>>>> C"). My day is just finishing now so I haven't applied too much 
>>>>>>>>>> scrutiny
>>>>>>>>>> to this result. Given the various rabbit holes I've been down on this
>>>>>>>>>> issue already I'd take this information with a good degree of 
>>>>>>>>>> skepticism.
>>>>>>>>>>
>>>>>>>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>>>>>>>> It doesn't have yet the change you're referring to, and the fsl-espi 
>>>>>>>>> driver
>>>>>>>>> is basically the same as in 5.7 (just two small changes in 5.7).
>>>>>>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>>>>>>> around this time that could affect book E, so it's good if that exact
>>>>>>>> patch is confirmed.
>>>>>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>>>>>> by cherry-picking 3282a3da25

Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-30 Thread Nicholas Piggin
Excerpts from pet...@infradead.org's message of August 28, 2020 9:15 pm:
> On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:
> 
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
> 
> ARM at least has activate_mm() := switch_mm(), so it could be made to
> work.
>

Yeah, so long as that post_lock_switch switch did the right thing with
respect to its TLB flushing. It should do because arm doesn't seem to
check ->mm or ->active_mm (and if it was broken, the scheduler context
switch would be suspect too). I don't think the fix would be hard, just
that I don't have a good way to test it and qemu isn't great for testing
this kind of thing.

um too I think could probably defer that lock until after interrupts are
enabled again. I might throw a bunch of arch conversion patches over the
wall if this gets merged and try to move things along.

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-30 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> Hello,
> 
> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> Reimplement book3s idle code in C").
> 
> The symptom is host locking up completely after some hours of KVM
> workload with messages like
> 
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 
> printed before the host locks up.
> 
> The machines run sandboxed builds which is a mixed workload resulting in
> IO/single core/mutiple core load over time and there are periods of no
> activity and no VMS runnig as well. The VMs are shortlived so VM
> setup/terdown is somewhat excercised as well.
> 
> POWER9 with the new guest entry fast path does not seem to be affected.
> 
> Reverted the patch and the followup idle fixes on top of 5.2.14 and
> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> after idle") which gives same idle code as 5.1.16 and the kernel seems
> stable.
> 
> Config is attached.
> 
> I cannot easily revert this commit, especially if I want to use the same
> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> only to the new idle code.
> 
> Any idea what can be the problem?

So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
those threads. I wonder what they are doing. POWER8 doesn't have a good
NMI IPI and I don't know if it supports pdbg dumping registers from the
BMC unfortunately. Do the messages always come in pairs of CPUs?

I'm not sure where to start with reproducing, I'll have to try. How many
vCPUs in the guests? Do you have several guests running at once?

Thanks,
Nick



Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Nicholas Piggin
Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>> On 26.08.2020 08:07, Chris Packham wrote:
>>>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>>> On 26/08/20 10:22 am, Chris Packham wrote:
>>>>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>>>>
>>>>>> 
>>>>>>> I've been staring at spi-fsl-espi.c for while now and I think I've
>>>>>>>> identified a couple of deficiencies that may or may not be related
>>>>>>>> to my
>>>>>>>> issue.
>>>>>>>>
>>>>>>>> First I think the 'Transfer done but SPIE_DON isn't set' message
>>>>>>>> can be
>>>>>>>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
>>>>>>>> register.
>>>>>>>> We also write back to it to clear the current events. We re-read it in
>>>>>>>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>>>>>>>> naturally end up in that situation if we're doing a large read.
>>>>>>>> Consider
>>>>>>>> the messages for reading a block of data from a spi-nor chip
>>>>>>>>
>>>>>>>>     tx = READ_OP + ADDR
>>>>>>>>     rx = data
>>>>>>>>
>>>>>>>> We setup the transfer and pump out the tx_buf. The first interrupt
>>>>>>>> goes
>>>>>>>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>>>>>>>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>>>>>>>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>>>>>>>> continues until we've received all the data and we finish with
>>>>>>>> ESPI_SPIE
>>>>>>>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>>>>>>>> isn't set.
>>>>>>>>
>>>>>>>> The other deficiency is that we only get an interrupt when the
>>>>>>>> amount of
>>>>>>>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>>>>>>>> FSL_ESPI_RXTHR left to be received we will never pull them out of
>>>>>>>> the fifo.
>>>>>>>>
>>>>>>> SPIM_DON will trigger an interrupt once the last characters have been
>>>>>>> transferred, and read the remaining characters from the FIFO.
>>>>>> The T2080RM that I have says the following about the DON bit
>>>>>>
>>>>>> "Last character was transmitted. The last character was transmitted
>>>>>> and a new command can be written for the next frame."
>>>>>>
>>>>>> That does at least seem to fit with my assertion that it's all about
>>>>>> the TX direction. But the fact that it doesn't happen all the time
>>>>>> throws some doubt on it.
>>>>>>
>>>>>>> I think the reason I'm seeing some variability is because of how fast
>>>>>>>> (or slow) the interrupts get processed and how fast the spi-nor
>>>>>>>> chip can
>>>>>>>> fill the CPUs rx fifo.
>>>>>>>>
>>>>>>> To rule out timing issues at high bus frequencies I initially asked
>>>>>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>>>>>> or even less, then timing shouldn't be an issue.
>>>>>> Yes I've currently got spi-max-frequency = <100>; in my dts. I
>>>>>> would also expect a slower frequency would fit my "DON is for TX"
>>>>>> narrative.
>>>>>>> Last relevant functional changes have been done almost 4 years ago.
>>>>>>> And yours is the first such report I see. So question is what could
>>>>>>> be so
>>>>>>> special with your setup that it seems you're the only one being
>>>>>>> affected.
>>>>>>> The scenarios you describe are standard, therefore much more people
>>>>>>> should be affe

[PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-08-28 Thread Nicholas Piggin
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.

Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/tlb.h   | 13 -
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-   WARN_ON(atomic_read(>context.copros) > 0);
-   /*
-* It's possible for mm_access to take a reference on mm_users to
-* access the remote mm from another thread, but it's not allowed
-* to set mm_cpumask, so mm_users may be > 1 here.
-*/
-   WARN_ON(current->mm != mm);
-   atomic_set(>context.active_cpus, 1);
-   cpumask_clear(mm_cpumask(mm));
-   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..a421a0e3f930 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;
 
+   /*
+* A kthread could have done a mmget_not_zero() after the flushing CPU
+* checked mm_users == 1, and be in the process of kthread_use_mm when
+* interrupted here. In that case, current->mm will be set to mm,
+* because kthread_use_mm() setting ->mm and switching to the mm is
+* done with interrupts off.
+*/
if (current->mm == mm)
-   return; /* Local CPU */
+   goto out_flush;
 
if (current->active_mm == mm) {
-   /*
-* Must be a kernel thread because sender is single-threaded.
-*/
-   BUG_ON(current->mm);
+   WARN_ON_ONCE(current->mm != NULL);
+   /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(_mm);
-   switch_mm(mm, _mm, current);
current->active_mm = _mm;
+   switch_mm_irqs_off(mm, _mm, current);
mmdrop(mm);
}
+
+   atomic_dec(>context.active_cpus);
+   cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 */
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
-   mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0



[PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-08-28 Thread Nicholas Piggin
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

if (atomic_read(>mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}

vs fs/io_uring.c

if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
 !mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Cc: sparcli...@vger.kernel.org
Signed-off-by: Nicholas Piggin 
---
 arch/sparc/kernel/smp_64.c | 65 --
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *space has (potentially) executed on, this is the heuristic
- *we use to avoid doing cross calls.
- *
- *Also, for flushing from kswapd and also for clones, we
- *use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *in the system, this allows us to play several games to avoid
- *cross calls.
- *
- *One invariant is that when a cpu switches to a process, and
- *that processes tsk->active_mm->cpu_vm_mask does not have the
- *current cpu's bit set, that tlb context is flushed locally.
- *
- *If the address space is non-shared (ie. mm->count == 1) we avoid
- *cross calls when we want to flush the currently running process's
- *tlb state.  This is done by clearing all cpu bits except the current
- *processor's in current->mm->cpu_vm_mask and performing the
- *flush locally only.  This will force any subsequent cpus which run
- *this task to flush the context from the local tlb if the process
- *migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *bullet for most cases and perform the cross call (but only to
- *the cpus listed in cpu_vm_mask).
- *
- *The performance gain from "optimizing" away the cross call for threads is
- *questionable (in theory the big win for threads is the massive sharing of
- *address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
u32 ctx = CTX_HWBITS(mm->context);
-   int cpu = get_cpu();
 
-   if (atomic_read(>mm_users) == 1) {
-   cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-   goto local_flush_and_out;
-   }
+   get_cpu();
 
smp_cross_call_masked(_flush_tlb_mm,
  ctx, 0, 0,
  mm_cpumask(mm));
 
-local_flush_and_out:
__flush_tl

[PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM

2020-08-28 Thread Nicholas Piggin
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..65cb32211574 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
+   select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 7f3658a97384..e02aa793420b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct 
mm_struct *next,
  */
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-   switch_mm(prev, next, current);
+   switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0



[PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-28 Thread Nicholas Piggin
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
kernel_execve()
  old_mm = current->mm;
  active_mm = current->active_mm;
  *** preempt *** >  schedule()
   prev->active_mm = NULL;
   mmdrop(prev active_mm);
 ...
  <  schedule()
  current->mm = mm;
  current->active_mm = mm;
  if (!old_mm)
  mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  7 +++
 fs/exec.c| 17 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+   bool
+   help
+ Temporary select until all architectures can be converted to have
+ irqs disabled over activate_mm. Architectures that do IPI based TLB
+ shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
}
 
task_lock(tsk);
-   active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
-   tsk->mm = mm;
+
+   local_irq_disable();
+   active_mm = tsk->active_mm;
tsk->active_mm = mm;
+   tsk->mm = mm;
+   /*
+* This prevents preemption while active_mm is being loaded and
+* it and mm are being updated, which could cause problems for
+* lazy tlb mm refcounting when these are updated by context
+* switches. Not all architectures can handle irqs off over
+* activate_mm yet.
+*/
+   if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
activate_mm(active_mm, mm);
+   if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
-- 
2.23.0



[PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb

2020-08-28 Thread Nicholas Piggin
This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

I'd like some feedback on the sparc/powerpc vs kthread_use_mm
problem too.

Thanks,
Nick

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig   |  7 +++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h | 13 --
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++---
 arch/sparc/kernel/smp_64.c | 65 ++
 fs/exec.c  | 17 ++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0



Re: fsl_espi errors on v5.7.15

2020-08-27 Thread Nicholas Piggin
Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
> On 26.08.2020 08:07, Chris Packham wrote:
>> 
>> On 26/08/20 1:48 pm, Chris Packham wrote:
>>>
>>> On 26/08/20 10:22 am, Chris Packham wrote:
 On 25/08/20 7:22 pm, Heiner Kallweit wrote:

 
> I've been staring at spi-fsl-espi.c for while now and I think I've
>> identified a couple of deficiencies that may or may not be related 
>> to my
>> issue.
>>
>> First I think the 'Transfer done but SPIE_DON isn't set' message 
>> can be
>> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE 
>> register.
>> We also write back to it to clear the current events. We re-read it in
>> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
>> naturally end up in that situation if we're doing a large read. 
>> Consider
>> the messages for reading a block of data from a spi-nor chip
>>
>>    tx = READ_OP + ADDR
>>    rx = data
>>
>> We setup the transfer and pump out the tx_buf. The first interrupt 
>> goes
>> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
>> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
>> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
>> continues until we've received all the data and we finish with 
>> ESPI_SPIE
>> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
>> isn't set.
>>
>> The other deficiency is that we only get an interrupt when the 
>> amount of
>> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
>> FSL_ESPI_RXTHR left to be received we will never pull them out of 
>> the fifo.
>>
> SPIM_DON will trigger an interrupt once the last characters have been
> transferred, and read the remaining characters from the FIFO.

 The T2080RM that I have says the following about the DON bit

 "Last character was transmitted. The last character was transmitted 
 and a new command can be written for the next frame."

 That does at least seem to fit with my assertion that it's all about 
 the TX direction. But the fact that it doesn't happen all the time 
 throws some doubt on it.

> I think the reason I'm seeing some variability is because of how fast
>> (or slow) the interrupts get processed and how fast the spi-nor 
>> chip can
>> fill the CPUs rx fifo.
>>
> To rule out timing issues at high bus frequencies I initially asked
> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
> or even less, then timing shouldn't be an issue.
 Yes I've currently got spi-max-frequency = <100>; in my dts. I 
 would also expect a slower frequency would fit my "DON is for TX" 
 narrative.
> Last relevant functional changes have been done almost 4 years ago.
> And yours is the first such report I see. So question is what could 
> be so
> special with your setup that it seems you're the only one being 
> affected.
> The scenarios you describe are standard, therefore much more people
> should be affected in case of a driver bug.
 Agreed. But even on my hardware (which may have a latent issue 
 despite being in the field for going on 5 years) the issue only 
 triggers under some fairly specific circumstances.
> You said that kernel config impacts how frequently the issue happens.
> Therefore question is what's the diff in kernel config, and how could
> the differences be related to SPI.

 It did seem to be somewhat random. Things like CONFIG_PREEMPT have an 
 impact but every time I found something that seemed to be having an 
 impact I've been able to disprove it. I actually think its about how 
 busy the system is which may or may not affect when we get round to 
 processing the interrupts.

 I have managed to get the 'Transfer done but SPIE_DON isn't set!' to 
 occur on the T2080RDB.

 I've had to add the following to expose the environment as a mtd 
 partition

 diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi 
 b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
 index ff87e67c70da..fbf95fc1fd68 100644
 --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
 +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
 @@ -116,6 +116,15 @@ flash@0 {
     compatible = "micron,n25q512ax3", 
 "jedec,spi-nor";
     reg = <0>;
     spi-max-frequency = <1000>; /* 
 input clock */
 +
 +   partition@u-boot {
 +    reg = <0x 0x0010>;
 +    label = "u-boot";
 +    };
 +    

[PATCH v2 16/23] powerpc: use asm-generic/mmu_context.h for no-op implementations

2020-08-26 Thread Nicholas Piggin
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mmu_context.h | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 7f3658a97384..bc22e247ab55 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -14,7 +14,9 @@
 /*
  * Most if the context management is out of line
  */
+#define init_new_context init_new_context
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+#define destroy_context destroy_context
 extern void destroy_context(struct mm_struct *mm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 struct mm_iommu_table_group_mem_t;
@@ -235,27 +237,15 @@ static inline void switch_mm(struct mm_struct *prev, 
struct mm_struct *next,
 }
 #define switch_mm_irqs_off switch_mm_irqs_off
 
-
-#define deactivate_mm(tsk,mm)  do { } while (0)
-
-/*
- * After we have set current->mm to a new value, this activates
- * the context for the new mm so we see the new mappings.
- */
-static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
-{
-   switch_mm(prev, next, current);
-}
-
-/* We don't currently use enter_lazy_tlb() for anything */
+#ifdef CONFIG_PPC_BOOK3E_64
+#define enter_lazy_tlb enter_lazy_tlb
 static inline void enter_lazy_tlb(struct mm_struct *mm,
  struct task_struct *tsk)
 {
/* 64-bit Book3E keeps track of current PGD in the PACA */
-#ifdef CONFIG_PPC_BOOK3E_64
get_paca()->pgd = NULL;
-#endif
 }
+#endif
 
 extern void arch_exit_mmap(struct mm_struct *mm);
 
@@ -298,5 +288,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+#include 
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
-- 
2.23.0



[PATCH v7 12/12] powerpc/64s/radix: Enable huge vmalloc mappings

2020-08-25 Thread Nicholas Piggin
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 Documentation/admin-guide/kernel-parameters.txt | 2 ++
 arch/powerpc/Kconfig| 1 +
 2 files changed, 3 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..6f0b41289a90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..9171d25ad7dc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
select GENERIC_TIME_VSYSCALL
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
-- 
2.23.0



[PATCH v7 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-25 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
larger, and fall back to small pages if that was unsuccessful.

Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
pages, because not all callers expect this (e.g., module allocations vs
strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|   4 +
 include/linux/vmalloc.h |   1 +
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 180 ++--
 4 files changed, 145 insertions(+), 45 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 15adb9a14fb6..a7449064fe35 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..b6427cc7b838 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8102,6 +8103,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8171,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = (find_vm_area(table)->page_order > 0);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8188,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-   virt ? "vmalloc" : "linear");
+   virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
 
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d6cad16bda3..8db53c2d7f72 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -44,6 +44,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
 bool is_vmalloc_addr(const void *x)
 {
unsigned long addr = (unsigned long)x;
@@ -477,31 +490,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long 
addr,
return 0;
 }
 
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify 
should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing.  The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
-pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages)
 {
unsigne

[PATCH v7 10/12] mm/vmalloc: add vmap_range_noflush variant

2020-08-25 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 256554d598e6..1d6cad16bda3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -237,7 +237,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end,
+static int vmap_range_noflush(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift)
 {
@@ -259,14 +259,24 @@ int vmap_range(unsigned long addr, unsigned long end,
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v7 09/12] mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c

2020-08-25 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   3 +
 mm/ioremap.c| 197 
 mm/vmalloc.c| 196 +++
 3 files changed, 199 insertions(+), 197 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3f6bba4cc9bc..15adb9a14fb6 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -177,6 +177,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index c67f91164401..d1dcc7e744ac 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,203 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_max_page_shift = PAGE_SHIFT;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PUD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 
max_page_shift, mask))
-

[PATCH v7 08/12] x86: inline huge vmap supported functions

2020-08-25 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---

Ack or objection if this goes via the -mm tree?

 arch/x86/include/asm/vmalloc.h | 22 +++---
 arch/x86/mm/ioremap.c  | 19 ---
 arch/x86/mm/pgtable.c  | 13 -
 3 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 094ea2b565f3..e714b00fc0ca 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,13 +1,29 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+#ifdef CONFIG_X86_64
+   return boot_cpu_has(X86_FEATURE_GBPAGES);
+#else
+   return false;
+#endif
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return boot_cpu_has(X86_FEATURE_PSE);
+}
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 159bfca757b9..1465a22a9bfb 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,25 +481,6 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-#ifdef CONFIG_X86_64
-   return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return boot_cpu_has(X86_FEATURE_PSE);
-}
-
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..801c418ee97d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,14 +780,6 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
 }
 
-/*
- * Until we support 512GB pages, skip them in the vmap area.
- */
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 #ifdef CONFIG_X86_64
 /**
  * pud_free_pmd_page - Clear pud entry and free pmd page.
@@ -859,11 +851,6 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #else /* !CONFIG_X86_64 */
 
-int pud_free_pmd_page(pud_t *pud, unsigned long addr)
-{
-   return pud_none(*pud);
-}
-
 /*
  * Disable free page handling on x86-PAE. This assures that ioremap()
  * does not update sync'd pmd entries. See vmalloc_sync_one().
-- 
2.23.0



[PATCH v7 07/12] arm64: inline huge vmap supported functions

2020-08-25 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Signed-off-by: Nicholas Piggin 
---

Ack or objection if this goes via the -mm tree?

 arch/arm64/include/asm/vmalloc.h | 23 ---
 arch/arm64/mm/mmu.c  | 26 --
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 597b40405319..fc9a12d6cc1a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,9 +4,26 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /*
+* Only 4k granule supports level 1 block mappings.
+* SW table walks can't handle removal of intermediate entries.
+*/
+   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   /* See arch_vmap_pud_supported() */
+   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9df7e0058c78..07093e148957 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,27 +1304,6 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot);
-{
-   /*
-* Only 4k granule supports level 1 block mappings.
-* SW table walks can't handle removal of intermediate entries.
-*/
-   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
-  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   /* See arch_vmap_pud_supported() */
-   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1416,11 +1395,6 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
return 1;
 }
 
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;   /* Don't attempt a block mapping */
-}
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 {
-- 
2.23.0



[PATCH v7 06/12] powerpc: inline huge vmap supported functions

2020-08-25 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---

Ack or objection if this goes via the -mm tree? 

 arch/powerpc/include/asm/vmalloc.h   | 19 ---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 21 -
 2 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 105abb73f075..3f0c153befb0 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,12 +1,25 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /* HPT does not cope with large pages in the vmalloc area */
+   return radix_enabled();
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return radix_enabled();
+}
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index eca83a50bf2e..27f5837cf145 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,22 +1134,6 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /* HPT does not cope with large pages in the vmalloc area */
-   return radix_enabled();
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return radix_enabled();
-}
-
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
pte_t *ptep = (pte_t *)pud;
@@ -1233,8 +1217,3 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
-
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-- 
2.23.0



[PATCH v7 05/12] mm: HUGE_VMAP arch support cleanup

2020-08-25 Thread Nicholas Piggin
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---

Ack or objection from arch maintainers if this goes via the -mm tree?

 arch/arm64/include/asm/vmalloc.h |  8 +++
 arch/arm64/mm/mmu.c  | 10 +--
 arch/powerpc/include/asm/vmalloc.h   |  8 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +--
 arch/x86/include/asm/vmalloc.h   |  7 ++
 arch/x86/mm/ioremap.c| 10 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  |  6 ++
 init/main.c  |  1 -
 mm/ioremap.c | 88 +---
 10 files changed, 77 insertions(+), 78 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 2ca708ab9b20..597b40405319 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_ARM64_VMALLOC_H
 #define _ASM_ARM64_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..9df7e0058c78 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,12 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot);
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1319,9 +1319,9 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index b992dfaaa161..105abb73f075 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..eca83a50bf2e 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,13 +1134,13 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1234,7 +1234,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 29837740b520..094ea2b565f3 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,6 +1,13 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..159bfca757b9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,21 +481,21 @@ void iounma

[PATCH v7 04/12] mm/ioremap: rename ioremap_*_range to vmap_*_range

2020-08-25 Thread Nicholas Piggin
This will be used as a generic kernel virtual mapping function, so
re-name it in preparation.

Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 64 +++-
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..3f4d36f9745a 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +192,9 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr

[PATCH v7 03/12] mm/vmalloc: rename vmap_*_range vmap_pages_*_range

2020-08-25 Thread Nicholas Piggin
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e9b21adc73d..45cd80ec7eeb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -189,7 +189,7 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -217,7 +217,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -229,13 +229,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -247,13 +247,13 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -265,7 +265,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -306,7 +306,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, , );
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, , 
);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v7 02/12] mm: apply_to_pte_range warn and fail if a large pte is encountered

2020-08-25 Thread Nicholas Piggin
apply_to_pte_range might mistake a large pte for bad, or treat it as a
page table, resulting in a crash or corruption. Add a test to warn and
return error if large entries are found.

Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 60 +++--
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..995b2e790b79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2262,13 +2262,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2289,13 +2296,20 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2316,13 +2330,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (!p4d_none(*p4d) && WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2341,8 +2362,15 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (!pgd_none(*pgd) && WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
-- 
2.23.0



[PATCH v7 01/12] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2020-08-25 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..4e9b21adc73d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 #include 
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v7 00/12] huge vmalloc mappings

2020-08-25 Thread Nicholas Piggin
I think it's ready to go into -mm if it gets acks for the arch
changes.

Thanks,
Nick

Since v6:
- Fixed a false positive warning introduced in patch 2, found by
  kbuild test robot.

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
  Cameron for debugging the cause of this (hopefully).

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
  struct vm_struct is an interface (this should fix x86 vmap stack debug
  assert). [Thanks Zefan]


Nicholas Piggin (12):
  mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  mm/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  powerpc: inline huge vmap supported functions
  arm64: inline huge vmap supported functions
  x86: inline huge vmap supported functions
  mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings
  powerpc/64s/radix: Enable huge vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |   4 +
 arch/arm64/include/asm/vmalloc.h  |  25 +
 arch/arm64/mm/mmu.c   |  26 -
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/vmalloc.h|  21 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  21 -
 arch/x86/include/asm/vmalloc.h|  23 +
 arch/x86/mm/ioremap.c |  19 -
 arch/x86/mm/pgtable.c |  13 -
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  10 +
 init/main.c   |   1 -
 mm/ioremap.c  | 225 +
 mm/memory.c   |  60 ++-
 mm/page_alloc.c   |   5 +-
 mm/vmalloc.c  | 443 +++---
 17 files changed, 515 insertions(+), 393 deletions(-)

-- 
2.23.0



[PATCH] powerpc/pseries: add new branch prediction security bits for link stack

2020-08-25 Thread Nicholas Piggin
The hypervisor interface has defined branch prediction security bits for
handling the link stack. Wire them up.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hvcall.h  | 2 ++
 arch/powerpc/platforms/pseries/setup.c | 6 ++
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index fbb377055471..e66627fc1972 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -375,11 +375,13 @@
 #define H_CPU_CHAR_THREAD_RECONFIG_CTRL(1ull << 57) // IBM bit 6
 #define H_CPU_CHAR_COUNT_CACHE_DISABLED(1ull << 56) // IBM bit 7
 #define H_CPU_CHAR_BCCTR_FLUSH_ASSIST  (1ull << 54) // IBM bit 9
+#define H_CPU_CHAR_BCCTR_LINK_FLUSH_ASSIST (1ull << 52) // IBM bit 11
 
 #define H_CPU_BEHAV_FAVOUR_SECURITY(1ull << 63) // IBM bit 0
 #define H_CPU_BEHAV_L1D_FLUSH_PR   (1ull << 62) // IBM bit 1
 #define H_CPU_BEHAV_BNDS_CHK_SPEC_BAR  (1ull << 61) // IBM bit 2
 #define H_CPU_BEHAV_FLUSH_COUNT_CACHE  (1ull << 58) // IBM bit 5
+#define H_CPU_BEHAV_FLUSH_LINK_STACK   (1ull << 57) // IBM bit 6
 
 /* Flag values used in H_REGISTER_PROC_TBL hcall */
 #define PROC_TABLE_OP_MASK 0x18
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 2f4ee0a90284..633c45ec406d 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -519,9 +519,15 @@ static void init_cpu_char_feature_flags(struct 
h_cpu_char_result *result)
if (result->character & H_CPU_CHAR_BCCTR_FLUSH_ASSIST)
security_ftr_set(SEC_FTR_BCCTR_FLUSH_ASSIST);
 
+   if (result->character & H_CPU_CHAR_BCCTR_LINK_FLUSH_ASSIST)
+   security_ftr_set(SEC_FTR_BCCTR_LINK_FLUSH_ASSIST);
+
if (result->behaviour & H_CPU_BEHAV_FLUSH_COUNT_CACHE)
security_ftr_set(SEC_FTR_FLUSH_COUNT_CACHE);
 
+   if (result->behaviour & H_CPU_BEHAV_FLUSH_LINK_STACK)
+   security_ftr_set(SEC_FTR_FLUSH_LINK_STACK);
+
/*
 * The features below are enabled by default, so we instead look to see
 * if firmware has *disabled* them, and clear them if so.
-- 
2.23.0



[PATCH] powerpc/64s: handle ISA v3.1 local copy-paste context switches

2020-08-25 Thread Nicholas Piggin
The ISA v3.1 the copy-paste facility has a new memory move functionality
which allows the copy buffer to be pasted to domestic memory (RAM) as
opposed to foreign memory (accelerator).

This means the POWER9 trick of avoiding the cp_abort on context switch if
the process had not mapped foreign memory does not work on POWER10. Do the
cp_abort unconditionally there.

KVM must also cp_abort on guest exit to prevent copy buffer state leaking
between contexts.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/process.c   | 16 +---
 arch/powerpc/kvm/book3s_hv.c|  7 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  8 
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 016bd831908e..1a572c811ca5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1250,15 +1250,17 @@ struct task_struct *__switch_to(struct task_struct 
*prev,
restore_math(current->thread.regs);
 
/*
-* The copy-paste buffer can only store into foreign real
-* addresses, so unprivileged processes can not see the
-* data or use it in any way unless they have foreign real
-* mappings. If the new process has the foreign real address
-* mappings, we must issue a cp_abort to clear any state and
-* prevent snooping, corruption or a covert channel.
+* On POWER9 the copy-paste buffer can only paste into
+* foreign real addresses, so unprivileged processes can not
+* see the data or use it in any way unless they have
+* foreign real mappings. If the new process has the foreign
+* real address mappings, we must issue a cp_abort to clear
+* any state and prevent snooping, corruption or a covert
+* channel. ISA v3.1 supports paste into local memory.
 */
if (current->mm &&
-   atomic_read(>mm->context.vas_windows))
+   (cpu_has_feature(CPU_FTR_ARCH_31) ||
+   atomic_read(>mm->context.vas_windows)))
asm volatile(PPC_CP_ABORT);
}
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4ba06a2a306c..3bd3118c7633 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3530,6 +3530,13 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
 */
asm volatile("eieio; tlbsync; ptesync");
 
+   /*
+* cp_abort is required if the processor supports local copy-paste
+* to clear the copy buffer that was under control of the guest.
+*/
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   asm volatile(PPC_CP_ABORT);
+
mtspr(SPRN_LPID, vcpu->kvm->arch.host_lpid);/* restore host LPID */
isync();
 
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 799d6d0f4ead..cd9995ee8441 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1830,6 +1830,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_RADIX_PREFETCH_BUG)
 2:
 #endif /* CONFIG_PPC_RADIX_MMU */
 
+   /*
+* cp_abort is required if the processor supports local copy-paste
+* to clear the copy buffer that was under control of the guest.
+*/
+BEGIN_FTR_SECTION
+   PPC_CP_ABORT
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_31)
+
/*
 * POWER7/POWER8 guest -> host partition switch code.
 * We don't have to lock against tlbies but we do
-- 
2.23.0



[PATCH] powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address

2020-08-25 Thread Nicholas Piggin
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add cp_abort
to the tlbiel sequence.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/synch.h   | 13 +
 arch/powerpc/mm/book3s64/hash_native.c |  8 
 arch/powerpc/mm/book3s64/radix_tlb.c   | 12 ++--
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index aca70fb43147..47d036d32828 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -3,7 +3,9 @@
 #define _ASM_POWERPC_SYNCH_H 
 #ifdef __KERNEL__
 
+#include 
 #include 
+#include 
 #include 
 
 #ifndef __ASSEMBLY__
@@ -20,6 +22,17 @@ static inline void isync(void)
 {
__asm__ __volatile__ ("isync" : : : "memory");
 }
+
+static inline void ppc_after_tlbiel_barrier(void)
+{
+asm volatile("ptesync": : :"memory");
+   /*
+* POWER9, POWER10 need a cp_abort after tlbiel. For POWER9 this could
+* possibly be limited to tasks which have mapped foreign address, 
similar
+* to cp_abort in context switch.
+*/
+asm volatile(ASM_FTR_IFSET(PPC_CP_ABORT, "", %0) : : "i" 
(CPU_FTR_ARCH_300) : "memory");
+}
 #endif /* __ASSEMBLY__ */
 
 #if defined(__powerpc64__)
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index cf20e5229ce1..0203cdf48c54 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -82,7 +82,7 @@ static void tlbiel_all_isa206(unsigned int num_sets, unsigned 
int is)
for (set = 0; set < num_sets; set++)
tlbiel_hash_set_isa206(set, is);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 }
 
 static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
@@ -110,7 +110,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, 
unsigned int is)
 */
tlbiel_hash_set_isa300(0, is, 0, 2, 1);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 
asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
 }
@@ -303,7 +303,7 @@ static inline void tlbie(unsigned long vpn, int psize, int 
apsize,
asm volatile("ptesync": : :"memory");
if (use_local) {
__tlbiel(vpn, psize, apsize, ssize);
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
} else {
__tlbie(vpn, psize, apsize, ssize);
fixup_tlbie_vpn(vpn, psize, apsize, ssize);
@@ -879,7 +879,7 @@ static void native_flush_hash_range(unsigned long number, 
int local)
__tlbiel(vpn, psize, psize, ssize);
} pte_iterate_hashed_end();
}
-   asm volatile("ptesync":::"memory");
+   ppc_after_tlbiel_barrier();
} else {
int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
 
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..5c9d2fccacc7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -65,7 +65,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, unsigned 
int is)
for (set = 1; set < num_sets; set++)
tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 1);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
 }
 
 void radix__tlbiel_all(unsigned int action)
@@ -296,7 +296,7 @@ static __always_inline void _tlbiel_pid(unsigned long pid, 
unsigned long ric)
 
/* For PWC, only one flush is needed */
if (ric == RIC_FLUSH_PWC) {
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
return;
}
 
@@ -304,7 +304,7 @@ static __always_inline void _tlbiel_pid(unsigned long pid, 
unsigned long ric)
for (set = 1; set < POWER9_TLB_SETS_RADIX ; set++)
__tlbiel_pid(pid, set, RIC_FLUSH_TLB);
 
-   asm volatile("ptesync": : :"memory");
+   ppc_after_tlbiel_barrier();
asm volatile(PPC_RADIX_INVALIDATE_ERAT_USER "; isync" : : :"memory");
 }
 
@@ -431,7 +431,7 @@ static __always_inline void _tlbiel_va(unsigned long va, 
unsigned long pid,
 
asm volatile("ptesync": : :"memory");
__tlbiel_va(va, pid, ap, ric);
-   asm volatile("ptesync

[PATCH] powerpc/64s: scv entry should set PPR

2020-08-25 Thread Nicholas Piggin
Kernel entry sets PPR to HMT_MEDIUM by convention. The scv entry
path missed this.

Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv 
instructions")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 33a42e42c56f..733e40eba4eb 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -113,6 +113,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
ld  r11,exception_marker@toc(r2)
std r11,-16(r10)/* "regshere" marker */
 
+BEGIN_FTR_SECTION
+   HMT_MEDIUM
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+
/*
 * RECONCILE_IRQ_STATE without calling trace_hardirqs_off(), which
 * would clobber syscall parameters. Also we always enter with IRQs
-- 
2.23.0



Re: [mm] 8e63b8bbd7: WARNING:at_mm/memory.c:#__apply_to_page_range

2020-08-22 Thread Nicholas Piggin
Excerpts from kernel test robot's message of August 22, 2020 9:24 am:
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 8e63b8bbd7d17f64ced151cebd151a2cd9f63c64 ("[PATCH v5 2/8] mm: 
> apply_to_pte_range warn and fail if a large pte is encountered")
> url: 
> https://github.com/0day-ci/linux/commits/Nicholas-Piggin/huge-vmalloc-mappings/20200821-124543
> base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
> 
> in testcase: boot
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +---+++
> |   | 185311995a | 8e63b8bbd7 |
> +---+++
> | boot_successes| 4  | 0  |
> | boot_failures | 0  | 4  |
> | WARNING:at_mm/memory.c:#__apply_to_page_range | 0  | 4  |
> | RIP:__apply_to_page_range | 0  | 4  |
> +---+++
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> [0.786159] WARNING: CPU: 0 PID: 0 at mm/memory.c:2269 
> __apply_to_page_range+0x537/0x9c0

Hmm, I wonder if that's WARN_ON_ONCE(pmd_bad(*pmd))), which would be
odd. I don't know x86 asm well enough to see what the *pmd value would
be there.

I'll try to reproduce and work out what's going on.

Thanks,
Nick


> [0.786675] Modules linked in:
> [0.786888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> 5.9.0-rc1-2-g8e63b8bbd7d17f #2
> [0.787402] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.12.0-1 04/01/2014
> [0.787935] RIP: 0010:__apply_to_page_range+0x537/0x9c0
> [0.788280] Code: 8b 5c 24 50 48 39 5c 24 38 0f 84 6b 03 00 00 4c 8b 74 24 
> 38 e9 63 fb ff ff 84 d2 0f 84 ba 01 00 00 48 8b 1c 24 e9 3c fe ff ff <0f> 0b 
> 45 84 ed 0f 84 08 01 00 00 48 89 ef e8 8f 8f 02 00 48 89 e8
> [0.789467] RSP: :83e079d0 EFLAGS: 00010293
> [0.789805] RAX:  RBX: f5201000 RCX: 
> 000fffe0
> [0.790260] RDX:  RSI: 000ff000 RDI: 
> 
> [0.790724] RBP: 888107408000 R08: 0001 R09: 
> 000107408000
> [0.791179] R10: 840dcb5b R11: fbfff081b96b R12: 
> f520001f
> [0.791634] R13: 0001 R14: f520 R15: 
> dc00
> [0.792090] FS:  () GS:8881eae0() 
> knlGS:
> [0.792607] CS:  0010 DS:  ES:  CR0: 80050033
> [0.792977] CR2: 88823000 CR3: 03e14000 CR4: 
> 000406b0
> [0.793433] DR0:  DR1:  DR2: 
> 
> [0.793889] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [0.794344] Call Trace:
> [0.794517]  ? memset+0x40/0x40
> [0.794745]  alloc_vmap_area+0x7a9/0x2280
> [0.795054]  ? trace_hardirqs_on+0x4f/0x2e0
> [0.795354]  ? _raw_spin_unlock_irqrestore+0x39/0x60
> [0.795682]  ? free_vmap_area+0x1a20/0x1a20
> [0.795959]  ? __kasan_kmalloc+0xbf/0xe0
> [0.796292]  __get_vm_area_node+0xd1/0x300
> [0.796605]  get_vm_area_caller+0x2d/0x40
> [0.796872]  ? acpi_os_map_iomem+0x3c3/0x4e0
> [0.797155]  __ioremap_caller+0x1d8/0x480
> [0.797486]  ? acpi_os_map_iomem+0x3c3/0x4e0
> [0.797770]  ? iounmap+0x160/0x160
> [0.798002]  ? __kasan_kmalloc+0xbf/0xe0
> [0.798335]  acpi_os_map_iomem+0x3c3/0x4e0
> [0.798612]  acpi_tb_acquire_table+0xb3/0x1c5
> [0.798910]  acpi_tb_validate_table+0x68/0xbf
> [0.799199]  acpi_tb_verify_temp_table+0xa1/0x640
> [0.799512]  ? __down_trylock_console_sem+0x7a/0xa0
> [0.799833]  ? acpi_tb_validate_temp_table+0x9d/0x9d
> [0.800159]  ? acpi_ut_init_stack_ptr_trace+0xaa/0xaa
> [0.800490]  ? vprintk_emit+0x10b/0x2a0
> [0.800748]  ? acpi_ut_acquire_mutex+0x1d7/0x32f
> [0.801056]  acpi_reallocate_root_table+0x339/0x385
> [0.801377]  ? acpi_tb_parse_root_table+0x5a5/0x5a5
> [0.801700]  ? dmi_matches+0xc6/0x120
> [0.801968]  acpi_early_init+0x116/0x3ae
> [0.802230]  start_kernel+0x2f7/0x39f
> [0.802477]  secondary_startup_64+0xa4/0xb0
> [0.802770] irq event stamp: 5137
> [0.802992] hardirqs last  enabled at (5145): [] 
> console_unlock+0x632/0xa00
> 

Re: [PATCH v6 05/12] mm: HUGE_VMAP arch support cleanup

2020-08-21 Thread Nicholas Piggin
Excerpts from Andrew Morton's message of August 22, 2020 6:14 am:
> On Sat, 22 Aug 2020 01:12:09 +1000 Nicholas Piggin  wrote:
> 
>> This changes the awkward approach where architectures provide init
>> functions to determine which levels they can provide large mappings for,
>> to one where the arch is queried for each call.
>> 
>> This removes code and indirection, and allows constant-folding of dead
>> code for unsupported levels.
>> 
>> This also adds a prot argument to the arch query. This is unused
>> currently but could help with some architectures (e.g., some powerpc
>> processors can't map uncacheable memory with large pages).
>> 
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -1,4 +1,12 @@
>>  #ifndef _ASM_ARM64_VMALLOC_H
>>  #define _ASM_ARM64_VMALLOC_H
>>  
>> +#include 
>> +
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +bool arch_vmap_p4d_supported(pgprot_t prot);
>> +bool arch_vmap_pud_supported(pgprot_t prot);
>> +bool arch_vmap_pmd_supported(pgprot_t prot);
>> +#endif
> 
> Moving these out of generic code and into multiple arch headers is
> unfortunate.  Can we leave them in include/linux/somewhere?  And remove
> the ifdefs, if so inclined - they just move the build error from
> link-time to compile-time, and such an error shouldn't occur!

Yeah this was just an intermediate step as you saw. It's a bit 
unfortunate, but I thought it made the arch changes clearer.

Thanks,
Nick



Re: [PATCH v6 01/12] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2020-08-21 Thread Nicholas Piggin
Excerpts from Andrew Morton's message of August 22, 2020 6:07 am:
> On Sat, 22 Aug 2020 01:12:05 +1000 Nicholas Piggin  wrote:
> 
>> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
>> Whether or not a vmap is huge depends on the architecture details,
>> alignments, boot options, etc., which the caller can not be expected
>> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
> 
> I assume this doesn't matter in current mainline?
> If wrong, then what are the user-visible effects and why no cc:stable?

I haven't heard any reports, but in theory it could cause a prolem. The
commit 029c54b095995 from the changelog was made to paper over it. But
that was fixed properly afterward I think by 737326aa510b.

Not sure of the user visible problems currently. I think generally you
wouldn't do vmalloc_to_page() on ioremap() memory, so maybe callilng it
a regression is a bit strong. _Technically_ a regression, maybe.

Thanks,
Nick


Re: [PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Excerpts from Eric Dumazet's message of August 22, 2020 1:38 am:
> 
> On 8/21/20 8:12 AM, Nicholas Piggin wrote:
>> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
>> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
>> supports PMD sized vmap mappings.
>> 
>> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
>> larger, and fall back to small pages if that was unsuccessful.
>> 
>> Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
>> pages, because not all callers expect this (e.g., module allocations vs
>> strict module rwx).
>> 
>> This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
>> POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
>> 
>> This can result in more internal fragmentation and memory overhead for a
>> given allocation, an option nohugevmalloc is added to disable at boot.
>> 
>>
> 
> Thanks for working on this stuff, I tried something similar in the past,
> but could not really do more than a hack.
> ( https://lkml.org/lkml/2016/12/21/285 )

Oh nice. It might be possible to do some ideas from your patch
still. Higher order pages smaller than PMD size, or the memory
policy stuff, perhaps.

> Note that __init alloc_large_system_hash() is used at boot time,
> when NUMA policy is spreading allocations over all NUMA nodes.
> 
> This means that on a dual node system, a hash table should be 50/50 spread.
> 
> With your patch, if a hashtable is exactly the size of one huge page,
> the location of this hashtable will be not balanced, this might have some
> unwanted impact.

In that case it shouldn't because it divides by the number of nodes,
but it will in general have a bit larger granularity in balancing than
smaller pages of course.

There's probably a better way to size these important hashes on NUMA. I
suspect most of the time you have a NUMA machine you actually would
prefer to use large pages now, even if it means taking up to 2MB more
memory per node per hash. It's not a great amount and the allocation 
size is rather arbitrary anyway.

Thanks,
Nick


[PATCH v6 12/12] powerpc/64s/radix: Enable huge vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 Documentation/admin-guide/kernel-parameters.txt | 2 ++
 arch/powerpc/Kconfig| 1 +
 2 files changed, 3 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..6f0b41289a90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..9171d25ad7dc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
select GENERIC_TIME_VSYSCALL
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
-- 
2.23.0



[PATCH v6 11/12] mm/vmalloc: Hugepage vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
supports PMD sized vmap mappings.

vmalloc will attempt to allocate PMD-sized pages if allocating PMD size or
larger, and fall back to small pages if that was unsuccessful.

Allocations that do not use PAGE_KERNEL prot are not permitted to use huge
pages, because not all callers expect this (e.g., module allocations vs
strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig|   4 +
 include/linux/vmalloc.h |   1 +
 mm/page_alloc.c |   5 +-
 mm/vmalloc.c| 180 ++--
 4 files changed, 145 insertions(+), 45 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 15adb9a14fb6..a7449064fe35 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..b6427cc7b838 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -8102,6 +8103,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8171,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = (find_vm_area(table)->page_order > 0);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8188,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-   virt ? "vmalloc" : "linear");
+   virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
 
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d6cad16bda3..8db53c2d7f72 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -44,6 +44,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
 bool is_vmalloc_addr(const void *x)
 {
unsigned long addr = (unsigned long)x;
@@ -477,31 +490,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long 
addr,
return 0;
 }
 
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify 
should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing.  The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
-pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages)
 {
unsigne

[PATCH v6 10/12] mm/vmalloc: add vmap_range_noflush variant

2020-08-21 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 256554d598e6..1d6cad16bda3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -237,7 +237,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end,
+static int vmap_range_noflush(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot,
unsigned int max_page_shift)
 {
@@ -259,14 +259,24 @@ int vmap_range(unsigned long addr, unsigned long end,
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v6 09/12] mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c

2020-08-21 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   3 +
 mm/ioremap.c| 197 
 mm/vmalloc.c| 196 +++
 3 files changed, 199 insertions(+), 197 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3f6bba4cc9bc..15adb9a14fb6 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -177,6 +177,9 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+int vmap_range(unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index c67f91164401..d1dcc7e744ac 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,203 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_max_page_shift = PAGE_SHIFT;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot,
-   unsigned int max_page_shift, pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PUD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 
max_page_shift, mask))
-

[PATCH v6 08/12] x86: inline huge vmap supported functions

2020-08-21 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---
 arch/x86/include/asm/vmalloc.h | 22 +++---
 arch/x86/mm/ioremap.c  | 19 ---
 arch/x86/mm/pgtable.c  | 13 -
 3 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 094ea2b565f3..e714b00fc0ca 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,13 +1,29 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+#ifdef CONFIG_X86_64
+   return boot_cpu_has(X86_FEATURE_GBPAGES);
+#else
+   return false;
+#endif
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return boot_cpu_has(X86_FEATURE_PSE);
+}
 #endif
 
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 159bfca757b9..1465a22a9bfb 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,25 +481,6 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-#ifdef CONFIG_X86_64
-   return boot_cpu_has(X86_FEATURE_GBPAGES);
-#else
-   return false;
-#endif
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return boot_cpu_has(X86_FEATURE_PSE);
-}
-
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
  * access
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index dfd82f51ba66..801c418ee97d 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,14 +780,6 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
 }
 
-/*
- * Until we support 512GB pages, skip them in the vmap area.
- */
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 #ifdef CONFIG_X86_64
 /**
  * pud_free_pmd_page - Clear pud entry and free pmd page.
@@ -859,11 +851,6 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
 #else /* !CONFIG_X86_64 */
 
-int pud_free_pmd_page(pud_t *pud, unsigned long addr)
-{
-   return pud_none(*pud);
-}
-
 /*
  * Disable free page handling on x86-PAE. This assures that ioremap()
  * does not update sync'd pmd entries. See vmalloc_sync_one().
-- 
2.23.0



[PATCH v6 07/12] arm64: inline huge vmap supported functions

2020-08-21 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h | 23 ---
 arch/arm64/mm/mmu.c  | 26 --
 2 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 597b40405319..fc9a12d6cc1a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -4,9 +4,26 @@
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /*
+* Only 4k granule supports level 1 block mappings.
+* SW table walks can't handle removal of intermediate entries.
+*/
+   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
+  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   /* See arch_vmap_pud_supported() */
+   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
+}
 #endif
 
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9df7e0058c78..07093e148957 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,27 +1304,6 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-
-bool arch_vmap_pud_supported(pgprot_t prot);
-{
-   /*
-* Only 4k granule supports level 1 block mappings.
-* SW table walks can't handle removal of intermediate entries.
-*/
-   return IS_ENABLED(CONFIG_ARM64_4K_PAGES) &&
-  !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   /* See arch_vmap_pud_supported() */
-   return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
-}
-
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot));
@@ -1416,11 +1395,6 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
return 1;
 }
 
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;   /* Don't attempt a block mapping */
-}
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
 {
-- 
2.23.0



[PATCH v6 06/12] powerpc: inline huge vmap supported functions

2020-08-21 Thread Nicholas Piggin
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.

Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/vmalloc.h   | 19 ---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 21 -
 2 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index 105abb73f075..3f0c153befb0 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,12 +1,25 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
 #include 
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-bool arch_vmap_p4d_supported(pgprot_t prot);
-bool arch_vmap_pud_supported(pgprot_t prot);
-bool arch_vmap_pmd_supported(pgprot_t prot);
+static inline bool arch_vmap_p4d_supported(pgprot_t prot)
+{
+   return false;
+}
+
+static inline bool arch_vmap_pud_supported(pgprot_t prot)
+{
+   /* HPT does not cope with large pages in the vmalloc area */
+   return radix_enabled();
+}
+
+static inline bool arch_vmap_pmd_supported(pgprot_t prot)
+{
+   return radix_enabled();
+}
 #endif
 
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index eca83a50bf2e..27f5837cf145 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,22 +1134,6 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-bool arch_vmap_pud_supported(pgprot_t prot)
-{
-   /* HPT does not cope with large pages in the vmalloc area */
-   return radix_enabled();
-}
-
-bool arch_vmap_pmd_supported(pgprot_t prot)
-{
-   return radix_enabled();
-}
-
-int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
-{
-   return 0;
-}
-
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
pte_t *ptep = (pte_t *)pud;
@@ -1233,8 +1217,3 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
-
-bool arch_vmap_p4d_supported(pgprot_t prot)
-{
-   return false;
-}
-- 
2.23.0



[PATCH v6 05/12] mm: HUGE_VMAP arch support cleanup

2020-08-21 Thread Nicholas Piggin
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Signed-off-by: Nicholas Piggin 
---
 arch/arm64/include/asm/vmalloc.h |  8 +++
 arch/arm64/mm/mmu.c  | 10 +--
 arch/powerpc/include/asm/vmalloc.h   |  8 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +--
 arch/x86/include/asm/vmalloc.h   |  7 ++
 arch/x86/mm/ioremap.c| 10 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  |  6 ++
 init/main.c  |  1 -
 mm/ioremap.c | 88 +---
 10 files changed, 77 insertions(+), 78 deletions(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 2ca708ab9b20..597b40405319 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_ARM64_VMALLOC_H
 #define _ASM_ARM64_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..9df7e0058c78 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,12 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot);
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1319,9 +1319,9 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
diff --git a/arch/powerpc/include/asm/vmalloc.h 
b/arch/powerpc/include/asm/vmalloc.h
index b992dfaaa161..105abb73f075 100644
--- a/arch/powerpc/include/asm/vmalloc.h
+++ b/arch/powerpc/include/asm/vmalloc.h
@@ -1,4 +1,12 @@
 #ifndef _ASM_POWERPC_VMALLOC_H
 #define _ASM_POWERPC_VMALLOC_H
 
+#include 
+
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_POWERPC_VMALLOC_H */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..eca83a50bf2e 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,13 +1134,13 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1234,7 +1234,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h
index 29837740b520..094ea2b565f3 100644
--- a/arch/x86/include/asm/vmalloc.h
+++ b/arch/x86/include/asm/vmalloc.h
@@ -1,6 +1,13 @@
 #ifndef _ASM_X86_VMALLOC_H
 #define _ASM_X86_VMALLOC_H
 
+#include 
 #include 
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#endif
+
 #endif /* _ASM_X86_VMALLOC_H */
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..159bfca757b9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,21 +481,21 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-

[PATCH v6 04/12] mm/ioremap: rename ioremap_*_range to vmap_*_range

2020-08-21 Thread Nicholas Piggin
This will be used as a generic kernel virtual mapping function, so
re-name it in preparation.

Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 64 +++-
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..3f4d36f9745a 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,9 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +81,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +102,9 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +115,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +147,9 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot,
+   pgtbl_mod_mask *mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +160,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +192,9 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr

[PATCH v6 03/12] mm/vmalloc: rename vmap_*_range vmap_pages_*_range

2020-08-21 Thread Nicholas Piggin
The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e9b21adc73d..45cd80ec7eeb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -189,7 +189,7 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -217,7 +217,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -229,13 +229,13 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -247,13 +247,13 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr,
unsigned long end, pgprot_t prot, struct page **pages, int *nr,
pgtbl_mod_mask *mask)
 {
@@ -265,7 +265,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -306,7 +306,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, , );
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, , 
);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0



[PATCH v6 02/12] mm: apply_to_pte_range warn and fail if a large pte is encountered

2020-08-21 Thread Nicholas Piggin
Currently this might mistake a large pte for bad, or treat it as a
page table, resulting in a crash or corruption.

Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 60 +++--
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..3a39a47920e2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2262,13 +2262,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2289,13 +2296,20 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2316,13 +2330,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2341,8 +2362,15 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
-- 
2.23.0



[PATCH v6 01/12] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2020-08-21 Thread Nicholas Piggin
vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..4e9b21adc73d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 #include 
@@ -343,7 +343,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +365,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +399,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0



[PATCH v6 00/12] huge vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Thanks Christophe and Christoph for reviews.

Thanks,
Nick

Since v5:
- Split arch changes out better and make the constant folding work
- Avoid most of the 80 column wrap, fix a reference to lib/ioremap.c
- Fix compile error on some archs

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
  Cameron for debugging the cause of this (hopefully).

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
  struct vm_struct is an interface (this should fix x86 vmap stack debug
  assert). [Thanks Zefan]

Nicholas Piggin (12):
  mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  lib/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  powerpc: inline huge vmap supported functions
  arm64: inline huge vmap supported functions
  x86: inline huge vmap supported functions
  mm: Move vmap_range from mm/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings
  powerpc/64s/radix: Enable huge vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |   4 +
 arch/arm64/include/asm/vmalloc.h  |  25 +
 arch/arm64/mm/mmu.c   |  26 -
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/vmalloc.h|  21 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  21 -
 arch/x86/include/asm/vmalloc.h|  23 +
 arch/x86/mm/ioremap.c |  19 -
 arch/x86/mm/pgtable.c |  13 -
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  10 +
 init/main.c   |   1 -
 mm/ioremap.c  | 225 +
 mm/memory.c   |  60 ++-
 mm/page_alloc.c   |   5 +-
 mm/vmalloc.c  | 443 +++---
 17 files changed, 515 insertions(+), 393 deletions(-)

-- 
2.23.0



Re: [PATCH v5 0/8] huge vmalloc mappings

2020-08-21 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 21, 2020 3:47 pm:
> 
> 
> Le 21/08/2020 à 06:44, Nicholas Piggin a écrit :
>> I made this powerpc-only for the time being. It shouldn't be too hard to
>> add support for other archs that define HUGE_VMAP. I have booted x86
>> with it enabled, just may not have audited everything.
> 
> I like this series, but if I understand correctly it enables huge 
> vmalloc mappings only for hugepages sizes matching a page directory 
> levels, ie on a PPC32 it would work only for 4M hugepages.

Yeah it really just uses the HUGE_VMAP mapping which is already there.

> On the 8xx, we only have 8M and 512k hugepages. Any change that it can 
> support these as well one day ?

The vmap_range interface can now handle that, then adding support is the
main work. Either make it weak and arch can override it, or add some
arch helpers to make the generic version create the huge pages if it's
not too ugly. Then you get those large pages for ioremap for free.

The vmalloc part to allocate and try to map a bigger page size and use 
it is quite trivial to change from PMD to an arch specific size.

Thanks,
Nick



Re: [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup

2020-08-21 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of August 21, 2020 3:40 pm:
> 
> 
> Le 21/08/2020 à 06:44, Nicholas Piggin a écrit :
>> This changes the awkward approach where architectures provide init
>> functions to determine which levels they can provide large mappings for,
>> to one where the arch is queried for each call.
>> 
>> This removes code and indirection, and allows constant-folding of dead
>> code for unsupported levels.
> 
> I think that in order to allow constant-folding of dead code for 
> unsupported levels, you must define arch_vmap_xxx_supported() as static 
> inline in a .h
> 
> If you have them in .c files, you'll get calls to tiny functions that 
> will always return false, but will still be called and dead code won't 
> be eliminated. And performance wise, that's probably not optimal either.

Yeah that's true actually, I think I didn't find a good place to add
the prototypes in the arch code but I'll have another look and either
rewrite the changelog or remove it. Although this does get a step closer
at least.

Thanks,
Nick


Re: [PATCH 2/2] powerpc/64s: Disallow PROT_SAO in LPARs by default

2020-08-21 Thread Nicholas Piggin
Excerpts from Shawn Anastasio's message of August 21, 2020 11:08 am:
> Since migration of guests using SAO to ISA 3.1 hosts may cause issues,
> disable PROT_SAO in LPARs by default and introduce a new Kconfig option
> PPC_PROT_SAO_LPAR to allow users to enable it if desired.

I think this should be okay. Could you also update the selftest to skip
if we have PPC_FEATURE2_ARCH_3_1 set?

Thanks,
Nick

Acked-by: Nicholas Piggin 

> 
> Signed-off-by: Shawn Anastasio 
> ---
>  arch/powerpc/Kconfig| 12 
>  arch/powerpc/include/asm/mman.h |  9 +++--
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 1f48bbfb3ce9..65bed1fdeaad 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -860,6 +860,18 @@ config PPC_SUBPAGE_PROT
>  
> If unsure, say N here.
>  
> +config PPC_PROT_SAO_LPAR
> + bool "Support PROT_SAO mappings in LPARs"
> + depends on PPC_BOOK3S_64
> + help
> +   This option adds support for PROT_SAO mappings from userspace
> +   inside LPARs on supported CPUs.
> +
> +   This may cause issues when performing guest migration from
> +   a CPU that supports SAO to one that does not.
> +
> +   If unsure, say N here.
> +
>  config PPC_COPRO_BASE
>   bool
>  
> diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
> index 4ba303ea27f5..7cb6d18f5cd6 100644
> --- a/arch/powerpc/include/asm/mman.h
> +++ b/arch/powerpc/include/asm/mman.h
> @@ -40,8 +40,13 @@ static inline bool arch_validate_prot(unsigned long prot, 
> unsigned long addr)
>  {
>   if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
>   return false;
> - if ((prot & PROT_SAO) && !cpu_has_feature(CPU_FTR_SAO))
> - return false;
> + if (prot & PROT_SAO) {
> + if (!cpu_has_feature(CPU_FTR_SAO))
> + return false;
> + if (firmware_has_feature(FW_FEATURE_LPAR) &&
> + !IS_ENABLED(CONFIG_PPC_PROT_SAO_LPAR))
> + return false;
> + }
>   return true;
>  }
>  #define arch_validate_prot arch_validate_prot
> -- 
> 2.28.0
> 
> 


[PATCH v5 8/8] mm/vmalloc: Hugepage vmalloc mappings

2020-08-20 Thread Nicholas Piggin
On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.

Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |   4 +
 arch/powerpc/Kconfig  |   1 +
 include/linux/vmalloc.h   |   1 +
 mm/page_alloc.c   |   4 +-
 mm/vmalloc.c  | 188 +-
 6 files changed, 152 insertions(+), 48 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..6f0b41289a90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 95dfd8ef3d4b..044e5a94967a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
select GENERIC_TIME_VSYSCALL
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..d785e5335529 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8102,6 +8102,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8170,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = (find_vm_area(table)->page_order > 0);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8187,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-   virt ? "vmalloc" : "linear");
+   virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
 
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..564d7497e551 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#end

[PATCH v5 7/8] mm/vmalloc: add vmap_range_noflush variant

2020-08-20 Thread Nicholas Piggin
As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
-   unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr,
+   pgprot_t prot, unsigned int max_page_shift)
 {
pgd_t *pgd;
unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr, pgp
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0



[PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c

2020-08-20 Thread Nicholas Piggin
This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   2 +
 mm/ioremap.c| 192 
 mm/vmalloc.c| 191 +++
 3 files changed, 193 insertions(+), 192 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t 
phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_allow_huge = false;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PUD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 
max_page_shift, mask))
-   return -ENOMEM;
-   } while (pud++, phys_addr += (next - addr), addr 

  1   2   3   4   5   6   7   8   9   10   >