Re: [RFC PATCH] powerpc: show registers when unwinding interrupt frames

2020-12-01 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 07/11/2020 à 03:33, Nicholas Piggin a écrit :
>> It's often useful to know the register state for interrupts in
>> the stack frame. In the below example (with this patch applied),
>> the important information is the state of the page fault.
>> 
>> A blatant case like this probably rather should have the page
>> fault regs passed down to the warning, but quite often there are
>> less obvious cases where an interrupt shows up that might give
>> some more clues.
>> 
>> The downside is longer and more complex bug output.
>
> Do we want all interrupts, including system call ?

I think we do.

> I don't find the dump of the syscall interrupt so usefull, do you ?

Yes :)

Because it's consistent, ie. we always show the full chain back to
userspace.

I think it's also helpful for folks who are less familiar with how
things work to show all the pieces, rather than hiding syscalls or
treating them specially.

Also I'm pretty sure I've had occasions where I've been debugging and
wanted to see the values that came in from userspace.

cheers


> See below an (unexpected?) KUAP warning due to an expected NULL pointer 
> dereference in 
> copy_from_kernel_nofault() called from kthread_probe_data()
>
>
> [ 1117.202054] [ cut here ]
> [ 1117.202102] Bug: fault blocked by AP register !
> [ 1117.202261] WARNING: CPU: 0 PID: 377 at 
> arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 
> do_page_fault+0x4a8/0x5ec
> [ 1117.202310] Modules linked in:
> [ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: GW 
> 5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty #4175
> [ 1117.202499] NIP:  c0012048 LR: c0012048 CTR: 
> [ 1117.202573] REGS: cacdbb88 TRAP: 0700   Tainted: GW 
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.202625] MSR:  00021032   CR: 2408  XER: 2000
> [ 1117.202899]
> [ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 0023 c092e554 0001 
> c09865e8 c092e640
> [ 1117.202899] GPR08: 1032   00014efc 28082224 100d166a 
> 100a0920 
> [ 1117.202899] GPR16: 100cac0c 100b 1080c3fc 1080d685 100d 100d 
>  100a0900
> [ 1117.202899] GPR24: 100d c07892ec  c0921510 c21f4440 005c 
> c000 cacdbc80
> [ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
> [ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
> [ 1117.204509] Call Trace:
> [ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
> [ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
> [ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
> [ 1117.204979] NIP:  c010dbec LR: c010dbac CTR: 0001
> [ 1117.205053] REGS: cacdbc80 TRAP: 0301   Tainted: GW 
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.205104] MSR:  9032   CR: 28082224  XER: 
> [ 1117.205416] DAR: 005c DSISR: c000
> [ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 0001 0017 0017 
> 0027 000f
> [ 1117.205416] GPR08: c09926ec   3000 24082224
> [ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
> [ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
> [ 1117.206258] --- interrupt: 301
> [ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
> [ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
> [ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
> [ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
> [ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
> [ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
> [ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
> [ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
> [ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
> [ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
> [ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
> [ 1117.207938] --- interrupt: c01 at 0xfd4e784
> [ 1117.208008] NIP:  0fd4e784 LR: 0fe0f244 CTR: 10048d38
> [ 1117.208083] REGS: cacdbf48 TRAP: 0c01   Tainted: GW 
> (5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
> [ 1117.208134] MSR:  d032   CR: 4400  XER: 
> [ 1117.208470]
> [ 1117.208470] GPR00: 0004 7fc34090 77bfb4e0 0001 1080fa40 0002 
> 740f fefefeff
> [ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 
> [ 1117.209104] NIP [0fd4e784] 0xfd4e784
> [ 1117.209180] LR [0fe0f244] 0xfe0f244
> [ 1117.209236] --- interrupt: c01
> [ 1117.209274] Instruction dump:
> [ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 
> 4082ff60
> [ 1117.209727] 0fe0 3c60c082 386399f4 48013b65 <0fe0> 80010034 
> 386b 7c0803a6
> [ 1117.210102] ---[ end trace 1927c0323393af3e ]---
>
> Christophe
>
>
>> 
>>Bug: Write fault blocked by AMR!
>>WARNING: CPU: 0 PID: 72 at 
>> 

[powerpc:next-test] BUILD SUCCESS 72e886545963b33dd5e1d92ee9c77dadb51adc4e

2020-12-01 Thread kernel test robot
 allyesconfig
parisc   allyesconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a004-20201201
i386 randconfig-a005-20201201
i386 randconfig-a001-20201201
i386 randconfig-a002-20201201
i386 randconfig-a006-20201201
i386 randconfig-a003-20201201
x86_64   randconfig-a016-20201201
x86_64   randconfig-a012-20201201
x86_64   randconfig-a014-20201201
x86_64   randconfig-a013-20201201
x86_64   randconfig-a015-20201201
x86_64   randconfig-a011-20201201
i386 randconfig-a014-20201201
i386 randconfig-a013-20201201
i386 randconfig-a011-20201201
i386 randconfig-a015-20201201
i386 randconfig-a012-20201201
i386 randconfig-a016-20201201
riscvnommu_k210_defconfig
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20201201
x86_64   randconfig-a006-20201201
x86_64   randconfig-a001-20201201
x86_64   randconfig-a002-20201201
x86_64   randconfig-a005-20201201
x86_64   randconfig-a003-20201201

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:fixes-test] BUILD SUCCESS f54db39fbe40731c40aefdd3bc26e7d56d668c64

2020-12-01 Thread kernel test robot
 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
i386 allyesconfig
sparcallyesconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a004-20201201
i386 randconfig-a005-20201201
i386 randconfig-a001-20201201
i386 randconfig-a002-20201201
i386 randconfig-a006-20201201
i386 randconfig-a003-20201201
x86_64   randconfig-a016-20201201
x86_64   randconfig-a012-20201201
x86_64   randconfig-a014-20201201
x86_64   randconfig-a013-20201201
x86_64   randconfig-a015-20201201
x86_64   randconfig-a011-20201201
i386 randconfig-a014-20201201
i386 randconfig-a013-20201201
i386 randconfig-a011-20201201
i386 randconfig-a015-20201201
i386 randconfig-a012-20201201
i386 randconfig-a016-20201201
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

clang tested configs:
x86_64   randconfig-a004-20201201
x86_64   randconfig-a006-20201201
x86_64   randconfig-a001-20201201
x86_64   randconfig-a002-20201201
x86_64   randconfig-a005-20201201
x86_64   randconfig-a003-20201201

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[PATCH v2 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar

2020-12-01 Thread Srikar Dronamraju
Introduce a static branch that would be set during boot if the OS
happens to be a KVM guest. Subsequent checks to see if we are on KVM
will rely on this static branch. This static branch would be used in
vcpu_is_preempted in a subsequent patch.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 10 ++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index ba8291e02ba9..627ba272e781 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,8 +7,18 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+#include 
+
+DECLARE_STATIC_KEY_FALSE(kvm_guest);
+
+static inline bool is_kvm_guest(void)
+{
+   return static_branch_unlikely(_guest);
+}
+
 bool check_kvm_guest(void);
 #else
+static inline bool is_kvm_guest(void) { return false; }
 static inline bool check_kvm_guest(void) { return false; }
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 6fba06b6cfdb..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 0aeb6a5b1a9e..28498fc573f2 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+DEFINE_STATIC_KEY_FALSE(kvm_guest);
 bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
@@ -33,6 +34,7 @@ bool check_kvm_guest(void)
if (!of_device_is_compatible(hyper_node, "linux,kvm"))
return 0;
 
+   static_branch_enable(_guest);
return 1;
 }
 #endif
-- 
2.18.4



[PATCH v2 1/4] powerpc: Refactor is_kvm_guest declaration to new header

2020-12-01 Thread Srikar Dronamraju
Only code/declaration movement, in anticipation of doing a kvm-aware
vcpu_is_preempted. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 15 +++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  1 +
 arch/powerpc/platforms/pseries/smp.c |  1 +
 5 files changed, 18 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 0b295bdb201e..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token;
 
 extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup;
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
-#else
-static inline bool is_kvm_guest(void) { return false; }
-#endif
-
 #ifdef CONFIG_PPC_PSERIES
 void pseries_probe_fw_features(void);
 #else
diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
new file mode 100644
index ..c0ace884a0e8
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020  IBM Corporation
+ */
+
+#ifndef __POWERPC_KVM_GUEST_H__
+#define __POWERPC_KVM_GUEST_H__
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+bool is_kvm_guest(void);
+#else
+static inline bool is_kvm_guest(void) { return false; }
+#endif
+
+#endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 744612054c94..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -8,7 +8,7 @@
 #ifndef __POWERPC_KVM_PARA_H__
 #define __POWERPC_KVM_PARA_H__
 
-#include 
+#include 
 
 #include 
 
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index fe48d319d490..5f48e5ad24cd 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 unsigned long powerpc_firmware_features __read_mostly;
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 92922491a81c..d578732c545d 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
-- 
2.18.4



[PATCH v2 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest

2020-12-01 Thread Srikar Dronamraju
is_kvm_guest() will be reused in subsequent patch in a new avatar.  Hence
rename is_kvm_guest to check_kvm_guest. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 4 ++--
 arch/powerpc/include/asm/kvm_para.h  | 2 +-
 arch/powerpc/kernel/firmware.c   | 2 +-
 arch/powerpc/platforms/pseries/smp.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index c0ace884a0e8..ba8291e02ba9 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,9 +7,9 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
+bool check_kvm_guest(void);
 #else
-static inline bool is_kvm_guest(void) { return false; }
+static inline bool check_kvm_guest(void) { return false; }
 #endif
 
 #endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index abe1b5e82547..6fba06b6cfdb 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 5f48e5ad24cd..0aeb6a5b1a9e 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,7 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void)
+bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index d578732c545d..c70b4be9f0a5 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void)
if (!cpu_has_feature(CPU_FTR_SMT))
return;
 
-   if (is_kvm_guest()) {
+   if (check_kvm_guest()) {
/*
 * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp
 * faults to the hypervisor which then reads the instruction
-- 
2.18.4



[PATCH v2 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

2020-12-01 Thread Srikar Dronamraju
If its a shared lpar but not a KVM guest, then see if the vCPU is
related to the calling vCPU. On PowerVM, only cores can be preempted.
So if one vCPU is a non-preempted state, we can decipher that all other
vCPUs sharing the same core are in non-preempted state.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/paravirt.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 9362c94fe3aa..edc08f04aef7 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -10,6 +10,9 @@
 #endif
 
 #ifdef CONFIG_PPC_SPLPAR
+#include 
+#include 
+
 DECLARE_STATIC_KEY_FALSE(shared_processor);
 
 static inline bool is_shared_processor(void)
@@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu)
 {
if (!is_shared_processor())
return false;
+
+#ifdef CONFIG_PPC_SPLPAR
+   if (!is_kvm_guest()) {
+   int first_cpu = cpu_first_thread_sibling(smp_processor_id());
+
+   /*
+* Preemption can only happen at core granularity. This CPU
+* is not preempted if one of the CPU of this core is not
+* preempted.
+*/
+   if (cpu_first_thread_sibling(cpu) == first_cpu)
+   return false;
+   }
+#endif
+
if (yield_count_of(cpu) & 1)
return true;
return false;
-- 
2.18.4



[PATCH v2 0/4] Powerpc: Better preemption for shared processor

2020-12-01 Thread Srikar Dronamraju
Currently, vcpu_is_preempted will return the yield_count for
shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
i.e all CPUs belonging to a core are either group scheduled in or group
scheduled out. This can be used to better predict non-preempted CPUs on
PowerVM shared LPARs.

perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better)

powerpc/next
 35,107,951.20 msec cpu-clock #  255.898 CPUs utilized  
  ( +-  0.31% )
23,655,348  context-switches  #0.674 K/sec  
  ( +-  3.72% )
14,465  cpu-migrations#0.000 K/sec  
  ( +-  5.37% )
82,463  page-faults   #0.002 K/sec  
  ( +-  8.40% )
 1,127,182,328,206  cycles#0.032 GHz
  ( +-  1.60% )  (66.67%)
78,587,300,622  stalled-cycles-frontend   #6.97% frontend cycles 
idle ( +-  0.08% )  (50.01%)
   654,124,218,432  stalled-cycles-backend#   58.03% backend cycles 
idle  ( +-  1.74% )  (50.01%)
   834,013,059,242  instructions  #0.74  insn per cycle
  #0.78  stalled cycles per 
insn  ( +-  0.73% )  (66.67%)
   132,911,454,387  branches  #3.786 M/sec  
  ( +-  0.59% )  (50.00%)
 2,890,882,143  branch-misses #2.18% of all branches
  ( +-  0.46% )  (50.00%)

   137.195 +- 0.419 seconds time elapsed  ( +-  0.31% )

powerpc/next + patchset
 29,981,702.64 msec cpu-clock #  255.881 CPUs utilized  
  ( +-  1.30% )
40,162,456  context-switches  #0.001 M/sec  
  ( +-  0.01% )
 1,110  cpu-migrations#0.000 K/sec  
  ( +-  5.20% )
62,616  page-faults   #0.002 K/sec  
  ( +-  3.93% )
 1,430,030,626,037  cycles#0.048 GHz
  ( +-  1.41% )  (66.67%)
83,202,707,288  stalled-cycles-frontend   #5.82% frontend cycles 
idle ( +-  0.75% )  (50.01%)
   744,556,088,520  stalled-cycles-backend#   52.07% backend cycles 
idle  ( +-  1.39% )  (50.01%)
   940,138,418,674  instructions  #0.66  insn per cycle
  #0.79  stalled cycles per 
insn  ( +-  0.51% )  (66.67%)
   146,452,852,283  branches  #4.885 M/sec  
  ( +-  0.80% )  (50.00%)
 3,237,743,996  branch-misses #2.21% of all branches
  ( +-  1.18% )  (50.01%)

117.17 +- 1.52 seconds time elapsed  ( +-  1.30% )

This is around 14.6% improvement in performance.

Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Rebased to 27th Nov linuxppc/merge tree.
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 

Srikar Dronamraju (4):
  powerpc: Refactor is_kvm_guest declaration to new header
  powerpc: Rename is_kvm_guest to check_kvm_guest
  powerpc: Reintroduce is_kvm_guest
  powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 25 +
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/include/asm/paravirt.h  | 18 ++
 arch/powerpc/kernel/firmware.c   |  5 -
 arch/powerpc/platforms/pseries/smp.c |  3 ++-
 6 files changed, 50 insertions(+), 9 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

-- 
2.18.4



[PATCH v7 updated 21/22 ] powerpc/book3s64/kup: Check max key supported before enabling kup

2020-12-01 Thread Aneesh Kumar K.V
Don't enable KUEP/KUAP if we support less than or equal to 3 keys.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/kup.h   |  3 +++
 arch/powerpc/mm/book3s64/pkeys.c | 33 
 arch/powerpc/mm/init-common.c|  4 ++--
 3 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 952be0414f43..f8ec679bd2de 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -44,6 +44,9 @@
 
 #else /* !__ASSEMBLY__ */
 
+extern bool disable_kuep;
+extern bool disable_kuap;
+
 #include 
 
 void setup_kup(void);
diff --git a/arch/powerpc/mm/book3s64/pkeys.c b/arch/powerpc/mm/book3s64/pkeys.c
index 4a3aeddbe0c7..2b7ded396db4 100644
--- a/arch/powerpc/mm/book3s64/pkeys.c
+++ b/arch/powerpc/mm/book3s64/pkeys.c
@@ -185,6 +185,27 @@ void __init pkey_early_init_devtree(void)
default_uamor &= ~(0x3ul << pkeyshift(execute_only_key));
}
 
+   if (unlikely(num_pkey <= 3)) {
+   /*
+* Insufficient number of keys to support
+* KUAP/KUEP feature.
+*/
+   disable_kuep = true;
+   disable_kuap = true;
+   WARN(1, "Disabling kernel user protection due to low (%d) max 
supported keys\n", num_pkey);
+   } else {
+   /*  handle key which is used by kernel for KAUP */
+   reserved_allocation_mask |= (0x1 << 3);
+   /*
+* Mark access for kup_key in default amr so that
+* we continue to operate with that AMR in
+* copy_to/from_user().
+*/
+   default_amr   &= ~(0x3ul << pkeyshift(3));
+   default_iamr  &= ~(0x1ul << pkeyshift(3));
+   default_uamor &= ~(0x3ul << pkeyshift(3));
+   }
+
/*
 * Allow access for only key 0. And prevent any other modification.
 */
@@ -205,18 +226,6 @@ void __init pkey_early_init_devtree(void)
reserved_allocation_mask |= (0x1 << 1);
default_uamor &= ~(0x3ul << pkeyshift(1));
 
-   /*  handle key which is used by kernel for KAUP */
-   reserved_allocation_mask |= (0x1 << 3);
-   /*
-* Mark access for KUAP key in default amr so that
-* we continue to operate with that AMR in
-* copy_to/from_user().
-*/
-   default_amr   &= ~(0x3ul << pkeyshift(3));
-   default_iamr  &= ~(0x1ul << pkeyshift(3));
-   default_uamor &= ~(0x3ul << pkeyshift(3));
-
-
/*
 * Prevent the usage of OS reserved keys. Update UAMOR
 * for those keys. Also mark the rest of the bits in the
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 8e0d792ac296..afdebb95bcae 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -28,8 +28,8 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
 unsigned long kernstart_virt_addr __ro_after_init = KERNELBASE;
 EXPORT_SYMBOL_GPL(kernstart_virt_addr);
 
-static bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
-static bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
+bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
+bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 
 static int __init parse_nosmep(char *p)
 {
-- 
2.28.0



Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
>>
>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
>> >
>> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>> > >
>> > > On big systems, the mm refcount can become highly contented when doing
>> > > a lot of context switching with threaded applications (particularly
>> > > switching between the idle thread and an application thread).
>> > >
>> > > Abandoning lazy tlb slows switching down quite a bit in the important
>> > > user->idle->user cases, so so instead implement a non-refcounted scheme
>> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> > > any remaining lazy ones.
>> > >
>> > > Shootdown IPIs are some concern, but they have not been observed to be
>> > > a big problem with this scheme (the powerpc implementation generated
>> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
>> > > There are a number of strategies that could be employed to reduce IPIs
>> > > if they turn out to be a problem for some workload.
>> >
>> > I'm still wondering whether we can do even better.
>> >
>>
>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
>> the TLB.  On x86, this will shoot down all lazies as long as even a
>> single pagetable was freed.  (Or at least it will if we don't have a
>> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
>> sets tlb->freed_tables, which will trigger the IPI.)  So, on
>> architectures like x86, the shootdown approach should be free.  The
>> only way it ought to have any excess IPIs is if we have CPUs in
>> mm_cpumask() that don't need IPI to free pagetables, which could
>> happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do

This is not zapping lazies, this is freeing the user page tables.

"lazy mm" is where a switch to a kernel thread takes on the
previous mm for its kernel mapping rather than switch to init_mm.

> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.

It might be possible x86's scheme you could scan mm_cpumask
carefully synchronized or something when the last user reference
gets dropped that frees the lazy at that point, but I don't know
what that would buy you because you're still having to maintain
the mm_cpumask on switches. powerpc's characteristics are just
different here so it makes sense whereas I don't know if it
would on x86.

> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?
> 
> s390x: The code has too many acronyms for me to understand it fully,
> but I think it's more or less the same situation as arm64.  How big do
> s390x systems come?
> 
> power: Ridiculously complicated, seems to vary by system and kernel config.
> 
> So, Nick, your unconditional IPI scheme is apparently a big
> improvement for power, and it should be an improvement and have low
> cost for x86.

As said, the tradeoffs are different, I'm not so sure. It was a big 
improvement on a very big system with the powerpc mm_cpumask switching
model on a microbenchmark designed to stress this, which is about all
I can say for it.

> On arm64 and s390x it will add more IPIs on process
> exit but reduce contention on context switching depending on how lazy
> TLB works.  I suppose we could try it for all architectures without
> any further optimizations.

It will remain opt-in but certainly try it out and see. There are some
requirements as documented in the config option text.

> Or we could try one of the perhaps
> excessively clever improvements I linked above.  arm64, s390x people,
> what do you think?
> 

I'm not against improvements to the scheme. e.g., from the patch

+   /*
+* IPI overheads have not found to be expensive, but they could
+* be reduced in a number of possible ways, for example (in
+  

Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of November 29, 2020 1:54 pm:
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>>
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>>
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>>
>> Shootdown IPIs are some concern, but they have not been observed to be
>> a big problem with this scheme (the powerpc implementation generated
>> 314 additional interrupts on a 144 CPU system during a kernel compile).
>> There are a number of strategies that could be employed to reduce IPIs
>> if they turn out to be a problem for some workload.
> 
> I'm still wondering whether we can do even better.

We probably can, for some values of better / more complex. This came up 
last time I posted, there was a big concern about IPIs etc, but it just 
wasn't an issue at all even when I tried to coax them to happen a bit.

The thing is they are faily self-limiting, it's not actually all that 
frequent that you have an mm get taken for a lazy *and* move between 
CPUs. Perhaps more often with threaded apps, but in that case you're 
eating various IPI costs anyway (e.g., when moving the task to another
CPU, on TLB shootdowns, etc).

So from last time I did measure and I did document some possible 
improvements that could be made in comments, but I decided to keep it 
simple before adding complexity to it.

> 
> The IPIs you're doing aren't really necessary -- we don't
> fundamentally need to free the pagetables immediately when all
> non-lazy users are done with them (and current kernels don't) -- what
> we need to do is to synchronize all the bookkeeping.  So, with
> adequate locking (famous last words), a couple of alternative schemes
> ought to be possible.

It's not freeing the page tables, those are freed by this point already 
I think (at least on powerpc they are). It's releasing the lazy mm.

> 
> a) Instead of sending an IPI, increment mm_count on behalf of the
> remote CPU and do something to make sure that the remote CPU knows we
> did this on its behalf.  Then free the mm when mm_count hits zero.
> 
> b) Treat mm_cpumask as part of the refcount.  Add one to mm_count when
> an mm is created.  Once mm_users hits zero, whoever clears the last
> bit in mm_cpumask is responsible for decrementing a single reference
> from mm_count, and whoever sets it to zero frees the mm.

Right, these were some possible avenues to explore, thing is it's 
complexity and more synchronisation costs, and in the fast (context 
switch) path too. The IPI actually avoids all fast path work, atomic
or not.

> Version (b) seems fairly straightforward to implement -- add RCU
> protection and a atomic_t special_ref_cleared (initially 0) to struct
> mm_struct itself.  After anyone clears a bit to mm_cpumask (which is
> already a barrier), they read mm_users.  If it's zero, then they scan
> mm_cpumask and see if it's empty.  If it is, they atomically swap
> special_ref_cleared to 1.  If it was zero before the swap, they do
> mmdrop().  I can imagine some tweaks that could make this a big
> faster, at least in the limit of a huge number of CPUs.
> 
> Version (a) seems a bit harder to reason about.  Maybe it could be
> done like this.  Add a percpu variable mm_with_extra_count.  This
> variable can be NULL, but it can also be an mm that has an extra
> reference on behalf of the cpu in question.
> 
> __mmput scans mm_cpumask and, for each cpu in the mask, mmgrabs the mm
> and cmpxchgs that cpu's mm_with_extra_count from NULL to mm.  If it
> succeeds, then we win.  If it fails, further thought is required, and
> maybe we have to send an IPI, although maybe some other cleverness is
> possible.  Any time a CPU switches mms, it does atomic swaps
> mm_with_extra_count to NULL and mmdrops whatever the mm was.  (Maybe
> it needs to check the mm isn't equal to the new mm, although it would
> be quite bizarre for this to happen.)  Other than these mmgrab and
> mmdrop calls, the mm switching code doesn't mmgrab or mmdrop at all.
> 
> 
> Version (a) seems like it could have excellent performance.

That said, if x86 wanted to explore something like this, the code to do 
it is a bit modular (I don't think a proliferation of lazy refcounting 
config options is a good idea of course, but 2 versions one for powrepc
style set-and-forget mm_cpumask and one for x86 set-and-clear would
be okay.

> *However*, I think we should consider whether we want to do something
> even bigger first.  Even with any of these changes, we still need to
> maintain mm_cpumask(), and that itself can be a scalability problem.
> I wonder if we can solve this problem too.  Perhaps the 

Re: [PATCH kernel] powerpc/perf: Stop crashing with generic_compat_pmu

2020-12-01 Thread Alexey Kardashevskiy

Hi Maddy,

I just noticed that I still have "powerpc/perf: Add checks for reserved 
values" in my pile (pushed here 
https://github.com/aik/linux/commit/61e1bc3f2e19d450e2e2d39174d422160b21957b 
), do we still need it? The lockups I saw were fixed by 
https://github.com/aik/linux/commit/17899eaf88d689 but it is hardly a 
replacement. Thanks,



On 04/06/2020 02:34, Madhavan Srinivasan wrote:



On 6/2/20 8:26 AM, Alexey Kardashevskiy wrote:

The bhrb_filter_map ("The  Branch  History  Rolling  Buffer") callback is
only defined in raw CPUs' power_pmu structs. The "architected" CPUs use
generic_compat_pmu which does not have this callback and crashed occur.

This add a NULL pointer check for bhrb_filter_map() which behaves as if
the callback returned an error.

This does not add the same check for config_bhrb() as the only caller
checks for cpuhw->bhrb_users which remains zero if bhrb_filter_map==0.


Changes looks fine.
Reviewed-by: Madhavan Srinivasan 

The commit be80e758d0c2e ('powerpc/perf: Add generic compat mode pmu 
driver')

which introduced generic_compat_pmu was merged in v5.2.  So we need to
CC stable starting from 5.2 :( .  My bad,  sorry.

Maddy


Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/perf/core-book3s.c | 19 ++-
  1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c 
b/arch/powerpc/perf/core-book3s.c

index 3dcfecf858f3..36870569bf9c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1515,9 +1515,16 @@ static int power_pmu_add(struct perf_event 
*event, int ef_flags)

  ret = 0;
   out:
  if (has_branch_stack(event)) {
-    power_pmu_bhrb_enable(event);
-    cpuhw->bhrb_filter = ppmu->bhrb_filter_map(
-    event->attr.branch_sample_type);
+    u64 bhrb_filter = -1;
+
+    if (ppmu->bhrb_filter_map)
+    bhrb_filter = ppmu->bhrb_filter_map(
+    event->attr.branch_sample_type);
+
+    if (bhrb_filter != -1) {
+    cpuhw->bhrb_filter = bhrb_filter;
+    power_pmu_bhrb_enable(event); /* Does bhrb_users++ */
+    }
  }

  perf_pmu_enable(event->pmu);
@@ -1839,7 +1846,6 @@ static int power_pmu_event_init(struct 
perf_event *event)

  int n;
  int err;
  struct cpu_hw_events *cpuhw;
-    u64 bhrb_filter;

  if (!ppmu)
  return -ENOENT;
@@ -1945,7 +1951,10 @@ static int power_pmu_event_init(struct 
perf_event *event)

  err = power_check_constraints(cpuhw, events, cflags, n + 1);

  if (has_branch_stack(event)) {
-    bhrb_filter = ppmu->bhrb_filter_map(
+    u64 bhrb_filter = -1;
+
+    if (ppmu->bhrb_filter_map)
+    bhrb_filter = ppmu->bhrb_filter_map(
  event->attr.branch_sample_type);

  if (bhrb_filter == -1) {




--
Alexey


Re: [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb

2020-12-01 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of November 29, 2020 10:38 am:
> On Sat, Nov 28, 2020 at 8:01 AM Nicholas Piggin  wrote:
>>
>> This is called at points where a lazy mm is switched away or made not
>> lazy (by its owner switching back).
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/arm/mach-rpc/ecard.c|  1 +
>>  arch/powerpc/mm/book3s64/radix_tlb.c |  1 +
>>  fs/exec.c|  6 --
>>  include/asm-generic/mmu_context.h| 21 +
>>  kernel/kthread.c |  1 +
>>  kernel/sched/core.c  |  2 ++
>>  6 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
>> index 827b50f1c73e..43eb1bfba466 100644
>> --- a/arch/arm/mach-rpc/ecard.c
>> +++ b/arch/arm/mach-rpc/ecard.c
>> @@ -253,6 +253,7 @@ static int ecard_init_mm(void)
>> current->mm = mm;
>> current->active_mm = mm;
>> activate_mm(active_mm, mm);
>> +   exit_lazy_tlb(active_mm, current);
>> mmdrop(active_mm);
>> ecard_init_pgtables(mm);
>> return 0;
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index b487b489d4b6..ac3fec03926a 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -661,6 +661,7 @@ static void do_exit_flush_lazy_tlb(void *arg)
>> mmgrab(_mm);
>> current->active_mm = _mm;
>> switch_mm_irqs_off(mm, _mm, current);
>> +   exit_lazy_tlb(mm, current);
>> mmdrop(mm);
>> }
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 547a2390baf5..4b4dea1bb7ba 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1017,6 +1017,8 @@ static int exec_mmap(struct mm_struct *mm)
>> if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
>> local_irq_enable();
>> activate_mm(active_mm, mm);
>> +   if (!old_mm)
>> +   exit_lazy_tlb(active_mm, tsk);
>> if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
>> local_irq_enable();
>> tsk->mm->vmacache_seqnum = 0;
>> @@ -1028,9 +1030,9 @@ static int exec_mmap(struct mm_struct *mm)
>> setmax_mm_hiwater_rss(>signal->maxrss, old_mm);
>> mm_update_next_owner(old_mm);
>> mmput(old_mm);
>> -   return 0;
>> +   } else {
>> +   mmdrop(active_mm);
>> }
>> -   mmdrop(active_mm);
> 
> This looks like an unrelated change.

I thought the old style was pointless and made me look twice to make 
sure we weren't mmdrop()ing the lazy.

> 
>> return 0;
>>  }
>>
>> diff --git a/include/asm-generic/mmu_context.h 
>> b/include/asm-generic/mmu_context.h
>> index 91727065bacb..4626d0020e65 100644
>> --- a/include/asm-generic/mmu_context.h
>> +++ b/include/asm-generic/mmu_context.h
>> @@ -24,6 +24,27 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
>>  }
>>  #endif
>>
>> +/*
>> + * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
>> + *
>> + * mm:  the lazy mm context that was switched
>> + * tsk: the task that was switched to (with a non-lazy mm)
>> + *
>> + * mm may equal tsk->mm.
>> + * mm and tsk->mm will not be NULL.
>> + *
>> + * Note this is not symmetrical to enter_lazy_tlb, this is not
>> + * called when tasks switch into the lazy mm, it's called after the
>> + * lazy mm becomes non-lazy (either switched to a different mm or the
>> + * owner of the mm returns).
>> + */
>> +#ifndef exit_lazy_tlb
>> +static inline void exit_lazy_tlb(struct mm_struct *mm,
> 
> Maybe name this parameter prev_lazy_mm?
> 

mm is better because it's the mm that we're "exiting lazy" from, the 
function name gives the context.

prev might suggest it was the previous but it's the current one, or
that we're switching to another mm but we may not be at all.

Thanks,
Nick


Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-12-01 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am:
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>>
>> And get rid of the generic sync_core_before_usermode facility. This is
>> functionally a no-op in the core scheduler code, but it also catches
>>
>> This helper is the wrong way around I think. The idea that membarrier
>> state requires a core sync before returning to user is the easy one
>> that does not need hiding behind membarrier calls. The gap in core
>> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
>> tricky detail that is better put in x86 lazy tlb code.
>>
>> Consider if an arch did not synchronize core in switch_mm either, then
>> membarrier_mm_sync_core_before_usermode would be in the wrong place
>> but arch specific mmu context functions would still be the right place.
>> There is also a exit_lazy_tlb case that is not covered by this call, which
>> could be a bugs (kthread use mm the membarrier process's mm then context
>> switch back to the process without switching mm or lazy mm switch).
>>
>> This makes lazy tlb code a bit more modular.
> 
> I have a couple of membarrier fixes that I want to send out today or
> tomorrow, and they might eliminate the need for this patch.  Let me
> think about this a little bit.  I'll cc you.  The existing code is way
> to subtle and the comments are far too confusing for me to be quickly
> confident about any of my conclusions :)
> 

Thanks for the head's up. I'll have to have a better look through them 
but I don't know that it eliminates the need for this entirely although
it might close some gaps and make this not a bug fix. The problem here 
is x86 code wanted something to be called when a lazy mm is unlazied,
but it missed some spots and also the core scheduler doesn't need to 
know about those x86 details if it has this generic call that annotates
the lazy handling better.

I'll go through the wording again and look at your patches a bit better
but I think they are somewhat orthogonal.

Thanks,
Nick


Re: [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable

2020-12-01 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of November 29, 2020 10:36 am:
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>>
>> NOMMU systems could easily go without this and save a bit of code
>> and the refcount atomics, because their mm switch is a no-op. I
>> haven't flipped them over because haven't audited all arch code to
>> convert over to using the _lazy_tlb refcounting.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/Kconfig | 11 +++
>>  include/linux/sched/mm.h | 13 ++--
>>  kernel/sched/core.c  | 68 +---
>>  kernel/sched/sched.h |  4 ++-
>>  4 files changed, 75 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 56b6ccc0e32d..596bf589d74b 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -430,6 +430,17 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>>   irqs disabled over activate_mm. Architectures that do IPI based TLB
>>   shootdowns should enable this.
>>
>> +# Should make this depend on MMU, because there is little use for lazy mm 
>> switching
>> +# with NOMMU. Must audit NOMMU architecture code for lazy mm refcounting 
>> first.
>> +config MMU_LAZY_TLB
>> +   def_bool y
>> +   help
>> + Enable "lazy TLB" mmu context switching for kernel threads.
>> +
>> +config MMU_LAZY_TLB_REFCOUNT
>> +   def_bool y
>> +   depends on MMU_LAZY_TLB
>> +
> 
> This could use some documentation as to what "no" means.

Sure I can add a bit more.

> 
>>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>> bool
>>
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 7157c0f6fef8..bd0f27402d4b 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -51,12 +51,21 @@ static inline void mmdrop(struct mm_struct *mm)
>>  /* Helpers for lazy TLB mm refcounting */
>>  static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
>>  {
>> -   mmgrab(mm);
>> +   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
>> +   mmgrab(mm);
>>  }
>>
>>  static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
>>  {
>> -   mmdrop(mm);
>> +   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
>> +   mmdrop(mm);
>> +   } else {
>> +   /*
>> +* mmdrop_lazy_tlb must provide a full memory barrier, see 
>> the
>> +* membarrier comment finish_task_switch.
> 
> "membarrier comment in finish_task_switch()", perhaps?

Sure.

Thanks,
Nick



Re: [PATCH v2 2/2] kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1

2020-12-01 Thread Masahiro Yamada
On Wed, Dec 2, 2020 at 5:56 AM Kees Cook  wrote:
>
> On Tue, Dec 01, 2020 at 10:31:37PM +0900, Masahiro Yamada wrote:
> > On Wed, Nov 25, 2020 at 7:22 AM Kees Cook  wrote:
> > >
> > > On Thu, Nov 19, 2020 at 01:13:27PM -0800, Nick Desaulniers wrote:
> > > > On Thu, Nov 19, 2020 at 12:57 PM Nathan Chancellor
> > > >  wrote:
> > > > >
> > > > > ld.lld 10.0.1 spews a bunch of various warnings about .rela sections,
> > > > > along with a few others. Newer versions of ld.lld do not have these
> > > > > warnings. As a result, do not add '--orphan-handling=warn' to
> > > > > LDFLAGS_vmlinux if ld.lld's version is not new enough.
> > > > >
> > > > > Link: https://github.com/ClangBuiltLinux/linux/issues/1187
> > > > > Link: https://github.com/ClangBuiltLinux/linux/issues/1193
> > > > > Reported-by: Arvind Sankar 
> > > > > Reported-by: kernelci.org bot 
> > > > > Reported-by: Mark Brown 
> > > > > Reviewed-by: Kees Cook 
> > > > > Signed-off-by: Nathan Chancellor 
> > > >
> > > > Thanks for the additions in v2.
> > > > Reviewed-by: Nick Desaulniers 
> > >
> > > I'm going to carry this for a few days in -next, and if no one screams,
> > > ask Linus to pull it for v5.10-rc6.
> > >
> > > Thanks!
> > >
> > > --
> > > Kees Cook
> >
> >
> > Sorry for the delay.
> > Applied to linux-kbuild.
>
> Great, thanks!
>
> > But, I already see this in linux-next.
> > Please let me know if I should drop it from my tree.
>
> My intention was to get this to Linus this week. Do you want to do that
> yourself, or Ack the patches in my tree and I'll send it?
>
> -Kees
>
> --
> Kees Cook


I will send a kbuild pull request myself this week.




-- 
Best Regards
Masahiro Yamada


[PATCH kernel] powerpc/kuap: Restore AMR after replaying soft interrupts

2020-12-01 Thread Alexey Kardashevskiy
When interrupted in raw_copy_from_user()/... after user memory access
is enabled, a nested handler may also access user memory (perf is
one example) and when it does so, it calls prevent_read_from_user()
which prevents the upper handler from accessing user memory.

This saves/restores AMR when replaying interrupts. get_kuap/set_kuap have
stubs for disabled KUAP so no ifdefs.

Found by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---

This is an example:

[ cut here ]
Bug: Read fault blocked by AMR!
WARNING: CPU: 0 PID: 1603 at 
/home/aik/p/kernel/arch/powerpc/include/asm/book3s/64/kup-radix.h:145 
__do_page_fau

Modules linked in:
CPU: 0 PID: 1603 Comm: amr Not tainted 5.10.0-rc6_v5.10-rc6_a+fstn1 #24
NIP:  c009ece8 LR: c009ece4 CTR: 
REGS: cdc63560 TRAP: 0700   Not tainted  (5.10.0-rc6_v5.10-rc6_a+fstn1)
MSR:  80021033   CR: 28002888  XER: 2004
CFAR: c01fa928 IRQMASK: 1
GPR00: c009ece4 cdc637f0 c2397600 001f
GPR04: c20eb318  cdc63494 0027
GPR08: c0007fe4de68 cdfe9180  0001
GPR12: 2000 c30a  
GPR16:    bfff
GPR20:  c000134a4020 c19c2218 0fe0
GPR24:   cd106200 4000
GPR28:  0300 cdc63910 c1946730
NIP [c009ece8] __do_page_fault+0xb38/0xde0
LR [c009ece4] __do_page_fault+0xb34/0xde0
Call Trace:
[cdc637f0] [c009ece4] __do_page_fault+0xb34/0xde0 (unreliable)
[cdc638a0] [c000c968] handle_page_fault+0x10/0x2c
--- interrupt: 300 at strncpy_from_user+0x290/0x440
LR = strncpy_from_user+0x284/0x440
[cdc63ba0] [c0c3dcb0] strncpy_from_user+0x2f0/0x440 (unreliable)
[cdc63c30] [c068b888] getname_flags+0x88/0x2c0
[cdc63c90] [c0662a44] do_sys_openat2+0x2d4/0x5f0
[cdc63d30] [c066560c] do_sys_open+0xcc/0x140
[cdc63dc0] [c0045e10] system_call_exception+0x160/0x240
[cdc63e20] [c000da60] system_call_common+0xf0/0x27c
Instruction dump:
409c0048 3fe2ff5b 3bfff128 fac10060 fae10068 482f7a85 6000 3c62ff5b
7fe4fb78 3863f250 4815bbd9 6000 <0fe0> 3c62ff5b 3863f2b8 4815c8b5
irq event stamp: 254
hardirqs last  enabled at (253): [] 
arch_local_irq_restore+0xa0/0x150
hardirqs last disabled at (254): [] 
data_access_common_virt+0x1b0/0x1d0
softirqs last  enabled at (0): [] copy_process+0x78c/0x2120
softirqs last disabled at (0): [<>] 0x0
---[ end trace ba98aec5151f3aeb ]---
---
 arch/powerpc/kernel/irq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 7d0f7682d01d..915123d861d0 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -221,6 +221,7 @@ void replay_soft_interrupts(void)
 */
unsigned char happened = local_paca->irq_happened;
struct pt_regs regs;
+   unsigned long kuap_state = get_kuap();
 
ppc_save_regs();
regs.softe = IRQS_ENABLED;
@@ -309,6 +310,7 @@ void replay_soft_interrupts(void)
trace_hardirqs_off();
goto again;
}
+   set_kuap(kuap_state);
 }
 
 notrace void arch_local_irq_restore(unsigned long mask)
-- 
2.17.1



[PATCH v2 03/17] ibmvfc: add Subordinate CRQ definitions

2020-12-01 Thread Tyrel Datwyler
Subordinate Command Response Queues (Sub CRQ) are used in conjunction
with the primary CRQ when more than one queue is needed by the virtual
IO adapter. Recent phyp firmware versions support Sub CRQ's with ibmvfc
adapters. This feature is a prerequisite for supporting multiple
hardware backed submission queues in the vfc adapter.

The Sub CRQ command element differs from the standard CRQ in that it is
32bytes long as opposed to 16bytes for the latter. Despite this extra
16bytes the ibmvfc protocol will use the original CRQ command element
mapped to the first 16bytes of the Sub CRQ element initially.

Add definitions for the Sub CRQ command element and queue.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index e095daada70e..b3cd35cbf067 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -656,6 +656,29 @@ struct ibmvfc_crq_queue {
dma_addr_t msg_token;
 };
 
+struct ibmvfc_sub_crq {
+   struct ibmvfc_crq crq;
+   __be64 reserved[2];
+} __packed __aligned(8);
+
+struct ibmvfc_sub_queue {
+   struct ibmvfc_sub_crq *msgs;
+   dma_addr_t msg_token;
+   int size, cur;
+   struct ibmvfc_host *vhost;
+   unsigned long cookie;
+   unsigned long vios_cookie;
+   unsigned long hw_irq;
+   unsigned long irq;
+   unsigned long hwq_id;
+   char name[32];
+};
+
+struct ibmvfc_scsi_channels {
+   struct ibmvfc_sub_queue *scrqs;
+   unsigned int active_queues;
+};
+
 enum ibmvfc_ae_link_state {
IBMVFC_AE_LS_LINK_UP= 0x01,
IBMVFC_AE_LS_LINK_BOUNCED   = 0x02,
-- 
2.27.0



Re: [net-next PATCH] net: freescale: ucc_geth: remove unused SKB_ALLOC_TIMEOUT

2020-12-01 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net-next.git (refs/heads/master):

On Mon, 30 Nov 2020 13:10:10 +1300 you wrote:
> This was added in commit ce973b141dfa ("[PATCH] Freescale QE UCC gigabit
> ethernet driver") but doesn't appear to have been used. Remove it now.
> 
> Signed-off-by: Chris Packham 
> ---
>  drivers/net/ethernet/freescale/ucc_geth.h | 1 -
>  1 file changed, 1 deletion(-)

Here is the summary with links:
  - [net-next] net: freescale: ucc_geth: remove unused SKB_ALLOC_TIMEOUT
https://git.kernel.org/netdev/net-next/c/2bf7d3776b74

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




[PATCH v2 05/17] ibmvfc: add Sub-CRQ IRQ enable/disable routine

2020-12-01 Thread Tyrel Datwyler
Each Sub-CRQ has its own interrupt. A hypercall is required to toggle
the IRQ state. Provide the necessary mechanism via a helper function.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 4860487c6779..97f00fefa809 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -3361,6 +3361,26 @@ static void ibmvfc_tasklet(void *data)
spin_unlock_irqrestore(vhost->host->host_lock, flags);
 }
 
+static int ibmvfc_toggle_scrq_irq(struct ibmvfc_sub_queue *scrq, int enable)
+{
+   struct device *dev = scrq->vhost->dev;
+   struct vio_dev *vdev = to_vio_dev(dev);
+   unsigned long rc;
+   int irq_action = H_ENABLE_VIO_INTERRUPT;
+
+   if (!enable)
+   irq_action = H_DISABLE_VIO_INTERRUPT;
+
+   rc = plpar_hcall_norets(H_VIOCTL, vdev->unit_address, irq_action,
+   scrq->hw_irq, 0, 0);
+
+   if (rc)
+   dev_err(dev, "Couldn't %s sub-crq[%lu] irq. rc=%ld\n",
+   enable ? "enable" : "disable", scrq->hwq_id, rc);
+
+   return rc;
+}
+
 /**
  * ibmvfc_init_tgt - Set the next init job step for the target
  * @tgt:   ibmvfc target struct
-- 
2.27.0



[PATCH v2 10/17] ibmvfc: advertise client support for using hardware channels

2020-12-01 Thread Tyrel Datwyler
Previous patches have plumbed the necessary Sub-CRQ interface and
channel negotiation MADs to fully channelized hardware queues.

Advertise client support via NPIV Login capability
IBMVFC_CAN_USE_CHANNELS when the client bits have MQ enabled via
vhost->mq_enabled, or when channels were already in use during a
subsequent NPIV Login. The later is required because channel support is
only renegotiated after a CRQ pair is broken. Simple NPIV Logout/Logins
require the client to continue to advertise the channel capability until
the CRQ pair between the client is broken.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index bfd3340eb0b6..0e6c9e55a221 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -1282,6 +1282,10 @@ static void ibmvfc_set_login_info(struct ibmvfc_host 
*vhost)
 
login_info->max_cmds = cpu_to_be32(max_requests + 
IBMVFC_NUM_INTERNAL_REQ);
login_info->capabilities = cpu_to_be64(IBMVFC_CAN_MIGRATE | 
IBMVFC_CAN_SEND_VF_WWPN);
+
+   if (vhost->mq_enabled || vhost->using_channels)
+   login_info->capabilities |= 
cpu_to_be64(IBMVFC_CAN_USE_CHANNELS);
+
login_info->async.va = cpu_to_be64(vhost->async_crq.msg_token);
login_info->async.len = cpu_to_be32(vhost->async_crq.size * 
sizeof(*vhost->async_crq.msgs));
strncpy(login_info->partition_name, vhost->partition_name, 
IBMVFC_MAX_NAME);
-- 
2.27.0



[PATCH v2 01/17] ibmvfc: add vhost fields and defaults for MQ enablement

2020-12-01 Thread Tyrel Datwyler
Introduce several new vhost fields for managing MQ state of the adapter
as well as initial defaults for MQ enablement.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c |  9 -
 drivers/scsi/ibmvscsi/ibmvfc.h | 13 +++--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 42e4d35e0d35..f1d677a7423d 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -5161,12 +5161,13 @@ static int ibmvfc_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
}
 
shost->transportt = ibmvfc_transport_template;
-   shost->can_queue = max_requests;
+   shost->can_queue = (max_requests / IBMVFC_SCSI_HW_QUEUES);
shost->max_lun = max_lun;
shost->max_id = max_targets;
shost->max_sectors = IBMVFC_MAX_SECTORS;
shost->max_cmd_len = IBMVFC_MAX_CDB_LEN;
shost->unique_id = shost->host_no;
+   shost->nr_hw_queues = IBMVFC_SCSI_HW_QUEUES;
 
vhost = shost_priv(shost);
INIT_LIST_HEAD(>sent);
@@ -5178,6 +5179,12 @@ static int ibmvfc_probe(struct vio_dev *vdev, const 
struct vio_device_id *id)
vhost->partition_number = -1;
vhost->log_level = log_level;
vhost->task_set = 1;
+
+   vhost->mq_enabled = IBMVFC_MQ;
+   vhost->client_scsi_channels = IBMVFC_SCSI_CHANNELS;
+   vhost->using_channels = 0;
+   vhost->do_enquiry = 1;
+
strcpy(vhost->partition_name, "UNKNOWN");
init_waitqueue_head(>work_wait_q);
init_waitqueue_head(>init_wait_q);
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 9d58cfd774d3..e095daada70e 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -41,16 +41,21 @@
 #define IBMVFC_DEFAULT_LOG_LEVEL   2
 #define IBMVFC_MAX_CDB_LEN 16
 #define IBMVFC_CLS3_ERROR  0
+#define IBMVFC_MQ  0
+#define IBMVFC_SCSI_CHANNELS   0
+#define IBMVFC_SCSI_HW_QUEUES  1
+#define IBMVFC_MIG_NO_SUB_TO_CRQ   0
+#define IBMVFC_MIG_NO_N_TO_M   0
 
 /*
  * Ensure we have resources for ERP and initialization:
- * 1 for ERP
  * 1 for initialization
  * 1 for NPIV Logout
  * 2 for BSG passthru
  * 2 for each discovery thread
+ * 1 ERP for each possible HW Queue
  */
-#define IBMVFC_NUM_INTERNAL_REQ(1 + 1 + 1 + 2 + (disc_threads * 2))
+#define IBMVFC_NUM_INTERNAL_REQ(1 + 1 + 2 + (disc_threads * 2) + 
IBMVFC_SCSI_HW_QUEUES)
 
 #define IBMVFC_MAD_SUCCESS 0x00
 #define IBMVFC_MAD_NOT_SUPPORTED   0xF1
@@ -826,6 +831,10 @@ struct ibmvfc_host {
int delay_init;
int scan_complete;
int logged_in;
+   int mq_enabled;
+   int using_channels;
+   int do_enquiry;
+   int client_scsi_channels;
int aborting_passthru;
int events_to_log;
 #define IBMVFC_AE_LINKUP   0x0001
-- 
2.27.0



[PATCH v2 02/17] ibmvfc: define hcall wrapper for registering a Sub-CRQ

2020-12-01 Thread Tyrel Datwyler
Sub-CRQs are registred with firmware via a hypercall. Abstract that
interface into a simpler helper function.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index f1d677a7423d..64674054dbae 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -138,6 +138,20 @@ static void ibmvfc_tgt_move_login(struct ibmvfc_target *);
 
 static const char *unknown_error = "unknown error";
 
+static long h_reg_sub_crq(unsigned long unit_address, unsigned long ioba,
+ unsigned long length, unsigned long *cookie,
+ unsigned long *irq)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long rc;
+
+   rc = plpar_hcall(H_REG_SUB_CRQ, retbuf, unit_address, ioba, length);
+   *cookie = retbuf[0];
+   *irq = retbuf[1];
+
+   return rc;
+}
+
 static int ibmvfc_check_caps(struct ibmvfc_host *vhost, unsigned long 
cap_flags)
 {
u64 host_caps = be64_to_cpu(vhost->login_buf->resp.capabilities);
-- 
2.27.0



[PATCH v2 15/17] ibmvfc: send Cancel MAD down each hw scsi channel

2020-12-01 Thread Tyrel Datwyler
In general the client needs to send Cancel MADs and task management
commands down the same channel as the command(s) intended to cancel or
abort. The client assigns cancel keys per LUN and thus must send a
Cancel down each channel commands were submitted for that LUN. Further,
the client then must wait for those cancel completions prior to
submitting a LUN RESET or ABORT TASK SET.

Allocate event pointers for each possible scsi channel and assign an
event for each channel that requires a cancel. Wait for completion each
submitted cancel.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 106 +
 1 file changed, 68 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 0b6284020f06..97e8eed04b01 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -2339,32 +2339,52 @@ static int ibmvfc_cancel_all(struct scsi_device *sdev, 
int type)
 {
struct ibmvfc_host *vhost = shost_priv(sdev->host);
struct ibmvfc_event *evt, *found_evt;
-   union ibmvfc_iu rsp;
-   int rsp_rc = -EBUSY;
+   struct ibmvfc_event **evt_list;
+   union ibmvfc_iu *rsp;
+   int rsp_rc = 0;
unsigned long flags;
u16 status;
+   int num_hwq = 1;
+   int i;
+   int ret = 0;
 
ENTER;
spin_lock_irqsave(vhost->host->host_lock, flags);
-   found_evt = NULL;
-   list_for_each_entry(evt, >sent, queue) {
-   if (evt->cmnd && evt->cmnd->device == sdev) {
-   found_evt = evt;
-   break;
+   if (vhost->using_channels && vhost->scsi_scrqs.active_queues)
+   num_hwq = vhost->scsi_scrqs.active_queues;
+
+   evt_list = kcalloc(num_hwq, sizeof(*evt_list), GFP_KERNEL);
+   rsp = kcalloc(num_hwq, sizeof(*rsp), GFP_KERNEL);
+
+   for (i = 0; i < num_hwq; i++) {
+   sdev_printk(KERN_INFO, sdev, "Cancelling outstanding commands 
on queue %d.\n", i);
+
+   found_evt = NULL;
+   list_for_each_entry(evt, >sent, queue) {
+   if (evt->cmnd && evt->cmnd->device == sdev && evt->hwq 
== i) {
+   found_evt = evt;
+   break;
+   }
}
-   }
 
-   if (!found_evt) {
-   if (vhost->log_level > IBMVFC_DEFAULT_LOG_LEVEL)
-   sdev_printk(KERN_INFO, sdev, "No events found to 
cancel\n");
-   spin_unlock_irqrestore(vhost->host->host_lock, flags);
-   return 0;
-   }
+   if (!found_evt) {
+   if (vhost->log_level > IBMVFC_DEFAULT_LOG_LEVEL)
+   sdev_printk(KERN_INFO, sdev, "No events found 
to cancel on queue %d\n", i);
+   continue;
+   }
 
-   if (vhost->logged_in) {
-   evt = ibmvfc_init_tmf(vhost, sdev, type);
-   evt->sync_iu = 
-   rsp_rc = ibmvfc_send_event(evt, vhost, default_timeout);
+
+   if (vhost->logged_in) {
+   evt_list[i] = ibmvfc_init_tmf(vhost, sdev, type);
+   evt_list[i]->hwq = i;
+   evt_list[i]->sync_iu = [i];
+   rsp_rc = ibmvfc_send_event(evt_list[i], vhost, 
default_timeout);
+   if (rsp_rc)
+   break;
+   } else {
+   rsp_rc = -EBUSY;
+   break;
+   }
}
 
spin_unlock_irqrestore(vhost->host->host_lock, flags);
@@ -2374,32 +2394,42 @@ static int ibmvfc_cancel_all(struct scsi_device *sdev, 
int type)
/* If failure is received, the host adapter is most likely going
 through reset, return success so the caller will wait for the 
command
 being cancelled to get returned */
-   return 0;
+   goto free_mem;
}
 
-   sdev_printk(KERN_INFO, sdev, "Cancelling outstanding commands.\n");
-
-   wait_for_completion(>comp);
-   status = be16_to_cpu(rsp.mad_common.status);
-   spin_lock_irqsave(vhost->host->host_lock, flags);
-   ibmvfc_free_event(evt);
-   spin_unlock_irqrestore(vhost->host->host_lock, flags);
+   for (i = 0; i < num_hwq; i++) {
+   if (!evt_list[i])
+   continue;
 
-   if (status != IBMVFC_MAD_SUCCESS) {
-   sdev_printk(KERN_WARNING, sdev, "Cancel failed with rc=%x\n", 
status);
-   switch (status) {
-   case IBMVFC_MAD_DRIVER_FAILED:
-   case IBMVFC_MAD_CRQ_ERROR:
-   /* Host adapter most likely going through reset, return 
success to
-the caller will wait for the command being cancelled 
to get returned */
-   return 0;
-   default:
- 

[PATCH v2 14/17] ibmvfc: add cancel mad initialization helper

2020-12-01 Thread Tyrel Datwyler
Add a helper routine for initializing a Cancel MAD. This will be useful
for a channelized client that needs to send a Cancel commands down every
channel commands were sent for a particular LUN.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 67 --
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index c1ac2acba5fd..0b6284020f06 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -2286,6 +2286,44 @@ static int ibmvfc_wait_for_ops(struct ibmvfc_host 
*vhost, void *device,
return SUCCESS;
 }
 
+static struct ibmvfc_event *ibmvfc_init_tmf(struct ibmvfc_host *vhost,
+struct scsi_device *sdev,
+int type)
+{
+   struct scsi_target *starget = scsi_target(sdev);
+   struct fc_rport *rport = starget_to_rport(starget);
+   struct ibmvfc_event *evt;
+   struct ibmvfc_tmf *tmf;
+
+   evt = ibmvfc_get_event(vhost);
+   ibmvfc_init_event(evt, ibmvfc_sync_completion, IBMVFC_MAD_FORMAT);
+
+   tmf = >iu.tmf;
+   memset(tmf, 0, sizeof(*tmf));
+   if (ibmvfc_check_caps(vhost, IBMVFC_HANDLE_VF_WWPN)) {
+   tmf->common.version = cpu_to_be32(2);
+   tmf->target_wwpn = cpu_to_be64(rport->port_name);
+   } else {
+   tmf->common.version = cpu_to_be32(1);
+   }
+   tmf->common.opcode = cpu_to_be32(IBMVFC_TMF_MAD);
+   tmf->common.length = cpu_to_be16(sizeof(*tmf));
+   tmf->scsi_id = cpu_to_be64(rport->port_id);
+   int_to_scsilun(sdev->lun, >lun);
+   if (!ibmvfc_check_caps(vhost, IBMVFC_CAN_SUPPRESS_ABTS))
+   type &= ~IBMVFC_TMF_SUPPRESS_ABTS;
+   if (vhost->state == IBMVFC_ACTIVE)
+   tmf->flags = cpu_to_be32((type | IBMVFC_TMF_LUA_VALID));
+   else
+   tmf->flags = cpu_to_be32(((type & IBMVFC_TMF_SUPPRESS_ABTS) | 
IBMVFC_TMF_LUA_VALID));
+   tmf->cancel_key = cpu_to_be32((unsigned long)sdev->hostdata);
+   tmf->my_cancel_key = cpu_to_be32((unsigned long)starget->hostdata);
+
+   init_completion(>comp);
+
+   return evt;
+}
+
 /**
  * ibmvfc_cancel_all - Cancel all outstanding commands to the device
  * @sdev:  scsi device to cancel commands
@@ -2300,9 +2338,6 @@ static int ibmvfc_wait_for_ops(struct ibmvfc_host *vhost, 
void *device,
 static int ibmvfc_cancel_all(struct scsi_device *sdev, int type)
 {
struct ibmvfc_host *vhost = shost_priv(sdev->host);
-   struct scsi_target *starget = scsi_target(sdev);
-   struct fc_rport *rport = starget_to_rport(starget);
-   struct ibmvfc_tmf *tmf;
struct ibmvfc_event *evt, *found_evt;
union ibmvfc_iu rsp;
int rsp_rc = -EBUSY;
@@ -2327,32 +2362,8 @@ static int ibmvfc_cancel_all(struct scsi_device *sdev, 
int type)
}
 
if (vhost->logged_in) {
-   evt = ibmvfc_get_event(vhost);
-   ibmvfc_init_event(evt, ibmvfc_sync_completion, 
IBMVFC_MAD_FORMAT);
-
-   tmf = >iu.tmf;
-   memset(tmf, 0, sizeof(*tmf));
-   if (ibmvfc_check_caps(vhost, IBMVFC_HANDLE_VF_WWPN)) {
-   tmf->common.version = cpu_to_be32(2);
-   tmf->target_wwpn = cpu_to_be64(rport->port_name);
-   } else {
-   tmf->common.version = cpu_to_be32(1);
-   }
-   tmf->common.opcode = cpu_to_be32(IBMVFC_TMF_MAD);
-   tmf->common.length = cpu_to_be16(sizeof(*tmf));
-   tmf->scsi_id = cpu_to_be64(rport->port_id);
-   int_to_scsilun(sdev->lun, >lun);
-   if (!ibmvfc_check_caps(vhost, IBMVFC_CAN_SUPPRESS_ABTS))
-   type &= ~IBMVFC_TMF_SUPPRESS_ABTS;
-   if (vhost->state == IBMVFC_ACTIVE)
-   tmf->flags = cpu_to_be32((type | IBMVFC_TMF_LUA_VALID));
-   else
-   tmf->flags = cpu_to_be32(((type & 
IBMVFC_TMF_SUPPRESS_ABTS) | IBMVFC_TMF_LUA_VALID));
-   tmf->cancel_key = cpu_to_be32((unsigned long)sdev->hostdata);
-   tmf->my_cancel_key = cpu_to_be32((unsigned 
long)starget->hostdata);
-
+   evt = ibmvfc_init_tmf(vhost, sdev, type);
evt->sync_iu = 
-   init_completion(>comp);
rsp_rc = ibmvfc_send_event(evt, vhost, default_timeout);
}
 
-- 
2.27.0



[PATCH v2 17/17] ibmvfc: provide modules parameters for MQ settings

2020-12-01 Thread Tyrel Datwyler
Add the various module parameter toggles for adjusting the MQ
characteristics at boot/load time as well as a device attribute for
changing the client scsi channel request amount.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 75 +-
 1 file changed, 65 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 97e8eed04b01..bc7c2dcd902c 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -40,6 +40,12 @@ static unsigned int disc_threads = IBMVFC_MAX_DISC_THREADS;
 static unsigned int ibmvfc_debug = IBMVFC_DEBUG;
 static unsigned int log_level = IBMVFC_DEFAULT_LOG_LEVEL;
 static unsigned int cls3_error = IBMVFC_CLS3_ERROR;
+static unsigned int mq_enabled = IBMVFC_MQ;
+static unsigned int nr_scsi_hw_queues = IBMVFC_SCSI_HW_QUEUES;
+static unsigned int nr_scsi_channels = IBMVFC_SCSI_CHANNELS;
+static unsigned int mig_channels_only = IBMVFC_MIG_NO_SUB_TO_CRQ;
+static unsigned int mig_no_less_channels = IBMVFC_MIG_NO_N_TO_M;
+
 static LIST_HEAD(ibmvfc_head);
 static DEFINE_SPINLOCK(ibmvfc_driver_lock);
 static struct scsi_transport_template *ibmvfc_transport_template;
@@ -49,6 +55,22 @@ MODULE_AUTHOR("Brian King ");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(IBMVFC_DRIVER_VERSION);
 
+module_param_named(mq, mq_enabled, uint, S_IRUGO);
+MODULE_PARM_DESC(mq, "Enable multiqueue support. "
+"[Default=" __stringify(IBMVFC_MQ) "]");
+module_param_named(scsi_host_queues, nr_scsi_hw_queues, uint, S_IRUGO);
+MODULE_PARM_DESC(scsi_host_queues, "Number of SCSI Host submission queues. "
+"[Default=" __stringify(IBMVFC_SCSI_HW_QUEUES) "]");
+module_param_named(scsi_hw_channels, nr_scsi_channels, uint, S_IRUGO);
+MODULE_PARM_DESC(scsi_hw_channels, "Number of hw scsi channels to request. "
+"[Default=" __stringify(IBMVFC_SCSI_CHANNELS) "]");
+module_param_named(mig_channels_only, mig_channels_only, uint, S_IRUGO | 
S_IWUSR);
+MODULE_PARM_DESC(mig_channels_only, "Prevent migration to non-channelized 
system. "
+"[Default=" __stringify(IBMVFC_MIG_NO_SUB_TO_CRQ) "]");
+module_param_named(mig_no_less_channels, mig_no_less_channels, uint, S_IRUGO | 
S_IWUSR);
+MODULE_PARM_DESC(mig_no_less_channels, "Prevent migration to system with less 
channels. "
+"[Default=" __stringify(IBMVFC_MIG_NO_N_TO_M) "]");
+
 module_param_named(init_timeout, init_timeout, uint, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(init_timeout, "Initialization timeout in seconds. "
 "[Default=" __stringify(IBMVFC_INIT_TIMEOUT) "]");
@@ -823,7 +845,7 @@ static int ibmvfc_reset_crq(struct ibmvfc_host *vhost)
crq->cur = 0;
 
if (vhost->scsi_scrqs.scrqs) {
-   for (i = 0; i < IBMVFC_SCSI_HW_QUEUES; i++) {
+   for (i = 0; i < nr_scsi_hw_queues; i++) {
scrq = >scsi_scrqs.scrqs[i];
memset(scrq->msgs, 0, PAGE_SIZE);
scrq->cur = 0;
@@ -3228,6 +3250,36 @@ static ssize_t ibmvfc_store_log_level(struct device *dev,
return strlen(buf);
 }
 
+static ssize_t ibmvfc_show_scsi_channels(struct device *dev,
+struct device_attribute *attr, char 
*buf)
+{
+   struct Scsi_Host *shost = class_to_shost(dev);
+   struct ibmvfc_host *vhost = shost_priv(shost);
+   unsigned long flags = 0;
+   int len;
+
+   spin_lock_irqsave(shost->host_lock, flags);
+   len = snprintf(buf, PAGE_SIZE, "%d\n", vhost->client_scsi_channels);
+   spin_unlock_irqrestore(shost->host_lock, flags);
+   return len;
+}
+
+static ssize_t ibmvfc_store_scsi_channels(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct Scsi_Host *shost = class_to_shost(dev);
+   struct ibmvfc_host *vhost = shost_priv(shost);
+   unsigned long flags = 0;
+   unsigned int channels;
+
+   spin_lock_irqsave(shost->host_lock, flags);
+   channels = simple_strtoul(buf, NULL, 10);
+   vhost->client_scsi_channels = min(channels, nr_scsi_hw_queues);
+   spin_unlock_irqrestore(shost->host_lock, flags);
+   return strlen(buf);
+}
+
 static DEVICE_ATTR(partition_name, S_IRUGO, ibmvfc_show_host_partition_name, 
NULL);
 static DEVICE_ATTR(device_name, S_IRUGO, ibmvfc_show_host_device_name, NULL);
 static DEVICE_ATTR(port_loc_code, S_IRUGO, ibmvfc_show_host_loc_code, NULL);
@@ -3236,6 +3288,8 @@ static DEVICE_ATTR(npiv_version, S_IRUGO, 
ibmvfc_show_host_npiv_version, NULL);
 static DEVICE_ATTR(capabilities, S_IRUGO, ibmvfc_show_host_capabilities, NULL);
 static DEVICE_ATTR(log_level, S_IRUGO | S_IWUSR,
   ibmvfc_show_log_level, ibmvfc_store_log_level);
+static DEVICE_ATTR(nr_scsi_channels, S_IRUGO | S_IWUSR,
+  ibmvfc_show_scsi_channels, 

[PATCH v2 13/17] ibmvfc: register Sub-CRQ handles with VIOS during channel setup

2020-12-01 Thread Tyrel Datwyler
If the ibmvfc client adapter requests channels it must submit a number
of Sub-CRQ handles matching the number of channels being requested. The
VIOS in its response will overwrite the actual number of channel
resources allocated which may be less than what was requested. The
client then must store the VIOS Sub-CRQ handle for each queue. This VIOS
handle is needed as a parameter with  h_send_sub_crq().

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 3bb20bfdaf4b..c1ac2acba5fd 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -4509,15 +4509,35 @@ static void ibmvfc_discover_targets(struct ibmvfc_host 
*vhost)
 static void ibmvfc_channel_setup_done(struct ibmvfc_event *evt)
 {
struct ibmvfc_host *vhost = evt->vhost;
+   struct ibmvfc_channel_setup *setup = vhost->channel_setup_buf;
+   struct ibmvfc_scsi_channels *scrqs = >scsi_scrqs;
u32 mad_status = be16_to_cpu(evt->xfer_iu->channel_setup.common.status);
int level = IBMVFC_DEFAULT_LOG_LEVEL;
+   int flags, active_queues, i;
 
ibmvfc_free_event(evt);
 
switch (mad_status) {
case IBMVFC_MAD_SUCCESS:
ibmvfc_dbg(vhost, "Channel Setup succeded\n");
+   flags = be32_to_cpu(setup->flags);
vhost->do_enquiry = 0;
+   active_queues = be32_to_cpu(setup->num_scsi_subq_channels);
+   scrqs->active_queues = active_queues;
+
+   if (flags & IBMVFC_CHANNELS_CANCELED) {
+   ibmvfc_dbg(vhost, "Channels Canceled\n");
+   vhost->using_channels = 0;
+   } else {
+   if (active_queues)
+   vhost->using_channels = 1;
+   for (i = 0; i < active_queues; i++)
+   scrqs->scrqs[i].vios_cookie =
+   be64_to_cpu(setup->channel_handles[i]);
+
+   ibmvfc_dbg(vhost, "Using %u channels\n",
+  vhost->scsi_scrqs.active_queues);
+   }
break;
case IBMVFC_MAD_FAILED:
level += ibmvfc_retry_host_init(vhost);
@@ -4541,9 +4561,19 @@ static void ibmvfc_channel_setup(struct ibmvfc_host 
*vhost)
struct ibmvfc_channel_setup_mad *mad;
struct ibmvfc_channel_setup *setup_buf = vhost->channel_setup_buf;
struct ibmvfc_event *evt = ibmvfc_get_event(vhost);
+   struct ibmvfc_scsi_channels *scrqs = >scsi_scrqs;
+   unsigned int num_channels =
+   min(vhost->client_scsi_channels, vhost->max_vios_scsi_channels);
+   int i;
 
memset(setup_buf, 0, sizeof(*setup_buf));
-   setup_buf->flags = cpu_to_be32(IBMVFC_CANCEL_CHANNELS);
+   if (num_channels == 0)
+   setup_buf->flags = cpu_to_be32(IBMVFC_CANCEL_CHANNELS);
+   else {
+   setup_buf->num_scsi_subq_channels = cpu_to_be32(num_channels);
+   for (i = 0; i < num_channels; i++)
+   setup_buf->channel_handles[i] = 
cpu_to_be64(scrqs->scrqs[i].cookie);
+   }
 
ibmvfc_init_event(evt, ibmvfc_channel_setup_done, IBMVFC_MAD_FORMAT);
mad = >iu.channel_setup;
-- 
2.27.0



[PATCH v2 16/17] ibmvfc: enable MQ and set reasonable defaults

2020-12-01 Thread Tyrel Datwyler
Turn on MQ by default and set sane values for the upper limit on hw
queues for the scsi host, and number of hw scsi channels to request from
the partner VIOS.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index e0ffb0416223..c327b9c3090e 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -41,9 +41,9 @@
 #define IBMVFC_DEFAULT_LOG_LEVEL   2
 #define IBMVFC_MAX_CDB_LEN 16
 #define IBMVFC_CLS3_ERROR  0
-#define IBMVFC_MQ  0
-#define IBMVFC_SCSI_CHANNELS   0
-#define IBMVFC_SCSI_HW_QUEUES  1
+#define IBMVFC_MQ  1
+#define IBMVFC_SCSI_CHANNELS   8
+#define IBMVFC_SCSI_HW_QUEUES  16
 #define IBMVFC_MIG_NO_SUB_TO_CRQ   0
 #define IBMVFC_MIG_NO_N_TO_M   0
 
-- 
2.27.0



[PATCH v2 08/17] ibmvfc: map/request irq and register Sub-CRQ interrupt handler

2020-12-01 Thread Tyrel Datwyler
Create an irq mapping for the hw_irq number provided from phyp firmware.
Request an irq assigned our Sub-CRQ interrupt handler.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index a3e2d627c1ac..0336833a6950 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -5130,12 +5130,34 @@ static int ibmvfc_register_scsi_channel(struct 
ibmvfc_host *vhost,
goto reg_failed;
}
 
+   scrq->irq = irq_create_mapping(NULL, scrq->hw_irq);
+
+   if (!scrq->irq) {
+   rc = -EINVAL;
+   dev_err(dev, "Error mapping sub-crq[%d] irq\n", index);
+   goto irq_failed;
+   }
+
+   snprintf(scrq->name, sizeof(scrq->name), "ibmvfc-%x-scsi%d",
+vdev->unit_address, index);
+   rc = request_irq(scrq->irq, ibmvfc_interrupt_scsi, 0, scrq->name, scrq);
+
+   if (rc) {
+   dev_err(dev, "Couldn't register sub-crq[%d] irq\n", index);
+   irq_dispose_mapping(scrq->irq);
+   goto irq_failed;
+   }
+
scrq->hwq_id = index;
scrq->vhost = vhost;
 
LEAVE;
return 0;
 
+irq_failed:
+   do {
+   plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address, 
scrq->cookie);
+   } while (rc == H_BUSY || H_IS_LONG_BUSY(rc));
 reg_failed:
dma_unmap_single(dev, scrq->msg_token, PAGE_SIZE, DMA_BIDIRECTIONAL);
 dma_map_failed:
-- 
2.27.0



[PATCH v2 07/17] ibmvfc: define Sub-CRQ interrupt handler routine

2020-12-01 Thread Tyrel Datwyler
Simple handler that calls Sub-CRQ drain routine directly.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index e9da3f60c793..a3e2d627c1ac 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -3458,6 +3458,16 @@ static void ibmvfc_drain_sub_crq(struct ibmvfc_sub_queue 
*scrq)
}
 }
 
+static irqreturn_t ibmvfc_interrupt_scsi(int irq, void *scrq_instance)
+{
+   struct ibmvfc_sub_queue *scrq = (struct ibmvfc_sub_queue 
*)scrq_instance;
+
+   ibmvfc_toggle_scrq_irq(scrq, 0);
+   ibmvfc_drain_sub_crq(scrq);
+
+   return IRQ_HANDLED;
+}
+
 /**
  * ibmvfc_init_tgt - Set the next init job step for the target
  * @tgt:   ibmvfc target struct
-- 
2.27.0



[PATCH v2 11/17] ibmvfc: set and track hw queue in ibmvfc_event struct

2020-12-01 Thread Tyrel Datwyler
Extract the hwq id from a SCSI command and store it in the ibmvfc_event
structure to identify which Sub-CRQ to send the command down when
channels are being utilized.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 5 +
 drivers/scsi/ibmvscsi/ibmvfc.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 0e6c9e55a221..4555775ea74b 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -1397,6 +1397,7 @@ static void ibmvfc_init_event(struct ibmvfc_event *evt,
evt->crq.format = format;
evt->done = done;
evt->eh_comp = NULL;
+   evt->hwq = 0;
 }
 
 /**
@@ -1748,6 +1749,8 @@ static int ibmvfc_queuecommand_lck(struct scsi_cmnd *cmnd,
struct ibmvfc_cmd *vfc_cmd;
struct ibmvfc_fcp_cmd_iu *iu;
struct ibmvfc_event *evt;
+   u32 tag_and_hwq = blk_mq_unique_tag(cmnd->request);
+   u16 hwq = blk_mq_unique_tag_to_hwq(tag_and_hwq);
int rc;
 
if (unlikely((rc = fc_remote_port_chkready(rport))) ||
@@ -1775,6 +1778,8 @@ static int ibmvfc_queuecommand_lck(struct scsi_cmnd *cmnd,
}
 
vfc_cmd->correlation = cpu_to_be64(evt);
+   if (vhost->using_channels)
+   evt->hwq = hwq % vhost->scsi_scrqs.active_queues;
 
if (likely(!(rc = ibmvfc_map_sg_data(cmnd, evt, vfc_cmd, vhost->dev
return ibmvfc_send_event(evt, vhost, 0);
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index dff26dbd912c..e0ffb0416223 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -781,6 +781,7 @@ struct ibmvfc_event {
struct completion comp;
struct completion *eh_comp;
struct timer_list timer;
+   u16 hwq;
 };
 
 /* a pool of event structs for use */
-- 
2.27.0



[PATCH v2 09/17] ibmvfc: implement channel enquiry and setup commands

2020-12-01 Thread Tyrel Datwyler
New NPIV_ENQUIRY_CHANNEL and NPIV_SETUP_CHANNEL management datagrams
(MADs) were defined in a previous patchset. If the client advertises a
desire to use channels and the partner VIOS is channel capable then the
client must proceed with channel enquiry to determine the maximum number
of channels the VIOS is capable of providing, and registering SubCRQs
via channel setup with the VIOS immediately following NPIV Login. This
handshaking should not be performed for subsequent NPIV Logins unless
the CRQ connection has been reset.

Implement these two new MADs and issue them following a successful NPIV
login where the VIOS has set the SUPPORT_CHANNELS capability bit in the
NPIV Login response.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 135 -
 drivers/scsi/ibmvscsi/ibmvfc.h |   3 +
 2 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 0336833a6950..bfd3340eb0b6 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -806,6 +806,8 @@ static int ibmvfc_reset_crq(struct ibmvfc_host *vhost)
spin_lock_irqsave(vhost->host->host_lock, flags);
vhost->state = IBMVFC_NO_CRQ;
vhost->logged_in = 0;
+   vhost->do_enquiry = 1;
+   vhost->using_channels = 0;
 
/* Clean out the queue */
memset(crq->msgs, 0, PAGE_SIZE);
@@ -4473,6 +4475,118 @@ static void ibmvfc_discover_targets(struct ibmvfc_host 
*vhost)
ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
 }
 
+static void ibmvfc_channel_setup_done(struct ibmvfc_event *evt)
+{
+   struct ibmvfc_host *vhost = evt->vhost;
+   u32 mad_status = be16_to_cpu(evt->xfer_iu->channel_setup.common.status);
+   int level = IBMVFC_DEFAULT_LOG_LEVEL;
+
+   ibmvfc_free_event(evt);
+
+   switch (mad_status) {
+   case IBMVFC_MAD_SUCCESS:
+   ibmvfc_dbg(vhost, "Channel Setup succeded\n");
+   vhost->do_enquiry = 0;
+   break;
+   case IBMVFC_MAD_FAILED:
+   level += ibmvfc_retry_host_init(vhost);
+   ibmvfc_log(vhost, level, "Channel Setup failed\n");
+   fallthrough;
+   case IBMVFC_MAD_DRIVER_FAILED:
+   return;
+   default:
+   dev_err(vhost->dev, "Invalid Channel Setup response: 0x%x\n",
+   mad_status);
+   ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
+   return;
+   }
+
+   ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_QUERY);
+   wake_up(>work_wait_q);
+}
+
+static void ibmvfc_channel_setup(struct ibmvfc_host *vhost)
+{
+   struct ibmvfc_channel_setup_mad *mad;
+   struct ibmvfc_channel_setup *setup_buf = vhost->channel_setup_buf;
+   struct ibmvfc_event *evt = ibmvfc_get_event(vhost);
+
+   memset(setup_buf, 0, sizeof(*setup_buf));
+   setup_buf->flags = cpu_to_be32(IBMVFC_CANCEL_CHANNELS);
+
+   ibmvfc_init_event(evt, ibmvfc_channel_setup_done, IBMVFC_MAD_FORMAT);
+   mad = >iu.channel_setup;
+   memset(mad, 0, sizeof(*mad));
+   mad->common.version = cpu_to_be32(1);
+   mad->common.opcode = cpu_to_be32(IBMVFC_CHANNEL_SETUP);
+   mad->common.length = cpu_to_be16(sizeof(*mad));
+   mad->buffer.va = cpu_to_be64(vhost->channel_setup_dma);
+   mad->buffer.len = cpu_to_be32(sizeof(*vhost->channel_setup_buf));
+
+   ibmvfc_set_host_action(vhost, IBMVFC_HOST_ACTION_INIT_WAIT);
+
+   if (!ibmvfc_send_event(evt, vhost, default_timeout))
+   ibmvfc_dbg(vhost, "Sent channel setup\n");
+   else
+   ibmvfc_link_down(vhost, IBMVFC_LINK_DOWN);
+}
+
+static void ibmvfc_channel_enquiry_done(struct ibmvfc_event *evt)
+{
+   struct ibmvfc_host *vhost = evt->vhost;
+   struct ibmvfc_channel_enquiry *rsp = >xfer_iu->channel_enquiry;
+   u32 mad_status = be16_to_cpu(rsp->common.status);
+   int level = IBMVFC_DEFAULT_LOG_LEVEL;
+
+   switch (mad_status) {
+   case IBMVFC_MAD_SUCCESS:
+   ibmvfc_dbg(vhost, "Channel Enquiry succeeded\n");
+   vhost->max_vios_scsi_channels = 
be32_to_cpu(rsp->num_scsi_subq_channels);
+   ibmvfc_free_event(evt);
+   break;
+   case IBMVFC_MAD_FAILED:
+   level += ibmvfc_retry_host_init(vhost);
+   ibmvfc_log(vhost, level, "Channel Enquiry failed\n");
+   fallthrough;
+   case IBMVFC_MAD_DRIVER_FAILED:
+   ibmvfc_free_event(evt);
+   return;
+   default:
+   dev_err(vhost->dev, "Invalid Channel Enquiry response: 0x%x\n",
+   mad_status);
+   ibmvfc_link_down(vhost, IBMVFC_LINK_DEAD);
+   ibmvfc_free_event(evt);
+   return;
+   }
+
+   ibmvfc_channel_setup(vhost);
+}
+
+static void ibmvfc_channel_enquiry(struct ibmvfc_host *vhost)
+{
+   struct ibmvfc_channel_enquiry 

[PATCH v2 12/17] ibmvfc: send commands down HW Sub-CRQ when channelized

2020-12-01 Thread Tyrel Datwyler
When the client has negotiated the use of channels all vfcFrames are
required to go down a Sub-CRQ channel or it is a protocoal violation. If
the adapter state is channelized submit vfcFrames to the appropriate
Sub-CRQ via the h_send_sub_crq() helper.

Signed-off-by: Tyrel Datwyler 
Reviewed-by: Brian King 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 32 +++-
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 4555775ea74b..3bb20bfdaf4b 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -701,6 +701,15 @@ static int ibmvfc_send_crq(struct ibmvfc_host *vhost, u64 
word1, u64 word2)
return plpar_hcall_norets(H_SEND_CRQ, vdev->unit_address, word1, word2);
 }
 
+static int ibmvfc_send_sub_crq(struct ibmvfc_host *vhost, u64 cookie, u64 
word1,
+  u64 word2, u64 word3, u64 word4)
+{
+   struct vio_dev *vdev = to_vio_dev(vhost->dev);
+
+   return plpar_hcall_norets(H_SEND_SUB_CRQ, vdev->unit_address, cookie,
+ word1, word2, word3, word4);
+}
+
 /**
  * ibmvfc_send_crq_init - Send a CRQ init message
  * @vhost: ibmvfc host struct
@@ -1513,15 +1522,19 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt,
 struct ibmvfc_host *vhost, unsigned long timeout)
 {
__be64 *crq_as_u64 = (__be64 *) >crq;
+   int channel_cmd = 0;
int rc;
 
/* Copy the IU into the transfer area */
*evt->xfer_iu = evt->iu;
-   if (evt->crq.format == IBMVFC_CMD_FORMAT)
+   if (evt->crq.format == IBMVFC_CMD_FORMAT) {
evt->xfer_iu->cmd.tag = cpu_to_be64((u64)evt);
-   else if (evt->crq.format == IBMVFC_MAD_FORMAT)
+   channel_cmd = 1;
+   } else if (evt->crq.format == IBMVFC_MAD_FORMAT) {
evt->xfer_iu->mad_common.tag = cpu_to_be64((u64)evt);
-   else
+   if (evt->xfer_iu->mad_common.opcode == IBMVFC_TMF_MAD)
+   channel_cmd = 1;
+   } else
BUG();
 
list_add_tail(>queue, >sent);
@@ -1534,8 +1547,17 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt,
 
mb();
 
-   if ((rc = ibmvfc_send_crq(vhost, be64_to_cpu(crq_as_u64[0]),
- be64_to_cpu(crq_as_u64[1] {
+   if (vhost->using_channels && channel_cmd)
+   rc = ibmvfc_send_sub_crq(vhost,
+
vhost->scsi_scrqs.scrqs[evt->hwq].vios_cookie,
+be64_to_cpu(crq_as_u64[0]),
+be64_to_cpu(crq_as_u64[1]),
+0, 0);
+   else
+   rc = ibmvfc_send_crq(vhost, be64_to_cpu(crq_as_u64[0]),
+be64_to_cpu(crq_as_u64[1]));
+
+   if (rc) {
list_del(>queue);
del_timer(>timer);
 
-- 
2.27.0



[PATCH v2 00/17] ibmvfc: initial MQ development

2020-12-01 Thread Tyrel Datwyler
Recent updates in pHyp Firmware and VIOS releases provide new infrastructure
towards enabling Subordinate Command Response Queues (Sub-CRQs) such that each
Sub-CRQ is a channel backed by an actual hardware queue in the FC stack on the
partner VIOS. Sub-CRQs are registered with the firmware via hypercalls and then
negotiated with the VIOS via new Management Datagrams (MADs) for channel setup.

This initial implementation adds the necessary Sub-CRQ framework and implements
the new MADs for negotiating and assigning a set of Sub-CRQs to associated VIOS
HW backed channels. The event pool and locking still leverages the legacy single
queue implementation, and as such lock contention is problematic when increasing
the number of queues. However, this initial work demonstrates a 1.2x factor
increase in IOPs when configured with two HW queues despite lock contention.

changes in v2:
* Patch 4: NULL'd scsi_scrq reference after deallocation [brking]
* Patch 6: Added switch case to handle XPORT event [brking]
* Patch 9: fixed ibmvfc_event leak and double free [brking]
* added support for cancel command with MQ
* added parameter toggles for MQ settings

Tyrel Datwyler (17):
  ibmvfc: add vhost fields and defaults for MQ enablement
  ibmvfc: define hcall wrapper for registering a Sub-CRQ
  ibmvfc: add Subordinate CRQ definitions
  ibmvfc: add alloc/dealloc routines for SCSI Sub-CRQ Channels
  ibmvfc: add Sub-CRQ IRQ enable/disable routine
  ibmvfc: add handlers to drain and complete Sub-CRQ responses
  ibmvfc: define Sub-CRQ interrupt handler routine
  ibmvfc: map/request irq and register Sub-CRQ interrupt handler
  ibmvfc: implement channel enquiry and setup commands
  ibmvfc: advertise client support for using hardware channels
  ibmvfc: set and track hw queue in ibmvfc_event struct
  ibmvfc: send commands down HW Sub-CRQ when channelized
  ibmvfc: register Sub-CRQ handles with VIOS during channel setup
  ibmvfc: add cancel mad initialization helper
  ibmvfc: send Cancel MAD down each hw scsi channel
  ibmvfc: enable MQ and set reasonable defaults
  ibmvfc: provide modules parameters for MQ settings

 drivers/scsi/ibmvscsi/ibmvfc.c | 706 +
 drivers/scsi/ibmvscsi/ibmvfc.h |  41 +-
 2 files changed, 675 insertions(+), 72 deletions(-)

-- 
2.27.0



[PATCH v2 04/17] ibmvfc: add alloc/dealloc routines for SCSI Sub-CRQ Channels

2020-12-01 Thread Tyrel Datwyler
Allocate a set of Sub-CRQs in advance. During channel setup the client
and VIOS negotiate the number of queues the VIOS supports and the number
that the client desires to request. Its possible that the final channel
resources allocated is less than requested, but the client is still
responsible for sending handles for every queue it is hoping for.

Also, provide deallocation cleanup routines.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 128 +
 drivers/scsi/ibmvscsi/ibmvfc.h |   1 +
 2 files changed, 129 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 64674054dbae..4860487c6779 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -793,6 +793,8 @@ static int ibmvfc_reset_crq(struct ibmvfc_host *vhost)
unsigned long flags;
struct vio_dev *vdev = to_vio_dev(vhost->dev);
struct ibmvfc_crq_queue *crq = >crq;
+   struct ibmvfc_sub_queue *scrq;
+   int i;
 
/* Close the CRQ */
do {
@@ -809,6 +811,14 @@ static int ibmvfc_reset_crq(struct ibmvfc_host *vhost)
memset(crq->msgs, 0, PAGE_SIZE);
crq->cur = 0;
 
+   if (vhost->scsi_scrqs.scrqs) {
+   for (i = 0; i < IBMVFC_SCSI_HW_QUEUES; i++) {
+   scrq = >scsi_scrqs.scrqs[i];
+   memset(scrq->msgs, 0, PAGE_SIZE);
+   scrq->cur = 0;
+   }
+   }
+
/* And re-open it again */
rc = plpar_hcall_norets(H_REG_CRQ, vdev->unit_address,
crq->msg_token, PAGE_SIZE);
@@ -4983,6 +4993,117 @@ static int ibmvfc_init_crq(struct ibmvfc_host *vhost)
return retrc;
 }
 
+static int ibmvfc_register_scsi_channel(struct ibmvfc_host *vhost,
+ int index)
+{
+   struct device *dev = vhost->dev;
+   struct vio_dev *vdev = to_vio_dev(dev);
+   struct ibmvfc_sub_queue *scrq = >scsi_scrqs.scrqs[index];
+   int rc = -ENOMEM;
+
+   ENTER;
+
+   scrq->msgs = (struct ibmvfc_sub_crq *)get_zeroed_page(GFP_KERNEL);
+   if (!scrq->msgs)
+   return rc;
+
+   scrq->size = PAGE_SIZE / sizeof(*scrq->msgs);
+   scrq->msg_token = dma_map_single(dev, scrq->msgs, PAGE_SIZE,
+DMA_BIDIRECTIONAL);
+
+   if (dma_mapping_error(dev, scrq->msg_token))
+   goto dma_map_failed;
+
+   rc = h_reg_sub_crq(vdev->unit_address, scrq->msg_token, PAGE_SIZE,
+  >cookie, >hw_irq);
+
+   if (rc) {
+   dev_warn(dev, "Error registering sub-crq: %d\n", rc);
+   dev_warn(dev, "Firmware may not support MQ\n");
+   goto reg_failed;
+   }
+
+   scrq->hwq_id = index;
+   scrq->vhost = vhost;
+
+   LEAVE;
+   return 0;
+
+reg_failed:
+   dma_unmap_single(dev, scrq->msg_token, PAGE_SIZE, DMA_BIDIRECTIONAL);
+dma_map_failed:
+   free_page((unsigned long)scrq->msgs);
+   LEAVE;
+   return rc;
+}
+
+static void ibmvfc_deregister_scsi_channel(struct ibmvfc_host *vhost, int 
index)
+{
+   struct device *dev = vhost->dev;
+   struct vio_dev *vdev = to_vio_dev(dev);
+   struct ibmvfc_sub_queue *scrq = >scsi_scrqs.scrqs[index];
+   long rc;
+
+   ENTER;
+
+   do {
+   rc = plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address,
+   scrq->cookie);
+   } while (rc == H_BUSY || H_IS_LONG_BUSY(rc));
+
+   if (rc)
+   dev_err(dev, "Failed to free sub-crq[%d]: rc=%ld\n", index, rc);
+
+   dma_unmap_single(dev, scrq->msg_token, PAGE_SIZE, DMA_BIDIRECTIONAL);
+   free_page((unsigned long)scrq->msgs);
+   LEAVE;
+}
+
+static int ibmvfc_init_sub_crqs(struct ibmvfc_host *vhost)
+{
+   int i, j;
+
+   ENTER;
+
+   vhost->scsi_scrqs.scrqs = kcalloc(IBMVFC_SCSI_HW_QUEUES,
+ sizeof(*vhost->scsi_scrqs.scrqs),
+ GFP_KERNEL);
+   if (!vhost->scsi_scrqs.scrqs)
+   return -1;
+
+   for (i = 0; i < IBMVFC_SCSI_HW_QUEUES; i++) {
+   if (ibmvfc_register_scsi_channel(vhost, i)) {
+   for (j = i; j > 0; j--)
+   ibmvfc_deregister_scsi_channel(vhost, j - 1);
+   kfree(vhost->scsi_scrqs.scrqs);
+   vhost->scsi_scrqs.scrqs = NULL;
+   vhost->scsi_scrqs.active_queues = 0;
+   LEAVE;
+   return -1;
+   }
+   }
+
+   LEAVE;
+   return 0;
+}
+
+static void ibmvfc_release_sub_crqs(struct ibmvfc_host *vhost)
+{
+   int i;
+
+   ENTER;
+   if (!vhost->scsi_scrqs.scrqs)
+   return;
+
+   for (i = 0; i < IBMVFC_SCSI_HW_QUEUES; i++)
+   ibmvfc_deregister_scsi_channel(vhost, i);
+
+ 

[PATCH v2 06/17] ibmvfc: add handlers to drain and complete Sub-CRQ responses

2020-12-01 Thread Tyrel Datwyler
The logic for iterating over the Sub-CRQ responses is similiar to that
of the primary CRQ. Add the necessary handlers for processing those
responses.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.c | 77 ++
 1 file changed, 77 insertions(+)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 97f00fefa809..e9da3f60c793 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -3381,6 +3381,83 @@ static int ibmvfc_toggle_scrq_irq(struct 
ibmvfc_sub_queue *scrq, int enable)
return rc;
 }
 
+static void ibmvfc_handle_scrq(struct ibmvfc_crq *crq, struct ibmvfc_host 
*vhost)
+{
+   struct ibmvfc_event *evt = (struct ibmvfc_event 
*)be64_to_cpu(crq->ioba);
+   unsigned long flags;
+
+   switch (crq->valid) {
+   case IBMVFC_CRQ_CMD_RSP:
+   break;
+   case IBMVFC_CRQ_XPORT_EVENT:
+   return;
+   default:
+   dev_err(vhost->dev, "Got and invalid message type 0x%02x\n", 
crq->valid);
+   return;
+   }
+
+   /* The only kind of payload CRQs we should get are responses to
+* things we send. Make sure this response is to something we
+* actually sent
+*/
+   if (unlikely(!ibmvfc_valid_event(>pool, evt))) {
+   dev_err(vhost->dev, "Returned correlation_token 0x%08llx is 
invalid!\n",
+   crq->ioba);
+   return;
+   }
+
+   if (unlikely(atomic_read(>free))) {
+   dev_err(vhost->dev, "Received duplicate correlation_token 
0x%08llx!\n",
+   crq->ioba);
+   return;
+   }
+
+   spin_lock_irqsave(vhost->host->host_lock, flags);
+   del_timer(>timer);
+   list_del(>queue);
+   ibmvfc_trc_end(evt);
+   spin_unlock_irqrestore(vhost->host->host_lock, flags);
+   evt->done(evt);
+}
+
+static struct ibmvfc_crq *ibmvfc_next_scrq(struct ibmvfc_sub_queue *scrq)
+{
+   struct ibmvfc_crq *crq;
+
+   crq = >msgs[scrq->cur].crq;
+   if (crq->valid & 0x80) {
+   if (++scrq->cur == scrq->size)
+   scrq->cur = 0;
+   rmb();
+   } else
+   crq = NULL;
+
+   return crq;
+}
+
+static void ibmvfc_drain_sub_crq(struct ibmvfc_sub_queue *scrq)
+{
+   struct ibmvfc_crq *crq;
+   int done = 0;
+
+   while (!done) {
+   while ((crq = ibmvfc_next_scrq(scrq)) != NULL) {
+   ibmvfc_handle_scrq(crq, scrq->vhost);
+   crq->valid = 0;
+   wmb();
+   }
+
+   ibmvfc_toggle_scrq_irq(scrq, 1);
+   if ((crq = ibmvfc_next_scrq(scrq)) != NULL) {
+   ibmvfc_toggle_scrq_irq(scrq, 0);
+   ibmvfc_handle_scrq(crq, scrq->vhost);
+   crq->valid = 0;
+   wmb();
+   } else
+   done = 1;
+   }
+}
+
 /**
  * ibmvfc_init_tgt - Set the next init job step for the target
  * @tgt:   ibmvfc target struct
-- 
2.27.0



[PATCH kernel v3] powerpc/pci: Remove LSI mappings on device teardown

2020-12-01 Thread Alexey Kardashevskiy
From: Oliver O'Halloran 

When a passthrough IO adapter is removed from a pseries machine using hash
MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
to clear all page table entries related to the adapter. If some are still
present, the RTAS call which isolates the PCI slot returns error 9001
"valid outstanding translations" and the removal of the IO adapter fails.
This is because when the PHBs are scanned, Linux maps automatically the
INTx interrupts in the Linux interrupt number space but these are never
removed.

This problem can be fixed by adding the corresponding unmap operation when
the device is removed. There's no pcibios_* hook for the remove case, but
the same effect can be achieved using a bus notifier.

Because INTx are shared among PHBs (and potentially across the system),
this adds tracking of virq to unmap them only when the last user is gone.

Signed-off-by: Oliver O'Halloran 
[aik: added refcounter]
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* free @vi on error path

v2:
* added refcounter
---
 arch/powerpc/kernel/pci-common.c | 82 ++--
 1 file changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..2b555997b295 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -353,6 +353,55 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
return NULL;
 }
 
+struct pci_intx_virq {
+   int virq;
+   struct kref kref;
+   struct list_head list_node;
+};
+
+static LIST_HEAD(intx_list);
+static DEFINE_MUTEX(intx_mutex);
+
+static void ppc_pci_intx_release(struct kref *kref)
+{
+   struct pci_intx_virq *vi = container_of(kref, struct pci_intx_virq, 
kref);
+
+   list_del(>list_node);
+   irq_dispose_mapping(vi->virq);
+   kfree(vi);
+}
+
+static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(data);
+
+   if (action == BUS_NOTIFY_DEL_DEVICE) {
+   struct pci_intx_virq *vi;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(vi, _list, list_node) {
+   if (vi->virq == pdev->irq) {
+   kref_put(>kref, ppc_pci_intx_release);
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+   }
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_pci_unmap_irq_notifier = {
+   .notifier_call = ppc_pci_unmap_irq_line,
+};
+
+static int ppc_pci_register_irq_notifier(void)
+{
+   return bus_register_notifier(_bus_type, 
_pci_unmap_irq_notifier);
+}
+arch_initcall(ppc_pci_register_irq_notifier);
+
 /*
  * Reads the interrupt pin to determine if interrupt is use by card.
  * If the interrupt is used, then gets the interrupt line from the
@@ -361,6 +410,12 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
 static int pci_read_irq_line(struct pci_dev *pci_dev)
 {
int virq;
+   struct pci_intx_virq *vi, *vitmp;
+
+   /* Preallocate vi as rewind is complex if this fails after mapping */
+   vi = kzalloc(sizeof(struct pci_intx_virq), GFP_KERNEL);
+   if (!vi)
+   return -1;
 
pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
 
@@ -377,12 +432,12 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
 * function.
 */
if (pci_read_config_byte(pci_dev, PCI_INTERRUPT_PIN, ))
-   return -1;
+   goto error_exit;
if (pin == 0)
-   return -1;
+   goto error_exit;
if (pci_read_config_byte(pci_dev, PCI_INTERRUPT_LINE, ) ||
line == 0xff || line == 0) {
-   return -1;
+   goto error_exit;
}
pr_debug(" No map ! Using line %d (pin %d) from PCI config\n",
 line, pin);
@@ -394,14 +449,33 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
 
if (!virq) {
pr_debug(" Failed to map !\n");
-   return -1;
+   goto error_exit;
}
 
pr_debug(" Mapped to linux irq %d\n", virq);
 
pci_dev->irq = virq;
 
+   mutex_lock(_mutex);
+   list_for_each_entry(vitmp, _list, list_node) {
+   if (vitmp->virq == virq) {
+   kref_get(>kref);
+   kfree(vi);
+   vi = NULL;
+   break;
+   }
+   }
+   if (vi) {
+   vi->virq = virq;
+   kref_init(>kref);
+   list_add_tail(>list_node, _list);
+   }
+   mutex_unlock(_mutex);
+
return 0;

Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Will Deacon
On Tue, Dec 01, 2020 at 01:50:38PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 1, 2020 at 1:28 PM Will Deacon  wrote:
> >
> > On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> > > other arch folk: there's some background here:
> > >
> > > https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> > >
> > > On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> > > >
> > > > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > > > >
> > > > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  
> > > > > wrote:
> > > > > >
> > > > > > On big systems, the mm refcount can become highly contented when 
> > > > > > doing
> > > > > > a lot of context switching with threaded applications (particularly
> > > > > > switching between the idle thread and an application thread).
> > > > > >
> > > > > > Abandoning lazy tlb slows switching down quite a bit in the 
> > > > > > important
> > > > > > user->idle->user cases, so so instead implement a non-refcounted 
> > > > > > scheme
> > > > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot 
> > > > > > down
> > > > > > any remaining lazy ones.
> > > > > >
> > > > > > Shootdown IPIs are some concern, but they have not been observed to 
> > > > > > be
> > > > > > a big problem with this scheme (the powerpc implementation generated
> > > > > > 314 additional interrupts on a 144 CPU system during a kernel 
> > > > > > compile).
> > > > > > There are a number of strategies that could be employed to reduce 
> > > > > > IPIs
> > > > > > if they turn out to be a problem for some workload.
> > > > >
> > > > > I'm still wondering whether we can do even better.
> > > > >
> > > >
> > > > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > > > the TLB.  On x86, this will shoot down all lazies as long as even a
> > > > single pagetable was freed.  (Or at least it will if we don't have a
> > > > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > > > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > > > architectures like x86, the shootdown approach should be free.  The
> > > > only way it ought to have any excess IPIs is if we have CPUs in
> > > > mm_cpumask() that don't need IPI to free pagetables, which could
> > > > happen on paravirt.
> > >
> > > Indeed, on x86, we do this:
> > >
> > > [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> > > [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> > > [   11.561068]  exit_mmap+0xc8/0x1a0
> > > [   11.561932]  mmput+0x29/0xd0
> > > [   11.562688]  do_exit+0x316/0xa90
> > > [   11.563588]  do_group_exit+0x34/0xb0
> > > [   11.564476]  __x64_sys_exit_group+0xf/0x10
> > > [   11.565512]  do_syscall_64+0x34/0x50
> > >
> > > and we have info->freed_tables set.
> > >
> > > What are the architectures that have large systems like?
> > >
> > > x86: we already zap lazies, so it should cost basically nothing to do
> > > a little loop at the end of __mmput() to make sure that no lazies are
> > > left.  If we care about paravirt performance, we could implement one
> > > of the optimizations I mentioned above to fix up the refcounts instead
> > > of sending an IPI to any remaining lazies.
> > >
> > > arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> > > remote flushes, so any lazy mm references will still exist after
> > > exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> > > the x86 paravirt case.  Are there large enough arm64 systems that any
> > > of this matters?
> >
> > Yes, there are large arm64 systems where performance of TLB invalidation
> > matters, but they're either niche (supercomputers) or not readily available
> > (NUMA boxes).
> >
> > But anyway, we blow away the TLB for everybody in tlb_finish_mmu() after
> > freeing the page-tables. We have an optimisation to avoid flushing if
> > we're just unmapping leaf entries when the mm is going away, but we don't
> > have a choice once we get to actually reclaiming the page-tables.
> >
> > One thing I probably should mention, though, is that we don't maintain
> > mm_cpumask() because we're not able to benefit from it and the atomic
> > update is a waste of time.
> 
> Do you do anything special for lazy TLB or do you just use the generic
> code?  (i.e. where do your user pagetables point when you go from a
> user task to idle or to a kernel thread?)

We don't do anything special (there's something funny with the PAN emulation
but you can ignore that); the page-table just points wherever it did before
for userspace. Switching explicitly to the init_mm, however, causes us to
unmap userspace entirely.

Since we have ASIDs, switch_mm() generally doesn't have to care about the
TLBs at all.

> Do you end up with all cpus set in mm_cpumask or can you have the mm
> loaded on a CPU that isn't in mm_cpumask?

I think the mask is always zero (we never set anything in there).

Will


Re: [PATCH kernel v2] powerpc/pci: Remove LSI mappings on device teardown

2020-12-01 Thread Alexey Kardashevskiy




On 01/12/2020 20:31, Cédric Le Goater wrote:

On 12/1/20 8:39 AM, Alexey Kardashevskiy wrote:

From: Oliver O'Halloran 

When a passthrough IO adapter is removed from a pseries machine using hash
MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
to clear all page table entries related to the adapter. If some are still
present, the RTAS call which isolates the PCI slot returns error 9001
"valid outstanding translations" and the removal of the IO adapter fails.
This is because when the PHBs are scanned, Linux maps automatically the
INTx interrupts in the Linux interrupt number space but these are never
removed.

This problem can be fixed by adding the corresponding unmap operation when
the device is removed. There's no pcibios_* hook for the remove case, but
the same effect can be achieved using a bus notifier.

Because INTx are shared among PHBs (and potentially across the system),
this adds tracking of virq to unmap them only when the last user is gone.

Signed-off-by: Oliver O'Halloran 
[aik: added refcounter]
Signed-off-by: Alexey Kardashevskiy 


Looks good to me and the system survives all the PCI hotplug tests I used
to do on my first attempts to fix this issue.

One comment below,


---


Doing this in the generic irq code is just too much for my small brain :-/


may be more cleanups are required in the PCI/MSI/IRQ PPC layers before
considering your first approach. You think too much in advance  !



---
  arch/powerpc/kernel/pci-common.c | 71 
  1 file changed, 71 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..0acf17f17253 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -353,6 +353,55 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
return NULL;
  }
  
+struct pci_intx_virq {

+   int virq;
+   struct kref kref;
+   struct list_head list_node;
+};
+
+static LIST_HEAD(intx_list);
+static DEFINE_MUTEX(intx_mutex);
+
+static void ppc_pci_intx_release(struct kref *kref)
+{
+   struct pci_intx_virq *vi = container_of(kref, struct pci_intx_virq, 
kref);
+
+   list_del(>list_node);
+   irq_dispose_mapping(vi->virq);
+   kfree(vi);
+}
+
+static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(data);
+
+   if (action == BUS_NOTIFY_DEL_DEVICE) {
+   struct pci_intx_virq *vi;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(vi, _list, list_node) {
+   if (vi->virq == pdev->irq) {
+   kref_put(>kref, ppc_pci_intx_release);
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+   }
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_pci_unmap_irq_notifier = {
+   .notifier_call = ppc_pci_unmap_irq_line,
+};
+
+static int ppc_pci_register_irq_notifier(void)
+{
+   return bus_register_notifier(_bus_type, 
_pci_unmap_irq_notifier);
+}
+arch_initcall(ppc_pci_register_irq_notifier);
+
  /*
   * Reads the interrupt pin to determine if interrupt is use by card.
   * If the interrupt is used, then gets the interrupt line from the
@@ -361,6 +410,12 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
  static int pci_read_irq_line(struct pci_dev *pci_dev)
  {
int virq;
+   struct pci_intx_virq *vi, *vitmp;
+
+   /* Preallocate vi as rewind is complex if this fails after mapping */


AFAICT, we only need to call irq_dispose_mapping() if allocation fails.


Today - yes but in the future (hierarchical domains or whatever other 
awesome thing we'll use from there) - not necessarily. Too much is 
hidden under irq_create_fwspec_mapping(). Thanks,





If so, it would be simpler to isolate the code in a pci_intx_register(virq)
helper and call it from pci_read_irq_line().


+   vi = kzalloc(sizeof(struct pci_intx_virq), GFP_KERNEL);
+   if (!vi)
+   return -1;
  
  	pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
  
@@ -401,6 +456,22 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
  
  	pci_dev->irq = virq;
  
+	mutex_lock(_mutex);

+   list_for_each_entry(vitmp, _list, list_node) {
+   if (vitmp->virq == virq) {
+   kref_get(>kref);
+   kfree(vi);
+   vi = NULL;
+   break;
+   }
+   }
+   if (vi) {
+   vi->virq = virq;
+   kref_init(>kref);
+   list_add_tail(>list_node, _list);
+   }
+   mutex_unlock(_mutex);
+
return 0;
  }
  





--
Alexey


[PATCH v2 5/5] powerpc/configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") removed geenric_bl driver from the tree, together with
corresponding config option.

Remove BACKLIGHT_GENERIC config item from generic-64bit_defconfig.

Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is 
unused")
Cc: Sam Ravnborg 
Signed-off-by: Andrey Zhizhikin 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Daniel Thompson 
Acked-by: Sam Ravnborg 
Acked-by: Michael Ellerman 
---
 arch/powerpc/configs/powernv_defconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/configs/powernv_defconfig 
b/arch/powerpc/configs/powernv_defconfig
index cf30fc24413b..60a30fffeda0 100644
--- a/arch/powerpc/configs/powernv_defconfig
+++ b/arch/powerpc/configs/powernv_defconfig
@@ -208,7 +208,6 @@ CONFIG_FB_MATROX_G=y
 CONFIG_FB_RADEON=m
 CONFIG_FB_IBM_GXT4500=m
 CONFIG_LCD_PLATFORM=m
-CONFIG_BACKLIGHT_GENERIC=m
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_LOGO=y
 CONFIG_HID_A4TECH=m
-- 
2.17.1



[PATCH v2 4/5] parisc: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") removed geenric_bl driver from the tree, together with
corresponding config option.

Remove BACKLIGHT_GENERIC config item from generic-64bit_defconfig.

Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is 
unused")
Cc: Sam Ravnborg 
Signed-off-by: Andrey Zhizhikin 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Daniel Thompson 
Acked-by: Sam Ravnborg 
---
 arch/parisc/configs/generic-64bit_defconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/parisc/configs/generic-64bit_defconfig 
b/arch/parisc/configs/generic-64bit_defconfig
index 7e2d7026285e..8f81fcbf04c4 100644
--- a/arch/parisc/configs/generic-64bit_defconfig
+++ b/arch/parisc/configs/generic-64bit_defconfig
@@ -191,7 +191,6 @@ CONFIG_DRM=y
 CONFIG_DRM_RADEON=y
 CONFIG_FIRMWARE_EDID=y
 CONFIG_FB_MODE_HELPERS=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
 CONFIG_HIDRAW=y
 CONFIG_HID_PID=y
-- 
2.17.1



[PATCH v2 3/5] MIPS: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") removed geenric_bl driver from the tree, together with
corresponding config option.

Remove BACKLIGHT_GENERIC config item from all MIPS configurations.

Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is 
unused")
Cc: Sam Ravnborg 
Signed-off-by: Andrey Zhizhikin 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Daniel Thompson 
Acked-by: Sam Ravnborg 
---
 arch/mips/configs/gcw0_defconfig  | 1 -
 arch/mips/configs/gpr_defconfig   | 1 -
 arch/mips/configs/lemote2f_defconfig  | 1 -
 arch/mips/configs/loongson3_defconfig | 1 -
 arch/mips/configs/mtx1_defconfig  | 1 -
 arch/mips/configs/rs90_defconfig  | 1 -
 6 files changed, 6 deletions(-)

diff --git a/arch/mips/configs/gcw0_defconfig b/arch/mips/configs/gcw0_defconfig
index 7e28a4fe9d84..460683b52285 100644
--- a/arch/mips/configs/gcw0_defconfig
+++ b/arch/mips/configs/gcw0_defconfig
@@ -73,7 +73,6 @@ CONFIG_DRM_PANEL_NOVATEK_NT39016=y
 CONFIG_DRM_INGENIC=y
 CONFIG_DRM_ETNAVIV=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_PWM=y
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
diff --git a/arch/mips/configs/gpr_defconfig b/arch/mips/configs/gpr_defconfig
index 9085f4d6c698..87e20f3391ed 100644
--- a/arch/mips/configs/gpr_defconfig
+++ b/arch/mips/configs/gpr_defconfig
@@ -251,7 +251,6 @@ CONFIG_SSB_DRIVER_PCICORE=y
 # CONFIG_VGA_ARB is not set
 # CONFIG_LCD_CLASS_DEVICE is not set
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_USB_HID=m
 CONFIG_USB_HIDDEV=y
diff --git a/arch/mips/configs/lemote2f_defconfig 
b/arch/mips/configs/lemote2f_defconfig
index 3a9a453b1264..688c91918db2 100644
--- a/arch/mips/configs/lemote2f_defconfig
+++ b/arch/mips/configs/lemote2f_defconfig
@@ -145,7 +145,6 @@ CONFIG_FB_SIS_300=y
 CONFIG_FB_SIS_315=y
 # CONFIG_LCD_CLASS_DEVICE is not set
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-CONFIG_BACKLIGHT_GENERIC=m
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
diff --git a/arch/mips/configs/loongson3_defconfig 
b/arch/mips/configs/loongson3_defconfig
index 38a817ead8e7..9c5fadef38cb 100644
--- a/arch/mips/configs/loongson3_defconfig
+++ b/arch/mips/configs/loongson3_defconfig
@@ -286,7 +286,6 @@ CONFIG_DRM_VIRTIO_GPU=y
 CONFIG_FB_RADEON=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_LCD_PLATFORM=m
-CONFIG_BACKLIGHT_GENERIC=m
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
diff --git a/arch/mips/configs/mtx1_defconfig b/arch/mips/configs/mtx1_defconfig
index 914af125a7fa..0ef2373404e5 100644
--- a/arch/mips/configs/mtx1_defconfig
+++ b/arch/mips/configs/mtx1_defconfig
@@ -450,7 +450,6 @@ CONFIG_WDT_MTX1=y
 # CONFIG_VGA_ARB is not set
 # CONFIG_LCD_CLASS_DEVICE is not set
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_SOUND=m
 CONFIG_SND=m
diff --git a/arch/mips/configs/rs90_defconfig b/arch/mips/configs/rs90_defconfig
index dfbb9fed9a42..4f540bb94628 100644
--- a/arch/mips/configs/rs90_defconfig
+++ b/arch/mips/configs/rs90_defconfig
@@ -97,7 +97,6 @@ CONFIG_DRM_FBDEV_OVERALLOC=300
 CONFIG_DRM_PANEL_SIMPLE=y
 CONFIG_DRM_INGENIC=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_PWM=y
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
-- 
2.17.1



[PATCH v2 2/5] arm64: defconfig: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") removed geenric_bl driver from the tree, together with
corresponding config option.

Remove BACKLIGHT_GENERIC config item from arm64 configuration.

Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is 
unused")
Cc: Sam Ravnborg 
Signed-off-by: Andrey Zhizhikin 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Daniel Thompson 
Acked-by: Sam Ravnborg 
---
 arch/arm64/configs/defconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 8e3f7ae71de5..280ed7404a1d 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -681,7 +681,6 @@ CONFIG_DRM_PANFROST=m
 CONFIG_FB=y
 CONFIG_FB_MODE_HELPERS=y
 CONFIG_FB_EFI=y
-CONFIG_BACKLIGHT_GENERIC=m
 CONFIG_BACKLIGHT_PWM=m
 CONFIG_BACKLIGHT_LP855X=m
 CONFIG_LOGO=y
-- 
2.17.1



[PATCH v2 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") removed geenric_bl driver from the tree, together with
corresponding config option.

Remove BACKLIGHT_GENERIC config item from all ARM configurations.

Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is 
unused")
Cc: Sam Ravnborg 
Signed-off-by: Andrey Zhizhikin 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Alexandre Belloni 
Acked-by: Daniel Thompson 
Acked-by: Sam Ravnborg 
---
 arch/arm/configs/at91_dt_defconfig| 1 -
 arch/arm/configs/cm_x300_defconfig| 1 -
 arch/arm/configs/colibri_pxa300_defconfig | 1 -
 arch/arm/configs/jornada720_defconfig | 1 -
 arch/arm/configs/magician_defconfig   | 1 -
 arch/arm/configs/mini2440_defconfig   | 1 -
 arch/arm/configs/omap2plus_defconfig  | 1 -
 arch/arm/configs/pxa3xx_defconfig | 1 -
 arch/arm/configs/qcom_defconfig   | 1 -
 arch/arm/configs/sama5_defconfig  | 1 -
 arch/arm/configs/sunxi_defconfig  | 1 -
 arch/arm/configs/tegra_defconfig  | 1 -
 arch/arm/configs/u8500_defconfig  | 1 -
 13 files changed, 13 deletions(-)

diff --git a/arch/arm/configs/at91_dt_defconfig 
b/arch/arm/configs/at91_dt_defconfig
index 4a0ba2ae1a25..6e52c9c965e6 100644
--- a/arch/arm/configs/at91_dt_defconfig
+++ b/arch/arm/configs/at91_dt_defconfig
@@ -132,7 +132,6 @@ CONFIG_DRM_ATMEL_HLCDC=y
 CONFIG_DRM_PANEL_SIMPLE=y
 CONFIG_FB_ATMEL=y
 CONFIG_BACKLIGHT_ATMEL_LCDC=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_PWM=y
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_LOGO=y
diff --git a/arch/arm/configs/cm_x300_defconfig 
b/arch/arm/configs/cm_x300_defconfig
index 2f7acde2d921..502a9d870ca4 100644
--- a/arch/arm/configs/cm_x300_defconfig
+++ b/arch/arm/configs/cm_x300_defconfig
@@ -87,7 +87,6 @@ CONFIG_FB=y
 CONFIG_FB_PXA=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_LCD_TDO24M=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_DA903X=m
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
diff --git a/arch/arm/configs/colibri_pxa300_defconfig 
b/arch/arm/configs/colibri_pxa300_defconfig
index 0dae3b185284..26e5a67f8e2d 100644
--- a/arch/arm/configs/colibri_pxa300_defconfig
+++ b/arch/arm/configs/colibri_pxa300_defconfig
@@ -34,7 +34,6 @@ CONFIG_FB=y
 CONFIG_FB_PXA=y
 # CONFIG_LCD_CLASS_DEVICE is not set
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_LOGO=y
diff --git a/arch/arm/configs/jornada720_defconfig 
b/arch/arm/configs/jornada720_defconfig
index 9f079be2b84b..069f60ffdcd8 100644
--- a/arch/arm/configs/jornada720_defconfig
+++ b/arch/arm/configs/jornada720_defconfig
@@ -48,7 +48,6 @@ CONFIG_FB=y
 CONFIG_FB_S1D13XXX=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
diff --git a/arch/arm/configs/magician_defconfig 
b/arch/arm/configs/magician_defconfig
index d2e684f6565a..b4670d42f378 100644
--- a/arch/arm/configs/magician_defconfig
+++ b/arch/arm/configs/magician_defconfig
@@ -95,7 +95,6 @@ CONFIG_FB_PXA_OVERLAY=y
 CONFIG_FB_W100=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_PWM=y
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
diff --git a/arch/arm/configs/mini2440_defconfig 
b/arch/arm/configs/mini2440_defconfig
index 301f29a1fcc3..898490aaa39e 100644
--- a/arch/arm/configs/mini2440_defconfig
+++ b/arch/arm/configs/mini2440_defconfig
@@ -158,7 +158,6 @@ CONFIG_FB_S3C2410=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_LCD_PLATFORM=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_PWM=y
 CONFIG_FRAMEBUFFER_CONSOLE=y
 CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
diff --git a/arch/arm/configs/omap2plus_defconfig 
b/arch/arm/configs/omap2plus_defconfig
index de3b7813a1ce..7eae097a75d2 100644
--- a/arch/arm/configs/omap2plus_defconfig
+++ b/arch/arm/configs/omap2plus_defconfig
@@ -388,7 +388,6 @@ CONFIG_FB_TILEBLITTING=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_LCD_PLATFORM=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-CONFIG_BACKLIGHT_GENERIC=m
 CONFIG_BACKLIGHT_PWM=m
 CONFIG_BACKLIGHT_PANDORA=m
 CONFIG_BACKLIGHT_GPIO=m
diff --git a/arch/arm/configs/pxa3xx_defconfig 
b/arch/arm/configs/pxa3xx_defconfig
index 06bbc7a59b60..f0c34017f2aa 100644
--- a/arch/arm/configs/pxa3xx_defconfig
+++ b/arch/arm/configs/pxa3xx_defconfig
@@ -74,7 +74,6 @@ CONFIG_FB_PXA=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_LCD_TDO24M=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
-# CONFIG_BACKLIGHT_GENERIC is not set
 CONFIG_BACKLIGHT_DA903X=y
 # CONFIG_VGA_CONSOLE is not set
 CONFIG_FRAMEBUFFER_CONSOLE=y
diff --git a/arch/arm/configs/qcom_defconfig b/arch/arm/configs/qcom_defconfig
index c882167e1496..d6733e745b80 100644
--- a/arch/arm/configs/qcom_defconfig
+++ 

[PATCH v2 0/5] drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Andrey Zhizhikin
Since the removal of generic_bl driver from the source tree in commit
7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
unused") BACKLIGHT_GENERIC config option became obsolete as well and
therefore subject to clean-up from all configuration files.

This series introduces patches to address this removal, separated by
architectures in the kernel tree.

Changes in v2:
- Collect all Acked-by: and Reviewed-by: tags
- Include ARM SOC maintainer list to recipients

Andrey Zhizhikin (5):
  ARM: configs: drop unused BACKLIGHT_GENERIC option
  arm64: defconfig: drop unused BACKLIGHT_GENERIC option
  MIPS: configs: drop unused BACKLIGHT_GENERIC option
  parisc: configs: drop unused BACKLIGHT_GENERIC option
  powerpc/configs: drop unused BACKLIGHT_GENERIC option

 arch/arm/configs/at91_dt_defconfig  | 1 -
 arch/arm/configs/cm_x300_defconfig  | 1 -
 arch/arm/configs/colibri_pxa300_defconfig   | 1 -
 arch/arm/configs/jornada720_defconfig   | 1 -
 arch/arm/configs/magician_defconfig | 1 -
 arch/arm/configs/mini2440_defconfig | 1 -
 arch/arm/configs/omap2plus_defconfig| 1 -
 arch/arm/configs/pxa3xx_defconfig   | 1 -
 arch/arm/configs/qcom_defconfig | 1 -
 arch/arm/configs/sama5_defconfig| 1 -
 arch/arm/configs/sunxi_defconfig| 1 -
 arch/arm/configs/tegra_defconfig| 1 -
 arch/arm/configs/u8500_defconfig| 1 -
 arch/arm64/configs/defconfig| 1 -
 arch/mips/configs/gcw0_defconfig| 1 -
 arch/mips/configs/gpr_defconfig | 1 -
 arch/mips/configs/lemote2f_defconfig| 1 -
 arch/mips/configs/loongson3_defconfig   | 1 -
 arch/mips/configs/mtx1_defconfig| 1 -
 arch/mips/configs/rs90_defconfig| 1 -
 arch/parisc/configs/generic-64bit_defconfig | 1 -
 arch/powerpc/configs/powernv_defconfig  | 1 -
 22 files changed, 22 deletions(-)


base-commit: b65054597872ce3aefbc6a666385eabdf9e288da
-- 
2.17.1



Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Andy Lutomirski
On Tue, Dec 1, 2020 at 1:28 PM Will Deacon  wrote:
>
> On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> > other arch folk: there's some background here:
> >
> > https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> >
> > On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > > >
> > > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  
> > > > wrote:
> > > > >
> > > > > On big systems, the mm refcount can become highly contented when doing
> > > > > a lot of context switching with threaded applications (particularly
> > > > > switching between the idle thread and an application thread).
> > > > >
> > > > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > > > user->idle->user cases, so so instead implement a non-refcounted 
> > > > > scheme
> > > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot 
> > > > > down
> > > > > any remaining lazy ones.
> > > > >
> > > > > Shootdown IPIs are some concern, but they have not been observed to be
> > > > > a big problem with this scheme (the powerpc implementation generated
> > > > > 314 additional interrupts on a 144 CPU system during a kernel 
> > > > > compile).
> > > > > There are a number of strategies that could be employed to reduce IPIs
> > > > > if they turn out to be a problem for some workload.
> > > >
> > > > I'm still wondering whether we can do even better.
> > > >
> > >
> > > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > > the TLB.  On x86, this will shoot down all lazies as long as even a
> > > single pagetable was freed.  (Or at least it will if we don't have a
> > > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > > architectures like x86, the shootdown approach should be free.  The
> > > only way it ought to have any excess IPIs is if we have CPUs in
> > > mm_cpumask() that don't need IPI to free pagetables, which could
> > > happen on paravirt.
> >
> > Indeed, on x86, we do this:
> >
> > [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> > [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> > [   11.561068]  exit_mmap+0xc8/0x1a0
> > [   11.561932]  mmput+0x29/0xd0
> > [   11.562688]  do_exit+0x316/0xa90
> > [   11.563588]  do_group_exit+0x34/0xb0
> > [   11.564476]  __x64_sys_exit_group+0xf/0x10
> > [   11.565512]  do_syscall_64+0x34/0x50
> >
> > and we have info->freed_tables set.
> >
> > What are the architectures that have large systems like?
> >
> > x86: we already zap lazies, so it should cost basically nothing to do
> > a little loop at the end of __mmput() to make sure that no lazies are
> > left.  If we care about paravirt performance, we could implement one
> > of the optimizations I mentioned above to fix up the refcounts instead
> > of sending an IPI to any remaining lazies.
> >
> > arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> > remote flushes, so any lazy mm references will still exist after
> > exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> > the x86 paravirt case.  Are there large enough arm64 systems that any
> > of this matters?
>
> Yes, there are large arm64 systems where performance of TLB invalidation
> matters, but they're either niche (supercomputers) or not readily available
> (NUMA boxes).
>
> But anyway, we blow away the TLB for everybody in tlb_finish_mmu() after
> freeing the page-tables. We have an optimisation to avoid flushing if
> we're just unmapping leaf entries when the mm is going away, but we don't
> have a choice once we get to actually reclaiming the page-tables.
>
> One thing I probably should mention, though, is that we don't maintain
> mm_cpumask() because we're not able to benefit from it and the atomic
> update is a waste of time.

Do you do anything special for lazy TLB or do you just use the generic
code?  (i.e. where do your user pagetables point when you go from a
user task to idle or to a kernel thread?)

Do you end up with all cpus set in mm_cpumask or can you have the mm
loaded on a CPU that isn't in mm_cpumask?

--Andy

>
> Will


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Will Deacon
On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> other arch folk: there's some background here:
> 
> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> 
> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> >
> > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> > > >
> > > > On big systems, the mm refcount can become highly contented when doing
> > > > a lot of context switching with threaded applications (particularly
> > > > switching between the idle thread and an application thread).
> > > >
> > > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > > user->idle->user cases, so so instead implement a non-refcounted scheme
> > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > > > any remaining lazy ones.
> > > >
> > > > Shootdown IPIs are some concern, but they have not been observed to be
> > > > a big problem with this scheme (the powerpc implementation generated
> > > > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > > > There are a number of strategies that could be employed to reduce IPIs
> > > > if they turn out to be a problem for some workload.
> > >
> > > I'm still wondering whether we can do even better.
> > >
> >
> > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > the TLB.  On x86, this will shoot down all lazies as long as even a
> > single pagetable was freed.  (Or at least it will if we don't have a
> > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > architectures like x86, the shootdown approach should be free.  The
> > only way it ought to have any excess IPIs is if we have CPUs in
> > mm_cpumask() that don't need IPI to free pagetables, which could
> > happen on paravirt.
> 
> Indeed, on x86, we do this:
> 
> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> [   11.561068]  exit_mmap+0xc8/0x1a0
> [   11.561932]  mmput+0x29/0xd0
> [   11.562688]  do_exit+0x316/0xa90
> [   11.563588]  do_group_exit+0x34/0xb0
> [   11.564476]  __x64_sys_exit_group+0xf/0x10
> [   11.565512]  do_syscall_64+0x34/0x50
> 
> and we have info->freed_tables set.
> 
> What are the architectures that have large systems like?
> 
> x86: we already zap lazies, so it should cost basically nothing to do
> a little loop at the end of __mmput() to make sure that no lazies are
> left.  If we care about paravirt performance, we could implement one
> of the optimizations I mentioned above to fix up the refcounts instead
> of sending an IPI to any remaining lazies.
> 
> arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> remote flushes, so any lazy mm references will still exist after
> exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> the x86 paravirt case.  Are there large enough arm64 systems that any
> of this matters?

Yes, there are large arm64 systems where performance of TLB invalidation
matters, but they're either niche (supercomputers) or not readily available
(NUMA boxes).

But anyway, we blow away the TLB for everybody in tlb_finish_mmu() after
freeing the page-tables. We have an optimisation to avoid flushing if
we're just unmapping leaf entries when the mm is going away, but we don't
have a choice once we get to actually reclaiming the page-tables.

One thing I probably should mention, though, is that we don't maintain
mm_cpumask() because we're not able to benefit from it and the atomic
update is a waste of time.

Will


Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Krzysztof Kozlowski
On Tue, Dec 01, 2020 at 04:50:22PM +0100, Arnd Bergmann wrote:
> On Tue, Dec 1, 2020 at 4:41 PM Alexandre Belloni
>  wrote:
> > On 01/12/2020 14:40:53+, Catalin Marinas wrote:
> > > On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> > > > From Krzysztof Kozlowski :
> 
> > > I tried to convince them before, it didn't work. I guess they don't like
> > > to be spammed ;).
> >
> > The first rule of arm-soc is: you do not talk about arm@ and soc@
> 
> I don't mind having the addresses documented better, but it needs to
> be done in a way that avoids having any patch for arch/arm*/boot/dts
> and arch/arm/*/configs Cc:d to s...@kernel.org.
> 
> If anyone has suggestions for how to do that, let me know.

Not a perfect solution but something. How about:
https://lore.kernel.org/linux-arm-kernel/20201201211516.24921-2-k...@kernel.org/T/#u

Would not work on defconfigs but there is a chance someone will find
your addresses this way. Should not cause to much additional traffic.

Best regards,
Krzysztof



Re: CONFIG_PPC_VAS depends on 64k pages...?

2020-12-01 Thread Carlos Eduardo de Paula
On Tue, Dec 1, 2020 at 2:54 AM Sukadev Bhattiprolu 
wrote:

>
> Christophe Leroy [christophe.le...@csgroup.eu] wrote:
> > Hi,
> >
> > Le 19/11/2020 à 11:58, Will Springer a écrit :
> > > I learned about the POWER9 gzip accelerator a few months ago when the
> > > support hit upstream Linux 5.8. However, for some reason the Kconfig
> > > dictates that VAS depends on a 64k page size, which is problematic as I
> > > run Void Linux, which uses a 4k-page kernel.
> > >
> > > Some early poking by others indicated there wasn't an obvious page size
> > > dependency in the code, and suggested I try modifying the config to
> switch
> > > it on. I did so, but was stopped by a minor complaint of an
> "unexpected DT
> > > configuration" by the VAS code. I wasn't equipped to figure out
> exactly what
> > > this meant, even after finding the offending condition, so after
> writing a
> > > very drawn-out forum post asking for help, I dropped the subject.
> > >
> > > Fast forward to today, when I was reminded of the whole thing again,
> and
> > > decided to debug a bit further. Apparently the VAS platform device
> > > (derived from the DT node) has 5 resources on my 4k kernel, instead of
> 4
> > > (which evidently works for others who have had success on 64k
> kernels). I
> > > have no idea what this means in practice (I don't know how to
> introspect
> > > it), but after making a tiny patch[1], everything came up smoothly and
> I
> > > was doing blazing-fast gzip (de)compression in no time.
> > >
> > > Everything seems to work fine on 4k pages. So, what's up? Are there
> > > pitfalls lurking around that I've yet to stumble over? More reasonably,
> > > I'm curious as to why the feature supposedly depends on 64k pages, or
> if
> > > there's anything else I should be concerned about.
>
> Will,
>
> The reason I put in that config check is because we were only able to
> test 64K pages at that point.
>
> It is interesting that it is working for you. Following code in skiboot
> https://github.com/open-power/skiboot/blob/master/hw/vas.c should restrict
> it to 64K pages. IIRC there is also a corresponding change in some NX
> registers that should also be configured to allow 4K pages.
>
>
> static int init_north_ctl(struct proc_chip *chip)
> {
> uint64_t val = 0ULL;
>
> val = SETFIELD(VAS_64K_MODE_MASK, val, true);
> val = SETFIELD(VAS_ACCEPT_PASTE_MASK, val, true);
> val = SETFIELD(VAS_ENABLE_WC_MMIO_BAR, val, true);
> val = SETFIELD(VAS_ENABLE_UWC_MMIO_BAR, val, true);
> val = SETFIELD(VAS_ENABLE_RMA_MMIO_BAR, val, true);
>
> return vas_scom_write(chip, VAS_MISC_N_CTL, val);
> }
>
> I am copying Bulent Albali and Haren Myneni who have been working with
> VAS/NX for their thoughts/experience.
>
> > >
> >
> > Maybe ask Sukadev who did the implementation and is maintaining it ?
> >
> > > I do have to say I'm quite satisfied with the results of the NX
> > > accelerator, though. Being able to shuffle data to a RaptorCS box over
> gigE
> > > and get compressed data back faster than most software gzip could ever
> > > hope to achieve is no small feat, let alone the instantaneous results
> locally.
> > > :)
> > >
> > > Cheers,
> > > Will Springer [she/her]
> > >
> > > [1]:
> https://github.com/Skirmisher/void-packages/blob/vas-4k-pages/srcpkgs/linux5.9/patches/ppc-vas-on-4k.patch
> > >
> >
> >
> > Christophe
>

Hi all, I'd like to report that with Will's patch, I'm using NX-Gzip
perfectly on Linux 5.9.10 built with 4K pages and no changes on firmware in
a Raptor Computing Blackbird workstation.

I'm using Debian 10 distro.

Ref. https://twitter.com/carlosedp/status/1328424799216021511

Carlos


-- 

Carlos Eduardo de Paula
m...@carlosedp.com
http://carlosedp.com
https://twitter.com/carlosedp
https://www.linkedin.com/in/carlosedp/



Re: CONFIG_PPC_VAS depends on 64k pages...?

2020-12-01 Thread Bulent Abali
I don't know anything about VAS page size requirements in the kernel.  I 
checked the user compression library and saw that we do a sysconf to get 
the page size; so the library should be immune to page size by design.
But it wouldn't surprise me if a 64KB constant is inadvertently hardcoded 
somewhere else in the library.  Giving heads up to Tulio and Raphael who 
are owners of the github repo.

https://github.com/libnxz/power-gzip/blob/master/lib/nx_zlib.c#L922

If we got this wrong in the library it might manifest itself as an error 
message of the sort "excessive page faults".  The library must touch pages 
ahead to make them present in the memory; occasional page faults is 
acceptable. It will retry.


Bulent




From:   "Sukadev Bhattiprolu" 
To: "Christophe Leroy" 
Cc: "Will Springer" , 
linuxppc-dev@lists.ozlabs.org, dan...@octaforge.org, Bulent 
Abali/Watson/IBM@IBM, ha...@linux.ibm.com
Date:   12/01/2020 12:53 AM
Subject:Re: CONFIG_PPC_VAS depends on 64k pages...?




Christophe Leroy [christophe.le...@csgroup.eu] wrote:
> Hi,
> 
> Le 19/11/2020 à 11:58, Will Springer a écrit :
> > I learned about the POWER9 gzip accelerator a few months ago when the
> > support hit upstream Linux 5.8. However, for some reason the Kconfig
> > dictates that VAS depends on a 64k page size, which is problematic as 
I
> > run Void Linux, which uses a 4k-page kernel.
> > 
> > Some early poking by others indicated there wasn't an obvious page 
size
> > dependency in the code, and suggested I try modifying the config to 
switch
> > it on. I did so, but was stopped by a minor complaint of an 
"unexpected DT
> > configuration" by the VAS code. I wasn't equipped to figure out 
exactly what
> > this meant, even after finding the offending condition, so after 
writing a
> > very drawn-out forum post asking for help, I dropped the subject.
> > 
> > Fast forward to today, when I was reminded of the whole thing again, 
and
> > decided to debug a bit further. Apparently the VAS platform device
> > (derived from the DT node) has 5 resources on my 4k kernel, instead of 
4
> > (which evidently works for others who have had success on 64k 
kernels). I
> > have no idea what this means in practice (I don't know how to 
introspect
> > it), but after making a tiny patch[1], everything came up smoothly and 
I
> > was doing blazing-fast gzip (de)compression in no time.
> > 
> > Everything seems to work fine on 4k pages. So, what's up? Are there
> > pitfalls lurking around that I've yet to stumble over? More 
reasonably,
> > I'm curious as to why the feature supposedly depends on 64k pages, or 
if
> > there's anything else I should be concerned about.

Will,

The reason I put in that config check is because we were only able to
test 64K pages at that point.

It is interesting that it is working for you. Following code in skiboot
https://github.com/open-power/skiboot/blob/master/hw/vas.c should restrict
it to 64K pages. IIRC there is also a corresponding change in some NX 
registers that should also be configured to allow 4K pages.


 static int init_north_ctl(struct proc_chip *chip)
 {
 uint64_t val = 0ULL;

 val = SETFIELD(VAS_64K_MODE_MASK, val, 
true);
 val = SETFIELD(VAS_ACCEPT_PASTE_MASK, 
val, true);
 val = SETFIELD(VAS_ENABLE_WC_MMIO_BAR, 
val, true);
 val = SETFIELD(VAS_ENABLE_UWC_MMIO_BAR, 
val, true);
 val = SETFIELD(VAS_ENABLE_RMA_MMIO_BAR, 
val, true);

 return vas_scom_write(chip, 
VAS_MISC_N_CTL, val);
 }

I am copying Bulent Albali and Haren Myneni who have been working with
VAS/NX for their thoughts/experience.

> > 
> 
> Maybe ask Sukadev who did the implementation and is maintaining it ?
> 
> > I do have to say I'm quite satisfied with the results of the NX
> > accelerator, though. Being able to shuffle data to a RaptorCS box over 
gigE
> > and get compressed data back faster than most software gzip could ever
> > hope to achieve is no small feat, let alone the instantaneous results 
locally.
> > :)
> > 
> > Cheers,
> > Will Springer [she/her]
> > 
> > [1]: 
https://github.com/Skirmisher/void-packages/blob/vas-4k-pages/srcpkgs/linux5.9/patches/ppc-vas-on-4k.patch

> > 
> 
> 
> Christophe






Re: [PATCH v2 2/2] kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1

2020-12-01 Thread Kees Cook
On Tue, Dec 01, 2020 at 10:31:37PM +0900, Masahiro Yamada wrote:
> On Wed, Nov 25, 2020 at 7:22 AM Kees Cook  wrote:
> >
> > On Thu, Nov 19, 2020 at 01:13:27PM -0800, Nick Desaulniers wrote:
> > > On Thu, Nov 19, 2020 at 12:57 PM Nathan Chancellor
> > >  wrote:
> > > >
> > > > ld.lld 10.0.1 spews a bunch of various warnings about .rela sections,
> > > > along with a few others. Newer versions of ld.lld do not have these
> > > > warnings. As a result, do not add '--orphan-handling=warn' to
> > > > LDFLAGS_vmlinux if ld.lld's version is not new enough.
> > > >
> > > > Link: https://github.com/ClangBuiltLinux/linux/issues/1187
> > > > Link: https://github.com/ClangBuiltLinux/linux/issues/1193
> > > > Reported-by: Arvind Sankar 
> > > > Reported-by: kernelci.org bot 
> > > > Reported-by: Mark Brown 
> > > > Reviewed-by: Kees Cook 
> > > > Signed-off-by: Nathan Chancellor 
> > >
> > > Thanks for the additions in v2.
> > > Reviewed-by: Nick Desaulniers 
> >
> > I'm going to carry this for a few days in -next, and if no one screams,
> > ask Linus to pull it for v5.10-rc6.
> >
> > Thanks!
> >
> > --
> > Kees Cook
> 
> 
> Sorry for the delay.
> Applied to linux-kbuild.

Great, thanks!

> But, I already see this in linux-next.
> Please let me know if I should drop it from my tree.

My intention was to get this to Linus this week. Do you want to do that
yourself, or Ack the patches in my tree and I'll send it?

-Kees

-- 
Kees Cook


Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Arnd Bergmann
On Tue, Dec 1, 2020 at 8:48 PM ZHIZHIKIN Andrey
 wrote:
> Hello Arnd,
> > > > Or rather, SoC-specific patches, even to defconfig, should go
> > > > through the specific SoC maintainers. However, there are occasional
> > > > defconfig patches which are more generic or affecting multiple SoCs.
> > > > I just ignore them as the arm64 defconfig is usually handled by the
> > > > arm-soc folk (when I need a defconfig change, I go for
> > > > arch/arm64/Kconfig directly ;)).
> > >
> > > IIRC, the plan was indeed to get defconfig changes through the
> > > platform sub-trees. It is also supposed to be how multi_v5 and
> > > multi_v7 are handled and they will take care of the merge.
> >
> > For cross-platform changes like this one, I'm definitely happy to pick up 
> > the
> > patch directly from s...@kernel.org, or from mailing list if I know about 
> > it.
>
> Should I collect all Ack's and re-send this series including the list "nobody
> talks about" :), or the series can be picked up as-is?
>
> Your advice would be really welcomed here!

Yes, please do, that makes my life easier. I would apply the patches
for arch/arm and arch/arm64 when you send them to s...@kernel.org,
the others go to the respective architecture maintainers, unless they
want me to pick up the whole series.

  Arnd


RE: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread ZHIZHIKIN Andrey
Hello Arnd,

> -Original Message-
> From: Arnd Bergmann 
> Sent: Tuesday, December 1, 2020 4:50 PM
> To: Alexandre Belloni 
> Cc: Catalin Marinas ; ZHIZHIKIN Andrey
> ; Krzysztof Kozlowski
> ; li...@armlinux.org.uk; nicolas.fe...@microchip.com;
> ludovic.desroc...@microchip.com; t...@atomide.com;
> mrip...@kernel.org; w...@csie.org; jernej.skra...@siol.net;
> thierry.red...@gmail.com; jonath...@nvidia.com; w...@kernel.org;
> tsbog...@alpha.franken.de; james.bottom...@hansenpartnership.com;
> del...@gmx.de; m...@ellerman.id.au; b...@kernel.crashing.org;
> pau...@samba.org; lee.jo...@linaro.org; s...@ravnborg.org;
> emil.l.veli...@gmail.com; daniel.thomp...@linaro.org; linux-arm-
> ker...@lists.infradead.org; linux-ker...@vger.kernel.org; linux-
> o...@vger.kernel.org; linux-te...@vger.kernel.org; linux-
> m...@vger.kernel.org; linux-par...@vger.kernel.org; linuxppc-
> d...@lists.ozlabs.org; Arnd Bergmann ; Olof Johansson
> ; arm-soc 
> Subject: Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC
> option
> 
> 
> On Tue, Dec 1, 2020 at 4:41 PM Alexandre Belloni
>  wrote:
> > On 01/12/2020 14:40:53+, Catalin Marinas wrote:
> > > On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> > > > From Krzysztof Kozlowski :
> 
> > > I tried to convince them before, it didn't work. I guess they don't
> > > like to be spammed ;).
> >
> > The first rule of arm-soc is: you do not talk about arm@ and soc@
> 
> I don't mind having the addresses documented better, but it needs to be
> done in a way that avoids having any patch for arch/arm*/boot/dts and
> arch/arm/*/configs Cc:d to s...@kernel.org.
> 
> If anyone has suggestions for how to do that, let me know.

Just as a proposal:
Maybe those addresses should at least be included in the Documentation ("Select 
the recipients for your patch" section of "Submitting patches"), much like 
stable@ is. Those who get themselves familiarized with it - would get an idea 
about which list they would need to include in Cc: for such changes.

That should IMHO partially reduce the traffic on the list since it would not 
pop-up in the output of get_maintainer.pl, but would at least be documented so 
contributors can follow the process.

> 
> > > Or rather, SoC-specific patches, even to defconfig, should go
> > > through the specific SoC maintainers. However, there are occasional
> > > defconfig patches which are more generic or affecting multiple SoCs.
> > > I just ignore them as the arm64 defconfig is usually handled by the
> > > arm-soc folk (when I need a defconfig change, I go for
> > > arch/arm64/Kconfig directly ;)).
> >
> > IIRC, the plan was indeed to get defconfig changes through the
> > platform sub-trees. It is also supposed to be how multi_v5 and
> > multi_v7 are handled and they will take care of the merge.
> 
> For cross-platform changes like this one, I'm definitely happy to pick up the
> patch directly from s...@kernel.org, or from mailing list if I know about it.

Should I collect all Ack's and re-send this series including the list "nobody 
talks about" :), or the series can be picked up as-is?

Your advice would be really welcomed here!

> 
> We usually do the merges for the soc tree in batches and rely on patchwork
> to keep track of what I'm missing, so if Olof and I are just on Cc to a mail, 
> we
> might have forgotten about it by the time we do the next merges.
> 
>   Arnd

Regards,
Andrey


Re: [PATCH net v3 0/2] ibmvnic: Bug fixes for queue descriptor processing

2020-12-01 Thread David Miller
From: Thomas Falcon 
Date: Tue,  1 Dec 2020 09:52:09 -0600

> This series resolves a few issues in the ibmvnic driver's
> RX buffer and TX completion processing. The first patch
> includes memory barriers to synchronize queue descriptor
> reads. The second patch fixes a memory leak that could
> occur if the device returns a TX completion with an error
> code in the descriptor, in which case the respective socket
> buffer and other relevant data structures may not be freed
> or updated properly.
> 
> v3: Correct length of Fixes tags, requested by Jakub Kicinski
> 
> v2: Provide more detailed comments explaining specifically what
> reads are being ordered, suggested by Michael Ellerman

Series applied, thanks!


Re: powerpc32: BUG: KASAN: use-after-free in test_bpf_init+0x6f8/0xde8 [test_bpf]

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 15:03, Christophe Leroy a écrit :

I've got the following KASAN error while running test_bpf module on a powerpc 
8xx (32 bits).

That's reproductible, happens each time at the same test.

Can someone help me to investigate and fix that ?

[  209.381037] test_bpf: #298 LD_IND byte frag


Without KASAN, this test and a few others fail:

[12493.832074] test_bpf: #298 LD_IND byte frag jited:1 ret 201 != 66 FAIL (1 
times)
[12493.844921] test_bpf: #299 LD_IND halfword frag jited:1 ret 51509 != 17220 
FAIL (1 times)
[12493.869990] test_bpf: #301 LD_IND halfword mixed head/frag jited:1 ret 51509 
!= 1305 FAIL (1 times)
[12493.897298] test_bpf: #303 LD_ABS byte frag jited:1 ret 201 != 66 FAIL (1 
times)
[12493.911351] test_bpf: #304 LD_ABS halfword frag jited:1 ret 51509 != 17220 
FAIL (1 times)
[12493.933244] test_bpf: #306 LD_ABS halfword mixed head/frag jited:1 ret 51509 
!= 1305 FAIL (1 times)
[12494.471983] test_bpf: Summary: 371 PASSED, 7 FAILED, [119/366 JIT'ed]

Christophe



[  209.383041] Pass 1: shrink = 0, seen = 0x3
[  209.383284] Pass 2: shrink = 0, seen = 0x3
[  209.383562] flen=3 proglen=104 pass=3 image=8166dc91 from=modprobe pid=380
[  209.383805] JIT code: : 7c 08 02 a6 90 01 00 04 91 c1 ff b8 91 e1 ff 
bc
[  209.384044] JIT code: 0010: 94 21 ff 70 80 e3 00 58 81 e3 00 54 7d e7 78 
50
[  209.384279] JIT code: 0020: 81 c3 00 a0 38 a0 00 00 38 80 00 00 38 a0 00 
40
[  209.384516] JIT code: 0030: 3c e0 c0 02 60 e7 62 14 7c e8 03 a6 38 c5 00 
00
[  209.384753] JIT code: 0040: 4e 80 00 21 41 80 00 0c 60 00 00 00 7c 83 23 
78
[  209.384990] JIT code: 0050: 38 21 00 90 80 01 00 04 7c 08 03 a6 81 c1 ff 
b8
[  209.385207] JIT code: 0060: 81 e1 ff bc 4e 80 00 20
[  209.385442] jited:1
[  209.385762] 
==
[  209.386272] BUG: KASAN: use-after-free in test_bpf_init+0x6f8/0xde8 
[test_bpf]
[  209.386503] Read of size 4 at addr c2de70c0 by task modprobe/380
[  209.386622]
[  209.386881] CPU: 0 PID: 380 Comm: modprobe Not tainted 
5.10.0-rc5-s3k-dev-01341-g72d20eec3f8b #4178
[  209.387032] Call Trace:
[  209.387404] [cad6b878] [c020e0d4] 
print_address_description.constprop.0+0x70/0x4e0 (unreliable)
[  209.387920] [cad6b8f8] [c020dc98] kasan_report+0x118/0x1c0
[  209.388503] [cad6b938] [cb0e0c98] test_bpf_init+0x6f8/0xde8 [test_bpf]
[  209.388918] [cad6ba58] [c0004084] do_one_initcall+0xa4/0x33c
[  209.389377] [cad6bb28] [c00f9144] do_init_module+0x158/0x7f4
[  209.389820] [cad6bbc8] [c00fccb0] load_module+0x3394/0x38d8
[  209.390273] [cad6be38] [c00fd4e0] sys_finit_module+0x118/0x17c
[  209.390700] [cad6bf38] [c00170d0] ret_from_syscall+0x0/0x34
[  209.391020] --- interrupt: c01 at 0xfd5e7c0
[  209.395301]
[  209.395472] Allocated by task 276:
[  209.395767]  __kasan_kmalloc.constprop.0+0xe8/0x134
[  209.396029]  kmem_cache_alloc+0x150/0x290
[  209.396281]  __alloc_skb+0x58/0x28c
[  209.396563]  alloc_skb_with_frags+0x74/0x314
[  209.396872]  sock_alloc_send_pskb+0x404/0x424
[  209.397205]  unix_dgram_sendmsg+0x200/0xbf0
[  209.397473]  __sys_sendto+0x17c/0x21c
[  209.397754]  ret_from_syscall+0x0/0x34
[  209.397877]
[  209.398039] Freed by task 274:
[  209.398308]  kasan_set_track+0x34/0x6c
[  209.398608]  kasan_set_free_info+0x28/0x48
[  209.398878]  __kasan_slab_free+0x10c/0x19c
[  209.399141]  kmem_cache_free+0x68/0x390
[  209.399433]  skb_free_datagram+0x20/0x8c
[  209.399759]  unix_dgram_recvmsg+0x474/0x710
[  209.400084]  sock_read_iter+0x17c/0x228
[  209.400348]  vfs_read+0x3c8/0x4f4
[  209.400603]  ksys_read+0x17c/0x1cc
[  209.400878]  ret_from_syscall+0x0/0x34
[  209.401001]
[  209.401222] The buggy address belongs to the object at c2de70c0
[  209.401222]  which belongs to the cache skbuff_head_cache of size 176
[  209.401462] The buggy address is located 0 bytes inside of
[  209.401462]  176-byte region [c2de70c0, c2de7170)
[  209.401604] The buggy address belongs to the page:
[  209.401867] page:464e6411 refcount:1 mapcount:0 mapping: index:0x0 
pfn:0xb79
[  209.402080] flags: 0x200(slab)
[  209.402477] raw: 0200 0100 0122 c2004a90  00440088 
 0001
[  209.402646] page dumped because: kasan: bad access detected
[  209.402765]
[  209.402897] Memory state around the buggy address:
[  209.403142]  c2de6f80: fb fb fc fc fc fc fc fc fc fc fa fb fb fb fb fb
[  209.403388]  c2de7000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  209.403639] >c2de7080: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
[  209.403798]    ^
[  209.404048]  c2de7100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
[  209.404304]  c2de7180: fc fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb
[  209.404456] 
==
[  209.404591] Disabling lock debugging due to kernel taint


Thanks
Christophe


Re: [PATCH v9 6/6] powerpc: Book3S 64-bit outline-only KASAN support

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU, supporting only outline mode.

  - Enable the compiler instrumentation to check addresses and maintain the
shadow region. (This is the guts of KASAN which we can easily reuse.)

  - Require kasan-vmalloc support to handle modules and anything else in
vmalloc space.

  - KASAN needs to be able to validate all pointer accesses, but we can't
instrument all kernel addresses - only linear map and vmalloc. On boot,
set up a single page of read-only shadow that marks all iomap and
vmemmap accesses as valid.

  - Make our stack-walking code KASAN-safe by using READ_ONCE_NOCHECK -
generic code, arm64, s390 and x86 all do this for similar sorts of
reasons: when unwinding a stack, we might touch memory that KASAN has
marked as being out-of-bounds. In our case we often get this when
checking for an exception frame because we're checking an arbitrary
offset into the stack frame.

See commit 20955746320e ("s390/kasan: avoid false positives during stack
unwind"), commit bcaf669b4bdb ("arm64: disable kasan when accessing
frame->fp in unwind_frame"), commit 91e08ab0c851 ("x86/dumpstack:
Prevent KASAN false positive warnings") and commit 6e22c8366416
("tracing, kasan: Silence Kasan warning in check_stack of stack_tracer")

  - Document KASAN in both generic and powerpc docs.

Background
--

KASAN support on Book3S is a bit tricky to get right:

  - It would be good to support inline instrumentation so as to be able to
catch stack issues that cannot be caught with outline mode.

  - Inline instrumentation requires a fixed offset.

  - Book3S runs code with translations off ("real mode") during boot,
including a lot of generic device-tree parsing code which is used to
determine MMU features.

 [ppc64 mm note: The kernel installs a linear mapping at effective
 address c000...-c008 This is a one-to-one mapping with physical
 memory from ... onward. Because of how memory accesses work on
 powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
 same memory both with translations on (accessing as an 'effective
 address'), and with translations off (accessing as a 'real
 address'). This works in both guests and the hypervisor. For more
 details, see s5.7 of Book III of version 3 of the ISA, in particular
 the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
 KASAN implementation currently only supports Radix.]

  - Some code - most notably a lot of KVM code - also runs with translations
off after boot.

  - Therefore any offset has to point to memory that is valid with
translations on or off.

One approach is just to give up on inline instrumentation. This way
boot-time checks can be delayed until after the MMU is set is up, and we
can just not instrument any code that runs with translations off after
booting. Take this approach for now and require outline instrumentation.

Previous attempts allowed inline instrumentation. However, they came with
some unfortunate restrictions: only physically contiguous memory could be
used and it had to be specified at compile time. Maybe we can do better in
the future.

Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Aneesh Kumar K.V  # ppc64 hash version
Cc: Christophe Leroy  # ppc32 version
Signed-off-by: Daniel Axtens 
---
  Documentation/dev-tools/kasan.rst|  9 +-
  Documentation/powerpc/kasan.txt  | 48 +-
  arch/powerpc/Kconfig |  4 +-
  arch/powerpc/Kconfig.debug   |  2 +-
  arch/powerpc/include/asm/book3s/64/hash.h|  4 +
  arch/powerpc/include/asm/book3s/64/pgtable.h |  7 ++
  arch/powerpc/include/asm/book3s/64/radix.h   | 13 ++-
  arch/powerpc/include/asm/kasan.h | 34 ++-
  arch/powerpc/kernel/Makefile |  5 +
  arch/powerpc/kernel/process.c| 16 ++--
  arch/powerpc/kvm/Makefile|  5 +
  arch/powerpc/mm/book3s64/Makefile|  8 ++
  arch/powerpc/mm/kasan/Makefile   |  1 +
  arch/powerpc/mm/kasan/init_book3s_64.c   | 98 
  arch/powerpc/mm/ptdump/ptdump.c  | 20 +++-
  arch/powerpc/platforms/Kconfig.cputype   |  1 +
  arch/powerpc/platforms/powernv/Makefile  |  6 ++
  arch/powerpc/platforms/pseries/Makefile  |  3 +
  18 files changed, 265 insertions(+), 19 deletions(-)
  create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index eaf868094a8e..28f08959bd2e 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -19,8 +19,9 @@ out-of-bounds accesses for global variables is only supported 
since Clang 11.
  Tag-based KASAN is only supported in Clang.
  
  

Re: [PATCH v9 5/6] powerpc/mm/kasan: rename kasan_init_32.c to init_32.c

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

kasan is already implied by the directory name, we don't need to
repeat it.

Suggested-by: Christophe Leroy 


My new address is 



Signed-off-by: Daniel Axtens 
---
  arch/powerpc/mm/kasan/Makefile   | 2 +-
  arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} | 0
  2 files changed, 1 insertion(+), 1 deletion(-)
  rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)

diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile
index bb1a5408b86b..42fb628a44fd 100644
--- a/arch/powerpc/mm/kasan/Makefile
+++ b/arch/powerpc/mm/kasan/Makefile
@@ -2,6 +2,6 @@
  
  KASAN_SANITIZE := n
  
-obj-$(CONFIG_PPC32)   += kasan_init_32.o

+obj-$(CONFIG_PPC32)   += init_32.o
  obj-$(CONFIG_PPC_8xx) += 8xx.o
  obj-$(CONFIG_PPC_BOOK3S_32)   += book3s_32.o
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/init_32.c
similarity index 100%
rename from arch/powerpc/mm/kasan/kasan_init_32.c
rename to arch/powerpc/mm/kasan/init_32.c



Re: [PATCH v9 4/6] kasan: Document support on 32-bit powerpc

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

KASAN is supported on 32-bit powerpc and the docs should reflect this.

Document s390 support while we're at it.

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 


My new address is 


Signed-off-by: Daniel Axtens 
---
  Documentation/dev-tools/kasan.rst |  7 +--
  Documentation/powerpc/kasan.txt   | 12 
  2 files changed, 17 insertions(+), 2 deletions(-)
  create mode 100644 Documentation/powerpc/kasan.txt

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index 2b68addaadcd..eaf868094a8e 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -19,7 +19,8 @@ out-of-bounds accesses for global variables is only supported 
since Clang 11.
  Tag-based KASAN is only supported in Clang.
  
  Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and

-riscv architectures, and tag-based KASAN is supported only for arm64.
+riscv architectures. It is also supported on 32-bit powerpc kernels. Tag-based
+KASAN is supported only on arm64.
  
  Usage

  -
@@ -255,7 +256,9 @@ CONFIG_KASAN_VMALLOC
  
  
  With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the

-cost of greater memory usage. Currently this is only supported on x86.
+cost of greater memory usage. Currently this supported on x86, s390
+and 32-bit powerpc. It is optional, except on 32-bit powerpc kernels
+with module support, where it is required.
  
  This works by hooking into vmalloc and vmap, and dynamically

  allocating real shadow memory to back the mappings.
diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt
new file mode 100644
index ..26bb0e8bb18c
--- /dev/null
+++ b/Documentation/powerpc/kasan.txt
@@ -0,0 +1,12 @@
+KASAN is supported on powerpc on 32-bit only.
+
+32 bit support
+==
+
+KASAN is supported on both hash and nohash MMUs on 32-bit.
+
+The shadow area sits at the top of the kernel virtual memory space above the
+fixmap area and occupies one eighth of the total kernel virtual memory space.
+
+Instrumentation of the vmalloc area is optional, unless built with modules,
+in which case it is required.



Re: [PATCH v9 3/6] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 


My neww address is : christophe.le...@csgroup.eu


Suggested-by: Balbir Singh 
Reviewed-by: Christophe Leroy 


Same


Reviewed-by: Balbir Singh 
Signed-off-by: Daniel Axtens 
---
  include/linux/kasan.h | 18 +++---
  mm/kasan/init.c   |  6 +++---
  2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 3df66fdf6662..893d054aad6f 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -24,10 +24,22 @@ struct kunit_kasan_expectation {
  static inline bool kasan_arch_is_ready(void)  { return true; }
  #endif
  
+#ifndef MAX_PTRS_PER_PTE

+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
  extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
  extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
  
  int kasan_populate_early_shadow(const void *shadow_start,

diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..42bca3d27db8 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
  }
  #endif
  #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
  static inline bool kasan_pud_table(p4d_t p4d)
  {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
  }
  #endif
  #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
  static inline bool kasan_pmd_table(pud_t pud)
  {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
  }
  #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
  
  static inline bool kasan_pte_table(pmd_t pmd)

  {



Re: [PATCH v9 2/6] kasan: allow architectures to provide an outline readiness check

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

Allow architectures to define a kasan_arch_is_ready() hook that bails
out of any function that's about to touch the shadow unless the arch
says that it is ready for the memory to be accessed. This is fairly
uninvasive and should have a negligible performance penalty.

This will only work in outline mode, so an arch must specify
HAVE_ARCH_NO_KASAN_INLINE if it requires this.

Cc: Balbir Singh 
Cc: Aneesh Kumar K.V 
Signed-off-by: Christophe Leroy 


Did I signed that off one day ? I can't remember.

Please update my email address, and maybe change it to a Suggested-by: ? I think the first 
Signed-off-by: has to be the author of the patch.



Signed-off-by: Daniel Axtens 

--

I discuss the justfication for this later in the series. Also,
both previous RFCs for ppc64 - by 2 different people - have
needed this trick! See:
  - https://lore.kernel.org/patchwork/patch/592820/ # ppc64 hash series
  - https://patchwork.ozlabs.org/patch/795211/  # ppc radix series
---
  include/linux/kasan.h |  4 
  mm/kasan/common.c | 10 ++
  mm/kasan/generic.c|  3 +++
  3 files changed, 17 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 30d343b4a40a..3df66fdf6662 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -20,6 +20,10 @@ struct kunit_kasan_expectation {
bool report_found;
  };
  
+#ifndef kasan_arch_is_ready

+static inline bool kasan_arch_is_ready(void)   { return true; }
+#endif
+
  extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
  extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
  extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 950fd372a07e..ba7744d3e319 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -117,6 +117,9 @@ void kasan_poison_shadow(const void *address, size_t size, 
u8 value)
  {
void *shadow_start, *shadow_end;
  
+	if (!kasan_arch_is_ready())

+   return;
+
/*
 * Perform shadow offset calculation based on untagged address, as
 * some of the callers (e.g. kasan_poison_object_data) pass tagged
@@ -134,6 +137,9 @@ void kasan_unpoison_shadow(const void *address, size_t size)
  {
u8 tag = get_tag(address);
  
+	if (!kasan_arch_is_ready())

+   return;
+
/*
 * Perform shadow offset calculation based on untagged address, as
 * some of the callers (e.g. kasan_unpoison_object_data) pass tagged
@@ -406,6 +412,10 @@ static bool __kasan_slab_free(struct kmem_cache *cache, 
void *object,
if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;
  
+	/* We can't read the shadow byte if the arch isn't ready */

+   if (!kasan_arch_is_ready())
+   return false;
+
shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
if (shadow_invalid(tag, shadow_byte)) {
kasan_report_invalid_free(tagged_object, ip);
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 248264b9cb76..e87404026b2b 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -169,6 +169,9 @@ static __always_inline bool 
check_memory_region_inline(unsigned long addr,
size_t size, bool write,
unsigned long ret_ip)
  {
+   if (!kasan_arch_is_ready())
+   return true;
+
if (unlikely(size == 0))
return true;
  



Re: [PATCH v9 1/6] kasan: allow an architecture to disable inline instrumentation

2020-12-01 Thread Christophe Leroy




Le 01/12/2020 à 17:16, Daniel Axtens a écrit :

For annoying architectural reasons, it's very difficult to support inline
instrumentation on powerpc64.

Add a Kconfig flag to allow an arch to disable inline. (It's a bit
annoying to be 'backwards', but I'm not aware of any way to have
an arch force a symbol to be 'n', rather than 'y'.)

Signed-off-by: Daniel Axtens 
---
  lib/Kconfig.kasan | 4 
  1 file changed, 4 insertions(+)

diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 542a9c18398e..31a0b28f6c2b 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -9,6 +9,9 @@ config HAVE_ARCH_KASAN_SW_TAGS
  configHAVE_ARCH_KASAN_VMALLOC
bool
  
+config HAVE_ARCH_NO_KASAN_INLINE


Maybe a better name could be: ARCH_DISABLE_KASAN_INLINE


+   def_bool n
+
  config CC_HAS_KASAN_GENERIC
def_bool $(cc-option, -fsanitize=kernel-address)
  
@@ -108,6 +111,7 @@ config KASAN_OUTLINE
  
  config KASAN_INLINE

bool "Inline instrumentation"
+   depends on !HAVE_ARCH_NO_KASAN_INLINE
help
  Compiler directly inserts code checking shadow memory before
  memory accesses. This is faster than outline (in some workloads



[PATCH v9 6/6] powerpc: Book3S 64-bit outline-only KASAN support

2020-12-01 Thread Daniel Axtens
Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU, supporting only outline mode.

 - Enable the compiler instrumentation to check addresses and maintain the
   shadow region. (This is the guts of KASAN which we can easily reuse.)

 - Require kasan-vmalloc support to handle modules and anything else in
   vmalloc space.

 - KASAN needs to be able to validate all pointer accesses, but we can't
   instrument all kernel addresses - only linear map and vmalloc. On boot,
   set up a single page of read-only shadow that marks all iomap and
   vmemmap accesses as valid.

 - Make our stack-walking code KASAN-safe by using READ_ONCE_NOCHECK -
   generic code, arm64, s390 and x86 all do this for similar sorts of
   reasons: when unwinding a stack, we might touch memory that KASAN has
   marked as being out-of-bounds. In our case we often get this when
   checking for an exception frame because we're checking an arbitrary
   offset into the stack frame.

   See commit 20955746320e ("s390/kasan: avoid false positives during stack
   unwind"), commit bcaf669b4bdb ("arm64: disable kasan when accessing
   frame->fp in unwind_frame"), commit 91e08ab0c851 ("x86/dumpstack:
   Prevent KASAN false positive warnings") and commit 6e22c8366416
   ("tracing, kasan: Silence Kasan warning in check_stack of stack_tracer")

 - Document KASAN in both generic and powerpc docs.

Background
--

KASAN support on Book3S is a bit tricky to get right:

 - It would be good to support inline instrumentation so as to be able to
   catch stack issues that cannot be caught with outline mode.

 - Inline instrumentation requires a fixed offset.

 - Book3S runs code with translations off ("real mode") during boot,
   including a lot of generic device-tree parsing code which is used to
   determine MMU features.

[ppc64 mm note: The kernel installs a linear mapping at effective
address c000...-c008 This is a one-to-one mapping with physical
memory from ... onward. Because of how memory accesses work on
powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
same memory both with translations on (accessing as an 'effective
address'), and with translations off (accessing as a 'real
address'). This works in both guests and the hypervisor. For more
details, see s5.7 of Book III of version 3 of the ISA, in particular
the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
KASAN implementation currently only supports Radix.]

 - Some code - most notably a lot of KVM code - also runs with translations
   off after boot.

 - Therefore any offset has to point to memory that is valid with
   translations on or off.

One approach is just to give up on inline instrumentation. This way
boot-time checks can be delayed until after the MMU is set is up, and we
can just not instrument any code that runs with translations off after
booting. Take this approach for now and require outline instrumentation.

Previous attempts allowed inline instrumentation. However, they came with
some unfortunate restrictions: only physically contiguous memory could be
used and it had to be specified at compile time. Maybe we can do better in
the future.

Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Aneesh Kumar K.V  # ppc64 hash version
Cc: Christophe Leroy  # ppc32 version
Signed-off-by: Daniel Axtens 
---
 Documentation/dev-tools/kasan.rst|  9 +-
 Documentation/powerpc/kasan.txt  | 48 +-
 arch/powerpc/Kconfig |  4 +-
 arch/powerpc/Kconfig.debug   |  2 +-
 arch/powerpc/include/asm/book3s/64/hash.h|  4 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  7 ++
 arch/powerpc/include/asm/book3s/64/radix.h   | 13 ++-
 arch/powerpc/include/asm/kasan.h | 34 ++-
 arch/powerpc/kernel/Makefile |  5 +
 arch/powerpc/kernel/process.c| 16 ++--
 arch/powerpc/kvm/Makefile|  5 +
 arch/powerpc/mm/book3s64/Makefile|  8 ++
 arch/powerpc/mm/kasan/Makefile   |  1 +
 arch/powerpc/mm/kasan/init_book3s_64.c   | 98 
 arch/powerpc/mm/ptdump/ptdump.c  | 20 +++-
 arch/powerpc/platforms/Kconfig.cputype   |  1 +
 arch/powerpc/platforms/powernv/Makefile  |  6 ++
 arch/powerpc/platforms/pseries/Makefile  |  3 +
 18 files changed, 265 insertions(+), 19 deletions(-)
 create mode 100644 arch/powerpc/mm/kasan/init_book3s_64.c

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index eaf868094a8e..28f08959bd2e 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -19,8 +19,9 @@ out-of-bounds accesses for global variables is only supported 
since Clang 11.
 Tag-based KASAN is only supported in Clang.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and
-riscv architectures. It is also 

[PATCH v9 5/6] powerpc/mm/kasan: rename kasan_init_32.c to init_32.c

2020-12-01 Thread Daniel Axtens
kasan is already implied by the directory name, we don't need to
repeat it.

Suggested-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/mm/kasan/Makefile   | 2 +-
 arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)

diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile
index bb1a5408b86b..42fb628a44fd 100644
--- a/arch/powerpc/mm/kasan/Makefile
+++ b/arch/powerpc/mm/kasan/Makefile
@@ -2,6 +2,6 @@
 
 KASAN_SANITIZE := n
 
-obj-$(CONFIG_PPC32)   += kasan_init_32.o
+obj-$(CONFIG_PPC32)   += init_32.o
 obj-$(CONFIG_PPC_8xx)  += 8xx.o
 obj-$(CONFIG_PPC_BOOK3S_32)+= book3s_32.o
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/init_32.c
similarity index 100%
rename from arch/powerpc/mm/kasan/kasan_init_32.c
rename to arch/powerpc/mm/kasan/init_32.c
-- 
2.25.1



[PATCH v9 4/6] kasan: Document support on 32-bit powerpc

2020-12-01 Thread Daniel Axtens
KASAN is supported on 32-bit powerpc and the docs should reflect this.

Document s390 support while we're at it.

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 Documentation/dev-tools/kasan.rst |  7 +--
 Documentation/powerpc/kasan.txt   | 12 
 2 files changed, 17 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/powerpc/kasan.txt

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index 2b68addaadcd..eaf868094a8e 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -19,7 +19,8 @@ out-of-bounds accesses for global variables is only supported 
since Clang 11.
 Tag-based KASAN is only supported in Clang.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and
-riscv architectures, and tag-based KASAN is supported only for arm64.
+riscv architectures. It is also supported on 32-bit powerpc kernels. Tag-based
+KASAN is supported only on arm64.
 
 Usage
 -
@@ -255,7 +256,9 @@ CONFIG_KASAN_VMALLOC
 
 
 With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
-cost of greater memory usage. Currently this is only supported on x86.
+cost of greater memory usage. Currently this supported on x86, s390
+and 32-bit powerpc. It is optional, except on 32-bit powerpc kernels
+with module support, where it is required.
 
 This works by hooking into vmalloc and vmap, and dynamically
 allocating real shadow memory to back the mappings.
diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt
new file mode 100644
index ..26bb0e8bb18c
--- /dev/null
+++ b/Documentation/powerpc/kasan.txt
@@ -0,0 +1,12 @@
+KASAN is supported on powerpc on 32-bit only.
+
+32 bit support
+==
+
+KASAN is supported on both hash and nohash MMUs on 32-bit.
+
+The shadow area sits at the top of the kernel virtual memory space above the
+fixmap area and occupies one eighth of the total kernel virtual memory space.
+
+Instrumentation of the vmalloc area is optional, unless built with modules,
+in which case it is required.
-- 
2.25.1



[PATCH v9 3/6] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2020-12-01 Thread Daniel Axtens
powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Reviewed-by: Christophe Leroy 
Reviewed-by: Balbir Singh 
Signed-off-by: Daniel Axtens 
---
 include/linux/kasan.h | 18 +++---
 mm/kasan/init.c   |  6 +++---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 3df66fdf6662..893d054aad6f 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -24,10 +24,22 @@ struct kunit_kasan_expectation {
 static inline bool kasan_arch_is_ready(void)   { return true; }
 #endif
 
+#ifndef MAX_PTRS_PER_PTE
+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
 
 int kasan_populate_early_shadow(const void *shadow_start,
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index fe6be0be1f76..42bca3d27db8 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
 static inline bool kasan_pud_table(p4d_t p4d)
 {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
 static inline bool kasan_pmd_table(pud_t pud)
 {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
 }
 #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
 
 static inline bool kasan_pte_table(pmd_t pmd)
 {
-- 
2.25.1



[PATCH v9 2/6] kasan: allow architectures to provide an outline readiness check

2020-12-01 Thread Daniel Axtens
Allow architectures to define a kasan_arch_is_ready() hook that bails
out of any function that's about to touch the shadow unless the arch
says that it is ready for the memory to be accessed. This is fairly
uninvasive and should have a negligible performance penalty.

This will only work in outline mode, so an arch must specify
HAVE_ARCH_NO_KASAN_INLINE if it requires this.

Cc: Balbir Singh 
Cc: Aneesh Kumar K.V 
Signed-off-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 

--

I discuss the justfication for this later in the series. Also,
both previous RFCs for ppc64 - by 2 different people - have
needed this trick! See:
 - https://lore.kernel.org/patchwork/patch/592820/ # ppc64 hash series
 - https://patchwork.ozlabs.org/patch/795211/  # ppc radix series
---
 include/linux/kasan.h |  4 
 mm/kasan/common.c | 10 ++
 mm/kasan/generic.c|  3 +++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 30d343b4a40a..3df66fdf6662 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -20,6 +20,10 @@ struct kunit_kasan_expectation {
bool report_found;
 };
 
+#ifndef kasan_arch_is_ready
+static inline bool kasan_arch_is_ready(void)   { return true; }
+#endif
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
 extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
 extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 950fd372a07e..ba7744d3e319 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -117,6 +117,9 @@ void kasan_poison_shadow(const void *address, size_t size, 
u8 value)
 {
void *shadow_start, *shadow_end;
 
+   if (!kasan_arch_is_ready())
+   return;
+
/*
 * Perform shadow offset calculation based on untagged address, as
 * some of the callers (e.g. kasan_poison_object_data) pass tagged
@@ -134,6 +137,9 @@ void kasan_unpoison_shadow(const void *address, size_t size)
 {
u8 tag = get_tag(address);
 
+   if (!kasan_arch_is_ready())
+   return;
+
/*
 * Perform shadow offset calculation based on untagged address, as
 * some of the callers (e.g. kasan_unpoison_object_data) pass tagged
@@ -406,6 +412,10 @@ static bool __kasan_slab_free(struct kmem_cache *cache, 
void *object,
if (unlikely(cache->flags & SLAB_TYPESAFE_BY_RCU))
return false;
 
+   /* We can't read the shadow byte if the arch isn't ready */
+   if (!kasan_arch_is_ready())
+   return false;
+
shadow_byte = READ_ONCE(*(s8 *)kasan_mem_to_shadow(object));
if (shadow_invalid(tag, shadow_byte)) {
kasan_report_invalid_free(tagged_object, ip);
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 248264b9cb76..e87404026b2b 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -169,6 +169,9 @@ static __always_inline bool 
check_memory_region_inline(unsigned long addr,
size_t size, bool write,
unsigned long ret_ip)
 {
+   if (!kasan_arch_is_ready())
+   return true;
+
if (unlikely(size == 0))
return true;
 
-- 
2.25.1



[PATCH v9 1/6] kasan: allow an architecture to disable inline instrumentation

2020-12-01 Thread Daniel Axtens
For annoying architectural reasons, it's very difficult to support inline
instrumentation on powerpc64.

Add a Kconfig flag to allow an arch to disable inline. (It's a bit
annoying to be 'backwards', but I'm not aware of any way to have
an arch force a symbol to be 'n', rather than 'y'.)

Signed-off-by: Daniel Axtens 
---
 lib/Kconfig.kasan | 4 
 1 file changed, 4 insertions(+)

diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 542a9c18398e..31a0b28f6c2b 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -9,6 +9,9 @@ config HAVE_ARCH_KASAN_SW_TAGS
 config HAVE_ARCH_KASAN_VMALLOC
bool
 
+config HAVE_ARCH_NO_KASAN_INLINE
+   def_bool n
+
 config CC_HAS_KASAN_GENERIC
def_bool $(cc-option, -fsanitize=kernel-address)
 
@@ -108,6 +111,7 @@ config KASAN_OUTLINE
 
 config KASAN_INLINE
bool "Inline instrumentation"
+   depends on !HAVE_ARCH_NO_KASAN_INLINE
help
  Compiler directly inserts code checking shadow memory before
  memory accesses. This is faster than outline (in some workloads
-- 
2.25.1



[PATCH v9 0/6] KASAN for powerpc64 radix

2020-12-01 Thread Daniel Axtens
Building on the work of Christophe, Aneesh and Balbir, I've ported
KASAN to 64-bit Book3S kernels running on the Radix MMU.

This is a significant reworking of the previous versions. Instead of
the previous approach which supported inline instrumentation, this
series provides only outline instrumentation.

To get around the problem of accessing the shadow region inside code we run
with translations off (in 'real mode'), we we restrict checking to when
translations are enabled. This is done via a new hook in the kasan core and
by excluding larger quantites of arch code from instrumentation. The upside
is that we no longer require that you be able to specify the amount of
physically contiguous memory on the system at compile time. Hopefully this
is a better trade-off. More details in patch 6.

kexec works. Both 64k and 4k pages work. Running as a KVM host works, but
nothing in arch/powerpc/kvm is instrumented. It's also potentially a bit
fragile - if any real mode code paths call out to instrumented code, things
will go boom.

There are 4 failing KUnit tests:

kasan_stack_oob, kasan_alloca_oob_left & kasan_alloca_oob_right - these are
due to not supporting inline instrumentation.

kasan_global_oob - gcc puts the ASAN init code in a section called
'.init_array'. Powerpc64 module loading code goes through and _renames_ any
section beginning with '.init' to begin with '_init' in order to avoid some
complexities around our 24-bit indirect jumps. This means it renames
'.init_array' to '_init_array', and the generic module loading code then
fails to recognise the section as a constructor and thus doesn't run
it. This hack dates back to 2003 and so I'm not going to try to unpick it
in this series. (I suspect this may have previously worked if the code
ended up in .ctors rather than .init_array but I don't keep my old binaries
around so I have no real way of checking.)


Daniel Axtens (6):
  kasan: allow an architecture to disable inline instrumentation
  kasan: allow architectures to provide an outline readiness check
  kasan: define and use MAX_PTRS_PER_* for early shadow tables
  kasan: Document support on 32-bit powerpc
  powerpc/mm/kasan: rename kasan_init_32.c to init_32.c
  powerpc: Book3S 64-bit outline-only KASAN support




[PATCH net v3 1/2] ibmvnic: Ensure that SCRQ entry reads are correctly ordered

2020-12-01 Thread Thomas Falcon
Ensure that received Subordinate Command-Response Queue (SCRQ)
entries are properly read in order by the driver. These queues
are used in the ibmvnic device to process RX buffer and TX completion
descriptors. dma_rmb barriers have been added after checking for a
pending descriptor to ensure the correct descriptor entry is checked
and after reading the SCRQ descriptor to ensure the entire
descriptor is read before processing.

Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 2aa40b2..5ea9f5c 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2403,6 +2403,12 @@ static int ibmvnic_poll(struct napi_struct *napi, int 
budget)
 
if (!pending_scrq(adapter, adapter->rx_scrq[scrq_num]))
break;
+   /* The queue entry at the current index is peeked at above
+* to determine that there is a valid descriptor awaiting
+* processing. We want to be sure that the current slot
+* holds a valid descriptor before reading its contents.
+*/
+   dma_rmb();
next = ibmvnic_next_scrq(adapter, adapter->rx_scrq[scrq_num]);
rx_buff =
(struct ibmvnic_rx_buff *)be64_to_cpu(next->
@@ -3098,6 +3104,13 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
unsigned int pool = scrq->pool_index;
int num_entries = 0;
 
+   /* The queue entry at the current index is peeked at above
+* to determine that there is a valid descriptor awaiting
+* processing. We want to be sure that the current slot
+* holds a valid descriptor before reading its contents.
+*/
+   dma_rmb();
+
next = ibmvnic_next_scrq(adapter, scrq);
for (i = 0; i < next->tx_comp.num_comps; i++) {
if (next->tx_comp.rcs[i]) {
@@ -3498,6 +3511,11 @@ static union sub_crq *ibmvnic_next_scrq(struct 
ibmvnic_adapter *adapter,
}
spin_unlock_irqrestore(>lock, flags);
 
+   /* Ensure that the entire buffer descriptor has been
+* loaded before reading its contents
+*/
+   dma_rmb();
+
return entry;
 }
 
-- 
1.8.3.1



[PATCH net v3 0/2] ibmvnic: Bug fixes for queue descriptor processing

2020-12-01 Thread Thomas Falcon
This series resolves a few issues in the ibmvnic driver's
RX buffer and TX completion processing. The first patch
includes memory barriers to synchronize queue descriptor
reads. The second patch fixes a memory leak that could
occur if the device returns a TX completion with an error
code in the descriptor, in which case the respective socket
buffer and other relevant data structures may not be freed
or updated properly.

v3: Correct length of Fixes tags, requested by Jakub Kicinski

v2: Provide more detailed comments explaining specifically what
reads are being ordered, suggested by Michael Ellerman

Thomas Falcon (2):
  ibmvnic: Ensure that SCRQ entry reads are correctly ordered
  ibmvnic: Fix TX completion error handling

 drivers/net/ethernet/ibm/ibmvnic.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

-- 
1.8.3.1



[PATCH net v3 2/2] ibmvnic: Fix TX completion error handling

2020-12-01 Thread Thomas Falcon
TX completions received with an error return code are not
being processed properly. When an error code is seen, do not
proceed to the next completion before cleaning up the existing
entry's data structures.

Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5ea9f5c..10878f8 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3113,11 +3113,9 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
 
next = ibmvnic_next_scrq(adapter, scrq);
for (i = 0; i < next->tx_comp.num_comps; i++) {
-   if (next->tx_comp.rcs[i]) {
+   if (next->tx_comp.rcs[i])
dev_err(dev, "tx error %x\n",
next->tx_comp.rcs[i]);
-   continue;
-   }
index = be32_to_cpu(next->tx_comp.correlators[i]);
if (index & IBMVNIC_TSO_POOL_MASK) {
tx_pool = >tso_pool[pool];
-- 
1.8.3.1



Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Arnd Bergmann
On Tue, Dec 1, 2020 at 4:41 PM Alexandre Belloni
 wrote:
> On 01/12/2020 14:40:53+, Catalin Marinas wrote:
> > On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> > > From Krzysztof Kozlowski :

> > I tried to convince them before, it didn't work. I guess they don't like
> > to be spammed ;).
>
> The first rule of arm-soc is: you do not talk about arm@ and soc@

I don't mind having the addresses documented better, but it needs to
be done in a way that avoids having any patch for arch/arm*/boot/dts
and arch/arm/*/configs Cc:d to s...@kernel.org.

If anyone has suggestions for how to do that, let me know.

> > Or rather, SoC-specific patches, even to defconfig,
> > should go through the specific SoC maintainers. However, there are
> > occasional defconfig patches which are more generic or affecting
> > multiple SoCs. I just ignore them as the arm64 defconfig is usually
> > handled by the arm-soc folk (when I need a defconfig change, I go for
> > arch/arm64/Kconfig directly ;)).
>
> IIRC, the plan was indeed to get defconfig changes through the platform
> sub-trees. It is also supposed to be how multi_v5 and multi_v7 are
> handled and they will take care of the merge.

For cross-platform changes like this one, I'm definitely happy to
pick up the patch directly from s...@kernel.org, or from mailing
list if I know about it.

We usually do the merges for the soc tree in batches and rely
on patchwork to keep track of what I'm missing, so if Olof and
I are just on Cc to a mail, we might have forgotten about it
by the time we do the next merges.

  Arnd


Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Alexandre Belloni
On 01/12/2020 14:40:53+, Catalin Marinas wrote:
> On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> > From Krzysztof Kozlowski :
> > > On Mon, Nov 30, 2020 at 03:21:33PM +, Andrey Zhizhikin wrote:
> > > > Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
> > > > unused") removed geenric_bl driver from the tree, together with
> > > > corresponding config option.
> > > >
> > > > Remove BACKLIGHT_GENERIC config item from all ARM configurations.
> > > >
> > > > Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it
> > > > is unused")
> > > > Cc: Sam Ravnborg 
> > > > Signed-off-by: Andrey Zhizhikin
> > > > 
> > > > ---
> > > >  arch/arm/configs/at91_dt_defconfig| 1 -
> > > >  arch/arm/configs/cm_x300_defconfig| 1 -
> > > >  arch/arm/configs/colibri_pxa300_defconfig | 1 -
> > > >  arch/arm/configs/jornada720_defconfig | 1 -
> > > >  arch/arm/configs/magician_defconfig   | 1 -
> > > >  arch/arm/configs/mini2440_defconfig   | 1 -
> > > >  arch/arm/configs/omap2plus_defconfig  | 1 -
> > > >  arch/arm/configs/pxa3xx_defconfig | 1 -
> > > >  arch/arm/configs/qcom_defconfig   | 1 -
> > > >  arch/arm/configs/sama5_defconfig  | 1 -
> > > >  arch/arm/configs/sunxi_defconfig  | 1 -
> > > >  arch/arm/configs/tegra_defconfig  | 1 -
> > > >  arch/arm/configs/u8500_defconfig  | 1 -
> > > >  13 files changed, 13 deletions(-)
> > > 
> > > You need to send it to arm-soc maintainers, otherwise no one might feel
> > > responsible enough to pick it up.
> > 
> > Good point, thanks a lot!
> > 
> > I was not aware of the fact that there is a separate ML that should
> > receive patches targeted ARM SOCs. Can you (or anyone else) please
> > share it, so I can re-send it there as well?
> 
> It's not a mailing list as such (with archives etc.), just an alias to
> the arm-soc maintainers: a...@kernel.org.
> 
> > > Reviewed-by: Krzysztof Kozlowski 
> > > 
> > > +CC Arnd and Olof,
> > > 
> > > Dear Arnd and Olof,
> > > 
> > > Maybe it is worth to add arm-soc entry to the MAINTAINERS file?
> > > Otherwise how one could get your email address? Not mentioning the
> > > secret-soc address. :)
> 
> I tried to convince them before, it didn't work. I guess they don't like
> to be spammed ;).

The first rule of arm-soc is: you do not talk about arm@ and soc@

> Or rather, SoC-specific patches, even to defconfig,
> should go through the specific SoC maintainers. However, there are
> occasional defconfig patches which are more generic or affecting
> multiple SoCs. I just ignore them as the arm64 defconfig is usually
> handled by the arm-soc folk (when I need a defconfig change, I go for
> arch/arm64/Kconfig directly ;)).
> 

IIRC, the plan was indeed to get defconfig changes through the platform
sub-trees. It is also supposed to be how multi_v5 and multi_v7 are
handled and they will take care of the merge.

-- 
Alexandre Belloni, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com


Re: [PATCH v6 0/5] PCI: Unify ECAM constants in native PCI Express drivers

2020-12-01 Thread Lorenzo Pieralisi
On Sun, 29 Nov 2020 23:07:38 +, Krzysztof Wilczyński wrote:
> Unify ECAM-related constants into a single set of standard constants
> defining memory address shift values for the byte-level address that can
> be used when accessing the PCI Express Configuration Space, and then
> move native PCI Express controller drivers to use newly introduced
> definitions retiring any driver-specific ones.
> 
> The ECAM ("Enhanced Configuration Access Mechanism") is defined by the
> PCI Express specification (see PCI Express Base Specification, Revision
> 5.0, Version 1.0, Section 7.2.2, p. 676), thus most hardware should
> implement it the same way.
> 
> [...]

Applied to pci/ecam, thanks!

[1/5] PCI: Unify ECAM constants in native PCI Express drivers
  https://git.kernel.org/lpieralisi/pci/c/f3c07cf692
[2/5] PCI: thunder-pem: Add constant for custom ".bus_shift" initialiser
  https://git.kernel.org/lpieralisi/pci/c/3c38579263
[3/5] PCI: iproc: Convert to use the new ECAM constants
  https://git.kernel.org/lpieralisi/pci/c/333ec9d3cc
[4/5] PCI: vmd: Update type of the __iomem pointers
  https://git.kernel.org/lpieralisi/pci/c/89094c12ea
[5/5] PCI: xgene: Removed unused ".bus_shift" initialisers from pci-xgene.c
  https://git.kernel.org/lpieralisi/pci/c/3dc62532a5

Thanks,
Lorenzo


[PATCH] selftests/powerpc: update .gitignore

2020-12-01 Thread Daniel Axtens
I did an in-place build of the self-tests and found that it left
the tree dirty.

Add missed test binaries to .gitignore

Signed-off-by: Daniel Axtens 
---
 tools/testing/selftests/powerpc/nx-gzip/.gitignore  | 3 +++
 tools/testing/selftests/powerpc/security/.gitignore | 1 +
 tools/testing/selftests/powerpc/signal/.gitignore   | 1 +
 tools/testing/selftests/powerpc/syscalls/.gitignore | 1 +
 4 files changed, 6 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/nx-gzip/.gitignore

diff --git a/tools/testing/selftests/powerpc/nx-gzip/.gitignore 
b/tools/testing/selftests/powerpc/nx-gzip/.gitignore
new file mode 100644
index ..886d522d52df
--- /dev/null
+++ b/tools/testing/selftests/powerpc/nx-gzip/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+gunz_test
+gzfht_test
diff --git a/tools/testing/selftests/powerpc/security/.gitignore 
b/tools/testing/selftests/powerpc/security/.gitignore
index 4257a1f156bb..93614b125ded 100644
--- a/tools/testing/selftests/powerpc/security/.gitignore
+++ b/tools/testing/selftests/powerpc/security/.gitignore
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 rfi_flush
 entry_flush
+spectre_v2
diff --git a/tools/testing/selftests/powerpc/signal/.gitignore 
b/tools/testing/selftests/powerpc/signal/.gitignore
index 405b5364044c..ce3375cd8e73 100644
--- a/tools/testing/selftests/powerpc/signal/.gitignore
+++ b/tools/testing/selftests/powerpc/signal/.gitignore
@@ -3,3 +3,4 @@ signal
 signal_tm
 sigfuz
 sigreturn_vdso
+sig_sc_double_restart
diff --git a/tools/testing/selftests/powerpc/syscalls/.gitignore 
b/tools/testing/selftests/powerpc/syscalls/.gitignore
index b00cab225476..a1e19ccdef84 100644
--- a/tools/testing/selftests/powerpc/syscalls/.gitignore
+++ b/tools/testing/selftests/powerpc/syscalls/.gitignore
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 ipc_unmuxed
+rtas_filter
-- 
2.25.1



[PATCH] powerpc/feature-fixups: use a semicolon rather than a comma

2020-12-01 Thread Daniel Axtens
In a bunch of our security flushes, we use a comma rather than
a semicolon to 'terminate' an assignment. Nothing breaks, but
checkpatch picks it up if you copy it into another flush.

Switch to semicolons for ending statements.

Cc: Nick Piggin 
Cc: Russell Currey 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/lib/feature-fixups.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 321c12a9ef6b..47821055b94c 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -124,7 +124,7 @@ static void do_stf_entry_barrier_fixups(enum 
stf_barrier_type types)
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___stf_entry_barrier_fixup),
+   start = PTRRELOC(&__start___stf_entry_barrier_fixup);
end = PTRRELOC(&__stop___stf_entry_barrier_fixup);
 
instrs[0] = 0x6000; /* nop */
@@ -176,7 +176,7 @@ static void do_stf_exit_barrier_fixups(enum 
stf_barrier_type types)
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___stf_exit_barrier_fixup),
+   start = PTRRELOC(&__start___stf_exit_barrier_fixup);
end = PTRRELOC(&__stop___stf_exit_barrier_fixup);
 
instrs[0] = 0x6000; /* nop */
@@ -344,7 +344,7 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___rfi_flush_fixup),
+   start = PTRRELOC(&__start___rfi_flush_fixup);
end = PTRRELOC(&__stop___rfi_flush_fixup);
 
instrs[0] = 0x6000; /* nop */
@@ -417,7 +417,7 @@ void do_barrier_nospec_fixups(bool enable)
 {
void *start, *end;
 
-   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   start = PTRRELOC(&__start___barrier_nospec_fixup);
end = PTRRELOC(&__stop___barrier_nospec_fixup);
 
do_barrier_nospec_fixups_range(enable, start, end);
-- 
2.25.1



Re: [PATCH 1/5] ARM: configs: drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Catalin Marinas
On Mon, Nov 30, 2020 at 07:50:25PM +, ZHIZHIKIN Andrey wrote:
> From Krzysztof Kozlowski :
> > On Mon, Nov 30, 2020 at 03:21:33PM +, Andrey Zhizhikin wrote:
> > > Commit 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
> > > unused") removed geenric_bl driver from the tree, together with
> > > corresponding config option.
> > >
> > > Remove BACKLIGHT_GENERIC config item from all ARM configurations.
> > >
> > > Fixes: 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it
> > > is unused")
> > > Cc: Sam Ravnborg 
> > > Signed-off-by: Andrey Zhizhikin
> > > 
> > > ---
> > >  arch/arm/configs/at91_dt_defconfig| 1 -
> > >  arch/arm/configs/cm_x300_defconfig| 1 -
> > >  arch/arm/configs/colibri_pxa300_defconfig | 1 -
> > >  arch/arm/configs/jornada720_defconfig | 1 -
> > >  arch/arm/configs/magician_defconfig   | 1 -
> > >  arch/arm/configs/mini2440_defconfig   | 1 -
> > >  arch/arm/configs/omap2plus_defconfig  | 1 -
> > >  arch/arm/configs/pxa3xx_defconfig | 1 -
> > >  arch/arm/configs/qcom_defconfig   | 1 -
> > >  arch/arm/configs/sama5_defconfig  | 1 -
> > >  arch/arm/configs/sunxi_defconfig  | 1 -
> > >  arch/arm/configs/tegra_defconfig  | 1 -
> > >  arch/arm/configs/u8500_defconfig  | 1 -
> > >  13 files changed, 13 deletions(-)
> > 
> > You need to send it to arm-soc maintainers, otherwise no one might feel
> > responsible enough to pick it up.
> 
> Good point, thanks a lot!
> 
> I was not aware of the fact that there is a separate ML that should
> receive patches targeted ARM SOCs. Can you (or anyone else) please
> share it, so I can re-send it there as well?

It's not a mailing list as such (with archives etc.), just an alias to
the arm-soc maintainers: a...@kernel.org.

> > Reviewed-by: Krzysztof Kozlowski 
> > 
> > +CC Arnd and Olof,
> > 
> > Dear Arnd and Olof,
> > 
> > Maybe it is worth to add arm-soc entry to the MAINTAINERS file?
> > Otherwise how one could get your email address? Not mentioning the
> > secret-soc address. :)

I tried to convince them before, it didn't work. I guess they don't like
to be spammed ;). Or rather, SoC-specific patches, even to defconfig,
should go through the specific SoC maintainers. However, there are
occasional defconfig patches which are more generic or affecting
multiple SoCs. I just ignore them as the arm64 defconfig is usually
handled by the arm-soc folk (when I need a defconfig change, I go for
arch/arm64/Kconfig directly ;)).

Anyway, I still think that we should add a MAINTAINERS entry for
arch/arm64/configs/defconfig and arch/arm64/Kconfig.platforms.

-- 
Catalin


Re: [PATCH v8 05/12] mm: HUGE_VMAP arch support cleanup

2020-12-01 Thread Catalin Marinas
On Sun, Nov 29, 2020 at 01:25:52AM +1000, Nicholas Piggin wrote:
> This changes the awkward approach where architectures provide init
> functions to determine which levels they can provide large mappings for,
> to one where the arch is queried for each call.
> 
> This removes code and indirection, and allows constant-folding of dead
> code for unsupported levels.
> 
> This also adds a prot argument to the arch query. This is unused
> currently but could help with some architectures (e.g., some powerpc
> processors can't map uncacheable memory with large pages).
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: x...@kernel.org
> Cc: "H. Peter Anvin" 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/arm64/include/asm/vmalloc.h |  8 +++
>  arch/arm64/mm/mmu.c  | 10 +--

For arm64:

Acked-by: Catalin Marinas 


Re: [RFC PATCH 01/14] ftrace: Fix updating FTRACE_FL_TRAMP

2020-12-01 Thread Naveen N. Rao

Steven Rostedt wrote:

On Thu, 26 Nov 2020 23:38:38 +0530
"Naveen N. Rao"  wrote:


On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in
ftrace_get_addr_new() followed by the below message:
  Bad trampoline accounting at: 4222522f (wake_up_process+0xc/0x20) 
(f001)

The set of steps leading to this involved:
- modprobe ftrace-direct-too
- enable_probe
- modprobe ftrace-direct
- rmmod ftrace-direct <-- trigger

The problem turned out to be that we were not updating flags in the
ftrace record properly. From the above message about the trampoline
accounting being bad, it can be seen that the ftrace record still has
FTRACE_FL_TRAMP set though ftrace-direct module is going away. This
happens because we are checking if any ftrace_ops has the
FTRACE_FL_TRAMP flag set _before_ updating the filter hash.

The fix for this is to look for any _other_ ftrace_ops that also needs
FTRACE_FL_TRAMP.


I'm applying this now and sending this for -rc and stable.

The code worked on x86 because x86 has a way to make all users use
trampolines, so this was never an issue (everything has a trampoline).
I modified the kernel so that x86 would not create its own trampoline
(see the weak function arch_ftrace_update_trampoline(), and I was able
to reproduce the bug.


Good to know that you were able to reproduce this.



I'm adding:

Cc: sta...@vger.kernel.org
Fixes: a124692b698b0 ("ftrace: Enable trampoline when rec count returns back to 
one")


That looks good to me. Thanks for picking the two patches and for your 
review on the others!



- Naveen



Re: [PATCH v2 1/2] kbuild: Hoist '--orphan-handling' into Kconfig

2020-12-01 Thread Masahiro Yamada
On Sat, Nov 21, 2020 at 9:08 AM Kees Cook  wrote:
>
> On Thu, Nov 19, 2020 at 01:46:56PM -0700, Nathan Chancellor wrote:
> > Currently, '--orphan-handling=warn' is spread out across four different
> > architectures in their respective Makefiles, which makes it a little
> > unruly to deal with in case it needs to be disabled for a specific
> > linker version (in this case, ld.lld 10.0.1).
> >
> > To make it easier to control this, hoist this warning into Kconfig and
> > the main Makefile so that disabling it is simpler, as the warning will
> > only be enabled in a couple places (main Makefile and a couple of
> > compressed boot folders that blow away LDFLAGS_vmlinx) and making it
> > conditional is easier due to Kconfig syntax. One small additional
> > benefit of this is saving a call to ld-option on incremental builds
> > because we will have already evaluated it for CONFIG_LD_ORPHAN_WARN.
> >
> > To keep the list of supported architectures the same, introduce
> > CONFIG_ARCH_WANT_LD_ORPHAN_WARN, which an architecture can select to
> > gain this automatically after all of the sections are specified and size
> > asserted. A special thanks to Kees Cook for the help text on this
> > config.
> >
> > Link: https://github.com/ClangBuiltLinux/linux/issues/1187
> > Acked-by: Kees Cook 
> > Acked-by: Michael Ellerman  (powerpc)
> > Reviewed-by: Nick Desaulniers 
> > Tested-by: Nick Desaulniers 
> > Signed-off-by: Nathan Chancellor 
>
> Masahiro, do you want to take these to get them to Linus for v5.10? I
> can send them if you'd prefer.
>



Sorry for the delay.

Applied to linux-kbuild.





> -Kees
>
> --
> Kees Cook
>
> --
> You received this message because you are subscribed to the Google Groups 
> "Clang Built Linux" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clang-built-linux+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/clang-built-linux/202011201607.75FA476%40keescook.



-- 
Best Regards
Masahiro Yamada


Re: [PATCH v2 2/2] kbuild: Disable CONFIG_LD_ORPHAN_WARN for ld.lld 10.0.1

2020-12-01 Thread Masahiro Yamada
On Wed, Nov 25, 2020 at 7:22 AM Kees Cook  wrote:
>
> On Thu, Nov 19, 2020 at 01:13:27PM -0800, Nick Desaulniers wrote:
> > On Thu, Nov 19, 2020 at 12:57 PM Nathan Chancellor
> >  wrote:
> > >
> > > ld.lld 10.0.1 spews a bunch of various warnings about .rela sections,
> > > along with a few others. Newer versions of ld.lld do not have these
> > > warnings. As a result, do not add '--orphan-handling=warn' to
> > > LDFLAGS_vmlinux if ld.lld's version is not new enough.
> > >
> > > Link: https://github.com/ClangBuiltLinux/linux/issues/1187
> > > Link: https://github.com/ClangBuiltLinux/linux/issues/1193
> > > Reported-by: Arvind Sankar 
> > > Reported-by: kernelci.org bot 
> > > Reported-by: Mark Brown 
> > > Reviewed-by: Kees Cook 
> > > Signed-off-by: Nathan Chancellor 
> >
> > Thanks for the additions in v2.
> > Reviewed-by: Nick Desaulniers 
>
> I'm going to carry this for a few days in -next, and if no one screams,
> ask Linus to pull it for v5.10-rc6.
>
> Thanks!
>
> --
> Kees Cook


Sorry for the delay.
Applied to linux-kbuild.

But, I already see this in linux-next.

Please let me know if I should drop it from my tree.


-- 
Best Regards
Masahiro Yamada


Re: [RFC PATCH] powerpc/papr_scm: Implement scm async flush

2020-12-01 Thread Pankaj Gupta
> >> Tha patch implements SCM async-flush hcall and sets the
> >> ND_REGION_ASYNC capability when the platform device tree
> >> has "ibm,async-flush-required" set.
> >
> > So, you are reusing the existing ND_REGION_ASYNC flag for the
> > hypercall based async flush with device tree discovery?
> >
> > Out of curiosity, does virtio based flush work in ppc? Was just thinking
> > if we can reuse virtio based flush present in virtio-pmem? Or anything
> > else we are trying to achieve here?
> >
>
>
> Not with PAPR based pmem driver papr_scm.ko. The devices there are
> considered platform device and we use hypercalls to configure the
> device. On similar fashion we are now using hypercall to flush the host
> based caches.

o.k. Thanks for answering.

Best regards,
Pankaj

>
> -aneesh


Re: [RFC PATCH] powerpc/papr_scm: Implement scm async flush

2020-12-01 Thread Aneesh Kumar K.V

On 12/1/20 6:17 PM, Pankaj Gupta wrote:

Tha patch implements SCM async-flush hcall and sets the
ND_REGION_ASYNC capability when the platform device tree
has "ibm,async-flush-required" set.


So, you are reusing the existing ND_REGION_ASYNC flag for the
hypercall based async flush with device tree discovery?

Out of curiosity, does virtio based flush work in ppc? Was just thinking
if we can reuse virtio based flush present in virtio-pmem? Or anything
else we are trying to achieve here?




Not with PAPR based pmem driver papr_scm.ko. The devices there are 
considered platform device and we use hypercalls to configure the 
device. On similar fashion we are now using hypercall to flush the host 
based caches.


-aneesh


Re: [RFC PATCH] powerpc/papr_scm: Implement scm async flush

2020-12-01 Thread Pankaj Gupta
> Tha patch implements SCM async-flush hcall and sets the
> ND_REGION_ASYNC capability when the platform device tree
> has "ibm,async-flush-required" set.

So, you are reusing the existing ND_REGION_ASYNC flag for the
hypercall based async flush with device tree discovery?

Out of curiosity, does virtio based flush work in ppc? Was just thinking
if we can reuse virtio based flush present in virtio-pmem? Or anything
else we are trying to achieve here?

Thanks,
Pankaj
>
> The below demonstration shows the map_sync behavior when
> ibm,async-flush-required is present in device tree.
> (https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)
>
> The pmem0 is from nvdimm without async-flush-required,
> and pmem1 is from nvdimm with async-flush-required, mounted as
> /dev/pmem0 on /mnt1 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
> /dev/pmem1 on /mnt2 type xfs 
> (rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
>
> #./mapsync /mnt1/newfile> Without async-flush-required
> #./mapsync /mnt2/newfile> With async-flush-required
> Failed to mmap  with Operation not supported
>
> Signed-off-by: Shivaprasad G Bhat 
> ---
> The HCALL semantics are in review, not final.

Any link of the discussion?

>
>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>  arch/powerpc/include/asm/hvcall.h |3 +-
>  arch/powerpc/platforms/pseries/papr_scm.c |   39 
> +
>  3 files changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 48fcf1255a33..cc310814f24c 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>  Given a DRC Index collect the performance statistics for NVDIMM and copy them
>  to the resultBuffer.
>
> +**H_SCM_ASYNC_FLUSH**
> +
> +| Input: *drcIndex*
> +| Out: *continue-token*
> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
> +
> +Given a DRC Index Flush the data to backend NVDIMM device.
> +
> +The hcall returns H_BUSY when the flush takes longer time and the hcall needs
> +to be issued multiple times in order to be completely serviced. The
> +*continue-token* from the output to be passed in the argument list in
> +subsequent hcalls to the hypervisor until the hcall is completely serviced
> +at which point H_SUCCESS is returned by the hypervisor.
> +
>  References
>  ==
>  .. [1] "Power Architecture Platform Reference"
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index c1fbccb04390..4a13074bc782 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -306,7 +306,8 @@
>  #define H_SCM_HEALTH0x400
>  #define H_SCM_PERFORMANCE_STATS 0x418
>  #define H_RPT_INVALIDATE   0x448
> -#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
> +#define H_SCM_ASYNC_FLUSH  0x4A0
> +#define MAX_HCALL_OPCODE   H_SCM_ASYNC_FLUSH
>
>  /* Scope args for H_SCM_UNBIND_ALL */
>  #define H_UNBIND_SCOPE_ALL (0x1)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..1f8c5153cb3d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -93,6 +93,7 @@ struct papr_scm_priv {
> uint64_t block_size;
> int metadata_size;
> bool is_volatile;
> +   bool async_flush_required;
>
> uint64_t bound_addr;
>
> @@ -117,6 +118,38 @@ struct papr_scm_priv {
> size_t stat_buffer_len;
>  };
>
> +static int papr_scm_pmem_flush(struct nd_region *nd_region, struct bio *bio)
> +{
> +   unsigned long ret[PLPAR_HCALL_BUFSIZE];
> +   struct papr_scm_priv *p = nd_region_provider_data(nd_region);
> +   int64_t rc;
> +   uint64_t token = 0;
> +
> +   do {
> +   rc = plpar_hcall(H_SCM_ASYNC_FLUSH, ret, p->drc_index, token);
> +
> +   /* Check if we are stalled for some time */
> +   token = ret[0];
> +   if (H_IS_LONG_BUSY(rc)) {
> +   msleep(get_longbusy_msecs(rc));
> +   rc = H_BUSY;
> +   } else if (rc == H_BUSY) {
> +   cond_resched();
> +   }
> +
> +   } while (rc == H_BUSY);
> +
> +   if (rc)
> +   dev_err(>pdev->dev, "flush error: %lld\n", rc);
> +   else
> +   dev_dbg(>pdev->dev, "flush drc 0x%x complete\n",
> +   p->drc_index);
> +
> +   dev_dbg(>pdev->dev, "Flush call complete\n");
> +
> +   return rc;
> +}
> +
>  static LIST_HEAD(papr_nd_regions);
>  static DEFINE_MUTEX(papr_ndr_lock);
>
> @@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> ndr_desc.num_mappings = 1;
> 

[RFC PATCH] powerpc/papr_scm: Implement scm async flush

2020-12-01 Thread Shivaprasad G Bhat
Tha patch implements SCM async-flush hcall and sets the
ND_REGION_ASYNC capability when the platform device tree
has "ibm,async-flush-required" set.

The below demonstration shows the map_sync behavior when
ibm,async-flush-required is present in device tree.
(https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c)

The pmem0 is from nvdimm without async-flush-required,
and pmem1 is from nvdimm with async-flush-required, mounted as
/dev/pmem0 on /mnt1 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
/dev/pmem1 on /mnt2 type xfs 
(rw,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)

#./mapsync /mnt1/newfile> Without async-flush-required
#./mapsync /mnt2/newfile> With async-flush-required
Failed to mmap  with Operation not supported

Signed-off-by: Shivaprasad G Bhat 
---
The HCALL semantics are in review, not final.

 Documentation/powerpc/papr_hcalls.rst |   14 ++
 arch/powerpc/include/asm/hvcall.h |3 +-
 arch/powerpc/platforms/pseries/papr_scm.c |   39 +
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/powerpc/papr_hcalls.rst 
b/Documentation/powerpc/papr_hcalls.rst
index 48fcf1255a33..cc310814f24c 100644
--- a/Documentation/powerpc/papr_hcalls.rst
+++ b/Documentation/powerpc/papr_hcalls.rst
@@ -275,6 +275,20 @@ Health Bitmap Flags:
 Given a DRC Index collect the performance statistics for NVDIMM and copy them
 to the resultBuffer.
 
+**H_SCM_ASYNC_FLUSH**
+
+| Input: *drcIndex*
+| Out: *continue-token*
+| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
+
+Given a DRC Index Flush the data to backend NVDIMM device.
+
+The hcall returns H_BUSY when the flush takes longer time and the hcall needs
+to be issued multiple times in order to be completely serviced. The
+*continue-token* from the output to be passed in the argument list in
+subsequent hcalls to the hypervisor until the hcall is completely serviced
+at which point H_SUCCESS is returned by the hypervisor.
+
 References
 ==
 .. [1] "Power Architecture Platform Reference"
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c1fbccb04390..4a13074bc782 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -306,7 +306,8 @@
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
-#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
+#define H_SCM_ASYNC_FLUSH  0x4A0
+#define MAX_HCALL_OPCODE   H_SCM_ASYNC_FLUSH
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 835163f54244..1f8c5153cb3d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -93,6 +93,7 @@ struct papr_scm_priv {
uint64_t block_size;
int metadata_size;
bool is_volatile;
+   bool async_flush_required;
 
uint64_t bound_addr;
 
@@ -117,6 +118,38 @@ struct papr_scm_priv {
size_t stat_buffer_len;
 };
 
+static int papr_scm_pmem_flush(struct nd_region *nd_region, struct bio *bio)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   struct papr_scm_priv *p = nd_region_provider_data(nd_region);
+   int64_t rc;
+   uint64_t token = 0;
+
+   do {
+   rc = plpar_hcall(H_SCM_ASYNC_FLUSH, ret, p->drc_index, token);
+
+   /* Check if we are stalled for some time */
+   token = ret[0];
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+
+   } while (rc == H_BUSY);
+
+   if (rc)
+   dev_err(>pdev->dev, "flush error: %lld\n", rc);
+   else
+   dev_dbg(>pdev->dev, "flush drc 0x%x complete\n",
+   p->drc_index);
+
+   dev_dbg(>pdev->dev, "Flush call complete\n");
+
+   return rc;
+}
+
 static LIST_HEAD(papr_nd_regions);
 static DEFINE_MUTEX(papr_ndr_lock);
 
@@ -943,6 +976,11 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
ndr_desc.num_mappings = 1;
ndr_desc.nd_set = >nd_set;
 
+   if (p->async_flush_required) {
+   set_bit(ND_REGION_ASYNC, _desc.flags);
+   ndr_desc.flush = papr_scm_pmem_flush;
+   }
+
if (p->is_volatile)
p->region = nvdimm_volatile_region_create(p->bus, _desc);
else {
@@ -1088,6 +1126,7 @@ static int papr_scm_probe(struct platform_device *pdev)
p->block_size = block_size;
p->blocks = blocks;
p->is_volatile = !of_property_read_bool(dn, "ibm,cache-flush-required");
+   p->async_flush_required = 

Re: [RFC PATCH] powerpc: show registers when unwinding interrupt frames

2020-12-01 Thread Christophe Leroy




Le 07/11/2020 à 03:33, Nicholas Piggin a écrit :

It's often useful to know the register state for interrupts in
the stack frame. In the below example (with this patch applied),
the important information is the state of the page fault.

A blatant case like this probably rather should have the page
fault regs passed down to the warning, but quite often there are
less obvious cases where an interrupt shows up that might give
some more clues.

The downside is longer and more complex bug output.


Do we want all interrupts, including system call ?

I don't find the dump of the syscall interrupt so usefull, do you ?

See below an (unexpected?) KUAP warning due to an expected NULL pointer dereference in 
copy_from_kernel_nofault() called from kthread_probe_data()



[ 1117.202054] [ cut here ]
[ 1117.202102] Bug: fault blocked by AP register !
[ 1117.202261] WARNING: CPU: 0 PID: 377 at arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 
do_page_fault+0x4a8/0x5ec

[ 1117.202310] Modules linked in:
[ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: GW 
5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty #4175

[ 1117.202499] NIP:  c0012048 LR: c0012048 CTR: 
[ 1117.202573] REGS: cacdbb88 TRAP: 0700   Tainted: GW 
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)

[ 1117.202625] MSR:  00021032   CR: 2408  XER: 2000
[ 1117.202899]
[ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 0023 c092e554 0001 
c09865e8 c092e640
[ 1117.202899] GPR08: 1032   00014efc 28082224 100d166a 
100a0920 
[ 1117.202899] GPR16: 100cac0c 100b 1080c3fc 1080d685 100d 100d 
 100a0900
[ 1117.202899] GPR24: 100d c07892ec  c0921510 c21f4440 005c 
c000 cacdbc80
[ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204509] Call Trace:
[ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
[ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
[ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
[ 1117.204979] NIP:  c010dbec LR: c010dbac CTR: 0001
[ 1117.205053] REGS: cacdbc80 TRAP: 0301   Tainted: GW 
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)

[ 1117.205104] MSR:  9032   CR: 28082224  XER: 
[ 1117.205416] DAR: 005c DSISR: c000
[ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 0001 0017 0017 
0027 000f
[ 1117.205416] GPR08: c09926ec   3000 24082224
[ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
[ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
[ 1117.206258] --- interrupt: 301
[ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
[ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
[ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
[ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
[ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
[ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
[ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
[ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
[ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
[ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
[ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
[ 1117.207938] --- interrupt: c01 at 0xfd4e784
[ 1117.208008] NIP:  0fd4e784 LR: 0fe0f244 CTR: 10048d38
[ 1117.208083] REGS: cacdbf48 TRAP: 0c01   Tainted: GW 
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)

[ 1117.208134] MSR:  d032   CR: 4400  XER: 
[ 1117.208470]
[ 1117.208470] GPR00: 0004 7fc34090 77bfb4e0 0001 1080fa40 0002 
740f fefefeff
[ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 
[ 1117.209104] NIP [0fd4e784] 0xfd4e784
[ 1117.209180] LR [0fe0f244] 0xfe0f244
[ 1117.209236] --- interrupt: c01
[ 1117.209274] Instruction dump:
[ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 
4082ff60
[ 1117.209727] 0fe0 3c60c082 386399f4 48013b65 <0fe0> 80010034 386b 
7c0803a6
[ 1117.210102] ---[ end trace 1927c0323393af3e ]---

Christophe




   Bug: Write fault blocked by AMR!
   WARNING: CPU: 0 PID: 72 at 
arch/powerpc/include/asm/book3s/64/kup-radix.h:164 __do_page_fault+0x880/0xa90
   Modules linked in:
   CPU: 0 PID: 72 Comm: systemd-gpt-aut Not tainted
   NIP:  c006e2f0 LR: c006e2ec CTR: 
   REGS: ca4f3420 TRAP: 0700
   MSR:  80021033   CR: 28002840  XER: 2004
   CFAR: c0128be0 IRQMASK: 3
   GPR00: c006e2ec ca4f36c0 c14f0700 0020
   GPR04: 0001 c1290f50 0001 c1290f80
   GPR08: c1612b08   e0f7
   GPR12: 48002840 c16e c00c00021c80 

Re: [PATCH 0/5] drop unused BACKLIGHT_GENERIC option

2020-12-01 Thread Daniel Thompson
On Mon, Nov 30, 2020 at 03:21:32PM +, Andrey Zhizhikin wrote:
> Since the removal of generic_bl driver from the source tree in commit
> 7ecdea4a0226 ("backlight: generic_bl: Remove this driver as it is
> unused") BACKLIGHT_GENERIC config option became obsolete as well and
> therefore subject to clean-up from all configuration files.
> 
> This series introduces patches to address this removal, separated by
> architectures in the kernel tree.
> 
> Andrey Zhizhikin (5):
>   ARM: configs: drop unused BACKLIGHT_GENERIC option
>   arm64: defconfig: drop unused BACKLIGHT_GENERIC option
>   MIPS: configs: drop unused BACKLIGHT_GENERIC option
>   parisc: configs: drop unused BACKLIGHT_GENERIC option
>   powerpc/configs: drop unused BACKLIGHT_GENERIC option

Whole series:
Acked-by: Daniel Thompson 


Daniel.


Re: [PATCH v2] clk: renesas: r9a06g032: Drop __packed for portability

2020-12-01 Thread Stephen Rothwell
Hi Geert,

On Mon, 30 Nov 2020 09:57:43 +0100 Geert Uytterhoeven  
wrote:
>
> The R9A06G032 clock driver uses an array of packed structures to reduce
> kernel size.  However, this array contains pointers, which are no longer
> aligned naturally, and cannot be relocated on PPC64.  Hence when
> compile-testing this driver on PPC64 with CONFIG_RELOCATABLE=y (e.g.
> PowerPC allyesconfig), the following warnings are produced:
> 
> WARNING: 136 bad relocations
> c0616be3 R_PPC64_UADDR64   .rodata+0x000cf338
> c0616bfe R_PPC64_UADDR64   .rodata+0x000cf370
> ...
> 
> Fix this by dropping the __packed attribute from the r9a06g032_clkdesc
> definition, trading a small size increase for portability.
> 
> This increases the 156-entry clock table by 1 byte per entry, but due to
> the compiler generating more efficient code for unpacked accesses, the
> net size increase is only 76 bytes (gcc 9.3.0 on arm32).
> 
> Reported-by: Stephen Rothwell 
> Fixes: 4c3d88526eba2143 ("clk: renesas: Renesas R9A06G032 clock driver")
> Signed-off-by: Geert Uytterhoeven 
> ---
> v2:
>   - Fix authorship.
> ---
>  drivers/clk/renesas/r9a06g032-clocks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/clk/renesas/r9a06g032-clocks.c 
> b/drivers/clk/renesas/r9a06g032-clocks.c
> index d900f6bf53d0b944..892e91b92f2c80f5 100644
> --- a/drivers/clk/renesas/r9a06g032-clocks.c
> +++ b/drivers/clk/renesas/r9a06g032-clocks.c
> @@ -55,7 +55,7 @@ struct r9a06g032_clkdesc {
>   u16 sel, g1, r1, g2, r2;
>   } dual;
>   };
> -} __packed;
> +};
>  
>  #define I_GATE(_clk, _rst, _rdy, _midle, _scon, _mirack, _mistat) \
>   { .gate = _clk, .reset = _rst, \
> -- 
> 2.25.1
> 

Tested-by: Stephen Rothwell  # PowerPC allyesconfig build

-- 
Cheers,
Stephen Rothwell


pgpM8PracTgTU.pgp
Description: OpenPGP digital signature


[PATCH] powerpc/mm: Don't see NULL pointer dereference as a KUAP fault

2020-12-01 Thread Christophe Leroy
Sometimes, NULL pointer dereferences are expected. Even when they
are accidental they are unlikely an exploit attempt because the
first page is never mapped.

The exemple below shows what we get when invoking the "show task"
sysrq handler, by writing 't' in /proc/sysrq-trigger

[ 1117.202054] [ cut here ]
[ 1117.202102] Bug: fault blocked by AP register !
[ 1117.202261] WARNING: CPU: 0 PID: 377 at 
arch/powerpc/include/asm/nohash/32/kup-8xx.h:66 do_page_fault+0x4a8/0x5ec
[ 1117.202310] Modules linked in:
[ 1117.202428] CPU: 0 PID: 377 Comm: sh Tainted: GW 
5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty #4175
[ 1117.202499] NIP:  c0012048 LR: c0012048 CTR: 
[ 1117.202573] REGS: cacdbb88 TRAP: 0700   Tainted: GW  
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
[ 1117.202625] MSR:  00021032   CR: 2408  XER: 2000
[ 1117.202899]
[ 1117.202899] GPR00: c0012048 cacdbc40 c2929290 0023 c092e554 0001 
c09865e8 c092e640
[ 1117.202899] GPR08: 1032   00014efc 28082224 100d166a 
100a0920 
[ 1117.202899] GPR16: 100cac0c 100b 1080c3fc 1080d685 100d 100d 
 100a0900
[ 1117.202899] GPR24: 100d c07892ec  c0921510 c21f4440 005c 
c000 cacdbc80
[ 1117.204362] NIP [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204461] LR [c0012048] do_page_fault+0x4a8/0x5ec
[ 1117.204509] Call Trace:
[ 1117.204609] [cacdbc40] [c0012048] do_page_fault+0x4a8/0x5ec (unreliable)
[ 1117.204771] [cacdbc70] [c00112f0] handle_page_fault+0x8/0x34
[ 1117.204911] --- interrupt: 301 at copy_from_kernel_nofault+0x70/0x1c0
[ 1117.204979] NIP:  c010dbec LR: c010dbac CTR: 0001
[ 1117.205053] REGS: cacdbc80 TRAP: 0301   Tainted: GW  
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
[ 1117.205104] MSR:  9032   CR: 28082224  XER: 
[ 1117.205416] DAR: 005c DSISR: c000
[ 1117.205416] GPR00: c0045948 cacdbd38 c2929290 0001 0017 0017 
0027 000f
[ 1117.205416] GPR08: c09926ec   3000 24082224
[ 1117.206106] NIP [c010dbec] copy_from_kernel_nofault+0x70/0x1c0
[ 1117.206202] LR [c010dbac] copy_from_kernel_nofault+0x30/0x1c0
[ 1117.206258] --- interrupt: 301
[ 1117.206372] [cacdbd38] [c004bbb0] kthread_probe_data+0x44/0x70 (unreliable)
[ 1117.206561] [cacdbd58] [c0045948] print_worker_info+0xe0/0x194
[ 1117.206717] [cacdbdb8] [c00548ac] sched_show_task+0x134/0x168
[ 1117.206851] [cacdbdd8] [c005a268] show_state_filter+0x70/0x100
[ 1117.206989] [cacdbe08] [c039baa0] sysrq_handle_showstate+0x14/0x24
[ 1117.207122] [cacdbe18] [c039bf18] __handle_sysrq+0xac/0x1d0
[ 1117.207257] [cacdbe48] [c039c0c0] write_sysrq_trigger+0x4c/0x74
[ 1117.207407] [cacdbe68] [c01fba48] proc_reg_write+0xb4/0x114
[ 1117.207550] [cacdbe88] [c0179968] vfs_write+0x12c/0x478
[ 1117.207686] [cacdbf08] [c0179e60] ksys_write+0x78/0x128
[ 1117.207826] [cacdbf38] [c00110d0] ret_from_syscall+0x0/0x34
[ 1117.207938] --- interrupt: c01 at 0xfd4e784
[ 1117.208008] NIP:  0fd4e784 LR: 0fe0f244 CTR: 10048d38
[ 1117.208083] REGS: cacdbf48 TRAP: 0c01   Tainted: GW  
(5.10.0-rc5-s3k-dev-01340-g83f53be2de31-dirty)
[ 1117.208134] MSR:  d032   CR: 4400  XER: 
[ 1117.208470]
[ 1117.208470] GPR00: 0004 7fc34090 77bfb4e0 0001 1080fa40 0002 
740f fefefeff
[ 1117.208470] GPR08: 7f7f7f7f 10048d38 1080c414 7fc343c0 
[ 1117.209104] NIP [0fd4e784] 0xfd4e784
[ 1117.209180] LR [0fe0f244] 0xfe0f244
[ 1117.209236] --- interrupt: c01
[ 1117.209274] Instruction dump:
[ 1117.209353] 714a4000 418200f0 73ca0001 40820084 73ca0032 408200f8 73c90040 
4082ff60
[ 1117.209727] 0fe0 3c60c082 386399f4 48013b65 <0fe0> 80010034 386b 
7c0803a6
[ 1117.210102] ---[ end trace 1927c0323393af3e ]---

So, avoid the big KUAP warning by bailing out of bad_kernel_fault()
before calling bad_kuap_fault() when address references the first
page.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/fault.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 0add963a849b..be2b4318206f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -198,6 +198,10 @@ static bool bad_kernel_fault(struct pt_regs *regs, 
unsigned long error_code,
 {
int is_exec = TRAP(regs) == 0x400;
 
+   // Kernel fault on first page is likely a NULL pointer dereference
+   if (address < PAGE_SIZE)
+   return true;
+
/* NX faults set DSISR_PROTFAULT on the 8xx, DSISR_NOEXEC_OR_G on 
others */
if (is_exec && (error_code & (DSISR_NOEXEC_OR_G | DSISR_KEYFAULT |
  DSISR_PROTFAULT))) {
-- 
2.25.0



Re: [PATCH kernel v2] powerpc/pci: Remove LSI mappings on device teardown

2020-12-01 Thread Frederic Barrat




On 01/12/2020 08:39, Alexey Kardashevskiy wrote:

From: Oliver O'Halloran 

When a passthrough IO adapter is removed from a pseries machine using hash
MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
to clear all page table entries related to the adapter. If some are still
present, the RTAS call which isolates the PCI slot returns error 9001
"valid outstanding translations" and the removal of the IO adapter fails.
This is because when the PHBs are scanned, Linux maps automatically the
INTx interrupts in the Linux interrupt number space but these are never
removed.

This problem can be fixed by adding the corresponding unmap operation when
the device is removed. There's no pcibios_* hook for the remove case, but
the same effect can be achieved using a bus notifier.

Because INTx are shared among PHBs (and potentially across the system),
this adds tracking of virq to unmap them only when the last user is gone.

Signed-off-by: Oliver O'Halloran 
[aik: added refcounter]
Signed-off-by: Alexey Kardashevskiy 
---


Doing this in the generic irq code is just too much for my small brain :-/


---
  arch/powerpc/kernel/pci-common.c | 71 
  1 file changed, 71 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..0acf17f17253 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -353,6 +353,55 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
return NULL;
  }
  
+struct pci_intx_virq {

+   int virq;
+   struct kref kref;
+   struct list_head list_node;
+};
+
+static LIST_HEAD(intx_list);
+static DEFINE_MUTEX(intx_mutex);
+
+static void ppc_pci_intx_release(struct kref *kref)
+{
+   struct pci_intx_virq *vi = container_of(kref, struct pci_intx_virq, 
kref);
+
+   list_del(>list_node);
+   irq_dispose_mapping(vi->virq);
+   kfree(vi);
+}
+
+static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(data);
+
+   if (action == BUS_NOTIFY_DEL_DEVICE) {
+   struct pci_intx_virq *vi;
+
+   mutex_lock(_mutex);
+   list_for_each_entry(vi, _list, list_node) {
+   if (vi->virq == pdev->irq) {
+   kref_put(>kref, ppc_pci_intx_release);
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+   }
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_pci_unmap_irq_notifier = {
+   .notifier_call = ppc_pci_unmap_irq_line,
+};
+
+static int ppc_pci_register_irq_notifier(void)
+{
+   return bus_register_notifier(_bus_type, 
_pci_unmap_irq_notifier);
+}
+arch_initcall(ppc_pci_register_irq_notifier);
+
  /*
   * Reads the interrupt pin to determine if interrupt is use by card.
   * If the interrupt is used, then gets the interrupt line from the
@@ -361,6 +410,12 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
  static int pci_read_irq_line(struct pci_dev *pci_dev)
  {
int virq;
+   struct pci_intx_virq *vi, *vitmp;
+
+   /* Preallocate vi as rewind is complex if this fails after mapping */



Seems ok to me as the failure is unexpected.
But then we need to free that memory on all the error paths below.

  Fred





+   vi = kzalloc(sizeof(struct pci_intx_virq), GFP_KERNEL);
+   if (!vi)
+   return -1;
  
  	pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
  
@@ -401,6 +456,22 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
  
  	pci_dev->irq = virq;
  
+	mutex_lock(_mutex);

+   list_for_each_entry(vitmp, _list, list_node) {
+   if (vitmp->virq == virq) {
+   kref_get(>kref);
+   kfree(vi);
+   vi = NULL;
+   break;
+   }
+   }
+   if (vi) {
+   vi->virq = virq;
+   kref_init(>kref);
+   list_add_tail(>list_node, _list);
+   }
+   mutex_unlock(_mutex);
+
return 0;
  }
  



Re: [PATCH kernel v2] powerpc/pci: Remove LSI mappings on device teardown

2020-12-01 Thread Cédric Le Goater
On 12/1/20 8:39 AM, Alexey Kardashevskiy wrote:
> From: Oliver O'Halloran 
> 
> When a passthrough IO adapter is removed from a pseries machine using hash
> MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
> to clear all page table entries related to the adapter. If some are still
> present, the RTAS call which isolates the PCI slot returns error 9001
> "valid outstanding translations" and the removal of the IO adapter fails.
> This is because when the PHBs are scanned, Linux maps automatically the
> INTx interrupts in the Linux interrupt number space but these are never
> removed.
> 
> This problem can be fixed by adding the corresponding unmap operation when
> the device is removed. There's no pcibios_* hook for the remove case, but
> the same effect can be achieved using a bus notifier.
> 
> Because INTx are shared among PHBs (and potentially across the system),
> this adds tracking of virq to unmap them only when the last user is gone.
> 
> Signed-off-by: Oliver O'Halloran 
> [aik: added refcounter]
> Signed-off-by: Alexey Kardashevskiy 

Looks good to me and the system survives all the PCI hotplug tests I used 
to do on my first attempts to fix this issue. 

One comment below,

> ---
> 
> 
> Doing this in the generic irq code is just too much for my small brain :-/

may be more cleanups are required in the PCI/MSI/IRQ PPC layers before 
considering your first approach. You think too much in advance  !

> 
> ---
>  arch/powerpc/kernel/pci-common.c | 71 
>  1 file changed, 71 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index be108616a721..0acf17f17253 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -353,6 +353,55 @@ struct pci_controller 
> *pci_find_controller_for_domain(int domain_nr)
>   return NULL;
>  }
>  
> +struct pci_intx_virq {
> + int virq;
> + struct kref kref;
> + struct list_head list_node;
> +};
> +
> +static LIST_HEAD(intx_list);
> +static DEFINE_MUTEX(intx_mutex);
> +
> +static void ppc_pci_intx_release(struct kref *kref)
> +{
> + struct pci_intx_virq *vi = container_of(kref, struct pci_intx_virq, 
> kref);
> +
> + list_del(>list_node);
> + irq_dispose_mapping(vi->virq);
> + kfree(vi);
> +}
> +
> +static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
> +unsigned long action, void *data)
> +{
> + struct pci_dev *pdev = to_pci_dev(data);
> +
> + if (action == BUS_NOTIFY_DEL_DEVICE) {
> + struct pci_intx_virq *vi;
> +
> + mutex_lock(_mutex);
> + list_for_each_entry(vi, _list, list_node) {
> + if (vi->virq == pdev->irq) {
> + kref_put(>kref, ppc_pci_intx_release);
> + break;
> + }
> + }
> + mutex_unlock(_mutex);
> + }
> +
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block ppc_pci_unmap_irq_notifier = {
> + .notifier_call = ppc_pci_unmap_irq_line,
> +};
> +
> +static int ppc_pci_register_irq_notifier(void)
> +{
> + return bus_register_notifier(_bus_type, 
> _pci_unmap_irq_notifier);
> +}
> +arch_initcall(ppc_pci_register_irq_notifier);
> +
>  /*
>   * Reads the interrupt pin to determine if interrupt is use by card.
>   * If the interrupt is used, then gets the interrupt line from the
> @@ -361,6 +410,12 @@ struct pci_controller 
> *pci_find_controller_for_domain(int domain_nr)
>  static int pci_read_irq_line(struct pci_dev *pci_dev)
>  {
>   int virq;
> + struct pci_intx_virq *vi, *vitmp;
> +
> + /* Preallocate vi as rewind is complex if this fails after mapping */

AFAICT, we only need to call irq_dispose_mapping() if allocation fails.
If so, it would be simpler to isolate the code in a pci_intx_register(virq) 
helper and call it from pci_read_irq_line().

> + vi = kzalloc(sizeof(struct pci_intx_virq), GFP_KERNEL);
> + if (!vi)
> + return -1;
>  
>   pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
>  
> @@ -401,6 +456,22 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
>  
>   pci_dev->irq = virq;
>  
> + mutex_lock(_mutex);
> + list_for_each_entry(vitmp, _list, list_node) {
> + if (vitmp->virq == virq) {
> + kref_get(>kref);
> + kfree(vi);
> + vi = NULL;
> + break;
> + }
> + }
> + if (vi) {
> + vi->virq = virq;
> + kref_init(>kref);
> + list_add_tail(>list_node, _list);
> + }
> + mutex_unlock(_mutex);
> +
>   return 0;
>  }
>  
> 



[PATCH] powerpc/perf: Invoke per-CPU variable access with disabled interrupts

2020-12-01 Thread Athira Rajeev
The power_pmu_event_init() callback access per-cpu variable
(cpu_hw_events) to check for event constraints and Branch Stack
(BHRB). Current usage is to disable preemption when accessing the
per-cpu variable, but this does not prevent timer callback from
interrupting event_init. Fix this by using 'local_irq_save/restore'
to make sure the code path is invoked with disabled interrupts.

This change is tested in mambo simulator to ensure that, if a timer
interrupt comes in during the per-cpu access in event_init, it will be
soft masked and replayed later. For testing purpose, introduced a
udelay() in power_pmu_event_init() to make sure a timer interrupt arrives
while in per-cpu variable access code between local_irq_save/resore.
As expected the timer interrupt was replayed later during local_irq_restore
called from power_pmu_event_init. This was confirmed by adding
breakpoint in mambo and checking the backtrace when timer_interrupt
was hit.

Reported-by: Sebastian Andrzej Siewior 
Signed-off-by: Athira Rajeev 
---
 arch/powerpc/perf/core-book3s.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 3c8c6ce..e38648f0 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1909,7 +1909,7 @@ static bool is_event_blacklisted(u64 ev)
 static int power_pmu_event_init(struct perf_event *event)
 {
u64 ev;
-   unsigned long flags;
+   unsigned long flags, irq_flags;
struct perf_event *ctrs[MAX_HWEVENTS];
u64 events[MAX_HWEVENTS];
unsigned int cflags[MAX_HWEVENTS];
@@ -2017,7 +2017,9 @@ static int power_pmu_event_init(struct perf_event *event)
if (check_excludes(ctrs, cflags, n, 1))
return -EINVAL;
 
-   cpuhw = _cpu_var(cpu_hw_events);
+   local_irq_save(irq_flags);
+   cpuhw = this_cpu_ptr(_hw_events);
+
err = power_check_constraints(cpuhw, events, cflags, n + 1);
 
if (has_branch_stack(event)) {
@@ -2028,13 +2030,13 @@ static int power_pmu_event_init(struct perf_event 
*event)
event->attr.branch_sample_type);
 
if (bhrb_filter == -1) {
-   put_cpu_var(cpu_hw_events);
+   local_irq_restore(irq_flags);
return -EOPNOTSUPP;
}
cpuhw->bhrb_filter = bhrb_filter;
}
 
-   put_cpu_var(cpu_hw_events);
+   local_irq_restore(irq_flags);
if (err)
return -EINVAL;
 
-- 
1.8.3.1



[PATCH] selftests/powerpc: Fix uninitialized variable warning

2020-12-01 Thread Harish
Patch fixes uninitialized variable warning in bad_accesses test
which causes the selftests build to fail in older distibutions

bad_accesses.c: In function ‘bad_access’:
bad_accesses.c:52:9: error: ‘x’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
   printf("Bad - no SEGV! (%c)\n", x);
 ^
cc1: all warnings being treated as errors

Signed-off-by: Harish 
---
 tools/testing/selftests/powerpc/mm/bad_accesses.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/mm/bad_accesses.c 
b/tools/testing/selftests/powerpc/mm/bad_accesses.c
index fd747b2ffcfc..65d2148b05dc 100644
--- a/tools/testing/selftests/powerpc/mm/bad_accesses.c
+++ b/tools/testing/selftests/powerpc/mm/bad_accesses.c
@@ -38,7 +38,7 @@ static void segv_handler(int n, siginfo_t *info, void *ctxt_v)
 
 int bad_access(char *p, bool write)
 {
-   char x;
+   char x = 0;
 
fault_code = 0;
fault_addr = 0;
-- 
2.26.2