date:20230608

Re: [PATCH] powerpc/fadump: invoke ibm,os-term with rtas_call_unlocked()

2023-06-08 Thread Mahesh J Salgaonkar

On 2023-06-09 10:48:46 Fri, Hari Bathini wrote:
> Invoke ibm,os-term call with rtas_call_unlocked(), without using the
> RTAS spinlock, to avoid deadlock in the unlikely event of a machine
> crash while making an RTAS call.

Thanks for the patch. Minor comment bellow.

> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/kernel/rtas.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c087320f..f65b2a8cc0f1 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -1587,6 +1587,7 @@ static bool ibm_extended_os_term;
>  void rtas_os_term(char *str)
>  {
>   s32 token = rtas_function_token(RTAS_FN_IBM_OS_TERM);
> + static struct rtas_args args;
>   int status;
>  
>   /*
> @@ -1607,7 +1608,7 @@ void rtas_os_term(char *str)
>* schedules.
>*/
>   do {
> - status = rtas_call(token, 1, 1, NULL, __pa(rtas_os_term_buf));
> + status = rtas_call_unlocked(, token, 1, 1, NULL, 
> __pa(rtas_os_term_buf));

rtas_call_unlocked() returns void. You may want to extract the status
from args->rets[0].

Thanks,
-Mahesh.

>   } while (rtas_busy_delay_time(status));
>  
>   if (status != 0)
> -- 
> 2.40.1
> 

-- 
Mahesh J Salgaonkar

[PATCH] powerpc/fadump: invoke ibm,os-term with rtas_call_unlocked()

2023-06-08 Thread Hari Bathini

Invoke ibm,os-term call with rtas_call_unlocked(), without using the
RTAS spinlock, to avoid deadlock in the unlikely event of a machine
crash while making an RTAS call.

Signed-off-by: Hari Bathini 
---
 arch/powerpc/kernel/rtas.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c087320f..f65b2a8cc0f1 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1587,6 +1587,7 @@ static bool ibm_extended_os_term;
 void rtas_os_term(char *str)
 {
s32 token = rtas_function_token(RTAS_FN_IBM_OS_TERM);
+   static struct rtas_args args;
int status;
 
/*
@@ -1607,7 +1608,7 @@ void rtas_os_term(char *str)
 * schedules.
 */
do {
-   status = rtas_call(token, 1, 1, NULL, __pa(rtas_os_term_buf));
+   status = rtas_call_unlocked(, token, 1, 1, NULL, 
__pa(rtas_os_term_buf));
} while (rtas_busy_delay_time(status));
 
if (status != 0)
-- 
2.40.1

[PATCH] powerpc/build: vdso linker warning for orphan sections

2023-06-08 Thread Nicholas Piggin

Add --orphan-handlin for vdsos, and adjust vdso linker scripts to deal
with orphan sections.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vdso/Makefile | 2 ++
 arch/powerpc/kernel/vdso/vdso32.lds.S | 4 +++-
 arch/powerpc/kernel/vdso/vdso64.lds.S | 4 +++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/vdso/Makefile 
b/arch/powerpc/kernel/vdso/Makefile
index 4c3f34485f08..23ee96106537 100644
--- a/arch/powerpc/kernel/vdso/Makefile
+++ b/arch/powerpc/kernel/vdso/Makefile
@@ -56,6 +56,8 @@ KCSAN_SANITIZE := n
 ccflags-y := -fno-common -fno-builtin
 ldflags-y := -Wl,--hash-style=both -nostdlib -shared -z noexecstack
 ldflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld)
+ldflags-$(CONFIG_LD_ORPHAN_WARN) += 
-Wl,--orphan-handling=$(CONFIG_LD_ORPHAN_WARN_LEVEL)
+
 # Filter flags that clang will warn are unused for linking
 ldflags-y += $(filter-out $(CC_AUTO_VAR_INIT_ZERO_ENABLER) $(CC_FLAGS_FTRACE) 
-Wa$(comma)%, $(KBUILD_CFLAGS))
 
diff --git a/arch/powerpc/kernel/vdso/vdso32.lds.S 
b/arch/powerpc/kernel/vdso/vdso32.lds.S
index bc0be274a9ac..426e1ccc6971 100644
--- a/arch/powerpc/kernel/vdso/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso/vdso32.lds.S
@@ -83,9 +83,11 @@ SECTIONS
 
/DISCARD/   : {
*(.note.GNU-stack)
+   *(*.EMB.apuinfo)
+   *(.branch_lt)
*(.data .data.* .gnu.linkonce.d.* .sdata*)
*(.bss .sbss .dynbss .dynsbss)
-   *(.got1)
+   *(.got1 .glink .iplt .rela*)
}
 }
 
diff --git a/arch/powerpc/kernel/vdso/vdso64.lds.S 
b/arch/powerpc/kernel/vdso/vdso64.lds.S
index 744ae5363e6c..bda6c8cdd459 100644
--- a/arch/powerpc/kernel/vdso/vdso64.lds.S
+++ b/arch/powerpc/kernel/vdso/vdso64.lds.S
@@ -32,7 +32,7 @@ SECTIONS
. = ALIGN(16);
.text   : {
*(.text .stub .text.* .gnu.linkonce.t.* __ftr_alt_*)
-   *(.sfpr .glink)
+   *(.sfpr)
}   :text
PROVIDE(__etext = .);
PROVIDE(_etext = .);
@@ -81,10 +81,12 @@ SECTIONS
 
/DISCARD/   : {
*(.note.GNU-stack)
+   *(*.EMB.apuinfo)
*(.branch_lt)
*(.data .data.* .gnu.linkonce.d.* .sdata*)
*(.bss .sbss .dynbss .dynsbss)
*(.opd)
+   *(.glink .iplt .plt .rela*)
}
 }
 
-- 
2.40.1

[PATCH] powerpc/ftrace: Disable ftrace on ppc32 if using clang

2023-06-08 Thread Naveen N Rao

Ftrace on ppc32 expects a three instruction sequence at the beginning of
each function when specifying -pg:
mflrr0
stw r0,4(r1)
bl  _mcount

This is the case with all supported versions of gcc. Clang however emits
a branch to _mcount after the function prologue, similar to the pre
-mprofile-kernel ABI on ppc64. This is not supported.

Disable ftrace on ppc32 if using clang for now. This can be re-enabled
later if clang picks up support for -fpatchable-function-entry on ppc32.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index bff5820b7cda14..d85e3cf4016d90 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -236,7 +236,7 @@ config PPC
select HAVE_FUNCTION_DESCRIPTORSif PPC64_ELF_ABI_V1
select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
-   select HAVE_FUNCTION_TRACER
+   select HAVE_FUNCTION_TRACER if PPC64 || (PPC32 && CC_IS_GCC)
select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200   # 
plugin support on gcc <= 5.1 is buggy on PPC
select HAVE_GENERIC_VDSO
select HAVE_HARDLOCKUP_DETECTOR_ARCHif PPC_BOOK3S_64 && SMP

base-commit: bd517a8442b6c6646a136421cd4c1b95bf4ce32b
-- 
2.40.1

Re: [PATCH mm-unstable v2 01/10] mm/kvm: add mmu_notifier_ops->test_clear_young()

2023-06-08 Thread Yu Zhao

On Tue, Jun 6, 2023 at 2:34 AM Tzung-Bi Shih  wrote:
>
> On Fri, May 26, 2023 at 05:44:26PM -0600, Yu Zhao wrote:
> > +/*
> > + * Architectures that implement kvm_arch_test_clear_young() should override
> > + * kvm_arch_has_test_clear_young().
> > + *
> > + * kvm_arch_has_test_clear_young() is allowed to return false positive, 
> > i.e., it
> > + * can return true if kvm_arch_test_clear_young() is supported but 
> > disabled due
> > + * to some runtime constraint. In this case, kvm_arch_test_clear_young() 
> > should
>
> Is it a typo here?  
> s/kvm_arch_test_clear_young/kvm_arch_has_test_clear_young/.

Not a typo.

kvm/x86: multichase benchmark

2023-06-08 Thread Yu Zhao

TLDR

Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after 
this patchset [1].

Hardware

HOST $ lscpu
Architecture:x86_64
  CPU op-mode(s):32-bit, 64-bit
  Address sizes: 43 bits physical, 48 bits virtual
  Byte Order:Little Endian
CPU(s):  128
  On-line CPU(s) list:   0-127
Vendor ID:   AuthenticAMD
  Model name:AMD Ryzen Threadripper PRO 3995WX 64-Cores
CPU family:  23
Model:   49
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):   1
Stepping:0
Frequency boost: disabled
CPU max MHz: 4308.3979
CPU min MHz: 2200.
BogoMIPS:5390.20
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush mmx fxsr sse sse2
 ...
Virtualization features:
  Virtualization:AMD-V
Caches (sum of all):
  L1d:   2 MiB (64 instances)
  L1i:   2 MiB (64 instances)
  L2:32 MiB (64 instances)
  L3:256 MiB (16 instances)
NUMA:
  NUMA node(s):  1
  NUMA node0 CPU(s): 0-127
Vulnerabilities:
  Itlb multihit: Not affected
  L1tf:  Not affected
  Mds:   Not affected
  Meltdown:  Not affected
  Mmio stale data:   Not affected
  Retbleed:  Mitigation; untrained return thunk; SMT enabled with 
STIBP protection
  Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:Mitigation; usercopy/swapgs barriers and __user 
pointer sanitization
  Spectre v2:Mitigation; Retpolines, IBPB conditional, STIBP 
always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds: Not affected
  Tsx async abort:   Not affected

HOST $ numactl -H
available: 1 nodes (0)
node 0 cpus: 0-127
node 0 size: 257542 MB
node 0 free: 224855 MB
node distances:
node   0
  0:  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 x86_64 
x86_64 x86_64 GNU/Linux

HOST $ cat /proc/swaps
Filename  Type Size UsedPriority
/dev/nvme0n1p2partition4668383560   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000f

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

Procedure
=
HOST $ git clone https://github.com/google/multichase

HOST $ 
HOST $ 

HOST $ cp multichase/multichase ./initrd/bin/
HOST $ sed -i \
"/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \
./initrd/init

HOST $ 

HOST $ cat run_microvms.sh
memcgs=64

run() {
path=/sys/fs/cgroup/memcg$1

mkdir $path
echo $BASHPID >$path/cgroup.procs

qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
-nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
-append "console=ttyS0 loglevel=0"
}

for ((memcg = 0; memcg < $memcgs; memcg++)); do
run $memcg &
done

wait

Results
===
 Before [1]AfterChange
--
Total samples6824  7237 +6%

Notes
=
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

kvm/arm64: Spark benchmark

2023-06-08 Thread Yu Zhao

TLDR

Apache Spark spent 12% less time sorting four billion random integers twenty 
times (in ~4 hours) after this patchset [1].

Hardware

HOST $ lscpu
Architecture:   aarch64
  CPU op-mode(s):   32-bit, 64-bit
  Byte Order:   Little Endian
CPU(s): 128
  On-line CPU(s) list:  0-127
Vendor ID:  ARM
  Model name:   Neoverse-N1
Model:  1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s):  2
Stepping:   r3p1
Frequency boost:disabled
CPU max MHz:2800.
CPU min MHz:1000.
BogoMIPS:   50.00
Flags:  fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp 
asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
  L1d:  8 MiB (128 instances)
  L1i:  8 MiB (128 instances)
  L2:   128 MiB (128 instances)
NUMA:
  NUMA node(s): 2
  NUMA node0 CPU(s):0-63
  NUMA node1 CPU(s):64-127
Vulnerabilities:
  Itlb multihit:Not affected
  L1tf: Not affected
  Mds:  Not affected
  Meltdown: Not affected
  Mmio stale data:  Not affected
  Retbleed: Not affected
  Spec store bypass:Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:   Mitigation; __user pointer sanitization
  Spectre v2:   Mitigation; CSV2, BHB
  Srbds:Not affected
  Tsx async abort:  Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-63
node 0 size: 257730 MB
node 0 free: 1447 MB
node 1 cpus: 64-127
node 1 size: 256877 MB
node 1 free: 256093 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

HOST $ uname -a
Linux arm 6.4.0-rc4 #1 SMP Sat Jun  3 05:30:06 UTC 2023 aarch64 aarch64 aarch64 
GNU/Linux

HOST $ cat /proc/swaps
FilenameTypeSizeUsed
Priority
/dev/nvme0n1p2  partition   466838356   
116922112   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x000b

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-aarch64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

GUEST $ java --version
openjdk 17.0.7 2023-04-18
OpenJDK Runtime Environment (build 17.0.7+7-Ubuntu-0ubuntu122.04.2)
OpenJDK 64-Bit Server VM (build 17.0.7+7-Ubuntu-0ubuntu122.04.2, mixed mode, 
sharing)

GUEST $ spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
  /_/

Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 17.0.7
Branch HEAD
Compiled by user xinrong.meng on 2023-04-07T02:18:01Z
Revision 87a5442f7ed96b11051d8a9333476d080054e5a0
Url https://github.com/apache/spark
Type --help for more information.

Procedure
=
HOST $ sudo numactl -N 0 -m 0 qemu-system-aarch64 \
-M virt,accel=kvm -cpu host -smp 64 -m 300g -nographic -nic user \
-bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ cat gen.scala
import java.io._
import scala.collection.mutable.ArrayBuffer

object GenData {
def main(args: Array[String]): Unit = {
val file = new File("/dev/shm/dataset.txt")
val writer = new BufferedWriter(new FileWriter(file))
val buf = ArrayBuffer(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)
for(_ <- 0 until 4) {
for (i <- 0 until 10) {
buf.update(i, scala.util.Random.nextLong())
}
writer.write(s"${buf.mkString(",")}\n")
}
writer.close()
}
}
GenData.main(Array())

GUEST $ cat sort.scala
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.SparkSession

object SparkSort {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().getOrCreate()
val file = sc.textFile("/dev/shm/dataset.txt", 64)
val results = file.flatMap(_.split(",")).map(x => (x, 
1)).sortByKey().takeOrdered(10)
results.foreach(println)
spark.stop()
}
}
SparkSort.main(Array())

GUEST $ cat run_spark.sh
export SPARK_LOCAL_DIRS=/dev/shm/

spark-shell https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

kvm/powerpc: memcached benchmark

2023-06-08 Thread Yu Zhao

TLDR

Memcached achieved 10% more operations per second (in ~4 hours) after this 
patchset [1].

Hardware

HOST $ lscpu
Architecture:  ppc64le
  Byte Order:  Little Endian
CPU(s):184
  On-line CPU(s) list: 0-183
Model name:POWER9 (raw), altivec supported
  Model:   2.2 (pvr 004e 1202)
  Thread(s) per core:  4
  Core(s) per socket:  23
  Socket(s):   2
  CPU max MHz: 3000.
  CPU min MHz: 2300.
Caches (sum of all):
  L1d: 1.4 MiB (46 instances)
  L1i: 1.4 MiB (46 instances)
  L2:  12 MiB (24 instances)
  L3:  240 MiB (24 instances)
NUMA:
  NUMA node(s):2
  NUMA node0 CPU(s):   0-91
  NUMA node1 CPU(s):   92-183
Vulnerabilities:
  Itlb multihit:   Not affected
  L1tf:Mitigation; RFI Flush, L1D private per thread
  Mds: Not affected
  Meltdown:Mitigation; RFI Flush, L1D private per thread
  Mmio stale data: Not affected
  Retbleed:Not affected
  Spec store bypass:   Mitigation; Kernel entry/exit barrier (eieio)
  Spectre v1:  Mitigation; __user pointer sanitization, ori31 
speculation barrier enabled
  Spectre v2:  Mitigation; Indirect branch serialisation (kernel only), 
Indirect branch cache disabled, Software link stack flush
  Srbds:   Not affected
  Tsx async abort: Not affected

HOST $ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0-91
node 0 size: 261659 MB
node 0 free: 259152 MB
node 1 cpus: 92-183
node 1 size: 261713 MB
node 1 free: 261076 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

HOST $ cat /sys/class/nvme/nvme0/model
INTEL SSDPF21Q800GB

HOST $ cat /sys/class/nvme/nvme0/numa_node
0

Software

HOST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

HOST $ uname -a
Linux ppc 6.3.0 #1 SMP Sun Jun  4 18:26:37 UTC 2023 ppc64le ppc64le ppc64le 
GNU/Linux

HOST $ cat /proc/swaps
Filename  Type Size UsedPriority
/dev/nvme0n1p2partition 4668382720   -2

HOST $ cat /sys/kernel/mm/lru_gen/enabled
0x0009

HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer defer+madvise madvise [never]

HOST $ qemu-system-ppc64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.6)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

GUEST $ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"

GUEST $ cat /etc/memcached.conf
...
-t 92
-m 262144
-B binary
-s /var/run/memcached/memcached.sock
-a 0766

GUEST $ memtier_benchmark -v
memtier_benchmark 1.4.0
Copyright (C) 2011-2022 Redis Ltd.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License .
There is NO WARRANTY, to the extent permitted by law.

Procedure
=
HOST $ sudo numactl -N 0 -m 0 qemu-system-ppc64 \
-M pseries,accel=kvm,kvm-type=HV -cpu host -smp 92 -m 270g
-nographic -nic user \
-drive if=virtio,format=raw,file=/dev/nvme0n1p1

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 1:0 \
--key-minimum=1 --key-maximum=12000 --key-pattern=P:P \
-n allkeys -d 2000

GUEST $ memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -c 1 -t 92 --pipeline 1 --ratio 0:1 \
--key-minimum=1 --key-maximum=12000 --key-pattern=R:R \
-n allkeys --randomize --distinct-client-seed

Results
===
Before [1]AfterChange
-
Ops/sec 721586.10 800210.12+10%
Avg. Latency0.12546   0.11260  -10%
p50 Latency 0.08700   0.08700  N/C
p99 Latency 0.28700   0.24700  -13%

Notes
=
[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
Apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

[PATCH] powerpc/legacy_serial: check CONFIG_SERIAL_8250_CONSOLE

2023-06-08 Thread Randy Dunlap

When SERIAL_8250_CONSOLE is not set but PPC_UDBG_16550=y,
the legacy_serial code references fsl8250_handle_irq, which is
only built when SERIAL_8250_CONSOLE is set.

Be consistent in referencing the used CONFIG_SERIAL_8250*
symbols so that the build errors do not happen.

Prevents these build errors:

powerpc-linux-ld: arch/powerpc/kernel/legacy_serial.o: in function 
`serial_dev_init':
legacy_serial.c:(.init.text+0x2aa): undefined reference to `fsl8250_handle_irq'
powerpc-linux-ld: legacy_serial.c:(.init.text+0x2b2): undefined reference to 
`fsl8250_handle_irq'

Fixes: 66eff0ef528b ("powerpc/legacy_serial: Warn about 8250 devices operated 
without active FSL workarounds")
Signed-off-by: Randy Dunlap 
Cc: Uwe Kleine-König 
Cc: Greg Kroah-Hartman 
Cc: linux-ser...@vger.kernel.org
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/legacy_serial.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -- a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -508,9 +508,9 @@ static void __init fixup_port_irq(int in
 
port->irq = virq;
 
-   if (IS_ENABLED(CONFIG_SERIAL_8250) &&
+   if (IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE) &&
of_device_is_compatible(np, "fsl,ns16550")) {
-   if (IS_REACHABLE(CONFIG_SERIAL_8250)) {
+   if (IS_REACHABLE(CONFIG_SERIAL_8250_CONSOLE)) {
port->handle_irq = fsl8250_handle_irq;
port->has_sysrq = 
IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
} else {

Re: [PATCH v5 25/26] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-06-08 Thread kernel test robot

Hi Terry,

kernel test robot noticed the following build errors:

[auto build test ERROR on a70fc4ed20a6118837b0aecbbf789074935f473b]

url:
https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/cxl-acpi-Probe-RCRB-later-during-RCH-downstream-port-creation/20230608-062818
base:   a70fc4ed20a6118837b0aecbbf789074935f473b
patch link:
https://lore.kernel.org/r/20230607221651.2454764-26-terry.bowman%40amd.com
patch subject: [PATCH v5 25/26] PCI/AER: Forward RCH downstream port-detected 
errors to the CXL.mem dev handler
config: x86_64-randconfig-r005-20230607 
(https://download.01.org/0day-ci/archive/20230609/202306090637.9e2ezbr4-...@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 
8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build):
mkdir -p ~/bin
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout a70fc4ed20a6118837b0aecbbf789074935f473b
b4 shazam 
https://lore.kernel.org/r/20230607221651.2454764-26-terry.bow...@amd.com
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 
O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 
O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202306090637.9e2ezbr4-...@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: module cxl_core uses symbol pci_print_aer from namespace 
>> CXL, but does not import it.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

[PATCH v5 3/3] block: sed-opal: keyring support for SED keys

2023-06-08 Thread gjoyce

From: Greg Joyce 

Extend the SED block driver so it can alternatively
obtain a key from a sed-opal kernel keyring. The SED
ioctls will indicate the source of the key, either
directly in the ioctl data or from the keyring.

This allows the use of SED commands in scripts such as
udev scripts so that drives may be automatically unlocked
as they become available.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/Kconfig |   2 +
 block/sed-opal.c  | 174 +-
 include/linux/sed-opal.h  |   3 +
 include/uapi/linux/sed-opal.h |   8 +-
 4 files changed, 184 insertions(+), 3 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 86122e459fe0..77f72175eb72 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -183,6 +183,8 @@ config BLK_DEBUG_FS_ZONED
 
 config BLK_SED_OPAL
bool "Logic for interfacing with Opal enabled SEDs"
+   depends on KEYS
+   select PSERIES_PLPKS if PPC_PSERIES
help
Builds Logic for interfacing with Opal enabled controllers.
Enabling this option enables users to setup/unlock/lock
diff --git a/block/sed-opal.c b/block/sed-opal.c
index e2aed7f4ebdf..6d7f25d1711b 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "opal_proto.h"
 
@@ -29,6 +32,8 @@
 /* Number of bytes needed by cmd_finalize. */
 #define CMD_FINALIZE_BYTES_NEEDED 7
 
+static struct key *sed_opal_keyring;
+
 struct opal_step {
int (*fn)(struct opal_dev *dev, void *data);
void *data;
@@ -269,6 +274,101 @@ static void print_buffer(const u8 *ptr, u32 length)
 #endif
 }
 
+/*
+ * Allocate/update a SED Opal key and add it to the SED Opal keyring.
+ */
+static int update_sed_opal_key(const char *desc, u_char *key_data, int keylen)
+{
+   key_ref_t kr;
+
+   if (!sed_opal_keyring)
+   return -ENOKEY;
+
+   kr = key_create_or_update(make_key_ref(sed_opal_keyring, true), "user",
+ desc, (const void *)key_data, keylen,
+ KEY_USR_VIEW | KEY_USR_SEARCH | KEY_USR_WRITE,
+ KEY_ALLOC_NOT_IN_QUOTA | KEY_ALLOC_BUILT_IN |
+   KEY_ALLOC_BYPASS_RESTRICTION);
+   if (IS_ERR(kr)) {
+   pr_err("Error adding SED key (%ld)\n", PTR_ERR(kr));
+   return PTR_ERR(kr);
+   }
+
+   return 0;
+}
+
+/*
+ * Read a SED Opal key from the SED Opal keyring.
+ */
+static int read_sed_opal_key(const char *key_name, u_char *buffer, int buflen)
+{
+   int ret;
+   key_ref_t kref;
+   struct key *key;
+
+   if (!sed_opal_keyring)
+   return -ENOKEY;
+
+   kref = keyring_search(make_key_ref(sed_opal_keyring, true),
+ _type_user, key_name, true);
+
+   if (IS_ERR(kref))
+   ret = PTR_ERR(kref);
+
+   key = key_ref_to_ptr(kref);
+   down_read(>sem);
+   ret = key_validate(key);
+   if (ret == 0) {
+   if (buflen > key->datalen)
+   buflen = key->datalen;
+
+   ret = key->type->read(key, (char *)buffer, buflen);
+   }
+   up_read(>sem);
+
+   key_ref_put(kref);
+
+   return ret;
+}
+
+static int opal_get_key(struct opal_dev *dev, struct opal_key *key)
+{
+   int ret = 0;
+
+   switch (key->key_type) {
+   case OPAL_INCLUDED:
+   /* the key is ready to use */
+   break;
+   case OPAL_KEYRING:
+   /* the key is in the keyring */
+   ret = read_sed_opal_key(OPAL_AUTH_KEY, key->key, OPAL_KEY_MAX);
+   if (ret > 0) {
+   if (ret > U8_MAX) {
+   ret = -ENOSPC;
+   goto error;
+   }
+   key->key_len = ret;
+   key->key_type = OPAL_INCLUDED;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   break;
+   }
+   if (ret < 0)
+   goto error;
+
+   /* must have a PEK by now or it's an error */
+   if (key->key_type != OPAL_INCLUDED || key->key_len == 0) {
+   ret = -EINVAL;
+   goto error;
+   }
+   return 0;
+error:
+   pr_debug("Error getting password: %d\n", ret);
+   return ret;
+}
+
 static bool check_tper(const void *data)
 {
const struct d0_tper_features *tper = data;
@@ -2459,6 +2559,9 @@ static int opal_secure_erase_locking_range(struct 
opal_dev *dev,
};
int ret;
 
+   ret = opal_get_key(dev, _session->opal_key);
+   if (ret)
+   return ret;
mutex_lock(>dev_lock);
setup_opal_dev(dev);
ret = execute_steps(dev, erase_steps, ARRAY_SIZE(erase_steps));
@@ -2492,6 +2595,9 @@ static int opal_revertlsp(struct opal_dev

[PATCH v2 23/23] xtensa: add pte_unmap() to balance pte_offset_map()

2023-06-08 Thread Hugh Dickins

To keep balance in future, remember to pte_unmap() after a successful
pte_offset_map().  And act as if get_pte_for_vaddr() really needs a map
there, to read the pteval before "unmapping", to be sure page table is
not removed.

Signed-off-by: Hugh Dickins 
---
 arch/xtensa/mm/tlb.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 27a477dae232..0a11fc5f185b 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -179,6 +179,7 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+   unsigned int pteval;
 
if (!mm)
mm = task->active_mm;
@@ -197,7 +198,9 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pte = pte_offset_map(pmd, vaddr);
if (!pte)
return 0;
-   return pte_val(*pte);
+   pteval = pte_val(*pte);
+   pte_unmap(pte);
+   return pteval;
 }
 
 enum {
-- 
2.35.3

[PATCH v2 22/23] x86: sme_populate_pgd() use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/x86/mm/mem_encrypt_identity.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index c6efcf559d88..a1ab542bdfd6 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -188,7 +188,7 @@ static void __init sme_populate_pgd(struct 
sme_populate_pgd_data *ppd)
if (pmd_large(*pmd))
return;
 
-   pte = pte_offset_map(pmd, ppd->vaddr);
+   pte = pte_offset_kernel(pmd, ppd->vaddr);
if (pte_none(*pte))
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
 }
-- 
2.35.3

[PATCH v2 21/23] x86: Allow get_locked_pte() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/x86/kernel/ldt.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..adc67f98819a 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct 
ldt_struct *ldt)
 
va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
ptep = get_locked_pte(mm, va, );
-   pte_clear(mm, va, ptep);
-   pte_unmap_unlock(ptep, ptl);
+   if (!WARN_ON_ONCE(!ptep)) {
+   pte_clear(mm, va, ptep);
+   pte_unmap_unlock(ptep, ptl);
+   }
}
 
va = (unsigned long)ldt_slot_va(ldt->slot);
-- 
2.35.3

[PATCH v5 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-08 Thread gjoyce

From: Greg Joyce 

Patchset rebased to for-6.5/block

This patchset has gone through numerous rounds of review and
all comments/suggetions have been addressed. I believe that
this patchset is ready for inclusion.

TCG SED Opal is a specification from The Trusted Computing Group
that allows self encrypting storage devices (SED) to be locked at
power on and require an authentication key to unlock the drive.

The current SED Opal implementation in the block driver
requires that authentication keys be provided in an ioctl
so that they can be presented to the underlying SED
capable drive. Currently, the key is typically entered by
a user with an application like sedutil or sedcli. While
this process works, it does not lend itself to automation
like unlock by a udev rule.

The SED block driver has been extended so it can alternatively
obtain a key from a sed-opal kernel keyring. The SED ioctls
will indicate the source of the key, either directly in the
ioctl data or from the keyring.

Two new SED ioctls have also been added. These are:
  1) IOC_OPAL_REVERT_LSP to revert LSP state
  2) IOC_OPAL_DISCOVERY to discover drive capabilities/state

change log v5:
- rebase to for-6.5/block

change log v4:
- rebase to 6.3-rc7
- replaced "255" magic number with U8_MAX

change log:
- rebase to 6.x
- added latest reviews
- removed platform functions for persistent key storage
- replaced key update logic with key_create_or_update()
- minor bracing and padding changes
- add error returns
- opal_key structure is application provided but kernel
  verified
- added brief description of TCG SED Opal


Greg Joyce (3):
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: keyring support for SED keys

 block/Kconfig |   2 +
 block/opal_proto.h|   4 +
 block/sed-opal.c  | 252 +-
 include/linux/sed-opal.h  |   5 +
 include/uapi/linux/sed-opal.h |  25 +++-
 5 files changed, 282 insertions(+), 6 deletions(-)


base-commit: 1341c7d2ccf42ed91aea80b8579d35bc1ea381e2
-- 
gjo...@linux.vnet.ibm.com

[PATCH v2 20/23] sparc: iounit and iommu use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

iounit_alloc() and sbus_iommu_alloc() are working from pmd_off_k(),
so should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/mm/io-unit.c | 2 +-
 arch/sparc/mm/iommu.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c
index bf3e6d2fe5d9..133dd42570d6 100644
--- a/arch/sparc/mm/io-unit.c
+++ b/arch/sparc/mm/io-unit.c
@@ -244,7 +244,7 @@ static void *iounit_alloc(struct device *dev, size_t len,
long i;
 
pmdp = pmd_off_k(addr);
-   ptep = pte_offset_map(pmdp, addr);
+   ptep = pte_offset_kernel(pmdp, addr);
 
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
 
diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c
index 9e3f6933ca13..3a6caef68348 100644
--- a/arch/sparc/mm/iommu.c
+++ b/arch/sparc/mm/iommu.c
@@ -358,7 +358,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t 
len,
__flush_page_to_ram(page);
 
pmdp = pmd_off_k(addr);
-   ptep = pte_offset_map(pmdp, addr);
+   ptep = pte_offset_kernel(pmdp, addr);
 
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
}
-- 
2.35.3

[PATCH v2 19/23] sparc: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/kernel/signal32.c | 2 ++
 arch/sparc/mm/fault_64.c | 3 +++
 arch/sparc/mm/tlb.c  | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..ca450c7bc53f 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -328,6 +328,8 @@ static void flush_signal_insns(unsigned long address)
goto out_irqs_on;
 
ptep = pte_offset_map(pmdp, address);
+   if (!ptep)
+   goto out_irqs_on;
pte = *ptep;
if (!pte_present(pte))
goto out_unmap;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index d91305de694c..d8a407fbe350 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -99,6 +99,7 @@ static unsigned int get_user_insn(unsigned long tpc)
local_irq_disable();
 
pmdp = pmd_offset(pudp, tpc);
+again:
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
 
@@ -115,6 +116,8 @@ static unsigned int get_user_insn(unsigned long tpc)
 #endif
{
ptep = pte_offset_map(pmdp, tpc);
+   if (!ptep)
+   goto again;
pte = *ptep;
if (pte_present(pte)) {
pa  = (pte_pfn(pte) << PAGE_SHIFT);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..7ecf8556947a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -149,6 +149,8 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, 
unsigned long vaddr,
pte_t *pte;
 
pte = pte_offset_map(, vaddr);
+   if (!pte)
+   return;
end = vaddr + HPAGE_SIZE;
while (vaddr < end) {
if (pte_val(*pte) & _PAGE_VALID) {
-- 
2.35.3

[PATCH v2 18/23] sparc/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/sparc/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index d8e0e3c7038d..d7018823206c 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
if (sz >= PMD_SIZE)
return (pte_t *)pmd;
-   return pte_alloc_map(mm, pmd, addr);
+   return pte_alloc_huge(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -325,7 +325,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return NULL;
if (is_hugetlb_pmd(*pmd))
return (pte_t *)pmd;
-   return pte_offset_map(pmd, addr);
+   return pte_offset_huge(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.35.3

[PATCH v2 17/23] sh/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/sh/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 999ab5916e69..6cb0ad73dbb9 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr);
}
}
}
@@ -63,7 +63,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (pud) {
pmd = pmd_offset(pud, addr);
if (pmd)
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_huge(pmd, addr);
}
}
}
-- 
2.35.3

[PATCH v7 3/3] powerpc/pseries: PLPKS SED Opal keystore support

2023-06-08 Thread gjoyce

From: Greg Joyce 

Define operations for SED Opal to read/write keys
from POWER LPAR Platform KeyStore(PLPKS). This allows
non-volatile storage of SED Opal keys.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 arch/powerpc/platforms/pseries/Kconfig|   6 +
 arch/powerpc/platforms/pseries/Makefile   |   1 +
 .../powerpc/platforms/pseries/plpks_sed_ops.c | 114 ++
 block/Kconfig |   1 +
 4 files changed, 122 insertions(+)
 create mode 100644 arch/powerpc/platforms/pseries/plpks_sed_ops.c

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 4ebf2ef2845d..afc0f6a61337 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -164,6 +164,12 @@ config PSERIES_PLPKS
# This option is selected by in-kernel consumers that require
# access to the PKS.
 
+config PSERIES_PLPKS_SED
+   depends on PPC_PSERIES
+   bool
+   # This option is selected by in-kernel consumers that require
+   # access to the SED PKS keystore.
+
 config PAPR_SCM
depends on PPC_PSERIES && MEMORY_HOTPLUG && LIBNVDIMM
tristate "Support for the PAPR Storage Class Memory interface"
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 53c3b91af2f7..1476c5e4433c 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_PPC_SVM) += svm.o
 obj-$(CONFIG_FA_DUMP)  += rtas-fadump.o
 obj-$(CONFIG_PSERIES_PLPKS)+= plpks.o
 obj-$(CONFIG_PPC_SECURE_BOOT)  += plpks-secvar.o
+obj-$(CONFIG_PSERIES_PLPKS_SED)+= plpks_sed_ops.o
 obj-$(CONFIG_SUSPEND)  += suspend.o
 obj-$(CONFIG_PPC_VAS)  += vas.o vas-sysfs.o
 
diff --git a/arch/powerpc/platforms/pseries/plpks_sed_ops.c 
b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
new file mode 100644
index ..c1d08075e850
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/plpks_sed_ops.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * POWER Platform specific code for non-volatile SED key access
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * Define operations for SED Opal to read/write keys
+ * from POWER LPAR Platform KeyStore(PLPKS).
+ *
+ * Self Encrypting Drives(SED) key storage using PLPKS
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * structure that contains all SED data
+ */
+struct plpks_sed_object_data {
+   u_char version;
+   u_char pad1[7];
+   u_long authority;
+   u_long range;
+   u_int  key_len;
+   u_char key[32];
+};
+
+#define PLPKS_SED_OBJECT_DATA_V00
+#define PLPKS_SED_MANGLED_LABEL "/default/pri"
+#define PLPKS_SED_COMPONENT "sed-opal"
+#define PLPKS_SED_KEY   "opal-boot-pin"
+
+/*
+ * authority is admin1 and range is global
+ */
+#define PLPKS_SED_AUTHORITY  0x000900010001
+#define PLPKS_SED_RANGE  0x08020001
+
+void plpks_init_var(struct plpks_var *var, char *keyname)
+{
+   var->name = keyname;
+   var->namelen = strlen(keyname);
+   if (strcmp(PLPKS_SED_KEY, keyname) == 0) {
+   var->name = PLPKS_SED_MANGLED_LABEL;
+   var->namelen = strlen(keyname);
+   }
+   var->policy = PLPKS_WORLDREADABLE;
+   var->os = PLPKS_VAR_COMMON;
+   var->data = NULL;
+   var->datalen = 0;
+   var->component = PLPKS_SED_COMPONENT;
+}
+
+/*
+ * Read the SED Opal key from PLPKS given the label
+ */
+int sed_read_key(char *keyname, char *key, u_int *keylen)
+{
+   struct plpks_var var;
+   struct plpks_sed_object_data data;
+   int ret;
+   u_int len;
+
+   plpks_init_var(, keyname);
+   var.data = (u8 *)
+   var.datalen = sizeof(data);
+
+   ret = plpks_read_os_var();
+   if (ret != 0)
+   return ret;
+
+   len = min_t(u16, be32_to_cpu(data.key_len), var.datalen);
+   memcpy(key, data.key, len);
+   key[len] = '\0';
+   *keylen = len;
+
+   return 0;
+}
+
+/*
+ * Write the SED Opal key to PLPKS given the label
+ */
+int sed_write_key(char *keyname, char *key, u_int keylen)
+{
+   struct plpks_var var;
+   struct plpks_sed_object_data data;
+   struct plpks_var_name vname;
+
+   plpks_init_var(, keyname);
+
+   var.datalen = sizeof(struct plpks_sed_object_data);
+   var.data = (u8 *)
+
+   /* initialize SED object */
+   data.version = PLPKS_SED_OBJECT_DATA_V0;
+   data.authority = cpu_to_be64(PLPKS_SED_AUTHORITY);
+   data.range = cpu_to_be64(PLPKS_SED_RANGE);
+   memset(, '\0', sizeof(data.pad1));
+   data.key_len = cpu_to_be32(keylen);
+   memcpy(data.key, (char *)key, keylen);
+
+   /*
+* Key update requires remove first. The return value
+* is ignored since it's okay if the key doesn't exist.
+*/
+

[PATCH v7 1/3] block:sed-opal: SED Opal keystore

2023-06-08 Thread gjoyce

From: Greg Joyce 

Add read and write functions that allow SED Opal keys to stored
in a permanent keystore.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/Makefile   |  2 +-
 block/sed-opal-key.c | 24 
 include/linux/sed-opal-key.h | 15 +++
 3 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 block/sed-opal-key.c
 create mode 100644 include/linux/sed-opal-key.h

diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..ea07d80402a6 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_BLK_DEV_ZONED)   += blk-zoned.o
 obj-$(CONFIG_BLK_WBT)  += blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
-obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o
+obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o sed-opal-key.o
 obj-$(CONFIG_BLK_PM)   += blk-pm.o
 obj-$(CONFIG_BLK_INLINE_ENCRYPTION)+= blk-crypto.o blk-crypto-profile.o \
   blk-crypto-sysfs.o
diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
new file mode 100644
index ..16f380164c44
--- /dev/null
+++ b/block/sed-opal-key.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * SED key operations.
+ *
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * These are the accessor functions (read/write) for SED Opal
+ * keys. Specific keystores can provide overrides.
+ *
+ */
+
+#include 
+#include 
+#include 
+
+int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
+{
+   return -EOPNOTSUPP;
+}
+
+int __weak sed_write_key(char *keyname, char *key, u_int keylen)
+{
+   return -EOPNOTSUPP;
+}
diff --git a/include/linux/sed-opal-key.h b/include/linux/sed-opal-key.h
new file mode 100644
index ..c9b1447986d8
--- /dev/null
+++ b/include/linux/sed-opal-key.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * SED key operations.
+ *
+ * Copyright (C) 2022 IBM Corporation
+ *
+ * These are the accessor functions (read/write) for SED Opal
+ * keys. Specific keystores can provide overrides.
+ *
+ */
+
+#include 
+
+int sed_read_key(char *keyname, char *key, u_int *keylen);
+int sed_write_key(char *keyname, char *key, u_int keylen);
-- 
gjo...@linux.vnet.ibm.com

[PATCH v7 0/3] generic and PowerPC SED Opal keystore

2023-06-08 Thread gjoyce

From: Greg Joyce 

Patchset rebase to for-6.5/block

This patchset has gone through numerous rounds of review and
all comments/suggetions have been addressed. I believe that
this patchset is ready for inclusion.

TCG SED Opal is a specification from The Trusted Computing Group
that allows self encrypting storage devices (SED) to be locked at
power on and require an authentication key to unlock the drive.

Generic functions have been defined for accessing SED Opal keys.
The generic functions are defined as weak so that they may be superseded
by keystore specific versions.

PowerPC/pseries versions of these functions provide read/write access
to SED Opal keys in the PLPKS keystore.

The SED block driver has been modified to read the SED Opal
keystore to populate a key in the SED Opal keyring. Changes to the
SED Opal key will be written to the SED Opal keystore.

Patch 3 "keystore access for SED Opal keys" is dependent on:

https://lore.kernel.org/keyrings/20220818143045.680972-4-gjo...@linux.vnet.ibm.com/T/#u

Changelog
v7: - rebased to for-6.5/block

v6: - squashed two commits (suggested by Andrew Donnellan)

v5: - updated to reflect changes in PLPKS API

v4:
- scope reduced to cover just SED Opal keys
- base SED Opal keystore is now in SED block driver
- removed use of enum to indicate type
- refactored common code into common function that read and
  write use
- removed cast to void
- added use of SED Opal keystore functions to SED block driver

v3:
- No code changes, but per reviewer requests, adding additional
  mailing lists(keyring, EFI) for wider review.

v2:
- Include feedback from Gregory Joyce, Eric Richter and
  Murilo Opsfelder Araujo.
- Include suggestions from Michael Ellerman.
- Moved a dependency from generic SED code to this patchset.
  This patchset now builds of its own.



Greg Joyce (3):
  block:sed-opal: SED Opal keystore
  block: sed-opal: keystore access for SED Opal keys
  powerpc/pseries: PLPKS SED Opal keystore support

 arch/powerpc/platforms/pseries/Kconfig|   6 +
 arch/powerpc/platforms/pseries/Makefile   |   1 +
 .../powerpc/platforms/pseries/plpks_sed_ops.c | 114 ++
 block/Kconfig |   1 +
 block/Makefile|   2 +-
 block/sed-opal-key.c  |  24 
 block/sed-opal.c  |  18 ++-
 include/linux/sed-opal-key.h  |  15 +++
 8 files changed, 178 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/plpks_sed_ops.c
 create mode 100644 block/sed-opal-key.c
 create mode 100644 include/linux/sed-opal-key.h


base-commit: 1341c7d2ccf42ed91aea80b8579d35bc1ea381e2
-- 
gjo...@linux.vnet.ibm.com

[PATCH v7 2/3] block: sed-opal: keystore access for SED Opal keys

2023-06-08 Thread gjoyce

From: Greg Joyce 

Allow for permanent SED authentication keys by
reading/writing to the SED Opal non-volatile keystore.

Signed-off-by: Greg Joyce 
Reviewed-by: Jonathan Derrick 
---
 block/sed-opal.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index 6d7f25d1711b..fa23a6a60485 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3019,7 +3020,13 @@ static int opal_set_new_pw(struct opal_dev *dev, struct 
opal_new_pw *opal_pw)
if (ret)
return ret;
 
-   /* update keyring with new password */
+   /* update keyring and key store with new password */
+   ret = sed_write_key(OPAL_AUTH_KEY,
+   opal_pw->new_user_pw.opal_key.key,
+   opal_pw->new_user_pw.opal_key.key_len);
+   if (ret != -EOPNOTSUPP)
+   pr_warn("error updating SED key: %d\n", ret);
+
ret = update_sed_opal_key(OPAL_AUTH_KEY,
  opal_pw->new_user_pw.opal_key.key,
  opal_pw->new_user_pw.opal_key.key_len);
@@ -3292,6 +3299,8 @@ EXPORT_SYMBOL_GPL(sed_ioctl);
 static int __init sed_opal_init(void)
 {
struct key *kr;
+   char init_sed_key[OPAL_KEY_MAX];
+   int keylen = OPAL_KEY_MAX - 1;
 
kr = keyring_alloc(".sed_opal",
   GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, current_cred(),
@@ -3304,6 +3313,11 @@ static int __init sed_opal_init(void)
 
sed_opal_keyring = kr;
 
-   return 0;
+   if (sed_read_key(OPAL_AUTH_KEY, init_sed_key, ) < 0) {
+   memset(init_sed_key, '\0', sizeof(init_sed_key));
+   keylen = OPAL_KEY_MAX - 1;
+   }
+
+   return update_sed_opal_key(OPAL_AUTH_KEY, init_sed_key, keylen);
 }
 late_initcall(sed_opal_init);
-- 
gjo...@linux.vnet.ibm.com

[PATCH v2 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()

2023-06-08 Thread Hugh Dickins

pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).

Signed-off-by: Hugh Dickins 
Acked-by: Alexander Gordeev 
---
 arch/s390/mm/gmap.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3a2a31a15ea8..f4b6fc746fce 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -895,12 +895,12 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned 
long gaddr,
 
 /**
  * gmap_pte_op_end - release the page table lock
- * @ptl: pointer to the spinlock pointer
+ * @ptep: pointer to the locked pte
+ * @ptl: pointer to the page table spinlock
  */
-static void gmap_pte_op_end(spinlock_t *ptl)
+static void gmap_pte_op_end(pte_t *ptep, spinlock_t *ptl)
 {
-   if (ptl)
-   spin_unlock(ptl);
+   pte_unmap_unlock(ptep, ptl);
 }
 
 /**
@@ -1011,7 +1011,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned 
long gaddr,
 {
int rc;
pte_t *ptep;
-   spinlock_t *ptl = NULL;
+   spinlock_t *ptl;
unsigned long pbits = 0;
 
if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -1025,7 +1025,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned 
long gaddr,
pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
/* Protect and unlock. */
rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
return rc;
 }
 
@@ -1154,7 +1154,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long 
gaddr, unsigned long *val)
/* Do *NOT* clear the _PAGE_INVALID bit! */
rc = 0;
}
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
}
if (!rc)
break;
@@ -1248,7 +1248,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned 
long raddr,
if (!rc)
gmap_insert_rmap(sg, vmaddr, rmap);
spin_unlock(>guest_table_lock);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
}
radix_tree_preload_end();
if (rc) {
@@ -2156,7 +2156,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long 
saddr, pte_t pte)
tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
if (!tptep) {
spin_unlock(>guest_table_lock);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(sptep, ptl);
radix_tree_preload_end();
break;
}
@@ -2167,7 +2167,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long 
saddr, pte_t pte)
rmap = NULL;
rc = 0;
}
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(sptep, ptl);
spin_unlock(>guest_table_lock);
}
radix_tree_preload_end();
@@ -2495,7 +2495,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned 
long bitmap[4],
continue;
if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
set_bit(i, bitmap);
-   spin_unlock(ptl);
+   pte_unmap_unlock(ptep, ptl);
}
}
gmap_pmd_op_end(gmap, pmdp);
-- 
2.35.3

[PATCH v2 15/23] s390: allow pte_offset_map_lock() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.

Signed-off-by: Hugh Dickins 
---
 arch/s390/kernel/uv.c  |  2 ++
 arch/s390/mm/gmap.c|  9 -
 arch/s390/mm/pgtable.c | 12 +---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index cb2ee06df286..3c62d1b218b1 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long 
gaddr, void *uvcb)
 
rc = -ENXIO;
ptep = get_locked_pte(gmap->mm, uaddr, );
+   if (!ptep)
+   goto out;
if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && 
pte_write(*ptep)) {
page = pte_page(*ptep);
rc = -EAGAIN;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dc90d1eb0d55..3a2a31a15ea8 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
  * Remove all empty zero pages from the mapping for lazy refaulting
  * - This must be called after mm->context.has_pgste is set, to avoid
  *   future creation of zero pages
- * - This must be called after THP was enabled
+ * - This must be called after THP was disabled.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * racing with the loop below and so causing pte_offset_map_lock() to fail,
+ * it will never insert a page table containing empty zero pages once
+ * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
  */
 static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
   unsigned long end, struct mm_walk *walk)
@@ -2549,6 +2554,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long 
start,
spinlock_t *ptl;
 
ptep = pte_offset_map_lock(walk->mm, pmd, addr, );
+   if (!ptep)
+   break;
if (is_zero_pfn(pte_pfn(*ptep)))
ptep_xchg_direct(walk->mm, addr, ptep, 
__pte(_PAGE_INVALID));
pte_unmap_unlock(ptep, ptl);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6effb24de6d9..3bd2ab2a9a34 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
new = old = pgste_get_lock(ptep);
pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, 
unsigned long addr)
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, 
unsigned long addr)
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
new = old = pgste_get_lock(ptep);
/* Reset guest reference bit only */
pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
pgste = pgste_get_lock(ptep);
*key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
paddr = pte_val(*ptep) & PAGE_MASK;
-- 
2.35.3

[PATCH v5 1/3] block: sed-opal: Implement IOC_OPAL_DISCOVERY

2023-06-08 Thread gjoyce

From: Greg Joyce 

Add IOC_OPAL_DISCOVERY ioctl to return raw discovery data to a SED Opal
application. This allows the application to display drive capabilities
and state.

Signed-off-by: Greg Joyce 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jonathan Derrick 
---
 block/sed-opal.c  | 38 ---
 include/linux/sed-opal.h  |  1 +
 include/uapi/linux/sed-opal.h |  6 ++
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/block/sed-opal.c b/block/sed-opal.c
index c18339446ef3..67c6c4f2b4b0 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -463,8 +463,11 @@ static int execute_steps(struct opal_dev *dev,
return error;
 }
 
-static int opal_discovery0_end(struct opal_dev *dev)
+static int opal_discovery0_end(struct opal_dev *dev, void *data)
 {
+   struct opal_discovery *discv_out = data; /* may be NULL */
+   u8 __user *buf_out;
+   u64 len_out;
bool found_com_id = false, supported = true, single_user = false;
const struct d0_header *hdr = (struct d0_header *)dev->resp;
const u8 *epos = dev->resp, *cpos = dev->resp;
@@ -480,6 +483,15 @@ static int opal_discovery0_end(struct opal_dev *dev)
return -EFAULT;
}
 
+   if (discv_out) {
+   buf_out = (u8 __user *)(uintptr_t)discv_out->data;
+   len_out = min_t(u64, discv_out->size, hlen);
+   if (buf_out && copy_to_user(buf_out, dev->resp, len_out))
+   return -EFAULT;
+
+   discv_out->size = hlen; /* actual size of data */
+   }
+
epos += hlen; /* end of buffer */
cpos += sizeof(*hdr); /* current position on buffer */
 
@@ -565,13 +577,13 @@ static int opal_discovery0(struct opal_dev *dev, void 
*data)
if (ret)
return ret;
 
-   return opal_discovery0_end(dev);
+   return opal_discovery0_end(dev, data);
 }
 
 static int opal_discovery0_step(struct opal_dev *dev)
 {
const struct opal_step discovery0_step = {
-   opal_discovery0,
+   opal_discovery0, NULL
};
 
return execute_step(dev, _step, 0);
@@ -2435,6 +2447,22 @@ static int opal_secure_erase_locking_range(struct 
opal_dev *dev,
return ret;
 }
 
+static int opal_get_discv(struct opal_dev *dev, struct opal_discovery *discv)
+{
+   const struct opal_step discovery0_step = {
+   opal_discovery0, discv
+   };
+   int ret = 0;
+
+   mutex_lock(>dev_lock);
+   setup_opal_dev(dev);
+   ret = execute_step(dev, _step, 0);
+   mutex_unlock(>dev_lock);
+   if (ret)
+   return ret;
+   return discv->size; /* modified to actual length of data */
+}
+
 static int opal_erase_locking_range(struct opal_dev *dev,
struct opal_session_info *opal_session)
 {
@@ -3056,6 +3084,10 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
case IOC_OPAL_GET_GEOMETRY:
ret = opal_get_geometry(dev, arg);
break;
+   case IOC_OPAL_DISCOVERY:
+   ret = opal_get_discv(dev, p);
+   break;
+
default:
break;
}
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index bbae1e52ab4f..ef65f589fbeb 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -47,6 +47,7 @@ static inline bool is_sed_ioctl(unsigned int cmd)
case IOC_OPAL_GET_STATUS:
case IOC_OPAL_GET_LR_STATUS:
case IOC_OPAL_GET_GEOMETRY:
+   case IOC_OPAL_DISCOVERY:
return true;
}
return false;
diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index dc2efd345133..7f5732c5bdc5 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -173,6 +173,11 @@ struct opal_geometry {
__u8  __align[3];
 };
 
+struct opal_discovery {
+   __u64 data;
+   __u64 size;
+};
+
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
@@ -192,5 +197,6 @@ struct opal_geometry {
 #define IOC_OPAL_GET_STATUS _IOR('p', 236, struct opal_status)
 #define IOC_OPAL_GET_LR_STATUS  _IOW('p', 237, struct opal_lr_status)
 #define IOC_OPAL_GET_GEOMETRY   _IOR('p', 238, struct opal_geometry)
+#define IOC_OPAL_DISCOVERY  _IOW('p', 239, struct opal_discovery)
 
 #endif /* _UAPI_SED_OPAL_H */
-- 
gjo...@linux.vnet.ibm.com

[PATCH v5 2/3] block: sed-opal: Implement IOC_OPAL_REVERT_LSP

2023-06-08 Thread gjoyce

From: Greg Joyce 

This is used in conjunction with IOC_OPAL_REVERT_TPR to return a drive to
Original Factory State without erasing the data. If IOC_OPAL_REVERT_LSP
is called with opal_revert_lsp.options bit OPAL_PRESERVE set prior
to calling IOC_OPAL_REVERT_TPR, the drive global locking range will not
be erased.

Signed-off-by: Greg Joyce 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jonathan Derrick 
---
 block/opal_proto.h|  4 
 block/sed-opal.c  | 40 +++
 include/linux/sed-opal.h  |  1 +
 include/uapi/linux/sed-opal.h | 11 ++
 4 files changed, 56 insertions(+)

diff --git a/block/opal_proto.h b/block/opal_proto.h
index a4e56845dd82..dec7ce3a3edb 100644
--- a/block/opal_proto.h
+++ b/block/opal_proto.h
@@ -225,6 +225,10 @@ enum opal_parameter {
OPAL_SUM_SET_LIST = 0x06,
 };
 
+enum opal_revertlsp {
+   OPAL_KEEP_GLOBAL_RANGE_KEY = 0x06,
+};
+
 /* Packets derived from:
  * TCG_Storage_Architecture_Core_Spec_v2.01_r1.00
  * Secion: 3.2.3 ComPackets, Packets & Subpackets
diff --git a/block/sed-opal.c b/block/sed-opal.c
index 67c6c4f2b4b0..e2aed7f4ebdf 100644
--- a/block/sed-opal.c
+++ b/block/sed-opal.c
@@ -1769,6 +1769,26 @@ static int internal_activate_user(struct opal_dev *dev, 
void *data)
return finalize_and_send(dev, parse_and_check_status);
 }
 
+static int revert_lsp(struct opal_dev *dev, void *data)
+{
+   struct opal_revert_lsp *rev = data;
+   int err;
+
+   err = cmd_start(dev, opaluid[OPAL_THISSP_UID],
+   opalmethod[OPAL_REVERTSP]);
+   add_token_u8(, dev, OPAL_STARTNAME);
+   add_token_u64(, dev, OPAL_KEEP_GLOBAL_RANGE_KEY);
+   add_token_u8(, dev, (rev->options & OPAL_PRESERVE) ?
+   OPAL_TRUE : OPAL_FALSE);
+   add_token_u8(, dev, OPAL_ENDNAME);
+   if (err) {
+   pr_debug("Error building REVERT SP command.\n");
+   return err;
+   }
+
+   return finalize_and_send(dev, parse_and_check_status);
+}
+
 static int erase_locking_range(struct opal_dev *dev, void *data)
 {
struct opal_session_info *session = data;
@@ -2463,6 +2483,23 @@ static int opal_get_discv(struct opal_dev *dev, struct 
opal_discovery *discv)
return discv->size; /* modified to actual length of data */
 }
 
+static int opal_revertlsp(struct opal_dev *dev, struct opal_revert_lsp *rev)
+{
+   /* controller will terminate session */
+   const struct opal_step steps[] = {
+   { start_admin1LSP_opal_session, >key },
+   { revert_lsp, rev }
+   };
+   int ret;
+
+   mutex_lock(>dev_lock);
+   setup_opal_dev(dev);
+   ret = execute_steps(dev, steps, ARRAY_SIZE(steps));
+   mutex_unlock(>dev_lock);
+
+   return ret;
+}
+
 static int opal_erase_locking_range(struct opal_dev *dev,
struct opal_session_info *opal_session)
 {
@@ -3084,6 +3121,9 @@ int sed_ioctl(struct opal_dev *dev, unsigned int cmd, 
void __user *arg)
case IOC_OPAL_GET_GEOMETRY:
ret = opal_get_geometry(dev, arg);
break;
+   case IOC_OPAL_REVERT_LSP:
+   ret = opal_revertlsp(dev, p);
+   break;
case IOC_OPAL_DISCOVERY:
ret = opal_get_discv(dev, p);
break;
diff --git a/include/linux/sed-opal.h b/include/linux/sed-opal.h
index ef65f589fbeb..2f189546e133 100644
--- a/include/linux/sed-opal.h
+++ b/include/linux/sed-opal.h
@@ -48,6 +48,7 @@ static inline bool is_sed_ioctl(unsigned int cmd)
case IOC_OPAL_GET_LR_STATUS:
case IOC_OPAL_GET_GEOMETRY:
case IOC_OPAL_DISCOVERY:
+   case IOC_OPAL_REVERT_LSP:
return true;
}
return false;
diff --git a/include/uapi/linux/sed-opal.h b/include/uapi/linux/sed-opal.h
index 7f5732c5bdc5..4e10675751b4 100644
--- a/include/uapi/linux/sed-opal.h
+++ b/include/uapi/linux/sed-opal.h
@@ -56,6 +56,10 @@ struct opal_key {
__u8 key[OPAL_KEY_MAX];
 };
 
+enum opal_revert_lsp_opts {
+   OPAL_PRESERVE = 0x01,
+};
+
 struct opal_lr_act {
struct opal_key key;
__u32 sum;
@@ -178,6 +182,12 @@ struct opal_discovery {
__u64 size;
 };
 
+struct opal_revert_lsp {
+   struct opal_key key;
+   __u32 options;
+   __u32 __pad;
+};
+
 #define IOC_OPAL_SAVE  _IOW('p', 220, struct opal_lock_unlock)
 #define IOC_OPAL_LOCK_UNLOCK   _IOW('p', 221, struct opal_lock_unlock)
 #define IOC_OPAL_TAKE_OWNERSHIP_IOW('p', 222, struct opal_key)
@@ -198,5 +208,6 @@ struct opal_discovery {
 #define IOC_OPAL_GET_LR_STATUS  _IOW('p', 237, struct opal_lr_status)
 #define IOC_OPAL_GET_GEOMETRY   _IOR('p', 238, struct opal_geometry)
 #define IOC_OPAL_DISCOVERY  _IOW('p', 239, struct opal_discovery)
+#define IOC_OPAL_REVERT_LSP _IOW('p', 240, struct opal_revert_lsp)
 
 #endif /* _UAPI_SED_OPAL_H */
--

[PATCH v2 14/23] riscv/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
Reviewed-by: Alexandre Ghiti 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index e0ef56dc57b9..542883b3b49b 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
-   pte = pte_alloc_map(mm, pmd, addr & 
napot_cont_mask(order));
+   pte = pte_alloc_huge(mm, pmd, addr & 
napot_cont_mask(order));
break;
}
}
@@ -114,7 +114,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
-   pte = pte_offset_kernel(pmd, addr & 
napot_cont_mask(order));
+   pte = pte_offset_huge(pmd, addr & 
napot_cont_mask(order));
break;
}
}
-- 
2.35.3

[PATCH v2 13/23] powerpc/hugetlb: pte_alloc_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead.  huge_pte_offset() is using __find_linux_pte(), which is using
pte_offset_kernel() - don't rename that to _huge, it's more complicated.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b900933507da..f7c683b672c1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-   return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
+   return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
 
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
-- 
2.35.3

[PATCH v2 12/23] powerpc: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Balance successful pte_offset_map() with pte_unmap() where omitted.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/book3s64/hash_tlb.c | 4 
 arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
 arch/powerpc/xmon/xmon.c| 5 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c 
b/arch/powerpc/mm/book3s64/hash_tlb.c
index a64ea0a7ef96..21fcad97ae80 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -239,12 +239,16 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, 
pmd_t *pmd, unsigned long
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
start_pte = pte_offset_map(pmd, addr);
+   if (!start_pte)
+   goto out;
for (pte = start_pte; pte < start_pte + PTRS_PER_PTE; pte++) {
unsigned long pteval = pte_val(*pte);
if (pteval & H_PAGE_HASHPTE)
hpte_need_flush(mm, addr, pte, pteval, 0);
addr += PAGE_SIZE;
}
+   pte_unmap(start_pte);
+out:
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
 }
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c 
b/arch/powerpc/mm/book3s64/subpage_prot.c
index b75a9fb99599..0dc85556dec5 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -71,6 +71,8 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned 
long addr,
if (pmd_none(*pmd))
return;
pte = pte_offset_map_lock(mm, pmd, addr, );
+   if (!pte)
+   return;
arch_enter_lazy_mmu_mode();
for (; npages > 0; --npages) {
pte_update(mm, addr, pte, 0, 0, 0);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 70c4c59a1a8f..fae747cc57d2 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3376,12 +3376,15 @@ static void show_pte(unsigned long addr)
printf("pmdp @ 0x%px = 0x%016lx\n", pmdp, pmd_val(*pmdp));
 
ptep = pte_offset_map(pmdp, addr);
-   if (pte_none(*ptep)) {
+   if (!ptep || pte_none(*ptep)) {
+   if (ptep)
+   pte_unmap(ptep);
printf("no valid PTE\n");
return;
}
 
format_pte(ptep, pte_val(*ptep));
+   pte_unmap(ptep);
 
sync();
__delay(200);
-- 
2.35.3

[PATCH v2 11/23] powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

kvmppc_unmap_free_pmd() use pte_offset_kernel(), like everywhere else
in book3s_64_mmu_radix.c: instead of pte_offset_map(), which will come
to need a pte_unmap() to balance it.

But note that this is a more complex case than most: see those -EAGAINs
in kvmppc_create_pte(), which is coping with kvmppc races beween page
table and huge entry, of the kind which we are expecting to address
in pte_offset_map() - this might want to be revisited in future.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..572707858d65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -509,7 +509,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full,
} else {
pte_t *pte;
 
-   pte = pte_offset_map(p, 0);
+   pte = pte_offset_kernel(p, 0);
kvmppc_unmap_free_pte(kvm, pte, full, lpid);
pmd_clear(p);
}
-- 
2.35.3

[PATCH v2 10/23] parisc/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/parisc/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d1d3990b83f6..a8a1a7c1e16e 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr);
}
return pte;
 }
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (!pud_none(*pud)) {
pmd = pmd_offset(pud, addr);
if (!pmd_none(*pmd))
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_huge(pmd, addr);
}
}
}
-- 
2.35.3

[PATCH v2 09/23] parisc: unmap_uncached_pte() use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

unmap_uncached_pte() is working from pgd_offset_k(vaddr), so it should
use pte_offset_kernel() instead of pte_offset_map(), to avoid the
question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/parisc/kernel/pci-dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 71ed5391f29d..415f12d5bab3 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -164,7 +164,7 @@ static inline void unmap_uncached_pte(pmd_t * pmd, unsigned 
long vaddr,
pmd_clear(pmd);
return;
}
-   pte = pte_offset_map(pmd, vaddr);
+   pte = pte_offset_kernel(pmd, vaddr);
vaddr &= ~PMD_MASK;
end = vaddr + size;
if (end > PMD_SIZE)
-- 
2.35.3

[PATCH v2 08/23] parisc: add pte_unmap() to balance get_ptep()

2023-06-08 Thread Hugh Dickins

To keep balance in future, remember to pte_unmap() after a successful
get_ptep().  And act as if flush_cache_pages() really needs a map there,
to read the pfn before "unmapping", to be sure page table is not removed.

Signed-off-by: Hugh Dickins 
---
 arch/parisc/kernel/cache.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ca4a302d4365..501160250bb7 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
addr = mpnt->vm_start + offset;
if (parisc_requires_coherency()) {
+   bool needs_flush = false;
pte_t *ptep;
 
ptep = get_ptep(mpnt->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep))
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_user_cache_page(mpnt, addr);
} else {
/*
@@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
 static void flush_cache_page_if_present(struct vm_area_struct *vma,
unsigned long vmaddr, unsigned long pfn)
 {
-   pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+   bool needs_flush = false;
+   pte_t *ptep;
 
/*
 * The pte check is racy and sometimes the flush will trigger
 * a non-access TLB miss. Hopefully, the page has already been
 * flushed.
 */
-   if (ptep && pte_needs_flush(*ptep))
+   ptep = get_ptep(vma->vm_mm, vmaddr);
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_cache_page(vma, vmaddr, pfn);
 }
 
@@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, 
unsigned long start, u
pte_t *ptep;
 
for (addr = start; addr < end; addr += PAGE_SIZE) {
+   bool needs_flush = false;
/*
 * The vma can contain pages that aren't present. Although
 * the pte search is expensive, we need the pte to find the
 * page pfn and to check whether the page should be flushed.
 */
ptep = get_ptep(vma->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep)) {
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pfn = pte_pfn(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush) {
if (parisc_requires_coherency()) {
flush_user_cache_page(vma, addr);
} else {
-   pfn = pte_pfn(*ptep);
if (WARN_ON(!pfn_valid(pfn)))
return;
__flush_cache_page(vma, addr, PFN_PHYS(pfn));
-- 
2.35.3

[PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-08 Thread Hugh Dickins

Don't make update_mmu_cache() a wrapper around __update_tlb(): call it
directly, and use the ptep (or pmdp) provided by the caller, instead of
re-calling pte_offset_map() - which would raise a question of whether a
pte_unmap() is needed to balance it.

Check whether the "ptep" provided by the caller is actually the pmdp,
instead of testing pmd_huge(): or test pmd_huge() too and warn if it
disagrees?  This is "hazardous" territory: needs review and testing.

Signed-off-by: Hugh Dickins 
---
 arch/mips/include/asm/pgtable.h | 15 +++
 arch/mips/mm/tlb-r3k.c  |  5 +++--
 arch/mips/mm/tlb-r4k.c  |  9 +++--
 3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 574fa14ac8b2..9175dfab08d5 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -565,15 +565,8 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 }
 #endif
 
-extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
-   pte_t pte);
-
-static inline void update_mmu_cache(struct vm_area_struct *vma,
-   unsigned long address, pte_t *ptep)
-{
-   pte_t pte = *ptep;
-   __update_tlb(vma, address, pte);
-}
+extern void update_mmu_cache(struct vm_area_struct *vma,
+   unsigned long address, pte_t *ptep);
 
 #define__HAVE_ARCH_UPDATE_MMU_TLB
 #define update_mmu_tlb update_mmu_cache
@@ -581,9 +574,7 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
 {
-   pte_t pte = *(pte_t *)pmdp;
-
-   __update_tlb(vma, address, pte);
+   update_mmu_cache(vma, address, (pte_t *)pmdp);
 }
 
 /*
diff --git a/arch/mips/mm/tlb-r3k.c b/arch/mips/mm/tlb-r3k.c
index 53dfa2b9316b..e5722cd8dd6d 100644
--- a/arch/mips/mm/tlb-r3k.c
+++ b/arch/mips/mm/tlb-r3k.c
@@ -176,7 +176,8 @@ void local_flush_tlb_page(struct vm_area_struct *vma, 
unsigned long page)
}
 }
 
-void __update_tlb(struct vm_area_struct *vma, unsigned long address, pte_t pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
 {
unsigned long asid_mask = cpu_asid_mask(_cpu_data);
unsigned long flags;
@@ -203,7 +204,7 @@ void __update_tlb(struct vm_area_struct *vma, unsigned long 
address, pte_t pte)
BARRIER;
tlb_probe();
idx = read_c0_index();
-   write_c0_entrylo0(pte_val(pte));
+   write_c0_entrylo0(pte_val(*ptep));
write_c0_entryhi(address | pid);
if (idx < 0) {  /* BARRIER */
tlb_write_random();
diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index 1b939abbe4ca..c96725d17cab 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -290,14 +290,14 @@ void local_flush_tlb_one(unsigned long page)
  * updates the TLB with the new pte(s), and another which also checks
  * for the R4k "end of page" hardware bug and does the needy.
  */
-void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t 
pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
 {
unsigned long flags;
pgd_t *pgdp;
p4d_t *p4dp;
pud_t *pudp;
pmd_t *pmdp;
-   pte_t *ptep;
int idx, pid;
 
/*
@@ -326,10 +326,9 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
idx = read_c0_index();
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
/* this could be a huge page  */
-   if (pmd_huge(*pmdp)) {
+   if (ptep == (pte_t *)pmdp) {
unsigned long lo;
write_c0_pagemask(PM_HUGE_MASK);
-   ptep = (pte_t *)pmdp;
lo = pte_to_entrylo(pte_val(*ptep));
write_c0_entrylo0(lo);
write_c0_entrylo1(lo + (HPAGE_SIZE >> 7));
@@ -344,8 +343,6 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
} else
 #endif
{
-   ptep = pte_offset_map(pmdp, address);
-
 #if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
 #ifdef CONFIG_XPA
write_c0_entrylo0(pte_to_entrylo(ptep->pte_high));
-- 
2.35.3

[PATCH v2 06/23] microblaze: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/microblaze/kernel/signal.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c
index c3aebec71c0c..c78a0ff48066 100644
--- a/arch/microblaze/kernel/signal.c
+++ b/arch/microblaze/kernel/signal.c
@@ -194,7 +194,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
 
preempt_disable();
ptep = pte_offset_map(pmdp, address);
-   if (pte_present(*ptep)) {
+   if (ptep && pte_present(*ptep)) {
address = (unsigned long) page_address(pte_page(*ptep));
/* MS: I need add offset in page */
address += ((unsigned long)frame->tramp) & ~PAGE_MASK;
@@ -203,7 +203,8 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
invalidate_icache_range(address, address + 8);
flush_dcache_range(address, address + 8);
}
-   pte_unmap(ptep);
+   if (ptep)
+   pte_unmap(ptep);
preempt_enable();
if (err)
return -EFAULT;
-- 
2.35.3

[PATCH v2 05/23] m68k: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
at label out, followed by one local_irq_restore() for all.

Signed-off-by: Hugh Dickins 
---
 arch/m68k/include/asm/mmu_context.h |  6 ++--
 arch/m68k/kernel/sys_m68k.c |  2 ++
 arch/m68k/mm/mcfmmu.c   | 52 -
 3 files changed, 27 insertions(+), 33 deletions(-)

diff --git a/arch/m68k/include/asm/mmu_context.h 
b/arch/m68k/include/asm/mmu_context.h
index 8ed6ac14d99f..141bbdfad960 100644
--- a/arch/m68k/include/asm/mmu_context.h
+++ b/arch/m68k/include/asm/mmu_context.h
@@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte;
+   pte_t *pte = NULL;
unsigned long mmuar;
 
local_irq_save(flags);
@@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
 
pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
 : pte_offset_map(pmd, mmuar);
-   if (pte_none(*pte) || !pte_present(*pte))
+   if (!pte || pte_none(*pte) || !pte_present(*pte))
goto bug;
 
set_pte(pte, pte_mkyoung(*pte));
@@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
 bug:
pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
 end:
+   if (pte && mmuar < PAGE_OFFSET)
+   pte_unmap(pte);
local_irq_restore(flags);
 }
 
diff --git a/arch/m68k/kernel/sys_m68k.c b/arch/m68k/kernel/sys_m68k.c
index bd0274c7592e..c586034d2a7a 100644
--- a/arch/m68k/kernel/sys_m68k.c
+++ b/arch/m68k/kernel/sys_m68k.c
@@ -488,6 +488,8 @@ sys_atomic_cmpxchg_32(unsigned long newval, int oldval, int 
d3, int d4, int d5,
if (!pmd_present(*pmd))
goto bad_access;
pte = pte_offset_map_lock(mm, pmd, (unsigned long)mem, );
+   if (!pte)
+   goto bad_access;
if (!pte_present(*pte) || !pte_dirty(*pte)
|| !pte_write(*pte)) {
pte_unmap_unlock(pte, ptl);
diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index 70aa0979e027..42f45abea37a 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -91,7 +91,8 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, 
int extension_word)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte;
+   pte_t *pte = NULL;
+   int ret = -1;
int asid;
 
local_irq_save(flags);
@@ -100,47 +101,33 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int 
dtlb, int extension_word)
regs->pc + (extension_word * sizeof(long));
 
mm = (!user_mode(regs) && KMAPAREA(mmuar)) ? _mm : current->mm;
-   if (!mm) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!mm)
+   goto out;
 
pgd = pgd_offset(mm, mmuar);
-   if (pgd_none(*pgd))  {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pgd_none(*pgd))
+   goto out;
 
p4d = p4d_offset(pgd, mmuar);
-   if (p4d_none(*p4d)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (p4d_none(*p4d))
+   goto out;
 
pud = pud_offset(p4d, mmuar);
-   if (pud_none(*pud)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pud_none(*pud))
+   goto out;
 
pmd = pmd_offset(pud, mmuar);
-   if (pmd_none(*pmd)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pmd_none(*pmd))
+   goto out;
 
pte = (KMAPAREA(mmuar)) ? pte_offset_kernel(pmd, mmuar)
: pte_offset_map(pmd, mmuar);
-   if (pte_none(*pte) || !pte_present(*pte)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!pte || pte_none(*pte) || !pte_present(*pte))
+   goto out;
 
if (write) {
-   if (!pte_write(*pte)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!pte_write(*pte))
+   goto out;
set_pte(pte, pte_mkdirty(*pte));
}
 
@@ -161,9 +148,12 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, 
int extension_word)
mmu_write(MMUOR, MMUOR_ACC | MMUOR_UAA);
else
mmu_write(MMUOR, MMUOR_ITLB | MMUOR_ACC | MMUOR_UAA);
-
+   ret = 0;
+out:
+   if (pte && !KMAPAREA(mmuar))
+   pte_unmap(pte);
local_irq_restore(flags);
-   return 0;
+   return ret;
 }
 
 void __init

[PATCH v2 04/23] ia64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/ia64/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 78a02e026164..adc49f2d22e8 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -41,7 +41,7 @@ huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct 
*vma,
if (pud) {
pmd = pmd_alloc(mm, pud, taddr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, taddr);
+   pte = pte_alloc_huge(mm, pmd, taddr);
}
return pte;
 }
@@ -64,7 +64,7 @@ huge_pte_offset (struct mm_struct *mm, unsigned long addr, 
unsigned long sz)
if (pud_present(*pud)) {
pmd = pmd_offset(pud, taddr);
if (pmd_present(*pmd))
-   pte = pte_offset_map(pmd, taddr);
+   pte = pte_offset_huge(pmd, taddr);
}
}
}
-- 
2.35.3

[PATCH v2 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
Acked-by: Catalin Marinas 
---
 arch/arm64/mm/hugetlbpage.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 95364e8bdc19..21716c940682 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -307,14 +307,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
WARN_ON(addr & (sz - 1));
-   /*
-* Note that if this code were ever ported to the
-* 32-bit arm platform then it will cause trouble in
-* the case where CONFIG_HIGHPTE is set, since there
-* will be no pte_unmap() to correspond with this
-* pte_alloc_map().
-*/
-   ptep = pte_alloc_map(mm, pmdp, addr);
+   ptep = pte_alloc_huge(mm, pmdp, addr);
} else if (sz == PMD_SIZE) {
if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
ptep = huge_pmd_share(mm, vma, addr, pudp);
@@ -366,7 +359,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *)pmdp;
 
if (sz == CONT_PTE_SIZE)
-   return pte_offset_kernel(pmdp, (addr & CONT_PTE_MASK));
+   return pte_offset_huge(pmdp, (addr & CONT_PTE_MASK));
 
return NULL;
 }
-- 
2.35.3

[PATCH v2 02/23] arm64: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
Acked-by: Catalin Marinas 
---
 arch/arm64/mm/fault.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index cb21ccd7940d..f3aaba853547 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -177,6 +177,9 @@ static void show_pte(unsigned long addr)
break;
 
ptep = pte_offset_map(pmdp, addr);
+   if (!ptep)
+   break;
+
pte = READ_ONCE(*ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
-- 
2.35.3

[PATCH v2 01/23] arm: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/arm/lib/uaccess_with_memcpy.c | 3 +++
 arch/arm/mm/fault-armv.c   | 5 -
 arch/arm/mm/fault.c| 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/uaccess_with_memcpy.c 
b/arch/arm/lib/uaccess_with_memcpy.c
index e4c2677cc1e9..2f6163f05e93 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -74,6 +74,9 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, 
spinlock_t **ptlp)
return 0;
 
pte = pte_offset_map_lock(current->mm, pmd, addr, );
+   if (unlikely(!pte))
+   return 0;
+
if (unlikely(!pte_present(*pte) || !pte_young(*pte) ||
!pte_write(*pte) || !pte_dirty(*pte))) {
pte_unmap_unlock(pte, ptl);
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 0e49154454a6..ca5302b0b7ee 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,8 +117,11 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned 
long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   ptl = pte_lockptr(vma->vm_mm, pmd);
pte = pte_offset_map(pmd, address);
+   if (!pte)
+   return 0;
+
+   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2418f1efabd8..83598649a094 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -85,6 +85,9 @@ void show_pte(const char *lvl, struct mm_struct *mm, unsigned 
long addr)
break;
 
pte = pte_offset_map(pmd, addr);
+   if (!pte)
+   break;
+
pr_cont(", *pte=%08llx", (long long)pte_val(*pte));
 #ifndef CONFIG_ARM_LPAE
pr_cont(", *ppte=%08llx",
-- 
2.35.3

[PATCH v2 00/23] arch: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

Here is v2 series of patches to various architectures, based on v6.4-rc5:
preparing for v2 of changes following in mm, affecting pte_offset_map()
and pte_offset_map_lock().  There are very few differences from v1:
noted patch by patch below.

v1 was "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53b...@google.com/
series of 23 posted on 2023-05-09,
followed by "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b...@google.com/
series of 31 posted on 2023-05-21,
followed by  "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
series of 12 posted on 2023-05-28.

The first two series are "independent": neither depends
for build or correctness on the other, and the arch patches can either be
merged separately via arch trees, or be picked up by akpm; but both series
must be in before the third series is added to make the effective changes
(and that adds just a little more in arm, powerpc, s390 and sparc).

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.

These changes (though of course not these exact patches, and not all
of these architectures!) have been in Google's data centre kernel for
three years now: we do rely upon them.

What are the per-arch changes about?  Generally, two things.

One: the current mmap locking may not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there.  What to do about that varies: often the nearby error
handling indicates just to skip it; but in some cases a "goto again"
looks appropriate (and if that risks an infinite loop, then there
must have been an oops, or pfn 0 mistaken for page table, before).

Deeper study of each site might show that 90% of them here in arch
code could only fail if there's corruption e.g. a transition to THP
would be surprising on an arch without HAVE_ARCH_TRANSPARENT_HUGEPAGE.
But given the likely extension to freeing empty page tables, I have
not limited this set of changes to THP; and it has been easier, and
sets a better example, if each site is given appropriate handling.

Two: pte_offset_map() will need to do an rcu_read_lock(), with the
corresponding rcu_read_unlock() in pte_unmap().  But most architectures
never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
after pte_offset_map(), or have used userspace pte_offset_map() where
pte_offset_kernel() is more correct.  No problem in the current tree,
but a problem once an rcu_read_unlock() will be needed to keep balance.

A common special case of that comes in arch/*/mm/hugetlbpage.c, if
the architecture supports hugetlb pages down at the lowest PTE level.
huge_pte_alloc() uses pte_alloc_map(), but generic hugetlb code does
no corresponding pte_unmap(); similarly for huge_pte_offset().
Thanks to Mike Kravetz and Andrew Morton, v6.4-rc1 already provides
pte_alloc_huge() and pte_offset_huge() to help fix up those cases.

This posting is based on v6.4-rc5, but good for any v6.4-rc,
current mm-everything and linux-next.

01/23 arm: allow pte_offset_map[_lock]() to fail
  v2: same as v1
02/23 arm64: allow pte_offset_map() to fail
  v2: add ack from Catalin
03/23 arm64/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: add ack from Catalin
04/23 ia64/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: same as v1
05/23 m68k: allow pte_offset_map[_lock]() to fail
  v2: same as v1
06/23 microblaze: allow pte_offset_map() to fail
  v2: same as v1
07/23 mips: update_mmu_cache() can replace __update_tlb()
  v2: same as v1
08/23 parisc: add pte_unmap() to balance get_ptep()
  v2: typo fix from Helge; stronger commit message
09/23 parisc: unmap_uncached_pte() use pte_offset_kernel()
  v2: same as v1
10/23 parisc/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: same as v1
11/23 powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()
  v2: same as v1
12/23 powerpc: allow pte_offset_map[_lock]() to fail
  v2: same as v1
13/23 powerpc/hugetlb: pte_alloc_huge()
  v2: same as v1
14/23 riscv/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: add review from

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-08 Thread Mike Rapoport

On Tue, Jun 06, 2023 at 11:21:59AM -0700, Song Liu wrote:
> On Mon, Jun 5, 2023 at 3:09 AM Mark Rutland  wrote:
> 
> [...]
> 
> > > > > Can you give more detail on what parameters you need? If the only 
> > > > > extra
> > > > > parameter is just "does this allocation need to live close to kernel
> > > > > text", that's not that big of a deal.
> > > >
> > > > My thinking was that we at least need the start + end for each caller. 
> > > > That
> > > > might be it, tbh.
> > >
> > > Do you mean that modules will have something like
> > >
> > >   jit_text_alloc(size, MODULES_START, MODULES_END);
> > >
> > > and kprobes will have
> > >
> > >   jit_text_alloc(size, KPROBES_START, KPROBES_END);
> > > ?
> >
> > Yes.
> 
> How about we start with two APIs:
>  jit_text_alloc(size);
>  jit_text_alloc_range(size, start, end);
> 
> AFAICT, arm64 is the only arch that requires the latter API. And TBH, I am
> not quite convinced it is needed.
 
Right now arm64 and riscv override bpf and kprobes allocations to use the
entire vmalloc address space, but having the ability to allocate generated
code outside of modules area may be useful for other architectures.

Still the start + end for the callers feels backwards to me because the
callers do not define the ranges, but rather the architectures, so we still
need a way for architectures to define how they want allocate memory for
the generated code.

> > > It sill can be achieved with a single jit_alloc_arch_params(), just by
> > > adding enum jit_type parameter to jit_text_alloc().
> >
> > That feels backwards to me; it centralizes a bunch of information about
> > distinct users to be able to shove that into a static array, when the 
> > callsites
> > can pass that information.
> 
> I think we only two type of users: module and everything else (ftrace, kprobe,
> bpf stuff). The key differences are:
> 
>   1. module uses text and data; while everything else only uses text.
>   2. module code is generated by the compiler, and thus has stronger
>   requirements in address ranges; everything else are generated via some
>   JIT or manual written assembly, so they are more flexible with address
>   ranges (in JIT, we can avoid using instructions that requires a specific
>   address range).
> 
> The next question is, can we have the two types of users share the same
> address ranges? If not, we can reserve the preferred range for modules,
> and let everything else use the other range. I don't see reasons to further
> separate users in the "everything else" group.
 
I agree that we can define only two types: modules and everything else and
let the architectures define if they need different ranges for these two
types, or want the same range for everything.

With only two types we can have two API calls for alloc, and a single
structure that defines the ranges etc from the architecture side rather
than spread all over.

Like something along these lines:

struct execmem_range {
unsigned long   start;
unsigned long   end;
unsigned long   fallback_start;
unsigned long   fallback_end;
pgprot_tpgprot;
unsigned intalignment;
};

struct execmem_modules_range {
enum execmem_module_flags flags;
struct execmem_range text;
struct execmem_range data;
};

struct execmem_jit_range {
struct execmem_range text;
};

struct execmem_params {
struct execmem_modules_rangemodules;
struct execmem_jit_rangejit;
};

struct execmem_params *execmem_arch_params(void);

void *execmem_text_alloc(size_t size);
void *execmem_data_alloc(size_t size);
void execmem_free(void *ptr);

void *jit_text_alloc(size_t size);
void jit_free(void *ptr);

Modules or anything that must live close to the kernel image can use
execmem_*_alloc() and the callers that don't generally care about relative
addressing will use jit_text_alloc(), presuming that arch will restrict jit
range if necessary, like e.g. below for arm64 jit can be anywhere in
vmalloc and for x86 and s390 it will share the modules range. 


struct execmem_params arm64_execmem = {
.modules = {
.flags = KASAN,
.text = {
.start = MODULES_VADDR,
.end = MODULES_END,
.pgprot = PAGE_KERNEL_ROX,
.fallback_start = VMALLOC_START,
.fallback_start = VMALLOC_END,
},
},
.jit = {
.text = {
.start = VMALLOC_START,
.end = VMALLOC_END,
.pgprot =

Re: [PATCH v4 RESEND 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-08 Thread Greg Joyce

On Mon, 2023-06-05 at 15:14 -0600, Jens Axboe wrote:
> On 6/1/23 4:37PM, gjo...@linux.vnet.ibm.com wrote:
> > From: Greg Joyce 
> > 
> > This patchset has gone through numerous rounds of review and
> > all comments/suggetions have been addressed. I believe that
> > this patchset is ready for inclusion.
> > 
> > TCG SED Opal is a specification from The Trusted Computing Group
> > that allows self encrypting storage devices (SED) to be locked at
> > power on and require an authentication key to unlock the drive.
> > 
> > The current SED Opal implementation in the block driver
> > requires that authentication keys be provided in an ioctl
> > so that they can be presented to the underlying SED
> > capable drive. Currently, the key is typically entered by
> > a user with an application like sedutil or sedcli. While
> > this process works, it does not lend itself to automation
> > like unlock by a udev rule.
> > 
> > The SED block driver has been extended so it can alternatively
> > obtain a key from a sed-opal kernel keyring. The SED ioctls
> > will indicate the source of the key, either directly in the
> > ioctl data or from the keyring.
> > 
> > Two new SED ioctls have also been added. These are:
> >   1) IOC_OPAL_REVERT_LSP to revert LSP state
> >   2) IOC_OPAL_DISCOVERY to discover drive capabilities/state
> > 
> > change log v4:
> > - rebase to 6.3-rc7
> > - replaced "255" magic number with U8_MAX
> 
> None of this applies for for-6.5/block, and I'm a little puzzled
> as to why you'd rebase to an old kernel rather than a 6.4-rc at
> least?
> 
> Please resend one that is current.

Rebase to for-6.5/block coming shortly.

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-08 Thread Gerald Schaefer

On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer  wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages for pagetables instead of 2K fragments there,
> > but Christian or Claudio should also have a look.
> >   
> > > 
> > > Gerald, s390 folk: would it be possible for you to give this
> > > a try, suggest corrections and improvements, and then I can make it
> > > a separate patch of the series; and work on avoiding concurrent use
> > > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > > the scheme already there, maybe DD bits to go along with the PP AA).  
> > 
> > It feels like it could be possible to not only avoid the double
> > rcu_head, but also avoid passing over the mm via page->pt_mm.
> > I.e. have pte_free_defer(),

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-08 Thread Jason Gunthorpe

On Wed, Jun 07, 2023 at 08:35:05PM -0700, Hugh Dickins wrote:

> My current thinking (but may be proved wrong) is along the lines of:
> why does something on its way to being freed need to be on any list
> than the rcu_head list?  I expect the current answer is, that the
> other half is allocated, so the page won't be freed; but I hope that
> we can put it back on that list once we're through with the rcu_head.

I was having the same thought. It is pretty tricky, but if this was
made into some core helper then PPC and S390 could both use it and PPC
would get a nice upgrade to have the S390 frag re-use instead of
leaking frags.

Broadly we have three states:

 all frags free
 at least one frag free
 all frags used

'all frags free' should be returned to the allocator
'at least one frag free' should have the struct page on the mmu_struct's list
'all frags used' should be on no list.

So if we go from 
  all frags used -> at least one frag free
Then we put it on the RCU then the RCU puts it on the mmu_struct list

If we go from 
   at least one frag free -> all frags free
Then we take it off the mmu_struct list, put it on the RCU, and RCU
frees it.

Your trick to put the list_head for the mm_struct list into the frag
memory looks like the right direction. So 'at least one frag free' has
a single already RCU free'd frag hold the list head pointer. Thus we
never use the LRU and the rcu_head is always available.

The struct page itself can contain the actual free frag bitmask.

I think if we split up the memory used for pt_frag_refcount we can get
enough bits to keep track of everything. With only 2-4 frags we should
be OK.

So we track this data in the struct page:
  - Current RCU free TODO bitmask - if non-zero then a RCU is already
triggered
  - Next RCU TODO bitmaks - If an RCU is already triggrered then we
accumulate more free'd frags here
  - Current Free Bits - Only updated by the RCU callback

?

We'd also need to store the mmu_struct pointer in the struct page for
the RCU to be able to add/remove from the mm_struct list.

I'm not sure how much of the work can be done with atomics and how
much would need to rely on spinlock inside the mm_struct.

It feels feasible and not so bad. :)

Figure it out and test it on S390 then make power use the same common
code, and we get full RCU page table freeing using a reliable rcu_head
on both of these previously troublesome architectures :) Yay

Jason

Re: [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-08 Thread Doug Anderson

Hi,

On Thu, Jun 8, 2023 at 4:02 AM Petr Mladek  wrote:
>
> > >  config HARDLOCKUP_DETECTOR
> > > bool "Detect Hard Lockups"
> > > depends on DEBUG_KERNEL && !S390
> > > -   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> > > HAVE_HARDLOCKUP_DETECTOR_ARCH
> > > +   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> > > HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> > > HAVE_HARDLOCKUP_DETECTOR_ARCH
> >
> > Adding the dependency to buddy (see ablove) would simplify the above
> > to just this:
> >
> > depends on HAVE_HARDLOCKUP_DETECTOR_PERF ||
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
>
> This is exactly what I do not want. It would just move the check
> somewhere else. But it would make the logic harder to understand.

Hmmm. To me, it felt easier to understand by moving this into the
"HAVE_HARDLOCKUP_DETECTOR_BUDDY". To me it was pretty easy to say "if
an architecture defined its own arch-specific watchdog then buddy
can't be enabled" and that felt like it fit cleanly within the
"HAVE_HARDLOCKUP_DETECTOR_BUDDY" definition. It got rid of _a lot_ of
other special cases / checks elsewhere and felt quite a bit cleaner to
me. I only had to think about the conflict between the "buddy" and
"nmi" watchdogs once when I understood
"HAVE_HARDLOCKUP_DETECTOR_BUDDY".

> > As per above, it's simply a responsibility of architectures not to
> > define that they have both "perf" if they have the NMI watchdog, so
> > it's just buddy to worry about.
>
> Where is this documented, please?
> Is it safe to assume this?

It's not well documented and I agree that it could be improved. Right
now, HAVE_NMI_WATCHDOG is documented to say that the architecture
"defines its own arch_touch_nmi_watchdog()". Looking before my
patches, you can see that "kernel/watchdog_hld.c" (the "perf" detector
code) unconditionally defines arch_touch_nmi_watchdog(). That would
give you a linker error.

> I would personally prefer to ensure this by the config check.
> It is even better than documentation because nobody reads
> documentation ;-)

Sure. IMO this should be documented as close as possible to the root
of the problem. Make "HAVE_NMI_WATCHDOG" depend on
"!HAVE_HARDLOCKUP_DETECTOR_PERF". That expresses that an architecture
is not allowed to declare that it has both.

Re: [PATCH 7/7] watchdog/hardlockup: Define HARDLOCKUP_DETECTOR_ARCH

2023-06-08 Thread Petr Mladek

On Wed 2023-06-07 16:37:10, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 7, 2023 at 8:26 AM Petr Mladek  wrote:
> >
> > @@ -1102,6 +1103,14 @@ config HARDLOCKUP_DETECTOR_BUDDY
> > depends on !HAVE_HARDLOCKUP_DETECTOR_ARCH
> > select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER
> >
> > +config HARDLOCKUP_DETECTOR_ARCH
> > +   bool
> > +   depends on HARDLOCKUP_DETECTOR
> > +   depends on HAVE_HARDLOCKUP_DETECTOR_ARCH
> > +   help
> > + The arch-specific implementation of the hardlockup detector is
> > + available.
> 
> nit: "is available" makes it sound a bit too much like a "have"
> version. Maybe "The arch-specific implementation of the hardlockup
> detector will be used" or something like that?

Makes sense. Will do this change in v2.

> Otherise:
> 
> Reviewed-by: Douglas Anderson 

Thanks,
Petr

Re: [PATCH 4/7] watchdog/hardlockup: Enable HAVE_NMI_WATCHDOG only on sparc64

2023-06-08 Thread Petr Mladek

On Wed 2023-06-07 16:36:35, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 13c6e596cf9e..57f15babe188 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -404,10 +404,9 @@ config HAVE_NMI_WATCHDOG
> > depends on HAVE_NMI
> > bool
> > help
> > - The arch provides its own hardlockup detector implementation 
> > instead
> > + Sparc64 provides its own hardlockup detector implementation 
> > instead
> >   of the generic perf one.
> 
> It's a little weird to document generic things with the specifics of
> the user. The exception, IMO, is when something is deprecated.
> Personally, it would sound less weird to me to say something like:

Or I could replace "The arch" by "Sparc64" in the 5th patch which
renames the variable to HAVE_HARDLOCKUP_DETECTOR_SPARC64. It will
not longer be a generic thing...

Or I could squash the two patches. I did not want to do too many
changes at the same time. But it might actually make sense to
do this in one step.

> 
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index d201f5d3876b..4b4aa0f941f9 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1050,15 +1050,13 @@ config HAVE_HARDLOCKUP_DETECTOR_BUDDY
> >  #  sparc64: has a custom implementation which is not using the common
> >  #  hardlockup command line options and sysctl interface.
> >  #
> > -# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> > -# implementaion. It is automatically enabled also for other arch-specific
> > -# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> > -# of avaialable and supported variants quite tricky.
> > +# Note that HAVE_NMI_WATCHDOG is set when the sparc64 specific 
> > implementation
> > +# is used.
> >  #
> >  config HARDLOCKUP_DETECTOR
> > bool "Detect Hard Lockups"
> > -   depends on DEBUG_KERNEL && !S390
> > -   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> > HAVE_HARDLOCKUP_DETECTOR_ARCH
> > +   depends on DEBUG_KERNEL && !S390 && !HAVE_NMI_WATCHDOG
> > +   depends on HAVE_HARDLOCKUP_DETECTOR_PERF || 
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH
> 
> If you add the "!HAVE_NMI_WATCHDOG" as a dependency to
> HAVE_HARDLOCKUP_DETECTOR_BUDDY, as discussed in a previous patch, you
> can skip adding it here.

It it related to the 2nd patch. Let's discuss it there.

> 
> > imply HARDLOCKUP_DETECTOR_PERF
> > imply HARDLOCKUP_DETECTOR_BUDDY
> > select LOCKUP_DETECTOR
> > @@ -1079,7 +1077,7 @@ config HARDLOCKUP_DETECTOR_PREFER_BUDDY
> > bool "Prefer the buddy CPU hardlockup detector"
> > depends on HARDLOCKUP_DETECTOR
> > depends on HAVE_HARDLOCKUP_DETECTOR_PERF && 
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY
> > -   depends on !HAVE_NMI_WATCHDOG
> > +   depends on !HAVE_HARLOCKUP_DETECTOR_ARCH
> 
> Don't need this. Architectures never are allowed to define
> HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_HARLOCKUP_DETECTOR_ARCH

Same here...

Best Regards,
Petr

Re: [PATCH v2] security/integrity: fix pointer to ESL data and its size on pseries

2023-06-08 Thread Jarkko Sakkinen

On Thu Jun 8, 2023 at 3:04 PM EEST, Nayna Jain wrote:
> On PowerVM guest, variable data is prefixed with 8 bytes of timestamp.
> Extract ESL by stripping off the timestamp before passing to ESL parser.
>
> Fixes: 4b3e71e9a34c ("integrity/powerpc: Support loading keys from PLPKS")
> Cc: sta...@vger.kenrnel.org # v6.3
> Signed-off-by: Nayna Jain 
> ---
> Changelog:
> v2: Fixed feedback from Jarkko
>   * added CC to stable
>   * moved *data declaration to same line as *db,*dbx
> Renamed extract_data() macro to extract_esl() for clarity
>  .../integrity/platform_certs/load_powerpc.c   | 40 ---
>  1 file changed, 26 insertions(+), 14 deletions(-)
>
> diff --git a/security/integrity/platform_certs/load_powerpc.c 
> b/security/integrity/platform_certs/load_powerpc.c
> index b9de70b90826..170789dc63d2 100644
> --- a/security/integrity/platform_certs/load_powerpc.c
> +++ b/security/integrity/platform_certs/load_powerpc.c
> @@ -15,6 +15,9 @@
>  #include "keyring_handler.h"
>  #include "../integrity.h"
>  
> +#define extract_esl(db, data, size, offset)  \
> + do { db = data + offset; size = size - offset; } while (0)
> +
>  /*
>   * Get a certificate list blob from the named secure variable.
>   *
> @@ -55,8 +58,9 @@ static __init void *get_cert_list(u8 *key, unsigned long 
> keylen, u64 *size)
>   */
>  static int __init load_powerpc_certs(void)
>  {
> - void *db = NULL, *dbx = NULL;
> - u64 dbsize = 0, dbxsize = 0;
> + void *db = NULL, *dbx = NULL, *data = NULL;
> + u64 dsize = 0;
> + u64 offset = 0;
>   int rc = 0;
>   ssize_t len;
>   char buf[32];
> @@ -74,38 +78,46 @@ static int __init load_powerpc_certs(void)
>   return -ENODEV;
>   }
>  
> + if (strcmp("ibm,plpks-sb-v1", buf) == 0)
> + /* PLPKS authenticated variables ESL data is prefixed with 8 
> bytes of timestamp */
> + offset = 8;
> +
>   /*
>* Get db, and dbx. They might not exist, so it isn't an error if we
>* can't get them.
>*/
> - db = get_cert_list("db", 3, );
> - if (!db) {
> + data = get_cert_list("db", 3, );
> + if (!data) {
>   pr_info("Couldn't get db list from firmware\n");
> - } else if (IS_ERR(db)) {
> - rc = PTR_ERR(db);
> + } else if (IS_ERR(data)) {
> + rc = PTR_ERR(data);
>   pr_err("Error reading db from firmware: %d\n", rc);
>   return rc;
>   } else {
> - rc = parse_efi_signature_list("powerpc:db", db, dbsize,
> + extract_esl(db, data, dsize, offset);
> +
> + rc = parse_efi_signature_list("powerpc:db", db, dsize,
> get_handler_for_db);
>   if (rc)
>   pr_err("Couldn't parse db signatures: %d\n", rc);
> - kfree(db);
> + kfree(data);
>   }
>  
> - dbx = get_cert_list("dbx", 4,  );
> - if (!dbx) {
> + data = get_cert_list("dbx", 4,  );
> + if (!data) {
>   pr_info("Couldn't get dbx list from firmware\n");
> - } else if (IS_ERR(dbx)) {
> - rc = PTR_ERR(dbx);
> + } else if (IS_ERR(data)) {
> + rc = PTR_ERR(data);
>   pr_err("Error reading dbx from firmware: %d\n", rc);
>   return rc;
>   } else {
> - rc = parse_efi_signature_list("powerpc:dbx", dbx, dbxsize,
> + extract_esl(dbx, data, dsize, offset);
> +
> + rc = parse_efi_signature_list("powerpc:dbx", dbx, dsize,
> get_handler_for_dbx);
>   if (rc)
>   pr_err("Couldn't parse dbx signatures: %d\n", rc);
> - kfree(dbx);
> + kfree(data);
>   }
>  
>   return rc;
> -- 
> 2.31.1

Acked-by: Jarkko Sakkinen 

BR, Jarkko

Re: [RFC PATCH] asm-generic: Unify uapi bitsperlong.h

2023-06-08 Thread Arnd Bergmann

On Thu, Jun 8, 2023, at 09:04, Tiezhu Yang wrote:
> On 05/09/2023 05:37 PM, Arnd Bergmann wrote:
>> On Tue, May 9, 2023, at 09:05, Tiezhu Yang wrote:
>>
>> I think we are completely safe on the architectures that were
>> added since the linux-3.x days (arm64, riscv, csky, openrisc,
>> loongarch, nios2, and hexagon), but for the older ones there
>> is a regression risk. Especially on targets that are not that
>> actively maintained (sparc, alpha, ia64, sh, ...) there is
>> a good chance that users are stuck on ancient toolchains.
>> It's probably also a safe assumption that anyone with an older
>> libc version won't be using the latest kernel headers, so
>> I think we can still do this across architectures if both
>> glibc and musl already require a compiler that is new enough,
>> or alternatively if we know that the kernel headers require
>> a new compiler for other reasons and nobody has complained.
>>
>> For glibc, it looks the minimum compiler version was raised
>> from gcc-5 to gcc-8 four years ago, so we should be fine.
>>
>> In musl, the documentation states that at least gcc-3.4 or
>> clang-3.2 are required, which probably predate the
>> __SIZEOF_LONG__ macro. On the other hand, musl was only
>> released in 2011, and building musl itself explicitly
>> does not require kernel uapi headers, so this may not
>> be too critical.
>>
>> There is also uClibc, but I could not find any minimum
>> supported compiler version for that. Most commonly, this
>> one is used for cross-build environments, so it's also
>> less likely to have libc/gcc/headers being wildly out of
>> sync. Not sure.
>>
>>   Arnd
>>
>> [1] https://sourceware.org/pipermail/libc-alpha/2019-January/101010.html
>>
>
> Thanks Arnd for the detailed reply.
> Any more comments? What should I do in the next step?

I think the summary is "it's probably fine", but I don't know
for sure, and it may not be worth the benefit.

Maybe you can prepare a v2 that only does this for the newer
architectures I mentioned above, with and an explanation and
link to my above reply in the file comments?

  Arnd

[PATCH] KVM: ppc64: Enable ring-based dirty memory tracking

2023-06-08 Thread Kautuk Consul

- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly
  ordered.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the
  kvmppc_xive_native_set_attr is called in the context of an ioctl
  syscall and will call kvmppc_xive_native_eq_sync for setting the
  KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page()
  when there isn't a running vcpu. Implemented the
  kvm_arch_allow_write_without_running_vcpu to always return true
  to allow mark_page_dirty_in_slot to mark the page dirty in the
  memslot->dirty_bitmap in this case.
- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
  offset.
- Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required
  for the generic KVM code to call.
- Add a check to kvmppc_vcpu_run_hv for checking whether the dirty
  ring is soft full.
- Implement the kvm_arch_flush_remote_tlbs_memslot function to support
  the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option.

On testing with live migration it was found that there is around
150-180 ms improvment in overall migration time with this patch.

Signed-off-by: Kautuk Consul 
---
 Documentation/virt/kvm/api.rst  |  2 +-
 arch/powerpc/include/uapi/asm/kvm.h |  2 ++
 arch/powerpc/kvm/Kconfig|  2 ++
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 +
 arch/powerpc/kvm/book3s_hv.c|  3 +++
 include/linux/kvm_dirty_ring.h  |  5 
 6 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index add067793b90..ce1ebc513bae 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through the 
CPUID leaf.
 8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
 --
 
-:Architectures: x86, arm64
+:Architectures: x86, arm64, ppc64
 :Parameters: args[0] - size of the dirty log ring
 
 KVM is capable of tracking dirty memory using ring buffers that are
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 9f18fa090f1f..f722309ed7fb 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -33,6 +33,8 @@
 /* Not always available, but if it is, this is the correct offset.  */
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
+
 struct kvm_regs {
__u64 pc;
__u64 cr;
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..c93354ec3bd5 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -26,6 +26,8 @@ config KVM
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select INTERVAL_TREE
+   select HAVE_KVM_DIRTY_RING_ACQ_REL
+   select NEED_KVM_DIRTY_RING_WITH_BITMAP
 
 config KVM_BOOK3S_HANDLER
bool
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..c92e8022e017 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
 
vcpu->arch.hflags |= BOOK3S_HFLAG_SLB;
 }
+
+/*
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * dirty pages.
+ *
+ * It write protects selected pages to enable dirty logging for them.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset,
+unsigned long mask)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+   while (start < end) {
+   pte_t *ptep;
+   unsigned int shift;
+
+   ptep = find_kvm_secondary_pte(kvm, start, );
+
+   *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE));
+
+   start += PAGE_SIZE;
+   }
+}
+
+#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+   return true;
+}
+#endif
+
+#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
+void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   kvm_flush_remote_tlbs(kvm);
+}
+#endif
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..1d1264ea72c4 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4804,6 +4804,9 @@ static int kvmppc_vcpu_run_hv(struct kvm_vcpu *vcpu)
return -EINTR;
}
 
+   if (kvm_dirty_ring_check_request(vcpu))
+   return 0;
+
 #ifdef

[PATCH v2] security/integrity: fix pointer to ESL data and its size on pseries

2023-06-08 Thread Nayna Jain

On PowerVM guest, variable data is prefixed with 8 bytes of timestamp.
Extract ESL by stripping off the timestamp before passing to ESL parser.

Fixes: 4b3e71e9a34c ("integrity/powerpc: Support loading keys from PLPKS")
Cc: sta...@vger.kenrnel.org # v6.3
Signed-off-by: Nayna Jain 
---
Changelog:
v2: Fixed feedback from Jarkko
  * added CC to stable
  * moved *data declaration to same line as *db,*dbx
Renamed extract_data() macro to extract_esl() for clarity
 .../integrity/platform_certs/load_powerpc.c   | 40 ---
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
index b9de70b90826..170789dc63d2 100644
--- a/security/integrity/platform_certs/load_powerpc.c
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -15,6 +15,9 @@
 #include "keyring_handler.h"
 #include "../integrity.h"
 
+#define extract_esl(db, data, size, offset)\
+   do { db = data + offset; size = size - offset; } while (0)
+
 /*
  * Get a certificate list blob from the named secure variable.
  *
@@ -55,8 +58,9 @@ static __init void *get_cert_list(u8 *key, unsigned long 
keylen, u64 *size)
  */
 static int __init load_powerpc_certs(void)
 {
-   void *db = NULL, *dbx = NULL;
-   u64 dbsize = 0, dbxsize = 0;
+   void *db = NULL, *dbx = NULL, *data = NULL;
+   u64 dsize = 0;
+   u64 offset = 0;
int rc = 0;
ssize_t len;
char buf[32];
@@ -74,38 +78,46 @@ static int __init load_powerpc_certs(void)
return -ENODEV;
}
 
+   if (strcmp("ibm,plpks-sb-v1", buf) == 0)
+   /* PLPKS authenticated variables ESL data is prefixed with 8 
bytes of timestamp */
+   offset = 8;
+
/*
 * Get db, and dbx. They might not exist, so it isn't an error if we
 * can't get them.
 */
-   db = get_cert_list("db", 3, );
-   if (!db) {
+   data = get_cert_list("db", 3, );
+   if (!data) {
pr_info("Couldn't get db list from firmware\n");
-   } else if (IS_ERR(db)) {
-   rc = PTR_ERR(db);
+   } else if (IS_ERR(data)) {
+   rc = PTR_ERR(data);
pr_err("Error reading db from firmware: %d\n", rc);
return rc;
} else {
-   rc = parse_efi_signature_list("powerpc:db", db, dbsize,
+   extract_esl(db, data, dsize, offset);
+
+   rc = parse_efi_signature_list("powerpc:db", db, dsize,
  get_handler_for_db);
if (rc)
pr_err("Couldn't parse db signatures: %d\n", rc);
-   kfree(db);
+   kfree(data);
}
 
-   dbx = get_cert_list("dbx", 4,  );
-   if (!dbx) {
+   data = get_cert_list("dbx", 4,  );
+   if (!data) {
pr_info("Couldn't get dbx list from firmware\n");
-   } else if (IS_ERR(dbx)) {
-   rc = PTR_ERR(dbx);
+   } else if (IS_ERR(data)) {
+   rc = PTR_ERR(data);
pr_err("Error reading dbx from firmware: %d\n", rc);
return rc;
} else {
-   rc = parse_efi_signature_list("powerpc:dbx", dbx, dbxsize,
+   extract_esl(dbx, data, dsize, offset);
+
+   rc = parse_efi_signature_list("powerpc:dbx", dbx, dsize,
  get_handler_for_dbx);
if (rc)
pr_err("Couldn't parse dbx signatures: %d\n", rc);
-   kfree(dbx);
+   kfree(data);
}
 
return rc;
-- 
2.31.1

Re: [PATCH 3/7] watchdog/hardlockup: Declare arch_touch_nmi_watchdog() only in linux/nmi.h

2023-06-08 Thread Petr Mladek

On Wed 2023-06-07 16:35:19, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
> >
> > arch_touch_nmi_watchdog() needs a different implementation for various
> > hardlockup detector implementations. And it does nothing when
> > any hardlockup detector is not build at all.
> 
> s/build/built/

Will fix in v2.

> 
> This looks right and is a nice cleanup.
> 
> Reviewed-by: Douglas Anderson 

Thanks a lot.

Best Regards,
Petr

Re: [PATCH 2/7] watchdog/hardlockup: Make the config checks more straightforward

2023-06-08 Thread Petr Mladek

On Wed 2023-06-07 16:35:09, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 422f0ffa269e..13c6e596cf9e 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -404,17 +404,27 @@ config HAVE_NMI_WATCHDOG
> > depends on HAVE_NMI
> > bool
> > help
> > - The arch provides a low level NMI watchdog. It provides
> > - asm/nmi.h, and defines its own watchdog_hardlockup_probe() and
> > - arch_touch_nmi_watchdog().
> > + The arch provides its own hardlockup detector implementation 
> > instead
> > + of the generic perf one.
> 
> nit: did you mean to have different wording here compared to
> HAVE_HARDLOCKUP_DETECTOR_ARCH? Here you say "the generic perf one" and
> there you say "the generic ones", though it seems like you mean them
> to be the same.

Good point, I'll fix it in v2.

> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 3e91fa33c7a0..d201f5d3876b 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -1035,16 +1035,33 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
> >
> >   Say N if unsure.
> >
> > +config HAVE_HARDLOCKUP_DETECTOR_BUDDY
> > +   bool
> > +   depends on SMP
> > +   default y
> 
> I think you simplify your life if you also add:
> 
>   depends on !HAVE_NMI_WATCHDOG
> 
> The existing config system has always assumed that no architecture
> defines both HAVE_HARDLOCKUP_DETECTOR_PERF and HAVE_NMI_WATCHDOG
> (symbols would have clashed and you would get a link error as two
> watchdogs try to implement the same weak symbol). If you add the extra
> dependency to "buddy" as per above, then a few things below fall out.
> I'll try to point them out below.

My aim is to have two variables with and without HAVE_* prefix
for each variant. They already existed for perf:

   + HAVE_HARDLOCKUP_DETECTOR_PERF means that it is supported by the arch.
   + HARDLOCKUP_DETECTOR_PERF means that it will really be built.

1. It will make it clear what variants are available on the give system.
   And it will make it clear what variant is used in the end.

2. It allows to add the dependecy on the global switch HARDLOCKUP_DETECTOR.
   It makes it clear that the detector could be disabled by this switch.
   Also it actually allows to use both top-bottom logic and C-like
   definition ordering.


> 
> > +
> >  #
> > -# arch/ can define HAVE_HARDLOCKUP_DETECTOR_ARCH to provide their own hard
> > -# lockup detector rather than the perf based detector.
> > +# Global switch whether to build a hardlockup detector at all. It is 
> > available
> > +# only when the architecture supports at least one implementation. There 
> > are
> > +# two exceptions. The hardlockup detector is newer enabled on:
> 
> s/newer/never/

Great catch. Will fix in v2.

> > +#
> > +#  s390: it reported many false positives there
> > +#
> > +#  sparc64: has a custom implementation which is not using the common
> > +#  hardlockup command line options and sysctl interface.
> > +#
> > +# Note that HAVE_NMI_WATCHDOG is used to distinguish the sparc64 specific
> > +# implementaion. It is automatically enabled also for other arch-specific
> > +# variants which set HAVE_HARDLOCKUP_DETECTOR_ARCH. It makes the check
> > +# of avaialable and supported variants quite tricky.
> >  #
> >  config HARDLOCKUP_DETECTOR
> > bool "Detect Hard Lockups"
> > depends on DEBUG_KERNEL && !S390
> > -   depends on HAVE_HARDLOCKUP_DETECTOR_NON_ARCH || 
> > HAVE_HARDLOCKUP_DETECTOR_ARCH
> > +   depends on ((HAVE_HARDLOCKUP_DETECTOR_PERF || 
> > HAVE_HARDLOCKUP_DETECTOR_BUDDY) && !HAVE_NMI_WATCHDOG) || 
> > HAVE_HARDLOCKUP_DETECTOR_ARCH
> 
> Adding the dependency to buddy (see ablove) would simplify the above
> to just this:
> 
> depends on HAVE_HARDLOCKUP_DETECTOR_PERF ||
> HAVE_HARDLOCKUP_DETECTOR_BUDDY || HAVE_HARDLOCKUP_DETECTOR_ARCH

This is exactly what I do not want. It would just move the check
somewhere else. But it would make the logic harder to understand.
Especially when the related definitions could not be found by cscope.

> As per above, it's simply a responsibility of architectures not to
> define that they have both "perf" if they have the NMI watchdog, so
> it's just buddy to worry about.

Where is this documented, please?
Is it safe to assume this?

I would personally prefer to ensure this by the config check.
It is even better than documentation because nobody reads
documentation ;-)

It took me long time to understand all the dependencies and
possibilities. My motivation is that neither me nor others
would need to absolve the same adventure in the future.

> 
> > +   imply HARDLOCKUP_DETECTOR_PERF
> > +   imply HARDLOCKUP_DETECTOR_BUDDY
> > select LOCKUP_DETECTOR
> > -   select HARDLOCKUP_DETECTOR_NON_ARCH if 
> > HAVE_HARDLOCKUP_DETECTOR_NON_ARCH
> >
> > help
> >   Say Y here to enable

Re: [PATCH 1/7] watchdog/hardlockup: Sort hardlockup detector related config values a logical way

2023-06-08 Thread Petr Mladek

On Wed 2023-06-07 16:34:20, Doug Anderson wrote:
> Hi,
> 
> On Wed, Jun 7, 2023 at 8:25 AM Petr Mladek  wrote:
> > Only one hardlockup detector can be compiled in. The selection is done
> > using quite complex dependencies between several CONFIG variables.
> > The following patches will try to make it more straightforward.
> >
> > As a first step, reorder the definitions of the various CONFIG variables.
> > The logical order is:
> >
> >1. HAVE_* variables define available variants. They are typically
> >   defined in the arch/ config files.
> >
> >2. HARDLOCKUP_DETECTOR y/n variable defines whether the hardlockup
> >   detector is enabled at all.
> >
> >3. HARDLOCKUP_DETECTOR_PREFER_BUDDY y/n variable defines whether
> >   the buddy detector should be preferred over the perf one.
> >   Note that the arch specific variants are always preferred when
> >   available.
> >
> >4. HARDLOCKUP_DETECTOR_PERF/BUDDY variables define whether the given
> >   detector is enabled in the end.
> >
> >5. HAVE_HARDLOCKUP_DETECTOR_NON_ARCH and HARDLOCKUP_DETECTOR_NON_ARCH
> >   are temporary variables that are going to be removed in
> >   a followup patch.
> >
> 
> I don't really have any strong opinions, so I'm fine with this. In
> general I think the ordering I picked tried to match the existing
> "style" which generally tried to list configs and then select them
> below. To me the existing style makes more sense when thinking about
> writing C code

I know. My motivation was the following:

1. Kconfig is not C. My view is that it is more like a menu. There is a
   top level item. If the top level is enabled then there is a submenu
   with a more detailed selection of various variants and options.

2. The current logic is quite complicated from my POV. And it was
   even before your patchset. For example,
   HAVE_HARDLOCKUP_DETECTOR_BUDDY is defined as:

config HAVE_HARDLOCKUP_DETECTOR_BUDDY
bool
depends on SMP
default y

   One would expect that it would be enabled on SMP system.
   But the final value depends on many other variables
   which are defined using relatively complex conditions,
   especially HARDLOCKUP_DETECTOR, HAVE_HARDLOCKUP_DETECTOR_NON_ARCH,
   and HARDLOCKUP_DETECTOR_NON_ARCH.

   Understanding the logic is even more complicated because Kconfig is
   not indexed by cscope.

Important: The logic used at the end of the patchset actually
   follows the C style. It defines how the various variables
   depend on each other from top to bottom.

> 
> config SOFTLOCKUP_DETECTOR:
>   ... blah blah blah ...

This one is actually defined in the menu-like order:

config SOFTLOCKUP_DETECTOR

config BOOTPARAM_SOFTLOCKUP_PANIC
depends on SOFTLOCKUP_DETECTOR

It is because the custom option depends on the top level one.
This is exactly what I would like to achieve with HARDLOCKUP
variables in this patchset.

Best Regards,
Petr

Re: [PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-06-08 Thread John Paul Adrian Glaubitz

On Wed, 2023-05-31 at 14:30 -0700, Vishal Moola (Oracle) wrote:
> Part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents. Also cleans up some spacing issues.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  arch/sh/include/asm/pgalloc.h | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
> index a9e98233c4d4..5d8577ab1591 100644
> --- a/arch/sh/include/asm/pgalloc.h
> +++ b/arch/sh/include/asm/pgalloc.h
> @@ -2,6 +2,7 @@
>  #ifndef __ASM_SH_PGALLOC_H
>  #define __ASM_SH_PGALLOC_H
>  
> +#include 
>  #include 
>  
>  #define __HAVE_ARCH_PMD_ALLOC_ONE
> @@ -31,10 +32,10 @@ static inline void pmd_populate(struct mm_struct *mm, 
> pmd_t *pmd,
>   set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
>  }
>  
> -#define __pte_free_tlb(tlb,pte,addr) \
> -do { \
> - pgtable_pte_page_dtor(pte); \
> - tlb_remove_page((tlb), (pte));  \
> +#define __pte_free_tlb(tlb, pte, addr)   \
> +do { \
> + pagetable_pte_dtor(page_ptdesc(pte));   \
> + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
>  } while (0)
>  
>  #endif /* __ASM_SH_PGALLOC_H */

Acked-by: John Paul Adrian Glaubitz 

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

[PATCH][next] powerpc/powernv/sriov: perform null check on iov before dereferencing iov

2023-06-08 Thread Colin Ian King

Currently pointer iov is being dereferenced before the null check of iov
which can lead to null pointer dereference errors. Fix this by moving the
iov null check before the dereferencing.

Detected using cppcheck static analysis:
linux/arch/powerpc/platforms/powernv/pci-sriov.c:597:12: warning: Either
the condition '!iov' is redundant or there is possible null pointer
dereference: iov. [nullPointerRedundantCheck]
 num_vfs = iov->num_vfs;
   ^

Fixes: 052da31d45fc ("powerpc/powernv/sriov: De-indent setup and teardown")
Signed-off-by: Colin Ian King 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 7195133b26bb..42e1f045199f 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -594,11 +594,10 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
struct pnv_iov_data   *iov;
 
iov = pnv_iov_get(pdev);
-   num_vfs = iov->num_vfs;
-   base_pe = iov->vf_pe_arr[0].pe_number;
-
if (WARN_ON(!iov))
return;
+   num_vfs = iov->num_vfs;
+   base_pe = iov->vf_pe_arr[0].pe_number;
 
/* Release VF PEs */
pnv_ioda_release_vf_PE(pdev);
-- 
2.30.2

[PATCH] powerpc/fadump: reset dump area size variable if memblock reserve fails

2023-06-08 Thread Sourabh Jain

If the memory reservation process (memblock_reserve) fails to reserve
the memory, the reserve dump variable retains the dump area size.
Consequently, the size of the dump area calculated for reservation
is displayed in /sys/kernel/fadump/mem_reserved.

To resolve this issue, the reserve dump area size variable is set to 0
if the memblock_reserve fails to reserve memory.

Fixes: 8255da95e545 ("powerpc/fadump: release all the memory above boot memory 
size")
Signed-off-by: Sourabh Jain 
Acked-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index ea0a073abd96..a8f2c3b2fa1e 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -641,6 +641,7 @@ int __init fadump_reserve_mem(void)
goto error_out;
 
if (memblock_reserve(base, size)) {
+   fw_dump.reserve_dump_area_size = 0;
pr_err("Failed to reserve memory!\n");
goto error_out;
}
-- 
2.40.1

[kvm-unit-tests v4 12/12] powerpc/sprs: Test hypervisor registers on powernv machine

2023-06-08 Thread Nicholas Piggin

This enables HV privilege registers to be tested with the powernv
machine.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 powerpc/sprs.c | 33 +
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index d5664201..07a4e759 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -199,16 +199,16 @@ static const struct spr sprs_power_common[1024] = {
 [190] = {"HFSCR",  64, HV_RW, },
 [256] = {"VRSAVE", 32, RW, },
 [259] = {"SPRG3",  64, RO, },
-[284] = {"TBL",32, HV_WO, },
-[285] = {"TBU",32, HV_WO, },
-[286] = {"TBU40",  64, HV_WO, },
+[284] = {"TBL",32, HV_WO, }, /* Things can go a bit wonky 
with */
+[285] = {"TBU",32, HV_WO, }, /* Timebase changing. Should 
save */
+[286] = {"TBU40",  64, HV_WO, }, /* and restore it. */
 [304] = {"HSPRG0", 64, HV_RW, },
 [305] = {"HSPRG1", 64, HV_RW, },
 [306] = {"HDSISR", 32, HV_RW,  SPR_INT, },
 [307] = {"HDAR",   64, HV_RW,  SPR_INT, },
 [308] = {"SPURR",  64, HV_RW | OS_RO,  SPR_ASYNC, },
 [309] = {"PURR",   64, HV_RW | OS_RO,  SPR_ASYNC, },
-[313] = {"HRMOR",  64, HV_RW, },
+[313] = {"HRMOR",  64, HV_RW,  SPR_HARNESS, }, /* Harness 
can't cope with HRMOR changing */
 [314] = {"HSRR0",  64, HV_RW,  SPR_INT, },
 [315] = {"HSRR1",  64, HV_RW,  SPR_INT, },
 [318] = {"LPCR",   64, HV_RW, },
@@ -306,7 +306,7 @@ static const struct spr sprs_power9_10[1024] = {
 [921] = {"TSCR",   32, HV_RW, },
 [922] = {"TTR",64, HV_RW, },
 [1006]= {"TRACE",  64, WO, },
-[1008]= {"HID",64, HV_RW, },
+[1008]= {"HID",64, HV_RW,  SPR_HARNESS, }, /* At 
least HILE would be unhelpful to change */
 };
 
 /* This covers POWER8 and POWER9 PMUs */
@@ -350,6 +350,22 @@ static const struct spr sprs_power10_pmu[1024] = {
 
 static struct spr sprs[1024];
 
+static bool spr_read_perms(int spr)
+{
+   if (machine_is_powernv())
+   return !!(sprs[spr].access & SPR_HV_READ);
+   else
+   return !!(sprs[spr].access & SPR_OS_READ);
+}
+
+static bool spr_write_perms(int spr)
+{
+   if (machine_is_powernv())
+   return !!(sprs[spr].access & SPR_HV_WRITE);
+   else
+   return !!(sprs[spr].access & SPR_OS_WRITE);
+}
+
 static void setup_sprs(void)
 {
uint32_t pvr = mfspr(287);  /* Processor Version Register */
@@ -466,7 +482,7 @@ static void get_sprs(uint64_t *v)
int i;
 
for (i = 0; i < 1024; i++) {
-   if (!(sprs[i].access & SPR_OS_READ))
+   if (!spr_read_perms(i))
continue;
v[i] = __mfspr(i);
}
@@ -477,8 +493,9 @@ static void set_sprs(uint64_t val)
int i;
 
for (i = 0; i < 1024; i++) {
-   if (!(sprs[i].access & SPR_OS_WRITE))
+   if (!spr_write_perms(i))
continue;
+
if (sprs[i].type & SPR_HARNESS)
continue;
if (!strcmp(sprs[i].name, "MMCR0")) {
@@ -550,7 +567,7 @@ int main(int argc, char **argv)
for (i = 0; i < 1024; i++) {
bool pass = true;
 
-   if (!(sprs[i].access & SPR_OS_READ))
+   if (!spr_read_perms(i))
continue;
 
if (sprs[i].width == 32) {
-- 
2.40.1

[kvm-unit-tests v4 11/12] powerpc: Support powernv machine with QEMU TCG

2023-06-08 Thread Nicholas Piggin

This is a basic first pass at powernv support using OPAL (skiboot)
firmware.

The ACCEL is a bit clunky, defaulting to kvm for powernv machine, which
isn't right and has to be manually overridden. It also does not yet run
in the run_tests.sh batch process, more work is needed to exclude
certain tests (e.g., rtas) and adjust parameters (e.g., increase memory
size) to allow powernv to work. For now it can run single test cases.

Reviewed-by: Cédric Le Goater 
Signed-off-by: Nicholas Piggin 
---
Since v3:
- Typo fix [Thomas]

 lib/powerpc/asm/ppc_asm.h   |  5 +++
 lib/powerpc/asm/processor.h | 10 +
 lib/powerpc/hcall.c |  4 +-
 lib/powerpc/io.c| 27 +-
 lib/powerpc/io.h|  6 +++
 lib/powerpc/processor.c | 10 +
 lib/powerpc/setup.c |  8 ++--
 lib/ppc64/asm/opal.h| 15 
 lib/ppc64/opal-calls.S  | 46 +++
 lib/ppc64/opal.c| 74 +
 powerpc/Makefile.ppc64  |  2 +
 powerpc/cstart64.S  |  7 
 powerpc/run | 35 --
 13 files changed, 238 insertions(+), 11 deletions(-)
 create mode 100644 lib/ppc64/asm/opal.h
 create mode 100644 lib/ppc64/opal-calls.S
 create mode 100644 lib/ppc64/opal.c

diff --git a/lib/powerpc/asm/ppc_asm.h b/lib/powerpc/asm/ppc_asm.h
index 46b4be00..63f11d13 100644
--- a/lib/powerpc/asm/ppc_asm.h
+++ b/lib/powerpc/asm/ppc_asm.h
@@ -39,7 +39,12 @@
 #define SPR_HSRR1  0x13B
 
 /* Machine State Register definitions: */
+#define MSR_LE_BIT 0
 #define MSR_EE_BIT 15  /* External Interrupts Enable */
+#define MSR_HV_BIT 60  /* Hypervisor mode */
 #define MSR_SF_BIT 63  /* 64-bit mode */
 
+#define SPR_HSRR0  0x13A
+#define SPR_HSRR1  0x13B
+
 #endif /* _ASMPOWERPC_PPC_ASM_H */
diff --git a/lib/powerpc/asm/processor.h b/lib/powerpc/asm/processor.h
index 4ad6612b..9b318c3e 100644
--- a/lib/powerpc/asm/processor.h
+++ b/lib/powerpc/asm/processor.h
@@ -3,6 +3,7 @@
 
 #include 
 #include 
+#include 
 
 #ifndef __ASSEMBLY__
 void handle_exception(int trap, void (*func)(struct pt_regs *, void *), void 
*);
@@ -43,6 +44,15 @@ static inline void mtmsr(uint64_t msr)
asm volatile ("mtmsrd %[msr]" :: [msr] "r" (msr) : "memory");
 }
 
+/*
+ * This returns true on PowerNV / OPAL machines which run in hypervisor
+ * mode. False on pseries / PAPR machines that run in guest mode.
+ */
+static inline bool machine_is_powernv(void)
+{
+   return !!(mfmsr() & (1ULL << MSR_HV_BIT));
+}
+
 static inline uint64_t get_tb(void)
 {
return mfspr(SPR_TB);
diff --git a/lib/powerpc/hcall.c b/lib/powerpc/hcall.c
index 711cb1b0..37e52f54 100644
--- a/lib/powerpc/hcall.c
+++ b/lib/powerpc/hcall.c
@@ -25,7 +25,7 @@ int hcall_have_broken_sc1(void)
return r3 == (unsigned long)H_PRIVILEGE;
 }
 
-void putchar(int c)
+void papr_putchar(int c)
 {
unsigned long vty = 0;  /* 0 == default */
unsigned long nr_chars = 1;
@@ -34,7 +34,7 @@ void putchar(int c)
hcall(H_PUT_TERM_CHAR, vty, nr_chars, chars);
 }
 
-int __getchar(void)
+int __papr_getchar(void)
 {
register unsigned long r3 asm("r3") = H_GET_TERM_CHAR;
register unsigned long r4 asm("r4") = 0; /* 0 == default vty */
diff --git a/lib/powerpc/io.c b/lib/powerpc/io.c
index a381688b..ab7bb843 100644
--- a/lib/powerpc/io.c
+++ b/lib/powerpc/io.c
@@ -9,13 +9,33 @@
 #include 
 #include 
 #include 
+#include 
 #include "io.h"
 
 static struct spinlock print_lock;
 
+void putchar(int c)
+{
+   if (machine_is_powernv())
+   opal_putchar(c);
+   else
+   papr_putchar(c);
+}
+
+int __getchar(void)
+{
+   if (machine_is_powernv())
+   return __opal_getchar();
+   else
+   return __papr_getchar();
+}
+
 void io_init(void)
 {
-   rtas_init();
+   if (machine_is_powernv())
+   assert(!opal_init());
+   else
+   rtas_init();
 }
 
 void puts(const char *s)
@@ -38,7 +58,10 @@ void exit(int code)
 // FIXME: change this print-exit/rtas-poweroff to chr_testdev_exit(),
 //maybe by plugging chr-testdev into a spapr-vty.
printf("\nEXIT: STATUS=%d\n", ((code) << 1) | 1);
-   rtas_power_off();
+   if (machine_is_powernv())
+   opal_power_off();
+   else
+   rtas_power_off();
halt(code);
__builtin_unreachable();
 }
diff --git a/lib/powerpc/io.h b/lib/powerpc/io.h
index d4f21ba1..943bf142 100644
--- a/lib/powerpc/io.h
+++ b/lib/powerpc/io.h
@@ -8,6 +8,12 @@
 #define _POWERPC_IO_H_
 
 extern void io_init(void);
+extern int opal_init(void);
+extern void opal_power_off(void);
 extern void putchar(int c);
+extern void opal_putchar(int c);
+extern void papr_putchar(int c);
+extern int __opal_getchar(void);
+extern int __papr_getchar(void);
 
 #endif
diff --git a/lib/powerpc/processor.c

[kvm-unit-tests v4 10/12] powerpc: Discover runtime load address dynamically

2023-06-08 Thread Nicholas Piggin

The next change will load the kernels at different addresses depending
on test options, so this needs to be reverted back to dynamic
discovery.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 powerpc/cstart64.S | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
index b7514100..e18ae9a2 100644
--- a/powerpc/cstart64.S
+++ b/powerpc/cstart64.S
@@ -33,9 +33,14 @@ start:
 * We were loaded at QEMU's kernel load address, but we're not
 * allowed to link there due to how QEMU deals with linker VMAs,
 * so we just linked at zero. This means the first thing to do is
-* to find our stack and toc, and then do a relocate.
+* to find our stack and toc, and then do a relocate. powernv and
+* pseries load addresses are not the same, so find the address
+* dynamically:
 */
-   LOAD_REG_IMMEDIATE(r31, SPAPR_KERNEL_LOAD_ADDR)
+   bl  0f
+0: mflrr31
+   subir31, r31, 0b - start/* QEMU's kernel load address */
+
ld  r1, (p_stack - start)(r31)
ld  r2, (p_toc - start)(r31)
add r1, r1, r31
@@ -114,8 +119,11 @@ p_toc: .llong  tocptr
 p_dyn: .llong  dynamic_start
 
 .text
+start_text:
 .align 3
+p_toc_text:.llong  tocptr
 
+.align 3
 .globl hcall
 hcall:
sc  1
@@ -185,9 +193,10 @@ call_handler:
std r0,_CCR(r1)
 
/* restore TOC pointer */
-
-   LOAD_REG_IMMEDIATE(r31, SPAPR_KERNEL_LOAD_ADDR)
-   ld  r2, (p_toc - start)(r31)
+   bl  0f
+0: mflrr31
+   subir31, r31, 0b - start_text
+   ld  r2, (p_toc_text - start_text)(r31)
 
/* FIXME: build stack frame */
 
-- 
2.40.1

[kvm-unit-tests v4 09/12] powerpc: Add support for more interrupts including HV interrupts

2023-06-08 Thread Nicholas Piggin

Interrupt vectors were not being populated for all architected
interrupt types, which could lead to crashes rather than a message for
unhandled interrupts.

0x20 sized vectors require some reworking of the code to fit. This
also adds support for HV / HSRR type interrupts which will be used in
a later change.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
Since v3:
- Build fix [Joel]

 lib/powerpc/asm/ppc_asm.h |  3 ++
 powerpc/cstart64.S| 79 ---
 2 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/lib/powerpc/asm/ppc_asm.h b/lib/powerpc/asm/ppc_asm.h
index 6299ff53..46b4be00 100644
--- a/lib/powerpc/asm/ppc_asm.h
+++ b/lib/powerpc/asm/ppc_asm.h
@@ -35,6 +35,9 @@
 
 #endif /* __BYTE_ORDER__ */
 
+#define SPR_HSRR0  0x13A
+#define SPR_HSRR1  0x13B
+
 /* Machine State Register definitions: */
 #define MSR_EE_BIT 15  /* External Interrupts Enable */
 #define MSR_SF_BIT 63  /* 64-bit mode */
diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
index 34e39341..b7514100 100644
--- a/powerpc/cstart64.S
+++ b/powerpc/cstart64.S
@@ -184,14 +184,6 @@ call_handler:
mfcrr0
std r0,_CCR(r1)
 
-   /* nip and msr */
-
-   mfsrr0  r0
-   std r0, _NIP(r1)
-
-   mfsrr1  r0
-   std r0, _MSR(r1)
-
/* restore TOC pointer */
 
LOAD_REG_IMMEDIATE(r31, SPAPR_KERNEL_LOAD_ADDR)
@@ -238,6 +230,7 @@ call_handler:
 
 .section .text.ex
 
+/* [H]VECTOR must not be more than 8 instructions to fit in 0x20 vectors */
 .macro VECTOR vec
. = \vec
 
@@ -246,19 +239,28 @@ call_handler:
subir1,r1, INT_FRAME_SIZE
 
/* save r0 and ctr to call generic handler */
-
SAVE_GPR(0,r1)
 
-   mfctr   r0
-   std r0,_CTR(r1)
+   li  r0,\vec
+   std r0,_TRAP(r1)
 
-   ld  r0, P_HANDLER(0)
-   mtctr   r0
+   b   handler_trampoline
+.endm
+
+.macro HVECTOR vec
+   . = \vec
+
+   mtsprg1 r1  /* save r1 */
+   mfsprg0 r1  /* get exception stack address */
+   subir1,r1, INT_FRAME_SIZE
+
+   /* save r0 and ctr to call generic handler */
+   SAVE_GPR(0,r1)
 
li  r0,\vec
std r0,_TRAP(r1)
 
-   bctr
+   b   handler_htrampoline
 .endm
 
. = 0x100
@@ -268,12 +270,61 @@ __start_interrupts:
 VECTOR(0x100)
 VECTOR(0x200)
 VECTOR(0x300)
+VECTOR(0x380)
 VECTOR(0x400)
+VECTOR(0x480)
 VECTOR(0x500)
 VECTOR(0x600)
 VECTOR(0x700)
 VECTOR(0x800)
 VECTOR(0x900)
+HVECTOR(0x980)
+VECTOR(0xa00)
+VECTOR(0xc00)
+VECTOR(0xd00)
+HVECTOR(0xe00)
+HVECTOR(0xe20)
+HVECTOR(0xe40)
+HVECTOR(0xe60)
+HVECTOR(0xe80)
+HVECTOR(0xea0)
+VECTOR(0xf00)
+VECTOR(0xf20)
+VECTOR(0xf40)
+VECTOR(0xf60)
+HVECTOR(0xf80)
+
+handler_trampoline:
+   mfctr   r0
+   std r0,_CTR(r1)
+
+   ld  r0, P_HANDLER(0)
+   mtctr   r0
+
+   /* nip and msr */
+   mfsrr0  r0
+   std r0, _NIP(r1)
+
+   mfsrr1  r0
+   std r0, _MSR(r1)
+
+   bctr
+
+handler_htrampoline:
+   mfctr   r0
+   std r0,_CTR(r1)
+
+   ld  r0, P_HANDLER(0)
+   mtctr   r0
+
+   /* nip and msr */
+   mfspr   r0, SPR_HSRR0
+   std r0, _NIP(r1)
+
+   mfspr   r0, SPR_HSRR1
+   std r0, _MSR(r1)
+
+   bctr
 
.align 7
.globl __end_interrupts
-- 
2.40.1

[kvm-unit-tests v4 08/12] powerpc: Expand exception handler vector granularity

2023-06-08 Thread Nicholas Piggin

Exception handlers are currently indexed in units of 0x100, but
powerpc can have vectors that are aligned to as little as 0x20
bytes. Increase granularity of the handler functions before
adding support for those vectors.

Signed-off-by: Nicholas Piggin 
---
Since v3:
- Fix typo [Thomas]

 lib/powerpc/processor.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
index aaf45b68..64d7ae01 100644
--- a/lib/powerpc/processor.c
+++ b/lib/powerpc/processor.c
@@ -16,19 +16,25 @@
 static struct {
void (*func)(struct pt_regs *, void *data);
void *data;
-} handlers[16];
+} handlers[128];
 
+/*
+ * Exception handlers span from 0x100 to 0x1000 and can have a granularity
+ * of 0x20 bytes in some cases. Indexing spans 0-0x1000 with 0x20 increments
+ * resulting in 128 slots.
+ */
 void handle_exception(int trap, void (*func)(struct pt_regs *, void *),
  void * data)
 {
-   assert(!(trap & ~0xf00));
+   assert(!(trap & ~0xfe0));
 
-   trap >>= 8;
+   trap >>= 5;
 
if (func && handlers[trap].func) {
printf("exception handler installed twice %#x\n", trap);
abort();
}
+
handlers[trap].func = func;
handlers[trap].data = data;
 }
@@ -37,9 +43,9 @@ void do_handle_exception(struct pt_regs *regs)
 {
unsigned char v;
 
-   v = regs->trap >> 8;
+   v = regs->trap >> 5;
 
-   if (v < 16 && handlers[v].func) {
+   if (v < 128 && handlers[v].func) {
handlers[v].func(regs, handlers[v].data);
return;
}
-- 
2.40.1

[kvm-unit-tests v4 07/12] powerpc/spapr_vpa: Add basic VPA tests

2023-06-08 Thread Nicholas Piggin

The VPA is an optional memory structure shared between the hypervisor
and operating system, defined by PAPR. This test defines the structure
and adds registration, deregistration, and a few simple sanity tests.

[Thanks to Thomas Huth for suggesting many of the test cases.]

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 lib/powerpc/asm/hcall.h |   1 +
 lib/ppc64/asm/vpa.h |  62 +++
 powerpc/Makefile.ppc64  |   2 +-
 powerpc/spapr_vpa.c | 172 
 powerpc/unittests.cfg   |   3 +
 5 files changed, 239 insertions(+), 1 deletion(-)
 create mode 100644 lib/ppc64/asm/vpa.h
 create mode 100644 powerpc/spapr_vpa.c

diff --git a/lib/powerpc/asm/hcall.h b/lib/powerpc/asm/hcall.h
index 1173feaa..e0f5009e 100644
--- a/lib/powerpc/asm/hcall.h
+++ b/lib/powerpc/asm/hcall.h
@@ -18,6 +18,7 @@
 #define H_SET_SPRG00x24
 #define H_SET_DABR 0x28
 #define H_PAGE_INIT0x2c
+#define H_REGISTER_VPA 0xDC
 #define H_CEDE 0xE0
 #define H_GET_TERM_CHAR0x54
 #define H_PUT_TERM_CHAR0x58
diff --git a/lib/ppc64/asm/vpa.h b/lib/ppc64/asm/vpa.h
new file mode 100644
index ..11dde018
--- /dev/null
+++ b/lib/ppc64/asm/vpa.h
@@ -0,0 +1,62 @@
+#ifndef _ASMPOWERPC_VPA_H_
+#define _ASMPOWERPC_VPA_H_
+/*
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+
+#ifndef __ASSEMBLY__
+
+struct vpa {
+   uint32_tdescriptor;
+   uint16_tsize;
+   uint8_t reserved1[3];
+   uint8_t status;
+   uint8_t reserved2[14];
+   uint32_tfru_node_id;
+   uint32_tfru_proc_id;
+   uint8_t reserved3[56];
+   uint8_t vhpn_change_counters[8];
+   uint8_t reserved4[80];
+   uint8_t cede_latency;
+   uint8_t maintain_ebb;
+   uint8_t reserved5[6];
+   uint8_t dtl_enable_mask;
+   uint8_t dedicated_cpu_donate;
+   uint8_t maintain_fpr;
+   uint8_t maintain_pmc;
+   uint8_t reserved6[28];
+   uint64_tidle_estimate_purr;
+   uint8_t reserved7[28];
+   uint16_tmaintain_nr_slb;
+   uint8_t idle;
+   uint8_t maintain_vmx;
+   uint32_tvp_dispatch_count;
+   uint32_tvp_dispatch_dispersion;
+   uint64_tvp_fault_count;
+   uint64_tvp_fault_tb;
+   uint64_tpurr_exprop_idle;
+   uint64_tspurr_exprop_idle;
+   uint64_tpurr_exprop_busy;
+   uint64_tspurr_exprop_busy;
+   uint64_tpurr_donate_idle;
+   uint64_tspurr_donate_idle;
+   uint64_tpurr_donate_busy;
+   uint64_tspurr_donate_busy;
+   uint64_tvp_wait3_tb;
+   uint64_tvp_wait2_tb;
+   uint64_tvp_wait1_tb;
+   uint64_tpurr_exprop_adjunct_busy;
+   uint64_tspurr_exprop_adjunct_busy;
+   uint32_tsupervisor_pagein_count;
+   uint8_t reserved8[4];
+   uint64_tpurr_exprop_adjunct_idle;
+   uint64_tspurr_exprop_adjunct_idle;
+   uint64_tadjunct_insns_executed;
+   uint8_t reserved9[120];
+   uint64_tdtl_index;
+   uint8_t reserved10[96];
+};
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASMPOWERPC_VPA_H_ */
diff --git a/powerpc/Makefile.ppc64 b/powerpc/Makefile.ppc64
index ea684470..b0ed2b10 100644
--- a/powerpc/Makefile.ppc64
+++ b/powerpc/Makefile.ppc64
@@ -19,7 +19,7 @@ reloc.o  = $(TEST_DIR)/reloc64.o
 OBJDIRS += lib/ppc64
 
 # ppc64 specific tests
-tests =
+tests = $(TEST_DIR)/spapr_vpa.elf
 
 include $(SRCDIR)/$(TEST_DIR)/Makefile.common
 
diff --git a/powerpc/spapr_vpa.c b/powerpc/spapr_vpa.c
new file mode 100644
index ..5586eb8d
--- /dev/null
+++ b/powerpc/spapr_vpa.c
@@ -0,0 +1,172 @@
+/*
+ * Test sPAPR "Per Virtual Processor Area" and H_REGISTER_VPA hypervisor call
+ * (also known as VPA, also known as lppaca in the Linux pseries kernel).
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* for endian accessors */
+
+static int verbose;
+
+static void print_vpa(struct vpa *vpa)
+{
+   printf("VPA\n");
+   printf("descriptor: 0x%08x\n", 
be32_to_cpu(vpa->descriptor));
+   printf("size:   0x%04x\n", 
be16_to_cpu(vpa->size));
+   printf("status:   0x%02x\n", vpa->status);
+   printf("fru_node_id:0x%08x\n", 
be32_to_cpu(vpa->fru_node_id));
+   printf("fru_proc_id:0x%08x\n", 
be32_to_cpu(vpa->fru_proc_id));
+   printf("vhpn_change_counters:   0x%02x %02x %02x %02x %02x %02x 
%02x

[kvm-unit-tests v4 06/12] powerpc/sprs: Specify SPRs with data rather than code

2023-06-08 Thread Nicholas Piggin

A significant rework that builds an array of 'struct spr', where each
element describes an SPR. This makes various metadata about the SPR
like name and access type easier to carry and use.

Hypervisor privileged registers are described despite not being used
at the moment for completeness, but also the code might one day be
reused for a hypervisor-privileged test.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
This ended up a little over-engineered perhaps, but there are lots of
SPRs, lots of access types, lots of changes between processor and ISA
versions, and lots of places they are implemented and used, so lots of
room for mistakes. There is not a good system in place to easily
see that userspace, supervisor, etc., switches perform all the right
SPR context switching so this is a nice test case to have. The sprs test
quickly caught a few QEMU TCG SPR bugs which really motivated me to
improve the SPR coverage.
---
 powerpc/sprs.c | 643 ++---
 1 file changed, 450 insertions(+), 193 deletions(-)

diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index 57e487ce..d5664201 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -28,231 +28,465 @@
 #include 
 #include 
 
-uint64_t before[1024], after[1024];
-
-/* Common SPRs for all PowerPC CPUs */
-static void set_sprs_common(uint64_t val)
+/* "Indirect" mfspr/mtspr which accept a non-constant spr number */
+static uint64_t __mfspr(unsigned spr)
 {
-   mtspr(9, val);  /* CTR */
-   // mtspr(273, val); /* SPRG1 */  /* Used by our exception handler */
-   mtspr(274, val);/* SPRG2 */
-   mtspr(275, val);/* SPRG3 */
+   uint64_t tmp;
+   uint64_t ret;
+
+   asm volatile(
+"  bcl 20, 31, 1f  \n"
+"1:mflr%0  \n"
+"  addi%0, %0, (2f-1b) \n"
+"  add %0, %0, %2  \n"
+"  mtctr   %0  \n"
+"  bctr\n"
+"2:\n"
+".LSPR=0   \n"
+".rept 1024\n"
+"  mfspr   %1, .LSPR   \n"
+"  b   3f  \n"
+"  .LSPR=.LSPR+1   \n"
+".endr \n"
+"3:\n"
+   : "="(tmp),
+ "=r"(ret)
+   : "r"(spr*8) /* 8 bytes per 'mfspr ; b' block */
+   : "lr", "ctr");
+
+   return ret;
 }
 
-/* SPRs from PowerPC Operating Environment Architecture, Book III, Vers. 2.01 
*/
-static void set_sprs_book3s_201(uint64_t val)
+static void __mtspr(unsigned spr, uint64_t val)
 {
-   mtspr(18, val); /* DSISR */
-   mtspr(19, val); /* DAR */
-   mtspr(152, val);/* CTRL */
-   mtspr(256, val);/* VRSAVE */
-   mtspr(786, val);/* MMCRA */
-   mtspr(795, val);/* MMCR0 */
-   mtspr(798, val);/* MMCR1 */
+   uint64_t tmp;
+
+   asm volatile(
+"  bcl 20, 31, 1f  \n"
+"1:mflr%0  \n"
+"  addi%0, %0, (2f-1b) \n"
+"  add %0, %0, %2  \n"
+"  mtctr   %0  \n"
+"  bctr\n"
+"2:\n"
+".LSPR=0   \n"
+".rept 1024\n"
+"  mtspr   .LSPR, %1   \n"
+"  b   3f  \n"
+"  .LSPR=.LSPR+1   \n"
+".endr \n"
+"3:\n"
+   : "="(tmp)
+   : "r"(val),
+ "r"(spr*8) /* 8 bytes per 'mfspr ; b' block */
+   : "lr", "ctr", "xer");
 }
 
+static uint64_t before[1024], after[1024];
+
+#define SPR_PR_READ0x0001
+#define SPR_PR_WRITE   0x0002
+#define SPR_OS_READ0x0010
+#define SPR_OS_WRITE   0x0020
+#define SPR_HV_READ0x0100
+#define SPR_HV_WRITE   0x0200
+
+#define RW 0x333
+#define RO 0x111
+#define WO 0x222
+#define OS_RW  0x330
+#define OS_RO  0x110
+#define OS_WO  0x220
+#define HV_RW  0x300
+#define HV_RO  0x100
+#define HV_WO  0x200
+
+#define SPR_ASYNC  0x1000  /* May be updated asynchronously */
+#define SPR_INT0x2000  /* May be updated by synchronous 
interrupt */
+#define SPR_HARNESS0x4000  /* Test harness uses the register */
+
+struct spr {
+   const char  *name;
+   uint8_t width;
+   uint16_taccess;
+   uint16_ttype;
+};
+
+/* SPRs common denominator back to PowerPC Operating Environment Architecture 
*/
+static const struct spr sprs_common[1024] = {
+  [1] = {"XER",64, RW, SPR_HARNESS, }, /* 
Compiler */
+  [8] = {"LR", 64, RW, SPR_HARNESS, }, /* 
Compiler, mfspr/mtspr */
+  [9] =

[kvm-unit-tests v4 05/12] powerpc: Extract some common helpers and defines to headers

2023-06-08 Thread Nicholas Piggin

Move some common helpers and defines to processor.h.

Signed-off-by: Nicholas Piggin 
---
 lib/powerpc/asm/processor.h | 38 +
 powerpc/spapr_hcall.c   |  9 +
 powerpc/sprs.c  |  9 -
 3 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/lib/powerpc/asm/processor.h b/lib/powerpc/asm/processor.h
index ebfeff2b..4ad6612b 100644
--- a/lib/powerpc/asm/processor.h
+++ b/lib/powerpc/asm/processor.h
@@ -9,13 +9,43 @@ void handle_exception(int trap, void (*func)(struct pt_regs 
*, void *), void *);
 void do_handle_exception(struct pt_regs *regs);
 #endif /* __ASSEMBLY__ */
 
-static inline uint64_t get_tb(void)
+#define SPR_TB 0x10c
+#define SPR_SPRG0  0x110
+#define SPR_SPRG1  0x111
+#define SPR_SPRG2  0x112
+#define SPR_SPRG3  0x113
+
+static inline uint64_t mfspr(int nr)
 {
-   uint64_t tb;
+   uint64_t ret;
+
+   asm volatile("mfspr %0,%1" : "=r"(ret) : "i"(nr) : "memory");
+
+   return ret;
+}
 
-   asm volatile ("mfspr %[tb],268" : [tb] "=r" (tb));
+static inline void mtspr(int nr, uint64_t val)
+{
+   asm volatile("mtspr %0,%1" : : "i"(nr), "r"(val) : "memory");
+}
+
+static inline uint64_t mfmsr(void)
+{
+   uint64_t msr;
 
-   return tb;
+   asm volatile ("mfmsr %[msr]" : [msr] "=r" (msr) :: "memory");
+
+   return msr;
+}
+
+static inline void mtmsr(uint64_t msr)
+{
+   asm volatile ("mtmsrd %[msr]" :: [msr] "r" (msr) : "memory");
+}
+
+static inline uint64_t get_tb(void)
+{
+   return mfspr(SPR_TB);
 }
 
 extern void delay(uint64_t cycles);
diff --git a/powerpc/spapr_hcall.c b/powerpc/spapr_hcall.c
index 823a574a..0d0f25af 100644
--- a/powerpc/spapr_hcall.c
+++ b/powerpc/spapr_hcall.c
@@ -9,20 +9,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PAGE_SIZE 4096
 
 #define H_ZERO_PAGE(1UL << (63-48))
 #define H_COPY_PAGE(1UL << (63-49))
 
-#define mfspr(nr) ({ \
-   uint64_t ret; \
-   asm volatile("mfspr %0,%1" : "=r"(ret) : "i"(nr)); \
-   ret; \
-})
-
-#define SPR_SPRG0  0x110
-
 /**
  * Test the H_SET_SPRG0 h-call by setting some values and checking whether
  * the SPRG0 register contains the correct values afterwards
diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index 6ee6dba6..57e487ce 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -28,15 +28,6 @@
 #include 
 #include 
 
-#define mfspr(nr) ({ \
-   uint64_t ret; \
-   asm volatile("mfspr %0,%1" : "=r"(ret) : "i"(nr)); \
-   ret; \
-})
-
-#define mtspr(nr, val) \
-   asm volatile("mtspr %0,%1" : : "i"(nr), "r"(val))
-
 uint64_t before[1024], after[1024];
 
 /* Common SPRs for all PowerPC CPUs */
-- 
2.40.1

[kvm-unit-tests v4 04/12] powerpc: Add ISA v3.1 (POWER10) support to SPR test

2023-06-08 Thread Nicholas Piggin

This is a very basic detection that does not include all new SPRs.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 powerpc/sprs.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index ba4ddee4..6ee6dba6 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -117,6 +117,15 @@ static void set_sprs_book3s_300(uint64_t val)
mtspr(823, val);/* PSSCR */
 }
 
+/* SPRs from Power ISA Version 3.1B */
+static void set_sprs_book3s_31(uint64_t val)
+{
+   set_sprs_book3s_207(val);
+   mtspr(48, val); /* PIDR */
+   /* 3.1 removes TIDR */
+   mtspr(823, val);/* PSSCR */
+}
+
 static void set_sprs(uint64_t val)
 {
uint32_t pvr = mfspr(287);  /* Processor Version Register */
@@ -137,6 +146,9 @@ static void set_sprs(uint64_t val)
case 0x4e:  /* POWER9 */
set_sprs_book3s_300(val);
break;
+   case 0x80:  /* POWER10 */
+   set_sprs_book3s_31(val);
+   break;
default:
puts("Warning: Unknown processor version!\n");
}
@@ -220,6 +232,13 @@ static void get_sprs_book3s_300(uint64_t *v)
v[823] = mfspr(823);/* PSSCR */
 }
 
+static void get_sprs_book3s_31(uint64_t *v)
+{
+   get_sprs_book3s_207(v);
+   v[48] = mfspr(48);  /* PIDR */
+   v[823] = mfspr(823);/* PSSCR */
+}
+
 static void get_sprs(uint64_t *v)
 {
uint32_t pvr = mfspr(287);  /* Processor Version Register */
@@ -240,6 +259,9 @@ static void get_sprs(uint64_t *v)
case 0x4e:  /* POWER9 */
get_sprs_book3s_300(v);
break;
+   case 0x80:  /* POWER10 */
+   get_sprs_book3s_31(v);
+   break;
}
 }
 
-- 
2.40.1

[kvm-unit-tests v4 03/12] powerpc: Abstract H_CEDE calls into a sleep functions

2023-06-08 Thread Nicholas Piggin

This consolidates several implementations, and it no longer leaves
MSR[EE] enabled after the decrementer interrupt is handled, but
rather disables it on return.

The handler no longer allows a continuous ticking, but rather dec
has to be re-armed and EE re-enabled (e.g., via H_CEDE hcall) each
time.

Signed-off-by: Nicholas Piggin 
---
 lib/powerpc/asm/handlers.h  |  2 +-
 lib/powerpc/asm/ppc_asm.h   |  1 +
 lib/powerpc/asm/processor.h |  7 ++
 lib/powerpc/handlers.c  | 10 -
 lib/powerpc/processor.c | 43 +
 powerpc/sprs.c  |  6 +-
 powerpc/tm.c| 20 +
 7 files changed, 58 insertions(+), 31 deletions(-)

diff --git a/lib/powerpc/asm/handlers.h b/lib/powerpc/asm/handlers.h
index 64ba727a..e4a0cd45 100644
--- a/lib/powerpc/asm/handlers.h
+++ b/lib/powerpc/asm/handlers.h
@@ -3,6 +3,6 @@
 
 #include 
 
-void dec_except_handler(struct pt_regs *regs, void *data);
+void dec_handler_oneshot(struct pt_regs *regs, void *data);
 
 #endif /* _ASMPOWERPC_HANDLERS_H_ */
diff --git a/lib/powerpc/asm/ppc_asm.h b/lib/powerpc/asm/ppc_asm.h
index 1b85f6bb..6299ff53 100644
--- a/lib/powerpc/asm/ppc_asm.h
+++ b/lib/powerpc/asm/ppc_asm.h
@@ -36,6 +36,7 @@
 #endif /* __BYTE_ORDER__ */
 
 /* Machine State Register definitions: */
+#define MSR_EE_BIT 15  /* External Interrupts Enable */
 #define MSR_SF_BIT 63  /* 64-bit mode */
 
 #endif /* _ASMPOWERPC_PPC_ASM_H */
diff --git a/lib/powerpc/asm/processor.h b/lib/powerpc/asm/processor.h
index ac001e18..ebfeff2b 100644
--- a/lib/powerpc/asm/processor.h
+++ b/lib/powerpc/asm/processor.h
@@ -20,6 +20,8 @@ static inline uint64_t get_tb(void)
 
 extern void delay(uint64_t cycles);
 extern void udelay(uint64_t us);
+extern void sleep_tb(uint64_t cycles);
+extern void usleep(uint64_t us);
 
 static inline void mdelay(uint64_t ms)
 {
@@ -27,4 +29,9 @@ static inline void mdelay(uint64_t ms)
udelay(1000);
 }
 
+static inline void msleep(uint64_t ms)
+{
+   usleep(ms * 1000);
+}
+
 #endif /* _ASMPOWERPC_PROCESSOR_H_ */
diff --git a/lib/powerpc/handlers.c b/lib/powerpc/handlers.c
index c8721e0a..296f14ff 100644
--- a/lib/powerpc/handlers.c
+++ b/lib/powerpc/handlers.c
@@ -9,15 +9,13 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Generic handler for decrementer exceptions (0x900)
- * Just reset the decrementer back to the value specified when registering the
- * handler
+ * Return with MSR[EE] disabled.
  */
-void dec_except_handler(struct pt_regs *regs __unused, void *data)
+void dec_handler_oneshot(struct pt_regs *regs, void *data)
 {
-   uint64_t dec = *((uint64_t *) data);
-
-   asm volatile ("mtdec %0" : : "r" (dec));
+   regs->msr &= ~(1UL << MSR_EE_BIT);
 }
diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
index 0550e4fc..aaf45b68 100644
--- a/lib/powerpc/processor.c
+++ b/lib/powerpc/processor.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static struct {
void (*func)(struct pt_regs *, void *data);
@@ -58,3 +60,44 @@ void udelay(uint64_t us)
 {
delay((us * tb_hz) / 100);
 }
+
+void sleep_tb(uint64_t cycles)
+{
+   uint64_t start, end, now;
+
+   start = now = get_tb();
+   end = start + cycles;
+
+   while (end > now) {
+   uint64_t left = end - now;
+
+   /* TODO: Could support large decrementer */
+   if (left > 0x7fff)
+   left = 0x7fff;
+
+   /* DEC won't fire until H_CEDE is called because EE=0 */
+   asm volatile ("mtdec %0" : : "r" (left));
+   handle_exception(0x900, _handler_oneshot, NULL);
+   /*
+* H_CEDE is called with MSR[EE] clear and enables it as part
+* of the hcall, returning with EE enabled. The dec interrupt
+* is then taken immediately and the handler disables EE.
+*
+* If H_CEDE returned for any other interrupt than dec
+* expiring, that is considered an unhandled interrupt and
+* the test case would be stopped.
+*/
+   if (hcall(H_CEDE) != H_SUCCESS) {
+   printf("H_CEDE failed\n");
+   abort();
+   }
+   handle_exception(0x900, NULL, NULL);
+
+   now = get_tb();
+   }
+}
+
+void usleep(uint64_t us)
+{
+   sleep_tb((us * tb_hz) / 100);
+}
diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index 5cc1cd16..ba4ddee4 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -254,7 +254,6 @@ int main(int argc, char **argv)
0x1234567890ABCDEFULL, 0xFEDCBA0987654321ULL,
-1ULL,
};
-   static uint64_t decr = 0x7FFF; /* Max value */
 
for (i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-w")) {
@@ -288,10 +287,7 @@

[kvm-unit-tests v4 02/12] powerpc: Add some checking to exception handler install

2023-06-08 Thread Nicholas Piggin

Check to ensure exception handlers are not being overwritten or
invalid exception numbers are used.

Signed-off-by: Nicholas Piggin 
---
Since v3:
- Simplified code as suggested by Thomas.

 lib/powerpc/processor.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
index 05b4b04f..0550e4fc 100644
--- a/lib/powerpc/processor.c
+++ b/lib/powerpc/processor.c
@@ -19,12 +19,16 @@ static struct {
 void handle_exception(int trap, void (*func)(struct pt_regs *, void *),
  void * data)
 {
+   assert(!(trap & ~0xf00));
+
trap >>= 8;
 
-   if (trap < 16) {
-   handlers[trap].func = func;
-   handlers[trap].data = data;
+   if (func && handlers[trap].func) {
+   printf("exception handler installed twice %#x\n", trap);
+   abort();
}
+   handlers[trap].func = func;
+   handlers[trap].data = data;
 }
 
 void do_handle_exception(struct pt_regs *regs)
-- 
2.40.1

[kvm-unit-tests v4 01/12] powerpc: Report instruction address and MSR in unhandled exception error

2023-06-08 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
Since v3:
- New patch

 lib/powerpc/processor.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
index ec85b9d8..05b4b04f 100644
--- a/lib/powerpc/processor.c
+++ b/lib/powerpc/processor.c
@@ -38,7 +38,7 @@ void do_handle_exception(struct pt_regs *regs)
return;
}
 
-   printf("unhandled cpu exception %#lx\n", regs->trap);
+   printf("unhandled cpu exception %#lx at NIA:0x%016lx MSR:0x%016lx\n", 
regs->trap, regs->nip, regs->msr);
abort();
 }
 
-- 
2.40.1

[kvm-unit-tests v4 00/12] powerpc: updates, P10, PNV support

2023-06-08 Thread Nicholas Piggin

Posting again, a couple of patches were merged and accounted for review
comments from last time.

Thanks,
Nick

Nicholas Piggin (12):
  powerpc: Report instruction address and MSR in unhandled exception
error
  powerpc: Add some checking to exception handler install
  powerpc: Abstract H_CEDE calls into a sleep functions
  powerpc: Add ISA v3.1 (POWER10) support to SPR test
  powerpc: Extract some common helpers and defines to headers
  powerpc/sprs: Specify SPRs with data rather than code
  powerpc/spapr_vpa: Add basic VPA tests
  powerpc: Expand exception handler vector granularity
  powerpc: Add support for more interrupts including HV interrupts
  powerpc: Discover runtime load address dynamically
  powerpc: Support powernv machine with QEMU TCG
  powerpc/sprs: Test hypervisor registers on powernv machine

 lib/powerpc/asm/handlers.h  |   2 +-
 lib/powerpc/asm/hcall.h |   1 +
 lib/powerpc/asm/ppc_asm.h   |   9 +
 lib/powerpc/asm/processor.h |  55 ++-
 lib/powerpc/handlers.c  |  10 +-
 lib/powerpc/hcall.c |   4 +-
 lib/powerpc/io.c|  27 +-
 lib/powerpc/io.h|   6 +
 lib/powerpc/processor.c |  79 -
 lib/powerpc/setup.c |   8 +-
 lib/ppc64/asm/opal.h|  15 +
 lib/ppc64/asm/vpa.h |  62 
 lib/ppc64/opal-calls.S  |  46 +++
 lib/ppc64/opal.c|  74 +
 powerpc/Makefile.ppc64  |   4 +-
 powerpc/cstart64.S  | 105 --
 powerpc/run |  35 +-
 powerpc/spapr_hcall.c   |   9 +-
 powerpc/spapr_vpa.c | 172 ++
 powerpc/sprs.c  | 645 ++--
 powerpc/tm.c|  20 +-
 powerpc/unittests.cfg   |   3 +
 22 files changed, 1133 insertions(+), 258 deletions(-)
 create mode 100644 lib/ppc64/asm/opal.h
 create mode 100644 lib/ppc64/asm/vpa.h
 create mode 100644 lib/ppc64/opal-calls.S
 create mode 100644 lib/ppc64/opal.c
 create mode 100644 powerpc/spapr_vpa.c

-- 
2.40.1

Re: [RFC PATCH] asm-generic: Unify uapi bitsperlong.h

2023-06-08 Thread Tiezhu Yang


Hi all,

On 05/09/2023 05:37 PM, Arnd Bergmann wrote:

On Tue, May 9, 2023, at 09:05, Tiezhu Yang wrote:

Now we specify the minimal version of GCC as 5.1 and Clang/LLVM as 11.0.0
in Documentation/process/changes.rst, __CHAR_BIT__ and __SIZEOF_LONG__ are
usable, just define __BITS_PER_LONG as (__CHAR_BIT__ * __SIZEOF_LONG__) in
asm-generic uapi bitsperlong.h, simpler, works everywhere.

Remove all the arch specific uapi bitsperlong.h which will be generated as
arch/*/include/generated/uapi/asm/bitsperlong.h.

Suggested-by: Xi Ruoyao 
Link:
https://lore.kernel.org/all/d3e255e4746de44c9903c4433616d44ffcf18d1b.ca...@xry111.site/
Signed-off-by: Tiezhu Yang 


I originally introduced the bitsperlong.h header, and I'd love to
see it removed if it's no longer needed. Your patch certainly
seems like it does this well.

There is one minor obstacle to this, which is that the compiler
requirements for uapi headers are not the same as for kernel
internal code. In particular, the uapi headers may be included
by user space code that is built with an older compiler version,
or with a compiler that is not gcc or clang.

I think we are completely safe on the architectures that were
added since the linux-3.x days (arm64, riscv, csky, openrisc,
loongarch, nios2, and hexagon), but for the older ones there
is a regression risk. Especially on targets that are not that
actively maintained (sparc, alpha, ia64, sh, ...) there is
a good chance that users are stuck on ancient toolchains.

It's probably also a safe assumption that anyone with an older
libc version won't be using the latest kernel headers, so
I think we can still do this across architectures if both
glibc and musl already require a compiler that is new enough,
or alternatively if we know that the kernel headers require
a new compiler for other reasons and nobody has complained.

For glibc, it looks the minimum compiler version was raised
from gcc-5 to gcc-8 four years ago, so we should be fine.

In musl, the documentation states that at least gcc-3.4 or
clang-3.2 are required, which probably predate the
__SIZEOF_LONG__ macro. On the other hand, musl was only
released in 2011, and building musl itself explicitly
does not require kernel uapi headers, so this may not
be too critical.

There is also uClibc, but I could not find any minimum
supported compiler version for that. Most commonly, this
one is used for cross-build environments, so it's also
less likely to have libc/gcc/headers being wildly out of
sync. Not sure.

  Arnd

[1] https://sourceware.org/pipermail/libc-alpha/2019-January/101010.html



Thanks Arnd for the detailed reply.
Any more comments? What should I do in the next step?

Thanks,
Tiezhu

73 matches

Mail list logo