Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-04 Thread Yu Zhang
> @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool 
> write_fault,
>   * The slow path to get the pfn of the specified host virtual address,
>   * 1 indicates success, -errno is returned if error is detected.
>   */
> -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> -bool interruptible, bool *writable, kvm_pfn_t *pfn)
> +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
>  {
> - unsigned int flags = FOLL_HWPOISON;
> + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
>   struct page *page;
>   int npages;
>  
>   might_sleep();
>  
> - if (writable)
> - *writable = write_fault;
> -
> - if (write_fault)
> - flags |= FOLL_WRITE;
> - if (async)
> - flags |= FOLL_NOWAIT;
> - if (interruptible)
> - flags |= FOLL_INTERRUPTIBLE;
> -
> - npages = get_user_pages_unlocked(addr, 1, , flags);
> + npages = get_user_pages_unlocked(foll->hva, 1, , flags);
>   if (npages != 1)
>   return npages;
>  
> + foll->writable = (foll->flags & FOLL_WRITE) && 
> foll->allow_write_mapping;
> +
>   /* map read fault as writable if possible */
> - if (unlikely(!write_fault) && writable) {
> + if (unlikely(!foll->writable) && foll->allow_write_mapping) {

I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.

>   struct page *wpage;
>  
> - if (get_user_page_fast_only(addr, FOLL_WRITE, )) {
> - *writable = true;
> + if (get_user_page_fast_only(foll->hva, FOLL_WRITE, )) {
> + foll->writable = true;
>   put_page(page);
>   page = wpage;
>   }
> @@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
>   return get_page_unless_zero(page);
>  }
>  
...

> +kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> +bool atomic, bool interruptible, bool *async,
> +bool write_fault, bool *writable, hva_t *hva)
> +{
> + kvm_pfn_t pfn;
> + struct kvm_follow_pfn foll = {
> + .slot = slot,
> + .gfn = gfn,
> + .flags = 0,
> + .atomic = atomic,
> + .allow_write_mapping = !!writable,
> + };
> +
> + if (write_fault)
> + foll.flags |= FOLL_WRITE;
> + if (async)
> + foll.flags |= FOLL_NOWAIT;
> + if (interruptible)
> + foll.flags |= FOLL_INTERRUPTIBLE;
> +
> + pfn = __kvm_follow_pfn();
> + if (pfn == KVM_PFN_ERR_NEEDS_IO) {

Could we just use KVM_PFN_ERR_FAULT and foll.flags here? I.e.,
if (pfn == KVM_PFN_ERR_FAULT && (foll.flags & FOLL_NOWAIT))?
Setting pfn to KVM_PFN_ERR_NEEDS_IO just to indicate an async fault
seems unnecessary.

> + *async = true;
> + pfn = KVM_PFN_ERR_FAULT;
> + }
> + if (hva)
> + *hva = foll.hva;
> + if (writable)
> + *writable = foll.writable;
> + return pfn;
>  }
>  EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
>  

B.R.
Yu


Re: [PATCH v3 6/9] cpu/SMT: Allow enabling partial SMT states via sysfs

2023-07-04 Thread Zhang, Rui
On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:
> @@ -2580,6 +2597,17 @@ static ssize_t control_show(struct device
> *dev,
>  {
> const char *state = smt_states[cpu_smt_control];
>  
> +#ifdef CONFIG_HOTPLUG_SMT
> +   /*
> +    * If SMT is enabled but not all threads are enabled then
> show the
> +    * number of threads. If all threads are enabled show "on".
> Otherwise
> +    * show the state name.
> +    */
> +   if (cpu_smt_control == CPU_SMT_ENABLED &&
> +   cpu_smt_num_threads != cpu_smt_max_threads)
> +   return sysfs_emit(buf, "%d\n", cpu_smt_num_threads);
> +#endif
> +

My understanding is that cpu_smt_control is always set to
CPU_SMT_NOT_IMPLEMENTED when CONFIG_HOTPLUG_SMT is not set, so this
ifdef is not necessary, right?

thanks,
rui


Re: [PATCH v3 3/9] cpu/SMT: Store the current/max number of threads

2023-07-04 Thread Zhang, Rui
On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:
> From: Michael Ellerman 
> 
> Some architectures allows partial SMT states at boot time,

s/allows/allow.

thanks,
rui



Re: [PATCH v3 0/9] Introduce SMT level and add PowerPC support

2023-07-04 Thread Zhang, Rui
Hi, Laurent,

I want to test this patch set and found that it does not apply on top
of latest usptream git, because of some changes in this merge window,
so better rebase.

thanks,
rui

On Thu, 2023-06-29 at 16:31 +0200, Laurent Dufour wrote:
> I'm taking over the series Michael sent previously [1] which is
> smartly
> reviewing the initial series I sent [2].  This series is addressing
> the
> comments sent by Thomas and me on the Michael's one.
> 
> Here is a short introduction to the issue this series is addressing:
> 
> When a new CPU is added, the kernel is activating all its threads.
> This
> leads to weird, but functional, result when adding CPU on a SMT 4
> system
> for instance.
> 
> Here the newly added CPU 1 has 8 threads while the other one has 4
> threads
> active (system has been booted with the 'smt-enabled=4' kernel
> option):
> 
> ltcden3-lp12:~ # ppc64_cpu --info
> Core   0:    0*    1*    2*    3*    4 5 6 7
> Core   1:    8*    9*   10*   11*   12*   13*   14*   15*
> 
> This mixed SMT level may confused end users and/or some applications.
> 
> There is no SMT level recorded in the kernel (common code), neither
> in user
> space, as far as I know. Such a level is helpful when adding new CPU
> or
> when optimizing the energy efficiency (when reactivating CPUs).
> 
> When SMP and HOTPLUG_SMT are defined, this series is adding a new SMT
> level
> (cpu_smt_num_threads) and few callbacks allowing the architecture
> code to
> fine control this value, setting a max and a "at boot" level, and
> controling whether a thread should be onlined or not.
> 
> v3:
>   Fix a build error in the patch 6/9
> v2:
>   As Thomas suggested,
>     Reword some commit's description
>     Remove topology_smt_supported()
>     Remove topology_smt_threads_supported()
>     Introduce CONFIG_SMT_NUM_THREADS_DYNAMIC
>     Remove switch() in __store_smt_control()
>   Update kernel-parameters.txt
> 
> [1]
> https://lore.kernel.org/linuxppc-dev/20230524155630.794584-1-...@ellerman.id.au/
> [2]
> https://lore.kernel.org/linuxppc-dev/20230331153905.31698-1-lduf...@linux.ibm.com/
> 
> Laurent Dufour (1):
>   cpu/SMT: Remove topology_smt_supported()
> 
> Michael Ellerman (8):
>   cpu/SMT: Move SMT prototypes into cpu_smt.h
>   cpu/SMT: Move smt/control simple exit cases earlier
>   cpu/SMT: Store the current/max number of threads
>   cpu/SMT: Create topology_smt_thread_allowed()
>   cpu/SMT: Allow enabling partial SMT states via sysfs
>   powerpc/pseries: Initialise CPU hotplug callbacks earlier
>   powerpc: Add HOTPLUG_SMT support
>   powerpc/pseries: Honour current SMT state when DLPAR onlining CPUs
> 
>  .../ABI/testing/sysfs-devices-system-cpu  |   1 +
>  .../admin-guide/kernel-parameters.txt |   4 +-
>  arch/Kconfig  |   3 +
>  arch/powerpc/Kconfig  |   2 +
>  arch/powerpc/include/asm/topology.h   |  15 +++
>  arch/powerpc/kernel/smp.c |   8 +-
>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |  30 +++--
>  arch/powerpc/platforms/pseries/pseries.h  |   2 +
>  arch/powerpc/platforms/pseries/setup.c    |   2 +
>  arch/x86/include/asm/topology.h   |   4 +-
>  arch/x86/kernel/cpu/bugs.c    |   3 +-
>  arch/x86/kernel/smpboot.c |   8 --
>  include/linux/cpu.h   |  25 +---
>  include/linux/cpu_smt.h   |  33 +
>  kernel/cpu.c  | 118 ++--
> --
>  15 files changed, 187 insertions(+), 71 deletions(-)
>  create mode 100644 include/linux/cpu_smt.h
> 



[PATCH] powerpc/config: Disable SLAB_DEBUG_ON in skiroot

2023-07-04 Thread Joel Stanley
In 5.10 commit 5e84dd547bce ("powerpc/configs/skiroot: Enable some more
hardening options") set SLUB_DEUBG_ON.

When 5.14 came around, commit 792702911f58 ("slub: force on
no_hash_pointers when slub_debug is enabled") print all the
pointers when SLUB_DEUBG_ON is set. This was fine, but in 5.12 commit
5ead723a20e0 ("lib/vsprintf: no_hash_pointers prints all addresses as
unhashed") added the warning at boot.

Disable SLAB_DEBUG_ON as we don't want the nasty warning. We have
CONFIG_EXPERT so SLAB_DEBUG is enabled. We do lose the settings in
DEBUG_DEFAULT_FLAGS, but it's not clear that these should have been
always-on anyway.

Signed-off-by: Joel Stanley 
---
 arch/powerpc/configs/skiroot_defconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/configs/skiroot_defconfig 
b/arch/powerpc/configs/skiroot_defconfig
index 71cfb990a74f..8d3eacb50d56 100644
--- a/arch/powerpc/configs/skiroot_defconfig
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -289,7 +289,6 @@ CONFIG_LIBCRC32C=y
 # CONFIG_XZ_DEC_SPARC is not set
 CONFIG_PRINTK_TIME=y
 CONFIG_MAGIC_SYSRQ=y
-CONFIG_SLUB_DEBUG_ON=y
 CONFIG_SCHED_STACK_END_CHECK=y
 CONFIG_DEBUG_STACKOVERFLOW=y
 CONFIG_PANIC_ON_OOPS=y
-- 
2.40.1



Re: [11/12] fbdev/core: Protect edid_info with CONFIG_ARCH_HAS_EDID_INFO

2023-07-04 Thread Sui Jingfeng

Hi,


On 2023/6/29 19:45, Thomas Zimmermann wrote:

Guard usage of edid_info with CONFIG_ARCH_HAS_EDID_INFO instead
of CONFIG_X86. No functional changes.

Signed-off-by: Thomas Zimmermann 



Reviewed-by: Sui Jingfeng 



Cc: Daniel Vetter 
Cc: Helge Deller 
Cc: Randy Dunlap 
---
  drivers/video/fbdev/core/fbmon.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/video/fbdev/core/fbmon.c b/drivers/video/fbdev/core/fbmon.c
index 35be4431f649a..9ae063021e431 100644
--- a/drivers/video/fbdev/core/fbmon.c
+++ b/drivers/video/fbdev/core/fbmon.c
@@ -1480,17 +1480,19 @@ int fb_validate_mode(const struct fb_var_screeninfo 
*var, struct fb_info *info)
-EINVAL : 0;
  }
  
-#if defined(CONFIG_FIRMWARE_EDID) && defined(CONFIG_X86)

+#if defined(CONFIG_FIRMWARE_EDID)
  const unsigned char *fb_firmware_edid(struct fb_info *info)
  {
unsigned char *edid = NULL;
  
+#if defined(CONFIG_ARCH_HAS_EDID_INFO)

/*
 * We need to ensure that the EDID block is only
 * returned for the primary graphics adapter.
 */
if (fb_is_primary_device(info))
edid = edid_info.dummy;
+#endif
  
  	return edid;

  }




Re: [03/12] sysfb: Do not include from sysfb header

2023-07-04 Thread Sui Jingfeng



On 2023/6/29 19:45, Thomas Zimmermann wrote:

The header file  does not need anything from
. Declare struct screen_info and remove
the include statements.

Signed-off-by: Thomas Zimmermann 
Cc: Ard Biesheuvel 
Cc: Hans de Goede 
Cc: Javier Martinez Canillas 
Reviewed-by: Javier Martinez Canillas 



Reviewed-by: Sui Jingfeng 



---
  include/linux/sysfb.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sysfb.h b/include/linux/sysfb.h
index c1ef5fc60a3cb..19cb803dd5ecd 100644
--- a/include/linux/sysfb.h
+++ b/include/linux/sysfb.h
@@ -9,7 +9,8 @@
  
  #include 

  #include 
-#include 
+
+struct screen_info;
  
  enum {

M_I17,  /* 17-Inch iMac */




Re: [01/12] efi: Do not include from EFI header

2023-07-04 Thread Sui Jingfeng

Hi, Thomas


I love your patch, LoongArch also have UEFI GOP support,

Maybe the arch/loongarch/kernel/efi.c don't include the '#include 
' explicitly.



```

diff --git a/arch/loongarch/kernel/efi.c b/arch/loongarch/kernel/efi.c
index 3d448fef3af4..04f4d217aefb 100644
--- a/arch/loongarch/kernel/efi.c
+++ b/arch/loongarch/kernel/efi.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
```


On 2023/6/29 19:45, Thomas Zimmermann wrote:

The header file  does not need anything from
. Declare struct screen_info and remove
the include statements. Update a number of source files that
require struct screen_info's definition.

Signed-off-by: Thomas Zimmermann 
Cc: Ard Biesheuvel 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: Javier Martinez Canillas 


With the above issue solved, please take my R-B if you would like.


Reviewed-by: Sui Jingfeng 


---
  arch/arm/kernel/efi.c | 2 ++
  arch/arm64/kernel/efi.c   | 1 +
  drivers/firmware/efi/libstub/efi-stub-entry.c | 2 ++
  drivers/firmware/efi/libstub/screen_info.c| 2 ++
  include/linux/efi.h   | 3 ++-
  5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/efi.c b/arch/arm/kernel/efi.c
index e2b9d2618c672..e94655ef16bb3 100644
--- a/arch/arm/kernel/efi.c
+++ b/arch/arm/kernel/efi.c
@@ -5,6 +5,8 @@
  
  #include 

  #include 
+#include 
+
  #include 
  #include 
  #include 
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index baab8dd3ead3c..3afbe503b066f 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -9,6 +9,7 @@
  
  #include 

  #include 
+#include 
  
  #include 

  #include 
diff --git a/drivers/firmware/efi/libstub/efi-stub-entry.c 
b/drivers/firmware/efi/libstub/efi-stub-entry.c
index cc4dcaea67fa6..2f1902e5d4075 100644
--- a/drivers/firmware/efi/libstub/efi-stub-entry.c
+++ b/drivers/firmware/efi/libstub/efi-stub-entry.c
@@ -1,6 +1,8 @@
  // SPDX-License-Identifier: GPL-2.0-only
  
  #include 

+#include 
+
  #include 
  
  #include "efistub.h"

diff --git a/drivers/firmware/efi/libstub/screen_info.c 
b/drivers/firmware/efi/libstub/screen_info.c
index 4be1c4d1f922b..a51ec201ca3cb 100644
--- a/drivers/firmware/efi/libstub/screen_info.c
+++ b/drivers/firmware/efi/libstub/screen_info.c
@@ -1,6 +1,8 @@
  // SPDX-License-Identifier: GPL-2.0
  
  #include 

+#include 
+
  #include 
  
  #include "efistub.h"

diff --git a/include/linux/efi.h b/include/linux/efi.h
index 571d1a6e1b744..360895a5572c0 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -24,10 +24,11 @@
  #include 
  #include 
  #include 
-#include 
  
  #include 
  
+struct screen_info;

+
  #define EFI_SUCCESS   0
  #define EFI_LOAD_ERROR( 1 | (1UL << (BITS_PER_LONG-1)))
  #define EFI_INVALID_PARAMETER ( 2 | (1UL << (BITS_PER_LONG-1)))




Re: [08/12] drivers/firmware: Remove trailing whitespaces

2023-07-04 Thread Sui Jingfeng

Hi,

On 2023/6/29 19:45, Thomas Zimmermann wrote:

Fix coding style. No functional changes.

Signed-off-by: Thomas Zimmermann 



Reviewed-by: Sui Jingfeng 



---
  drivers/firmware/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index b59e3041fd627..0432737bbb8b4 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -99,7 +99,7 @@ config EFI_PCDP
  You must also enable the appropriate drivers (serial, VGA, etc.)
  
  	  See DIG64_HCDPv20_042804.pdf available from

- 
+ 
  
  config DMIID

  bool "Export DMI identification via sysfs to userspace"




Re: [05/12] arch: Remove trailing whitespaces

2023-07-04 Thread Sui Jingfeng



On 2023/6/29 19:45, Thomas Zimmermann wrote:

Fix coding style. No functional changes.

Signed-off-by: Thomas Zimmermann 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: Andrew Morton 
Cc: Geert Uytterhoeven 
Cc: Arnd Bergmann 
Cc: "Kirill A. Shutemov" 
Cc: Anshuman Khandual 
Cc: Niklas Schnelle 
Cc: Zi Yan 
Cc: "Mike Rapoport (IBM)" 
Cc: Peter Zijlstra 
Reviewed-by: Javier Martinez Canillas 


Reviewed-by: Sui Jingfeng 


---
  arch/ia64/Kconfig | 4 ++--
  arch/sh/Kconfig   | 6 +++---
  2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 21fa63ce5ffc0..e79f15e32a451 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -260,7 +260,7 @@ config PERMIT_BSP_REMOVE
default n
help
Say Y here if your platform SAL will support removal of BSP with 
HOTPLUG_CPU
-   support.
+   support.
  
  config FORCE_CPEI_RETARGET

bool "Force assumption that CPEI can be re-targeted"
@@ -335,7 +335,7 @@ config IA64_PALINFO
  config IA64_MC_ERR_INJECT
tristate "MC error injection support"
help
- Adds support for MC error injection. If enabled, the kernel
+ Adds support for MC error injection. If enabled, the kernel
  will provide a sysfs interface for user applications to
  call MC error injection PAL procedures to inject various errors.
  This is a useful tool for MCA testing.
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 9652d367fc377..04b9550cf0070 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -234,7 +234,7 @@ config CPU_SUBTYPE_SH7201
select CPU_SH2A
select CPU_HAS_FPU
select SYS_SUPPORTS_SH_MTU2
-
+
  config CPU_SUBTYPE_SH7203
bool "Support SH7203 processor"
select CPU_SH2A
@@ -496,7 +496,7 @@ config CPU_SUBTYPE_SH7366
  endchoice
  
  source "arch/sh/mm/Kconfig"

-
+
  source "arch/sh/Kconfig.cpu"
  
  source "arch/sh/boards/Kconfig"

@@ -647,7 +647,7 @@ config GUSA
  This is the default implementation for both UP and non-ll/sc
  CPUs, and is used by the libc, amongst others.
  
-	  For additional information, design information can be found

+ For additional information, design information can be found
  in .
  
  	  This should only be disabled for special cases where alternate




Re: [06/12] arch: Declare screen_info in

2023-07-04 Thread Sui Jingfeng

Hi, Thomas


I love your patch, yet after applied your patch, the linux kernel fail 
to compile on my LoongArch machine.



```

  CC  arch/loongarch/kernel/efi.o
arch/loongarch/kernel/efi.c: In function ‘init_screen_info’:
arch/loongarch/kernel/efi.c:77:54: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

   77 | si = early_memremap(screen_info_table, sizeof(*si));
  |  ^
arch/loongarch/kernel/efi.c:82:9: error: ‘screen_info’ undeclared (first 
use in this function)

   82 | screen_info = *si;
  | ^~~
arch/loongarch/kernel/efi.c:82:9: note: each undeclared identifier is 
reported only once for each function it appears in
arch/loongarch/kernel/efi.c:82:23: error: invalid use of undefined type 
‘struct screen_info’

   82 | screen_info = *si;
  |   ^
arch/loongarch/kernel/efi.c:83:29: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

   83 | memset(si, 0, sizeof(*si));
  | ^
arch/loongarch/kernel/efi.c:84:34: error: invalid application of 
‘sizeof’ to incomplete type ‘struct screen_info’

   84 | early_memunmap(si, sizeof(*si));
  |  ^
make[3]: *** [scripts/Makefile.build:252: arch/loongarch/kernel/efi.o] 
Error 1

make[3]: *** Waiting for unfinished jobs
make[2]: *** [scripts/Makefile.build:494: arch/loongarch/kernel] Error 2
make[1]: *** [scripts/Makefile.build:494: arch/loongarch] Error 2
make[1]: *** Waiting for unfinished jobs
make: *** [Makefile:2026: .] Error 2

```

On 2023/6/29 19:45, Thomas Zimmermann wrote:

The variable screen_info does not exist on all architectures. Declare
it in . All architectures that do declare it
will provide it via .

Add the Kconfig token ARCH_HAS_SCREEN_INFO to guard against access on
architectures that don't provide screen_info.

Signed-off-by: Thomas Zimmermann 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Guo Ren 
Cc: Brian Cain 
Cc: Huacai Chen 
Cc: WANG Xuerui 
Cc: Thomas Bogendoerfer 
Cc: Dinh Nguyen 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Helge Deller 
Cc: Arnd Bergmann 
Cc: Kees Cook 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Frederic Weisbecker 
Cc: Andrew Morton 
Cc: Ard Biesheuvel 
Cc: Sami Tolvanen 
Cc: Juerg Haefliger 
Cc: Geert Uytterhoeven 
Cc: Anshuman Khandual 
Cc: Niklas Schnelle 
Cc: "Russell King (Oracle)" 
Cc: Linus Walleij 
Cc: Sebastian Reichel 
Cc: "Mike Rapoport (IBM)" 
Cc: "Kirill A. Shutemov" 
Cc: Zi Yan 
Acked-by: WANG Xuerui  # loongarch
---
  arch/Kconfig  |  6 ++
  arch/alpha/Kconfig|  1 +
  arch/arm/Kconfig  |  1 +
  arch/arm64/Kconfig|  1 +
  arch/csky/Kconfig |  1 +
  arch/hexagon/Kconfig  |  1 +
  arch/ia64/Kconfig |  1 +
  arch/loongarch/Kconfig|  1 +
  arch/mips/Kconfig |  1 +
  arch/nios2/Kconfig|  1 +
  arch/powerpc/Kconfig  |  1 +
  arch/riscv/Kconfig|  1 +
  arch/sh/Kconfig   |  1 +
  arch/sparc/Kconfig|  1 +
  arch/x86/Kconfig  |  1 +
  arch/xtensa/Kconfig   |  1 +
  drivers/video/Kconfig |  3 +++
  include/asm-generic/Kbuild|  1 +
  include/asm-generic/screen_info.h | 12 
  include/linux/screen_info.h   |  2 +-
  20 files changed, 38 insertions(+), 1 deletion(-)
  create mode 100644 include/asm-generic/screen_info.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cada..2f58293fd7bcb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1466,6 +1466,12 @@ config ARCH_HAS_NONLEAF_PMD_YOUNG
  address translations. Page table walkers that clear the accessed bit
  may use this capability to reduce their search space.
  
+config ARCH_HAS_SCREEN_INFO

+   bool
+   help
+ Selected by architectures that provide a global instance of
+ screen_info.
+
  source "kernel/gcov/Kconfig"
  
  source "scripts/gcc-plugins/Kconfig"

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index a5c2b1aa46b02..d749011d88b14 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -4,6 +4,7 @@ config ALPHA
default y
select ARCH_32BIT_USTAT_F_TINODE
select ARCH_HAS_CURRENT_STACK_POINTER
+   select ARCH_HAS_SCREEN_INFO
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select 

Re: [PATCH] powernv/opal-prd: Silence memcpy() run-time false positive warnings

2023-07-04 Thread Jordan Niethe




On 26/6/23 5:04 pm, Mahesh Salgaonkar wrote:

opal_prd_msg_notifier extracts the opal prd message size from the message
header and uses it for allocating opal_prd_msg_queue_item that includes
the correct message size to be copied. However, while running under
CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning:

[ 6458.234352] memcpy: detected field-spanning write (size 32) of single field 
">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4)
[ 6458.234390] WARNING: CPU: 9 PID: 660 at 
arch/powerpc/platforms/powernv/opal-prd.c:355 opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[...]
[ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 
[opal_prd]
[ 6458.234736] Call Trace:
[ 6458.234742] [c002acb23c10] [c0080e0c0e68] 
opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable)
[ 6458.234759] [c002acb23ca0] [c019ccc0] 
notifier_call_chain+0xc0/0x1b0
[ 6458.234774] [c002acb23d00] [c019ceac] 
atomic_notifier_call_chain+0x2c/0x40
[ 6458.234788] [c002acb23d20] [c00d69b4] 
opal_message_notify+0xf4/0x2c0
[...]

Add a flexible array member to avoid false positive run-time warning.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
  arch/powerpc/platforms/powernv/opal-prd.c |7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 113bdb151f687..9e2c4775f75f5 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -30,7 +30,10 @@
   */
  struct opal_prd_msg_queue_item {
struct list_headlist;
-   struct opal_prd_msg_header  msg;
+   union {
+   struct opal_prd_msg_header  msg;
+   DECLARE_FLEX_ARRAY(__u8, msg_flex);
+   };
  };
  
  static struct device_node *prd_node;

@@ -352,7 +355,7 @@ static int opal_prd_msg_notifier(struct notifier_block *nb,
if (!item)
return -ENOMEM;
  
-	memcpy(>msg, msg->params, msg_size);

+   memcpy(>msg_flex, msg->params, msg_size);


This does silence the warning but it seems like we might be able to go a 
little further.


What about not adding that flex array to struct opal_prd_msg_queue_item, 
but adding one to struct opal_prd_msg. That is what the data format 
actually is.


So we'd have something like:


struct opal_prd_msg  {

struct opal_prd_msg_header hdr;

char msg[];

}


and change things to use that instead?

But that might be more trouble than it is worth, alternatively we can 
just do:


item->msg = *hdr;
	memcpy((char *)item->msg + sizeof(*hdr), (char *)hdr + sizeof(*hdr), 
msg_size - sizeof(*hdr));



  
  	spin_lock_irqsave(_prd_msg_queue_lock, flags);

list_add_tail(>list, _prd_msg_queue);






[PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-04 Thread Benjamin Gray
The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
struct regmap_config, which is guarded behind CONFIG_REGMAP.

Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select REGMAP")
exposed this by disabling the default pick unless KUNIT_ALL_TESTS is
set, causing the ppc64be allnoconfig build to fail.

Signed-off-by: Benjamin Gray 
---
 drivers/rtc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
index ffca9a8bb878..7455ebd189fe 100644
--- a/drivers/rtc/Kconfig
+++ b/drivers/rtc/Kconfig
@@ -246,6 +246,7 @@ config RTC_DRV_AS3722
 
 config RTC_DRV_DS1307
tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00, EPSON RX-8025, 
ISL12057"
+   select REGMAP
select REGMAP_I2C
select WATCHDOG_CORE if WATCHDOG
help
-- 
2.41.0



Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Suren Baghdasaryan
On Tue, Jul 4, 2023 at 2:28 PM Andrew Morton  wrote:
>
> On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan  
> wrote:
>
> > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  
> > wrote:
> > >
> > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  
> > > wrote:
> > >
> > > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default 
> > > > > > > > until
> > > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > > >
> > > > > > > Posted at: 
> > > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> > > > > >
> > > > > > As that change fixes something in 6.4, why not cc: stable on it as 
> > > > > > well?
> > > > >
> > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > > versions. Did I miss something?
> > > >
> > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > > there :)
> > >
> > > I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> > > fix rather than disabling the feature in -stable.
> >
> > Ok, I think we have a fix posted at [2]  and it's cleanly applies to
> > 6.4.y stable branch as well. However fork() performance might slightly
> > regress, therefore disabling per-VMA locks by default for now seems to
> > be preferable even with this fix (see discussion at
> > https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/).
> > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> > to stable@vger?
> >
> > [1] https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
>
> This one isn't sufficient for .configs which already have
> PER_VMA_LOCK=y.  Using `depends on BROKEN' would be better.
>
> > [2] https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/
> >
>
> We're still awaiting tester input on this?

Yeah, and it seems to be negative... Anyway, I'll post a dependency on BROKEN.

>
> I think a clean new fully-changelogged two-patch series would be the
> best way to handle this.  Please ensure that the [0/2] intro clearly
> explains what we're proposing here, and why.
>
> Also, "might slightly regress" is a bit weak.  These things are
> measurable, no?  Because a better solution would be to fix 6.4.x and
> mainline and leave it at that.
>


Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Andrew Morton
On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan  wrote:

> On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  
> wrote:
> >
> > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  
> > wrote:
> >
> > > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default 
> > > > > > > until
> > > > > > > the issue is fixed. I'll post a patch shortly.
> > > > > >
> > > > > > Posted at: 
> > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> > > > >
> > > > > As that change fixes something in 6.4, why not cc: stable on it as 
> > > > > well?
> > > >
> > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > > versions. Did I miss something?
> > >
> > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > > there :)
> >
> > I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> > fix rather than disabling the feature in -stable.
> 
> Ok, I think we have a fix posted at [2]  and it's cleanly applies to
> 6.4.y stable branch as well. However fork() performance might slightly
> regress, therefore disabling per-VMA locks by default for now seems to
> be preferable even with this fix (see discussion at
> https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/).
> IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
> cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
> to stable@vger?
> 
> [1] https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/

This one isn't sufficient for .configs which already have
PER_VMA_LOCK=y.  Using `depends on BROKEN' would be better.

> [2] https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/
> 

We're still awaiting tester input on this?

I think a clean new fully-changelogged two-patch series would be the
best way to handle this.  Please ensure that the [0/2] intro clearly
explains what we're proposing here, and why.

Also, "might slightly regress" is a bit weak.  These things are
measurable, no?  Because a better solution would be to fix 6.4.x and
mainline and leave it at that.



Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Suren Baghdasaryan
On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton  wrote:
>
> On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  wrote:
>
> > > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default 
> > > > > > until
> > > > > > the issue is fixed. I'll post a patch shortly.
> > > > >
> > > > > Posted at: 
> > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> > > >
> > > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > >
> > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > > patch is fixing 6.4 I didn't need to send it to stable for older
> > > versions. Did I miss something?
> >
> > 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> > there :)
>
> I'm in wait-a-few-days-mode on this.  To see if we have a backportable
> fix rather than disabling the feature in -stable.

Ok, I think we have a fix posted at [2]  and it's cleanly applies to
6.4.y stable branch as well. However fork() performance might slightly
regress, therefore disabling per-VMA locks by default for now seems to
be preferable even with this fix (see discussion at
https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/).
IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply
cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately
to stable@vger?

[1] https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
[2] https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/

>


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Hugh Dickins
On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > 
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not.  But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).  
> > 
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> > 
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
> 
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.

Phew! I'm of course glad to hear this: thanks for your efforts on it.

...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> >   * while the PP bits are never used, nor such a page is added to or removed
> >   * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off 
> > pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has 
> > to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> 
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
> 
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
> 
> So what exactly do you mean with "which page_table_free_rcu() does not 
> actually
> guarantee"?

I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better.  And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.

The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate 
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented: 
/*
 * This isn't an RCU grace period and hence the page-tables cannot be
 * assumed to be actually RCU-freed.
 *
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */

Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.

If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)".  But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.

And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.

...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned 
> > long *table)
> >  */
> > mask = atomic_xor_bits(>_refcount, 0x11U << (bit + 24));
> > mask >>= 24;
> > -   if (mask & 0x03U)
> > +   if ((mask & 0x03U) && !PageActive(page)) {
> > +   /*
> > +* Other half is allocated, and neither half has had
> > +* its free deferred: add page to head 

Re: [PATCH 05/12] arch: Remove trailing whitespaces

2023-07-04 Thread Javier Martinez Canillas
Thomas Zimmermann  writes:

> Fix coding style. No functional changes.
>
> Signed-off-by: Thomas Zimmermann 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: John Paul Adrian Glaubitz 
> Cc: Andrew Morton 
> Cc: Geert Uytterhoeven 
> Cc: Arnd Bergmann 
> Cc: "Kirill A. Shutemov" 
> Cc: Anshuman Khandual 
> Cc: Niklas Schnelle 
> Cc: Zi Yan 
> Cc: "Mike Rapoport (IBM)" 
> Cc: Peter Zijlstra 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat



Re: [PATCH 04/12] staging/sm750fb: Do not include

2023-07-04 Thread Javier Martinez Canillas
Thomas Zimmermann  writes:

> The sm750fb driver does not need anything from .
> Remove the include statements.
>
> Signed-off-by: Thomas Zimmermann 
> Cc: Sudip Mukherjee 
> Cc: Teddy Wang 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat



Re: [PATCH 03/12] sysfb: Do not include from sysfb header

2023-07-04 Thread Javier Martinez Canillas
Thomas Zimmermann  writes:

> The header file  does not need anything from
> . Declare struct screen_info and remove
> the include statements.
>
> Signed-off-by: Thomas Zimmermann 
> Cc: Ard Biesheuvel 
> Cc: Hans de Goede 
> Cc: Javier Martinez Canillas 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat



Re: [PATCH 02/12] fbdev/sm712fb: Do not include

2023-07-04 Thread Javier Martinez Canillas
Thomas Zimmermann  writes:

> Sm712fb's dependency on  is artificial in that
> it only uses struct screen_info for its internals. Replace the use of
> struct screen_info with a custom data structure and remove the include
> of .
>
> Signed-off-by: Thomas Zimmermann 
> Cc: Sudip Mukherjee 
> Cc: Teddy Wang 
> Cc: Helge Deller 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat



Re: [PATCH 01/12] efi: Do not include from EFI header

2023-07-04 Thread Javier Martinez Canillas
Thomas Zimmermann  writes:

> The header file  does not need anything from
> . Declare struct screen_info and remove
> the include statements. Update a number of source files that
> require struct screen_info's definition.
>
> Signed-off-by: Thomas Zimmermann 
> Cc: Ard Biesheuvel 
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> ---

Reviewed-by: Javier Martinez Canillas 

-- 
Best regards,

Javier Martinez Canillas
Core Platforms
Red Hat



Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Andrew Morton
On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH  wrote:

> > > > > Thanks! I'll investigate this later today. After discussing with
> > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > > the issue is fixed. I'll post a patch shortly.
> > > >
> > > > Posted at: 
> > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> > >
> > > As that change fixes something in 6.4, why not cc: stable on it as well?
> > 
> > Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> > patch is fixing 6.4 I didn't need to send it to stable for older
> > versions. Did I miss something?
> 
> 6.4.y is a stable kernel tree right now, so yes, it needs to be included
> there :)

I'm in wait-a-few-days-mode on this.  To see if we have a backportable
fix rather than disabling the feature in -stable.



Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Hugh Dickins
On Tue, 4 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> 
> Just to make things more clear for me: do I understand correctly that this
> was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Yes, from page_table_free() called from pte_free_defer(): I had claimed
they could be put on the list (or not) without needing to consider their
HH-ness, apart from wanting to list_add_tail() rather than list_add() them.

But then realized that this category of list_add_tail() pages would block
access to the others.

But I think I was mistaken then to say "could be solved by a list_move()
somewhere"; because "somewhere" would have had to be __tlb_remove_table()
when it removes PP-bits, which would bring us back to the issues of
getting a spinlock from an mm which might already be freed.

Hugh


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Gerald Schaefer
On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
Hugh Dickins  wrote:

> On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > 
> > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > extent that I've not even tried to verify it; but I think I do get the
> > point now, that we need further info than just PPHHAA to know whether
> > the page is on the list or not.  But I think that if we move where the
> > call_rcu() is done, then the page can stay on or off the list by same
> > rules as before (but need to check HH bits along with PP when deciding
> > whether to allocate, and whether to list_add_tail() when freeing).  
> 
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).
> 
> I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> I've dropped most of the pte_free_*() helpers, and list_del_init() is
> an easier way of dealing with those "is it on the list" questions.
> I expect that we shall be close to reaching agreement on...

This looks really nice, almost too good and easy to be true. I did not
find any obvious flaw, just some comments below. It also survived LTP
without any visible havoc, so I guess this approach is the best so far.

> 
> [PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page
> 
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive).  And for simplicity, delay calling RCU until
> both halves are freed.
> 
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/s390/include/asm/pgalloc.h |  4 ++
>  arch/s390/mm/pgalloc.c  | 75 +++--
>  2 files changed, 67 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..fd0c4312da16 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
>   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
>   * while the PP bits are never used, nor such a page is added to or removed
>   * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off 
> pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).

Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
allow reuse before grace period elapsed. And I hope that it does so, by
setting the PP bits, which would be noticed in page_table_alloc(), in
case the page would be seen there.

Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
end of the list, and so they could be seen in page_table_alloc(), but they
should not be reused before grace period elapsed and __tlb_remove_table()
cleared the PP bits, as far 

Re: [RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-07-04 Thread Yicong Yang
On 2023/6/30 1:26, Catalin Marinas wrote:
> On Thu, Jun 29, 2023 at 05:31:36PM +0100, Catalin Marinas wrote:
>> On Thu, May 18, 2023 at 02:59:34PM +0800, Yicong Yang wrote:
>>> From: Barry Song 
>>>
>>> on x86, batched and deferred tlb shootdown has lead to 90%
>>> performance increase on tlb shootdown. on arm64, HW can do
>>> tlb shootdown without software IPI. But sync tlbi is still
>>> quite expensive.
>> [...]
>>>  .../features/vm/TLB/arch-support.txt  |  2 +-
>>>  arch/arm64/Kconfig|  1 +
>>>  arch/arm64/include/asm/tlbbatch.h | 12 
>>>  arch/arm64/include/asm/tlbflush.h | 33 -
>>>  arch/arm64/mm/flush.c | 69 +++
>>>  arch/x86/include/asm/tlbflush.h   |  5 +-
>>>  include/linux/mm_types_task.h |  4 +-
>>>  mm/rmap.c | 12 ++--
>>
>> First of all, this patch needs to be split in some preparatory patches
>> introducing/renaming functions with no functional change for x86. Once
>> done, you can add the arm64-only changes.
>>

got it. will try to split this patch as suggested.

>> Now, on the implementation, I had some comments on v7 but we didn't get
>> to a conclusion and the thread eventually died:
>>
>> https://lore.kernel.org/linux-mm/y7ctoj5mwd1zb...@arm.com/
>>
>> I know I said a command line argument is better than Kconfig or some
>> random number of CPUs heuristics but it would be even better if we don't
>> bother with any, just make this always on.

ok, will make this always on.

>> Barry had some comments
>> around mprotect() being racy and that's why we have
>> flush_tlb_batched_pending() but I don't think it's needed (or, for
>> arm64, it can be a DSB since this patch issues the TLBIs but without the
>> DVM Sync). So we need to clarify this (see Barry's last email on the
>> above thread) and before attempting new versions of this patchset. With
>> flush_tlb_batched_pending() removed (or DSB), I have a suspicion such
>> implementation would be faster on any SoC irrespective of the number of
>> CPUs.
> 
> I think I got the need for flush_tlb_batched_pending(). If
> try_to_unmap() marks the pte !present and we have a pending TLBI,
> change_pte_range() will skip the TLB maintenance altogether since it did
> not change the pte. So we could be left with stale TLB entries after
> mprotect() before TTU does the batch flushing.
> 
> We can have an arch-specific flush_tlb_batched_pending() that can be a
> DSB only on arm64 and a full mm flush on x86.
> 

We need to do a flush/dsb in flush_tlb_batched_pending() only in a race
condition so we first check whether there's a pended batched flush and
if so do the tlb flush. The pending checking is common and the differences
among the archs is how to flush the TLB here within the 
flush_tlb_batched_pending(),
on arm64 it should only be a dsb.

As we only needs to maintain the TLBs already pended in batched flush,
does it make sense to only handle those TLBs in flush_tlb_batched_pending()?
Then we can use the arch_tlbbatch_flush() rather than flush_tlb_mm() in
flush_tlb_batched_pending() and no arch specific function needed.

Thanks.



Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Alexander Gordeev
On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> On Thu, 29 Jun 2023, Hugh Dickins wrote:

Hi Hugh,

...
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).

Just to make things more clear for me: do I understand correctly that this
was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Thanks!


Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-04 Thread Peter Zijlstra
On Mon, May 15, 2023 at 01:46:01PM +0200, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not 
> of the same size.
> 
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390 
> architectures.

Mostly; there's historically the cpuset case where we can create
lopsided groups like that. And today we're growing all these hybrid
things that will also tickle this, except they're looking towards
different balancer extentions to deal with the IPC difference so might
not be immediately caring about this here issue.


> Signed-off-by: Tobias Huschle 
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>* group's child domain.
>*/
>   if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)

Should that not be: busiest->group_weight * (local->sum_nr_running + 1) ?

I'm not opposed to this -- it seems fairly straight forward.

>   goto force_balance;
>  
>   if (busiest->group_type != group_overloaded) {
> -- 
> 2.34.1
> 


Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-04 Thread Tobias Huschle

On 2023-05-16 18:35, Dietmar Eggemann wrote:

On 15/05/2023 13:46, Tobias Huschle wrote:
The current load balancer implementation implies that scheduler 
groups,

within the same scheduler domain, all host the same number of CPUs.

This appears to be valid for non-s390 architectures. Nevertheless, 
s390

can actually have scheduler groups of unequal size.


Arm (classical) big.Little had this for years before we switched to 
flat

scheduling (only MC sched domain) over CPU capacity boundaries for Arm
DynamIQ.

Arm64 Juno platform in mainline:

root@juno:~# cat 
/sys/devices/system/cpu/cpu*/topology/cluster_cpus_list

0,3-5
1-2
1-2
0,3-5
0,3-5
0,3-5

root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'

domain0 39 <--
domain1 3f
domain0 06 <--
domain1 3f
domain0 06
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f
domain0 39
domain1 3f

root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
MC
DIE

But we don't have SMT on the mobile processors.

It looks like you are only interested to get group_weight dependency
into this 'prefer_sibling' condition of find_busiest_group()?


Sorry, looks like your reply hit some bad filter of my mail program.
Let me answer, although it's a bit late.

Yes, I would like to get the group_weight into the prefer_sibling path.
Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
allows to have CPUs to be pretty far apart (cache-wise), which means,
the load balancer should avoid to move tasks back and forth between
those CPUs if possible.

We can't remove SD_PREFER_SIBLING either, as this would cause the load
balancer to aim for having the same number of idle CPUs in all groups,
which is a problem as well in asymmetric groups, for example:

With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x  x  x x  x x x  x


Hence the idea to add the group_weight to the prefer_sibling path.

I was wondering if this would be the right place to address this issue
or if I should go down another route.


We in (classical) big.LITTLE (sd flag SD_ASYM_CPUCAPACITY) remove
SD_PREFER_SIBLING from sd->child so we don't run this condition.


The current scheduler behavior causes some s390 configs to use SMT
while some cores are still idle, leading to a performance degredation
under certain levels of workload.

Please refer to the patch's commit message for more details and an
example. This patch is a proposal on how to integrate the size of
scheduler groups into the decision process.

This patch is the most basic approach to address this issue and does
not claim to be perfect as-is.

Other ideas that also proved to address the problem but are more
complex but also potentially more precise:
  1. On scheduler group building, count the number of CPUs within each
 group that are first in their sibling mask. This represents the
 number of CPUs that can be used before running into SMT. This
 should be slightly more accurate than using the full group weight
 if the number of available SMT threads per core varies.
  2. Introduce a new scheduler group classification (smt_busy) in
 between of fully_busy and has_spare. This classification would
 indicate that a group still has spare capacity, but will run
 into SMT when using that capacity. This would make the load
 balancer prefer groups with fully idle CPUs over ones that are
 about to run into SMT.

Feedback would be greatly appreciated.

Tobias Huschle (1):
  sched/fair: Consider asymmetric scheduler groups in load balancer

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)





Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Greg KH
On Tue, Jul 04, 2023 at 12:45:39AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jul 3, 2023 at 11:44 AM Greg KH  wrote:
> >
> > On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> > > > Leemhuis)  wrote:
> > > > >
> > > > > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > > > >
> > > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed 
> > > > > >> frequent but random crashes in a user space program.  After a lot 
> > > > > >> of reduction, I have come up with the following reproducer program:
> > > > > > [...]
> > > > > >> After tuning the various parameters for my computer, exit code 2, 
> > > > > >> which indicates that memory corruption was detected, occurs 
> > > > > >> approximately 99% of the time.  Exit code 1, which occurs 
> > > > > >> approximately 1% of the time, means it ran out of 
> > > > > >> statically-allocated memory before reproducing the issue, and 
> > > > > >> increasing the memory usage any more only leads to diminishing 
> > > > > >> returns.  There is also something like a 0.1% chance that it 
> > > > > >> segfaults due to memory corruption elsewhere than in the 
> > > > > >> statically-allocated buffer.
> > > > > >>
> > > > > >> With this reproducer in hand, I was able to perform the following 
> > > > > >> bisection:
> > > > > > [...]
> > > > > >
> > > > > > See Bugzilla for the full thread.
> > > > >
> > > > > Additional details from
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> > > > >
> > > > > ```
> > > > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > > > > reverted no longer causes any memory corruption with either my
> > > > > reproducer or the original program.
> > > > > ```
> > > > >
> > > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > > > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already 
> > > > > CCed]]
> > > > >
> > > > > That's the same commit that causes build problems with go:
> > > > >
> > > > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/
> > > >
> > > > Thanks! I'll investigate this later today. After discussing with
> > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > > the issue is fixed. I'll post a patch shortly.
> > >
> > > Posted at: 
> > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
> >
> > As that change fixes something in 6.4, why not cc: stable on it as well?
> 
> Sorry, I thought since per-VMA locks were introduced in 6.4 and this
> patch is fixing 6.4 I didn't need to send it to stable for older
> versions. Did I miss something?

6.4.y is a stable kernel tree right now, so yes, it needs to be included
there :)



[PATCH v7 8/8] KVM: remove __gfn_to_pfn_memslot

2023-07-04 Thread David Stevens
From: David Stevens 

All callers have been migrated to __kvm_follow_pfn.

Signed-off-by: David Stevens 
---
 virt/kvm/kvm_main.c | 33 -
 1 file changed, 33 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0f7b41f220b6..5b5afd70f239 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2712,39 +2712,6 @@ kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll)
 }
 EXPORT_SYMBOL_GPL(__kvm_follow_pfn);
 
-kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
-  bool atomic, bool interruptible, bool *async,
-  bool write_fault, bool *writable, hva_t *hva)
-{
-   kvm_pfn_t pfn;
-   struct kvm_follow_pfn foll = {
-   .slot = slot,
-   .gfn = gfn,
-   .flags = FOLL_GET,
-   .atomic = atomic,
-   .allow_write_mapping = !!writable,
-   };
-
-   if (write_fault)
-   foll.flags |= FOLL_WRITE;
-   if (async)
-   foll.flags |= FOLL_NOWAIT;
-   if (interruptible)
-   foll.flags |= FOLL_INTERRUPTIBLE;
-
-   pfn = __kvm_follow_pfn();
-   if (pfn == KVM_PFN_ERR_NEEDS_IO) {
-   *async = true;
-   pfn = KVM_PFN_ERR_FAULT;
-   }
-   if (hva)
-   *hva = foll.hva;
-   if (writable)
-   *writable = foll.writable;
-   return pfn;
-}
-EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
-
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
  bool *writable)
 {
-- 
2.41.0.255.g8b1d071c50-goog



[PATCH v7 7/8] KVM: PPC: Migrate to __kvm_follow_pfn

2023-07-04 Thread David Stevens
From: David Stevens 

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn. As part of the
refactoring, remove the redundant calls to get_user_page_fast_only,
since the check for !async && !atomic was removed from the KVM generic
code in b9b33da2aa74. Also, remove the kvm_ro parameter because the KVM
generic code handles RO memslots.

Signed-off-by: David Stevens 
---
I have checked that this patch compiles, but I don't have the hardware
to test it myself.

 arch/powerpc/include/asm/kvm_book3s.h  |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c| 38 +---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 50 +++---
 arch/powerpc/kvm/book3s_hv_nested.c|  4 +--
 4 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..bf48c511e700 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -202,7 +202,7 @@ extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool 
nested,
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
-   bool writing, bool kvm_ro,
+   bool writing,
pte_t *inserted_pte, unsigned int *levelp);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..9a4715e73937 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -523,6 +523,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
unsigned long rcbits;
long mmio_update;
pte_t pte, *ptep;
+   struct kvm_follow_pfn foll = {
+   .allow_write_mapping = true,
+   };
 
if (kvm_is_radix(kvm))
return kvmppc_book3s_radix_page_fault(vcpu, ea, dsisr);
@@ -599,29 +602,20 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
page = NULL;
writing = (dsisr & DSISR_ISSTORE) != 0;
/* If writing != 0, then the HPTE must allow writing, if we get here */
-   write_ok = writing;
-   hva = gfn_to_hva_memslot(memslot, gfn);
 
-   /*
-* Do a fast check first, since __gfn_to_pfn_memslot doesn't
-* do it with !atomic && !async, which is how we call it.
-* We always ask for write permission since the common case
-* is that the page is writable.
-*/
-   if (get_user_page_fast_only(hva, FOLL_WRITE, )) {
-   write_ok = true;
-   } else {
-   /* Call KVM generic code to do the slow-path check */
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-  writing, _ok, NULL);
-   if (is_error_noslot_pfn(pfn))
-   return -EFAULT;
-   page = NULL;
-   if (pfn_valid(pfn)) {
-   page = pfn_to_page(pfn);
-   if (PageReserved(page))
-   page = NULL;
-   }
+   foll.slot = memslot;
+   foll.gfn = gfn;
+   foll.flags = FOLL_GET | (writing ? FOLL_WRITE : 0);
+   pfn = __kvm_follow_pfn();
+   if (is_error_noslot_pfn(pfn))
+   return -EFAULT;
+   page = NULL;
+   write_ok = foll.writable;
+   hva = foll.hva;
+   if (pfn_valid(pfn)) {
+   page = pfn_to_page(pfn);
+   if (PageReserved(page))
+   page = NULL;
}
 
/*
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..339d1efcb6c9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -815,47 +815,39 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool 
nested, bool writing,
 int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
   unsigned long gpa,
   struct kvm_memory_slot *memslot,
-  bool writing, bool kvm_ro,
+  bool writing,
   pte_t *inserted_pte, unsigned int *levelp)
 {
struct kvm *kvm = vcpu->kvm;
struct page *page = NULL;
unsigned long mmu_seq;
-   unsigned long hva, gfn = gpa >> PAGE_SHIFT;
-   bool upgrade_write = false;
-   bool *upgrade_p = _write;
+   unsigned long hva, pfn, gfn = gpa >> PAGE_SHIFT;
+   bool upgrade_write;
pte_t pte, *ptep;
unsigned int shift, level;
int ret;
bool large_enable;
+   struct kvm_follow_pfn foll = {
+   .slot = memslot,
+   .gfn = gfn,
+   .flags = 

[PATCH v7 6/8] KVM: arm64: Migrate to __kvm_follow_pfn

2023-07-04 Thread David Stevens
From: David Stevens 

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens 
---
 arch/arm64/kvm/mmu.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6db9ef288ec3..c706530d304d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1334,7 +1334,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
  unsigned long fault_status)
 {
int ret = 0;
-   bool write_fault, writable, force_pte = false;
+   bool write_fault = kvm_is_write_fault(vcpu);
+   bool force_pte = false;
bool exec_fault, mte_allowed;
bool device = false;
unsigned long mmu_seq;
@@ -1342,16 +1343,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
struct kvm_mmu_memory_cache *memcache = >arch.mmu_page_cache;
struct vm_area_struct *vma;
short vma_shift;
-   gfn_t gfn;
kvm_pfn_t pfn;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+   struct kvm_follow_pfn foll = {
+   .slot = memslot,
+   .flags = FOLL_GET | (write_fault ? FOLL_WRITE : 0),
+   .allow_write_mapping = true,
+   };
 
fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
-   write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
VM_BUG_ON(write_fault && exec_fault);
 
@@ -1425,7 +1429,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
fault_ipa &= ~(vma_pagesize - 1);
 
-   gfn = fault_ipa >> PAGE_SHIFT;
+   foll.gfn = fault_ipa >> PAGE_SHIFT;
mte_allowed = kvm_vma_mte_allowed(vma);
 
/* Don't use the VMA after the unlock -- it may have vanished */
@@ -1433,7 +1437,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 
/*
 * Read mmu_invalidate_seq so that KVM can detect if the results of
-* vma_lookup() or __gfn_to_pfn_memslot() become stale prior to
+* vma_lookup() or __kvm_follow_pfn() become stale prior to
 * acquiring kvm->mmu_lock.
 *
 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
@@ -1442,8 +1446,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
mmu_seq = vcpu->kvm->mmu_invalidate_seq;
mmap_read_unlock(current->mm);
 
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-  write_fault, , NULL);
+   pfn = __kvm_follow_pfn();
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
@@ -1468,7 +1471,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 * Only actually map the page as writable if this was a write
 * fault.
 */
-   writable = false;
+   foll.writable = false;
}
 
if (exec_fault && device)
@@ -1508,7 +1511,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
}
}
 
-   if (writable)
+   if (foll.writable)
prot |= KVM_PGTABLE_PROT_W;
 
if (exec_fault)
@@ -1534,9 +1537,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 KVM_PGTABLE_WALK_SHARED);
 
/* Mark the page dirty only if the fault is handled successfully */
-   if (writable && !ret) {
+   if (foll.writable && !ret) {
kvm_set_pfn_dirty(pfn);
-   mark_page_dirty_in_slot(kvm, memslot, gfn);
+   mark_page_dirty_in_slot(kvm, memslot, foll.gfn);
}
 
 out_unlock:
-- 
2.41.0.255.g8b1d071c50-goog



[PATCH v7 4/8] KVM: x86/mmu: Migrate to __kvm_follow_pfn

2023-07-04 Thread David Stevens
From: David Stevens 

Migrate from __gfn_to_pfn_memslot to __kvm_follow_pfn.

Signed-off-by: David Stevens 
---
 arch/x86/kvm/mmu/mmu.c | 35 +--
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ec169f5c7dce..e44ab512c3a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4296,7 +4296,12 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
 static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
 {
struct kvm_memory_slot *slot = fault->slot;
-   bool async;
+   struct kvm_follow_pfn foll = {
+   .slot = slot,
+   .gfn = fault->gfn,
+   .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+   .allow_write_mapping = true,
+   };
 
/*
 * Retry the page fault if the gfn hit a memslot that is being deleted
@@ -4325,12 +4330,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
 
-   async = false;
-   fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, 
,
- fault->write, >map_writable,
- >hva);
-   if (!async)
-   return RET_PF_CONTINUE; /* *pfn has correct page already */
+   foll.flags |= FOLL_NOWAIT;
+   fault->pfn = __kvm_follow_pfn();
+
+   if (!is_error_noslot_pfn(fault->pfn))
+   goto success;
+
+   if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
+   return RET_PF_CONTINUE;
 
if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(fault->addr, fault->gfn);
@@ -4348,9 +4355,17 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
 * to wait for IO.  Note, gup always bails if it is unable to quickly
 * get a page and a fatal signal, i.e. SIGKILL, is pending.
 */
-   fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
- fault->write, >map_writable,
- >hva);
+   foll.flags |= FOLL_INTERRUPTIBLE;
+   foll.flags &= ~FOLL_NOWAIT;
+   fault->pfn = __kvm_follow_pfn();
+
+   if (!is_error_noslot_pfn(fault->pfn))
+   goto success;
+
+   return RET_PF_CONTINUE;
+success:
+   fault->hva = foll.hva;
+   fault->map_writable = foll.writable;
return RET_PF_CONTINUE;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog



[PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-04 Thread David Stevens
From: David Stevens 

Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
memory into the guest that is backed by un-refcounted struct pages - for
example, higher order non-compound pages allocated by the amdgpu driver
via ttm_pool_alloc_page.

The bulk of this change is tracking the is_refcounted_page flag so that
non-refcounted pages don't trigger page_count() == 0 warnings. This is
done by storing the flag in an unused bit in the sptes.

Signed-off-by: David Stevens 
---
 arch/x86/kvm/mmu/mmu.c  | 44 +
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
 arch/x86/kvm/mmu/spte.c |  4 ++-
 arch/x86/kvm/mmu/spte.h | 12 -
 arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
 6 files changed, 62 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e44ab512c3a1..b1607e314497 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -553,12 +553,14 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
 
if (is_accessed_spte(old_spte) && !is_accessed_spte(new_spte)) {
flush = true;
-   kvm_set_pfn_accessed(spte_to_pfn(old_spte));
+   if (is_refcounted_page_pte(old_spte))
+   
kvm_set_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
}
 
if (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte)) {
flush = true;
-   kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+   if (is_refcounted_page_pte(old_spte))
+   kvm_set_page_dirty(pfn_to_page(spte_to_pfn(old_spte)));
}
 
return flush;
@@ -596,14 +598,18 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 
*sptep)
 * before they are reclaimed.  Sanity check that, if the pfn is backed
 * by a refcounted page, the refcount is elevated.
 */
-   page = kvm_pfn_to_refcounted_page(pfn);
-   WARN_ON(page && !page_count(page));
+   if (is_refcounted_page_pte(old_spte)) {
+   page = kvm_pfn_to_refcounted_page(pfn);
+   WARN_ON(!page || !page_count(page));
+   }
 
-   if (is_accessed_spte(old_spte))
-   kvm_set_pfn_accessed(pfn);
+   if (is_refcounted_page_pte(old_spte)) {
+   if (is_accessed_spte(old_spte))
+   kvm_set_page_accessed(pfn_to_page(pfn));
 
-   if (is_dirty_spte(old_spte))
-   kvm_set_pfn_dirty(pfn);
+   if (is_dirty_spte(old_spte))
+   kvm_set_page_dirty(pfn_to_page(pfn));
+   }
 
return old_spte;
 }
@@ -639,8 +645,8 @@ static bool mmu_spte_age(u64 *sptep)
 * Capture the dirty status of the page, so that it doesn't get
 * lost when the SPTE is marked for access tracking.
 */
-   if (is_writable_pte(spte))
-   kvm_set_pfn_dirty(spte_to_pfn(spte));
+   if (is_writable_pte(spte) && is_refcounted_page_pte(spte))
+   kvm_set_page_dirty(pfn_to_page(spte_to_pfn(spte)));
 
spte = mark_spte_for_access_track(spte);
mmu_spte_update_no_track(sptep, spte);
@@ -1278,8 +1284,8 @@ static bool spte_wrprot_for_clear_dirty(u64 *sptep)
 {
bool was_writable = test_and_clear_bit(PT_WRITABLE_SHIFT,
   (unsigned long *)sptep);
-   if (was_writable && !spte_ad_enabled(*sptep))
-   kvm_set_pfn_dirty(spte_to_pfn(*sptep));
+   if (was_writable && !spte_ad_enabled(*sptep) && 
is_refcounted_page_pte(*sptep))
+   kvm_set_page_dirty(pfn_to_page(spte_to_pfn(*sptep)));
 
return was_writable;
 }
@@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *slot,
bool host_writable = !fault || fault->map_writable;
bool prefetch = !fault || fault->prefetch;
bool write_fault = fault && fault->write;
+   bool is_refcounted = !fault || fault->is_refcounted_page;
 
pgprintk("%s: spte %llx write_fault %d gfn %llx\n", __func__,
 *sptep, write_fault, gfn);
@@ -2969,7 +2976,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct 
kvm_memory_slot *slot,
}
 
wrprot = make_spte(vcpu, sp, slot, pte_access, gfn, pfn, *sptep, 
prefetch,
-  true, host_writable, );
+  true, host_writable, is_refcounted, );
 
if (*sptep == spte) {
ret = RET_PF_SPURIOUS;
@@ -4299,8 +4306,9 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
struct kvm_follow_pfn foll = {
.slot = slot,
.gfn = fault->gfn,
-   .flags = FOLL_GET | (fault->write ? FOLL_WRITE : 0),
+   .flags = fault->write ? FOLL_WRITE : 0,

[PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-04 Thread David Stevens
From: David Stevens 

Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
callers to resolve a gfn when the associated pfn has a valid struct page
that isn't being actively refcounted (e.g. tail pages of non-compound
higher order pages). For a caller to safely omit FOLL_GET, all usages of
the returned pfn must be guarded by a mmu notifier.

This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
is set when the returned pfn has an associated struct page with a valid
refcount. Callers that don't pass FOLL_GET should remember this value
and use it to avoid places like kvm_is_ad_tracked_page that assume a
non-zero refcount.

Signed-off-by: David Stevens 
---
 include/linux/kvm_host.h | 10 ++
 virt/kvm/kvm_main.c  | 67 +---
 virt/kvm/pfncache.c  |  2 +-
 3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ef2763c2b12e..a45308c7d2d9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+void kvm_set_page_accessed(struct page *page);
+void kvm_set_page_dirty(struct page *page);
+
 struct kvm_follow_pfn {
const struct kvm_memory_slot *slot;
gfn_t gfn;
@@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
bool atomic;
/* Allow a read fault to create a writeable mapping. */
bool allow_write_mapping;
+   /*
+* Usage of the returned pfn will be guared by a mmu notifier. Must
+* be true if FOLL_GET is not set.
+*/
+   bool guarded_by_mmu_notifier;
 
/* Outputs of __kvm_follow_pfn */
hva_t hva;
bool writable;
+   /* True if the returned pfn is for a page with a valid refcount. */
+   bool is_refcounted_page;
 };
 
 kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b13f22861d2f..0f7b41f220b6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2502,6 +2502,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, 
kvm_pfn_t *pfn)
if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
foll->writable = foll->allow_write_mapping;
+   foll->is_refcounted_page = true;
+   if (!(foll->flags & FOLL_GET))
+   put_page(page[0]);
return true;
}
 
@@ -2525,6 +2528,7 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, 
kvm_pfn_t *pfn)
return npages;
 
foll->writable = (foll->flags & FOLL_WRITE) && 
foll->allow_write_mapping;
+   foll->is_refcounted_page = true;
 
/* map read fault as writable if possible */
if (unlikely(!foll->writable) && foll->allow_write_mapping) {
@@ -2537,6 +2541,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, 
kvm_pfn_t *pfn)
}
}
*pfn = page_to_pfn(page);
+   if (!(foll->flags & FOLL_GET))
+   put_page(page);
return npages;
 }
 
@@ -2551,16 +2557,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
bool write_fault)
return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-   struct page *page = kvm_pfn_to_refcounted_page(pfn);
-
-   if (!page)
-   return 1;
-
-   return get_page_unless_zero(page);
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct 
kvm_follow_pfn *foll,
   kvm_pfn_t *p_pfn)
 {
@@ -2568,6 +2564,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma, struct kvm_follow_pfn
pte_t *ptep;
spinlock_t *ptl;
bool write_fault = foll->flags & FOLL_WRITE;
+   struct page *page;
int r;
 
r = follow_pte(vma->vm_mm, foll->hva, , );
@@ -2599,24 +2596,27 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma, struct kvm_follow_pfn
pfn = pte_pfn(*ptep);
 
/*
-* Get a reference here because callers of *hva_to_pfn* and
-* *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-* returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-* set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
-* simply do nothing for reserved pfns.
-*
-* Whoever called remap_pfn_range is also going to call e.g.
-* unmap_mapping_range before the underlying pages are freed,
-* causing a call to our MMU notifier.
+* Now deal with reference counting. If kvm_pfn_to_refcounted_page
+* returns NULL, then there's no refcount to worry about.
 *
-* Certain IO or PFNMAP mappings can be backed with valid
-* struct pages, but be allocated without 

[PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-04 Thread David Stevens
From: David Stevens 

Introduce __kvm_follow_pfn, which will replace __gfn_to_pfn_memslot.
__kvm_follow_pfn refactors the old API's arguments into a struct and,
where possible, combines the boolean arguments into a single flags
argument.

Signed-off-by: David Stevens 
---
 include/linux/kvm_host.h |  16 
 virt/kvm/kvm_main.c  | 171 ++-
 virt/kvm/kvm_mm.h|   3 +-
 virt/kvm/pfncache.c  |   8 +-
 4 files changed, 122 insertions(+), 76 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..ef2763c2b12e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -97,6 +97,7 @@
 #define KVM_PFN_ERR_HWPOISON   (KVM_PFN_ERR_MASK + 1)
 #define KVM_PFN_ERR_RO_FAULT   (KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
+#define KVM_PFN_ERR_NEEDS_IO   (KVM_PFN_ERR_MASK + 4)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
@@ -1156,6 +1157,21 @@ unsigned long gfn_to_hva_memslot_prot(struct 
kvm_memory_slot *slot, gfn_t gfn,
 void kvm_release_page_clean(struct page *page);
 void kvm_release_page_dirty(struct page *page);
 
+struct kvm_follow_pfn {
+   const struct kvm_memory_slot *slot;
+   gfn_t gfn;
+   unsigned int flags;
+   bool atomic;
+   /* Allow a read fault to create a writeable mapping. */
+   bool allow_write_mapping;
+
+   /* Outputs of __kvm_follow_pfn */
+   hva_t hva;
+   bool writable;
+};
+
+kvm_pfn_t __kvm_follow_pfn(struct kvm_follow_pfn *foll);
+
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
  bool *writable);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 371bd783ff2b..b13f22861d2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2486,24 +2486,22 @@ static inline int check_user_page_hwpoison(unsigned 
long addr)
  * true indicates success, otherwise false is returned.  It's also the
  * only part that runs if we can in atomic context.
  */
-static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-   bool *writable, kvm_pfn_t *pfn)
+static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
struct page *page[1];
+   bool write_fault = foll->flags & FOLL_WRITE;
 
/*
 * Fast pin a writable pfn only if it is a write fault request
 * or the caller allows to map a writable pfn for a read fault
 * request.
 */
-   if (!(write_fault || writable))
+   if (!(write_fault || foll->allow_write_mapping))
return false;
 
-   if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+   if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
-
-   if (writable)
-   *writable = true;
+   foll->writable = foll->allow_write_mapping;
return true;
}
 
@@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool 
write_fault,
  * The slow path to get the pfn of the specified host virtual address,
  * 1 indicates success, -errno is returned if error is detected.
  */
-static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-  bool interruptible, bool *writable, kvm_pfn_t *pfn)
+static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
 {
-   unsigned int flags = FOLL_HWPOISON;
+   unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
struct page *page;
int npages;
 
might_sleep();
 
-   if (writable)
-   *writable = write_fault;
-
-   if (write_fault)
-   flags |= FOLL_WRITE;
-   if (async)
-   flags |= FOLL_NOWAIT;
-   if (interruptible)
-   flags |= FOLL_INTERRUPTIBLE;
-
-   npages = get_user_pages_unlocked(addr, 1, , flags);
+   npages = get_user_pages_unlocked(foll->hva, 1, , flags);
if (npages != 1)
return npages;
 
+   foll->writable = (foll->flags & FOLL_WRITE) && 
foll->allow_write_mapping;
+
/* map read fault as writable if possible */
-   if (unlikely(!write_fault) && writable) {
+   if (unlikely(!foll->writable) && foll->allow_write_mapping) {
struct page *wpage;
 
-   if (get_user_page_fast_only(addr, FOLL_WRITE, )) {
-   *writable = true;
+   if (get_user_page_fast_only(foll->hva, FOLL_WRITE, )) {
+   foll->writable = true;
put_page(page);
page = wpage;
}
@@ -2572,23 +2561,23 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn)
return get_page_unless_zero(page);
 }
 
-static int hva_to_pfn_remapped(struct vm_area_struct *vma,
-  unsigned long addr, bool 

[PATCH v7 1/8] KVM: Assert that a page's refcount is elevated when marking accessed/dirty

2023-07-04 Thread David Stevens
From: Sean Christopherson 

Assert that a page's refcount is elevated, i.e. that _something_ holds a
reference to the page, when KVM marks a page as accessed and/or dirty.
KVM typically doesn't hold a reference to pages that are mapped into the
guest, e.g. to allow page migration, compaction, swap, etc., and instead
relies on mmu_notifiers to react to changes in the primary MMU.

Incorrect handling of mmu_notifier events (or similar mechanisms) can
result in KVM keeping a mapping beyond the lifetime of the backing page,
i.e. can (and often does) result in use-after-free.  Yelling if KVM marks
a freed page as accessed/dirty doesn't prevent badness as KVM usually
only does A/D updates when unmapping memory from the guest, i.e. the
assertion fires well after an underlying bug has occurred, but yelling
does help detect, triage, and debug use-after-free bugs.

Note, the assertion must use page_count(), NOT page_ref_count()!  For
hugepages, the returned struct page may be a tailpage and thus not have
its own refcount.

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b838c8f71349..371bd783ff2b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2885,6 +2885,19 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_unmap);
 
 static bool kvm_is_ad_tracked_page(struct page *page)
 {
+   /*
+* Assert that KVM isn't attempting to mark a freed page as Accessed or
+* Dirty, i.e. that KVM's MMU doesn't have a use-after-free bug.  KVM
+* (typically) doesn't pin pages that are mapped in KVM's MMU, and
+* instead relies on mmu_notifiers to know when a mapping needs to be
+* zapped/invalidated.  Unmapping from KVM's MMU must happen _before_
+* KVM returns from its mmu_notifier, i.e. the page should have an
+* elevated refcount at this point even though KVM doesn't hold a
+* reference of its own.
+*/
+   if (WARN_ON_ONCE(!page_count(page)))
+   return false;
+
/*
 * Per page-flags.h, pages tagged PG_reserved "should in general not be
 * touched (e.g. set dirty) except by its owner".
-- 
2.41.0.255.g8b1d071c50-goog



[PATCH v7 0/8] KVM: allow mapping non-refcounted pages

2023-07-04 Thread David Stevens
From: David Stevens 

This patch series adds support for mapping VM_IO and VM_PFNMAP memory
that is backed by struct pages that aren't currently being refcounted
(e.g. tail pages of non-compound higher order allocations) into the
guest.

Our use case is virtio-gpu blob resources [1], which directly map host
graphics buffers into the guest as "vram" for the virtio-gpu device.
This feature currently does not work on systems using the amdgpu driver,
as that driver allocates non-compound higher order pages via
ttm_pool_alloc_page.

First, this series replaces the __gfn_to_pfn_memslot API with a more
extensible __kvm_faultin_pfn API. The updated API rearranges
__gfn_to_pfn_memslot's args into a struct and where possible packs the
bool arguments into a FOLL_ flags argument. The refactoring changes do
not change any behavior, except as noted in the PPC change.

When introduced in the refactoring, __kvm_faultin_pfn implies FOLL_GET
to preserve existing behavior. From there, the API is made to support
mapping non-refcounted pages by respecting the FOLL_GET flag.

This series only adds support for non-refcounted pages to the x86 MMU.
Other MMUs can likely be updated without too much difficulty, but it is
not needed at this point. Updating other parts of KVM (e.g. pfncache) is
not straightforward [2].

[1]
https://patchwork.kernel.org/project/dri-devel/cover/20200814024000.2485-1-gurchetansi...@chromium.org/
[2] https://lore.kernel.org/all/zbeeqtmtnpaeq...@google.com/

v6 -> v7:
 - Replace __gfn_to_pfn_memslot with a more flexible __kvm_faultin_pfn,
   and extend that API to support non-refcounted pages.
v5 -> v6:
 - rebase on kvm next branch
 - rename gfn_to_pfn_page to gfn_to_pfn_noref
 - fix uninitialized outparam in error case of __kvm_faultin_pfn
 - add kvm_release_pfn_noref_clean for releasing pfn/page pair
v4 -> v5:
 - rebase on kvm next branch again
v3 -> v4:
 - rebase on kvm next branch again
 - Add some more context to a comment in ensure_pfn_ref
v2 -> v3:
 - rebase on kvm next branch
v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next
David Stevens (7):
  KVM: Introduce __kvm_follow_pfn function
  KVM: Make __kvm_follow_pfn not imply FOLL_GET
  KVM: x86/mmu: Migrate to __kvm_follow_pfn
  KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn
  KVM: arm64: Migrate to __kvm_follow_pfn
  KVM: PPC: Migrate to __kvm_follow_pfn
  KVM: remove __gfn_to_pfn_memslot

Sean Christopherson (1):
  KVM: Assert that a page's refcount is elevated when marking
accessed/dirty

 arch/arm64/kvm/mmu.c   |  25 +--
 arch/powerpc/include/asm/kvm_book3s.h  |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  38 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  50 +++---
 arch/powerpc/kvm/book3s_hv_nested.c|   4 +-
 arch/x86/kvm/mmu/mmu.c |  77 ++---
 arch/x86/kvm/mmu/mmu_internal.h|   1 +
 arch/x86/kvm/mmu/paging_tmpl.h |   9 +-
 arch/x86/kvm/mmu/spte.c|   4 +-
 arch/x86/kvm/mmu/spte.h|  12 +-
 arch/x86/kvm/mmu/tdp_mmu.c |  22 +--
 include/linux/kvm_host.h   |  26 +++
 virt/kvm/kvm_main.c| 210 +
 virt/kvm/kvm_mm.h  |   3 +-
 virt/kvm/pfncache.c|   8 +-
 15 files changed, 282 insertions(+), 209 deletions(-)

-- 
2.41.0.255.g8b1d071c50-goog



Re: Fwd: Memory corruption in multithreaded user space program while calling fork

2023-07-04 Thread Suren Baghdasaryan
On Mon, Jul 3, 2023 at 11:44 AM Greg KH  wrote:
>
> On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten
> > > Leemhuis)  wrote:
> > > >
> > > > On 02.07.23 14:27, Bagas Sanjaya wrote:
> > > > > I notice a regression report on Bugzilla [1]. Quoting from it:
> > > > >
> > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed 
> > > > >> frequent but random crashes in a user space program.  After a lot of 
> > > > >> reduction, I have come up with the following reproducer program:
> > > > > [...]
> > > > >> After tuning the various parameters for my computer, exit code 2, 
> > > > >> which indicates that memory corruption was detected, occurs 
> > > > >> approximately 99% of the time.  Exit code 1, which occurs 
> > > > >> approximately 1% of the time, means it ran out of 
> > > > >> statically-allocated memory before reproducing the issue, and 
> > > > >> increasing the memory usage any more only leads to diminishing 
> > > > >> returns.  There is also something like a 0.1% chance that it 
> > > > >> segfaults due to memory corruption elsewhere than in the 
> > > > >> statically-allocated buffer.
> > > > >>
> > > > >> With this reproducer in hand, I was able to perform the following 
> > > > >> bisection:
> > > > > [...]
> > > > >
> > > > > See Bugzilla for the full thread.
> > > >
> > > > Additional details from
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 :
> > > >
> > > > ```
> > > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829
> > > > reverted no longer causes any memory corruption with either my
> > > > reproducer or the original program.
> > > > ```
> > > >
> > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling
> > > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already 
> > > > CCed]]
> > > >
> > > > That's the same commit that causes build problems with go:
> > > >
> > > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/
> > >
> > > Thanks! I'll investigate this later today. After discussing with
> > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until
> > > the issue is fixed. I'll post a patch shortly.
> >
> > Posted at: 
> > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/
>
> As that change fixes something in 6.4, why not cc: stable on it as well?

Sorry, I thought since per-VMA locks were introduced in 6.4 and this
patch is fixing 6.4 I didn't need to send it to stable for older
versions. Did I miss something?
Thanks,
Suren.

>
> thanks,
>
> greg k-h