date:20240220

Re: [kvm-unit-tests PATCH v5 7/8] Add common/ directory for architecture-independent tests

2024-02-20 Thread Andrew Jones

On Wed, Feb 21, 2024 at 01:27:56PM +1000, Nicholas Piggin wrote:
> x86/sieve.c is used by s390x, arm, and riscv via symbolic link. Make a
> new directory common/ for architecture-independent tests and move
> sieve.c here.
> 
> Reviewed-by: Thomas Huth 
> Signed-off-by: Nicholas Piggin 
> ---
>  arm/sieve.c|  2 +-
>  common/sieve.c | 51 +
>  riscv/sieve.c  |  2 +-
>  s390x/sieve.c  |  2 +-
>  x86/sieve.c| 52 +-
>  5 files changed, 55 insertions(+), 54 deletions(-)
>  create mode 100644 common/sieve.c
>  mode change 100644 => 12 x86/sieve.c
>

Acked-by: Andrew Jones

[powerpc:fixes] BUILD SUCCESS 20c8c4dafe93e82441583e93bd68c0d256d7bed4

2024-02-20 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes
branch HEAD: 20c8c4dafe93e82441583e93bd68c0d256d7bed4  KVM: PPC: Book3S HV: Fix 
L2 guest reboot failure due to empty 'arch_compat'

elapsed time: 1061m

configs tested: 123
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha allnoconfig   gcc  
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
arc  allmodconfig   gcc  
arc   allnoconfig   gcc  
arc  allyesconfig   gcc  
arc defconfig   gcc  
arm  allmodconfig   gcc  
arm   allnoconfig   clang
arm  allyesconfig   gcc  
arm defconfig   clang
arm   h3600_defconfig   gcc  
arm   spear13xx_defconfig   gcc  
arm64allmodconfig   clang
arm64 allnoconfig   gcc  
arm64   defconfig   gcc  
csky allmodconfig   gcc  
csky  allnoconfig   gcc  
csky allyesconfig   gcc  
cskydefconfig   gcc  
hexagon  allmodconfig   clang
hexagon   allnoconfig   clang
hexagon  allyesconfig   clang
hexagon defconfig   clang
i386 allmodconfig   gcc  
i386  allnoconfig   gcc  
i386 allyesconfig   gcc  
i386 buildonly-randconfig-002-20240221   clang
i386defconfig   clang
i386  randconfig-002-20240221   clang
i386  randconfig-003-20240221   clang
i386  randconfig-006-20240221   clang
i386  randconfig-012-20240221   clang
i386  randconfig-014-20240221   clang
i386  randconfig-016-20240221   clang
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarchallyesconfig   gcc  
loongarch   defconfig   gcc  
m68k allmodconfig   gcc  
m68k  allnoconfig   gcc  
m68k allyesconfig   gcc  
m68kdefconfig   gcc  
microblaze   alldefconfig   gcc  
microblaze   allmodconfig   gcc  
microblazeallnoconfig   gcc  
microblaze   allyesconfig   gcc  
microblaze  defconfig   gcc  
mips allmodconfig   gcc  
mips  allnoconfig   gcc  
mips allyesconfig   gcc  
mips decstation_r4k_defconfig   gcc  
nios2allmodconfig   gcc  
nios2 allnoconfig   gcc  
nios2allyesconfig   gcc  
nios2   defconfig   gcc  
openrisc allmodconfig   gcc  
openrisc  allnoconfig   gcc  
openrisc allyesconfig   gcc  
openriscdefconfig   gcc  
openriscor1ksim_defconfig   gcc  
parisc   allmodconfig   gcc  
pariscallnoconfig   gcc  
parisc   allyesconfig   gcc  
parisc  defconfig   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  allyesconfig   clang
powerpc wii_defconfig   gcc  
riscvallmodconfig   clang
riscv allnoconfig   gcc  
riscvallyesconfig   clang
riscv   defconfig   clang
s390 allmodconfig   clang
s390  allnoconfig   clang
s390 allyesconfig   gcc  
s390defconfig   clang
sh   allmodconfig   gcc  
shallnoconfig   gcc  
sh   allyesconfig   gcc  
sh  defconfig   gcc  
shdreamcast_defconfig   gcc  
sh  sdk7786_defconfig   gcc  
shsh7763rdp_defconfig   gcc  
sparc

[powerpc] WARNING at arch/powerpc/mm/mmu_context.c:106 switch_mm_irqs_off+0x140/0x17

2024-02-20 Thread Geetika M

While running DLPAR CPU remove test on a IBM Power10 server (6.8.0-rc2) 
following warning is seen :


[ 2334.165288][ T0] [ cut here ]
[ 2334.165302][ T0] WARNING: CPU: 45 PID: 0 at 
arch/powerpc/mm/mmu_context.c:106 switch_mm_irqs_off+0x140/0x170
[ 2334.165316][ T0] Modules linked in: rpadlpar_io rpaphp bonding 
nfnetlink pseries_rng rng_core vmx_crypto gf128mul aes_gcm_p10_crypto 
crct10dif_vpmsum crct10dif_common binfmt_misc crc32c_vpmsum fuse autofs4
[ 2334.165337][ T0] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 
6.8.0-rc2 #1
[ 2334.165342][ T0] Hardware name: IBM,9105-22A POWER10 (raw) 0x800200 
0xf06 of:IBM,FW1060.00 (NL1060_024) hv:phyp pSeries
[ 2334.165345][ T0] NIP: c008c5f0 LR: c008c590 CTR: 
c00fa084
[ 2334.165347][ T0] REGS: c3e67b40 TRAP: 0700 Not tainted 
(6.8.0-rc2)
[ 2334.165351][ T0] MSR: 8282b033 
 CR: 24000222 XER: 

[ 2334.165365][ T0] CFAR: c008c5a4 IRQMASK: 1
[ 2334.165365][ T0] GPR00: c008c590 c3e67de0 
c1559700 c8c12700
[ 2334.165365][ T0] GPR04: c2986480 c3ccff00 
002d 
[ 2334.165365][ T0] GPR08: 002d  
c2986a80 
[ 2334.165365][ T0] GPR12: c00fa084 c00efff88700 
 2eede6a0
[ 2334.165365][ T0] GPR16:   
 
[ 2334.165365][ T0] GPR20:   
 0001
[ 2334.165365][ T0] GPR24: 002d dedc 
c2a62cc8 c29dcc68
[ 2334.165365][ T0] GPR28: c29e1340  
002d c8c12d00

[ 2334.165407][ T0] NIP [c008c5f0] switch_mm_irqs_off+0x140/0x170
[ 2334.165412][ T0] LR [c008c590] switch_mm_irqs_off+0xe0/0x170
[ 2334.165417][ T0] Call Trace:
[ 2334.165418][ T0] [c3e67de0] [c3e67e30] 
0xc3e67e30 (unreliable)
[ 2334.165425][ T0] [c3e67e20] [c01a7bb4] 
idle_task_exit+0x90/0xb4
[ 2334.165433][ T0] [c3e67e50] [c00fa0b8] 
pseries_cpu_offline_self+0x34/0xec
[ 2334.165440][ T0] [c3e67ec0] [c005cbb0] 
arch_cpu_idle_dead+0x48/0x98
[ 2334.165446][ T0] [c3e67ee0] [c01cacb4] 
do_idle+0x2c0/0x3c0
[ 2334.165451][ T0] [c3e67f60] [c01cb044] 
cpu_startup_entry+0x4c/0x50
[ 2334.165457][ T0] [c3e67f90] [c005c798] 
start_secondary+0x2b4/0x2c4
[ 2334.165461][ T0] [c3e67fe0] [c000e258] 
start_secondary_prolog+0x10/0x14
[ 2334.165466][ T0] Code: 4e800020 6000 6000 6000 7ca32b78 
48008eb9 6000 4bb8 0fe0 4b1c 6000 6000 
<0fe0> e8010050 ebe10038 7c0803a6


[ 2334.165481][ T0] ---[ end trace  ]---

- Geetika

Re: [PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides

2024-02-20 Thread Christoph Hellwig

On Tue, Feb 20, 2024 at 02:32:53PM -0600, Maxwell Bland wrote:
> Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
> enforcing appropriate code and data seperation untenable on certain
> microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
> while the use of the vmalloc interface is non-monolithic: in particular,
> appropriate randomness in ASLR makes it such that code regions must fall
> in some region between VMALLOC_START and VMALLOC_end, but this
> necessitates that code pages are intermingled with data pages, meaning
> code-specific protections, such as arm64's PXNTable, cannot be
> performantly runtime enforced.

That's not actually true.  We have MODULE_START/END to separate them,
which is used by mips only for now.

> 
> The solution to this problem allows architectures to override the
> vmalloc wrapper functions by enforcing that the rest of the kernel does
> not reimplement __vmalloc_node by using __vmalloc_node_range with the
> same parameters as __vmalloc_node or provides a __weak tag to those
> functions using __vmalloc_node_range with parameters repeating those of
> __vmalloc_node.

I'm really not too happy about overriding the functions.  Especially
as the separation is a generally good idea and it would be good to
move everyone (or at least all modern architectures) over to a scheme
like this.

Re: [PATCH v2 00/14] Split crash out from kexec and clean up related config items

2024-02-20 Thread Hari Bathini


Hi Baoquan,

On 04/02/24 8:56 am, Baoquan He wrote:

Hope Hari and Pingfan can help have a look, see if
it's doable. Now, I make it either have both kexec and crash enabled, or
disable both of them altogether.


Sure. I will take a closer look...

Thanks a lot. Please feel free to post patches to make that, or I can do
it with your support or suggestion.


Tested your changes and on top of these changes, came up with the below
changes to get it working for powerpc:


https://lore.kernel.org/all/20240213113150.1148276-1-hbath...@linux.ibm.com/

Please take a look.

Thanks
Hari

[PATCH v4 1/2] powerpc: Add Power11 architected and raw mode

2024-02-20 Thread Michael Ellerman

From: Madhavan Srinivasan 

Add CPU table entries for raw and architected mode. Most fields are
copied from the Power10 table entries.

CPU, MMU and user (ELF_HWCAP) features are unchanged vs P10. However
userspace can detect P11 because the AT_PLATFORM value changes to
"power11".

The logical PVR value of 0x0F07, passed to firmware via the
ibm_arch_vec, indicates the kernel can support a P11 compatible CPU,
which means at least ISA v3.1 compliant.

Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/cputable.h   |  3 ++
 arch/powerpc/include/asm/mmu.h|  1 +
 arch/powerpc/include/asm/reg.h|  2 ++
 arch/powerpc/kernel/cpu_specs_book3s_64.h | 34 +++
 arch/powerpc/kernel/dt_cpu_ftrs.c | 10 +++
 arch/powerpc/kernel/prom_init.c   | 10 ++-
 arch/powerpc/kvm/book3s_hv.c  |  1 +
 7 files changed, 60 insertions(+), 1 deletion(-)

v4: mpe: Rename PVR_ARCH_31N to PVR_ARCH_31_P11, to clarify that it indicates
P11 compatibility. Flesh out change log with some more detail.

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 48471ca388dd..07a204d21034 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -454,6 +454,9 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_ARCH_300 | CPU_FTR_ARCH_31 | \
CPU_FTR_DAWR | CPU_FTR_DAWR1 | \
CPU_FTR_DEXCR_NPHIE)
+
+#define CPU_FTRS_POWER11   CPU_FTRS_POWER10
+
 #define CPU_FTRS_CELL  (CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index d8b7e246a32f..61ebe5eff2c9 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -133,6 +133,7 @@
 #define MMU_FTRS_POWER8MMU_FTRS_POWER6
 #define MMU_FTRS_POWER9MMU_FTRS_POWER6
 #define MMU_FTRS_POWER10   MMU_FTRS_POWER6
+#define MMU_FTRS_POWER11   MMU_FTRS_POWER6
 #define MMU_FTRS_CELL  MMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
MMU_FTR_CI_LARGE_PAGE
 #define MMU_FTRS_PA6T  MMU_FTRS_DEFAULT_HPTE_ARCH_V2 | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7fd09f25452d..58d6348e4ea0 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1364,6 +1364,7 @@
 #define PVR_HX_C2000   0x0066
 #define PVR_POWER9 0x004E
 #define PVR_POWER100x0080
+#define PVR_POWER110x0082
 #define PVR_BE 0x0070
 #define PVR_PA6T   0x0090
 
@@ -1375,6 +1376,7 @@
 #define PVR_ARCH_207   0x0f04
 #define PVR_ARCH_300   0x0f05
 #define PVR_ARCH_310x0f06
+#define PVR_ARCH_31_P110x0f07
 
 /* Macros for setting and retrieving special purpose registers */
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/cpu_specs_book3s_64.h 
b/arch/powerpc/kernel/cpu_specs_book3s_64.h
index 3ff9757df4c0..98d4274a1b6b 100644
--- a/arch/powerpc/kernel/cpu_specs_book3s_64.h
+++ b/arch/powerpc/kernel/cpu_specs_book3s_64.h
@@ -60,6 +60,9 @@
 PPC_FEATURE2_ISEL | PPC_FEATURE2_TAR | \
 PPC_FEATURE2_VEC_CRYPTO)
 
+#define COMMON_USER_POWER11COMMON_USER_POWER10
+#define COMMON_USER2_POWER11   COMMON_USER2_POWER10
+
 static struct cpu_spec cpu_specs[] __initdata = {
{   /* PPC970 */
.pvr_mask   = 0x,
@@ -281,6 +284,20 @@ static struct cpu_spec cpu_specs[] __initdata = {
.cpu_restore= __restore_cpu_power10,
.platform   = "power10",
},
+   {   /* 3.1-compliant processor, i.e. Power11 "architected" mode */
+   .pvr_mask   = 0x,
+   .pvr_value  = 0x0f07,
+   .cpu_name   = "Power11 (architected)",
+   .cpu_features   = CPU_FTRS_POWER11,
+   .cpu_user_features  = COMMON_USER_POWER11,
+   .cpu_user_features2 = COMMON_USER2_POWER11,
+   .mmu_features   = MMU_FTRS_POWER11,
+   .icache_bsize   = 128,
+   .dcache_bsize   = 128,
+   .cpu_setup  = __setup_cpu_power10,
+   .cpu_restore= __restore_cpu_power10,
+   .platform   = "power11",
+   },
{   /* Power7 */
.pvr_mask   = 0x,
.pvr_value  = 0x003f,
@@ -451,6 +468,23 @@ static struct cpu_spec cpu_specs[] __initdata = {
.machine_check_early= __machine_check_early_realmode_p10,
.platform   = "power10",
},
+   {   /* Power11

[PATCH v4 2/2] powerpc/perf: Power11 Performance Monitoring support

2024-02-20 Thread Michael Ellerman

From: Madhavan Srinivasan 

Base enablement patch to register performance monitoring
hardware support for Power11. Most of fields are copied
from power10_pmu struct for power11_pmu struct.

Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/perf/core-book3s.c |  2 ++
 arch/powerpc/perf/internal.h|  1 +
 arch/powerpc/perf/power10-pmu.c | 27 +++
 3 files changed, 30 insertions(+)

v4: No change.

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 01d14523c938..6b5f8a94e7d8 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2593,6 +2593,8 @@ static int __init init_ppc64_pmu(void)
return 0;
else if (!init_power10_pmu())
return 0;
+   else if (!init_power11_pmu())
+   return 0;
else if (!init_ppc970_pmu())
return 0;
else
diff --git a/arch/powerpc/perf/internal.h b/arch/powerpc/perf/internal.h
index 4c18b5504326..a70ac471a5a5 100644
--- a/arch/powerpc/perf/internal.h
+++ b/arch/powerpc/perf/internal.h
@@ -10,4 +10,5 @@ int __init init_power7_pmu(void);
 int __init init_power8_pmu(void);
 int __init init_power9_pmu(void);
 int __init init_power10_pmu(void);
+int __init init_power11_pmu(void);
 int __init init_generic_compat_pmu(void);
diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index 9b5133e361a7..62a68b6b2d4b 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -634,3 +634,30 @@ int __init init_power10_pmu(void)
 
return 0;
 }
+
+static struct power_pmu power11_pmu;
+
+int __init init_power11_pmu(void)
+{
+   unsigned int pvr;
+   int rc;
+
+   pvr = mfspr(SPRN_PVR);
+   if (PVR_VER(pvr) != PVR_POWER11)
+   return -ENODEV;
+
+   /* Set the PERF_REG_EXTENDED_MASK here */
+   PERF_REG_EXTENDED_MASK = PERF_REG_PMU_MASK_31;
+
+   power11_pmu = power10_pmu;
+   power11_pmu.name = "Power11";
+
+   rc = register_power_pmu(_pmu);
+   if (rc)
+   return rc;
+
+   /* Tell userspace that EBB is supported */
+   cur_cpu_spec->cpu_user_features2 |= PPC_FEATURE2_EBB;
+
+   return 0;
+}
-- 
2.43.1

Re: [PATCH v12 07/15] media: v4l2: Add audio capture and output support

2024-02-20 Thread Tomasz Figa

On Thu, Jan 18, 2024 at 10:15 PM Shengjiu Wang  wrote:
>
> Audio signal processing has the requirement for memory to
> memory similar as Video.
>
> This patch is to add this support in v4l2 framework, defined
> new buffer type V4L2_BUF_TYPE_AUDIO_CAPTURE and
> V4L2_BUF_TYPE_AUDIO_OUTPUT, defined new format v4l2_audio_format
> for audio case usage.
>
> The created audio device is named "/dev/v4l-audioX".
>
> Signed-off-by: Shengjiu Wang 
> ---
>  .../userspace-api/media/v4l/buffer.rst|  6 ++
>  .../media/v4l/dev-audio-mem2mem.rst   | 71 +++
>  .../userspace-api/media/v4l/devices.rst   |  1 +
>  .../media/v4l/vidioc-enum-fmt.rst |  2 +
>  .../userspace-api/media/v4l/vidioc-g-fmt.rst  |  4 ++
>  .../media/videodev2.h.rst.exceptions  |  2 +
>  .../media/common/videobuf2/videobuf2-v4l2.c   |  4 ++
>  drivers/media/v4l2-core/v4l2-compat-ioctl32.c |  9 +++
>  drivers/media/v4l2-core/v4l2-dev.c| 17 +
>  drivers/media/v4l2-core/v4l2-ioctl.c  | 53 ++
>  include/media/v4l2-dev.h  |  2 +
>  include/media/v4l2-ioctl.h| 34 +
>  include/uapi/linux/videodev2.h| 17 +
>  13 files changed, 222 insertions(+)
>  create mode 100644 
> Documentation/userspace-api/media/v4l/dev-audio-mem2mem.rst

For drivers/media/common/videobuf2:

Acked-by: Tomasz Figa 

Best regards,
Tomasz

Re: [PATCH v12 07/15] media: v4l2: Add audio capture and output support

2024-02-20 Thread Tomasz Figa

On Sat, Feb 17, 2024 at 6:42 PM Mauro Carvalho Chehab
 wrote:
>
> Em Thu, 18 Jan 2024 20:32:00 +0800
> Shengjiu Wang  escreveu:
>
> > Audio signal processing has the requirement for memory to
> > memory similar as Video.
> >
> > This patch is to add this support in v4l2 framework, defined
> > new buffer type V4L2_BUF_TYPE_AUDIO_CAPTURE and
> > V4L2_BUF_TYPE_AUDIO_OUTPUT, defined new format v4l2_audio_format
> > for audio case usage.
> >
> > The created audio device is named "/dev/v4l-audioX".
> >
> > Signed-off-by: Shengjiu Wang 
> > ---
> >  .../userspace-api/media/v4l/buffer.rst|  6 ++
> >  .../media/v4l/dev-audio-mem2mem.rst   | 71 +++
> >  .../userspace-api/media/v4l/devices.rst   |  1 +
> >  .../media/v4l/vidioc-enum-fmt.rst |  2 +
> >  .../userspace-api/media/v4l/vidioc-g-fmt.rst  |  4 ++
> >  .../media/videodev2.h.rst.exceptions  |  2 +
> >  .../media/common/videobuf2/videobuf2-v4l2.c   |  4 ++
> >  drivers/media/v4l2-core/v4l2-compat-ioctl32.c |  9 +++
> >  drivers/media/v4l2-core/v4l2-dev.c| 17 +
> >  drivers/media/v4l2-core/v4l2-ioctl.c  | 53 ++
> >  include/media/v4l2-dev.h  |  2 +
> >  include/media/v4l2-ioctl.h| 34 +
> >  include/uapi/linux/videodev2.h| 17 +
> >  13 files changed, 222 insertions(+)
> >  create mode 100644 
> > Documentation/userspace-api/media/v4l/dev-audio-mem2mem.rst
> >
> > diff --git a/Documentation/userspace-api/media/v4l/buffer.rst 
> > b/Documentation/userspace-api/media/v4l/buffer.rst
> > index 52bbee81c080..a3754ca6f0d6 100644
> > --- a/Documentation/userspace-api/media/v4l/buffer.rst
> > +++ b/Documentation/userspace-api/media/v4l/buffer.rst
> > @@ -438,6 +438,12 @@ enum v4l2_buf_type
> >  * - ``V4L2_BUF_TYPE_META_OUTPUT``
> >- 14
>
> >- Buffer for metadata output, see :ref:`metadata`.
> > +* - ``V4L2_BUF_TYPE_AUDIO_CAPTURE``
> > +  - 15
> > +  - Buffer for audio capture, see :ref:`audio`.
> > +* - ``V4L2_BUF_TYPE_AUDIO_OUTPUT``
> > +  - 16
>
> Hmm... alsa APi define input/output as:
> enum {
> SNDRV_PCM_STREAM_PLAYBACK = 0,
> SNDRV_PCM_STREAM_CAPTURE,
> SNDRV_PCM_STREAM_LAST = SNDRV_PCM_STREAM_CAPTURE,
> };
>
>
> I would use a namespace as close as possible to the
> ALSA API. Also, we're not talking about V4L2, but, instead
> audio. so, not sure if I like the prefix to start with
> V4L2_. Maybe ALSA_?
>
> So, a better namespace would be:
>
> ${prefix}_BUF_TYPE_PCM_STREAM_PLAYBACK
> and
> ${prefix}_BUF_TYPE_PCM_STREAM_CAPTURE
>

The API is still V4L2, and all the other non-video buf types also use
the V4L2_ prefix, so perhaps that's good here as well?

Whether AUDIO or PCM_STREAM makes more sense goes outside of my
expertise. Subjectively, a PCM stream sounds more specific than an
audio stream. Do those buf types also support non-PCM audio streams?

> > +  - Buffer for audio output, see :ref:`audio`.
> >
> >
> >  .. _buffer-flags:
> > diff --git a/Documentation/userspace-api/media/v4l/dev-audio-mem2mem.rst 
> > b/Documentation/userspace-api/media/v4l/dev-audio-mem2mem.rst
> > new file mode 100644
> > index ..68faecfe3a02
> > --- /dev/null
> > +++ b/Documentation/userspace-api/media/v4l/dev-audio-mem2mem.rst
> > @@ -0,0 +1,71 @@
> > +.. SPDX-License-Identifier: GFDL-1.1-no-invariants-or-later
> > +
> > +.. _audiomem2mem:
> > +
> > +
> > +Audio Memory-To-Memory Interface
> > +
> > +
> > +An audio memory-to-memory device can compress, decompress, transform, or
> > +otherwise convert audio data from one format into another format, in 
> > memory.
> > +Such memory-to-memory devices set the ``V4L2_CAP_AUDIO_M2M`` capability.
> > +Examples of memory-to-memory devices are audio codecs, audio preprocessing,
> > +audio postprocessing.
> > +
> > +A memory-to-memory audio node supports both output (sending audio frames 
> > from
> > +memory to the hardware) and capture (receiving the processed audio frames
> > +from the hardware into memory) stream I/O. An application will have to
> > +setup the stream I/O for both sides and finally call
> > +:ref:`VIDIOC_STREAMON ` for both capture and output to
> > +start the hardware.
> > +
> > +Memory-to-memory devices function as a shared resource: you can
> > +open the audio node multiple times, each application setting up their
> > +own properties that are local to the file handle, and each can use
> > +it independently from the others. The driver will arbitrate access to
> > +the hardware and reprogram it whenever another file handler gets access.
> > +
> > +Audio memory-to-memory devices are accessed through character device
> > +special files named ``/dev/v4l-audio``
> > +
> > +Querying Capabilities
> > +=
> > +
> > +Device nodes supporting the audio

[kvm-unit-tests PATCH v5 8/8] migration: add a migration selftest

2024-02-20 Thread Nicholas Piggin

Add a selftest for migration support in  guest library and test harness
code. It performs migrations in a tight loop to irritate races and bugs
in the test harness code.

Include the test in s390, powerpc.

Acked-by: Claudio Imbrenda  (s390x)
Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 common/selftest-migration.c  | 29 +
 powerpc/Makefile.common  |  1 +
 powerpc/selftest-migration.c |  1 +
 powerpc/unittests.cfg|  4 
 s390x/Makefile   |  1 +
 s390x/selftest-migration.c   |  1 +
 s390x/unittests.cfg  |  4 
 7 files changed, 41 insertions(+)
 create mode 100644 common/selftest-migration.c
 create mode 12 powerpc/selftest-migration.c
 create mode 12 s390x/selftest-migration.c

diff --git a/common/selftest-migration.c b/common/selftest-migration.c
new file mode 100644
index 0..54b5d6b2d
--- /dev/null
+++ b/common/selftest-migration.c
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Machine independent migration tests
+ *
+ * This is just a very simple test that is intended to stress the migration
+ * support in the test harness. This could be expanded to test more guest
+ * library code, but architecture-specific tests should be used to test
+ * migration of tricky machine state.
+ */
+#include 
+#include 
+
+#define NR_MIGRATIONS 30
+
+int main(int argc, char **argv)
+{
+   int i = 0;
+
+   report_prefix_push("migration");
+
+   for (i = 0; i < NR_MIGRATIONS; i++)
+   migrate_quiet();
+
+   report(true, "simple harness stress test");
+
+   report_prefix_pop();
+
+   return report_summary();
+}
diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
index eb88398d8..da4a7bbb8 100644
--- a/powerpc/Makefile.common
+++ b/powerpc/Makefile.common
@@ -6,6 +6,7 @@
 
 tests-common = \
$(TEST_DIR)/selftest.elf \
+   $(TEST_DIR)/selftest-migration.elf \
$(TEST_DIR)/spapr_hcall.elf \
$(TEST_DIR)/rtas.elf \
$(TEST_DIR)/emulator.elf \
diff --git a/powerpc/selftest-migration.c b/powerpc/selftest-migration.c
new file mode 12
index 0..bd1eb266d
--- /dev/null
+++ b/powerpc/selftest-migration.c
@@ -0,0 +1 @@
+../common/selftest-migration.c
\ No newline at end of file
diff --git a/powerpc/unittests.cfg b/powerpc/unittests.cfg
index e71140aa5..7ce57de02 100644
--- a/powerpc/unittests.cfg
+++ b/powerpc/unittests.cfg
@@ -36,6 +36,10 @@ smp = 2
 extra_params = -m 256 -append 'setup smp=2 mem=256'
 groups = selftest
 
+[selftest-migration]
+file = selftest-migration.elf
+groups = selftest migration
+
 [spapr_hcall]
 file = spapr_hcall.elf
 
diff --git a/s390x/Makefile b/s390x/Makefile
index b72f7578f..344d46d68 100644
--- a/s390x/Makefile
+++ b/s390x/Makefile
@@ -1,4 +1,5 @@
 tests = $(TEST_DIR)/selftest.elf
+tests += $(TEST_DIR)/selftest-migration.elf
 tests += $(TEST_DIR)/intercept.elf
 tests += $(TEST_DIR)/emulator.elf
 tests += $(TEST_DIR)/sieve.elf
diff --git a/s390x/selftest-migration.c b/s390x/selftest-migration.c
new file mode 12
index 0..bd1eb266d
--- /dev/null
+++ b/s390x/selftest-migration.c
@@ -0,0 +1 @@
+../common/selftest-migration.c
\ No newline at end of file
diff --git a/s390x/unittests.cfg b/s390x/unittests.cfg
index f5024b6ee..a7ad522ca 100644
--- a/s390x/unittests.cfg
+++ b/s390x/unittests.cfg
@@ -24,6 +24,10 @@ groups = selftest
 # please keep the kernel cmdline in sync with $(TEST_DIR)/selftest.parmfile
 extra_params = -append 'test 123'
 
+[selftest-migration]
+file = selftest-migration.elf
+groups = selftest migration
+
 [intercept]
 file = intercept.elf
 
-- 
2.42.0

[kvm-unit-tests PATCH v5 7/8] Add common/ directory for architecture-independent tests

2024-02-20 Thread Nicholas Piggin

x86/sieve.c is used by s390x, arm, and riscv via symbolic link. Make a
new directory common/ for architecture-independent tests and move
sieve.c here.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 arm/sieve.c|  2 +-
 common/sieve.c | 51 +
 riscv/sieve.c  |  2 +-
 s390x/sieve.c  |  2 +-
 x86/sieve.c| 52 +-
 5 files changed, 55 insertions(+), 54 deletions(-)
 create mode 100644 common/sieve.c
 mode change 100644 => 12 x86/sieve.c

diff --git a/arm/sieve.c b/arm/sieve.c
index 8f14a5c3d..fe299f309 12
--- a/arm/sieve.c
+++ b/arm/sieve.c
@@ -1 +1 @@
-../x86/sieve.c
\ No newline at end of file
+../common/sieve.c
\ No newline at end of file
diff --git a/common/sieve.c b/common/sieve.c
new file mode 100644
index 0..8150f2d98
--- /dev/null
+++ b/common/sieve.c
@@ -0,0 +1,51 @@
+#include "alloc.h"
+#include "libcflat.h"
+
+static int sieve(char* data, int size)
+{
+int i, j, r = 0;
+
+for (i = 0; i < size; ++i)
+   data[i] = 1;
+
+data[0] = data[1] = 0;
+
+for (i = 2; i < size; ++i)
+   if (data[i]) {
+   ++r;
+   for (j = i*2; j < size; j += i)
+   data[j] = 0;
+   }
+return r;
+}
+
+static void test_sieve(const char *msg, char *data, int size)
+{
+int r;
+
+printf("%s:", msg);
+r = sieve(data, size);
+printf("%d out of %d\n", r, size);
+}
+
+#define STATIC_SIZE 100
+#define VSIZE 1
+char static_data[STATIC_SIZE];
+
+int main(void)
+{
+void *v;
+int i;
+
+printf("starting sieve\n");
+test_sieve("static", static_data, STATIC_SIZE);
+setup_vm();
+test_sieve("mapped", static_data, STATIC_SIZE);
+for (i = 0; i < 3; ++i) {
+   v = malloc(VSIZE);
+   test_sieve("virtual", v, VSIZE);
+   free(v);
+}
+
+return 0;
+}
diff --git a/riscv/sieve.c b/riscv/sieve.c
index 8f14a5c3d..fe299f309 12
--- a/riscv/sieve.c
+++ b/riscv/sieve.c
@@ -1 +1 @@
-../x86/sieve.c
\ No newline at end of file
+../common/sieve.c
\ No newline at end of file
diff --git a/s390x/sieve.c b/s390x/sieve.c
index 8f14a5c3d..fe299f309 12
--- a/s390x/sieve.c
+++ b/s390x/sieve.c
@@ -1 +1 @@
-../x86/sieve.c
\ No newline at end of file
+../common/sieve.c
\ No newline at end of file
diff --git a/x86/sieve.c b/x86/sieve.c
deleted file mode 100644
index 8150f2d98..0
--- a/x86/sieve.c
+++ /dev/null
@@ -1,51 +0,0 @@
-#include "alloc.h"
-#include "libcflat.h"
-
-static int sieve(char* data, int size)
-{
-int i, j, r = 0;
-
-for (i = 0; i < size; ++i)
-   data[i] = 1;
-
-data[0] = data[1] = 0;
-
-for (i = 2; i < size; ++i)
-   if (data[i]) {
-   ++r;
-   for (j = i*2; j < size; j += i)
-   data[j] = 0;
-   }
-return r;
-}
-
-static void test_sieve(const char *msg, char *data, int size)
-{
-int r;
-
-printf("%s:", msg);
-r = sieve(data, size);
-printf("%d out of %d\n", r, size);
-}
-
-#define STATIC_SIZE 100
-#define VSIZE 1
-char static_data[STATIC_SIZE];
-
-int main(void)
-{
-void *v;
-int i;
-
-printf("starting sieve\n");
-test_sieve("static", static_data, STATIC_SIZE);
-setup_vm();
-test_sieve("mapped", static_data, STATIC_SIZE);
-for (i = 0; i < 3; ++i) {
-   v = malloc(VSIZE);
-   test_sieve("virtual", v, VSIZE);
-   free(v);
-}
-
-return 0;
-}
diff --git a/x86/sieve.c b/x86/sieve.c
new file mode 12
index 0..fe299f309
--- /dev/null
+++ b/x86/sieve.c
@@ -0,0 +1 @@
+../common/sieve.c
\ No newline at end of file
-- 
2.42.0

[kvm-unit-tests PATCH v5 6/8] migration: Add quiet migration support

2024-02-20 Thread Nicholas Piggin

Console output required to support migration becomes quite noisy
when doing lots of migrations. Provide a migrate_quiet() call that
suppresses console output and doesn't log a message.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 lib/migrate.c | 11 +++
 lib/migrate.h |  1 +
 scripts/arch-run.bash |  4 ++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/lib/migrate.c b/lib/migrate.c
index b77216594..92d1d957d 100644
--- a/lib/migrate.c
+++ b/lib/migrate.c
@@ -18,6 +18,17 @@ void migrate(void)
report_info("Migration complete");
 }
 
+/*
+ * Like migrate() but suppress output and logs, useful for intensive
+ * migration stress testing without polluting logs. Test cases should
+ * provide relevant information about migration in failure reports.
+ */
+void migrate_quiet(void)
+{
+   puts("Now migrate the VM (quiet)\n");
+   (void)getchar();
+}
+
 /*
  * Initiate migration and wait for it to complete.
  * If this function is called more than once, it is a no-op.
diff --git a/lib/migrate.h b/lib/migrate.h
index 2af06a72d..95b9102b0 100644
--- a/lib/migrate.h
+++ b/lib/migrate.h
@@ -7,4 +7,5 @@
  */
 
 void migrate(void);
+void migrate_quiet(void);
 void migrate_once(void);
diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index c98429e8c..0a98e5127 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -152,7 +152,7 @@ run_migration ()
-chardev socket,id=mon,path=${src_qmp},server=on,wait=off \
-mon chardev=mon,mode=control > ${src_outfifo} &
live_pid=$!
-   cat ${src_outfifo} | tee ${src_out} &
+   cat ${src_outfifo} | tee ${src_out} | grep -v "Now migrate the VM 
(quiet)" &
 
# Start the first destination QEMU machine in advance of the test
# reaching the migration point, since we expect at least one migration.
@@ -190,7 +190,7 @@ do_migration ()
-mon chardev=mon,mode=control -incoming unix:${dst_incoming} \
< <(cat ${dst_infifo}) > ${dst_outfifo} &
incoming_pid=$!
-   cat ${dst_outfifo} | tee ${dst_out} &
+   cat ${dst_outfifo} | tee ${dst_out} | grep -v "Now migrate the VM 
(quiet)" &
 
# The test must prompt the user to migrate, so wait for the
# "Now migrate VM" console message.
-- 
2.42.0

[kvm-unit-tests PATCH v5 5/8] arch-run: rename migration variables

2024-02-20 Thread Nicholas Piggin

Using 1 and 2 for source and destination is confusing, particularly
now with multiple migrations that flip between them. Do a rename
pass to 'src' and 'dst' to tidy things up.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 scripts/arch-run.bash | 111 +-
 1 file changed, 56 insertions(+), 55 deletions(-)

diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index c2002d7ae..c98429e8c 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -132,27 +132,27 @@ run_migration ()
migcmdline=$@
 
trap 'trap - TERM ; kill 0 ; exit 2' INT TERM
-   trap 'rm -f ${migout1} ${migout2} ${migout_fifo1} ${migout_fifo2} 
${migsock} ${qmp1} ${qmp2} ${fifo}' RETURN EXIT
-
-   migsock=$(mktemp -u -t mig-helper-socket.XX)
-   migout1=$(mktemp -t mig-helper-stdout1.XX)
-   migout_fifo1=$(mktemp -u -t mig-helper-fifo-stdout1.XX)
-   migout2=$(mktemp -t mig-helper-stdout2.XX)
-   migout_fifo2=$(mktemp -u -t mig-helper-fifo-stdout2.XX)
-   qmp1=$(mktemp -u -t mig-helper-qmp1.XX)
-   qmp2=$(mktemp -u -t mig-helper-qmp2.XX)
-   fifo=$(mktemp -u -t mig-helper-fifo.XX)
-   qmpout1=/dev/null
-   qmpout2=/dev/null
-
-   mkfifo ${migout_fifo1}
-   mkfifo ${migout_fifo2}
+   trap 'rm -f ${src_out} ${dst_out} ${src_outfifo} ${dst_outfifo} 
${dst_incoming} ${src_qmp} ${dst_qmp} ${dst_infifo}' RETURN EXIT
+
+   dst_incoming=$(mktemp -u -t mig-helper-socket-incoming.XX)
+   src_out=$(mktemp -t mig-helper-stdout1.XX)
+   src_outfifo=$(mktemp -u -t mig-helper-fifo-stdout1.XX)
+   dst_out=$(mktemp -t mig-helper-stdout2.XX)
+   dst_outfifo=$(mktemp -u -t mig-helper-fifo-stdout2.XX)
+   src_qmp=$(mktemp -u -t mig-helper-qmp1.XX)
+   dst_qmp=$(mktemp -u -t mig-helper-qmp2.XX)
+   dst_infifo=$(mktemp -u -t mig-helper-fifo-stdin.XX)
+   src_qmpout=/dev/null
+   dst_qmpout=/dev/null
+
+   mkfifo ${src_outfifo}
+   mkfifo ${dst_outfifo}
 
eval "$migcmdline" \
-   -chardev socket,id=mon1,path=${qmp1},server=on,wait=off \
-   -mon chardev=mon1,mode=control > ${migout_fifo1} &
+   -chardev socket,id=mon,path=${src_qmp},server=on,wait=off \
+   -mon chardev=mon,mode=control > ${src_outfifo} &
live_pid=$!
-   cat ${migout_fifo1} | tee ${migout1} &
+   cat ${src_outfifo} | tee ${src_out} &
 
# Start the first destination QEMU machine in advance of the test
# reaching the migration point, since we expect at least one migration.
@@ -162,7 +162,7 @@ run_migration ()
 
while ps -p ${live_pid} > /dev/null ; do
# Wait for test exit or further migration messages.
-   if ! grep -q -i "Now migrate the VM" < ${migout1} ; then
+   if ! grep -q -i "Now migrate the VM" < ${src_out} ; then
sleep 0.1
else
do_migration || return $?
@@ -184,80 +184,81 @@ do_migration ()
# We have to use cat to open the named FIFO, because named FIFO's,
# unlike pipes, will block on open() until the other end is also
# opened, and that totally breaks QEMU...
-   mkfifo ${fifo}
+   mkfifo ${dst_infifo}
eval "$migcmdline" \
-   -chardev socket,id=mon2,path=${qmp2},server=on,wait=off \
-   -mon chardev=mon2,mode=control -incoming unix:${migsock} \
-   < <(cat ${fifo}) > ${migout_fifo2} &
+   -chardev socket,id=mon,path=${dst_qmp},server=on,wait=off \
+   -mon chardev=mon,mode=control -incoming unix:${dst_incoming} \
+   < <(cat ${dst_infifo}) > ${dst_outfifo} &
incoming_pid=$!
-   cat ${migout_fifo2} | tee ${migout2} &
+   cat ${dst_outfifo} | tee ${dst_out} &
 
# The test must prompt the user to migrate, so wait for the
# "Now migrate VM" console message.
-   while ! grep -q -i "Now migrate the VM" < ${migout1} ; do
+   while ! grep -q -i "Now migrate the VM" < ${src_out} ; do
if ! ps -p ${live_pid} > /dev/null ; then
echo "ERROR: Test exit before migration point." >&2
-   echo > ${fifo}
-   qmp ${qmp1} '"quit"'> ${qmpout1} 2>/dev/null
-   qmp ${qmp2} '"quit"'> ${qmpout2} 2>/dev/null
+   echo > ${dst_infifo}
+   qmp ${src_qmp} '"quit"'> ${src_qmpout} 2>/dev/null
+   qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
return 3
fi
sleep 0.1
done
 
# Wait until the destination has created the incoming and qmp sockets
-   while ! [ -S ${migsock} ] ; do sleep 0.1 ; done
-   while ! [ -S ${qmp2} ] ; do sleep

[kvm-unit-tests PATCH v5 4/8] migration: Support multiple migrations

2024-02-20 Thread Nicholas Piggin

Support multiple migrations by flipping dest file/socket variables to
source after the migration is complete, ready to start again. A new
destination is created if the test outputs the migrate line again.
Test cases may now switch to calling migrate() one or more times.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 lib/migrate.c |  8 ++--
 lib/migrate.h |  1 +
 scripts/arch-run.bash | 86 ---
 3 files changed, 77 insertions(+), 18 deletions(-)

diff --git a/lib/migrate.c b/lib/migrate.c
index 527e63ae1..b77216594 100644
--- a/lib/migrate.c
+++ b/lib/migrate.c
@@ -8,8 +8,10 @@
 #include 
 #include "migrate.h"
 
-/* static for now since we only support migrating exactly once per test. */
-static void migrate(void)
+/*
+ * Initiate migration and wait for it to complete.
+ */
+void migrate(void)
 {
puts("Now migrate the VM, then press a key to continue...\n");
(void)getchar();
@@ -19,8 +21,6 @@ static void migrate(void)
 /*
  * Initiate migration and wait for it to complete.
  * If this function is called more than once, it is a no-op.
- * Since migrate_cmd can only migrate exactly once this function can
- * simplify the control flow, especially when skipping tests.
  */
 void migrate_once(void)
 {
diff --git a/lib/migrate.h b/lib/migrate.h
index 3c94e6af7..2af06a72d 100644
--- a/lib/migrate.h
+++ b/lib/migrate.h
@@ -6,4 +6,5 @@
  * Author: Nico Boehr 
  */
 
+void migrate(void);
 void migrate_once(void);
diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index 9a5aaddcc..c2002d7ae 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -129,12 +129,16 @@ run_migration ()
return 77
fi
 
+   migcmdline=$@
+
trap 'trap - TERM ; kill 0 ; exit 2' INT TERM
-   trap 'rm -f ${migout1} ${migout_fifo1} ${migsock} ${qmp1} ${qmp2} 
${fifo}' RETURN EXIT
+   trap 'rm -f ${migout1} ${migout2} ${migout_fifo1} ${migout_fifo2} 
${migsock} ${qmp1} ${qmp2} ${fifo}' RETURN EXIT
 
migsock=$(mktemp -u -t mig-helper-socket.XX)
migout1=$(mktemp -t mig-helper-stdout1.XX)
migout_fifo1=$(mktemp -u -t mig-helper-fifo-stdout1.XX)
+   migout2=$(mktemp -t mig-helper-stdout2.XX)
+   migout_fifo2=$(mktemp -u -t mig-helper-fifo-stdout2.XX)
qmp1=$(mktemp -u -t mig-helper-qmp1.XX)
qmp2=$(mktemp -u -t mig-helper-qmp2.XX)
fifo=$(mktemp -u -t mig-helper-fifo.XX)
@@ -142,20 +146,54 @@ run_migration ()
qmpout2=/dev/null
 
mkfifo ${migout_fifo1}
-   eval "$@" -chardev socket,id=mon1,path=${qmp1},server=on,wait=off \
+   mkfifo ${migout_fifo2}
+
+   eval "$migcmdline" \
+   -chardev socket,id=mon1,path=${qmp1},server=on,wait=off \
-mon chardev=mon1,mode=control > ${migout_fifo1} &
live_pid=$!
cat ${migout_fifo1} | tee ${migout1} &
 
-   # We have to use cat to open the named FIFO, because named FIFO's, 
unlike
-   # pipes, will block on open() until the other end is also opened, and 
that
-   # totally breaks QEMU...
+   # Start the first destination QEMU machine in advance of the test
+   # reaching the migration point, since we expect at least one migration.
+   # Then destination machines are started after the test outputs
+   # subsequent "Now migrate the VM" messages.
+   do_migration || return $?
+
+   while ps -p ${live_pid} > /dev/null ; do
+   # Wait for test exit or further migration messages.
+   if ! grep -q -i "Now migrate the VM" < ${migout1} ; then
+   sleep 0.1
+   else
+   do_migration || return $?
+   fi
+   done
+
+   wait ${live_pid}
+   ret=$?
+
+   while (( $(jobs -r | wc -l) > 0 )); do
+   sleep 0.1
+   done
+
+   return $ret
+}
+
+do_migration ()
+{
+   # We have to use cat to open the named FIFO, because named FIFO's,
+   # unlike pipes, will block on open() until the other end is also
+   # opened, and that totally breaks QEMU...
mkfifo ${fifo}
-   eval "$@" -chardev socket,id=mon2,path=${qmp2},server=on,wait=off \
-   -mon chardev=mon2,mode=control -incoming unix:${migsock} < 
<(cat ${fifo}) &
+   eval "$migcmdline" \
+   -chardev socket,id=mon2,path=${qmp2},server=on,wait=off \
+   -mon chardev=mon2,mode=control -incoming unix:${migsock} \
+   < <(cat ${fifo}) > ${migout_fifo2} &
incoming_pid=$!
+   cat ${migout_fifo2} | tee ${migout2} &
 
-   # The test must prompt the user to migrate, so wait for the "migrate" 
keyword
+   # The test must prompt the user to migrate, so wait for the
+   # "Now migrate VM" console message.
while ! grep -q -i "Now migrate the VM" < ${migout1} ; do
if ! ps -p ${live_pid} >

[kvm-unit-tests PATCH v5 3/8] migration: use a more robust way to wait for background job

2024-02-20 Thread Nicholas Piggin

Starting a pipeline of jobs in the background does not seem to have
a simple way to reliably find the pid of a particular process in the
pipeline (because not all processes are started when the shell
continues to execute).

The way PID of QEMU is derived can result in a failure waiting on a
PID that is not running. This is easier to hit with subsequent
multiple-migration support. Changing this to use $! by swapping the
pipeline for a fifo is more robust.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 scripts/arch-run.bash | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index c1dd67abe..9a5aaddcc 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -130,19 +130,22 @@ run_migration ()
fi
 
trap 'trap - TERM ; kill 0 ; exit 2' INT TERM
-   trap 'rm -f ${migout1} ${migsock} ${qmp1} ${qmp2} ${fifo}' RETURN EXIT
+   trap 'rm -f ${migout1} ${migout_fifo1} ${migsock} ${qmp1} ${qmp2} 
${fifo}' RETURN EXIT
 
migsock=$(mktemp -u -t mig-helper-socket.XX)
migout1=$(mktemp -t mig-helper-stdout1.XX)
+   migout_fifo1=$(mktemp -u -t mig-helper-fifo-stdout1.XX)
qmp1=$(mktemp -u -t mig-helper-qmp1.XX)
qmp2=$(mktemp -u -t mig-helper-qmp2.XX)
fifo=$(mktemp -u -t mig-helper-fifo.XX)
qmpout1=/dev/null
qmpout2=/dev/null
 
+   mkfifo ${migout_fifo1}
eval "$@" -chardev socket,id=mon1,path=${qmp1},server=on,wait=off \
-   -mon chardev=mon1,mode=control | tee ${migout1} &
-   live_pid=`jobs -l %+ | grep "eval" | awk '{print$2}'`
+   -mon chardev=mon1,mode=control > ${migout_fifo1} &
+   live_pid=$!
+   cat ${migout_fifo1} | tee ${migout1} &
 
# We have to use cat to open the named FIFO, because named FIFO's, 
unlike
# pipes, will block on open() until the other end is also opened, and 
that
@@ -150,7 +153,7 @@ run_migration ()
mkfifo ${fifo}
eval "$@" -chardev socket,id=mon2,path=${qmp2},server=on,wait=off \
-mon chardev=mon2,mode=control -incoming unix:${migsock} < 
<(cat ${fifo}) &
-   incoming_pid=`jobs -l %+ | awk '{print$2}'`
+   incoming_pid=$!
 
# The test must prompt the user to migrate, so wait for the "migrate" 
keyword
while ! grep -q -i "Now migrate the VM" < ${migout1} ; do
@@ -164,6 +167,10 @@ run_migration ()
sleep 1
done
 
+   # Wait until the destination has created the incoming and qmp sockets
+   while ! [ -S ${migsock} ] ; do sleep 0.1 ; done
+   while ! [ -S ${qmp2} ] ; do sleep 0.1 ; done
+
qmp ${qmp1} '"migrate", "arguments": { "uri": "unix:'${migsock}'" }' > 
${qmpout1}
 
# Wait for the migration to complete
-- 
2.42.0

[kvm-unit-tests PATCH v5 2/8] arch-run: Clean up initrd cleanup

2024-02-20 Thread Nicholas Piggin

Rather than put a big script into the trap handler, have it call
a function.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 scripts/arch-run.bash | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index 11d47a85c..c1dd67abe 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -269,10 +269,21 @@ search_qemu_binary ()
export PATH=$save_path
 }
 
+initrd_cleanup ()
+{
+   rm -f $KVM_UNIT_TESTS_ENV
+   if [ "$KVM_UNIT_TESTS_ENV_OLD" ]; then
+   export KVM_UNIT_TESTS_ENV="$KVM_UNIT_TESTS_ENV_OLD"
+   else
+   unset KVM_UNIT_TESTS_ENV
+   fi
+   unset KVM_UNIT_TESTS_ENV_OLD
+}
+
 initrd_create ()
 {
if [ "$ENVIRON_DEFAULT" = "yes" ]; then
-   trap_exit_push 'rm -f $KVM_UNIT_TESTS_ENV; [ 
"$KVM_UNIT_TESTS_ENV_OLD" ] && export 
KVM_UNIT_TESTS_ENV="$KVM_UNIT_TESTS_ENV_OLD" || unset KVM_UNIT_TESTS_ENV; unset 
KVM_UNIT_TESTS_ENV_OLD'
+   trap_exit_push 'initrd_cleanup'
[ -f "$KVM_UNIT_TESTS_ENV" ] && export 
KVM_UNIT_TESTS_ENV_OLD="$KVM_UNIT_TESTS_ENV"
export KVM_UNIT_TESTS_ENV=$(mktemp)
env_params
-- 
2.42.0

[kvm-unit-tests PATCH v5 1/8] arch-run: Fix TRAP handler recursion to remove temporary files properly

2024-02-20 Thread Nicholas Piggin

Migration files were not being removed when the QEMU process is
interrupted (e.g., with ^C). This is becaus the SIGINT propagates to the
bash TRAP handler, which recursively TRAPs due to the 'kill 0' in the
handler. This eventually crashes bash.

This can be observed by interrupting a long-running test program that is
run with MIGRATION=yes, /tmp/mig-helper-* files remain afterwards.

Removing TRAP recursion solves this problem and allows the EXIT handler
to run and clean up the files.

This also moves the trap handler before temp file creation, which closes
the small race between creation trap handler install.

Reviewed-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
 scripts/arch-run.bash | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
index d0864360a..11d47a85c 100644
--- a/scripts/arch-run.bash
+++ b/scripts/arch-run.bash
@@ -129,6 +129,9 @@ run_migration ()
return 77
fi
 
+   trap 'trap - TERM ; kill 0 ; exit 2' INT TERM
+   trap 'rm -f ${migout1} ${migsock} ${qmp1} ${qmp2} ${fifo}' RETURN EXIT
+
migsock=$(mktemp -u -t mig-helper-socket.XX)
migout1=$(mktemp -t mig-helper-stdout1.XX)
qmp1=$(mktemp -u -t mig-helper-qmp1.XX)
@@ -137,9 +140,6 @@ run_migration ()
qmpout1=/dev/null
qmpout2=/dev/null
 
-   trap 'kill 0; exit 2' INT TERM
-   trap 'rm -f ${migout1} ${migsock} ${qmp1} ${qmp2} ${fifo}' RETURN EXIT
-
eval "$@" -chardev socket,id=mon1,path=${qmp1},server=on,wait=off \
-mon chardev=mon1,mode=control | tee ${migout1} &
live_pid=`jobs -l %+ | grep "eval" | awk '{print$2}'`
@@ -209,11 +209,11 @@ run_panic ()
return 77
fi
 
-   qmp=$(mktemp -u -t panic-qmp.XX)
-
-   trap 'kill 0; exit 2' INT TERM
+   trap 'trap - TERM ; kill 0 ; exit 2' INT TERM
trap 'rm -f ${qmp}' RETURN EXIT
 
+   qmp=$(mktemp -u -t panic-qmp.XX)
+
# start VM stopped so we don't miss any events
eval "$@" -chardev socket,id=mon1,path=${qmp},server=on,wait=off \
-mon chardev=mon1,mode=control -S &
-- 
2.42.0

[kvm-unit-tests PATCH v5 0/8] Multi-migration support

2024-02-20 Thread Nicholas Piggin

Now that strange arm64 hang is found to be QEMU bug, I'll repost.
Since arm64 requires Thomas's uart patch and it is worse affected
by the QEMU bug, I will just not build it on arm. The QEMU bug
still affects powerpc (and presumably s390x) but it's not causing
so much trouble for this test case.

I have another test case that can hit it reliably and doesn't
cause crashes but that takes some harness and common lib work so
I'll send that another time.

Since v4:
- Don't build selftest-migration on arm.
- Reduce selftest-migration iterations from 100 to 30 to make the
  test run faster (it's ~0.5s per migration).

Since v3:
- Addressed Thomas's review comments:
- Patch 2 initrd cleanup unset the old variable in the correct place.
- Patch 4 multi migration removed the extra wait for "Now migrate the
  VM" message, and updated comments around it.
- Patch 6 fix typo and whitespace in quiet migration support.
- Patch 8 fix typo and whitespace in migration selftest.

Since v2:
- Rebase on riscv port and auxvinfo fix was merged.
- Clean up initrd cleanup moves more commands into the new cleanup
  function from the trap handler comands (suggested by Thomas).
- "arch-run: Clean up temporary files properly" patch is now renamed
  to "arch-run: Fix TRAP handler..."
- Fix TRAP handler patch has redone changelog to be more precise about
  the problem and including recipe to recreate it.
- Fix TRAP handler patch reworked slightly to remove the theoretical
  race rather than just adding a comment about it.
- Patch 3 was missing a couple of fixes that leaked into patch 4,
  those are moved into patch 3.

Thanks,
Nick

Nicholas Piggin (8):
  arch-run: Fix TRAP handler recursion to remove temporary files
properly
  arch-run: Clean up initrd cleanup
  migration: use a more robust way to wait for background job
  migration: Support multiple migrations
  arch-run: rename migration variables
  migration: Add quiet migration support
  Add common/ directory for architecture-independent tests
  migration: add a migration selftest

 arm/sieve.c  |   2 +-
 common/selftest-migration.c  |  29 ++
 common/sieve.c   |  51 ++
 lib/migrate.c|  19 +++-
 lib/migrate.h|   2 +
 powerpc/Makefile.common  |   1 +
 powerpc/selftest-migration.c |   1 +
 powerpc/unittests.cfg|   4 +
 riscv/sieve.c|   2 +-
 s390x/Makefile   |   1 +
 s390x/selftest-migration.c   |   1 +
 s390x/sieve.c|   2 +-
 s390x/unittests.cfg  |   4 +
 scripts/arch-run.bash| 177 +--
 x86/sieve.c  |  52 +-
 15 files changed, 240 insertions(+), 108 deletions(-)
 create mode 100644 common/selftest-migration.c
 create mode 100644 common/sieve.c
 create mode 12 powerpc/selftest-migration.c
 create mode 12 s390x/selftest-migration.c
 mode change 100644 => 12 x86/sieve.c

-- 
2.42.0

[PATCH 4/4] arm64: dynamic enforcement of pmd-level PXNTable

2024-02-20 Thread Maxwell Bland

In an attempt to protect against write-then-execute attacks wherein an
adversary stages malicious code into a data page and then later uses a
write gadget to mark the data page executable, arm64 enforces PXNTable
when allocating pmd descriptors during the init process. However, these
protections are not maintained for dynamic memory allocations, creating
an extensive threat surface to write-then-execute attacks targeting
pages allocated through the vmalloc interface.

Straightforward modifications to the pgalloc interface allow for the
dynamic enforcement of PXNTable, restricting writable and
privileged-executable code pages to known kernel text, bpf-allocated
programs, and kprobe-allocated pages, all of which have more extensive
verification interfaces than the generic vmalloc region.

This patch adds a preprocessor define to check whether a pmd is
allocated by vmalloc and exists outside of a known code region, and if
so, marks the pmd as PXNTable, protecting over 100 last-level page
tables from manipulation in the process.

Signed-off-by: Maxwell Bland 
---
 arch/arm64/include/asm/pgalloc.h | 11 +--
 arch/arm64/include/asm/vmalloc.h |  5 +
 arch/arm64/mm/trans_pgd.c|  2 +-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 237224484d0f..5e9262241e8b 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 
+#define __HAVE_ARCH_ADDR_COND_PMD
 #define __HAVE_ARCH_PGD_FREE
 #include 
 
@@ -74,10 +75,16 @@ static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t 
ptep,
  * of the mm address space.
  */
 static inline void
-pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep)
+pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp, pte_t *ptep,
+   unsigned long address)
 {
+   pmdval_t pmd = PMD_TYPE_TABLE | PMD_TABLE_UXN;
VM_BUG_ON(mm && mm != _mm);
-   __pmd_populate(pmdp, __pa(ptep), PMD_TYPE_TABLE | PMD_TABLE_UXN);
+   if (IS_DATA_VMALLOC_ADDR(address) &&
+   IS_DATA_VMALLOC_ADDR(address + PMD_SIZE)) {
+   pmd |= PMD_TABLE_PXN;
+   }
+   __pmd_populate(pmdp, __pa(ptep), pmd);
 }
 
 static inline void
diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index dbcf8ad20265..6f254ab83f4a 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -34,4 +34,9 @@ static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot)
 extern unsigned long code_region_start __ro_after_init;
 extern unsigned long code_region_end __ro_after_init;
 
+#define IS_DATA_VMALLOC_ADDR(vaddr) (((vaddr) < code_region_start || \
+ (vaddr) > code_region_end) && \
+ ((vaddr) >= VMALLOC_START && \
+  (vaddr) < VMALLOC_END))
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 7b14df3c6477..7f903c51e1eb 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -69,7 +69,7 @@ static int copy_pte(struct trans_pgd_info *info, pmd_t 
*dst_pmdp,
dst_ptep = trans_alloc(info);
if (!dst_ptep)
return -ENOMEM;
-   pmd_populate_kernel(NULL, dst_pmdp, dst_ptep);
+   pmd_populate_kernel_at(NULL, dst_pmdp, dst_ptep, addr);
dst_ptep = pte_offset_kernel(dst_pmdp, start);
 
src_ptep = pte_offset_kernel(src_pmdp, start);
-- 
2.39.2

[PATCH 1/4] mm/vmalloc: allow arch-specific vmalloc_node overrides

2024-02-20 Thread Maxwell Bland

Present non-uniform use of __vmalloc_node and __vmalloc_node_range makes
enforcing appropriate code and data seperation untenable on certain
microarchitectures, as VMALLOC_START and VMALLOC_END are monolithic
while the use of the vmalloc interface is non-monolithic: in particular,
appropriate randomness in ASLR makes it such that code regions must fall
in some region between VMALLOC_START and VMALLOC_end, but this
necessitates that code pages are intermingled with data pages, meaning
code-specific protections, such as arm64's PXNTable, cannot be
performantly runtime enforced.

The solution to this problem allows architectures to override the
vmalloc wrapper functions by enforcing that the rest of the kernel does
not reimplement __vmalloc_node by using __vmalloc_node_range with the
same parameters as __vmalloc_node or provides a __weak tag to those
functions using __vmalloc_node_range with parameters repeating those of
__vmalloc_node.

Two benefits of this approach are (1) greater flexibility to each
architecture for handling of virtual memory while not compromising the
kernel's vmalloc logic and (2) more uniform use of the __vmalloc_node
interface, reserving the more specialized __vmalloc_node_range for more
specialized cases, such as kasan's shadow memory.

Signed-off-by: Maxwell Bland 
---
 arch/arm/kernel/irq.c   |  2 +-
 arch/arm64/include/asm/vmap_stack.h |  2 +-
 arch/arm64/kernel/efi.c |  2 +-
 arch/powerpc/kernel/irq.c   |  2 +-
 arch/riscv/include/asm/irq_stack.h  |  2 +-
 arch/s390/hypfs/hypfs_diag.c|  2 +-
 arch/s390/kernel/setup.c|  6 ++---
 arch/s390/kernel/sthyi.c|  2 +-
 include/linux/vmalloc.h | 15 ++-
 kernel/bpf/syscall.c|  4 +--
 kernel/fork.c   |  4 +--
 kernel/scs.c|  3 +--
 lib/objpool.c   |  2 +-
 lib/test_vmalloc.c  |  6 ++---
 mm/util.c   |  3 +--
 mm/vmalloc.c| 39 +++--
 16 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
index fe28fc1f759d..109f4f363621 100644
--- a/arch/arm/kernel/irq.c
+++ b/arch/arm/kernel/irq.c
@@ -61,7 +61,7 @@ static void __init init_irq_stacks(void)
   THREAD_SIZE_ORDER);
else
stack = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN,
-  THREADINFO_GFP, NUMA_NO_NODE,
+  THREADINFO_GFP, 0, NUMA_NO_NODE,
   __builtin_return_address(0));
 
if (WARN_ON(!stack))
diff --git a/arch/arm64/include/asm/vmap_stack.h 
b/arch/arm64/include/asm/vmap_stack.h
index 20873099c035..57a7eaa720d5 100644
--- a/arch/arm64/include/asm/vmap_stack.h
+++ b/arch/arm64/include/asm/vmap_stack.h
@@ -21,7 +21,7 @@ static inline unsigned long *arch_alloc_vmap_stack(size_t 
stack_size, int node)
 
BUILD_BUG_ON(!IS_ENABLED(CONFIG_VMAP_STACK));
 
-   p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node,
+   p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, 0, node,
__builtin_return_address(0));
return kasan_reset_tag(p);
 }
diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
index 0228001347be..48efa31a9161 100644
--- a/arch/arm64/kernel/efi.c
+++ b/arch/arm64/kernel/efi.c
@@ -205,7 +205,7 @@ static int __init arm64_efi_rt_init(void)
return 0;
 
p = __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, GFP_KERNEL,
-  NUMA_NO_NODE, &);
+  0, NUMA_NO_NODE, &);
 l: if (!p) {
pr_warn("Failed to allocate EFI runtime stack\n");
clear_bit(EFI_RUNTIME_SERVICES, );
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 6f7d4edaa0bc..ceb7ea07ca28 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -308,7 +308,7 @@ DEFINE_INTERRUPT_HANDLER_ASYNC(do_IRQ)
 static void *__init alloc_vm_stack(void)
 {
return __vmalloc_node(THREAD_SIZE, THREAD_ALIGN, THREADINFO_GFP,
- NUMA_NO_NODE, (void *)_RET_IP_);
+ 0, NUMA_NO_NODE, (void *)_RET_IP_);
 }
 
 static void __init vmap_irqstack_init(void)
diff --git a/arch/riscv/include/asm/irq_stack.h 
b/arch/riscv/include/asm/irq_stack.h
index 6441ded3b0cf..d2410735bde0 100644
--- a/arch/riscv/include/asm/irq_stack.h
+++ b/arch/riscv/include/asm/irq_stack.h
@@ -24,7 +24,7 @@ static inline unsigned long *arch_alloc_vmap_stack(size_t 
stack_size, int node)
 {
void *p;
 
-   p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node,
+   p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, 0, node,

[PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation

2024-02-20 Thread Maxwell Bland

While other descriptors (e.g. pud) allow allocations conditional on
which virtual address is allocated, pmd descriptor allocations do not.
However, adding support for this is straightforward and is beneficial to
future kernel development targeting the PMD memory granularity.

As many architectures already implement pmd_populate_kernel in an
address-generic manner, it is necessary to roll out support
incrementally. For this purpose a preprocessor flag,
__HAVE_ARCH_ADDR_COND_PMD is introduced to capture whether the
architecture supports some feature requiring PMD allocation conditional
on virtual address. Some microarchitectures (e.g. arm64) support
configurations for table descriptors, for example to enforce Privilege
eXecute Never, which benefit from knowing the virtual memory addresses
referenced by PMDs.

Thus two major arguments in favor of this change are (1) unformity of
allocation between PMD and other table descriptor types and (2) the
capability of address-specific PMD allocation.

Signed-off-by: Maxwell Bland 
---
 include/asm-generic/pgalloc.h | 18 ++
 include/linux/mm.h|  4 ++--
 mm/hugetlb_vmemmap.c  |  4 ++--
 mm/kasan/init.c   | 22 +-
 mm/memory.c   |  4 ++--
 mm/percpu.c   |  2 +-
 mm/pgalloc-track.h|  3 ++-
 mm/sparse-vmemmap.c   |  2 +-
 8 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 879e5f8aa5e9..e5cdce77c6e4 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -142,6 +142,24 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long addr)
 }
 #endif
 
+#ifdef __HAVE_ARCH_ADDR_COND_PMD
+static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
+   pte_t *ptep, unsigned long address);
+#else
+static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
+   pte_t *ptep);
+#endif
+
+static inline void pmd_populate_kernel_at(struct mm_struct *mm, pmd_t *pmdp,
+   pte_t *ptep, unsigned long address)
+{
+#ifdef __HAVE_ARCH_ADDR_COND_PMD
+   pmd_populate_kernel(mm, pmdp, ptep, address);
+#else
+   pmd_populate_kernel(mm, pmdp, ptep);
+#endif
+}
+
 #ifndef __HAVE_ARCH_PMD_FREE
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..6a9d5ded428d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2782,7 +2782,7 @@ static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 #if defined(CONFIG_MMU)
 
@@ -2977,7 +2977,7 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t 
*pmd,
 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address) \
-   ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+   ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address)) ? \
NULL: pte_offset_kernel(pmd, address))
 
 #if USE_SPLIT_PMD_PTLOCKS
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index da177e49d956..1f5664b656f1 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -58,7 +58,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, 
unsigned long start,
if (!pgtable)
return -ENOMEM;
 
-   pmd_populate_kernel(_mm, &__pmd, pgtable);
+   pmd_populate_kernel_at(_mm, &__pmd, pgtable, addr);
 
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
pte_t entry, *pte;
@@ -81,7 +81,7 @@ static int vmemmap_split_pmd(pmd_t *pmd, struct page *head, 
unsigned long start,
 
/* Make pte visible before pmd. See comment in pmd_install(). */
smp_wmb();
-   pmd_populate_kernel(_mm, pmd, pgtable);
+   pmd_populate_kernel_at(_mm, pmd, pgtable, addr);
if (!(walk->flags & VMEMMAP_SPLIT_NO_TLB_FLUSH))
flush_tlb_kernel_range(start, start + PMD_SIZE);
} else {
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index 89895f38f722..1e31d965a14e 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -116,8 +116,9 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned 
long addr,
next = pmd_addr_end(addr, end);
 
if (IS_ALIGNED(addr, PMD_SIZE) && end - addr >= PMD_SIZE) {
-   pmd_populate_kernel(_mm, pmd,
-   lm_alias(kasan_early_shadow_pte));
+   pmd_populate_kernel_at(_mm, pmd,
+   lm_alias(kasan_early_shadow_pte),
+   addr);
continue;

[PATCH 0/4] arm64: mm: support dynamic vmalloc/pmd configuration

2024-02-20 Thread Maxwell Bland

Reworks ARM's virtual memory allocation infrastructure to support
dynamic enforcement of page middle directory PXNTable restrictions
rather than only during the initial memory mapping. Runtime enforcement
of this bit prevents write-then-execute attacks, where malicious code is
staged in vmalloc'd data regions, and later the page table is changed to
make this code executable.

Previously the entire region from VMALLOC_START to VMALLOC_END was
vulnerable, but now the vulnerable region is restricted to the 2GB
reserved by module_alloc, a region which is generally read-only and more
difficult to inject staging code into, e.g., data must pass the BPF
verifier. These changes also set the stage for other systems, such as
KVM-level (EL2) changes to mark page tables immutable and code page
verification changes, forging a path toward complete mitigation of
kernel exploits on ARM.

Implementing this required minimal changes to the generic vmalloc
interface in the kernel to allow architecture overrides of some vmalloc
wrapper functions, refactoring vmalloc calls to use a standard interface
in the generic kernel, and passing the address parameter already passed
into PTE allocation to the pte_allocate child function call.

The new arm64 vmalloc wrapper functions ensure vmalloc data is not
allocated into the region reserved for module_alloc. arm64 BPF and
kprobe code also see a two-line-change ensuring their allocations abide
by the segmentation of code from data. Finally, arm64's pmd_populate
function is modified to set the PXNTable bit appropriately.

Signed-off-by: Maxwell Bland 

---

After Mark Rutland's feedback last week on my more minimal patch, see



I adopted a more sweeping and more correct overhaul of ARM's virtual
memory allocation infrastructure to support these changes. This patch
guarantees our ability to write future systems with a strong and
accessible distinction between code and data at the page allocation
layer, bolstering the guarantees of complementary contributions, i.e.
W^X and kCFI.

The current patch minimally reduces available vmalloc space, removing
the 2GB that should be reserved for code allocations regardless, and I
feel really benefits the kernel by making several memory allocation
interfaces more uniform, and providing hooks for non-ARM architectures
to follow suit.

I have done some minimal runtime testing using Torvald's test-tlb script
on a QEMU VM, but maybe more extensive benchmarking is needed?

Size: Before Patch -> After Patch
4k: 4.09ns  4.15ns  4.41ns  4.43ns -> 3.68ns  3.73ns  3.67ns  3.73ns 
8k: 4.22ns  4.19ns  4.30ns  4.15ns -> 3.99ns  3.89ns  4.12ns  4.04ns 
16k: 3.97ns  4.31ns  4.30ns  4.28ns -> 4.03ns  3.98ns  4.06ns  4.06ns 
32k: 3.82ns  4.51ns  4.25ns  4.31ns -> 3.99ns  4.09ns  4.07ns  5.17ns 
64k: 4.50ns  5.59ns  6.13ns  6.14ns -> 4.23ns  4.26ns  5.91ns  5.93ns 
128k: 5.06ns  4.47ns  6.75ns  6.69ns -> 4.47ns  4.71ns  6.54ns  6.44ns 
256k: 4.83ns  4.43ns  6.62ns  6.21ns -> 4.39ns  4.62ns  6.71ns  6.65ns 
512k: 4.45ns  4.75ns  6.19ns  6.65ns -> 4.86ns  5.26ns  7.77ns  6.68ns 
1M: 4.72ns  4.73ns  6.74ns  6.47ns -> 4.29ns  4.45ns  6.87ns  6.59ns 
2M: 4.66ns  4.86ns  14.49ns  15.00ns -> 4.53ns  4.57ns  15.91ns  15.90ns 
4M: 4.85ns  4.95ns  15.90ns  15.98ns -> 4.48ns  4.74ns  17.27ns  17.36ns 
6M: 4.94ns  5.03ns  17.19ns  17.31ns -> 4.70ns  4.93ns  18.02ns  18.23ns 
8M: 5.05ns  5.18ns  17.49ns  17.64ns -> 4.96ns  5.07ns  18.84ns  18.72ns 
16M: 5.55ns  5.79ns  20.99ns  23.70ns -> 5.46ns  5.72ns  22.76ns  26.51ns
32M: 8.54ns  9.06ns  124.61ns 125.07ns -> 8.43ns  8.59ns  116.83ns 138.83ns
64M: 8.42ns  8.63ns  196.17ns 204.52ns -> 8.26ns  8.43ns  193.49ns 203.85ns
128M: 8.31ns  8.58ns  230.46ns 242.63ns -> 8.22ns  8.39ns  227.99ns 240.29ns
256M: 8.80ns  8.80ns  248.24ns 261.68ns -> 8.35ns  8.55ns  250.18ns 262.20ns

Note I also chose to enforce PXNTable at the PMD layer only (for now),
since the 194 descriptors which are affected by this change on my
testing setup are not sufficient to warrant enforcement at a coarser
granularity.

The architecture-independent changes (I term "generic") can be
classified only as refactoring, but I feel are also major improvements
in that they standardize most uses of the vmalloc interface across the
kernel.

Note this patch reduces the arm64 allocated region for BPF and kprobes,
but only to match with the existing allocation choices made by the
generic kernel. I will admit I do not understand why BPF JIT allocation
code was duplicated into arm64, but I also feel that this was either an
artifact or that these overrides for generic allocation should require a
specific KConfig as they trade off between security and space. That
said, I have chosen not to wrap this patch in a KConfig interface, as I
feel the changes provide significant benefit to the arm64 kernel's
baseline security, though a KConfig could certainly be added if the
maintainers see the need.

Maxwell Bland (4):
  mm/vmalloc: allow arch-specific vmalloc_node overrides
  mm: pgalloc:

[PATCH 3/4] arm64: separate code and data virtual memory allocation

2024-02-20 Thread Maxwell Bland

Current BPF and kprobe instruction allocation interfaces do not match
the base kernel and intermingle code and data pages within the same
sections. In the case of BPF, this appears to be a result of code
duplication between the kernel's JIT compiler and arm64's JIT.  However,
This is no longer necessary given the possibility of overriding vmalloc
wrapper functions.

arm64's vmalloc_node routines now include a layer of indirection which
splits the vmalloc region into two segments surrounding the middle
module_alloc region determined by ASLR. To support this,
code_region_start and code_region_end are defined to match the 2GB
boundary chosen by the kernel module ASLR initialization routine.

The result is a large benefits to overall kernel security, as code pages
now remain protected by this ASLR routine and protections can be defined
linearly for code regions rather than through PTE-level tracking.

Signed-off-by: Maxwell Bland 
---
 arch/arm64/include/asm/vmalloc.h   |  3 ++
 arch/arm64/kernel/module.c |  7 
 arch/arm64/kernel/probes/kprobes.c |  2 +-
 arch/arm64/mm/Makefile |  3 +-
 arch/arm64/mm/vmalloc.c| 57 ++
 arch/arm64/net/bpf_jit_comp.c  |  5 +--
 6 files changed, 73 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/mm/vmalloc.c

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 38fafffe699f..dbcf8ad20265 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -31,4 +31,7 @@ static inline pgprot_t arch_vmap_pgprot_tagged(pgprot_t prot)
return pgprot_tagged(prot);
 }
 
+extern unsigned long code_region_start __ro_after_init;
+extern unsigned long code_region_end __ro_after_init;
+
 #endif /* _ASM_ARM64_VMALLOC_H */
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index dd851297596e..c4fe753a71a9 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -29,6 +29,10 @@
 static u64 module_direct_base __ro_after_init = 0;
 static u64 module_plt_base __ro_after_init = 0;
 
+/* For pre-init vmalloc, assume the worst-case code range */
+unsigned long code_region_start __ro_after_init = (u64) (_end - SZ_2G);
+unsigned long code_region_end __ro_after_init = (u64) (_text + SZ_2G);
+
 /*
  * Choose a random page-aligned base address for a window of 'size' bytes which
  * entirely contains the interval [start, end - 1].
@@ -101,6 +105,9 @@ static int __init module_init_limits(void)
module_plt_base = random_bounding_box(SZ_2G, min, max);
}
 
+   code_region_start = module_plt_base;
+   code_region_end = module_plt_base + SZ_2G;
+
pr_info("%llu pages in range for non-PLT usage",
module_direct_base ? (SZ_128M - kernel_size) / PAGE_SIZE : 0);
pr_info("%llu pages in range for PLT usage",
diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 70b91a8c6bb3..c9e109d6c8bc 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -131,7 +131,7 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
 
 void *alloc_insn_page(void)
 {
-   return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
+   return __vmalloc_node_range(PAGE_SIZE, 1, code_region_start, 
code_region_end,
GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
NUMA_NO_NODE, __builtin_return_address(0));
 }
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index dbd1bc95967d..730b805d8388 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -2,7 +2,8 @@
 obj-y  := dma-mapping.o extable.o fault.o init.o \
   cache.o copypage.o flush.o \
   ioremap.o mmap.o pgd.o mmu.o \
-  context.o proc.o pageattr.o fixmap.o
+  context.o proc.o pageattr.o fixmap.o \
+  vmalloc.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 obj-$(CONFIG_PTDUMP_CORE)  += ptdump.o
 obj-$(CONFIG_PTDUMP_DEBUGFS)   += ptdump_debugfs.o
diff --git a/arch/arm64/mm/vmalloc.c b/arch/arm64/mm/vmalloc.c
new file mode 100644
index ..b6d2fa841f90
--- /dev/null
+++ b/arch/arm64/mm/vmalloc.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include 
+#include 
+
+static void *__vmalloc_node_range_split(unsigned long size, unsigned long 
align,
+   unsigned long start, unsigned long end,
+   unsigned long exclusion_start, unsigned long 
exclusion_end, gfp_t gfp_mask,
+   pgprot_t prot, unsigned long vm_flags, int node,
+   const void *caller)
+{
+   void *res = NULL;
+
+   res = __vmalloc_node_range(size, align, start, exclusion_start,
+   gfp_mask, prot, vm_flags,

[RFC PATCH 08/14] powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag

2024-02-20 Thread K Prateek Nayak

Add support for TIF_NOTIFY_IPI on PowerPC. With TIF_NOTIFY_IPI, a sender
sending an IPI to an idle CPU in TIF_POLLING mode will set the
TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the
CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This
avoids spurious calls to schedule_idle() in cases where an IPI does not
necessarily wake up a task on the idle CPU.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: Andrew Donnellan 
Cc: K Prateek Nayak 
Cc: Nicholas Miehlbradt 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: K Prateek Nayak 
---
 arch/powerpc/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index bf5dde1a4114..b48db55192e0 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -103,6 +103,7 @@ void arch_setup_new_exec(void);
 #define TIF_PATCH_PENDING  6   /* pending live patching update */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
+#define TIF_NOTIFY_IPI 9   /* Pending IPI on TIF_POLLLING idle CPU 
*/
 #define TIF_SECCOMP10  /* secure computing */
 #define TIF_RESTOREALL 11  /* Restore all regs (implies NOERROR) */
 #define TIF_NOERROR12  /* Force successful syscall return */
@@ -129,6 +130,7 @@ void arch_setup_new_exec(void);
 #define _TIF_PATCH_PENDING (1<

[RFC PATCH 03/14] sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING mode of pending IPI

2024-02-20 Thread K Prateek Nayak

From: "Gautham R. Shenoy" 

Problem statement
=

When measuring IPI throughput using a modified version of Anton
Blanchard's ipistorm benchmark [1], configured to measure time taken to
perform a fixed number of smp_call_function_single() (with wait set to
1), an increase in benchmark time was observed between v5.7 and the
upstream kernel (v6.7-rc6).

Bisection pointed to commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") as the reason behind this increase in
runtime. Reverting the optimization introduced by the above commit fixed
the regression in ipistorm, however benchmarks like tbench and netperf
regressed with the revert, supporting the validity of the optimization.

Following are the benchmark results on top of tip:sched/core with the
optimization reverted on a dual socket 3rd Generation aMD EPYC system
(2 x 64C/128T) running with boost enabled and C2 disabled:

(tip:sched/core at tag "sched-core-2024-01-08" for all the testing done
below)

  ==
  Test  : ipistorm (modified)
  Units : Normalized runtime
  Interpretation: Lower is better
  Statistic : AMean
  cmdline   : insmod ipistorm.ko numipi=10 single=1 offset=8 cpulist=8 
wait=1
  ==
  kernel:   time [pct imp]
  tip:sched/core1.00 [0.00]
  tip:sched/core + revert   0.81 [19.36]

  ==
  Test  : tbench
  Units : Normalized throughput
  Interpretation: Higher is better
  Statistic : AMean
  ==
  Clients:tip[pct imp](CV)   revert[pct imp](CV)
  1 1.00 [  0.00]( 0.24) 0.91 [ -8.96]( 0.30)
  2 1.00 [  0.00]( 0.25) 0.92 [ -8.20]( 0.97)
  4 1.00 [  0.00]( 0.23) 0.91 [ -9.20]( 1.75)
  8 1.00 [  0.00]( 0.69) 0.91 [ -9.48]( 1.56)
 16 1.00 [  0.00]( 0.66) 0.92 [ -8.49]( 2.43)
 32 1.00 [  0.00]( 0.96) 0.89 [-11.13]( 0.96)
 64 1.00 [  0.00]( 1.06) 0.90 [ -9.72]( 2.49)
128 1.00 [  0.00]( 0.70) 0.92 [ -8.36]( 1.26)
256 1.00 [  0.00]( 0.72) 0.97 [ -3.30]( 1.10)
512 1.00 [  0.00]( 0.42) 0.98 [ -1.73]( 0.37)
   1024 1.00 [  0.00]( 0.28) 0.99 [ -1.39]( 0.43)

  ==
  Test  : netperf
  Units : Normalized Througput
  Interpretation: Higher is better
  Statistic : AMean
  ==
  Clients: tip[pct imp](CV)   revert[pct imp](CV)
   1-clients 1.00 [  0.00]( 0.50) 0.89 [-10.51]( 0.20)
   2-clients 1.00 [  0.00]( 1.16) 0.89 [-11.10]( 0.59)
   4-clients 1.00 [  0.00]( 1.03) 0.89 [-10.68]( 0.38)
   8-clients 1.00 [  0.00]( 0.99) 0.89 [-10.54]( 0.50)
  16-clients 1.00 [  0.00]( 0.87) 0.89 [-10.92]( 0.95)
  32-clients 1.00 [  0.00]( 1.24) 0.89 [-10.85]( 0.63)
  64-clients 1.00 [  0.00]( 1.58) 0.90 [-10.11]( 1.18)
  128-clients1.00 [  0.00]( 0.87) 0.89 [-10.94]( 1.11)
  256-clients1.00 [  0.00]( 4.77) 1.00 [ -0.16]( 3.45)
  512-clients1.00 [  0.00](56.16) 1.02 [  2.10](56.05)

Since a simple revert is not a viable solution, the changes in the code
path of call_function_single_prep_ipi(), with and without the
optimization were audited to better understand the effect of the commit.

Effects of call_function_single_prep_ipi()
==

To pull a TIF_POLLING thread out of idle to process an IPI, the sender
sets the TIF_NEED_RESCHED bit in the idle task's thread info in
call_function_single_prep_ipi() and avoids sending an actual IPI to the
target. As a result, the scheduler expects a task to be enqueued when
exiting the idle path. This is not the case with non-polling idle states
where the idle CPU exits the non-polling idle state to process the
interrupt, and since need_resched() returns false, soon goes back to
idle again.

When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(),
a large part of which runs with local IRQ disabled. In case of ipistorm,
when measuring IPI throughput, this large IRQ disabled section delays
processing of IPIs. Further auditing revealed that in absence of any
runnable tasks, pick_next_task_fair(), which is called from the
pick_next_task() fast path, will always call newidle_balance() in this
scenario, further increasing the time spent in the IRQ disabled section.

Following is the crude visualization of the problem with relevant
functions expanded:
--
CPU0CPU1

do_idle() {

[RFC PATCH 02/14] sched: Define a need_resched_or_ipi() helper and use it treewide

2024-02-20 Thread K Prateek Nayak

From: "Gautham R. Shenoy" 

Currently TIF_NEED_RESCHED is being overloaded, to wakeup an idle CPU in
TIF_POLLING mode to service an IPI even if there are no new tasks being
woken up on the said CPU.

In preparation of a proper fix, introduce a new helper
"need_resched_or_ipi()" which is intended to return true if either
the TIF_NEED_RESCHED flag or if TIF_NOTIFY_IPI flag is set. Use this
helper function in place of need_resched() in idle loops where
TIF_POLLING_NRFLAG is set.

To preserve bisectibility and avoid unbreakable idle loops, all the
need_resched() checks within TIF_POLLING_NRFLAGS sections, have been
replaced tree-wide with the need_resched_or_ipi() check.

[ prateek: Replaced some of the missed out occurrences of
  need_resched() within a TIF_POLLING sections with
  need_resched_or_ipi() ]

Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Russell King 
Cc: Guo Ren 
Cc: Michal Simek 
Cc: Dinh Nguyen 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: Al Viro 
Cc: Linus Walleij 
Cc: Ard Biesheuvel 
Cc: Andrew Donnellan 
Cc: Nicholas Miehlbradt 
Cc: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Josh Poimboeuf 
Cc: "Kirill A. Shutemov" 
Cc: Rick Edgecombe 
Cc: Tony Battersby 
Cc: Brian Gerst 
Cc: Tim Chen 
Cc: David Vernet 
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-openr...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: Gautham R. Shenoy 
Co-developed-by: K Prateek Nayak 
Signed-off-by: K Prateek Nayak 
---
 arch/x86/include/asm/mwait.h  | 2 +-
 arch/x86/kernel/process.c | 2 +-
 drivers/cpuidle/cpuidle-powernv.c | 2 +-
 drivers/cpuidle/cpuidle-pseries.c | 2 +-
 drivers/cpuidle/poll_state.c  | 2 +-
 include/linux/sched.h | 5 +
 include/linux/sched/idle.h| 4 ++--
 kernel/sched/idle.c   | 7 ---
 8 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 778df05f8539..ac1370143407 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -115,7 +115,7 @@ static __always_inline void mwait_idle_with_hints(unsigned 
long eax, unsigned lo
}
 
__monitor((void *)_thread_info()->flags, 0, 0);
-   if (!need_resched())
+   if (!need_resched_or_ipi())
__mwait(eax, ecx);
}
current_clr_polling();
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b6f4e8399fca..ca6cb7e28cba 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -925,7 +925,7 @@ static __cpuidle void mwait_idle(void)
}
 
__monitor((void *)_thread_info()->flags, 0, 0);
-   if (!need_resched()) {
+   if (!need_resched_or_ipi()) {
__sti_mwait(0, 0);
raw_local_irq_disable();
}
diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index 9ebedd972df0..77c3bb371f56 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -79,7 +79,7 @@ static int snooze_loop(struct cpuidle_device *dev,
dev->poll_time_limit = false;
ppc64_runlatch_off();
HMT_very_low();
-   while (!need_resched()) {
+   while (!need_resched_or_ipi()) {
if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) {
/*
 * Task has not woken up but we are exiting the polling
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 14db9b7d985d..4f2b490f8b73 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -46,7 +46,7 @@ int snooze_loop(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
snooze_exit_time = get_tb() + snooze_timeout;
dev->poll_time_limit = false;
 
-   while (!need_resched()) {
+   while (!need_resched_or_ipi()) {
HMT_low();
HMT_very_low();
if (likely(snooze_timeout_en) && get_tb() > snooze_exit_time) {
diff --git

[RFC PATCH 01/14] thread_info: Add helpers to test and clear TIF_NOTIFY_IPI

2024-02-20 Thread K Prateek Nayak

From: "Gautham R. Shenoy" 

Introduce the notion of TIF_NOTIFY_IPI flag. When a processor in
TIF_POLLING mode needs to process an IPI, the sender sets NEED_RESCHED
bit in idle task's thread_info to pull the target out of idle and avoids
sending an interrupt to the idle CPU. When NEED_RESCHED is set, the
scheduler assumes that a new task has been queued on the idle CPU and
calls schedule_idle(), however, it is not necessary that an IPI on an
idle CPU will necessarily end up waking a task on the said CPU. To avoid
spurious calls to schedule_idle() assuming an IPI on an idle CPU will
always wake a task on the said CPU, TIF_NOTIFY_IPI will be used to pull
a TIF_POLLING CPU out of idle.

Since the IPI handlers are processed before the call to schedule_idle(),
schedule_idle() will be called only if one of the handlers have woken up
a new task on the CPU and has set NEED_RESCHED.

Add tif_notify_ipi() and current_clr_notify_ipi() helpers to test if
TIF_NOTIFY_IPI is set in the current task's thread_info, and to clear it
respectively. These interfaces will be used in subsequent patches as
TIF_NOTIFY_IPI notion is integrated in the scheduler and in the idle
path.

[ prateek: Split the changes into a separate patch, add commit log ]

Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Russell King 
Cc: Guo Ren 
Cc: Michal Simek 
Cc: Dinh Nguyen 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: "Naveen N. Rao" 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: Al Viro 
Cc: Linus Walleij 
Cc: Ard Biesheuvel 
Cc: Andrew Donnellan 
Cc: Nicholas Miehlbradt 
Cc: Andrew Morton 
Cc: Arnd Bergmann 
Cc: Josh Poimboeuf 
Cc: "Kirill A. Shutemov" 
Cc: Rick Edgecombe 
Cc: Tony Battersby 
Cc: Brian Gerst 
Cc: Tim Chen 
Cc: David Vernet 
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-openr...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: Gautham R. Shenoy 
Co-developed-by: K Prateek Nayak 
Signed-off-by: K Prateek Nayak 
---
 include/linux/thread_info.h | 43 +
 1 file changed, 43 insertions(+)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 9ea0b28068f4..1e10dd8c0227 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -195,6 +195,49 @@ static __always_inline bool tif_need_resched(void)
 
 #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
 
+#ifdef TIF_NOTIFY_IPI
+
+#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
+
+static __always_inline bool tif_notify_ipi(void)
+{
+   return arch_test_bit(TIF_NOTIFY_IPI,
+(unsigned long *)(_thread_info()->flags));
+}
+
+static __always_inline void current_clr_notify_ipi(void)
+{
+   arch_clear_bit(TIF_NOTIFY_IPI,
+  (unsigned long *)(_thread_info()->flags));
+}
+
+#else
+
+static __always_inline bool tif_notify_ipi(void)
+{
+   return test_bit(TIF_NOTIFY_IPI,
+   (unsigned long *)(_thread_info()->flags));
+}
+
+static __always_inline void current_clr_notify_ipi(void)
+{
+   clear_bit(TIF_NOTIFY_IPI,
+ (unsigned long *)(_thread_info()->flags));
+}
+
+#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
+
+#else /* !TIF_NOTIFY_IPI */
+
+static __always_inline bool tif_notify_ipi(void)
+{
+   return false;
+}
+
+static __always_inline void current_clr_notify_ipi(void) { }
+
+#endif /* TIF_NOTIFY_IPI */
+
 #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
 static inline int arch_within_stack_frames(const void * const stack,
   const void * const stackend,
-- 
2.34.1

[RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag

2024-02-20 Thread K Prateek Nayak

Hello everyone,

Before jumping into the issue, let me clarify the Cc list. Everyone have
been cc'ed on Patch 0 through Patch 3. Respective arch maintainers,
reviewers, and committers returned by scripts/get_maintainer.pl have
been cc'ed on the respective arch side changes. Scheduler and CPU Idle
maintainers and reviewers have been included for the entire series. If I
have missed anyone, please do add them. If you would like to be dropped
from the cc list, wholly or partially, for the future iterations, please
do let me know.

With that out of the way ...

Problem statement
=

When measuring IPI throughput using a modified version of Anton
Blanchard's ipistorm benchmark [1], configured to measure time taken to
perform a fixed number of smp_call_function_single() (with wait set to
1), an increase in benchmark time was observed between v5.7 and the
current upstream release (v6.7-rc6 at the time of encounter).

Bisection pointed to commit b2a02fc43a1f ("smp: Optimize
send_call_function_single_ipi()") as the reason behind this increase in
runtime.


Experiments
===

Since the commit cannot be cleanly reverted on top of the current
tip:sched/core, the effects of the optimizations were reverted by:

1. Removing the check for call_function_single_prep_ipi() in
   send_call_function_single_ipi(). With this change
   send_call_function_single_ipi() always calls
   arch_send_call_function_single_ipi()

2. Removing the call to flush_smp_call_function_queue() in do_idle()
   since every smp_call_function, with (1.), would unconditionally send
   an IPI to an idle CPU in TIF_POLLING mode.

Following is the diff of the above described changes which will be
henceforth referred to as the "revert":

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 31231925f1ec..735184d98c0f 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -332,11 +332,6 @@ static void do_idle(void)
 */
smp_mb__after_atomic();
 
-   /*
-* RCU relies on this call to be done outside of an RCU read-side
-* critical section.
-*/
-   flush_smp_call_function_queue();
schedule_idle();
 
if (unlikely(klp_patch_pending(current)))
diff --git a/kernel/smp.c b/kernel/smp.c
index f085ebcdf9e7..2ff100c41885 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -111,11 +111,9 @@ void __init call_function_init(void)
 static __always_inline void
 send_call_function_single_ipi(int cpu)
 {
-   if (call_function_single_prep_ipi(cpu)) {
-   trace_ipi_send_cpu(cpu, _RET_IP_,
-  generic_smp_call_function_single_interrupt);
-   arch_send_call_function_single_ipi(cpu);
-   }
+   trace_ipi_send_cpu(cpu, _RET_IP_,
+  generic_smp_call_function_single_interrupt);
+   arch_send_call_function_single_ipi(cpu);
 }
 
 static __always_inline void
--

With the revert, the time taken to complete a fixed set of IPIs using
ipistorm improves significantly. Following are the numbers from a dual
socket 3rd Generation EPYC system (2 x 64C/128T) (boost on, C2 disabled)
running ipistorm between CPU8 and CPU16:

cmdline: insmod ipistorm.ko numipi=10 single=1 offset=8 cpulist=8 wait=1

(tip:sched/core at tag "sched-core-2024-01-08" for all the testing done
below)

  ==
  Test  : ipistorm (modified)
  Units : Normalized runtime
  Interpretation: Lower is better
  Statistic : AMean
  ==
  kernel:   time [pct imp]
  tip:sched/core1.00 [0.00]
  tip:sched/core + revert   0.81 [19.36]

Although the revert improves ipistorm performance, it also regresses
tbench and netperf, supporting the validity of the optimization.
Following are netperf and tbench numbers from the same machine comparing
vanilla tip:sched/core and the revert applied on top:

  ==
  Test  : tbench
  Units : Normalized throughput
  Interpretation: Higher is better
  Statistic : AMean
  ==
  Clients:tip[pct imp](CV)   revert[pct imp](CV)
  1 1.00 [  0.00]( 0.24) 0.91 [ -8.96]( 0.30)
  2 1.00 [  0.00]( 0.25) 0.92 [ -8.20]( 0.97)
  4 1.00 [  0.00]( 0.23) 0.91 [ -9.20]( 1.75)
  8 1.00 [  0.00]( 0.69) 0.91 [ -9.48]( 1.56)
 16 1.00 [  0.00]( 0.66) 0.92 [ -8.49]( 2.43)
 32 1.00 [  0.00]( 0.96) 0.89 [-11.13]( 0.96)
 64 1.00 [  0.00]( 1.06) 0.90 [ -9.72]( 2.49)
128 1.00 [  0.00]( 0.70) 0.92 [ -8.36]( 1.26)
256 1.00 [  0.00]( 0.72) 0.97 [ -3.30]( 1.10)
512 1.00 [  0.00]( 0.42) 0.98 [ -1.73]( 0.37)
   1024 1.00 [  0.00]( 0.28) 0.99 [ -1.39]( 0.43)

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-20 Thread Ryan Roberts

On 19/02/2024 15:18, Catalin Marinas wrote:
> On Fri, Feb 16, 2024 at 12:53:43PM +, Ryan Roberts wrote:
>> On 16/02/2024 12:25, Catalin Marinas wrote:
>>> On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote:
 +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 +{
 +  /*
 +   * Gather access/dirty bits, which may be populated in any of the ptes
 +   * of the contig range. We may not be holding the PTL, so any contiguous
 +   * range may be unfolded/modified/refolded under our feet. Therefore we
 +   * ensure we read a _consistent_ contpte range by checking that all ptes
 +   * in the range are valid and have CONT_PTE set, that all pfns are
 +   * contiguous and that all pgprots are the same (ignoring access/dirty).
 +   * If we find a pte that is not consistent, then we must be racing with
 +   * an update so start again. If the target pte does not have CONT_PTE
 +   * set then that is considered consistent on its own because it is not
 +   * part of a contpte range.
 +*/
> [...]
>>> After writing the comments above, I think I figured out that the whole
>>> point of this loop is to check that the ptes in the contig range are
>>> still consistent and the only variation allowed is the dirty/young
>>> state to be passed to the orig_pte returned. The original pte may have
>>> been updated by the time this loop finishes but I don't think it
>>> matters, it wouldn't be any different than reading a single pte and
>>> returning it while it is being updated.
>>
>> Correct. The pte can be updated at any time, before after or during the 
>> reads.
>> That was always the case. But now we have to cope with a whole contpte block
>> being repainted while we are reading it. So we are just checking to make sure
>> that all the ptes that we read from the contpte block are consistent with
>> eachother and therefore we can trust that the access/dirty bits we gathered 
>> are
>> consistent.
> 
> I've been thinking a bit more about this - do any of the callers of
> ptep_get_lockless() check the dirty/access bits? The only one that seems
> to care is ptdump but in that case I'd rather see the raw bits for
> debugging rather than propagating the dirty/access bits to the rest in
> the contig range.
> 
> So with some clearer documentation on the requirements, I think we don't
> need an arm64-specific ptep_get_lockless() (unless I missed something).

We've discussed similar at [1]. And I've posted an RFC series to convert all
ptep_get_lockless() to ptep_get_lockless_norecency() at [2]. The current spec
for ptep_get_lockless() is that it includes the access and dirty bits. So we
can't just read the single pte - if there is a tlb eviction followed by
re-population for the block, the access/dirty bits could move and that will
break pte_same() comparisons which are used in places.

So the previous conclusion was that we are ok to put this arm64-specific
ptep_get_lockless() in for now, but look to simplify by migrating to
ptep_get_lockless_norecency() in future. Are you ok with that approach?

[1]
https://lore.kernel.org/linux-mm/a91cfe1c-289e-4828-8cfc-be34eb69a...@redhat.com/
[2] 
https://lore.kernel.org/linux-mm/20240215121756.2734131-1-ryan.robe...@arm.com/

Thanks,
Ryan

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-20 Thread Ryan Roberts

On 16/02/2024 19:54, John Hubbard wrote:
> On 2/16/24 08:56, Catalin Marinas wrote:
> ...
>>> The problem is that the contpte_* symbols are called from the ptep_* inline
>>> functions. So where those inlines are called from modules, we need to make 
>>> sure
>>> the contpte_* symbols are available.
>>>
>>> John Hubbard originally reported this problem against v1 and I enumerated 
>>> all
>>> the drivers that call into the ptep_* inlines here:
>>> https://lore.kernel.org/linux-arm-kernel/b994ff89-1a1f-26ca-9479-b08c77f94...@arm.com/#t
>>>
>>> So they definitely need to be exported. Perhaps we can tighten it to
> 
> Yes. Let's keep the in-tree modules working.
> 
>>> EXPORT_SYMBOL_GPL(), but I was being cautious as I didn't want to break 
>>> anything
>>> out-of-tree. I'm not sure what the normal policy is? arm64 seems to use 
>>> ~equal
>>> amounts of both.
> 
> EXPORT_SYMBOL_GPL() seems appropriate and low risk. As Catalin says below,
> these really are deeply core mm routines, and any module operating at this
> level is not going to be able to survive on EXPORT_SYMBOL alone, IMHO.
> 
> Now, if only I could find an out of tree module to test that claim on... :)
> 
> 
>> I don't think we are consistent here. For example set_pte_at() can't be
>> called from non-GPL modules because of __sync_icache_dcache. OTOH, such
>> driver is probably doing something dodgy. Same with
>> apply_to_page_range(), it's GPL-only (called from i915).
>>
>> Let's see if others have any view over the next week or so, otherwise
>> I'd go for _GPL and relax it later if someone has a good use-case (can
>> be a patch on top adding _GPL).
> 
> I think going directly to _GPL for these is fine, actually.

OK I'll send out a patch to convert these to _GPL on my return on Monday.
Hopefully Andrew will be able to squash the patch into the existing series.

> 
> 
> thanks,

Re: [PATCH v4 01/11] dt-bindings: wiiu: Document the Nintendo Wii U devicetree

2024-02-20 Thread Krzysztof Kozlowski

On 20/02/2024 17:20, Christophe Leroy wrote:
> Michael,
> 
> Le 19/11/2022 à 12:30, Ash Logan a écrit :
>> Adds schema for the various Wii U devicetree nodes used.
>>
>> Signed-off-by: Ash Logan 
> 
> There's an issue at https://github.com/linuxppc/issues/issues/410 with 
> kernel v6.4 as a target to merging thing, any plan ?
> 
> It still applies without rebase (with git am -3).

No, it should not be merged, because it was never tested and fails in
several places.

Best regards,
Krzysztof

Re: [PATCH RFC] powerpc/pseries: exploit H_PAGE_SET_UNUSED for partition migration

2024-02-20 Thread Nathan Lynch

Michael Ellerman  writes:
> Nathan Lynch via B4 Relay 
> writes:
>> From: Nathan Lynch 
>>
>> Although the H_PAGE_INIT hcall's H_PAGE_SET_UNUSED historically has
>> been tied to the cooperative memory overcommit (CMO) platform feature,
>> the flag also is treated by the PowerVM hypervisor as a hint that the
>> page contents need not be copied to the destination during a live
>> partition migration.
>>
>> Use the "ibm,migratable-partition" root node property to determine
>> whether this partition/guest can be migrated. Mark freed pages unused
>> if so (or if CMO is in use, as before).
>>
>> Signed-off-by: Nathan Lynch 
>> ---
>> Several things yet to improve here:
>>
>> * powerpc's arch_free_page()/HAVE_ARCH_FREE_PAGE should be decoupled
>>   from CONFIG_PPC_SMLPAR.
>>
>> * powerpc's arch_free_page() could be made to use a static key if
>>   justified.
>>
>> * I have not yet measured the overhead this introduces, nor have I
>>   measured the benefit to a live migration.
>>
>> To date, I have smoke tested it by doing a live migration and
>> performing a build on a kernel with the change, to ensure it doesn't
>> introduce obvious memory corruption or anything. It hasn't blown up
>> yet :-)
>>
>> This will be a possibly significant behavior change in that we will be
>> flagging pages unused where we typically did not before. Until now,
>> having CMO enabled was the only way to do this, and I don't think that
>> feature is used all that much?
>
> Yeah AFAIK it has to be explicitly configured and enabled via the HMC,
> so doesn't get much testing or usage.
>
>> Posting this as RFC to see if there are any major concerns.
>  
> My worry is that this will add overhead for everyone in normal usage, an
> hcall per freed set of pages, whereas the benefit is only seen when a
> migration happens.
>
> But that does depend on how often arch_free_page() gets called in normal
> usage, which I don't know offhand.

Yes, and as I said in my followup yesterday:

>> for this to be safe, powerpc/pseries needs to implement
>> arch_alloc_page() to undo setting the "unused" flag.

So, perhaps more significantly, we'd also incur an hcall per
arch_alloc_page() with the most straightforward implementation that
doesn't eat data (unlike this version!).

Nevertheless I'll plan on doing that for the next iteration to see if I
can measure the overhead and benefit, with the expectation that we'll
ultimately need a more sophisticated design.

Re: [PATCH v4 01/11] dt-bindings: wiiu: Document the Nintendo Wii U devicetree

2024-02-20 Thread Christophe Leroy

Michael,

Le 19/11/2022 à 12:30, Ash Logan a écrit :
> Adds schema for the various Wii U devicetree nodes used.
> 
> Signed-off-by: Ash Logan 

There's an issue at https://github.com/linuxppc/issues/issues/410 with 
kernel v6.4 as a target to merging thing, any plan ?

It still applies without rebase (with git am -3).

Christophe


> ---
> v3->v4: Rework to match expected style and conciceness.
> 
>   .../bindings/powerpc/nintendo/wiiu.yaml   | 25 +
>   .../powerpc/nintendo/wiiu/espresso-pic.yaml   | 48 
>   .../bindings/powerpc/nintendo/wiiu/gpu7.yaml  | 42 ++
>   .../powerpc/nintendo/wiiu/latte-ahci.yaml | 50 +
>   .../powerpc/nintendo/wiiu/latte-dsp.yaml  | 35 
>   .../powerpc/nintendo/wiiu/latte-pic.yaml  | 55 +++
>   .../powerpc/nintendo/wiiu/latte-sdhci.yaml| 46 
>   .../bindings/powerpc/nintendo/wiiu/latte.yaml | 31 +++
>   .../devicetree/bindings/usb/generic-ehci.yaml |  1 +
>   9 files changed, 333 insertions(+)
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/espresso-pic.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/gpu7.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/latte-ahci.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/latte-dsp.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/latte-pic.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/latte-sdhci.yaml
>   create mode 100644 
> Documentation/devicetree/bindings/powerpc/nintendo/wiiu/latte.yaml
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/nintendo/wiiu.yaml 
> b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu.yaml
> new file mode 100644
> index ..23703b1052d0
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu.yaml
> @@ -0,0 +1,25 @@
> +# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/powerpc/nintendo/wiiu.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Nintendo Wii U bindings
> +
> +maintainers:
> +  - Ash Logan 
> +  - Emmanuel Gil Peyrot 
> +
> +description: |
> +  Nintendo Wii U video game console binding.
> +
> +properties:
> +  $nodename:
> +const: "/"
> +
> +  compatible:
> +const: nintendo,wiiu
> +
> +additionalProperties: true
> +
> +...
> diff --git 
> a/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/espresso-pic.yaml 
> b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/espresso-pic.yaml
> new file mode 100644
> index ..476a8ccda7a1
> --- /dev/null
> +++ 
> b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/espresso-pic.yaml
> @@ -0,0 +1,48 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/powerpc/nintendo/wiiu/espresso-pic.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Nintendo Wii U "Espresso" interrupt controller
> +
> +maintainers:
> +  - Ash Logan 
> +  - Emmanuel Gil Peyrot 
> +
> +description: |
> +  Interrupt controller found on the Nintendo Wii U for the "Espresso" 
> processor.
> +
> +allOf:
> +  - $ref: "/schemas/interrupt-controller.yaml#"
> +
> +properties:
> +  compatible:
> +const: nintendo,espresso-pic
> +
> +  '#interrupt-cells':
> +# Interrupt numbers 0-32 in one cell
> +const: 1
> +
> +  interrupt-controller: true
> +
> +  reg:
> +maxItems: 1
> +
> +required:
> +  - compatible
> +  - '#interrupt-cells'
> +  - interrupt-controller
> +  - reg
> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +interrupt-controller@c78 {
> +compatible = "nintendo,espresso-pic";
> +reg = <0x0c78 0x18>;
> +#interrupt-cells = <1>;
> +interrupt-controller;
> +};
> +...
> diff --git 
> a/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/gpu7.yaml 
> b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/gpu7.yaml
> new file mode 100644
> index ..d44ebe0d866c
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/nintendo/wiiu/gpu7.yaml
> @@ -0,0 +1,42 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/powerpc/nintendo/wiiu/gpu7.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Nintendo Wii U Latte "GPU7" graphics processor
> +
> +maintainers:
> +  - Ash Logan 
> +  - Emmanuel Gil Peyrot 
> +
> +description: |
> +  GPU7 graphics processor, also known as "GX2", found in the Latte 
> multifunction chip of the
> +  Nintendo Wii U.
> +
> +properties:
> +  compatible:
> +const: nintendo,latte-gpu7
> +
> +

Re: [PATCH v3] powerpc: macio: Make remove callback of macio driver void returned

2024-02-20 Thread Christophe Leroy

Hi Michael,

ping ?

Le 01/02/2023 à 15:36, Dawei Li a écrit :
> Commit fc7a6209d571 ("bus: Make remove callback return void") forces
> bus_type::remove be void-returned, it doesn't make much sense for any
> bus based driver implementing remove callbalk to return non-void to
> its caller.
> 
> This change is for macio bus based drivers.
> 
> Signed-off-by: Dawei Li 

This patch is Acked , any special reason for not applying it ?

Note that it now conflicts with commit 1535d5962d79 ("wifi: remove 
orphaned orinoco driver") but resolution is trivial, just drop the 
changes to that file.

Christophe

> ---
> v2 -> v3
> - Rebased on latest powerpc/next.
> - cc' to relevant subsysem lists.
> 
> v1 -> v2
> - Revert unneeded changes.
> - Rebased on latest powerpc/next.
> 
> v1
> - 
> https://lore.kernel.org/all/tycp286mb2323fcdc7ecd87f8d97cb74bca...@tycp286mb2323.jpnp286.prod.outlook.com/
> ---
>   arch/powerpc/include/asm/macio.h| 2 +-
>   drivers/ata/pata_macio.c| 4 +---
>   drivers/macintosh/rack-meter.c  | 4 +---
>   drivers/net/ethernet/apple/bmac.c   | 4 +---
>   drivers/net/ethernet/apple/mace.c   | 4 +---
>   drivers/net/wireless/intersil/orinoco/airport.c | 4 +---
>   drivers/scsi/mac53c94.c | 5 +
>   drivers/scsi/mesh.c | 5 +
>   drivers/tty/serial/pmac_zilog.c | 7 ++-
>   sound/aoa/soundbus/i2sbus/core.c| 4 +---
>   10 files changed, 11 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/macio.h 
> b/arch/powerpc/include/asm/macio.h
> index ff5fd82d9ff0..cb9c386dacf8 100644
> --- a/arch/powerpc/include/asm/macio.h
> +++ b/arch/powerpc/include/asm/macio.h
> @@ -125,7 +125,7 @@ static inline struct pci_dev *macio_get_pci_dev(struct 
> macio_dev *mdev)
>   struct macio_driver
>   {
>   int (*probe)(struct macio_dev* dev, const struct of_device_id 
> *match);
> - int (*remove)(struct macio_dev* dev);
> + void(*remove)(struct macio_dev *dev);
>   
>   int (*suspend)(struct macio_dev* dev, pm_message_t state);
>   int (*resume)(struct macio_dev* dev);
> diff --git a/drivers/ata/pata_macio.c b/drivers/ata/pata_macio.c
> index 9ccaac9e2bc3..653106716a4b 100644
> --- a/drivers/ata/pata_macio.c
> +++ b/drivers/ata/pata_macio.c
> @@ -1187,7 +1187,7 @@ static int pata_macio_attach(struct macio_dev *mdev,
>   return rc;
>   }
>   
> -static int pata_macio_detach(struct macio_dev *mdev)
> +static void pata_macio_detach(struct macio_dev *mdev)
>   {
>   struct ata_host *host = macio_get_drvdata(mdev);
>   struct pata_macio_priv *priv = host->private_data;
> @@ -1202,8 +1202,6 @@ static int pata_macio_detach(struct macio_dev *mdev)
>   ata_host_detach(host);
>   
>   unlock_media_bay(priv->mdev->media_bay);
> -
> - return 0;
>   }
>   
>   #ifdef CONFIG_PM_SLEEP
> diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
> index c28893e41a8b..f2f83c4f3af5 100644
> --- a/drivers/macintosh/rack-meter.c
> +++ b/drivers/macintosh/rack-meter.c
> @@ -523,7 +523,7 @@ static int rackmeter_probe(struct macio_dev* mdev,
>   return rc;
>   }
>   
> -static int rackmeter_remove(struct macio_dev* mdev)
> +static void rackmeter_remove(struct macio_dev *mdev)
>   {
>   struct rackmeter *rm = dev_get_drvdata(>ofdev.dev);
>   
> @@ -558,8 +558,6 @@ static int rackmeter_remove(struct macio_dev* mdev)
>   
>   /* Get rid of me */
>   kfree(rm);
> -
> - return 0;
>   }
>   
>   static int rackmeter_shutdown(struct macio_dev* mdev)
> diff --git a/drivers/net/ethernet/apple/bmac.c 
> b/drivers/net/ethernet/apple/bmac.c
> index 9e653e2925f7..292b1f9cd9e7 100644
> --- a/drivers/net/ethernet/apple/bmac.c
> +++ b/drivers/net/ethernet/apple/bmac.c
> @@ -1591,7 +1591,7 @@ bmac_proc_info(char *buffer, char **start, off_t 
> offset, int length)
>   }
>   #endif
>   
> -static int bmac_remove(struct macio_dev *mdev)
> +static void bmac_remove(struct macio_dev *mdev)
>   {
>   struct net_device *dev = macio_get_drvdata(mdev);
>   struct bmac_data *bp = netdev_priv(dev);
> @@ -1609,8 +1609,6 @@ static int bmac_remove(struct macio_dev *mdev)
>   macio_release_resources(mdev);
>   
>   free_netdev(dev);
> -
> - return 0;
>   }
>   
>   static const struct of_device_id bmac_match[] =
> diff --git a/drivers/net/ethernet/apple/mace.c 
> b/drivers/net/ethernet/apple/mace.c
> index fd1b008b7208..e6350971c707 100644
> --- a/drivers/net/ethernet/apple/mace.c
> +++ b/drivers/net/ethernet/apple/mace.c
> @@ -272,7 +272,7 @@ static int mace_probe(struct macio_dev *mdev, const 
> struct of_device_id *match)
>   return rc;
>   }
>   
> -static int mace_remove(struct macio_dev *mdev)
> +static void mace_remove(struct macio_dev *mdev)
>   {
>   struct net_device *dev = macio_get_drvdata(mdev);
>   struct mace_data *mp;
> @@ -296,8 +296,6

Re: [RESEND PATCH net v4 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2024-02-20 Thread Sean Anderson

On 2/19/24 10:30, Vladimir Oltean wrote:
> Hi Sean,
>
> On Thu, Feb 15, 2024 at 11:23:26AM -0500, Sean Anderson wrote:
>> smp_call_function_single disables IRQs when executing the callback. To
>> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
>> This is already done by qman_update_cgr and qman_delete_cgr; fix the
>> other lockers.
>>
>> Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
>> CC: sta...@vger.kernel.org
>> Signed-off-by: Sean Anderson 
>> Reviewed-by: Camelia Groza 
>> Tested-by: Vladimir Oltean 
>> ---
>> I got no response the first time I sent this, so I am resending to net.
>> This issue was introduced in a series which went through net, so I hope
>> it makes sense to take it via net.
>>
>> [1] 
>> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flinux%2darm%2dkernel%2f20240108161904.2865093%2d1%2dsean.anderson%40seco.com%2f=75622bdd-3d90-45a2-89a9-60921f1f3189=d807158c60b7d2502abde8a2fc01f40662980862-0625a208f4f6c241a307b4380763ba50532758bf
>>
>> (no changes since v3)
>>
>> Changes in v3:
>> - Change blamed commit to something more appropriate
>>
>> Changes in v2:
>> - Fix one additional call to spin_unlock
>
> Leo Li (Li Yang) is no longer with NXP. Until we figure out within NXP
> how to continue with the maintainership of drivers/soc/fsl/, yes, please
> continue to submit this series to 'net'. I would also like to point
> out to Arnd that this is the case.
>
> Arnd, a large portion of drivers/soc/fsl/ is networking-related
> (dpio, qbman). Would it make sense to transfer the maintainership
> of these under the respective networking drivers, to simplify the
> procedures?
>
> Also, your patches are whitespace-damaged. They do not apply to the
> kernel, and patchwork shows this as well.
> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fpatchwork.kernel.org%2fproject%2fnetdevbpf%2fpatch%2f20240215162327.3663092%2d1%2dsean.anderson%40seco.com%2f=75622bdd-3d90-45a2-89a9-60921f1f3189=d807158c60b7d2502abde8a2fc01f40662980862-ec9df03b11ef3e6b48a457ca5469e0b20c4b0439
>
> Please repost with this fixed.

Hm, I used the same method I have in the past (git send-email). But I
guess something is converting my tabs to spaces? Maybe it is related to
the embedded world advertisement...

Maybe the solution is to get a kernel.org email...

--Sean

[Embedded World 2024, SECO 
SpA]

Re: [PATCH linux-next 1/3] kexec/kdump: make struct crash_mem available without CONFIG_CRASH_DUMP

2024-02-20 Thread Baoquan He

On 02/13/24 at 05:01pm, Hari Bathini wrote:
> struct crash_mem defined under include/linux/crash_core.h represents
> a list of memory ranges. While it is used to represent memory ranges

>From its name, it's not only representing memory ranges, it's
representing crash memory ranges. Except of this, the whole series looks
good to me. Thanks for the effort.

> for kdump kernel, it can also be used for other kind of memory ranges.
> In fact, KEXEC_FILE_LOAD syscall in powerpc uses this structure to
> represent reserved memory ranges and exclude memory ranges needed to
> find the right memory regions to load kexec kernel. So, make the
> definition of crash_mem structure available for !CONFIG_CRASH_DUMP
> case too.
> 
> Signed-off-by: Hari Bathini 
> ---
>  include/linux/crash_core.h | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 23270b16e1db..d33352c2e386 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -8,6 +8,12 @@
>  
>  struct kimage;
>  
> +struct crash_mem {
> + unsigned int max_nr_ranges;
> + unsigned int nr_ranges;
> + struct range ranges[] __counted_by(max_nr_ranges);
> +};
> +
>  #ifdef CONFIG_CRASH_DUMP
>  
>  int crash_shrink_memory(unsigned long new_size);
> @@ -51,12 +57,6 @@ static inline unsigned int crash_get_elfcorehdr_size(void) 
> { return 0; }
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
>  
> -struct crash_mem {
> - unsigned int max_nr_ranges;
> - unsigned int nr_ranges;
> - struct range ranges[] __counted_by(max_nr_ranges);
> -};
> -
>  extern int crash_exclude_mem_range(struct crash_mem *mem,
>  unsigned long long mstart,
>  unsigned long long mend);
> -- 
> 2.43.0
>

Re: [PATCH v2] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries

2024-02-20 Thread Michael Ellerman

On Wed, 14 Feb 2024 16:34:06 -0600, Peter Bergner wrote:
> Changes from v1:
> - Add Acked-by lines.
> 
> The powerpc toolchain keeps a copy of the HWCAP bit masks in our TCB for fast
> access by the __builtin_cpu_supports built-in function.  The TCB space for
> the HWCAP entries - which are created in pairs - is an ABI extension, so
> waiting to create the space for HWCAP3 and HWCAP4 until we need them is
> problematical.  Define AT_HWCAP3 and AT_HWCAP4 in the generic uapi header
> so they can be used in glibc to reserve space in the powerpc TCB for their
> future use.
> 
> [...]

Applied to powerpc/next.

[1/1] uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector, entries
  https://git.kernel.org/powerpc/c/3281366a8e79a512956382885091565db1036b64

cheers

Re: [PATCH 0/7] macintosh: Convert to platform remove callback returning void

2024-02-20 Thread Michael Ellerman

On Wed, 10 Jan 2024 16:42:47 +0100, Uwe Kleine-König wrote:
> this series converts all drivers below drivers/macintosh to use
> .remove_new(). See commit 5c5a7680e67b ("platform: Provide a remove
> callback that returns no value") for an extended explanation and the
> eventual goal. The TL;DR; is to make it harder for driver authors to
> leak resources without noticing.
> 
> This is merge window material. All patches are pairwise independent of
> each other so they can be applied individually. There isn't a maintainer
> for drivers/macintosh, I'm still sending this as a series in the hope
> Michael feels repsonsible and applies it completely.
> 
> [...]

Applied to powerpc/next.

[1/7] macintosh: therm_windtunnel: Convert to platform remove callback 
returning void
  https://git.kernel.org/powerpc/c/bd6d99b70b2ffa96119826f22e96a5b77e6f90d6
[2/7] macintosh: windfarm_pm112: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/839cf59b5596abcdfbcdc4278a7bd4f8da32e1b2
[3/7] macintosh: windfarm_pm121: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/2e7e64c8427c2385bf47456a612d908f827f
[4/7] macintosh: windfarm_pm72: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/057894a40e973c829baacce0b9de6bdf6c8ec1da
[5/7] macintosh: windfarm_pm81: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/fb0217d79d77f1092929bae1137ac0f586c29fec
[6/7] macintosh: windfarm_pm91: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/7cfe99872c711ffa727db85c608a0897955a2758
[7/7] macintosh: windfarm_rm31: Convert to platform remove callback returning 
void
  https://git.kernel.org/powerpc/c/4b26558415d628ad2c0d3d4ec65156a0c99eaf02

cheers

Re: [PATCH] powerpc/hv-gpci: Fix the hcall return value checks in single_gpci_request function

2024-02-20 Thread Michael Ellerman

Kajol Jain  writes:
> Running event 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
> in one of the system throws below error:
>
>  ---Logs---
>  # perf list | grep 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles
>   
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=?/[Kernel
>  PMU event]
>
>
>  # perf stat -v -e 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>  sleep 2
> Using CPUID 00800200
> Control descriptor is not initialized
> Warning:
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>  event is not supported by the kernel.
> failed to read counter 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>
>  Performance counter stats for 'system wide':
>
>  
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>
>2.000700771 seconds time elapsed
>
> The above error is because of the hcall failure as required
> permission "Enable Performance Information Collection" is not set.
> Based on current code, single_gpci_request function did not check the
> error type incase hcall fails and by default returns EINVAL. But we can
> have other reasons for hcall failures like H_AUTHORITY/H_PARAMETER for which
> we need to act accordingly.
> Fix this issue by adding new checks in the single_gpci_request function.
>
> Result after fix patch changes:
>
>  # perf stat -e 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>  sleep 2
> Error:
> No permission to enable 
> hv_gpci/dispatch_timebase_by_processor_processor_time_in_timebase_cycles,phys_processor_idx=0/
>  event.
>
> Fixes: 220a0c609ad1 ("powerpc/perf: Add support for the hv gpci (get 
> performance counter info) interface")
> Reported-by: Akanksha J N 
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/perf/hv-gpci.c | 29 +
>  1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
> index 27f18119fda1..101060facd81 100644
> --- a/arch/powerpc/perf/hv-gpci.c
> +++ b/arch/powerpc/perf/hv-gpci.c
> @@ -695,7 +695,17 @@ static unsigned long single_gpci_request(u32 req, u32 
> starting_index,
>  
>   ret = plpar_hcall_norets(H_GET_PERF_COUNTER_INFO,
>   virt_to_phys(arg), HGPCI_REQ_BUFFER_SIZE);
> - if (ret) {
> +
> + /*
> +  * ret value as 'H_PARAMETER' corresponds to 'GEN_BUF_TOO_SMALL',

Don't we expect H_PARAMETER if any parameter value is incorrect?

> +  * which means that the current buffer size cannot accommodate
> +  * all the information and a partial buffer returned.

I don't see how we can infer that H_PARAMETER means the buffer is too
small and accessing the first entry is OK?

cheers

> +  * Since in this function we are only accessing data for a given 
> starting index,
> +  * we don't need to accommodate whole data and can get required count by
> +  * accessing very first entry.
> +  * Hence hcall fails only incase the ret value is other than H_SUCCESS 
> or H_PARAMETER.
> +  */
> + if (ret && (ret != H_PARAMETER)) {
>   pr_devel("hcall failed: 0x%lx\n", ret);
>   goto out;
>   }

Re: [PATCH v3] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely initialize pci_controller structure

2024-02-20 Thread Michael Ellerman

On Thu, 15 Feb 2024 16:18:33 -0600, Gaurav Batra wrote:
> When a PCI device is Dynamically added, LPAR OOPS with NULL pointer
> exception.
> 
> Complete stack is as below
> 
> [  211.239206] BUG: Kernel NULL pointer dereference on read at 0x0030
> [  211.239210] Faulting instruction address: 0xc06bbe5c
> [  211.239214] Oops: Kernel access of bad area, sig: 11 [#1]
> [  211.239218] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries/iommu: DLPAR ADD of pci device doesn't completely 
initialize pci_controller structure
  https://git.kernel.org/powerpc/c/a5c57fd2e9bd1c8ea8613a8f94fd0be5eccbf321

cheers

Re: [PATCH v4] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-20 Thread Michael Ellerman

On Wed, 07 Feb 2024 11:15:26 +0530, Amit Machhiwal wrote:
> Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> ppc_set_compat ioctl. This triggers a condition failure in
> kvmppc_set_arch_compat() resulting in an EINVAL.
> 
> qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
> argument
> 
> [...]

Applied to powerpc/fixes.

[1/1] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 
'arch_compat'
  https://git.kernel.org/powerpc/c/20c8c4dafe93e82441583e93bd68c0d256d7bed4

cheers

Re: [PATCH RFC] powerpc/pseries: exploit H_PAGE_SET_UNUSED for partition migration

2024-02-20 Thread Michael Ellerman

Nathan Lynch via B4 Relay 
writes:
> From: Nathan Lynch 
>
> Although the H_PAGE_INIT hcall's H_PAGE_SET_UNUSED historically has
> been tied to the cooperative memory overcommit (CMO) platform feature,
> the flag also is treated by the PowerVM hypervisor as a hint that the
> page contents need not be copied to the destination during a live
> partition migration.
>
> Use the "ibm,migratable-partition" root node property to determine
> whether this partition/guest can be migrated. Mark freed pages unused
> if so (or if CMO is in use, as before).
>
> Signed-off-by: Nathan Lynch 
> ---
> Several things yet to improve here:
>
> * powerpc's arch_free_page()/HAVE_ARCH_FREE_PAGE should be decoupled
>   from CONFIG_PPC_SMLPAR.
>
> * powerpc's arch_free_page() could be made to use a static key if
>   justified.
>
> * I have not yet measured the overhead this introduces, nor have I
>   measured the benefit to a live migration.
>
> To date, I have smoke tested it by doing a live migration and
> performing a build on a kernel with the change, to ensure it doesn't
> introduce obvious memory corruption or anything. It hasn't blown up
> yet :-)
>
> This will be a possibly significant behavior change in that we will be
> flagging pages unused where we typically did not before. Until now,
> having CMO enabled was the only way to do this, and I don't think that
> feature is used all that much?

Yeah AFAIK it has to be explicitly configured and enabled via the HMC,
so doesn't get much testing or usage.

> Posting this as RFC to see if there are any major concerns.
 
My worry is that this will add overhead for everyone in normal usage, an
hcall per freed set of pages, whereas the benefit is only seen when a
migration happens.

But that does depend on how often arch_free_page() gets called in normal
usage, which I don't know offhand.

cheers

Re: [PATCH 0/4] powerpc/ps3 Add ELFv2 support

2024-02-20 Thread Michael Ellerman

Christophe Leroy  writes:
> Hi,
>
> Le 19/01/2024 à 11:27, Geoff Levand a écrit :
>> The following changes since commit 44a1aad2fe6c10bfe0589d8047057b10a4c18a19:
>> 
>>Merge branch 'topic/ppc-kvm' into next (2023-12-29 15:30:45 +1100)
>> 
>> are available in the Git repository at:
>> 
>>git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git 
>> for-merge-elfv2
>> 
>> for you to fetch changes up to 983836405df1b6001a2262972fb32d1aee97d6f5:
>> 
>>Revert "powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2" 
>> (2024-01-19 17:53:48 +0900)
>> 
>> 
>> Geoff Levand (1):
>>Revert "powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2"
>> 
>> Nicholas Piggin (3):
>>powerpc/ps3: Fix lv1 hcall assembly for ELFv2 calling convention
>>powerpc/ps3: lv1 hcall code use symbolic constant for LR save offset
>>powerpc/ps3: Make real stack frames for LV1 hcalls
>
> Must be something wrong it the way you sent this series.
>
> First, the three patches from Nic don't appear as being from him, 
> missing the From: at the top.
>
> Second, this introductory letter appears as a standalone patch in patchwork.
>
> The three patches from Nic are awaiting in patchwork, all we need it 
> your 4th patch when times come.

I'll sort it out.

cheers

Re: [PATCHv8 2/2] watchdog/softlockup: report the most frequent interrupts

2024-02-20 Thread Bitao Hu


Hi,

On 2024/2/20 17:35, Thomas Gleixner wrote:

On Tue, Feb 20 2024 at 00:19, Bitao Hu wrote:

  arch/mips/dec/setup.c|   2 +-
  arch/parisc/kernel/smp.c |   2 +-
  arch/powerpc/kvm/book3s_hv_rm_xics.c |   2 +-
  include/linux/irqdesc.h  |   9 ++-
  include/linux/kernel_stat.h  |   4 +
  kernel/irq/internals.h   |   2 +-
  kernel/irq/irqdesc.c |  34 ++--
  kernel/irq/proc.c|   9 +--


This really wants to be split into two patches. Interrupt infrastructure
first and then the actual usage site in the watchdog code.


Okay, I will split it into two patches.

Re: [PATCHv8 2/2] watchdog/softlockup: report the most frequent interrupts

2024-02-20 Thread Thomas Gleixner

On Tue, Feb 20 2024 at 00:19, Bitao Hu wrote:
>  arch/mips/dec/setup.c|   2 +-
>  arch/parisc/kernel/smp.c |   2 +-
>  arch/powerpc/kvm/book3s_hv_rm_xics.c |   2 +-
>  include/linux/irqdesc.h  |   9 ++-
>  include/linux/kernel_stat.h  |   4 +
>  kernel/irq/internals.h   |   2 +-
>  kernel/irq/irqdesc.c |  34 ++--
>  kernel/irq/proc.c|   9 +--

This really wants to be split into two patches. Interrupt infrastructure
first and then the actual usage site in the watchdog code.

Thanks,

tglx

Re: Boot failure with ppc64 port on iMacs G5

2024-02-20 Thread John Paul Adrian Glaubitz

Hello,

On Tue, 2024-02-20 at 04:16 +0100, tuxayo wrote:
> I tried snapshots/2024-01-31/debian-12.0.0-ppc64-NETINST-1.iso
> 
> And was able to start booting from usb with:
> boot usb0/disk@1:,\boot\grub\powerpc.elf
> (typed in Open Firmware shell)
> (usb0 is the top port)
> 
> Grub worked, and then I tried default install (the 1st option) and it 
> started loading during like 2 minutes.
> And then it got stuck with some superposition of the messages
> smp_core99_probe
> and
> the stuff before
> DO-QUIESCE finisedBooting Linux via __start() @ 0x0209 ...

There seems to be a regression in the kernel which affects PowerPC 970 machines,
i.e. PowerMac G5 CPUs. The issue needs to be bisected and reported upstream.

If you have the time, I would really appreciate if you could test the various
snapshots and let me know which kernel is the first to not work. I expect that
the breakage occurred somewhere around kernel 6.3 or so.

CC'ing Claudia Neumann who observed this bug before and can maybe share some
additional information.

> Full message of the two iMacs G5 ↓↓↓
> https://transfert.facil.services/r/Ksfq_2VM9_#tDATBcLXzB0zkAEaiqm9gfLfsaXliVJ13rQxKUHgUmA=
> https://transfert.facil.services/r/Zs1h1jEtb2#jufjxv6+1DfHnO3TSfhmYD+teOvY46sGClHyz7SiXd4=

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH v1 2/2] powerpc/debug: hook to user return notifier infrastructure

2024-02-20 Thread Christophe Leroy



Le 20/02/2024 à 09:51, Christophe Leroy a écrit :
> 
> 
> Le 19/12/2023 à 07:33, Michael Ellerman a écrit :
>> Aneesh Kumar K.V  writes:
>>> Luming Yu  writes:
>>>
 Before we have powerpc to use the generic entry infrastructure,
 the call to fire user return notifier is made temporarily in powerpc
 entry code.

>>>
>>> It is still not clear what will be registered as user return notifier.
>>> Can you summarize that here?
>>
>> fire_user_return_notifiers() is defined in kernel/user-return-notifier.c
>>
>> That's built when CONFIG_USER_RETURN_NOTIFIER=y.
>>
>> That is not user selectable, it's only enabled by:
>>
>> arch/x86/kvm/Kconfig:select USER_RETURN_NOTIFIER
>>
>> So it looks to me like (currently) it's always a nop and does nothing.
>>
>> Which makes me wonder what the point of wiring this feature up is :)
>> Maybe it's needed for some other feature I don't know about?
>>
>> Arguably we could just enable it because we can, and it currently does
>> nothing so it's unlikely to break anything. But that also makes it
>> impossible to test the implementation is correct, and runs the risk that
>> one day in the future when it does get enabled only then do we discover
>> it doesn't work.
> 
> Opened an "issue" for the day we need it:
> https://github.com/KSPP/linux/issues/348

Correct one is https://github.com/linuxppc/issues/issues/477

Re: [PATCH v1 2/2] powerpc/debug: hook to user return notifier infrastructure

2024-02-20 Thread Christophe Leroy



Le 19/12/2023 à 07:33, Michael Ellerman a écrit :
> Aneesh Kumar K.V  writes:
>> Luming Yu  writes:
>>
>>> Before we have powerpc to use the generic entry infrastructure,
>>> the call to fire user return notifier is made temporarily in powerpc
>>> entry code.
>>>
>>
>> It is still not clear what will be registered as user return notifier.
>> Can you summarize that here?
> 
> fire_user_return_notifiers() is defined in kernel/user-return-notifier.c
> 
> That's built when CONFIG_USER_RETURN_NOTIFIER=y.
> 
> That is not user selectable, it's only enabled by:
> 
> arch/x86/kvm/Kconfig:select USER_RETURN_NOTIFIER
> 
> So it looks to me like (currently) it's always a nop and does nothing.
> 
> Which makes me wonder what the point of wiring this feature up is :)
> Maybe it's needed for some other feature I don't know about?
> 
> Arguably we could just enable it because we can, and it currently does
> nothing so it's unlikely to break anything. But that also makes it
> impossible to test the implementation is correct, and runs the risk that
> one day in the future when it does get enabled only then do we discover
> it doesn't work.

Opened an "issue" for the day we need it: 
https://github.com/KSPP/linux/issues/348

Re: [PATCH v1 1/1] hvc_console: Allow backends to set I/O buffer size

2024-02-20 Thread Christophe Leroy

Hi,

Le 15/01/2023 à 20:56, Geoff Levand a écrit :
> To allow HVC backends to set the I/O buffer sizes to values that are most
> efficient for the backend, change the macro definitions where the buffer sizes
> are set to be conditional on whether or not the macros are already defined.
> Also, rename the macros from N_OUTBUF to HVC_N_OUBUF and from N_INBUF to
> HVC_N_INBUF.
> 
> Typical usage in the backend source file would be:
> 
>#define HVC_N_OUTBUF 32
>#define HVC_N_INBUF 32
>#include "hvc_console.h"
> 
> Signed-off-by: Geoff Levand 

Most patches in drivers/tty/hvc/ are merged by greg through the serial 
tree, you should send to him.

And I think it is not correct to send that as pull request.

Christophe

> ---
> 
> Hi,
> 
> With this patch the buffer sizes are set by defining preprocessor macros 
> before
> including the hvc_console.h header file.  Another way would be to have Kconfig
> options for the buffer sizes.  Since the optimal buffer size is so closely 
> tied
> to the backend implementation I thought that using these preprocessor macros
> would be the better way.
> 
> -Geoff
> 
> The following changes since commit 5dc4c995db9eb45f6373a956eb1f69460e69e6d4:
> 
>Linux 6.2-rc4 (2023-01-15 09:22:43 -0600)
> 
> are available in the Git repository at:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git 
> for-merge-hvc-v1
> 
> for you to fetch changes up to 8f3cd1e0589f134380f80a1f551c8232ed0bc1f2:
> 
>hvc_console: Allow backends to set I/O buffer size (2023-01-15 09:36:22 
> -0800)
> 
> 
>   drivers/tty/hvc/hvc_console.c | 19 +++
>   1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/tty/hvc/hvc_console.c b/drivers/tty/hvc/hvc_console.c
> index a683e21df19c..f7809d19e2cd 100644
> --- a/drivers/tty/hvc/hvc_console.c
> +++ b/drivers/tty/hvc/hvc_console.c
> @@ -42,12 +42,15 @@
>   #define HVC_CLOSE_WAIT (HZ/100) /* 1/10 of a second */
>   
>   /*
> - * These sizes are most efficient for vio, because they are the
> - * native transfer size. We could make them selectable in the
> - * future to better deal with backends that want other buffer sizes.
> + * These default sizes are most efficient for vio, because they are
> + * the native transfer size.
>*/
> -#define N_OUTBUF 16
> -#define N_INBUF  16
> +#if !defined(HVC_N_OUTBUF)
> +# define HVC_N_OUTBUF16
> +#endif
> +#if !defined(HVC_N_INBUF)
> +# define HVC_N_INBUF 16
> +#endif
>   
>   #define __ALIGNED__ __attribute__((__aligned__(L1_CACHE_BYTES)))
>   
> @@ -151,7 +154,7 @@ static uint32_t vtermnos[MAX_NR_HVC_CONSOLES] =
>   static void hvc_console_print(struct console *co, const char *b,
> unsigned count)
>   {
> - char c[N_OUTBUF] __ALIGNED__;
> + char c[HVC_N_OUTBUF] __ALIGNED__;
>   unsigned i = 0, n = 0;
>   int r, donecr = 0, index = co->index;
>   
> @@ -633,7 +636,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool 
> may_sleep)
>   {
>   struct tty_struct *tty;
>   int i, n, count, poll_mask = 0;
> - char buf[N_INBUF] __ALIGNED__;
> + char buf[HVC_N_INBUF] __ALIGNED__;
>   unsigned long flags;
>   int read_total = 0;
>   int written_total = 0;
> @@ -674,7 +677,7 @@ static int __hvc_poll(struct hvc_struct *hp, bool 
> may_sleep)
>   
>read_again:
>   /* Read data if any */
> - count = tty_buffer_request_room(>port, N_INBUF);
> + count = tty_buffer_request_room(>port, HVC_N_INBUF);
>   
>   /* If flip is full, just reschedule a later read */
>   if (count == 0) {

Re: [PATCH 4/4] Revert "powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2"

2024-02-20 Thread Christophe Leroy



Le 19/01/2024 à 11:27, Geoff Levand a écrit :
> Patches provided by Nicholas Piggin enable PS3
> support for ELFv2.

The said patches are 
https://patchwork.ozlabs.org/project/linuxppc-dev/cover/20231227072405.63751-1-npig...@gmail.com/

> 
> Signed-off-by: Geoff Levand 
> ---
>   arch/powerpc/configs/ps3_defconfig | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/configs/ps3_defconfig 
> b/arch/powerpc/configs/ps3_defconfig
> index aa8bb0208bcc..2b175ddf82f0 100644
> --- a/arch/powerpc/configs/ps3_defconfig
> +++ b/arch/powerpc/configs/ps3_defconfig
> @@ -24,7 +24,6 @@ CONFIG_PS3_VRAM=m
>   CONFIG_PS3_LPM=m
>   # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
>   CONFIG_KEXEC=y
> -# CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not set
>   CONFIG_PPC_4K_PAGES=y
>   CONFIG_SCHED_SMT=y
>   CONFIG_PM=y

Re: [PATCH 0/4] powerpc/ps3 Add ELFv2 support

2024-02-20 Thread Christophe Leroy

Hi,

Le 19/01/2024 à 11:27, Geoff Levand a écrit :
> The following changes since commit 44a1aad2fe6c10bfe0589d8047057b10a4c18a19:
> 
>Merge branch 'topic/ppc-kvm' into next (2023-12-29 15:30:45 +1100)
> 
> are available in the Git repository at:
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git 
> for-merge-elfv2
> 
> for you to fetch changes up to 983836405df1b6001a2262972fb32d1aee97d6f5:
> 
>Revert "powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2" 
> (2024-01-19 17:53:48 +0900)
> 
> 
> Geoff Levand (1):
>Revert "powerpc/ps3_defconfig: Disable PPC64_BIG_ENDIAN_ELF_ABI_V2"
> 
> Nicholas Piggin (3):
>powerpc/ps3: Fix lv1 hcall assembly for ELFv2 calling convention
>powerpc/ps3: lv1 hcall code use symbolic constant for LR save offset
>powerpc/ps3: Make real stack frames for LV1 hcalls

Must be something wrong it the way you sent this series.

First, the three patches from Nic don't appear as being from him, 
missing the From: at the top.

Second, this introductory letter appears as a standalone patch in patchwork.

The three patches from Nic are awaiting in patchwork, all we need it 
your 4th patch when times come.


> 
>   arch/powerpc/configs/ps3_defconfig  |   1 -
>   arch/powerpc/include/asm/ppc_asm.h  |   6 +-
>   arch/powerpc/platforms/ps3/hvcall.S | 298 
> 
>   3 files changed, 171 insertions(+), 134 deletions(-)
>

51 matches

Mail list logo