Re: [PATCH 1/1] ASoC: fsl: fsl_ssi: Add dev_err_probe if PCM DMA init fails

2024-03-20 Thread Shengjiu Wang
On Thu, Mar 14, 2024 at 10:16 PM Alexander Stein
 wrote:
>
> This happens especially if this driver is built-in, but SDMA driver
> is configured as module.
>
> Signed-off-by: Alexander Stein 

Acked-by: Shengjiu Wang 

Best Regards
Shengjiu Wang
> ---
>  sound/soc/fsl/fsl_ssi.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index ab6ec1974807..4ca3a16f7ac0 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -1401,8 +1401,10 @@ static int fsl_ssi_imx_probe(struct platform_device 
> *pdev,
> goto error_pcm;
> } else {
> ret = imx_pcm_dma_init(pdev);
> -   if (ret)
> +   if (ret) {
> +   dev_err_probe(dev, ret, "Failed to init PCM DMA\n");
> goto error_pcm;
> +   }
> }
>
> return 0;
> --
> 2.34.1
>


Re: [PATCH v3 0/5] ASoC: fsl: Support register and unregister rpmsg sound card through remoteproc

2024-03-20 Thread Shengjiu Wang
On Mon, Mar 11, 2024 at 7:14 PM Chancel Liu  wrote:
>
> echo /lib/firmware/fw.elf > /sys/class/remoteproc/remoteproc0/firmware
> (A) echo start > /sys/class/remoteproc/remoteproc0/state
> (B) echo stop > /sys/class/remoteproc/remoteproc0/state
>
> The rpmsg sound card is registered in (A) and unregistered in (B).
> After "start", imx-audio-rpmsg registers devices for ASoC platform driver
> and machine driver. Then sound card is registered. After "stop",
> imx-audio-rpmsg unregisters devices for ASoC platform driver and machine
> driver. Then sound card is unregistered.

Acked-by: Shengjiu Wang 

Best regards
Shengjiu Wang

>
> changes in v2
> - Fix build errors reported by kernel test robot
>
> changes in v3
> - Add a new patch for fsl_rpmsg to register CPU DAI with rpmsg channel
>   name
> - Update imx-rpmsg.c to get DT node of ASoC CPU DAI device with rpmsg
>   channel name instead of using undocumented bindings
>
> Chancel Liu (5):
>   ASoC: fsl: imx-pcm-rpmsg: Register component with rpmsg channel name
>   ASoC: fsl: imx-audio-rpmsg: Register device with rpmsg channel name
>   ASoC: fsl: Let imx-audio-rpmsg register platform device for card
>   ASoC: fsl: fsl_rpmsg: Register CPU DAI with name of rpmsg channel
>   ASoC: fsl: imx-rpmsg: Update to correct DT node
>
>  sound/soc/fsl/fsl_rpmsg.c   | 43 -
>  sound/soc/fsl/imx-audio-rpmsg.c | 21 +---
>  sound/soc/fsl/imx-pcm-rpmsg.c   | 11 ++---
>  sound/soc/fsl/imx-rpmsg.c   | 28 ++---
>  4 files changed, 71 insertions(+), 32 deletions(-)
>
> --
> 2.43.0
>


Re: [PATCH] powerpc: ps3: mark ps3_notification_device static for stack usage

2024-03-20 Thread Geoff Levand
On 3/21/24 03:03, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The device is way too large to be on the stack, causing a warning for
> an allmodconfig build with clang:
> 
> arch/powerpc/platforms/ps3/device-init.c:771:12: error: stack frame size 
> (2064) exceeds limit (2048) in 'ps3_probe_thread' 
> [-Werror,-Wframe-larger-than]
>   771 | static int ps3_probe_thread(void *data)
> 
> There is only a single thread that ever accesses this, so it's fine to
> make it static, which avoids the warning.
> 
> Fixes: b4cb2941f855 ("[POWERPC] PS3: Use the HVs storage device notification 
> mechanism properly")
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/powerpc/platforms/ps3/device-init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/device-init.c 
> b/arch/powerpc/platforms/ps3/device-init.c
> index 878bc160246e..ce99f06698a9 100644
> --- a/arch/powerpc/platforms/ps3/device-init.c
> +++ b/arch/powerpc/platforms/ps3/device-init.c
> @@ -770,7 +770,7 @@ static struct task_struct *probe_task;
>  
>  static int ps3_probe_thread(void *data)
>  {
> - struct ps3_notification_device dev;
> + static struct ps3_notification_device dev;
>   int res;
>   unsigned int irq;
>   u64 lpar;

Seems fine.

Acked-by: Geoff Levand 



Re: [PATCH] vdso: use CONFIG_PAGE_SHIFT in vdso/datapage.h

2024-03-20 Thread Kees Cook
On Wed, Mar 20, 2024 at 07:02:15PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Both the vdso rework and the CONFIG_PAGE_SHIFT changes were merged during
> the v6.9 merge window, so it is now possible to use CONFIG_PAGE_SHIFT
> instead of including asm/page.h in the vdso.
> 
> This avoids the workaround for arm64 and addresses a build warning
> for powerpc64:
> 
> In file included from :4:
> In file included from /home/arnd/arm-soc/arm-soc/lib/vdso/gettimeofday.c:5:
> In file included from ../include/vdso/datapage.h:25:
> arch/powerpc/include/asm/page.h:230:9: error: result of comparison of 
> constant 13835058055282163712 with expression of type 'unsigned long' is 
> always true [-Werror,-Wtautological-constant-out-of-range-compare]
>   230 | return __pa(kaddr) >> PAGE_SHIFT;
>   |^~~
> arch/powerpc/include/asm/page.h:217:37: note: expanded from macro '__pa'
>   217 | VIRTUAL_WARN_ON((unsigned long)(x) < PAGE_OFFSET);
>   \
>   | ~~~^~
> arch/powerpc/include/asm/page.h:202:73: note: expanded from macro 
> 'VIRTUAL_WARN_ON'
>   202 | #define VIRTUAL_WARN_ON(x)  
> WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && (x))
>   | 
> ~^~~
> arch/powerpc/include/asm/bug.h:88:25: note: expanded from macro 'WARN_ON'
>88 | int __ret_warn_on = !!(x);  \
>   |^
> 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Andy Lutomirski 
> Cc: Thomas Gleixner 
> Cc: Vincenzo Frascino 
> Cc: Anna-Maria Behnsen 
> See-also: 8b3843ae3634 ("vdso/datapage: Quick fix - use asm/page-def.h for 
> ARM64")
> Signed-off-by: Arnd Bergmann 

Thanks for tracking this!

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Peter Xu
On Wed, Mar 20, 2024 at 05:40:39PM +, Christophe Leroy wrote:
> 
> 
> Le 20/03/2024 à 17:09, Peter Xu a écrit :
> > On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote:
> >> At the first place that was to get a close fit between hardware
> >> pagetable topology and linux pagetable topology. But obviously we
> >> already stepped back for 512k pages, so let's go one more step aside and
> >> do similar with 8M pages.
> >>
> >> I'll give it a try and see how it goes.
> > 
> > So you're talking about 8M only for 8xx, am I right?
> 
> Yes I am.
> 
> > 
> > There seem to be other PowerPC systems use hugepd.  Is it possible that we
> > convert all hugepd into cont_pte form?
> 
> Indeed.
> 
> Seems like we have hugepd for book3s/64 and for nohash.
> 
> For book3s I don't know, may Aneesh can answer.
> 
> For nohash I think it should be possible because TLB misses are handled 
> by software. Even the e6500 which has a hardware tablewalk falls back on 
> software walk when it is a hugepage IIUC.

It'll be great if I can get some answer here, and then I know the path for
hugepd in general.  I don't want to add any new code into core mm to
something destined to fade away soon.

One option for me is I can check a macro of hugepd existance, so all new
code will only work when hugepd is not supported on such arch.  However
that'll start to make some PowerPC systems special (which I still tried
hard to avoid, if that wasn't proved in the past..), meanwhile we'll also
need to keep some generic-mm paths (that I can already remove along with
the new code) only for these hugepd systems.  But it's still okay to me,
it'll be just a matter of when to drop those codes, sooner or later.

Thanks,

-- 
Peter Xu



[PATCH] powerpc: ps3: mark ps3_notification_device static for stack usage

2024-03-20 Thread Arnd Bergmann
From: Arnd Bergmann 

The device is way too large to be on the stack, causing a warning for
an allmodconfig build with clang:

arch/powerpc/platforms/ps3/device-init.c:771:12: error: stack frame size (2064) 
exceeds limit (2048) in 'ps3_probe_thread' [-Werror,-Wframe-larger-than]
  771 | static int ps3_probe_thread(void *data)

There is only a single thread that ever accesses this, so it's fine to
make it static, which avoids the warning.

Fixes: b4cb2941f855 ("[POWERPC] PS3: Use the HVs storage device notification 
mechanism properly")
Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/platforms/ps3/device-init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/ps3/device-init.c 
b/arch/powerpc/platforms/ps3/device-init.c
index 878bc160246e..ce99f06698a9 100644
--- a/arch/powerpc/platforms/ps3/device-init.c
+++ b/arch/powerpc/platforms/ps3/device-init.c
@@ -770,7 +770,7 @@ static struct task_struct *probe_task;
 
 static int ps3_probe_thread(void *data)
 {
-   struct ps3_notification_device dev;
+   static struct ps3_notification_device dev;
int res;
unsigned int irq;
u64 lpar;
-- 
2.39.2



[PATCH] vdso: use CONFIG_PAGE_SHIFT in vdso/datapage.h

2024-03-20 Thread Arnd Bergmann
From: Arnd Bergmann 

Both the vdso rework and the CONFIG_PAGE_SHIFT changes were merged during
the v6.9 merge window, so it is now possible to use CONFIG_PAGE_SHIFT
instead of including asm/page.h in the vdso.

This avoids the workaround for arm64 and addresses a build warning
for powerpc64:

In file included from :4:
In file included from /home/arnd/arm-soc/arm-soc/lib/vdso/gettimeofday.c:5:
In file included from ../include/vdso/datapage.h:25:
arch/powerpc/include/asm/page.h:230:9: error: result of comparison of constant 
13835058055282163712 with expression of type 'unsigned long' is always true 
[-Werror,-Wtautological-constant-out-of-range-compare]
  230 | return __pa(kaddr) >> PAGE_SHIFT;
  |^~~
arch/powerpc/include/asm/page.h:217:37: note: expanded from macro '__pa'
  217 | VIRTUAL_WARN_ON((unsigned long)(x) < PAGE_OFFSET);  
\
  | ~~~^~
arch/powerpc/include/asm/page.h:202:73: note: expanded from macro 
'VIRTUAL_WARN_ON'
  202 | #define VIRTUAL_WARN_ON(x)  
WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && (x))
  | 
~^~~
arch/powerpc/include/asm/bug.h:88:25: note: expanded from macro 'WARN_ON'
   88 | int __ret_warn_on = !!(x);  \
  |^

Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Andy Lutomirski 
Cc: Thomas Gleixner 
Cc: Vincenzo Frascino 
Cc: Anna-Maria Behnsen 
See-also: 8b3843ae3634 ("vdso/datapage: Quick fix - use asm/page-def.h for 
ARM64")
Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/include/asm/vdso/gettimeofday.h | 3 +--
 include/vdso/datapage.h  | 8 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
b/arch/powerpc/include/asm/vdso/gettimeofday.h
index f0a4cf01e85c..78302f6c2580 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -4,7 +4,6 @@
 
 #ifndef __ASSEMBLY__
 
-#include 
 #include 
 #include 
 #include 
@@ -95,7 +94,7 @@ const struct vdso_data *__arch_get_vdso_data(void);
 static __always_inline
 const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd)
 {
-   return (void *)vd + PAGE_SIZE;
+   return (void *)vd + (1U << CONFIG_PAGE_SHIFT);
 }
 #endif
 
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 5d5c0b8efff2..c71ddb6d4691 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -19,12 +19,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_ARM64
-#include 
-#else
-#include 
-#endif
-
 #ifdef CONFIG_ARCH_HAS_VDSO_DATA
 #include 
 #else
@@ -132,7 +126,7 @@ extern struct vdso_data _timens_data[CS_BASES] 
__attribute__((visibility("hidden
  */
 union vdso_data_store {
struct vdso_datadata[CS_BASES];
-   u8  page[PAGE_SIZE];
+   u8  page[1U << CONFIG_PAGE_SHIFT];
 };
 
 /*
-- 
2.39.2



Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Christophe Leroy


Le 20/03/2024 à 17:09, Peter Xu a écrit :
> On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote:
>> At the first place that was to get a close fit between hardware
>> pagetable topology and linux pagetable topology. But obviously we
>> already stepped back for 512k pages, so let's go one more step aside and
>> do similar with 8M pages.
>>
>> I'll give it a try and see how it goes.
> 
> So you're talking about 8M only for 8xx, am I right?

Yes I am.

> 
> There seem to be other PowerPC systems use hugepd.  Is it possible that we
> convert all hugepd into cont_pte form?

Indeed.

Seems like we have hugepd for book3s/64 and for nohash.

For book3s I don't know, may Aneesh can answer.

For nohash I think it should be possible because TLB misses are handled 
by software. Even the e6500 which has a hardware tablewalk falls back on 
software walk when it is a hugepage IIUC.

Christophe


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Peter Xu
On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote:
> At the first place that was to get a close fit between hardware 
> pagetable topology and linux pagetable topology. But obviously we 
> already stepped back for 512k pages, so let's go one more step aside and 
> do similar with 8M pages.
> 
> I'll give it a try and see how it goes.

So you're talking about 8M only for 8xx, am I right?

There seem to be other PowerPC systems use hugepd.  Is it possible that we
convert all hugepd into cont_pte form?

Thanks,

-- 
Peter Xu



Re: Cannot load wireguard module

2024-03-20 Thread Michal Suchánek
On Wed, Mar 20, 2024 at 11:41:32PM +1100, Michael Ellerman wrote:
> Michal Suchánek  writes:
> > On Mon, Mar 18, 2024 at 06:08:55PM +0100, Michal Suchánek wrote:
> >> On Mon, Mar 18, 2024 at 10:50:49PM +1100, Michael Ellerman wrote:
> >> > Michael Ellerman  writes:
> >> > > Michal Suchánek  writes:
> >> > >> Hello,
> >> > >>
> >> > >> I cannot load the wireguard module.
> >> > >>
> >> > >> Loading the module provides no diagnostic other than 'No such device'.
> >> > >>
> >> > >> Please provide maningful diagnostics for loading software-only driver,
> >> > >> clearly there is no particular device needed.
> >> > >
> >> > > Presumably it's just bubbling up an -ENODEV from somewhere.
> >> > >
> >> > > Can you get a trace of it?
> >> > >
> >> > > Something like:
> >> > >
> >> > >   # trace-cmd record -p function_graph -F modprobe wireguard
> >
> > Attached.
> 
> Sorry :/, you need to also trace children of modprobe, with -c.
> 
> But, I was able to reproduce the same issue here.
> 
> On a P9, a kernel with CONFIG_CRYPTO_CHACHA20_P10=n everything works:
> 
>   $ modprobe -v wireguard
>   insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko
>   insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko
>   insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko
>   insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha20poly1305.ko
>   insmod /lib/modules/6.8.0/kernel/drivers/net/wireguard/wireguard.ko
>   [   19.180564][  T692] wireguard: allowedips self-tests: pass
>   [   19.185080][  T692] wireguard: nonce counter self-tests: pass
>   [   19.310438][  T692] wireguard: ratelimiter self-tests: pass
>   [   19.310639][  T692] wireguard: WireGuard 1.0.0 loaded. See 
> www.wireguard.com for information.
>   [   19.310746][  T692] wireguard: Copyright (C) 2015-2019 Jason A. 
> Donenfeld . All Rights Reserved.
> 
> 
> If I build CONFIG_CRYPTO_CHACHA20_P10 as a module then it breaks:
> 
>   $ modprobe -v wireguard
>   insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko
>   insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko
>   insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko
>   insmod /lib/modules/6.8.0/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko
>   modprobe: ERROR: could not insert 'wireguard': No such device
> 
> 
> The ENODEV is coming from module_cpu_feature_match(), which blocks the
> driver from loading on non-p10.
> 
> Looking at other arches (arm64 at least) it seems like the driver should
> instead be loading but disabling the p10 path. Which then allows
> chacha_crypt_arch() to exist, and it has a fallback to use
> chacha_crypt_generic().
> 
> I don't see how module_cpu_feature_match() can co-exist with the driver
> also providing a fallback. Hopefully someone who knows crypto better
> than me can explain it.

Maybe it doesn't. ppc64le is the only platform that needs the fallback,
on other platforms that have hardware-specific chacha implementation it
seems to be using pretty common feature so the fallback is rarely if
ever needed in practice.

Thanks

Michal

> 
> This diff fixes it for me:
> 
> diff --git a/arch/powerpc/crypto/chacha-p10-glue.c 
> b/arch/powerpc/crypto/chacha-p10-glue.c
> index 74fb86b0d209..9d2c30b0904c 100644
> --- a/arch/powerpc/crypto/chacha-p10-glue.c
> +++ b/arch/powerpc/crypto/chacha-p10-glue.c
> @@ -197,6 +197,9 @@ static struct skcipher_alg algs[] = {
>  
>  static int __init chacha_p10_init(void)
>  {
> + if (!cpu_has_feature(PPC_FEATURE2_ARCH_3_1))
> + return 0;
> +
>   static_branch_enable(_p10);
>  
>   return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
> @@ -207,7 +210,7 @@ static void __exit chacha_p10_exit(void)
>   crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
>  }
>  
> -module_cpu_feature_match(PPC_MODULE_FEATURE_P10, chacha_p10_init);
> +module_init(chacha_p10_init);
>  module_exit(chacha_p10_exit);
>  
>  MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (P10 accelerated)");
> 
> 
> Giving me:
> 
>   $ modprobe -v wireguard
>   insmod /lib/modules/6.8.0-dirty/kernel/net/ipv4/udp_tunnel.ko
>   insmod /lib/modules/6.8.0-dirty/kernel/net/ipv6/ip6_udp_tunnel.ko
>   insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha.ko
>   insmod 
> /lib/modules/6.8.0-dirty/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko
>   insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha20poly1305.ko
>   insmod /lib/modules/6.8.0-dirty/kernel/drivers/net/wireguard/wireguard.ko
>   [   19.657941][  T718] wireguard: allowedips self-tests: pass
>   [   19.662501][  T718] wireguard: nonce counter self-tests: pass
>   [   19.782933][  T718] wireguard: ratelimiter self-tests: pass
>   [   19.783114][  T718] wireguard: WireGuard 1.0.0 loaded. See 
> www.wireguard.com for information.
>   [   19.783223][  T718] wireguard: Copyright (C) 2015-2019 Jason A. 
> Donenfeld . All Rights Reserved.
>   
> 
> cheers


Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables

2024-03-20 Thread Shivaprasad G Bhat

Hi Jason,

On 3/19/24 20:02, Jason Gunthorpe wrote:

On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote:

The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to
it_userspace") which implemented the tce indirect levels
support for PowerNV ended up removing the single level support
which existed by default(generic tce_iommu_userspace_view_alloc/free()
calls). On pSeries the TCEs are single level, and the allocation
of userspace view is lost with the removal of generic code.

:( :(

If this has been broken since 2018 and nobody cared till now can we
please go in a direction of moving this code to the new iommu APIs
instead of doubling down on more of this old stuff that apparently
almost nobody cares about ??


We have existing software stack deployments using VFIO userspace
device assignment running on Power platform. We have to enable
similar software stack on newer generation Power10 platform and
also in a pSeries lpar environment. These distros rely on VFIO enabled
in kernel and currently have IOMMUFD disabled. This patch series is
a simpler low risk enablement that functionally get the software stack
working while we continue to enable and move to IOMMUFD in phases.
We have to fix the older APIs in order to stage the functional enablement
in small increments.

We are working on iommufd support for pSeries and looking forward
to Timothy's patches.


-Thanks

Shivaprasad


Jason


Re: [PATCH] arch/powerpc/kvm: Add support for reading VPA counters for pseries guests

2024-03-20 Thread Nicholas Piggin
On Wed Mar 20, 2024 at 12:28 AM AEST, Gautam Menghani wrote:
> PAPR hypervisor has introduced three new counters in the VPA area of
> LPAR CPUs for KVM L2 guest (see [1] for terminology) observability - 2
> for context switches from host to guest and vice versa, and 1 counter
> for getting the total time spent inside the KVM guest. Add a tracepoint
> that enables reading the counters for use by ftrace/perf. Note that this
> tracepoint is only available for nestedv2 API (i.e, KVM on PowerVM).
>
> [1] Terminology:
> a. L1 refers to the VM (LPAR) booted on top of PAPR hypervisor
> b. L2 refers to the KVM guest booted on top of L1.
>
> Signed-off-by: Vaibhav Jain 
> Signed-off-by: Gautam Menghani 
> ---
>  arch/powerpc/include/asm/kvm_host.h |  5 +
>  arch/powerpc/include/asm/lppaca.h   | 11 ---
>  arch/powerpc/kvm/book3s_hv.c| 20 
>  arch/powerpc/kvm/trace_hv.h | 24 
>  4 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 8abac5321..26d7bb4b9 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -847,6 +847,11 @@ struct kvm_vcpu_arch {
>   gpa_t nested_io_gpr;
>   /* For nested APIv2 guests*/
>   struct kvmhv_nestedv2_io nestedv2_io;
> +
> + /* For VPA counters having context switch and guest run time info (in 
> ns) */
> + u64 l1_to_l2_cs;
> + u64 l2_to_l1_cs;
> + u64 l2_runtime;
>  #endif
>  
>  #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING

These aren't required here if it's just used for tracing over
a single run vcpu call are they?

> diff --git a/arch/powerpc/include/asm/lppaca.h 
> b/arch/powerpc/include/asm/lppaca.h
> index 61ec2447d..bda6b86b9 100644
> --- a/arch/powerpc/include/asm/lppaca.h
> +++ b/arch/powerpc/include/asm/lppaca.h
> @@ -62,7 +62,8 @@ struct lppaca {
>   u8  donate_dedicated_cpu;   /* Donate dedicated CPU cycles */
>   u8  fpregs_in_use;
>   u8  pmcregs_in_use;
> - u8  reserved8[28];
> + u8  l2_accumul_cntrs_enable;  /* Enable usage of counters for KVM 
> guest */
> + u8  reserved8[27];
>   __be64  wait_state_cycles;  /* Wait cycles for this proc */
>   u8  reserved9[28];
>   __be16  slb_count;  /* # of SLBs to maintain */
> @@ -92,9 +93,13 @@ struct lppaca {
>   /* cacheline 4-5 */
>  
>   __be32  page_ins;   /* CMO Hint - # page ins by OS */
> - u8  reserved12[148];
> + u8  reserved12[28];
> + volatile __be64 l1_to_l2_cs_tb;
> + volatile __be64 l2_to_l1_cs_tb;
> + volatile __be64 l2_runtime_tb;
> + u8 reserved13[96];
>   volatile __be64 dtl_idx;/* Dispatch Trace Log head index */
> - u8  reserved13[96];
> + u8  reserved14[96];
>  } cacheline_aligned;
>  
>  #define lppaca_of(cpu)   (*paca_ptrs[cpu]->lppaca_ptr)
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 2b04eba90..b94461b5f 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4092,6 +4092,7 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   unsigned long msr, i;
>   int trap;
>   long rc;
> + struct lppaca *lp = get_lppaca();

Does get_lppaca() emit some inline asm that can't be optimised?
Could move it under the unlikely branches if so.

>  
>   io = >arch.nestedv2_io;
>  

KVM L0 could in theory provide this for v1 L1s too, so could this
be done at a higher level to cover both?

> @@ -4107,6 +4108,17 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   kvmppc_gse_put_u64(io->vcpu_run_input, KVMPPC_GSID_LPCR, lpcr);
>  
>   accumulate_time(vcpu, >arch.in_guest);
> +
> + /* Reset the guest host context switch timing */
> + if (unlikely(trace_kvmppc_vcpu_exit_cs_time_enabled())) {
> + lp->l2_accumul_cntrs_enable = 1;
> + lp->l1_to_l2_cs_tb = 0;
> + lp->l2_to_l1_cs_tb = 0;
> + lp->l2_runtime_tb = 0;
> + } else {
> + lp->l2_accumul_cntrs_enable = 0;
> + }

Instead of zeroing here zero after the exit, which avoids the
else branch and possibly avoids an obscure race with the counters.
What if trace_kvmppc_vcpu_exit_cs_time_enabled() is false here...

> +
>   rc = plpar_guest_run_vcpu(0, vcpu->kvm->arch.lpid, vcpu->vcpu_id,
> , );
>  
> @@ -4133,6 +4145,14 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>  
>   timer_rearm_host_dec(*tb);
>  
> + /* Record context switch and guest_run_time data */
> + if (unlikely(trace_kvmppc_vcpu_exit_cs_time_enabled())) {
> + vcpu->arch.l1_to_l2_cs = 
> tb_to_ns(be64_to_cpu(lp->l1_to_l2_cs_tb));
> + vcpu->arch.l2_to_l1_cs = 
> 

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-20 Thread Vincent Guittot
On Wed, 20 Mar 2024 at 08:04, Tobias Huschle  wrote:
>
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
> > >
...
> > >
> > > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > > eligibility.
> > >
> > > I'm inclined to rather propose propose a documentation change, which
> > > describes that tasks should not rely on woken up tasks being scheduled
> > > immediately.
> >
> > Where do you see such an assumption ? Even before eevdf, there were
> > nothing that ensures such behavior. When using CFS (legacy or eevdf)
> > tasks, you can't know if the newly wakeup task will run 1st or not
> >
>
> There was no guarantee of course. place_entity was reducing the vruntime of
> woken up tasks though, giving it a slight boost, right?. For the scenario

It was rather the opposite, It was ensuring that long sleeping tasks
will not get too much bonus because of vruntime too far in the past.
This is similar although not exactly the same intent as the lag. The
bonus was up to 24ms previously whereas it's not more than a slice now

> that I observed, that boost was enough to make sure, that the woken up tasks
> gets scheduled consistently. This might still not be true for all scenarios,
> but in general EEVDF seems to be stricter with woken up tasks.
>
> Dismissing the lag on wakeup also does obviously not guarantee getting
> scheduled, as other tasks might still be involved.
>
> The question would be if it should be explicitly mentioned somewhere that,
> at this point, woken up tasks are not getting any special treatment and
> noone should rely on that boost for woken up tasks.
>
> > >
> > > Changing things in the code to address for the specific scenario I'm
> > > seeing seems to mostly create unwanted side effects and/or would require
> > > the definition of some magic cut-off values.
> > >
> > >


Re: [PATCH] KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception

2024-03-20 Thread Nicholas Piggin
On Wed Mar 13, 2024 at 5:26 PM AEST, Vaibhav Jain wrote:
> This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC:
> Book3S HV nestedv2: Do not cancel pending decrementer exception") which
> prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It
> was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC
> expiry TB' register which was higher compared to handling extra decrementer
> exceptions.
>
> This overhead of reading 'HDEC expiry TB' register has been mitigated
> recently by the L0 hypervisor(PowerVM) by putting the value of this
> register in L2 guest-state output buffer on trap to L1. From there the
> value of this register is cached, made available in kvmhv_run_single_vcpu()
> to compare it against host(L1) timebase and cancel the pending hypervisor
> decrementer exception if needed.

Ah, I figured out the problem here. Guest entry never clears the
queued dec, because it's level triggered on the DEC MSB so it
doesn't go away when it's delivered. So upstream code is indeed
buggy and I think I take the blame for suggesting this nestedv2
workaround.

I actually don't think that is necessary though, we could treat it
like other interrupts.  I think that would solve the problem without
having to test dec here.

I am wondering though, what workload slows down that this patch
was needed in the first place. We'd only get here after a cede
returns, then we'd dequeue the dec and stop having to GET_STATE
it here.

Thanks,
Nick

>
> Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending 
> decrementer exception")
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 0b921704da45..e47b954ce266 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
> time_limit,
>* entering a nested guest in which case the decrementer is now owned
>* by L2 and the L1 decrementer is provided in hdec_expires
>*/
> - if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) &&
> + if (kvmppc_core_pending_dec(vcpu) &&
>   ((tb < kvmppc_dec_expires_host_tb(vcpu)) ||
>(trap == BOOK3S_INTERRUPT_SYSCALL &&
> kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))



Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

2024-03-20 Thread Baoquan He
On 03/20/24 at 11:36am, Mike Rapoport wrote:
> On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote:
> > On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> > > Hi Baoquan,
> > > 
> > > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > > > This is not needed any more.
> > > 
> > > I'd swap this and the first patch, so that the first patch would remove
> > > memblock_find_dma_reserve() and it's changelog will explain why it's not
> > > needed and then the second patch will simply drop unused set_dma_reserve()
> > 
> > Thanks, Mike.
> > 
> > My thought on the patch 1/2 splitting is:
> > patch 1 is removing all relevant codes in mm, including the usage of
> > dma_reserve in free_area_init_core() and exporting set_dma_reserve()
> > to any ARCH which want to subtract the dma_reserve from DMA zone.
> >
> > Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.
>  
> I think it's better first to remove the usage of set_dma_reserve() in x86
> and then clean up the unused code.

OK, firslty remove the only user, that sounds reasonable. Will change.
Thanks.



Re: Cannot load wireguard module

2024-03-20 Thread Michael Ellerman
Michal Suchánek  writes:
> On Mon, Mar 18, 2024 at 06:08:55PM +0100, Michal Suchánek wrote:
>> On Mon, Mar 18, 2024 at 10:50:49PM +1100, Michael Ellerman wrote:
>> > Michael Ellerman  writes:
>> > > Michal Suchánek  writes:
>> > >> Hello,
>> > >>
>> > >> I cannot load the wireguard module.
>> > >>
>> > >> Loading the module provides no diagnostic other than 'No such device'.
>> > >>
>> > >> Please provide maningful diagnostics for loading software-only driver,
>> > >> clearly there is no particular device needed.
>> > >
>> > > Presumably it's just bubbling up an -ENODEV from somewhere.
>> > >
>> > > Can you get a trace of it?
>> > >
>> > > Something like:
>> > >
>> > >   # trace-cmd record -p function_graph -F modprobe wireguard
>
> Attached.

Sorry :/, you need to also trace children of modprobe, with -c.

But, I was able to reproduce the same issue here.

On a P9, a kernel with CONFIG_CRYPTO_CHACHA20_P10=n everything works:

  $ modprobe -v wireguard
  insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko
  insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko
  insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko
  insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha20poly1305.ko
  insmod /lib/modules/6.8.0/kernel/drivers/net/wireguard/wireguard.ko
  [   19.180564][  T692] wireguard: allowedips self-tests: pass
  [   19.185080][  T692] wireguard: nonce counter self-tests: pass
  [   19.310438][  T692] wireguard: ratelimiter self-tests: pass
  [   19.310639][  T692] wireguard: WireGuard 1.0.0 loaded. See 
www.wireguard.com for information.
  [   19.310746][  T692] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld 
. All Rights Reserved.


If I build CONFIG_CRYPTO_CHACHA20_P10 as a module then it breaks:

  $ modprobe -v wireguard
  insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko
  insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko
  insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko
  insmod /lib/modules/6.8.0/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko
  modprobe: ERROR: could not insert 'wireguard': No such device


The ENODEV is coming from module_cpu_feature_match(), which blocks the
driver from loading on non-p10.

Looking at other arches (arm64 at least) it seems like the driver should
instead be loading but disabling the p10 path. Which then allows
chacha_crypt_arch() to exist, and it has a fallback to use
chacha_crypt_generic().

I don't see how module_cpu_feature_match() can co-exist with the driver
also providing a fallback. Hopefully someone who knows crypto better
than me can explain it.

This diff fixes it for me:

diff --git a/arch/powerpc/crypto/chacha-p10-glue.c 
b/arch/powerpc/crypto/chacha-p10-glue.c
index 74fb86b0d209..9d2c30b0904c 100644
--- a/arch/powerpc/crypto/chacha-p10-glue.c
+++ b/arch/powerpc/crypto/chacha-p10-glue.c
@@ -197,6 +197,9 @@ static struct skcipher_alg algs[] = {
 
 static int __init chacha_p10_init(void)
 {
+   if (!cpu_has_feature(PPC_FEATURE2_ARCH_3_1))
+   return 0;
+
static_branch_enable(_p10);
 
return crypto_register_skciphers(algs, ARRAY_SIZE(algs));
@@ -207,7 +210,7 @@ static void __exit chacha_p10_exit(void)
crypto_unregister_skciphers(algs, ARRAY_SIZE(algs));
 }
 
-module_cpu_feature_match(PPC_MODULE_FEATURE_P10, chacha_p10_init);
+module_init(chacha_p10_init);
 module_exit(chacha_p10_exit);
 
 MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (P10 accelerated)");


Giving me:

  $ modprobe -v wireguard
  insmod /lib/modules/6.8.0-dirty/kernel/net/ipv4/udp_tunnel.ko
  insmod /lib/modules/6.8.0-dirty/kernel/net/ipv6/ip6_udp_tunnel.ko
  insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha.ko
  insmod 
/lib/modules/6.8.0-dirty/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko
  insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha20poly1305.ko
  insmod /lib/modules/6.8.0-dirty/kernel/drivers/net/wireguard/wireguard.ko
  [   19.657941][  T718] wireguard: allowedips self-tests: pass
  [   19.662501][  T718] wireguard: nonce counter self-tests: pass
  [   19.782933][  T718] wireguard: ratelimiter self-tests: pass
  [   19.783114][  T718] wireguard: WireGuard 1.0.0 loaded. See 
www.wireguard.com for information.
  [   19.783223][  T718] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld 
. All Rights Reserved.
  

cheers


Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

2024-03-20 Thread Mike Rapoport
On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote:
> On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> > Hi Baoquan,
> > 
> > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > > This is not needed any more.
> > 
> > I'd swap this and the first patch, so that the first patch would remove
> > memblock_find_dma_reserve() and it's changelog will explain why it's not
> > needed and then the second patch will simply drop unused set_dma_reserve()
> 
> Thanks, Mike.
> 
> My thought on the patch 1/2 splitting is:
> patch 1 is removing all relevant codes in mm, including the usage of
> dma_reserve in free_area_init_core() and exporting set_dma_reserve()
> to any ARCH which want to subtract the dma_reserve from DMA zone.
>
> Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.
 
I think it's better first to remove the usage of set_dma_reserve() in x86
and then clean up the unused code.

> Your suggestion is also good to me, I can rearrange the order and
> repost.

-- 
Sincerely yours,
Mike.


Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

2024-03-20 Thread Baoquan He
On 03/20/24 at 04:18pm, Baoquan He wrote:
> On 03/19/24 at 06:17pm, Mike Rapoport wrote:
> > On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote:
> > > Currently, in free_area_init_core(), when initialize zone's field, a
> > > rough value is set to zone->managed_pages. That value is calculated by
> > > (zone->present_pages - memmap_pages).
> > > 
> > > In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> > > represent all free pages of system (only low memory or including HIGHMEM
> > > memory separately). Both of them are gonna be used in
> > > alloc_large_system_hash().
> > > 
> > > However, the rough calculation and setting of zone->managed_pages is
> > > meaningless because
> > >   a) memmap pages are allocated on units of node in sparse_init() or
> > >  alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> > >  memmap_pages) is too rough to make sense for zone;
> > >   b) the set zone->managed_pages will be zeroed out and reset with
> > >  acutal value in mem_init() via memblock_free_all(). Before the
> > >  resetting, no buddy allocation request is issued.
> > > 
> > > Here, remove the meaningless and complicated calculation of
> > > (zone->present_pages - memmap_pages), directly set zone->present_pages to
> > > zone->managed_pages. It will be adjusted in mem_init().
> > 
> > Do you mean "set zone->managed_pages to zone->present_pages"?
> 
> Hmm, maybe 'set zone->managed_pages as zone->present_pages'
> or 
>'assign zone->present_pages to zone->managed_pages'
> which is more precise.
> 
> Wwill update.
> 
> > 
> > I think we can just set zone->managed_pages to 0 in free_area_init_core().
> > Anyway it will be reset before the first use.

Rethink about this, it's better to set zone->managed_pages to 0 because
there isn't any page added to buddy. Will update.

> 
> Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering
> zone->present_pages is closer value to actual zone->managed_pages
> than 0, and it may be needed in the future in some way before
> mem_init(). If no strong objection, I will keep the assigning
> 'zone->present_pages' to 'zone->managed_pages'.
> 
> Thanks again for careful reviewing.
> 



Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()

2024-03-20 Thread Baoquan He
On 03/19/24 at 06:17pm, Mike Rapoport wrote:
> On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote:
> > Currently, in free_area_init_core(), when initialize zone's field, a
> > rough value is set to zone->managed_pages. That value is calculated by
> > (zone->present_pages - memmap_pages).
> > 
> > In the meantime, add the value to nr_all_pages and nr_kernel_pages which
> > represent all free pages of system (only low memory or including HIGHMEM
> > memory separately). Both of them are gonna be used in
> > alloc_large_system_hash().
> > 
> > However, the rough calculation and setting of zone->managed_pages is
> > meaningless because
> >   a) memmap pages are allocated on units of node in sparse_init() or
> >  alloc_node_mem_map(pgdat); The simple (zone->present_pages -
> >  memmap_pages) is too rough to make sense for zone;
> >   b) the set zone->managed_pages will be zeroed out and reset with
> >  acutal value in mem_init() via memblock_free_all(). Before the
> >  resetting, no buddy allocation request is issued.
> > 
> > Here, remove the meaningless and complicated calculation of
> > (zone->present_pages - memmap_pages), directly set zone->present_pages to
> > zone->managed_pages. It will be adjusted in mem_init().
> 
> Do you mean "set zone->managed_pages to zone->present_pages"?

Hmm, maybe 'set zone->managed_pages as zone->present_pages'
or 
   'assign zone->present_pages to zone->managed_pages'
which is more precise.

Wwill update.

> 
> I think we can just set zone->managed_pages to 0 in free_area_init_core().
> Anyway it will be reset before the first use.

Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering
zone->present_pages is closer value to actual zone->managed_pages
than 0, and it may be needed in the future in some way before
mem_init(). If no strong objection, I will keep the assigning
'zone->present_pages' to 'zone->managed_pages'.

Thanks again for careful reviewing.



Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-20 Thread Luis Machado
On 3/20/24 07:04, Tobias Huschle wrote:
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
>> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
>>>
>>> On 2024-03-18 15:45, Luis Machado wrote:
 On 3/14/24 13:45, Tobias Huschle wrote:
> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:
>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>
>>> Questions:
>>> 1. The kworker getting its negative lag occurs in the following
>>> scenario
>>>- kworker and a cgroup are supposed to execute on the same CPU
>>>- one task within the cgroup is executing and wakes up the
>>> kworker
>>>- kworker with 0 lag, gets picked immediately and finishes its
>>>  execution within ~5000ns
>>>- on dequeue, kworker gets assigned a negative lag
>>>Is this expected behavior? With this short execution time, I
>>> would
>>>expect the kworker to be fine.
>>
>> That strikes me as a bit odd as well. Have you been able to determine
>> how a negative lag
>> is assigned to the kworker after such a short runtime?
>>
>
> I did some more trace reading though and found something.
>
> What I observed if everything runs regularly:
> - vhost and kworker run alternating on the same CPU
> - if the kworker is done, it leaves the runqueue
> - vhost wakes up the kworker if it needs it
> --> this means:
>   - vhost starts alone on an otherwise empty runqueue
>   - it seems like it never gets dequeued
> (unless another unrelated task joins or migration hits)
>   - if vhost wakes up the kworker, the kworker gets selected
>   - vhost runtime > kworker runtime
> --> kworker gets positive lag and gets selected immediately next
> time
>
> What happens if it does go wrong:
> From what I gather, there seem to be occasions where the vhost either
> executes suprisingly quick, or the kworker surprinsingly slow. If
> these
> outliers reach critical values, it can happen, that
>vhost runtime < kworker runtime
> which now causes the kworker to get the negative lag.
>
> In this case it seems like, that the vhost is very fast in waking up
> the kworker. And coincidentally, the kworker takes, more time than
> usual
> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>
> So, for these outliers, the scheduler extrapolates that the kworker
> out-consumes the vhost and should be slowed down, although in the
> majority
> of other cases this does not happen.

 Thanks for providing the above details Tobias. It does seem like EEVDF
 is strict
 about the eligibility checks and making tasks wait when their lags are
 negative, even
 if just a little bit as in the case of the kworker.

 There was a patch to disable the eligibility checks
 (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/),
 which would make EEVDF more like EVDF, though the deadline comparison
 would
 probably still favor the vhost task instead of the kworker with the
 negative lag.

 I'm not sure if you tried it, but I thought I'd mention it.
>>>
>>> Haven't seen that one yet. Unfortunately, it does not help to ignore the
>>> eligibility.
>>>
>>> I'm inclined to rather propose propose a documentation change, which
>>> describes that tasks should not rely on woken up tasks being scheduled
>>> immediately.
>>
>> Where do you see such an assumption ? Even before eevdf, there were
>> nothing that ensures such behavior. When using CFS (legacy or eevdf)
>> tasks, you can't know if the newly wakeup task will run 1st or not
>>
> 
> There was no guarantee of course. place_entity was reducing the vruntime of 
> woken up tasks though, giving it a slight boost, right?. For the scenario 
> that I observed, that boost was enough to make sure, that the woken up tasks 
> gets scheduled consistently. This might still not be true for all scenarios, 
> but in general EEVDF seems to be stricter with woken up tasks.

It seems that way, as EEVDF will do eligibility and deadline checks before 
scheduling a task, so
a task would have to satisfy both of those checks.

I think we have some special treatment for when a task initially joins the 
competition, in which
case we halve its slice. But I don't think there is any special treatment for 
woken tasks
anymore.

There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to 
reduce the number of
wake up preemptions under some conditions, under the RUN_TO_PARITY feature.


Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()

2024-03-20 Thread Baoquan He
On 03/19/24 at 05:49pm, Mike Rapoport wrote:
> Hi Baoquan,
> 
> On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote:
> > This is not needed any more.
> 
> I'd swap this and the first patch, so that the first patch would remove
> memblock_find_dma_reserve() and it's changelog will explain why it's not
> needed and then the second patch will simply drop unused set_dma_reserve()

Thanks, Mike.

My thought on the patch 1/2 splitting is:
patch 1 is removing all relevant codes in mm, including the usage of
dma_reserve in free_area_init_core() and exporting set_dma_reserve()
to any ARCH which want to subtract the dma_reserve from DMA zone.

Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve.

Your suggestion is also good to me, I can rearrange the order and
repost.



Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-20 Thread Tobias Huschle
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
> >
> > On 2024-03-18 15:45, Luis Machado wrote:
> > > On 3/14/24 13:45, Tobias Huschle wrote:
> > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:
> > >>> On 2/28/24 16:10, Tobias Huschle wrote:
> > 
> >  Questions:
> >  1. The kworker getting its negative lag occurs in the following
> >  scenario
> > - kworker and a cgroup are supposed to execute on the same CPU
> > - one task within the cgroup is executing and wakes up the
> >  kworker
> > - kworker with 0 lag, gets picked immediately and finishes its
> >   execution within ~5000ns
> > - on dequeue, kworker gets assigned a negative lag
> > Is this expected behavior? With this short execution time, I
> >  would
> > expect the kworker to be fine.
> > >>>
> > >>> That strikes me as a bit odd as well. Have you been able to determine
> > >>> how a negative lag
> > >>> is assigned to the kworker after such a short runtime?
> > >>>
> > >>
> > >> I did some more trace reading though and found something.
> > >>
> > >> What I observed if everything runs regularly:
> > >> - vhost and kworker run alternating on the same CPU
> > >> - if the kworker is done, it leaves the runqueue
> > >> - vhost wakes up the kworker if it needs it
> > >> --> this means:
> > >>   - vhost starts alone on an otherwise empty runqueue
> > >>   - it seems like it never gets dequeued
> > >> (unless another unrelated task joins or migration hits)
> > >>   - if vhost wakes up the kworker, the kworker gets selected
> > >>   - vhost runtime > kworker runtime
> > >> --> kworker gets positive lag and gets selected immediately next
> > >> time
> > >>
> > >> What happens if it does go wrong:
> > >> From what I gather, there seem to be occasions where the vhost either
> > >> executes suprisingly quick, or the kworker surprinsingly slow. If
> > >> these
> > >> outliers reach critical values, it can happen, that
> > >>vhost runtime < kworker runtime
> > >> which now causes the kworker to get the negative lag.
> > >>
> > >> In this case it seems like, that the vhost is very fast in waking up
> > >> the kworker. And coincidentally, the kworker takes, more time than
> > >> usual
> > >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> > >>
> > >> So, for these outliers, the scheduler extrapolates that the kworker
> > >> out-consumes the vhost and should be slowed down, although in the
> > >> majority
> > >> of other cases this does not happen.
> > >
> > > Thanks for providing the above details Tobias. It does seem like EEVDF
> > > is strict
> > > about the eligibility checks and making tasks wait when their lags are
> > > negative, even
> > > if just a little bit as in the case of the kworker.
> > >
> > > There was a patch to disable the eligibility checks
> > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/),
> > > which would make EEVDF more like EVDF, though the deadline comparison
> > > would
> > > probably still favor the vhost task instead of the kworker with the
> > > negative lag.
> > >
> > > I'm not sure if you tried it, but I thought I'd mention it.
> >
> > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > eligibility.
> >
> > I'm inclined to rather propose propose a documentation change, which
> > describes that tasks should not rely on woken up tasks being scheduled
> > immediately.
> 
> Where do you see such an assumption ? Even before eevdf, there were
> nothing that ensures such behavior. When using CFS (legacy or eevdf)
> tasks, you can't know if the newly wakeup task will run 1st or not
> 

There was no guarantee of course. place_entity was reducing the vruntime of 
woken up tasks though, giving it a slight boost, right?. For the scenario 
that I observed, that boost was enough to make sure, that the woken up tasks 
gets scheduled consistently. This might still not be true for all scenarios, 
but in general EEVDF seems to be stricter with woken up tasks.

Dismissing the lag on wakeup also does obviously not guarantee getting 
scheduled, as other tasks might still be involved.

The question would be if it should be explicitly mentioned somewhere that,
at this point, woken up tasks are not getting any special treatment and
noone should rely on that boost for woken up tasks.

> >
> > Changing things in the code to address for the specific scenario I'm
> > seeing seems to mostly create unwanted side effects and/or would require
> > the definition of some magic cut-off values.
> >
> >


Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Christophe Leroy


Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit :
> On Tue, Mar 19, 2024 at 11:07:08PM +, Christophe Leroy wrote:
>>
>>
>> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit :
>>> On Thu, Mar 14, 2024 at 01:11:59PM +, Christophe Leroy wrote:


 Le 14/03/2024 à 13:53, Peter Xu a écrit :
> On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
>>
>>
>> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
>>> From: Peter Xu 
>>>
>>> PowerPC book3s 4K mostly has the same definition on both, except 
>>> pXd_huge()
>>> constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out 
>>> [1],
>>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be 
>>> set so
>>> it will keep returning false.
>>>
>>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
>>> such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc 
>>> hugetlb
>>> pgtable walker __find_linux_pte() already used pXd_leaf() to check 
>>> hugetlb
>>> mappings.
>>>
>>> The goal should be that we will have one API pXd_leaf() to detect all 
>>> kinds
>>> of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather 
>>> than
>>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return 
>>> true.
>>
>> All kinds of huge mappings ?
>>
>> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are
>> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages
>> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report
>> those huge pages.
>
> Ah yes, I should always mention this is in the context of leaf huge pages
> only.  Are the examples you provided all fall into hugepd category?  If so
> I can reword the commit message, as:

 On powerpc 8xx, only the 8M huge pages fall into the hugepd case.

 The 512k hugepages are at PTE level, they are handled more or less like
 CONT_PTE on ARM. see function set_huge_pte_at() for more context.

 You can also look at pte_leaf_size() and pgd_leaf_size().
>>>
>>> IMHO leaf should return false if the thing is pointing to a next level
>>> page table, even if that next level is fully populated with contiguous
>>> pages.
>>>
>>> This seems more aligned with the contig page direction that hugepd
>>> should be moved over to..
>>
>> Should hugepd be moved to the contig page direction, really ?
> 
> Sure? Is there any downside for the reading side to do so?

Probably not.

> 
>> Would it be acceptable that a 8M hugepage requires 2048 contig entries
>> in 2 page tables, when the hugepd allows a single entry ?
> 
> ? I thought we agreed the only difference would be that something new
> is needed to merge the two identical sibling page tables into one, ie
> you pay 2x the page table memory if that isn't fixed. That is write
> side only change and I imagine it could be done with a single PPC
> special API.
> 
> Honestly not totally sure that is a big deal, it is already really
> memory inefficient compared to every other arch's huge page by needing
> the child page table in the first place.
> 
>> Would it be acceptable performancewise ?
> 
> Isn't this particular PPC sub platform ancient? Are there current real
> users that are going to have hugetlbfs special code and care about
> this performance detail on a 6.20 era kernel?

Ancient yes but still widely in use and with the emergence of voice over 
IP in Air Trafic Control, performance becomes more and more challenge 
with those old boards that have another 10 years in front of them.

> 
> In today's world wouldn't it be performance better if these platforms
> could support THP by aligning to the contig API instead of being
> special?

Indeed, if we can promote THP that'd be even better.

> 
> Am I wrong to question why we are polluting the core code for this
> special optimization?

At the first place that was to get a close fit between hardware 
pagetable topology and linux pagetable topology. But obviously we 
already stepped back for 512k pages, so let's go one more step aside and 
do similar with 8M pages.

I'll give it a try and see how it goes.

Christophe