Re: [PATCH 1/1] ASoC: fsl: fsl_ssi: Add dev_err_probe if PCM DMA init fails
On Thu, Mar 14, 2024 at 10:16 PM Alexander Stein wrote: > > This happens especially if this driver is built-in, but SDMA driver > is configured as module. > > Signed-off-by: Alexander Stein Acked-by: Shengjiu Wang Best Regards Shengjiu Wang > --- > sound/soc/fsl/fsl_ssi.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c > index ab6ec1974807..4ca3a16f7ac0 100644 > --- a/sound/soc/fsl/fsl_ssi.c > +++ b/sound/soc/fsl/fsl_ssi.c > @@ -1401,8 +1401,10 @@ static int fsl_ssi_imx_probe(struct platform_device > *pdev, > goto error_pcm; > } else { > ret = imx_pcm_dma_init(pdev); > - if (ret) > + if (ret) { > + dev_err_probe(dev, ret, "Failed to init PCM DMA\n"); > goto error_pcm; > + } > } > > return 0; > -- > 2.34.1 >
Re: [PATCH v3 0/5] ASoC: fsl: Support register and unregister rpmsg sound card through remoteproc
On Mon, Mar 11, 2024 at 7:14 PM Chancel Liu wrote: > > echo /lib/firmware/fw.elf > /sys/class/remoteproc/remoteproc0/firmware > (A) echo start > /sys/class/remoteproc/remoteproc0/state > (B) echo stop > /sys/class/remoteproc/remoteproc0/state > > The rpmsg sound card is registered in (A) and unregistered in (B). > After "start", imx-audio-rpmsg registers devices for ASoC platform driver > and machine driver. Then sound card is registered. After "stop", > imx-audio-rpmsg unregisters devices for ASoC platform driver and machine > driver. Then sound card is unregistered. Acked-by: Shengjiu Wang Best regards Shengjiu Wang > > changes in v2 > - Fix build errors reported by kernel test robot > > changes in v3 > - Add a new patch for fsl_rpmsg to register CPU DAI with rpmsg channel > name > - Update imx-rpmsg.c to get DT node of ASoC CPU DAI device with rpmsg > channel name instead of using undocumented bindings > > Chancel Liu (5): > ASoC: fsl: imx-pcm-rpmsg: Register component with rpmsg channel name > ASoC: fsl: imx-audio-rpmsg: Register device with rpmsg channel name > ASoC: fsl: Let imx-audio-rpmsg register platform device for card > ASoC: fsl: fsl_rpmsg: Register CPU DAI with name of rpmsg channel > ASoC: fsl: imx-rpmsg: Update to correct DT node > > sound/soc/fsl/fsl_rpmsg.c | 43 - > sound/soc/fsl/imx-audio-rpmsg.c | 21 +--- > sound/soc/fsl/imx-pcm-rpmsg.c | 11 ++--- > sound/soc/fsl/imx-rpmsg.c | 28 ++--- > 4 files changed, 71 insertions(+), 32 deletions(-) > > -- > 2.43.0 >
Re: [PATCH] powerpc: ps3: mark ps3_notification_device static for stack usage
On 3/21/24 03:03, Arnd Bergmann wrote: > From: Arnd Bergmann > > The device is way too large to be on the stack, causing a warning for > an allmodconfig build with clang: > > arch/powerpc/platforms/ps3/device-init.c:771:12: error: stack frame size > (2064) exceeds limit (2048) in 'ps3_probe_thread' > [-Werror,-Wframe-larger-than] > 771 | static int ps3_probe_thread(void *data) > > There is only a single thread that ever accesses this, so it's fine to > make it static, which avoids the warning. > > Fixes: b4cb2941f855 ("[POWERPC] PS3: Use the HVs storage device notification > mechanism properly") > Signed-off-by: Arnd Bergmann > --- > arch/powerpc/platforms/ps3/device-init.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/ps3/device-init.c > b/arch/powerpc/platforms/ps3/device-init.c > index 878bc160246e..ce99f06698a9 100644 > --- a/arch/powerpc/platforms/ps3/device-init.c > +++ b/arch/powerpc/platforms/ps3/device-init.c > @@ -770,7 +770,7 @@ static struct task_struct *probe_task; > > static int ps3_probe_thread(void *data) > { > - struct ps3_notification_device dev; > + static struct ps3_notification_device dev; > int res; > unsigned int irq; > u64 lpar; Seems fine. Acked-by: Geoff Levand
Re: [PATCH] vdso: use CONFIG_PAGE_SHIFT in vdso/datapage.h
On Wed, Mar 20, 2024 at 07:02:15PM +0100, Arnd Bergmann wrote: > From: Arnd Bergmann > > Both the vdso rework and the CONFIG_PAGE_SHIFT changes were merged during > the v6.9 merge window, so it is now possible to use CONFIG_PAGE_SHIFT > instead of including asm/page.h in the vdso. > > This avoids the workaround for arm64 and addresses a build warning > for powerpc64: > > In file included from :4: > In file included from /home/arnd/arm-soc/arm-soc/lib/vdso/gettimeofday.c:5: > In file included from ../include/vdso/datapage.h:25: > arch/powerpc/include/asm/page.h:230:9: error: result of comparison of > constant 13835058055282163712 with expression of type 'unsigned long' is > always true [-Werror,-Wtautological-constant-out-of-range-compare] > 230 | return __pa(kaddr) >> PAGE_SHIFT; > |^~~ > arch/powerpc/include/asm/page.h:217:37: note: expanded from macro '__pa' > 217 | VIRTUAL_WARN_ON((unsigned long)(x) < PAGE_OFFSET); > \ > | ~~~^~ > arch/powerpc/include/asm/page.h:202:73: note: expanded from macro > 'VIRTUAL_WARN_ON' > 202 | #define VIRTUAL_WARN_ON(x) > WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && (x)) > | > ~^~~ > arch/powerpc/include/asm/bug.h:88:25: note: expanded from macro 'WARN_ON' >88 | int __ret_warn_on = !!(x); \ > |^ > > Cc: Michael Ellerman > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Andy Lutomirski > Cc: Thomas Gleixner > Cc: Vincenzo Frascino > Cc: Anna-Maria Behnsen > See-also: 8b3843ae3634 ("vdso/datapage: Quick fix - use asm/page-def.h for > ARM64") > Signed-off-by: Arnd Bergmann Thanks for tracking this! Reviewed-by: Kees Cook -- Kees Cook
Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()
On Wed, Mar 20, 2024 at 05:40:39PM +, Christophe Leroy wrote: > > > Le 20/03/2024 à 17:09, Peter Xu a écrit : > > On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote: > >> At the first place that was to get a close fit between hardware > >> pagetable topology and linux pagetable topology. But obviously we > >> already stepped back for 512k pages, so let's go one more step aside and > >> do similar with 8M pages. > >> > >> I'll give it a try and see how it goes. > > > > So you're talking about 8M only for 8xx, am I right? > > Yes I am. > > > > > There seem to be other PowerPC systems use hugepd. Is it possible that we > > convert all hugepd into cont_pte form? > > Indeed. > > Seems like we have hugepd for book3s/64 and for nohash. > > For book3s I don't know, may Aneesh can answer. > > For nohash I think it should be possible because TLB misses are handled > by software. Even the e6500 which has a hardware tablewalk falls back on > software walk when it is a hugepage IIUC. It'll be great if I can get some answer here, and then I know the path for hugepd in general. I don't want to add any new code into core mm to something destined to fade away soon. One option for me is I can check a macro of hugepd existance, so all new code will only work when hugepd is not supported on such arch. However that'll start to make some PowerPC systems special (which I still tried hard to avoid, if that wasn't proved in the past..), meanwhile we'll also need to keep some generic-mm paths (that I can already remove along with the new code) only for these hugepd systems. But it's still okay to me, it'll be just a matter of when to drop those codes, sooner or later. Thanks, -- Peter Xu
[PATCH] powerpc: ps3: mark ps3_notification_device static for stack usage
From: Arnd Bergmann The device is way too large to be on the stack, causing a warning for an allmodconfig build with clang: arch/powerpc/platforms/ps3/device-init.c:771:12: error: stack frame size (2064) exceeds limit (2048) in 'ps3_probe_thread' [-Werror,-Wframe-larger-than] 771 | static int ps3_probe_thread(void *data) There is only a single thread that ever accesses this, so it's fine to make it static, which avoids the warning. Fixes: b4cb2941f855 ("[POWERPC] PS3: Use the HVs storage device notification mechanism properly") Signed-off-by: Arnd Bergmann --- arch/powerpc/platforms/ps3/device-init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/ps3/device-init.c b/arch/powerpc/platforms/ps3/device-init.c index 878bc160246e..ce99f06698a9 100644 --- a/arch/powerpc/platforms/ps3/device-init.c +++ b/arch/powerpc/platforms/ps3/device-init.c @@ -770,7 +770,7 @@ static struct task_struct *probe_task; static int ps3_probe_thread(void *data) { - struct ps3_notification_device dev; + static struct ps3_notification_device dev; int res; unsigned int irq; u64 lpar; -- 2.39.2
[PATCH] vdso: use CONFIG_PAGE_SHIFT in vdso/datapage.h
From: Arnd Bergmann Both the vdso rework and the CONFIG_PAGE_SHIFT changes were merged during the v6.9 merge window, so it is now possible to use CONFIG_PAGE_SHIFT instead of including asm/page.h in the vdso. This avoids the workaround for arm64 and addresses a build warning for powerpc64: In file included from :4: In file included from /home/arnd/arm-soc/arm-soc/lib/vdso/gettimeofday.c:5: In file included from ../include/vdso/datapage.h:25: arch/powerpc/include/asm/page.h:230:9: error: result of comparison of constant 13835058055282163712 with expression of type 'unsigned long' is always true [-Werror,-Wtautological-constant-out-of-range-compare] 230 | return __pa(kaddr) >> PAGE_SHIFT; |^~~ arch/powerpc/include/asm/page.h:217:37: note: expanded from macro '__pa' 217 | VIRTUAL_WARN_ON((unsigned long)(x) < PAGE_OFFSET); \ | ~~~^~ arch/powerpc/include/asm/page.h:202:73: note: expanded from macro 'VIRTUAL_WARN_ON' 202 | #define VIRTUAL_WARN_ON(x) WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && (x)) | ~^~~ arch/powerpc/include/asm/bug.h:88:25: note: expanded from macro 'WARN_ON' 88 | int __ret_warn_on = !!(x); \ |^ Cc: Michael Ellerman Cc: linuxppc-dev@lists.ozlabs.org Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Vincenzo Frascino Cc: Anna-Maria Behnsen See-also: 8b3843ae3634 ("vdso/datapage: Quick fix - use asm/page-def.h for ARM64") Signed-off-by: Arnd Bergmann --- arch/powerpc/include/asm/vdso/gettimeofday.h | 3 +-- include/vdso/datapage.h | 8 +--- 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h index f0a4cf01e85c..78302f6c2580 100644 --- a/arch/powerpc/include/asm/vdso/gettimeofday.h +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -4,7 +4,6 @@ #ifndef __ASSEMBLY__ -#include #include #include #include @@ -95,7 +94,7 @@ const struct vdso_data *__arch_get_vdso_data(void); static __always_inline const struct vdso_data *__arch_get_timens_vdso_data(const struct vdso_data *vd) { - return (void *)vd + PAGE_SIZE; + return (void *)vd + (1U << CONFIG_PAGE_SHIFT); } #endif diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index 5d5c0b8efff2..c71ddb6d4691 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -19,12 +19,6 @@ #include #include -#ifdef CONFIG_ARM64 -#include -#else -#include -#endif - #ifdef CONFIG_ARCH_HAS_VDSO_DATA #include #else @@ -132,7 +126,7 @@ extern struct vdso_data _timens_data[CS_BASES] __attribute__((visibility("hidden */ union vdso_data_store { struct vdso_datadata[CS_BASES]; - u8 page[PAGE_SIZE]; + u8 page[1U << CONFIG_PAGE_SHIFT]; }; /* -- 2.39.2
Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()
Le 20/03/2024 à 17:09, Peter Xu a écrit : > On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote: >> At the first place that was to get a close fit between hardware >> pagetable topology and linux pagetable topology. But obviously we >> already stepped back for 512k pages, so let's go one more step aside and >> do similar with 8M pages. >> >> I'll give it a try and see how it goes. > > So you're talking about 8M only for 8xx, am I right? Yes I am. > > There seem to be other PowerPC systems use hugepd. Is it possible that we > convert all hugepd into cont_pte form? Indeed. Seems like we have hugepd for book3s/64 and for nohash. For book3s I don't know, may Aneesh can answer. For nohash I think it should be possible because TLB misses are handled by software. Even the e6500 which has a hardware tablewalk falls back on software walk when it is a hugepage IIUC. Christophe
Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()
On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote: > At the first place that was to get a close fit between hardware > pagetable topology and linux pagetable topology. But obviously we > already stepped back for 512k pages, so let's go one more step aside and > do similar with 8M pages. > > I'll give it a try and see how it goes. So you're talking about 8M only for 8xx, am I right? There seem to be other PowerPC systems use hugepd. Is it possible that we convert all hugepd into cont_pte form? Thanks, -- Peter Xu
Re: Cannot load wireguard module
On Wed, Mar 20, 2024 at 11:41:32PM +1100, Michael Ellerman wrote: > Michal Suchánek writes: > > On Mon, Mar 18, 2024 at 06:08:55PM +0100, Michal Suchánek wrote: > >> On Mon, Mar 18, 2024 at 10:50:49PM +1100, Michael Ellerman wrote: > >> > Michael Ellerman writes: > >> > > Michal Suchánek writes: > >> > >> Hello, > >> > >> > >> > >> I cannot load the wireguard module. > >> > >> > >> > >> Loading the module provides no diagnostic other than 'No such device'. > >> > >> > >> > >> Please provide maningful diagnostics for loading software-only driver, > >> > >> clearly there is no particular device needed. > >> > > > >> > > Presumably it's just bubbling up an -ENODEV from somewhere. > >> > > > >> > > Can you get a trace of it? > >> > > > >> > > Something like: > >> > > > >> > > # trace-cmd record -p function_graph -F modprobe wireguard > > > > Attached. > > Sorry :/, you need to also trace children of modprobe, with -c. > > But, I was able to reproduce the same issue here. > > On a P9, a kernel with CONFIG_CRYPTO_CHACHA20_P10=n everything works: > > $ modprobe -v wireguard > insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko > insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko > insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko > insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha20poly1305.ko > insmod /lib/modules/6.8.0/kernel/drivers/net/wireguard/wireguard.ko > [ 19.180564][ T692] wireguard: allowedips self-tests: pass > [ 19.185080][ T692] wireguard: nonce counter self-tests: pass > [ 19.310438][ T692] wireguard: ratelimiter self-tests: pass > [ 19.310639][ T692] wireguard: WireGuard 1.0.0 loaded. See > www.wireguard.com for information. > [ 19.310746][ T692] wireguard: Copyright (C) 2015-2019 Jason A. > Donenfeld . All Rights Reserved. > > > If I build CONFIG_CRYPTO_CHACHA20_P10 as a module then it breaks: > > $ modprobe -v wireguard > insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko > insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko > insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko > insmod /lib/modules/6.8.0/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko > modprobe: ERROR: could not insert 'wireguard': No such device > > > The ENODEV is coming from module_cpu_feature_match(), which blocks the > driver from loading on non-p10. > > Looking at other arches (arm64 at least) it seems like the driver should > instead be loading but disabling the p10 path. Which then allows > chacha_crypt_arch() to exist, and it has a fallback to use > chacha_crypt_generic(). > > I don't see how module_cpu_feature_match() can co-exist with the driver > also providing a fallback. Hopefully someone who knows crypto better > than me can explain it. Maybe it doesn't. ppc64le is the only platform that needs the fallback, on other platforms that have hardware-specific chacha implementation it seems to be using pretty common feature so the fallback is rarely if ever needed in practice. Thanks Michal > > This diff fixes it for me: > > diff --git a/arch/powerpc/crypto/chacha-p10-glue.c > b/arch/powerpc/crypto/chacha-p10-glue.c > index 74fb86b0d209..9d2c30b0904c 100644 > --- a/arch/powerpc/crypto/chacha-p10-glue.c > +++ b/arch/powerpc/crypto/chacha-p10-glue.c > @@ -197,6 +197,9 @@ static struct skcipher_alg algs[] = { > > static int __init chacha_p10_init(void) > { > + if (!cpu_has_feature(PPC_FEATURE2_ARCH_3_1)) > + return 0; > + > static_branch_enable(_p10); > > return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); > @@ -207,7 +210,7 @@ static void __exit chacha_p10_exit(void) > crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); > } > > -module_cpu_feature_match(PPC_MODULE_FEATURE_P10, chacha_p10_init); > +module_init(chacha_p10_init); > module_exit(chacha_p10_exit); > > MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (P10 accelerated)"); > > > Giving me: > > $ modprobe -v wireguard > insmod /lib/modules/6.8.0-dirty/kernel/net/ipv4/udp_tunnel.ko > insmod /lib/modules/6.8.0-dirty/kernel/net/ipv6/ip6_udp_tunnel.ko > insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha.ko > insmod > /lib/modules/6.8.0-dirty/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko > insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha20poly1305.ko > insmod /lib/modules/6.8.0-dirty/kernel/drivers/net/wireguard/wireguard.ko > [ 19.657941][ T718] wireguard: allowedips self-tests: pass > [ 19.662501][ T718] wireguard: nonce counter self-tests: pass > [ 19.782933][ T718] wireguard: ratelimiter self-tests: pass > [ 19.783114][ T718] wireguard: WireGuard 1.0.0 loaded. See > www.wireguard.com for information. > [ 19.783223][ T718] wireguard: Copyright (C) 2015-2019 Jason A. > Donenfeld . All Rights Reserved. > > > cheers
Re: [RFC PATCH 1/3] powerpc/pseries/iommu: Bring back userspace view for single level TCE tables
Hi Jason, On 3/19/24 20:02, Jason Gunthorpe wrote: On Tue, Mar 12, 2024 at 01:14:20PM -0500, Shivaprasad G Bhat wrote: The commit 090bad39b237a ("powerpc/powernv: Add indirect levels to it_userspace") which implemented the tce indirect levels support for PowerNV ended up removing the single level support which existed by default(generic tce_iommu_userspace_view_alloc/free() calls). On pSeries the TCEs are single level, and the allocation of userspace view is lost with the removal of generic code. :( :( If this has been broken since 2018 and nobody cared till now can we please go in a direction of moving this code to the new iommu APIs instead of doubling down on more of this old stuff that apparently almost nobody cares about ?? We have existing software stack deployments using VFIO userspace device assignment running on Power platform. We have to enable similar software stack on newer generation Power10 platform and also in a pSeries lpar environment. These distros rely on VFIO enabled in kernel and currently have IOMMUFD disabled. This patch series is a simpler low risk enablement that functionally get the software stack working while we continue to enable and move to IOMMUFD in phases. We have to fix the older APIs in order to stage the functional enablement in small increments. We are working on iommufd support for pSeries and looking forward to Timothy's patches. -Thanks Shivaprasad Jason
Re: [PATCH] arch/powerpc/kvm: Add support for reading VPA counters for pseries guests
On Wed Mar 20, 2024 at 12:28 AM AEST, Gautam Menghani wrote: > PAPR hypervisor has introduced three new counters in the VPA area of > LPAR CPUs for KVM L2 guest (see [1] for terminology) observability - 2 > for context switches from host to guest and vice versa, and 1 counter > for getting the total time spent inside the KVM guest. Add a tracepoint > that enables reading the counters for use by ftrace/perf. Note that this > tracepoint is only available for nestedv2 API (i.e, KVM on PowerVM). > > [1] Terminology: > a. L1 refers to the VM (LPAR) booted on top of PAPR hypervisor > b. L2 refers to the KVM guest booted on top of L1. > > Signed-off-by: Vaibhav Jain > Signed-off-by: Gautam Menghani > --- > arch/powerpc/include/asm/kvm_host.h | 5 + > arch/powerpc/include/asm/lppaca.h | 11 --- > arch/powerpc/kvm/book3s_hv.c| 20 > arch/powerpc/kvm/trace_hv.h | 24 > 4 files changed, 57 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_host.h > b/arch/powerpc/include/asm/kvm_host.h > index 8abac5321..26d7bb4b9 100644 > --- a/arch/powerpc/include/asm/kvm_host.h > +++ b/arch/powerpc/include/asm/kvm_host.h > @@ -847,6 +847,11 @@ struct kvm_vcpu_arch { > gpa_t nested_io_gpr; > /* For nested APIv2 guests*/ > struct kvmhv_nestedv2_io nestedv2_io; > + > + /* For VPA counters having context switch and guest run time info (in > ns) */ > + u64 l1_to_l2_cs; > + u64 l2_to_l1_cs; > + u64 l2_runtime; > #endif > > #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING These aren't required here if it's just used for tracing over a single run vcpu call are they? > diff --git a/arch/powerpc/include/asm/lppaca.h > b/arch/powerpc/include/asm/lppaca.h > index 61ec2447d..bda6b86b9 100644 > --- a/arch/powerpc/include/asm/lppaca.h > +++ b/arch/powerpc/include/asm/lppaca.h > @@ -62,7 +62,8 @@ struct lppaca { > u8 donate_dedicated_cpu; /* Donate dedicated CPU cycles */ > u8 fpregs_in_use; > u8 pmcregs_in_use; > - u8 reserved8[28]; > + u8 l2_accumul_cntrs_enable; /* Enable usage of counters for KVM > guest */ > + u8 reserved8[27]; > __be64 wait_state_cycles; /* Wait cycles for this proc */ > u8 reserved9[28]; > __be16 slb_count; /* # of SLBs to maintain */ > @@ -92,9 +93,13 @@ struct lppaca { > /* cacheline 4-5 */ > > __be32 page_ins; /* CMO Hint - # page ins by OS */ > - u8 reserved12[148]; > + u8 reserved12[28]; > + volatile __be64 l1_to_l2_cs_tb; > + volatile __be64 l2_to_l1_cs_tb; > + volatile __be64 l2_runtime_tb; > + u8 reserved13[96]; > volatile __be64 dtl_idx;/* Dispatch Trace Log head index */ > - u8 reserved13[96]; > + u8 reserved14[96]; > } cacheline_aligned; > > #define lppaca_of(cpu) (*paca_ptrs[cpu]->lppaca_ptr) > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 2b04eba90..b94461b5f 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -4092,6 +4092,7 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu > *vcpu, u64 time_limit, > unsigned long msr, i; > int trap; > long rc; > + struct lppaca *lp = get_lppaca(); Does get_lppaca() emit some inline asm that can't be optimised? Could move it under the unlikely branches if so. > > io = >arch.nestedv2_io; > KVM L0 could in theory provide this for v1 L1s too, so could this be done at a higher level to cover both? > @@ -4107,6 +4108,17 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu > *vcpu, u64 time_limit, > kvmppc_gse_put_u64(io->vcpu_run_input, KVMPPC_GSID_LPCR, lpcr); > > accumulate_time(vcpu, >arch.in_guest); > + > + /* Reset the guest host context switch timing */ > + if (unlikely(trace_kvmppc_vcpu_exit_cs_time_enabled())) { > + lp->l2_accumul_cntrs_enable = 1; > + lp->l1_to_l2_cs_tb = 0; > + lp->l2_to_l1_cs_tb = 0; > + lp->l2_runtime_tb = 0; > + } else { > + lp->l2_accumul_cntrs_enable = 0; > + } Instead of zeroing here zero after the exit, which avoids the else branch and possibly avoids an obscure race with the counters. What if trace_kvmppc_vcpu_exit_cs_time_enabled() is false here... > + > rc = plpar_guest_run_vcpu(0, vcpu->kvm->arch.lpid, vcpu->vcpu_id, > , ); > > @@ -4133,6 +4145,14 @@ static int kvmhv_vcpu_entry_nestedv2(struct kvm_vcpu > *vcpu, u64 time_limit, > > timer_rearm_host_dec(*tb); > > + /* Record context switch and guest_run_time data */ > + if (unlikely(trace_kvmppc_vcpu_exit_cs_time_enabled())) { > + vcpu->arch.l1_to_l2_cs = > tb_to_ns(be64_to_cpu(lp->l1_to_l2_cs_tb)); > + vcpu->arch.l2_to_l1_cs = >
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Wed, 20 Mar 2024 at 08:04, Tobias Huschle wrote: > > On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: > > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > > ... > > > > > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > > > eligibility. > > > > > > I'm inclined to rather propose propose a documentation change, which > > > describes that tasks should not rely on woken up tasks being scheduled > > > immediately. > > > > Where do you see such an assumption ? Even before eevdf, there were > > nothing that ensures such behavior. When using CFS (legacy or eevdf) > > tasks, you can't know if the newly wakeup task will run 1st or not > > > > There was no guarantee of course. place_entity was reducing the vruntime of > woken up tasks though, giving it a slight boost, right?. For the scenario It was rather the opposite, It was ensuring that long sleeping tasks will not get too much bonus because of vruntime too far in the past. This is similar although not exactly the same intent as the lag. The bonus was up to 24ms previously whereas it's not more than a slice now > that I observed, that boost was enough to make sure, that the woken up tasks > gets scheduled consistently. This might still not be true for all scenarios, > but in general EEVDF seems to be stricter with woken up tasks. > > Dismissing the lag on wakeup also does obviously not guarantee getting > scheduled, as other tasks might still be involved. > > The question would be if it should be explicitly mentioned somewhere that, > at this point, woken up tasks are not getting any special treatment and > noone should rely on that boost for woken up tasks. > > > > > > > Changing things in the code to address for the specific scenario I'm > > > seeing seems to mostly create unwanted side effects and/or would require > > > the definition of some magic cut-off values. > > > > > >
Re: [PATCH] KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception
On Wed Mar 13, 2024 at 5:26 PM AEST, Vaibhav Jain wrote: > This reverts commit 180c6b072bf360b686e53d893d8dcf7dbbaec6bb ("KVM: PPC: > Book3S HV nestedv2: Do not cancel pending decrementer exception") which > prevented cancelling a pending HDEC exception for nestedv2 KVM guests. It > was done to avoid overhead of a H_GUEST_GET_STATE hcall to read the 'HDEC > expiry TB' register which was higher compared to handling extra decrementer > exceptions. > > This overhead of reading 'HDEC expiry TB' register has been mitigated > recently by the L0 hypervisor(PowerVM) by putting the value of this > register in L2 guest-state output buffer on trap to L1. From there the > value of this register is cached, made available in kvmhv_run_single_vcpu() > to compare it against host(L1) timebase and cancel the pending hypervisor > decrementer exception if needed. Ah, I figured out the problem here. Guest entry never clears the queued dec, because it's level triggered on the DEC MSB so it doesn't go away when it's delivered. So upstream code is indeed buggy and I think I take the blame for suggesting this nestedv2 workaround. I actually don't think that is necessary though, we could treat it like other interrupts. I think that would solve the problem without having to test dec here. I am wondering though, what workload slows down that this patch was needed in the first place. We'd only get here after a cede returns, then we'd dequeue the dec and stop having to GET_STATE it here. Thanks, Nick > > Fixes: 180c6b072bf3 ("KVM: PPC: Book3S HV nestedv2: Do not cancel pending > decrementer exception") > Signed-off-by: Vaibhav Jain > --- > arch/powerpc/kvm/book3s_hv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 0b921704da45..e47b954ce266 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -4856,7 +4856,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 > time_limit, >* entering a nested guest in which case the decrementer is now owned >* by L2 and the L1 decrementer is provided in hdec_expires >*/ > - if (!kvmhv_is_nestedv2() && kvmppc_core_pending_dec(vcpu) && > + if (kvmppc_core_pending_dec(vcpu) && > ((tb < kvmppc_dec_expires_host_tb(vcpu)) || >(trap == BOOK3S_INTERRUPT_SYSCALL && > kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()
On 03/20/24 at 11:36am, Mike Rapoport wrote: > On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote: > > On 03/19/24 at 05:49pm, Mike Rapoport wrote: > > > Hi Baoquan, > > > > > > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote: > > > > This is not needed any more. > > > > > > I'd swap this and the first patch, so that the first patch would remove > > > memblock_find_dma_reserve() and it's changelog will explain why it's not > > > needed and then the second patch will simply drop unused set_dma_reserve() > > > > Thanks, Mike. > > > > My thought on the patch 1/2 splitting is: > > patch 1 is removing all relevant codes in mm, including the usage of > > dma_reserve in free_area_init_core() and exporting set_dma_reserve() > > to any ARCH which want to subtract the dma_reserve from DMA zone. > > > > Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve. > > I think it's better first to remove the usage of set_dma_reserve() in x86 > and then clean up the unused code. OK, firslty remove the only user, that sounds reasonable. Will change. Thanks.
Re: Cannot load wireguard module
Michal Suchánek writes: > On Mon, Mar 18, 2024 at 06:08:55PM +0100, Michal Suchánek wrote: >> On Mon, Mar 18, 2024 at 10:50:49PM +1100, Michael Ellerman wrote: >> > Michael Ellerman writes: >> > > Michal Suchánek writes: >> > >> Hello, >> > >> >> > >> I cannot load the wireguard module. >> > >> >> > >> Loading the module provides no diagnostic other than 'No such device'. >> > >> >> > >> Please provide maningful diagnostics for loading software-only driver, >> > >> clearly there is no particular device needed. >> > > >> > > Presumably it's just bubbling up an -ENODEV from somewhere. >> > > >> > > Can you get a trace of it? >> > > >> > > Something like: >> > > >> > > # trace-cmd record -p function_graph -F modprobe wireguard > > Attached. Sorry :/, you need to also trace children of modprobe, with -c. But, I was able to reproduce the same issue here. On a P9, a kernel with CONFIG_CRYPTO_CHACHA20_P10=n everything works: $ modprobe -v wireguard insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha20poly1305.ko insmod /lib/modules/6.8.0/kernel/drivers/net/wireguard/wireguard.ko [ 19.180564][ T692] wireguard: allowedips self-tests: pass [ 19.185080][ T692] wireguard: nonce counter self-tests: pass [ 19.310438][ T692] wireguard: ratelimiter self-tests: pass [ 19.310639][ T692] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information. [ 19.310746][ T692] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. If I build CONFIG_CRYPTO_CHACHA20_P10 as a module then it breaks: $ modprobe -v wireguard insmod /lib/modules/6.8.0/kernel/net/ipv4/udp_tunnel.ko insmod /lib/modules/6.8.0/kernel/net/ipv6/ip6_udp_tunnel.ko insmod /lib/modules/6.8.0/kernel/lib/crypto/libchacha.ko insmod /lib/modules/6.8.0/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko modprobe: ERROR: could not insert 'wireguard': No such device The ENODEV is coming from module_cpu_feature_match(), which blocks the driver from loading on non-p10. Looking at other arches (arm64 at least) it seems like the driver should instead be loading but disabling the p10 path. Which then allows chacha_crypt_arch() to exist, and it has a fallback to use chacha_crypt_generic(). I don't see how module_cpu_feature_match() can co-exist with the driver also providing a fallback. Hopefully someone who knows crypto better than me can explain it. This diff fixes it for me: diff --git a/arch/powerpc/crypto/chacha-p10-glue.c b/arch/powerpc/crypto/chacha-p10-glue.c index 74fb86b0d209..9d2c30b0904c 100644 --- a/arch/powerpc/crypto/chacha-p10-glue.c +++ b/arch/powerpc/crypto/chacha-p10-glue.c @@ -197,6 +197,9 @@ static struct skcipher_alg algs[] = { static int __init chacha_p10_init(void) { + if (!cpu_has_feature(PPC_FEATURE2_ARCH_3_1)) + return 0; + static_branch_enable(_p10); return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); @@ -207,7 +210,7 @@ static void __exit chacha_p10_exit(void) crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); } -module_cpu_feature_match(PPC_MODULE_FEATURE_P10, chacha_p10_init); +module_init(chacha_p10_init); module_exit(chacha_p10_exit); MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (P10 accelerated)"); Giving me: $ modprobe -v wireguard insmod /lib/modules/6.8.0-dirty/kernel/net/ipv4/udp_tunnel.ko insmod /lib/modules/6.8.0-dirty/kernel/net/ipv6/ip6_udp_tunnel.ko insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha.ko insmod /lib/modules/6.8.0-dirty/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko insmod /lib/modules/6.8.0-dirty/kernel/lib/crypto/libchacha20poly1305.ko insmod /lib/modules/6.8.0-dirty/kernel/drivers/net/wireguard/wireguard.ko [ 19.657941][ T718] wireguard: allowedips self-tests: pass [ 19.662501][ T718] wireguard: nonce counter self-tests: pass [ 19.782933][ T718] wireguard: ratelimiter self-tests: pass [ 19.783114][ T718] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information. [ 19.783223][ T718] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. cheers
Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()
On Wed, Mar 20, 2024 at 03:52:52PM +0800, Baoquan He wrote: > On 03/19/24 at 05:49pm, Mike Rapoport wrote: > > Hi Baoquan, > > > > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote: > > > This is not needed any more. > > > > I'd swap this and the first patch, so that the first patch would remove > > memblock_find_dma_reserve() and it's changelog will explain why it's not > > needed and then the second patch will simply drop unused set_dma_reserve() > > Thanks, Mike. > > My thought on the patch 1/2 splitting is: > patch 1 is removing all relevant codes in mm, including the usage of > dma_reserve in free_area_init_core() and exporting set_dma_reserve() > to any ARCH which want to subtract the dma_reserve from DMA zone. > > Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve. I think it's better first to remove the usage of set_dma_reserve() in x86 and then clean up the unused code. > Your suggestion is also good to me, I can rearrange the order and > repost. -- Sincerely yours, Mike.
Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()
On 03/20/24 at 04:18pm, Baoquan He wrote: > On 03/19/24 at 06:17pm, Mike Rapoport wrote: > > On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote: > > > Currently, in free_area_init_core(), when initialize zone's field, a > > > rough value is set to zone->managed_pages. That value is calculated by > > > (zone->present_pages - memmap_pages). > > > > > > In the meantime, add the value to nr_all_pages and nr_kernel_pages which > > > represent all free pages of system (only low memory or including HIGHMEM > > > memory separately). Both of them are gonna be used in > > > alloc_large_system_hash(). > > > > > > However, the rough calculation and setting of zone->managed_pages is > > > meaningless because > > > a) memmap pages are allocated on units of node in sparse_init() or > > > alloc_node_mem_map(pgdat); The simple (zone->present_pages - > > > memmap_pages) is too rough to make sense for zone; > > > b) the set zone->managed_pages will be zeroed out and reset with > > > acutal value in mem_init() via memblock_free_all(). Before the > > > resetting, no buddy allocation request is issued. > > > > > > Here, remove the meaningless and complicated calculation of > > > (zone->present_pages - memmap_pages), directly set zone->present_pages to > > > zone->managed_pages. It will be adjusted in mem_init(). > > > > Do you mean "set zone->managed_pages to zone->present_pages"? > > Hmm, maybe 'set zone->managed_pages as zone->present_pages' > or >'assign zone->present_pages to zone->managed_pages' > which is more precise. > > Wwill update. > > > > > I think we can just set zone->managed_pages to 0 in free_area_init_core(). > > Anyway it will be reset before the first use. Rethink about this, it's better to set zone->managed_pages to 0 because there isn't any page added to buddy. Will update. > > Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering > zone->present_pages is closer value to actual zone->managed_pages > than 0, and it may be needed in the future in some way before > mem_init(). If no strong objection, I will keep the assigning > 'zone->present_pages' to 'zone->managed_pages'. > > Thanks again for careful reviewing. >
Re: [PATCH 4/6] mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()
On 03/19/24 at 06:17pm, Mike Rapoport wrote: > On Mon, Mar 18, 2024 at 10:21:36PM +0800, Baoquan He wrote: > > Currently, in free_area_init_core(), when initialize zone's field, a > > rough value is set to zone->managed_pages. That value is calculated by > > (zone->present_pages - memmap_pages). > > > > In the meantime, add the value to nr_all_pages and nr_kernel_pages which > > represent all free pages of system (only low memory or including HIGHMEM > > memory separately). Both of them are gonna be used in > > alloc_large_system_hash(). > > > > However, the rough calculation and setting of zone->managed_pages is > > meaningless because > > a) memmap pages are allocated on units of node in sparse_init() or > > alloc_node_mem_map(pgdat); The simple (zone->present_pages - > > memmap_pages) is too rough to make sense for zone; > > b) the set zone->managed_pages will be zeroed out and reset with > > acutal value in mem_init() via memblock_free_all(). Before the > > resetting, no buddy allocation request is issued. > > > > Here, remove the meaningless and complicated calculation of > > (zone->present_pages - memmap_pages), directly set zone->present_pages to > > zone->managed_pages. It will be adjusted in mem_init(). > > Do you mean "set zone->managed_pages to zone->present_pages"? Hmm, maybe 'set zone->managed_pages as zone->present_pages' or 'assign zone->present_pages to zone->managed_pages' which is more precise. Wwill update. > > I think we can just set zone->managed_pages to 0 in free_area_init_core(). > Anyway it will be reset before the first use. Yeah, setting to 0 is also fine. I thougt of 0 ever. Considering zone->present_pages is closer value to actual zone->managed_pages than 0, and it may be needed in the future in some way before mem_init(). If no strong objection, I will keep the assigning 'zone->present_pages' to 'zone->managed_pages'. Thanks again for careful reviewing.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On 3/20/24 07:04, Tobias Huschle wrote: > On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: >> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: >>> >>> On 2024-03-18 15:45, Luis Machado wrote: On 3/14/24 13:45, Tobias Huschle wrote: > On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: >> On 2/28/24 16:10, Tobias Huschle wrote: >>> >>> Questions: >>> 1. The kworker getting its negative lag occurs in the following >>> scenario >>>- kworker and a cgroup are supposed to execute on the same CPU >>>- one task within the cgroup is executing and wakes up the >>> kworker >>>- kworker with 0 lag, gets picked immediately and finishes its >>> execution within ~5000ns >>>- on dequeue, kworker gets assigned a negative lag >>>Is this expected behavior? With this short execution time, I >>> would >>>expect the kworker to be fine. >> >> That strikes me as a bit odd as well. Have you been able to determine >> how a negative lag >> is assigned to the kworker after such a short runtime? >> > > I did some more trace reading though and found something. > > What I observed if everything runs regularly: > - vhost and kworker run alternating on the same CPU > - if the kworker is done, it leaves the runqueue > - vhost wakes up the kworker if it needs it > --> this means: > - vhost starts alone on an otherwise empty runqueue > - it seems like it never gets dequeued > (unless another unrelated task joins or migration hits) > - if vhost wakes up the kworker, the kworker gets selected > - vhost runtime > kworker runtime > --> kworker gets positive lag and gets selected immediately next > time > > What happens if it does go wrong: > From what I gather, there seem to be occasions where the vhost either > executes suprisingly quick, or the kworker surprinsingly slow. If > these > outliers reach critical values, it can happen, that >vhost runtime < kworker runtime > which now causes the kworker to get the negative lag. > > In this case it seems like, that the vhost is very fast in waking up > the kworker. And coincidentally, the kworker takes, more time than > usual > to finish. We speak of 4-digit to low 5-digit nanoseconds. > > So, for these outliers, the scheduler extrapolates that the kworker > out-consumes the vhost and should be slowed down, although in the > majority > of other cases this does not happen. Thanks for providing the above details Tobias. It does seem like EEVDF is strict about the eligibility checks and making tasks wait when their lags are negative, even if just a little bit as in the case of the kworker. There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), which would make EEVDF more like EVDF, though the deadline comparison would probably still favor the vhost task instead of the kworker with the negative lag. I'm not sure if you tried it, but I thought I'd mention it. >>> >>> Haven't seen that one yet. Unfortunately, it does not help to ignore the >>> eligibility. >>> >>> I'm inclined to rather propose propose a documentation change, which >>> describes that tasks should not rely on woken up tasks being scheduled >>> immediately. >> >> Where do you see such an assumption ? Even before eevdf, there were >> nothing that ensures such behavior. When using CFS (legacy or eevdf) >> tasks, you can't know if the newly wakeup task will run 1st or not >> > > There was no guarantee of course. place_entity was reducing the vruntime of > woken up tasks though, giving it a slight boost, right?. For the scenario > that I observed, that boost was enough to make sure, that the woken up tasks > gets scheduled consistently. This might still not be true for all scenarios, > but in general EEVDF seems to be stricter with woken up tasks. It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so a task would have to satisfy both of those checks. I think we have some special treatment for when a task initially joins the competition, in which case we halve its slice. But I don't think there is any special treatment for woken tasks anymore. There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of wake up preemptions under some conditions, under the RUN_TO_PARITY feature.
Re: [PATCH 2/6] x86: remove memblock_find_dma_reserve()
On 03/19/24 at 05:49pm, Mike Rapoport wrote: > Hi Baoquan, > > On Mon, Mar 18, 2024 at 10:21:34PM +0800, Baoquan He wrote: > > This is not needed any more. > > I'd swap this and the first patch, so that the first patch would remove > memblock_find_dma_reserve() and it's changelog will explain why it's not > needed and then the second patch will simply drop unused set_dma_reserve() Thanks, Mike. My thought on the patch 1/2 splitting is: patch 1 is removing all relevant codes in mm, including the usage of dma_reserve in free_area_init_core() and exporting set_dma_reserve() to any ARCH which want to subtract the dma_reserve from DMA zone. Patch 2 purely remove the code in x86 ARCH about how to get dma_reserve. Your suggestion is also good to me, I can rearrange the order and repost.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > > > On 2024-03-18 15:45, Luis Machado wrote: > > > On 3/14/24 13:45, Tobias Huschle wrote: > > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > > >>> On 2/28/24 16:10, Tobias Huschle wrote: > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following > > scenario > > - kworker and a cgroup are supposed to execute on the same CPU > > - one task within the cgroup is executing and wakes up the > > kworker > > - kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > > - on dequeue, kworker gets assigned a negative lag > > Is this expected behavior? With this short execution time, I > > would > > expect the kworker to be fine. > > >>> > > >>> That strikes me as a bit odd as well. Have you been able to determine > > >>> how a negative lag > > >>> is assigned to the kworker after such a short runtime? > > >>> > > >> > > >> I did some more trace reading though and found something. > > >> > > >> What I observed if everything runs regularly: > > >> - vhost and kworker run alternating on the same CPU > > >> - if the kworker is done, it leaves the runqueue > > >> - vhost wakes up the kworker if it needs it > > >> --> this means: > > >> - vhost starts alone on an otherwise empty runqueue > > >> - it seems like it never gets dequeued > > >> (unless another unrelated task joins or migration hits) > > >> - if vhost wakes up the kworker, the kworker gets selected > > >> - vhost runtime > kworker runtime > > >> --> kworker gets positive lag and gets selected immediately next > > >> time > > >> > > >> What happens if it does go wrong: > > >> From what I gather, there seem to be occasions where the vhost either > > >> executes suprisingly quick, or the kworker surprinsingly slow. If > > >> these > > >> outliers reach critical values, it can happen, that > > >>vhost runtime < kworker runtime > > >> which now causes the kworker to get the negative lag. > > >> > > >> In this case it seems like, that the vhost is very fast in waking up > > >> the kworker. And coincidentally, the kworker takes, more time than > > >> usual > > >> to finish. We speak of 4-digit to low 5-digit nanoseconds. > > >> > > >> So, for these outliers, the scheduler extrapolates that the kworker > > >> out-consumes the vhost and should be slowed down, although in the > > >> majority > > >> of other cases this does not happen. > > > > > > Thanks for providing the above details Tobias. It does seem like EEVDF > > > is strict > > > about the eligibility checks and making tasks wait when their lags are > > > negative, even > > > if just a little bit as in the case of the kworker. > > > > > > There was a patch to disable the eligibility checks > > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), > > > which would make EEVDF more like EVDF, though the deadline comparison > > > would > > > probably still favor the vhost task instead of the kworker with the > > > negative lag. > > > > > > I'm not sure if you tried it, but I thought I'd mention it. > > > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > > eligibility. > > > > I'm inclined to rather propose propose a documentation change, which > > describes that tasks should not rely on woken up tasks being scheduled > > immediately. > > Where do you see such an assumption ? Even before eevdf, there were > nothing that ensures such behavior. When using CFS (legacy or eevdf) > tasks, you can't know if the newly wakeup task will run 1st or not > There was no guarantee of course. place_entity was reducing the vruntime of woken up tasks though, giving it a slight boost, right?. For the scenario that I observed, that boost was enough to make sure, that the woken up tasks gets scheduled consistently. This might still not be true for all scenarios, but in general EEVDF seems to be stricter with woken up tasks. Dismissing the lag on wakeup also does obviously not guarantee getting scheduled, as other tasks might still be involved. The question would be if it should be explicitly mentioned somewhere that, at this point, woken up tasks are not getting any special treatment and noone should rely on that boost for woken up tasks. > > > > Changing things in the code to address for the specific scenario I'm > > seeing seems to mostly create unwanted side effects and/or would require > > the definition of some magic cut-off values. > > > >
Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()
Le 20/03/2024 à 00:26, Jason Gunthorpe a écrit : > On Tue, Mar 19, 2024 at 11:07:08PM +, Christophe Leroy wrote: >> >> >> Le 18/03/2024 à 17:15, Jason Gunthorpe a écrit : >>> On Thu, Mar 14, 2024 at 01:11:59PM +, Christophe Leroy wrote: Le 14/03/2024 à 13:53, Peter Xu a écrit : > On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote: >> >> >> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit : >>> From: Peter Xu >>> >>> PowerPC book3s 4K mostly has the same definition on both, except >>> pXd_huge() >>> constantly returns 0 for hash MMUs. As Michael Ellerman pointed out >>> [1], >>> it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be >>> set so >>> it will keep returning false. >>> >>> As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create >>> such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc >>> hugetlb >>> pgtable walker __find_linux_pte() already used pXd_leaf() to check >>> hugetlb >>> mappings. >>> >>> The goal should be that we will have one API pXd_leaf() to detect all >>> kinds >>> of huge mappings. AFAICT we need to use the pXd_leaf() impl (rather >>> than >>> pXd_huge() ones) to make sure ie. THPs on hash MMU will also return >>> true. >> >> All kinds of huge mappings ? >> >> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are >> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages >> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report >> those huge pages. > > Ah yes, I should always mention this is in the context of leaf huge pages > only. Are the examples you provided all fall into hugepd category? If so > I can reword the commit message, as: On powerpc 8xx, only the 8M huge pages fall into the hugepd case. The 512k hugepages are at PTE level, they are handled more or less like CONT_PTE on ARM. see function set_huge_pte_at() for more context. You can also look at pte_leaf_size() and pgd_leaf_size(). >>> >>> IMHO leaf should return false if the thing is pointing to a next level >>> page table, even if that next level is fully populated with contiguous >>> pages. >>> >>> This seems more aligned with the contig page direction that hugepd >>> should be moved over to.. >> >> Should hugepd be moved to the contig page direction, really ? > > Sure? Is there any downside for the reading side to do so? Probably not. > >> Would it be acceptable that a 8M hugepage requires 2048 contig entries >> in 2 page tables, when the hugepd allows a single entry ? > > ? I thought we agreed the only difference would be that something new > is needed to merge the two identical sibling page tables into one, ie > you pay 2x the page table memory if that isn't fixed. That is write > side only change and I imagine it could be done with a single PPC > special API. > > Honestly not totally sure that is a big deal, it is already really > memory inefficient compared to every other arch's huge page by needing > the child page table in the first place. > >> Would it be acceptable performancewise ? > > Isn't this particular PPC sub platform ancient? Are there current real > users that are going to have hugetlbfs special code and care about > this performance detail on a 6.20 era kernel? Ancient yes but still widely in use and with the emergence of voice over IP in Air Trafic Control, performance becomes more and more challenge with those old boards that have another 10 years in front of them. > > In today's world wouldn't it be performance better if these platforms > could support THP by aligning to the contig API instead of being > special? Indeed, if we can promote THP that'd be even better. > > Am I wrong to question why we are polluting the core code for this > special optimization? At the first place that was to get a close fit between hardware pagetable topology and linux pagetable topology. But obviously we already stepped back for 512k pages, so let's go one more step aside and do similar with 8M pages. I'll give it a try and see how it goes. Christophe