RE: [PATCH] drm/amdgpu: drop the long-double-128 powerpc check/hack
[Public] > -Original Message- > From: amd-gfx On Behalf Of > Daniel Kolesa > Sent: Wednesday, February 1, 2023 1:46 PM > To: linuxppc-...@lists.ozlabs.org > Cc: d...@danny.cz; sta...@vger.kernel.org; > tpear...@raptorengineering.com; amd-gfx@lists.freedesktop.org; > alexdeuc...@gmail.com; torva...@linux-foundation.org > Subject: [PATCH] drm/amdgpu: drop the long-double-128 powerpc > check/hack > > Commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc") > introduced this check as a workaround for the driver not building with > toolchains that default to 64-bit long double. > > The reason things worked on 128-bit-long-double toolchains and not > otherwise was however largely accidental. The real issue was that some files > containing floating point code were compiled without -mhard-float, while > others were compiled with -mhard-float. > > The PowerPC compilers tag object files that use long doubles with a special > ABI tag in order to differentiate 64-bit long double, IBM long double and IEEE > 128-bit long double. When no long double is used in a source file, the file > does not receive the ABI tag. > Since only regular doubles are used in the AMDGPU source, there is no ABI > tag on the soft-float object files with 128-bit-ldbl compilers, and therefore > no > error. With 64-bit long double, the double and long double types are equal, > and an ABI tag is introduced. > > Of course, this resulted in the real bug, which was mixing of hard and soft > float object files, getting hidden, which makes this check technically > incorrect. > > Since then, work has been done to ensure that all float code is separately > compiled. This was also necessary in order to enable > AArch64 support in the display stack, as AArch64 does not have any soft-float > ABI, and all code that does not explicitly conatain floats is compiled with - > mgeneral-regs-only, which prevents float-using code from being compiled at > all. That means AArch64 support will from now on always safeguard against > such cases happening ever again. > > In mainline, this work is now fully done, so this check is fully redundant and > does not do anything except preventing AMDGPU DC from being built on > systems such as those using musl libc. The last piece of work to enable this > was commit c92b7fe0d92a > ("drm/amd/display: move remaining FPU code to dml folder") and this has > since been backported to 6.1 stable (in 6.1.7). > > Relevant issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2288 > > Signed-off-by: Daniel Kolesa Acked-by: Alex Deucher > --- > arch/powerpc/Kconfig| 4 > drivers/gpu/drm/amd/display/Kconfig | 2 +- > 2 files changed, 1 insertion(+), 5 deletions(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index > b8c4ac56b..267805072 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -289,10 +289,6 @@ config PPC > # Please keep this list sorted alphabetically. > # > > -config PPC_LONG_DOUBLE_128 > - depends on PPC64 && ALTIVEC > - def_bool $(success,test "$(shell,echo __LONG_DOUBLE_128__ | > $(CC) -E -P -)" = 1) > - > config PPC_BARRIER_NOSPEC > bool > default y > diff --git a/drivers/gpu/drm/amd/display/Kconfig > b/drivers/gpu/drm/amd/display/Kconfig > index 2efe93f74..94645b6ef 100644 > --- a/drivers/gpu/drm/amd/display/Kconfig > +++ b/drivers/gpu/drm/amd/display/Kconfig > @@ -8,7 +8,7 @@ config DRM_AMD_DC > depends on BROKEN || !CC_IS_CLANG || X86_64 || SPARC64 || > ARM64 > select SND_HDA_COMPONENT if SND_HDA_CORE > # !CC_IS_CLANG: > https://github.com/ClangBuiltLinux/linux/issues/1752 > - select DRM_AMD_DC_DCN if (X86 || PPC_LONG_DOUBLE_128 || > (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG)) > + select DRM_AMD_DC_DCN if (X86 || PPC64 || (ARM64 && > KERNEL_MODE_NEON > +&& !CC_IS_CLANG)) > help > Choose this option if you want to use the new display engine > support for AMDGPU. This adds required support for Vega and > -- > 2.34.1
[pull] amdgpu drm-fixes-6.2
Hi Dave, Daniel, Fixes for 6.2. The following changes since commit 6d796c50f84ca79f1722bb131799e5a5710c4700: Linux 6.2-rc6 (2023-01-29 13:59:43 -0800) are available in the Git repository at: https://gitlab.freedesktop.org/agd5f/linux.git tags/amd-drm-fixes-6.2-2023-02-01 for you to fetch changes up to 6fc547a5a2ef5ce05b16924106663ab92f8f87a7: drm/amd/display: Properly handle additional cases where DCN is not supported (2023-02-01 22:45:51 -0500) amd-drm-fixes-6.2-2023-02-01: amdgpu: - GC11 fixes - DCN 3.1.4 fixes - NBIO 4.3 fix - DCN 3.2 fixes - Properly handle additional cases where DCN is not supported - SMU13 fixes Alex Deucher (1): drm/amd/display: Properly handle additional cases where DCN is not supported Daniel Miess (2): drm/amd/display: Add missing brackets in calculation drm/amd/display: Adjust downscaling limits for dcn314 Evan Quan (1): drm/amdgpu: enable HDP SD for gfx 11.0.3 George Shen (1): drm/amd/display: Unassign does_plane_fit_in_mall function from dcn3.2 Graham Sider (1): drm/amdgpu: update wave data type to 3 for gfx11 Mario Limonciello (1): drm/amd: Fix initialization for nbio 4.3.0 Nicholas Kazlauskas (1): drm/amd/display: Reset DMUB mailbox SW state after HW reset Tim Huang (1): drm/amd/pm: drop unneeded dpm features disablement for SMU 13.0.4/11 Yiqing Yao (1): drm/amdgpu: Enable vclk dclk node for gc11.0.3 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/nbio_v4_3.c | 8 +++- drivers/gpu/drm/amd/amdgpu/soc21.c | 3 ++- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 11 +++ drivers/gpu/drm/amd/display/dc/dcn314/dcn314_resource.c| 5 +++-- drivers/gpu/drm/amd/display/dc/dcn32/dcn32_init.c | 2 +- .../drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c | 2 +- drivers/gpu/drm/amd/display/dmub/src/dmub_srv.c| 12 drivers/gpu/drm/amd/pm/amdgpu_pm.c | 6 -- drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 14 ++ 10 files changed, 57 insertions(+), 10 deletions(-)
RE: [linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf
> Hi Neal, > > On Thu, 2 Feb 2023, at 03:17, kernel test robot wrote: > > tree/branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > > master > > branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf Add > linux-next > > specific files for 20230201 > > > > > > Unverified Error/Warning (likely false positive, please contact us if > > interested): > > > > drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect > > type in assignment (different base types) > > drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast > > removes address space '__iomem' of expression > > drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol > > 'aspeed_acry_akcipher_algs' was not declared. Should it be static? > > Can you please look into these issues with the ACRY driver? > > Cheers, > > Andrew I just send patch to fix the first 2 warnings, and the last one warning is already fixed by another patch. [PATCH-next] crypto: aspeed: fix type warnings https://patchwork.ozlabs.org/project/linux-aspeed/patch/20230119014859.1900136-1-yangyingli...@huawei.com/ Thanks -Neal
Re: [linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf
Hi Neal, On Thu, 2 Feb 2023, at 03:17, kernel test robot wrote: > tree/branch: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > master > branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf Add linux-next > specific files for 20230201 > > > Unverified Error/Warning (likely false positive, please contact us if > interested): > > drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect > type in assignment (different base types) > drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast > removes address space '__iomem' of expression > drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol > 'aspeed_acry_akcipher_algs' was not declared. Should it be static? Can you please look into these issues with the ACRY driver? Cheers, Andrew > > elapsed time: 722m > > configs tested: 69 > configs skipped: 2 > > gcc tested configs: > x86_64 rhel-8.3-func > x86_64rhel-8.3-kselftests > ia64 allmodconfig > x86_64allnoconfig > um i386_defconfig > um x86_64_defconfig > arc defconfig > i386defconfig > x86_64 defconfig > alpha defconfig > arm defconfig > mips allyesconfig > x86_64 rhel-8.3-syz > x86_64 randconfig-a001-20230130 > x86_64 rhel-8.3-kunit > x86_64 randconfig-a003-20230130 > x86_64 rhel-8.3-bpf > powerpc allnoconfig > x86_64 randconfig-a004-20230130 > x86_64 rhel-8.3 > x86_64 randconfig-a002-20230130 > powerpc allmodconfig > sh allmodconfig > x86_64 randconfig-a006-20230130 > x86_64 allyesconfig > i386 randconfig-a001 > x86_64 rhel-8.3-kvm > i386 randconfig-a003 > x86_64 randconfig-a005-20230130 > arm64allyesconfig > i386 allyesconfig > i386 randconfig-a005 > arc randconfig-r043-20230129 > i386 randconfig-c001 > arm allyesconfig > s390defconfig > s390 allmodconfig > arm randconfig-r046-20230129 > arm randconfig-r046-20230130 > s390 allyesconfig > arc randconfig-r043-20230130 > m68k allyesconfig > m68k allmodconfig > alphaallyesconfig > arc allyesconfig > > clang tested configs: > x86_64 rhel-8.3-rust > x86_64 randconfig-a012-20230130 > x86_64 randconfig-a013-20230130 > x86_64 randconfig-a011-20230130 > x86_64 randconfig-a014-20230130 > i386 randconfig-a013-20230130 > i386 randconfig-a002 > i386 randconfig-a012-20230130 > i386 randconfig-a014-20230130 > i386 randconfig-a004 > x86_64 randconfig-a015-20230130 > i386 randconfig-a015-20230130 > x86_64 randconfig-a016-20230130 > i386 randconfig-a011-20230130 > hexagon randconfig-r041-20230129 > i386 randconfig-a016-20230130 > riscvrandconfig-r042-20230129 > i386 randconfig-a006 > riscvrandconfig-r042-20230130 > hexagon randconfig-r045-20230130 > hexagon randconfig-r041-20230130 > hexagon randconfig-r045-20230129 > s390 randconfig-r044-20230129 > s390 randconfig-r044-20230130 > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests
Re: [PATCH] drm/amd: Allow s0ix without BIOS support
On Wed, Jan 25, 2023 at 1:33 PM Mario Limonciello wrote: > > We guard the suspend entry code from running unless we have proper > BIOS support for either S3 mode or s0ix mode. > > If a user's system doesn't support either of these modes the kernel > still does offer s2idle in `/sys/power/mem_sleep` so there is an > expectation from users that it works even if the power consumption > remains very high. > > Rafael Ávila de Espíndola reports that a system of his has a > non-functional graphics stack after resuming. That system doesn't > support S3 and the FADT doesn't indicate support for low power idle. > > Through some experimentation it was concluded that even without the > hardware s0i3 support provided by the amd_pmc driver the power > consumption over suspend is decreased by running amdgpu's s0ix > suspend routine. > > The numbers over suspend showed: > * No patch: 9.2W > * Skip amdgpu suspend entirely: 10.5W > * Run amdgpu s0ix routine: 7.7W > > As this does improve the power, remove some of the guard rails in > `amdgpu_acpi.c` for only running s0ix suspend routines in the right > circumstances. > > However if this turns out to cause regressions for anyone, we should > revert this change and instead opt for skipping suspend/resume routines > entirely or try to fix the underlying behavior that makes graphics fail > after resume without underlying platform support. > > Reported-by: Rafael Ávila de Espíndola > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2364 > Signed-off-by: Mario Limonciello > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > index 57b5e11446c65..fa7375b97fd47 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > @@ -1079,20 +1079,16 @@ bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device > *adev) > * S0ix even though the system is suspending to idle, so return false > * in that case. > */ > - if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0)) { > + if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0)) > dev_warn_once(adev->dev, > "Power consumption will be higher as BIOS has > not been configured for suspend-to-idle.\n" > "To use suspend-to-idle change the sleep mode > in BIOS setup.\n"); > - return false; Thinking about this more, I think we may need to check the asic type here. Pre-Raven APUs didn't support S0ix at all so this may break them if they have any checks that use amdgpu_acpi_is_s0ix_active() in their code paths. Alex > - } > > #if !IS_ENABLED(CONFIG_AMD_PMC) > dev_warn_once(adev->dev, > "Power consumption will be higher as the kernel has not > been compiled with CONFIG_AMD_PMC.\n"); > - return false; > -#else > - return true; > #endif /* CONFIG_AMD_PMC */ > + return true; > } > > #endif /* CONFIG_SUSPEND */ > -- > 2.25.1 >
RE: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Reviewed-by: Guchun Chen Regards, Guchun -Original Message- From: Guilherme G. Piccoli Sent: Thursday, February 2, 2023 12:48 AM To: amd-gfx@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org; Deucher, Alexander ; Koenig, Christian ; Pan, Xinhui ; ker...@gpiccoli.net; kernel-...@igalia.com; Guilherme G. Piccoli ; Chen, Guchun ; Tuikov, Luben ; Limonciello, Mario Subject: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of :04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König Cc: Guchun Chen Cc: Luben Tuikov Cc: Mario Limonciello Signed-off-by: Guilherme G. Piccoli --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 00444203220d..3b962cb680a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) + /* +* Notice we check for sched.ops since there's some +* override on the meaning of sched.ready by amdgpu. +* The natural check would be sched.ready, which is +* set as drm_sched_init() finishes... +*/ + if (!ring->no_scheduler && ring->sched.ops) drm_sched_fini(>sched); for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j) -- 2.39.0
Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Reviewed-by: Luben Tuikov Regards, Luben On 2023-02-01 11:48, Guilherme G. Piccoli wrote: > Currently amdgpu calls drm_sched_fini() from the fence driver sw fini > routine - such function is expected to be called only after the > respective init function - drm_sched_init() - was executed successfully. > > Happens that we faced a driver probe failure in the Steam Deck > recently, and the function drm_sched_fini() was called even without > its counter-part had been previously called, causing the following oops: > > amdgpu: probe of :04:00.0 failed with error -110 > BUG: kernel NULL pointer dereference, address: 0090 > PGD 0 P4D 0 > Oops: 0002 [#1] PREEMPT SMP NOPTI > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 > Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] > [...] > Call Trace: > > amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > devm_drm_dev_init_release+0x49/0x70 > [...] > > To prevent that, check if the drm_sched was properly initialized for a > given ring before calling its fini counter-part. > > Notice ideally we'd use sched.ready for that; such field is set as the latest > thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such > field - in the above oops for example, it was a GFX ring causing the crash, > and > the sched.ready field was set to true in the ring init routine, regardless of > the state of the DRM scheduler. Hence, we ended-up using sched.ops as per > Christian's suggestion [0]. > > [0] > https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/ > > Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in > s3 test (v2)") > Suggested-by: Christian König > Cc: Guchun Chen > Cc: Luben Tuikov > Cc: Mario Limonciello > Signed-off-by: Guilherme G. Piccoli > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index 00444203220d..3b962cb680a6 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device > *adev) > if (!ring || !ring->fence_drv.initialized) > continue; > > - if (!ring->no_scheduler) > + /* > + * Notice we check for sched.ops since there's some > + * override on the meaning of sched.ready by amdgpu. > + * The natural check would be sched.ready, which is > + * set as drm_sched_init() finishes... > + */ > + if (!ring->no_scheduler && ring->sched.ops) > drm_sched_fini(>sched); > > for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
[PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of :04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using sched.ops as per Christian's suggestion [0]. [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/ Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Suggested-by: Christian König Cc: Guchun Chen Cc: Luben Tuikov Cc: Mario Limonciello Signed-off-by: Guilherme G. Piccoli --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 00444203220d..3b962cb680a6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) + /* +* Notice we check for sched.ops since there's some +* override on the meaning of sched.ready by amdgpu. +* The natural check would be sched.ready, which is +* set as drm_sched_init() finishes... +*/ + if (!ring->no_scheduler && ring->sched.ops) drm_sched_fini(>sched); for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j) -- 2.39.0
Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
On 01/02/2023 13:21, Luben Tuikov wrote: > Hi Guilherme, > > Since setting sched->ready to false, seems to be taking place in, directly > amdgpu_ring_fini() > and in amdgpu_fence_driver_sw_fini() indirectly as that function calls > drm_sched_fini() > which sets it to false, we seem to have two competing policies of, > "set ready to false to show that _fini() was called, and set to false to > disable IB submissions". > > To that effect, your patch is generally correct, as it would be the case of > an early failure > and unroll from (indirectly) amdgpu_device_init_schedulers(). > > Please resubmit your patch but using .ops as Christian suggested, as .name is > sufficient, > but .ops is necessary. > > On a side-note: in the future we should probably discern between > "this ring has an initialized and working scheduler" (looking up at DRM), from > "this ring can take on IBs to send them down to the hardware" (looking down > at hardware). > Sched->ready seems to be overloaded with these disparate states, and this is > why you need > to use .ops to guard calling drm_sched_fini(). > > Regards, > Luben Thanks a lot Luben, makes perfect sense! Also, thanks for everyone that provided feedback here, very interesting discussion. Submitted V2: https://lore.kernel.org/dri-devel/20230201164814.1353383-1-gpicc...@igalia.com/ Cheers, Guilherme
Re: [PATCH] drm/amdgpu: add more fields into device info, caches sizes, etc.
Looks good to me. WIth a link the mesa MR which uses these, the patch is: Reviewed-by: Alex Deucher For TA_CNTL2, an alternative would be to add this register to the AMDGPU_INFO_READ_MMR_REG and just cache it in the KGD if you think we may need other fields from it in the future. Alex On Mon, Jan 30, 2023 at 12:57 AM Marek Olšák wrote: > > AMDGPU_IDS_FLAGS_CONFORMANT_TRUNC_COORD: important for conformance on gfx11 > Other fields are exposed from IP discovery. > enabled_rb_pipes_mask_hi is added for future chips, currently 0. > > The patch is attached. > > Thanks, > Marek
RE: [PATCH] Revert "drm/display/dp_mst: Move all payload info into the atomic state"
[AMD Official Use Only - General] > -Original Message- > From: Greg KH > Sent: Sunday, January 29, 2023 07:32 > To: Limonciello, Mario > Cc: Linux regressions mailing list ; dri- > de...@lists.freedesktop.org; sta...@vger.kernel.org; > stanislav.lisovs...@intel.com; Zuo, Jerry ; amd- > g...@lists.freedesktop.org; Lin, Wayne ; Guenter > Roeck ; bske...@redhat.com > Subject: Re: [PATCH] Revert "drm/display/dp_mst: Move all payload info into > the atomic state" > > On Fri, Jan 27, 2023 at 03:02:41PM +, Limonciello, Mario wrote: > > [Public] > > > > > > > > > -Original Message- > > > From: Linux kernel regression tracking (Thorsten Leemhuis) > > > > > > Sent: Friday, January 27, 2023 03:15 > > > To: Greg KH ; Limonciello, Mario > > > > > > Cc: dri-de...@lists.freedesktop.org; sta...@vger.kernel.org; > > > stanislav.lisovs...@intel.com; Zuo, Jerry ; amd- > > > g...@lists.freedesktop.org; Lin, Wayne ; Guenter > > > Roeck ; bske...@redhat.com > > > Subject: Re: [PATCH] Revert "drm/display/dp_mst: Move all payload info > into > > > the atomic state" > > > > > > On 27.01.23 08:39, Greg KH wrote: > > > > On Fri, Jan 20, 2023 at 11:51:04AM -0600, Limonciello, Mario wrote: > > > >> On 1/20/2023 11:46, Guenter Roeck wrote: > > > >>> On Thu, Jan 12, 2023 at 04:50:44PM +0800, Wayne Lin wrote: > > > This reverts commit 4d07b0bc403403438d9cf88450506240c5faf92f. > > > > > > [Why] > > > Changes cause regression on amdgpu mst. > > > E.g. > > > In fill_dc_mst_payload_table_from_drm(), amdgpu expects to > > > add/remove payload > > > one by one and call fill_dc_mst_payload_table_from_drm() to > update > > > the HW > > > maintained payload table. But previous change tries to go through > all > > > the > > > payloads in mst_state and update amdpug hw maintained table in > once > > > everytime > > > driver only tries to add/remove a specific payload stream only. The > > > newly > > > design idea conflicts with the implementation in amdgpu nowadays. > > > > > > [How] > > > Revert this patch first. After addressing all regression problems > caused > > > by > > > this previous patch, will add it back and adjust it. > > > >>> > > > >>> Has there been any progress on this revert, or on fixing the > underlying > > > >>> problem ? > > > >>> > > > >>> Thanks, > > > >>> Guenter > > > >> > > > >> Hi Guenter, > > > >> > > > >> Wayne is OOO for CNY, but let me update you. > > > >> > > > >> Harry has sent out this series which is a collection of proper fixes. > > > >> https://patchwork.freedesktop.org/series/113125/ > > > >> > > > >> Once that's reviewed and accepted, 4 of them are applicable for 6.1. > > > > > > > > Any hint on when those will be reviewed and accepted? patchwork > > > doesn't > > > > show any activity on them, or at least I can't figure it out... > > > > > > I didn't look closer (hence please correct me if I'm wrong), but the > > > core changes afaics are in the DRM pull airlied send a few hours ago to > > > Linus (note the "amdgpu […] DP MST fixes" line): > > > > > > > https://lore.kernel.org/all/CAPM%3D9tzuu4xnx6T5v7sKsK%2BA5HEaPOc1ie > > > myznsyqzgztj%3d...@mail.gmail.com/ > > > > That's right. There are 4 commits in that PR with the appropriate stable > > tags > > that should fix the majority of the MST issues introduced in 6.1 by > 4d07b0bc40340 > > ("drm/display/dp_mst: Move all payload info into the atomic state"): > > > > drm/amdgpu/display/mst: Fix mst_state->pbn_div and slot count > assignments > > drm/amdgpu/display/mst: limit payload to be updated one by one > > drm/amdgpu/display/mst: update mst_mgr relevant variable when long > HPD > > drm/display/dp_mst: Correct the kref of port. > > > > There will be follow ups for any remaining corner cases. > > Great, thanks for this, all are now queued up in the 6.1.y queue. > > greg k-h Greg, My apologies if this has been covered elsewhere and I missed it but I was wondering if there was a decision made for whether 6.1.y will be an LTS kernel release or not?
[linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf Add linux-next specific files for 20230201 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202301300743.bp7dpazv-...@intel.com https://lore.kernel.org/oe-kbuild-all/202301301801.y5o08tqx-...@intel.com https://lore.kernel.org/oe-kbuild-all/202301302110.metnwkbd-...@intel.com https://lore.kernel.org/oe-kbuild-all/202301310939.tagcoezb-...@intel.com https://lore.kernel.org/oe-kbuild-all/202302011836.ka3bxqdy-...@intel.com Error/Warning: (recently discovered and may have been fixed) Documentation/riscv/uabi.rst:24: WARNING: Enumerated list ends without a blank line; unexpected unindent. ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] undefined! ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] undefined! arch/arm64/kvm/arm.c:2206: warning: expecting prototype for Initialize Hyp(). Prototype was for kvm_arm_init() instead drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.h:62:10: fatal error: mod_info_packet.h: No such file or directory drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:148:6: warning: no previous prototype for 'link_dp_trace_set_edp_power_timestamp' [-Wmissing-prototypes] drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:158:10: warning: no previous prototype for 'link_dp_trace_get_edp_poweron_timestamp' [-Wmissing-prototypes] drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:163:10: warning: no previous prototype for 'link_dp_trace_get_edp_poweroff_timestamp' [-Wmissing-prototypes] drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1295:32: warning: variable 'result_write_min_hblank' set but not used [-Wunused-but-set-variable] drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:279:42: warning: variable 'ds_port' set but not used [-Wunused-but-set-variable] drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_training.c:1585:38: warning: variable 'result' set but not used [-Wunused-but-set-variable] ftrace-ops.c:(.init.text+0x2c3): undefined reference to `__udivdi3' idma64.c:(.text+0x1696): undefined reference to `devm_platform_ioremap_resource' Unverified Error/Warning (likely false positive, please contact us if interested): drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect type in assignment (different base types) drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast removes address space '__iomem' of expression drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol 'aspeed_acry_akcipher_algs' was not declared. Should it be static? drivers/nvmem/imx-ocotp.c:599:21: sparse: sparse: symbol 'imx_ocotp_layout' was not declared. Should it be static? drivers/nvmem/layouts/sl28vpd.c:143:21: sparse: sparse: symbol 'sl28vpd_layout' was not declared. Should it be static? drivers/of/platform.c:570:2-23: WARNING: Function "for_each_node_by_type" should have of_node_put() before break around line 575. drivers/thermal/qcom/tsens-v0_1.c:106:40: sparse: sparse: symbol 'tsens_9607_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v0_1.c:26:40: sparse: sparse: symbol 'tsens_8916_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v0_1.c:42:40: sparse: sparse: symbol 'tsens_8939_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v0_1.c:62:40: sparse: sparse: symbol 'tsens_8974_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v0_1.c:84:40: sparse: sparse: symbol 'tsens_8974_backup_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v1.c:24:40: sparse: sparse: symbol 'tsens_qcs404_nvmem' was not declared. Should it be static? drivers/thermal/qcom/tsens-v1.c:45:40: sparse: sparse: symbol 'tsens_8976_nvmem' was not declared. Should it be static? Error/Warning ids grouped by kconfigs: gcc_recent_errors |-- alpha-allyesconfig | |-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_get_edp_poweroff_timestamp | |-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_get_edp_poweron_timestamp | |-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_set_edp_power_timestamp | |-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:variable-ds_port-set-but-not-used | |-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:variable-result_write_min_hblank-set-but-not-used | `-- drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_training.c:warning:va
[PATCH] drm/amdgpu: drop the long-double-128 powerpc check/hack
Commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc") introduced this check as a workaround for the driver not building with toolchains that default to 64-bit long double. The reason things worked on 128-bit-long-double toolchains and not otherwise was however largely accidental. The real issue was that some files containing floating point code were compiled without -mhard-float, while others were compiled with -mhard-float. The PowerPC compilers tag object files that use long doubles with a special ABI tag in order to differentiate 64-bit long double, IBM long double and IEEE 128-bit long double. When no long double is used in a source file, the file does not receive the ABI tag. Since only regular doubles are used in the AMDGPU source, there is no ABI tag on the soft-float object files with 128-bit-ldbl compilers, and therefore no error. With 64-bit long double, the double and long double types are equal, and an ABI tag is introduced. Of course, this resulted in the real bug, which was mixing of hard and soft float object files, getting hidden, which makes this check technically incorrect. Since then, work has been done to ensure that all float code is separately compiled. This was also necessary in order to enable AArch64 support in the display stack, as AArch64 does not have any soft-float ABI, and all code that does not explicitly conatain floats is compiled with -mgeneral-regs-only, which prevents float-using code from being compiled at all. That means AArch64 support will from now on always safeguard against such cases happening ever again. In mainline, this work is now fully done, so this check is fully redundant and does not do anything except preventing AMDGPU DC from being built on systems such as those using musl libc. The last piece of work to enable this was commit c92b7fe0d92a ("drm/amd/display: move remaining FPU code to dml folder") and this has since been backported to 6.1 stable (in 6.1.7). Relevant issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2288 Signed-off-by: Daniel Kolesa --- arch/powerpc/Kconfig| 4 drivers/gpu/drm/amd/display/Kconfig | 2 +- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b8c4ac56b..267805072 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -289,10 +289,6 @@ config PPC # Please keep this list sorted alphabetically. # -config PPC_LONG_DOUBLE_128 - depends on PPC64 && ALTIVEC - def_bool $(success,test "$(shell,echo __LONG_DOUBLE_128__ | $(CC) -E -P -)" = 1) - config PPC_BARRIER_NOSPEC bool default y diff --git a/drivers/gpu/drm/amd/display/Kconfig b/drivers/gpu/drm/amd/display/Kconfig index 2efe93f74..94645b6ef 100644 --- a/drivers/gpu/drm/amd/display/Kconfig +++ b/drivers/gpu/drm/amd/display/Kconfig @@ -8,7 +8,7 @@ config DRM_AMD_DC depends on BROKEN || !CC_IS_CLANG || X86_64 || SPARC64 || ARM64 select SND_HDA_COMPONENT if SND_HDA_CORE # !CC_IS_CLANG: https://github.com/ClangBuiltLinux/linux/issues/1752 - select DRM_AMD_DC_DCN if (X86 || PPC_LONG_DOUBLE_128 || (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG)) + select DRM_AMD_DC_DCN if (X86 || PPC64 || (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG)) help Choose this option if you want to use the new display engine support for AMDGPU. This adds required support for Vega and -- 2.34.1
[PATCH] drm/amd/amdgpu: enable athub cg on gc 11.0.3
enable athub cg on gc 11.0.3 Signed-off-by: Kenneth Feng --- drivers/gpu/drm/amd/amdgpu/soc21.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c b/drivers/gpu/drm/amd/amdgpu/soc21.c index 2ea0b9142868..0615fdbf0a64 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc21.c +++ b/drivers/gpu/drm/amd/amdgpu/soc21.c @@ -677,7 +677,9 @@ static int soc21_common_early_init(void *handle) AMD_CG_SUPPORT_GFX_CGCG | AMD_CG_SUPPORT_GFX_CGLS | AMD_CG_SUPPORT_REPEATER_FGCG | - AMD_CG_SUPPORT_GFX_MGCG; + AMD_CG_SUPPORT_GFX_MGCG | + AMD_CG_SUPPORT_ATHUB_MGCG | + AMD_CG_SUPPORT_ATHUB_LS; adev->pg_flags = AMD_PG_SUPPORT_VCN | AMD_PG_SUPPORT_VCN_DPG | AMD_PG_SUPPORT_JPEG; -- 2.25.1
Re: [RFC PATCH] drm/amdgpu: add force_sg_display module parameter (v2)
Am 27.01.23 um 18:51 schrieb Alex Deucher: Add a module parameter to force sg (scatter/gather) display on APUs. Normally we allow displays in both VRAM and GTT, but this option forces displays into GTT so we can explicitly test more scenarios with GTT. I'm still absolutely not keen to add module parameters for internal testing. Even with tainting the kernel this is not something we should allow end users to do. Christian. v2: fix checks Signed-off-by: Alex Deucher --- For reference since the original patch was buggy. drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 12 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 +++ 3 files changed, 17 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 872450a3a164..73d0a0807138 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -244,6 +244,8 @@ extern int amdgpu_num_kcq; #define AMDGPU_VCNFW_LOG_SIZE (32 * 1024) extern int amdgpu_vcnfw_log; +extern int amdgpu_force_sg_display; + #define AMDGPU_VM_MAX_NUM_CTX 4096 #define AMDGPU_SG_THRESHOLD (256*1024*1024) #define AMDGPU_DEFAULT_GTT_SIZE_MB3072ULL /* 3GB by default */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 4cf2028b5235..c8975c2da853 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -943,6 +943,18 @@ MODULE_PARM_DESC(smu_pptable_id, "specify pptable id to be used (-1 = auto(default) value, 0 = use pptable from vbios, > 0 = soft pptable id)"); module_param_named(smu_pptable_id, amdgpu_smu_pptable_id, int, 0444); +/** + * DOC: force_sg_display (int) + * Force display buffers into GTT (scatter/gather) memory for APUs. + * This is used to force GTT only for displays rather than displaying from + * either VRAM (carve out) or GTT. + * + * Defaults to 0, or disabled. + */ +int amdgpu_force_sg_display; +MODULE_PARM_DESC(force_sg_display, "Force S/G display (0 = off (default), 1 = force display to use GTT) "); +module_param_named(force_sg_display, amdgpu_force_sg_display, int, 0444); + /* These devices are not supported by amdgpu. * They are supported by the mach64, r128, radeon drivers */ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 981010de0a28..840190dd78e4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1516,6 +1516,9 @@ uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev, if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD) domain = AMDGPU_GEM_DOMAIN_GTT; } + if ((domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) && + amdgpu_force_sg_display && adev->mode_info.gpu_vm_support) + domain = AMDGPU_GEM_DOMAIN_GTT; return domain; }
Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Hi Guilherme, Since setting sched->ready to false, seems to be taking place in, directly amdgpu_ring_fini() and in amdgpu_fence_driver_sw_fini() indirectly as that function calls drm_sched_fini() which sets it to false, we seem to have two competing policies of, "set ready to false to show that _fini() was called, and set to false to disable IB submissions". To that effect, your patch is generally correct, as it would be the case of an early failure and unroll from (indirectly) amdgpu_device_init_schedulers(). Please resubmit your patch but using .ops as Christian suggested, as .name is sufficient, but .ops is necessary. On a side-note: in the future we should probably discern between "this ring has an initialized and working scheduler" (looking up at DRM), from "this ring can take on IBs to send them down to the hardware" (looking down at hardware). Sched->ready seems to be overloaded with these disparate states, and this is why you need to use .ops to guard calling drm_sched_fini(). Regards, Luben On 2023-02-01 09:35, Christian König wrote: > Am 01.02.23 um 15:24 schrieb Alex Deucher: >> On Wed, Feb 1, 2023 at 2:18 AM Christian König >> wrote: >>> Hi Guchun, >>> >>> no, that doesn't make any sense at all. >>> >>> The ready flag indicates that the scheduler is fully prepared for hw >>> submissions from userspace and is unrelated to the initialization >>> status. It's set to true after IB testing was successful and only set to >>> false only when a GPU reset fails and we can't get the hardware to work >>> any more. >> That might have been the original intention, but right now sched.ready >> gets set to true when we finish setting up the ring, but before we do >> ring or IB tests. > WHAT? Please not again. > > I'm really tired of fixing this over and over again, the meaning of > ring->sched.ready is to block submissions when a GPU reset fails. AND > NOTHING ELSE! > > The problem is people seem to abuse it and I have to fix it for the > fourth or fives time now. > > I'm going to send out patches, > Christian. > >> Alex >> >>> Please use sched.ops instead as I suggested before. >>> >>> Regards, >>> Christian. >>> >>> Am 31.01.23 um 14:58 schrieb Chen, Guchun: Hi Christian, Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure. Regards, Guchun -Original Message- From: Koenig, Christian Sent: Tuesday, January 31, 2023 6:59 PM To: Chen, Guchun ; Alex Deucher ; Guilherme G. Piccoli Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui ; dri-de...@lists.freedesktop.org; Tuikov, Luben ; Limonciello, Mario ; kernel-...@igalia.com; Deucher, Alexander Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Am 31.01.23 um 10:17 schrieb Chen, Guchun: > Hi Piccoli, > > Please ignore my request of full dmesg log. I can reproduce the issue and > get the same failure callstack by returning early with an error code > prior to amdgpu_device_init_schedulers. > > Regards, > Guchun > > -Original Message- > From: Chen, Guchun > Sent: Tuesday, January 31, 2023 2:37 PM > To: Alex Deucher ; Guilherme G. Piccoli > > Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui > ; dri-de...@lists.freedesktop.org; Tuikov, Luben > ; Limonciello, Mario > ; kernel-...@igalia.com; Deucher, Alexander > ; Koenig, Christian > > Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > drm_sched init/fini > > Hi Piccoli, > > I agree with Alex's point, using ring->sched.name for such check is not a > good way. BTW, can you please attach a full dmesg long in bad case to > help me understand more? > > Regards, > Guchun > > -Original Message- > From: Alex Deucher > Sent: Tuesday, January 31, 2023 6:30 AM > To: Guilherme G. Piccoli > Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun > ; Pan, Xinhui ; > dri-de...@lists.freedesktop.org; Tuikov, Luben ; > Limonciello, Mario ; kernel-...@igalia.com; > Deucher, Alexander ; Koenig, Christian > > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > drm_sched init/fini > > On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli > wrote: >> + Luben >> >> (sorry, missed that in the first submission). >> >> On 30/01/2023 18:45, Guilherme G. Piccoli wrote: >>> Currently amdgpu calls drm_sched_fini() from the fence driver sw >>> fini routine - such function is expected to be called only after the >>> respective init function - drm_sched_init() - was
Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
On Wed, Feb 1, 2023 at 2:18 AM Christian König wrote: > > Hi Guchun, > > no, that doesn't make any sense at all. > > The ready flag indicates that the scheduler is fully prepared for hw > submissions from userspace and is unrelated to the initialization > status. It's set to true after IB testing was successful and only set to > false only when a GPU reset fails and we can't get the hardware to work > any more. That might have been the original intention, but right now sched.ready gets set to true when we finish setting up the ring, but before we do ring or IB tests. Alex > > Please use sched.ops instead as I suggested before. > > Regards, > Christian. > > Am 31.01.23 um 14:58 schrieb Chen, Guchun: > > Hi Christian, > > > > Do you think if it makes sense that we can set 'ring->sched.ready' to be > > true in each ring init, even if before executing/setting up drm_sched_init > > in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler > > structure. > > > > Regards, > > Guchun > > > > -Original Message- > > From: Koenig, Christian > > Sent: Tuesday, January 31, 2023 6:59 PM > > To: Chen, Guchun ; Alex Deucher > > ; Guilherme G. Piccoli > > Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui > > ; dri-de...@lists.freedesktop.org; Tuikov, Luben > > ; Limonciello, Mario ; > > kernel-...@igalia.com; Deucher, Alexander > > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > > drm_sched init/fini > > > > Am 31.01.23 um 10:17 schrieb Chen, Guchun: > >> Hi Piccoli, > >> > >> Please ignore my request of full dmesg log. I can reproduce the issue and > >> get the same failure callstack by returning early with an error code prior > >> to amdgpu_device_init_schedulers. > >> > >> Regards, > >> Guchun > >> > >> -Original Message- > >> From: Chen, Guchun > >> Sent: Tuesday, January 31, 2023 2:37 PM > >> To: Alex Deucher ; Guilherme G. Piccoli > >> > >> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui > >> ; dri-de...@lists.freedesktop.org; Tuikov, Luben > >> ; Limonciello, Mario > >> ; kernel-...@igalia.com; Deucher, Alexander > >> ; Koenig, Christian > >> > >> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > >> drm_sched init/fini > >> > >> Hi Piccoli, > >> > >> I agree with Alex's point, using ring->sched.name for such check is not a > >> good way. BTW, can you please attach a full dmesg long in bad case to help > >> me understand more? > >> > >> Regards, > >> Guchun > >> > >> -Original Message- > >> From: Alex Deucher > >> Sent: Tuesday, January 31, 2023 6:30 AM > >> To: Guilherme G. Piccoli > >> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun > >> ; Pan, Xinhui ; > >> dri-de...@lists.freedesktop.org; Tuikov, Luben ; > >> Limonciello, Mario ; kernel-...@igalia.com; > >> Deucher, Alexander ; Koenig, Christian > >> > >> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > >> drm_sched init/fini > >> > >> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli > >> wrote: > >>> + Luben > >>> > >>> (sorry, missed that in the first submission). > >>> > >>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote: > Currently amdgpu calls drm_sched_fini() from the fence driver sw > fini routine - such function is expected to be called only after the > respective init function - drm_sched_init() - was executed successfully. > > Happens that we faced a driver probe failure in the Steam Deck > recently, and the function drm_sched_fini() was called even without > its counter-part had been previously called, causing the following oops: > > amdgpu: probe of :04:00.0 failed with error -110 > BUG: kernel NULL pointer dereference, address: 0090 PGD > 0 P4D 0 > Oops: 0002 [#1] PREEMPT SMP NOPTI > CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli > #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: > > amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > devm_drm_dev_init_release+0x49/0x70 > [...] > > To prevent that, check if the drm_sched was properly initialized for > a given ring before calling its fini counter-part. > > Notice ideally we'd use sched.ready for that; such field is set as > the latest thing on drm_sched_init(). But amdgpu seems to "override" > the meaning of such field - in the above oops for example, it was a > GFX ring causing the crash, and the sched.ready field was set to > true in the ring init routine, regardless of the state of the DRM > scheduler. Hence, we ended-up using another sched field. > >> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence > >>
Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Am 01.02.23 um 15:24 schrieb Alex Deucher: On Wed, Feb 1, 2023 at 2:18 AM Christian König wrote: Hi Guchun, no, that doesn't make any sense at all. The ready flag indicates that the scheduler is fully prepared for hw submissions from userspace and is unrelated to the initialization status. It's set to true after IB testing was successful and only set to false only when a GPU reset fails and we can't get the hardware to work any more. That might have been the original intention, but right now sched.ready gets set to true when we finish setting up the ring, but before we do ring or IB tests. WHAT? Please not again. I'm really tired of fixing this over and over again, the meaning of ring->sched.ready is to block submissions when a GPU reset fails. AND NOTHING ELSE! The problem is people seem to abuse it and I have to fix it for the fourth or fives time now. I'm going to send out patches, Christian. Alex Please use sched.ops instead as I suggested before. Regards, Christian. Am 31.01.23 um 14:58 schrieb Chen, Guchun: Hi Christian, Do you think if it makes sense that we can set 'ring->sched.ready' to be true in each ring init, even if before executing/setting up drm_sched_init in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure. Regards, Guchun -Original Message- From: Koenig, Christian Sent: Tuesday, January 31, 2023 6:59 PM To: Chen, Guchun ; Alex Deucher ; Guilherme G. Piccoli Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui ; dri-de...@lists.freedesktop.org; Tuikov, Luben ; Limonciello, Mario ; kernel-...@igalia.com; Deucher, Alexander Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Am 31.01.23 um 10:17 schrieb Chen, Guchun: Hi Piccoli, Please ignore my request of full dmesg log. I can reproduce the issue and get the same failure callstack by returning early with an error code prior to amdgpu_device_init_schedulers. Regards, Guchun -Original Message- From: Chen, Guchun Sent: Tuesday, January 31, 2023 2:37 PM To: Alex Deucher ; Guilherme G. Piccoli Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui ; dri-de...@lists.freedesktop.org; Tuikov, Luben ; Limonciello, Mario ; kernel-...@igalia.com; Deucher, Alexander ; Koenig, Christian Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Hi Piccoli, I agree with Alex's point, using ring->sched.name for such check is not a good way. BTW, can you please attach a full dmesg long in bad case to help me understand more? Regards, Guchun -Original Message- From: Alex Deucher Sent: Tuesday, January 31, 2023 6:30 AM To: Guilherme G. Piccoli Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun ; Pan, Xinhui ; dri-de...@lists.freedesktop.org; Tuikov, Luben ; Limonciello, Mario ; kernel-...@igalia.com; Deucher, Alexander ; Koenig, Christian Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli wrote: + Luben (sorry, missed that in the first submission). On 30/01/2023 18:45, Guilherme G. Piccoli wrote: Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - such function is expected to be called only after the respective init function - drm_sched_init() - was executed successfully. Happens that we faced a driver probe failure in the Steam Deck recently, and the function drm_sched_fini() was called even without its counter-part had been previously called, causing the following oops: amdgpu: probe of :04:00.0 failed with error -110 BUG: kernel NULL pointer dereference, address: 0090 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] devm_drm_dev_init_release+0x49/0x70 [...] To prevent that, check if the drm_sched was properly initialized for a given ring before calling its fini counter-part. Notice ideally we'd use sched.ready for that; such field is set as the latest thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such field - in the above oops for example, it was a GFX ring causing the crash, and the sched.ready field was set to true in the ring init routine, regardless of the state of the DRM scheduler. Hence, we ended-up using another sched field. Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)") Cc: Andrey Grodzovsky Cc: Guchun Chen Cc: Mario Limonciello Signed-off-by: Guilherme G. Piccoli --- Hi folks, first of all thanks in advance for reviews / comments! Notice that I've used the