RE: [PATCH] drm/amdgpu: drop the long-double-128 powerpc check/hack

2023-02-01 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: amd-gfx  On Behalf Of
> Daniel Kolesa
> Sent: Wednesday, February 1, 2023 1:46 PM
> To: linuxppc-...@lists.ozlabs.org
> Cc: d...@danny.cz; sta...@vger.kernel.org;
> tpear...@raptorengineering.com; amd-gfx@lists.freedesktop.org;
> alexdeuc...@gmail.com; torva...@linux-foundation.org
> Subject: [PATCH] drm/amdgpu: drop the long-double-128 powerpc
> check/hack
> 
> Commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc")
> introduced this check as a workaround for the driver not building with
> toolchains that default to 64-bit long double.
> 
> The reason things worked on 128-bit-long-double toolchains and not
> otherwise was however largely accidental. The real issue was that some files
> containing floating point code were compiled without -mhard-float, while
> others were compiled with -mhard-float.
> 
> The PowerPC compilers tag object files that use long doubles with a special
> ABI tag in order to differentiate 64-bit long double, IBM long double and IEEE
> 128-bit long double. When no long double is used in a source file, the file
> does not receive the ABI tag.
> Since only regular doubles are used in the AMDGPU source, there is no ABI
> tag on the soft-float object files with 128-bit-ldbl compilers, and therefore 
> no
> error. With 64-bit long double, the double and long double types are equal,
> and an ABI tag is introduced.
> 
> Of course, this resulted in the real bug, which was mixing of hard and soft
> float object files, getting hidden, which makes this check technically
> incorrect.
> 
> Since then, work has been done to ensure that all float code is separately
> compiled. This was also necessary in order to enable
> AArch64 support in the display stack, as AArch64 does not have any soft-float
> ABI, and all code that does not explicitly conatain floats is compiled with -
> mgeneral-regs-only, which prevents float-using code from being compiled at
> all. That means AArch64 support will from now on always safeguard against
> such cases happening ever again.
> 
> In mainline, this work is now fully done, so this check is fully redundant and
> does not do anything except preventing AMDGPU DC from being built on
> systems such as those using musl libc. The last piece of work to enable this
> was commit c92b7fe0d92a
> ("drm/amd/display: move remaining FPU code to dml folder") and this has
> since been backported to 6.1 stable (in 6.1.7).
> 
> Relevant issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2288
> 
> Signed-off-by: Daniel Kolesa 

Acked-by: Alex Deucher 

> ---
>  arch/powerpc/Kconfig| 4 
>  drivers/gpu/drm/amd/display/Kconfig | 2 +-
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index
> b8c4ac56b..267805072 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -289,10 +289,6 @@ config PPC
>   # Please keep this list sorted alphabetically.
>   #
> 
> -config PPC_LONG_DOUBLE_128
> - depends on PPC64 && ALTIVEC
> - def_bool $(success,test "$(shell,echo __LONG_DOUBLE_128__ |
> $(CC) -E -P -)" = 1)
> -
>  config PPC_BARRIER_NOSPEC
>   bool
>   default y
> diff --git a/drivers/gpu/drm/amd/display/Kconfig
> b/drivers/gpu/drm/amd/display/Kconfig
> index 2efe93f74..94645b6ef 100644
> --- a/drivers/gpu/drm/amd/display/Kconfig
> +++ b/drivers/gpu/drm/amd/display/Kconfig
> @@ -8,7 +8,7 @@ config DRM_AMD_DC
>   depends on BROKEN || !CC_IS_CLANG || X86_64 || SPARC64 ||
> ARM64
>   select SND_HDA_COMPONENT if SND_HDA_CORE
>   # !CC_IS_CLANG:
> https://github.com/ClangBuiltLinux/linux/issues/1752
> - select DRM_AMD_DC_DCN if (X86 || PPC_LONG_DOUBLE_128 ||
> (ARM64 && KERNEL_MODE_NEON && !CC_IS_CLANG))
> + select DRM_AMD_DC_DCN if (X86 || PPC64 || (ARM64 &&
> KERNEL_MODE_NEON
> +&& !CC_IS_CLANG))
>   help
> Choose this option if you want to use the new display engine
> support for AMDGPU. This adds required support for Vega and
> --
> 2.34.1


[pull] amdgpu drm-fixes-6.2

2023-02-01 Thread Alex Deucher
Hi Dave, Daniel,

Fixes for 6.2.

The following changes since commit 6d796c50f84ca79f1722bb131799e5a5710c4700:

  Linux 6.2-rc6 (2023-01-29 13:59:43 -0800)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.2-2023-02-01

for you to fetch changes up to 6fc547a5a2ef5ce05b16924106663ab92f8f87a7:

  drm/amd/display: Properly handle additional cases where DCN is not supported 
(2023-02-01 22:45:51 -0500)


amd-drm-fixes-6.2-2023-02-01:

amdgpu:
- GC11 fixes
- DCN 3.1.4 fixes
- NBIO 4.3 fix
- DCN 3.2 fixes
- Properly handle additional cases where DCN is not supported
- SMU13 fixes


Alex Deucher (1):
  drm/amd/display: Properly handle additional cases where DCN is not 
supported

Daniel Miess (2):
  drm/amd/display: Add missing brackets in calculation
  drm/amd/display: Adjust downscaling limits for dcn314

Evan Quan (1):
  drm/amdgpu: enable HDP SD for gfx 11.0.3

George Shen (1):
  drm/amd/display: Unassign does_plane_fit_in_mall function from dcn3.2

Graham Sider (1):
  drm/amdgpu: update wave data type to 3 for gfx11

Mario Limonciello (1):
  drm/amd: Fix initialization for nbio 4.3.0

Nicholas Kazlauskas (1):
  drm/amd/display: Reset DMUB mailbox SW state after HW reset

Tim Huang (1):
  drm/amd/pm: drop unneeded dpm features disablement for SMU 13.0.4/11

Yiqing Yao (1):
  drm/amdgpu: Enable vclk dclk node for gc11.0.3

 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/nbio_v4_3.c |  8 +++-
 drivers/gpu/drm/amd/amdgpu/soc21.c |  3 ++-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c  | 11 +++
 drivers/gpu/drm/amd/display/dc/dcn314/dcn314_resource.c|  5 +++--
 drivers/gpu/drm/amd/display/dc/dcn32/dcn32_init.c  |  2 +-
 .../drm/amd/display/dc/dml/dcn314/display_mode_vba_314.c   |  2 +-
 drivers/gpu/drm/amd/display/dmub/src/dmub_srv.c| 12 
 drivers/gpu/drm/amd/pm/amdgpu_pm.c |  6 --
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 14 ++
 10 files changed, 57 insertions(+), 10 deletions(-)


RE: [linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf

2023-02-01 Thread Neal Liu
> Hi Neal,
> 
> On Thu, 2 Feb 2023, at 03:17, kernel test robot wrote:
> > tree/branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> > master
> > branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf  Add
> linux-next
> > specific files for 20230201
> >
> >
> > Unverified Error/Warning (likely false positive, please contact us if
> > interested):
> >
> > drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect
> > type in assignment (different base types)
> > drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast
> > removes address space '__iomem' of expression
> > drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol
> > 'aspeed_acry_akcipher_algs' was not declared. Should it be static?
> 
> Can you please look into these issues with the ACRY driver?
> 
> Cheers,
> 
> Andrew

I just send patch to fix the first 2 warnings, and the last one warning is 
already fixed by another patch.
[PATCH-next] crypto: aspeed: fix type warnings
https://patchwork.ozlabs.org/project/linux-aspeed/patch/20230119014859.1900136-1-yangyingli...@huawei.com/
Thanks

-Neal


Re: [linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf

2023-02-01 Thread Andrew Jeffery
Hi Neal,

On Thu, 2 Feb 2023, at 03:17, kernel test robot wrote:
> tree/branch: 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 
> master
> branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf  Add linux-next 
> specific files for 20230201
>
>
> Unverified Error/Warning (likely false positive, please contact us if 
> interested):
>
> drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect 
> type in assignment (different base types)
> drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast 
> removes address space '__iomem' of expression
> drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol 
> 'aspeed_acry_akcipher_algs' was not declared. Should it be static?

Can you please look into these issues with the ACRY driver?

Cheers,

Andrew

>
> elapsed time: 722m
>
> configs tested: 69
> configs skipped: 2
>
> gcc tested configs:
> x86_64  rhel-8.3-func
> x86_64rhel-8.3-kselftests
> ia64 allmodconfig
> x86_64allnoconfig
> um i386_defconfig
> um   x86_64_defconfig
> arc defconfig
> i386defconfig
> x86_64  defconfig
> alpha   defconfig
> arm defconfig
> mips allyesconfig
> x86_64   rhel-8.3-syz
> x86_64   randconfig-a001-20230130
> x86_64 rhel-8.3-kunit
> x86_64   randconfig-a003-20230130
> x86_64   rhel-8.3-bpf
> powerpc   allnoconfig
> x86_64   randconfig-a004-20230130
> x86_64   rhel-8.3
> x86_64   randconfig-a002-20230130
> powerpc  allmodconfig
> sh   allmodconfig
> x86_64   randconfig-a006-20230130
> x86_64   allyesconfig
> i386  randconfig-a001
> x86_64   rhel-8.3-kvm
> i386  randconfig-a003
> x86_64   randconfig-a005-20230130
> arm64allyesconfig
> i386 allyesconfig
> i386  randconfig-a005
> arc  randconfig-r043-20230129
> i386  randconfig-c001
> arm  allyesconfig
> s390defconfig
> s390 allmodconfig
> arm  randconfig-r046-20230129
> arm  randconfig-r046-20230130
> s390 allyesconfig
> arc  randconfig-r043-20230130
> m68k allyesconfig
> m68k allmodconfig
> alphaallyesconfig
> arc  allyesconfig
>
> clang tested configs:
> x86_64  rhel-8.3-rust
> x86_64   randconfig-a012-20230130
> x86_64   randconfig-a013-20230130
> x86_64   randconfig-a011-20230130
> x86_64   randconfig-a014-20230130
> i386 randconfig-a013-20230130
> i386  randconfig-a002
> i386 randconfig-a012-20230130
> i386 randconfig-a014-20230130
> i386  randconfig-a004
> x86_64   randconfig-a015-20230130
> i386 randconfig-a015-20230130
> x86_64   randconfig-a016-20230130
> i386 randconfig-a011-20230130
> hexagon  randconfig-r041-20230129
> i386 randconfig-a016-20230130
> riscvrandconfig-r042-20230129
> i386  randconfig-a006
> riscvrandconfig-r042-20230130
> hexagon  randconfig-r045-20230130
> hexagon  randconfig-r041-20230130
> hexagon  randconfig-r045-20230129
> s390 randconfig-r044-20230129
> s390 randconfig-r044-20230130
>
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests


Re: [PATCH] drm/amd: Allow s0ix without BIOS support

2023-02-01 Thread Alex Deucher
On Wed, Jan 25, 2023 at 1:33 PM Mario Limonciello
 wrote:
>
> We guard the suspend entry code from running unless we have proper
> BIOS support for either S3 mode or s0ix mode.
>
> If a user's system doesn't support either of these modes the kernel
> still does offer s2idle in `/sys/power/mem_sleep` so there is an
> expectation from users that it works even if the power consumption
> remains very high.
>
> Rafael Ávila de Espíndola reports that a system of his has a
> non-functional graphics stack after resuming.  That system doesn't
> support S3 and the FADT doesn't indicate support for low power idle.
>
> Through some experimentation it was concluded that even without the
> hardware s0i3 support provided by the amd_pmc driver the power
> consumption over suspend is decreased by running amdgpu's s0ix
> suspend routine.
>
> The numbers over suspend showed:
> * No patch: 9.2W
> * Skip amdgpu suspend entirely: 10.5W
> * Run amdgpu s0ix routine: 7.7W
>
> As this does improve the power, remove some of the guard rails in
> `amdgpu_acpi.c` for only running s0ix suspend routines in the right
> circumstances.
>
> However if this turns out to cause regressions for anyone, we should
> revert this change and instead opt for skipping suspend/resume routines
> entirely or try to fix the underlying behavior that makes graphics fail
> after resume without underlying platform support.
>
> Reported-by: Rafael Ávila de Espíndola 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2364
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 57b5e11446c65..fa7375b97fd47 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -1079,20 +1079,16 @@ bool amdgpu_acpi_is_s0ix_active(struct amdgpu_device 
> *adev)
>  * S0ix even though the system is suspending to idle, so return false
>  * in that case.
>  */
> -   if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0)) {
> +   if (!(acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0))
> dev_warn_once(adev->dev,
>   "Power consumption will be higher as BIOS has 
> not been configured for suspend-to-idle.\n"
>   "To use suspend-to-idle change the sleep mode 
> in BIOS setup.\n");
> -   return false;

Thinking about this more, I think we may need to check the asic type
here.  Pre-Raven APUs didn't support S0ix at all so this may break
them if they have any checks that use amdgpu_acpi_is_s0ix_active() in
their code paths.

Alex


> -   }
>
>  #if !IS_ENABLED(CONFIG_AMD_PMC)
> dev_warn_once(adev->dev,
>   "Power consumption will be higher as the kernel has not 
> been compiled with CONFIG_AMD_PMC.\n");
> -   return false;
> -#else
> -   return true;
>  #endif /* CONFIG_AMD_PMC */
> +   return true;
>  }
>
>  #endif /* CONFIG_SUSPEND */
> --
> 2.25.1
>


RE: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Chen, Guchun
Reviewed-by: Guchun Chen 

Regards,
Guchun

-Original Message-
From: Guilherme G. Piccoli  
Sent: Thursday, February 2, 2023 12:48 AM
To: amd-gfx@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org; Deucher, Alexander 
; Koenig, Christian ; Pan, 
Xinhui ; ker...@gpiccoli.net; kernel-...@igalia.com; 
Guilherme G. Piccoli ; Chen, Guchun ; 
Tuikov, Luben ; Limonciello, Mario 

Subject: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched 
init/fini

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini routine - 
such function is expected to be called only after the respective init function 
- drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck recently, and 
the function drm_sched_fini() was called even without its counter-part had been 
previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090 PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338 
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a given 
ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest 
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such 
field - in the above oops for example, it was a GFX ring causing the crash, and 
the sched.ready field was set to true in the ring init routine, regardless of 
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per 
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.ops)
drm_sched_fini(>sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
--
2.39.0



Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Luben Tuikov
Reviewed-by: Luben Tuikov 

Regards,
Luben

On 2023-02-01 11:48, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
>
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
>
> amdgpu: probe of :04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
>
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
>
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, 
> and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0].
>
> [0] 
> https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/
>
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in 
> s3 test (v2)")
> Suggested-by: Christian König 
> Cc: Guchun Chen 
> Cc: Luben Tuikov 
> Cc: Mario Limonciello 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..3b962cb680a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
> *adev)
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
>  
> - if (!ring->no_scheduler)
> + /*
> +  * Notice we check for sched.ops since there's some
> +  * override on the meaning of sched.ready by amdgpu.
> +  * The natural check would be sched.ready, which is
> +  * set as drm_sched_init() finishes...
> +  */
> + if (!ring->no_scheduler && ring->sched.ops)
>   drm_sched_fini(>sched);
>  
>   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)



[PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Guilherme G. Piccoli
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0].

[0] 
https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb...@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)")
Suggested-by: Christian König 
Cc: Guchun Chen 
Cc: Luben Tuikov 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..3b962cb680a6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
*adev)
if (!ring || !ring->fence_drv.initialized)
continue;
 
-   if (!ring->no_scheduler)
+   /*
+* Notice we check for sched.ops since there's some
+* override on the meaning of sched.ready by amdgpu.
+* The natural check would be sched.ready, which is
+* set as drm_sched_init() finishes...
+*/
+   if (!ring->no_scheduler && ring->sched.ops)
drm_sched_fini(>sched);
 
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
-- 
2.39.0



Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Guilherme G. Piccoli
On 01/02/2023 13:21, Luben Tuikov wrote:
> Hi Guilherme,
> 
> Since setting sched->ready to false, seems to be taking place in, directly 
> amdgpu_ring_fini()
> and in amdgpu_fence_driver_sw_fini() indirectly as that function calls 
> drm_sched_fini()
> which sets it to false, we seem to have two competing policies of,
> "set ready to false to show that _fini() was called, and set to false to 
> disable IB submissions".
> 
> To that effect, your patch is generally correct, as it would be the case of 
> an early failure
> and unroll from (indirectly) amdgpu_device_init_schedulers().
> 
> Please resubmit your patch but using .ops as Christian suggested, as .name is 
> sufficient,
> but .ops is necessary.
> 
> On a side-note: in the future we should probably discern between
> "this ring has an initialized and working scheduler" (looking up at DRM), from
> "this ring can take on IBs to send them down to the hardware" (looking down 
> at hardware).
> Sched->ready seems to be overloaded with these disparate states, and this is 
> why you need
> to use .ops to guard calling drm_sched_fini().
> 
> Regards,
> Luben

Thanks a lot Luben, makes perfect sense!

Also, thanks for everyone that provided feedback here, very interesting
discussion.

Submitted V2:
https://lore.kernel.org/dri-devel/20230201164814.1353383-1-gpicc...@igalia.com/
Cheers,


Guilherme


Re: [PATCH] drm/amdgpu: add more fields into device info, caches sizes, etc.

2023-02-01 Thread Alex Deucher
Looks good to me.  WIth a link the mesa MR which uses these, the patch is:
Reviewed-by: Alex Deucher 

For TA_CNTL2, an alternative would be to add this register to the
AMDGPU_INFO_READ_MMR_REG and just cache it in the KGD if you think we
may need other fields from it in the future.

Alex

On Mon, Jan 30, 2023 at 12:57 AM Marek Olšák  wrote:
>
> AMDGPU_IDS_FLAGS_CONFORMANT_TRUNC_COORD: important for conformance on gfx11
> Other fields are exposed from IP discovery.
> enabled_rb_pipes_mask_hi is added for future chips, currently 0.
>
> The patch is attached.
>
> Thanks,
> Marek


RE: [PATCH] Revert "drm/display/dp_mst: Move all payload info into the atomic state"

2023-02-01 Thread Limonciello, Mario
[AMD Official Use Only - General]



> -Original Message-
> From: Greg KH 
> Sent: Sunday, January 29, 2023 07:32
> To: Limonciello, Mario 
> Cc: Linux regressions mailing list ; dri-
> de...@lists.freedesktop.org; sta...@vger.kernel.org;
> stanislav.lisovs...@intel.com; Zuo, Jerry ; amd-
> g...@lists.freedesktop.org; Lin, Wayne ; Guenter
> Roeck ; bske...@redhat.com
> Subject: Re: [PATCH] Revert "drm/display/dp_mst: Move all payload info into
> the atomic state"
> 
> On Fri, Jan 27, 2023 at 03:02:41PM +, Limonciello, Mario wrote:
> > [Public]
> >
> >
> >
> > > -Original Message-
> > > From: Linux kernel regression tracking (Thorsten Leemhuis)
> > > 
> > > Sent: Friday, January 27, 2023 03:15
> > > To: Greg KH ; Limonciello, Mario
> > > 
> > > Cc: dri-de...@lists.freedesktop.org; sta...@vger.kernel.org;
> > > stanislav.lisovs...@intel.com; Zuo, Jerry ; amd-
> > > g...@lists.freedesktop.org; Lin, Wayne ; Guenter
> > > Roeck ; bske...@redhat.com
> > > Subject: Re: [PATCH] Revert "drm/display/dp_mst: Move all payload info
> into
> > > the atomic state"
> > >
> > > On 27.01.23 08:39, Greg KH wrote:
> > > > On Fri, Jan 20, 2023 at 11:51:04AM -0600, Limonciello, Mario wrote:
> > > >> On 1/20/2023 11:46, Guenter Roeck wrote:
> > > >>> On Thu, Jan 12, 2023 at 04:50:44PM +0800, Wayne Lin wrote:
> > >  This reverts commit 4d07b0bc403403438d9cf88450506240c5faf92f.
> > > 
> > >  [Why]
> > >  Changes cause regression on amdgpu mst.
> > >  E.g.
> > >  In fill_dc_mst_payload_table_from_drm(), amdgpu expects to
> > > add/remove payload
> > >  one by one and call fill_dc_mst_payload_table_from_drm() to
> update
> > > the HW
> > >  maintained payload table. But previous change tries to go through
> all
> > > the
> > >  payloads in mst_state and update amdpug hw maintained table in
> once
> > > everytime
> > >  driver only tries to add/remove a specific payload stream only. The
> > > newly
> > >  design idea conflicts with the implementation in amdgpu nowadays.
> > > 
> > >  [How]
> > >  Revert this patch first. After addressing all regression problems
> caused
> > > by
> > >  this previous patch, will add it back and adjust it.
> > > >>>
> > > >>> Has there been any progress on this revert, or on fixing the
> underlying
> > > >>> problem ?
> > > >>>
> > > >>> Thanks,
> > > >>> Guenter
> > > >>
> > > >> Hi Guenter,
> > > >>
> > > >> Wayne is OOO for CNY, but let me update you.
> > > >>
> > > >> Harry has sent out this series which is a collection of proper fixes.
> > > >> https://patchwork.freedesktop.org/series/113125/
> > > >>
> > > >> Once that's reviewed and accepted, 4 of them are applicable for 6.1.
> > > >
> > > > Any hint on when those will be reviewed and accepted?  patchwork
> > > doesn't
> > > > show any activity on them, or at least I can't figure it out...
> > >
> > > I didn't look closer (hence please correct me if I'm wrong), but the
> > > core changes afaics are in the DRM pull airlied send a few hours ago to
> > > Linus (note the "amdgpu […] DP MST fixes" line):
> > >
> > >
> https://lore.kernel.org/all/CAPM%3D9tzuu4xnx6T5v7sKsK%2BA5HEaPOc1ie
> > > myznsyqzgztj%3d...@mail.gmail.com/
> >
> > That's right.  There are 4 commits in that PR with the appropriate stable 
> > tags
> > that should fix the majority of the MST issues introduced in 6.1 by
> 4d07b0bc40340
> > ("drm/display/dp_mst: Move all payload info into the atomic state"):
> >
> >   drm/amdgpu/display/mst: Fix mst_state->pbn_div and slot count
> assignments
> >   drm/amdgpu/display/mst: limit payload to be updated one by one
> >   drm/amdgpu/display/mst: update mst_mgr relevant variable when long
> HPD
> >   drm/display/dp_mst: Correct the kref of port.
> >
> > There will be follow ups for any remaining corner cases.
> 
> Great, thanks for this, all are now queued up in the 6.1.y queue.
> 
> greg k-h

Greg,

My apologies if this has been covered elsewhere and I missed it but I was
wondering if there was a decision made for whether 6.1.y will be an LTS kernel
release or not?


[linux-next:master] BUILD REGRESSION 66eee64b235411d512bed4d672c2d00683239daf

2023-02-01 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: 66eee64b235411d512bed4d672c2d00683239daf  Add linux-next specific 
files for 20230201

Error/Warning reports:

https://lore.kernel.org/oe-kbuild-all/202301300743.bp7dpazv-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202301301801.y5o08tqx-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202301302110.metnwkbd-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202301310939.tagcoezb-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202302011836.ka3bxqdy-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

Documentation/riscv/uabi.rst:24: WARNING: Enumerated list ends without a blank 
line; unexpected unindent.
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] 
undefined!
ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] 
undefined!
arch/arm64/kvm/arm.c:2206: warning: expecting prototype for Initialize Hyp(). 
Prototype was for kvm_arm_init() instead
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.h:62:10: fatal error: 
mod_info_packet.h: No such file or directory
drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:148:6:
 warning: no previous prototype for 'link_dp_trace_set_edp_power_timestamp' 
[-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:158:10:
 warning: no previous prototype for 'link_dp_trace_get_edp_poweron_timestamp' 
[-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/accessories/link_dp_trace.c:163:10:
 warning: no previous prototype for 'link_dp_trace_get_edp_poweroff_timestamp' 
[-Wmissing-prototypes]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:1295:32:
 warning: variable 'result_write_min_hblank' set but not used 
[-Wunused-but-set-variable]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_capability.c:279:42:
 warning: variable 'ds_port' set but not used [-Wunused-but-set-variable]
drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_training.c:1585:38:
 warning: variable 'result' set but not used [-Wunused-but-set-variable]
ftrace-ops.c:(.init.text+0x2c3): undefined reference to `__udivdi3'
idma64.c:(.text+0x1696): undefined reference to `devm_platform_ioremap_resource'

Unverified Error/Warning (likely false positive, please contact us if 
interested):

drivers/crypto/aspeed/aspeed-acry.c:295:37: sparse: sparse: incorrect type in 
assignment (different base types)
drivers/crypto/aspeed/aspeed-acry.c:305:28: sparse: sparse: cast removes 
address space '__iomem' of expression
drivers/crypto/aspeed/aspeed-acry.c:606:24: sparse: sparse: symbol 
'aspeed_acry_akcipher_algs' was not declared. Should it be static?
drivers/nvmem/imx-ocotp.c:599:21: sparse: sparse: symbol 'imx_ocotp_layout' was 
not declared. Should it be static?
drivers/nvmem/layouts/sl28vpd.c:143:21: sparse: sparse: symbol 'sl28vpd_layout' 
was not declared. Should it be static?
drivers/of/platform.c:570:2-23: WARNING: Function "for_each_node_by_type" 
should have of_node_put() before break around line 575.
drivers/thermal/qcom/tsens-v0_1.c:106:40: sparse: sparse: symbol 
'tsens_9607_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v0_1.c:26:40: sparse: sparse: symbol 
'tsens_8916_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v0_1.c:42:40: sparse: sparse: symbol 
'tsens_8939_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v0_1.c:62:40: sparse: sparse: symbol 
'tsens_8974_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v0_1.c:84:40: sparse: sparse: symbol 
'tsens_8974_backup_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v1.c:24:40: sparse: sparse: symbol 
'tsens_qcs404_nvmem' was not declared. Should it be static?
drivers/thermal/qcom/tsens-v1.c:45:40: sparse: sparse: symbol 
'tsens_8976_nvmem' was not declared. Should it be static?

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_get_edp_poweroff_timestamp
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_get_edp_poweron_timestamp
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-accessories-link_dp_trace.c:warning:no-previous-prototype-for-link_dp_trace_set_edp_power_timestamp
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:variable-ds_port-set-but-not-used
|   |-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_capability.c:warning:variable-result_write_min_hblank-set-but-not-used
|   `-- 
drivers-gpu-drm-amd-amdgpu-..-display-dc-link-protocols-link_dp_training.c:warning:va

[PATCH] drm/amdgpu: drop the long-double-128 powerpc check/hack

2023-02-01 Thread Daniel Kolesa
Commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc")
introduced this check as a workaround for the driver not building
with toolchains that default to 64-bit long double.

The reason things worked on 128-bit-long-double toolchains and
not otherwise was however largely accidental. The real issue was
that some files containing floating point code were compiled
without -mhard-float, while others were compiled with -mhard-float.

The PowerPC compilers tag object files that use long doubles with
a special ABI tag in order to differentiate 64-bit long double,
IBM long double and IEEE 128-bit long double. When no long double
is used in a source file, the file does not receive the ABI tag.
Since only regular doubles are used in the AMDGPU source, there
is no ABI tag on the soft-float object files with 128-bit-ldbl
compilers, and therefore no error. With 64-bit long double,
the double and long double types are equal, and an ABI tag is
introduced.

Of course, this resulted in the real bug, which was mixing of
hard and soft float object files, getting hidden, which makes
this check technically incorrect.

Since then, work has been done to ensure that all float code is
separately compiled. This was also necessary in order to enable
AArch64 support in the display stack, as AArch64 does not have any
soft-float ABI, and all code that does not explicitly conatain
floats is compiled with -mgeneral-regs-only, which prevents
float-using code from being compiled at all. That means AArch64
support will from now on always safeguard against such cases
happening ever again.

In mainline, this work is now fully done, so this check is fully
redundant and does not do anything except preventing AMDGPU DC
from being built on systems such as those using musl libc. The
last piece of work to enable this was commit c92b7fe0d92a
("drm/amd/display: move remaining FPU code to dml folder")
and this has since been backported to 6.1 stable (in 6.1.7).

Relevant issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2288

Signed-off-by: Daniel Kolesa 
---
 arch/powerpc/Kconfig| 4 
 drivers/gpu/drm/amd/display/Kconfig | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56b..267805072 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -289,10 +289,6 @@ config PPC
# Please keep this list sorted alphabetically.
#
 
-config PPC_LONG_DOUBLE_128
-   depends on PPC64 && ALTIVEC
-   def_bool $(success,test "$(shell,echo __LONG_DOUBLE_128__ | $(CC) -E -P 
-)" = 1)
-
 config PPC_BARRIER_NOSPEC
bool
default y
diff --git a/drivers/gpu/drm/amd/display/Kconfig 
b/drivers/gpu/drm/amd/display/Kconfig
index 2efe93f74..94645b6ef 100644
--- a/drivers/gpu/drm/amd/display/Kconfig
+++ b/drivers/gpu/drm/amd/display/Kconfig
@@ -8,7 +8,7 @@ config DRM_AMD_DC
depends on BROKEN || !CC_IS_CLANG || X86_64 || SPARC64 || ARM64
select SND_HDA_COMPONENT if SND_HDA_CORE
# !CC_IS_CLANG: https://github.com/ClangBuiltLinux/linux/issues/1752
-   select DRM_AMD_DC_DCN if (X86 || PPC_LONG_DOUBLE_128 || (ARM64 && 
KERNEL_MODE_NEON && !CC_IS_CLANG))
+   select DRM_AMD_DC_DCN if (X86 || PPC64 || (ARM64 && KERNEL_MODE_NEON && 
!CC_IS_CLANG))
help
  Choose this option if you want to use the new display engine
  support for AMDGPU. This adds required support for Vega and
-- 
2.34.1


[PATCH] drm/amd/amdgpu: enable athub cg on gc 11.0.3

2023-02-01 Thread Kenneth Feng
enable athub cg on gc 11.0.3

Signed-off-by: Kenneth Feng 
---
 drivers/gpu/drm/amd/amdgpu/soc21.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c 
b/drivers/gpu/drm/amd/amdgpu/soc21.c
index 2ea0b9142868..0615fdbf0a64 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc21.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc21.c
@@ -677,7 +677,9 @@ static int soc21_common_early_init(void *handle)
AMD_CG_SUPPORT_GFX_CGCG |
AMD_CG_SUPPORT_GFX_CGLS |
AMD_CG_SUPPORT_REPEATER_FGCG |
-   AMD_CG_SUPPORT_GFX_MGCG;
+   AMD_CG_SUPPORT_GFX_MGCG |
+   AMD_CG_SUPPORT_ATHUB_MGCG |
+   AMD_CG_SUPPORT_ATHUB_LS;
adev->pg_flags = AMD_PG_SUPPORT_VCN |
AMD_PG_SUPPORT_VCN_DPG |
AMD_PG_SUPPORT_JPEG;
-- 
2.25.1



Re: [RFC PATCH] drm/amdgpu: add force_sg_display module parameter (v2)

2023-02-01 Thread Christian König

Am 27.01.23 um 18:51 schrieb Alex Deucher:

Add a module parameter to force sg (scatter/gather) display
on APUs.  Normally we allow displays in both VRAM and GTT,
but this option forces displays into GTT so we can explicitly
test more scenarios with GTT.


I'm still absolutely not keen to add module parameters for internal testing.

Even with tainting the kernel this is not something we should allow end 
users to do.


Christian.



v2: fix checks

Signed-off-by: Alex Deucher 
---

For reference since the original patch was buggy.

  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 12 
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  3 +++
  3 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 872450a3a164..73d0a0807138 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -244,6 +244,8 @@ extern int amdgpu_num_kcq;
  #define AMDGPU_VCNFW_LOG_SIZE (32 * 1024)
  extern int amdgpu_vcnfw_log;
  
+extern int amdgpu_force_sg_display;

+
  #define AMDGPU_VM_MAX_NUM_CTX 4096
  #define AMDGPU_SG_THRESHOLD   (256*1024*1024)
  #define AMDGPU_DEFAULT_GTT_SIZE_MB3072ULL /* 3GB by default */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 4cf2028b5235..c8975c2da853 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -943,6 +943,18 @@ MODULE_PARM_DESC(smu_pptable_id,
"specify pptable id to be used (-1 = auto(default) value, 0 = use pptable from 
vbios, > 0 = soft pptable id)");
  module_param_named(smu_pptable_id, amdgpu_smu_pptable_id, int, 0444);
  
+/**

+ * DOC: force_sg_display (int)
+ * Force display buffers into GTT (scatter/gather) memory for APUs.
+ * This is used to force GTT only for displays rather than displaying from
+ * either VRAM (carve out) or GTT.
+ *
+ * Defaults to 0, or disabled.
+ */
+int amdgpu_force_sg_display;
+MODULE_PARM_DESC(force_sg_display, "Force S/G display (0 = off (default), 1 = force 
display to use GTT) ");
+module_param_named(force_sg_display, amdgpu_force_sg_display, int, 0444);
+
  /* These devices are not supported by amdgpu.
   * They are supported by the mach64, r128, radeon drivers
   */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 981010de0a28..840190dd78e4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1516,6 +1516,9 @@ uint32_t amdgpu_bo_get_preferred_domain(struct 
amdgpu_device *adev,
if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
domain = AMDGPU_GEM_DOMAIN_GTT;
}
+   if ((domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) &&
+   amdgpu_force_sg_display && adev->mode_info.gpu_vm_support)
+   domain = AMDGPU_GEM_DOMAIN_GTT;
return domain;
  }
  




Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Luben Tuikov
Hi Guilherme,

Since setting sched->ready to false, seems to be taking place in, directly 
amdgpu_ring_fini()
and in amdgpu_fence_driver_sw_fini() indirectly as that function calls 
drm_sched_fini()
which sets it to false, we seem to have two competing policies of,
"set ready to false to show that _fini() was called, and set to false to 
disable IB submissions".

To that effect, your patch is generally correct, as it would be the case of an 
early failure
and unroll from (indirectly) amdgpu_device_init_schedulers().

Please resubmit your patch but using .ops as Christian suggested, as .name is 
sufficient,
but .ops is necessary.

On a side-note: in the future we should probably discern between
"this ring has an initialized and working scheduler" (looking up at DRM), from
"this ring can take on IBs to send them down to the hardware" (looking down at 
hardware).
Sched->ready seems to be overloaded with these disparate states, and this is 
why you need
to use .ops to guard calling drm_sched_fini().

Regards,
Luben

On 2023-02-01 09:35, Christian König wrote:
> Am 01.02.23 um 15:24 schrieb Alex Deucher:
>> On Wed, Feb 1, 2023 at 2:18 AM Christian König
>>  wrote:
>>> Hi Guchun,
>>>
>>> no, that doesn't make any sense at all.
>>>
>>> The ready flag indicates that the scheduler is fully prepared for hw
>>> submissions from userspace and is unrelated to the initialization
>>> status. It's set to true after IB testing was successful and only set to
>>> false only when a GPU reset fails and we can't get the hardware to work
>>> any more.
>> That might have been the original intention, but right now sched.ready
>> gets set to true when we finish setting up the ring, but before we do
>> ring or IB tests.
> WHAT? Please not again.
>
> I'm really tired of fixing this over and over again, the meaning of 
> ring->sched.ready is to block submissions when a GPU reset fails. AND 
> NOTHING ELSE!
>
> The problem is people seem to abuse it and I have to fix it for the 
> fourth or fives time now.
>
> I'm going to send out patches,
> Christian.
>
>> Alex
>>
>>> Please use sched.ops instead as I suggested before.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
 Hi Christian,

 Do you think if it makes sense that we can set 'ring->sched.ready' to be 
 true in each ring init, even if before executing/setting up drm_sched_init 
 in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler 
 structure.

 Regards,
 Guchun

 -Original Message-
 From: Koenig, Christian 
 Sent: Tuesday, January 31, 2023 6:59 PM
 To: Chen, Guchun ; Alex Deucher 
 ; Guilherme G. Piccoli 
 Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui 
 ; dri-de...@lists.freedesktop.org; Tuikov, Luben 
 ; Limonciello, Mario ; 
 kernel-...@igalia.com; Deucher, Alexander 
 Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching 
 drm_sched init/fini

 Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> Hi Piccoli,
>
> Please ignore my request of full dmesg log. I can reproduce the issue and 
> get the same failure callstack by returning early with an error code 
> prior to amdgpu_device_init_schedulers.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Chen, Guchun
> Sent: Tuesday, January 31, 2023 2:37 PM
> To: Alex Deucher ; Guilherme G. Piccoli
> 
> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui
> ; dri-de...@lists.freedesktop.org; Tuikov, Luben
> ; Limonciello, Mario
> ; kernel-...@igalia.com; Deucher, Alexander
> ; Koenig, Christian
> 
> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> drm_sched init/fini
>
> Hi Piccoli,
>
> I agree with Alex's point, using ring->sched.name for such check is not a 
> good way. BTW, can you please attach a full dmesg long in bad case to 
> help me understand more?
>
> Regards,
> Guchun
>
> -Original Message-
> From: Alex Deucher 
> Sent: Tuesday, January 31, 2023 6:30 AM
> To: Guilherme G. Piccoli 
> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun
> ; Pan, Xinhui ;
> dri-de...@lists.freedesktop.org; Tuikov, Luben ;
> Limonciello, Mario ; kernel-...@igalia.com;
> Deucher, Alexander ; Koenig, Christian
> 
> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> drm_sched init/fini
>
> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli 
>  wrote:
>> + Luben
>>
>> (sorry, missed that in the first submission).
>>
>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>>> Currently amdgpu calls drm_sched_fini() from the fence driver sw
>>> fini routine - such function is expected to be called only after the
>>> respective init function - drm_sched_init() - was 

Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Alex Deucher
On Wed, Feb 1, 2023 at 2:18 AM Christian König
 wrote:
>
> Hi Guchun,
>
> no, that doesn't make any sense at all.
>
> The ready flag indicates that the scheduler is fully prepared for hw
> submissions from userspace and is unrelated to the initialization
> status. It's set to true after IB testing was successful and only set to
> false only when a GPU reset fails and we can't get the hardware to work
> any more.

That might have been the original intention, but right now sched.ready
gets set to true when we finish setting up the ring, but before we do
ring or IB tests.

Alex

>
> Please use sched.ops instead as I suggested before.
>
> Regards,
> Christian.
>
> Am 31.01.23 um 14:58 schrieb Chen, Guchun:
> > Hi Christian,
> >
> > Do you think if it makes sense that we can set 'ring->sched.ready' to be 
> > true in each ring init, even if before executing/setting up drm_sched_init 
> > in amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler 
> > structure.
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: Koenig, Christian 
> > Sent: Tuesday, January 31, 2023 6:59 PM
> > To: Chen, Guchun ; Alex Deucher 
> > ; Guilherme G. Piccoli 
> > Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui 
> > ; dri-de...@lists.freedesktop.org; Tuikov, Luben 
> > ; Limonciello, Mario ; 
> > kernel-...@igalia.com; Deucher, Alexander 
> > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching 
> > drm_sched init/fini
> >
> > Am 31.01.23 um 10:17 schrieb Chen, Guchun:
> >> Hi Piccoli,
> >>
> >> Please ignore my request of full dmesg log. I can reproduce the issue and 
> >> get the same failure callstack by returning early with an error code prior 
> >> to amdgpu_device_init_schedulers.
> >>
> >> Regards,
> >> Guchun
> >>
> >> -Original Message-
> >> From: Chen, Guchun
> >> Sent: Tuesday, January 31, 2023 2:37 PM
> >> To: Alex Deucher ; Guilherme G. Piccoli
> >> 
> >> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui
> >> ; dri-de...@lists.freedesktop.org; Tuikov, Luben
> >> ; Limonciello, Mario
> >> ; kernel-...@igalia.com; Deucher, Alexander
> >> ; Koenig, Christian
> >> 
> >> Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> >> drm_sched init/fini
> >>
> >> Hi Piccoli,
> >>
> >> I agree with Alex's point, using ring->sched.name for such check is not a 
> >> good way. BTW, can you please attach a full dmesg long in bad case to help 
> >> me understand more?
> >>
> >> Regards,
> >> Guchun
> >>
> >> -Original Message-
> >> From: Alex Deucher 
> >> Sent: Tuesday, January 31, 2023 6:30 AM
> >> To: Guilherme G. Piccoli 
> >> Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun
> >> ; Pan, Xinhui ;
> >> dri-de...@lists.freedesktop.org; Tuikov, Luben ;
> >> Limonciello, Mario ; kernel-...@igalia.com;
> >> Deucher, Alexander ; Koenig, Christian
> >> 
> >> Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
> >> drm_sched init/fini
> >>
> >> On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli  
> >> wrote:
> >>> + Luben
> >>>
> >>> (sorry, missed that in the first submission).
> >>>
> >>> On 30/01/2023 18:45, Guilherme G. Piccoli wrote:
>  Currently amdgpu calls drm_sched_fini() from the fence driver sw
>  fini routine - such function is expected to be called only after the
>  respective init function - drm_sched_init() - was executed successfully.
> 
>  Happens that we faced a driver probe failure in the Steam Deck
>  recently, and the function drm_sched_fini() was called even without
>  its counter-part had been previously called, causing the following oops:
> 
>  amdgpu: probe of :04:00.0 failed with error -110
>  BUG: kernel NULL pointer dereference, address: 0090 PGD
>  0 P4D 0
>  Oops: 0002 [#1] PREEMPT SMP NOPTI
>  CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
>  #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
>  RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:
> 
> amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
> amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
> amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> devm_drm_dev_init_release+0x49/0x70
> [...]
> 
>  To prevent that, check if the drm_sched was properly initialized for
>  a given ring before calling its fini counter-part.
> 
>  Notice ideally we'd use sched.ready for that; such field is set as
>  the latest thing on drm_sched_init(). But amdgpu seems to "override"
>  the meaning of such field - in the above oops for example, it was a
>  GFX ring causing the crash, and the sched.ready field was set to
>  true in the ring init routine, regardless of the state of the DRM 
>  scheduler. Hence, we ended-up using another sched field.
> >> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
> >> 

Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini

2023-02-01 Thread Christian König

Am 01.02.23 um 15:24 schrieb Alex Deucher:

On Wed, Feb 1, 2023 at 2:18 AM Christian König
 wrote:

Hi Guchun,

no, that doesn't make any sense at all.

The ready flag indicates that the scheduler is fully prepared for hw
submissions from userspace and is unrelated to the initialization
status. It's set to true after IB testing was successful and only set to
false only when a GPU reset fails and we can't get the hardware to work
any more.

That might have been the original intention, but right now sched.ready
gets set to true when we finish setting up the ring, but before we do
ring or IB tests.


WHAT? Please not again.

I'm really tired of fixing this over and over again, the meaning of 
ring->sched.ready is to block submissions when a GPU reset fails. AND 
NOTHING ELSE!


The problem is people seem to abuse it and I have to fix it for the 
fourth or fives time now.


I'm going to send out patches,
Christian.



Alex


Please use sched.ops instead as I suggested before.

Regards,
Christian.

Am 31.01.23 um 14:58 schrieb Chen, Guchun:

Hi Christian,

Do you think if it makes sense that we can set 'ring->sched.ready' to be true 
in each ring init, even if before executing/setting up drm_sched_init in 
amdgpu_device_init_schedulers? As 'ready' is a member of gpu scheduler structure.

Regards,
Guchun

-Original Message-
From: Koenig, Christian 
Sent: Tuesday, January 31, 2023 6:59 PM
To: Chen, Guchun ; Alex Deucher ; 
Guilherme G. Piccoli 
Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui ; 
dri-de...@lists.freedesktop.org; Tuikov, Luben ; Limonciello, Mario 
; kernel-...@igalia.com; Deucher, Alexander 

Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching drm_sched 
init/fini

Am 31.01.23 um 10:17 schrieb Chen, Guchun:

Hi Piccoli,

Please ignore my request of full dmesg log. I can reproduce the issue and get 
the same failure callstack by returning early with an error code prior to 
amdgpu_device_init_schedulers.

Regards,
Guchun

-Original Message-
From: Chen, Guchun
Sent: Tuesday, January 31, 2023 2:37 PM
To: Alex Deucher ; Guilherme G. Piccoli

Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui
; dri-de...@lists.freedesktop.org; Tuikov, Luben
; Limonciello, Mario
; kernel-...@igalia.com; Deucher, Alexander
; Koenig, Christian

Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
drm_sched init/fini

Hi Piccoli,

I agree with Alex's point, using ring->sched.name for such check is not a good 
way. BTW, can you please attach a full dmesg long in bad case to help me 
understand more?

Regards,
Guchun

-Original Message-
From: Alex Deucher 
Sent: Tuesday, January 31, 2023 6:30 AM
To: Guilherme G. Piccoli 
Cc: amd-gfx@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun
; Pan, Xinhui ;
dri-de...@lists.freedesktop.org; Tuikov, Luben ;
Limonciello, Mario ; kernel-...@igalia.com;
Deucher, Alexander ; Koenig, Christian

Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching
drm_sched init/fini

On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli  
wrote:

+ Luben

(sorry, missed that in the first submission).

On 30/01/2023 18:45, Guilherme G. Piccoli wrote:

Currently amdgpu calls drm_sched_fini() from the fence driver sw
fini routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of :04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0090 PGD
0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli
#338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace:

amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
devm_drm_dev_init_release+0x49/0x70
[...]

To prevent that, check if the drm_sched was properly initialized for
a given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as
the latest thing on drm_sched_init(). But amdgpu seems to "override"
the meaning of such field - in the above oops for example, it was a
GFX ring causing the crash, and the sched.ready field was set to
true in the ring init routine, regardless of the state of the DRM scheduler. 
Hence, we ended-up using another sched field.

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence
driver fini in s3 test (v2)")

Cc: Andrey Grodzovsky 
Cc: Guchun Chen 
Cc: Mario Limonciello 
Signed-off-by: Guilherme G. Piccoli 
---


Hi folks, first of all thanks in advance for reviews / comments!
Notice that I've used the