[PATCH v2] drm/amd/display: Implement bounds check for stream encoder creation in DCN301

2024-02-05 Thread Srinivasan Shanmugam
'stream_enc_regs' array is an array of dcn10_stream_enc_registers
structures. The array is initialized with four elements, corresponding
to the four calls to stream_enc_regs() in the array initializer. This
means that valid indices for this array are 0, 1, 2, and 3.

The error message 'stream_enc_regs' 4 <= 5 below, is indicating that
there is an attempt to access this array with an index of 5, which is
out of bounds. This could lead to undefined behavior

Here, eng_id is used as an index to access the stream_enc_regs array. If
eng_id is 5, this would result in an out-of-bounds access on the
stream_enc_regs array.

Thus fixing Buffer overflow error in dcn301_stream_encoder_create
reported by Smatch:
drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn301/dcn301_resource.c:1011 
dcn301_stream_encoder_create() error: buffer overflow 'stream_enc_regs' 4 <= 5

Fixes: 3a83e4e64bb1 ("drm/amd/display: Add dcn3.01 support to DC (v2)")
Cc: Roman Li 
Cc: Rodrigo Siqueira 
Cc: Aurabindo Pillai 
Signed-off-by: Srinivasan Shanmugam 
---
 .../drm/amd/display/dc/resource/dcn301/dcn301_resource.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c 
b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
index 511ff6b5b985..4a475a723191 100644
--- a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
+++ b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
@@ -999,7 +999,7 @@ static struct stream_encoder 
*dcn301_stream_encoder_create(enum engine_id eng_id
vpg = dcn301_vpg_create(ctx, vpg_inst);
afmt = dcn301_afmt_create(ctx, afmt_inst);
 
-   if (!enc1 || !vpg || !afmt) {
+   if (!enc1 || !vpg || !afmt || eng_id >= ARRAY_SIZE(stream_enc_regs)) {
kfree(enc1);
kfree(vpg);
kfree(afmt);
@@ -1007,10 +1007,9 @@ static struct stream_encoder 
*dcn301_stream_encoder_create(enum engine_id eng_id
}
 
dcn30_dio_stream_encoder_construct(enc1, ctx, ctx->dc_bios,
-   eng_id, vpg, afmt,
-   _enc_regs[eng_id],
-   _shift, _mask);
-
+  eng_id, vpg, afmt,
+  _enc_regs[eng_id],
+  _shift, _mask);
return >base;
 }
 
-- 
2.34.1



[PATCH] drm/amd/display: Increase frame-larger-than for all display_mode_vba files

2024-02-05 Thread Nathan Chancellor
After a recent change in LLVM, allmodconfig (which has CONFIG_KCSAN=y
and CONFIG_WERROR=y enabled) has a few new instances of
-Wframe-larger-than for the mode support and system configuration
functions:

  
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20v2.c:3393:6:
 error: stack frame size (2144) exceeds limit (2048) in 
'dml20v2_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
   3393 | void dml20v2_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_lib)
|  ^
  1 error generated.

  
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3520:6:
 error: stack frame size (2192) exceeds limit (2048) in 
'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
   3520 | void dml21_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_lib)
|  ^
  1 error generated.

  
drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:3286:6:
 error: stack frame size (2128) exceeds limit (2048) in 
'dml20_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than]
   3286 | void dml20_ModeSupportAndSystemConfigurationFull(struct 
display_mode_lib *mode_lib)
|  ^
  1 error generated.

Without the sanitizers enabled, there are no warnings.

This was the catalyst for commit 6740ec97bcdb ("drm/amd/display:
Increase frame warning limit with KASAN or KCSAN in dml2") and that same
change was made to dml in commit 5b750b22530f ("drm/amd/display:
Increase frame warning limit with KASAN or KCSAN in dml") but the
frame_warn_flag variable was not applied to all files. Do so now to
clear up the warnings and make all these files consistent.

Cc: sta...@vger.kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issue/1990
Signed-off-by: Nathan Chancellor 
---
 drivers/gpu/drm/amd/display/dc/dml/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile 
b/drivers/gpu/drm/amd/display/dc/dml/Makefile
index 6042a5a6a44f..59ade76ffb18 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
@@ -72,11 +72,11 @@ CFLAGS_$(AMDDALPATH)/dc/dml/display_mode_lib.o := 
$(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/display_mode_vba.o := $(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn10/dcn10_fpu.o := $(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/dcn20_fpu.o := $(dml_ccflags)
-CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := $(dml_ccflags)
+CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := $(dml_ccflags) 
$(frame_warn_flag)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20.o := $(dml_ccflags)
-CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := $(dml_ccflags)
+CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := $(dml_ccflags) 
$(frame_warn_flag)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20v2.o := $(dml_ccflags)
-CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := $(dml_ccflags)
+CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := $(dml_ccflags) 
$(frame_warn_flag)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_rq_dlg_calc_21.o := $(dml_ccflags)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn30/display_mode_vba_30.o := $(dml_ccflags) 
$(frame_warn_flag)
 CFLAGS_$(AMDDALPATH)/dc/dml/dcn30/display_rq_dlg_calc_30.o := $(dml_ccflags)

---
base-commit: 6813cdca4ab94a238f8eb0cef3d3f3fcbdfb0ee0
change-id: 20240205-amdgpu-raise-flt-for-dml-vba-files-ee5b5a9c5e43

Best regards,
-- 
Nathan Chancellor 



[PATCH] drm/amd/display: Disable PSR-SU on Parade 08-01 TCON too

2024-02-05 Thread Mario Limonciello
Stuart Hayhurst has found that both at bootup and fullscreen VA-API video
is leading to black screens for around 1 second and kernel WARNING [1] traces
when calling dmub_psr_enable() with Parade 08-01 TCON.

These symptoms all go away with PSR-SU disabled for this TCON, so disable
it for now while DMUB traces [2] from the failure can be analyzed and the 
failure
state properly root caused.

Cc: sta...@vger.kernel.org
Cc: Marc Rossi 
Cc: Hamza Mahfooz 
Link: 
https://gitlab.freedesktop.org/drm/amd/uploads/a832dd515b571ee171b3e3b566e99a13/dmesg.log
 [1]
Link: 
https://gitlab.freedesktop.org/drm/amd/uploads/8f13ff3b00963c833e23e68aa8116959/output.log
 [2]
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2645
Signed-off-by: Mario Limonciello 
---
---
 drivers/gpu/drm/amd/display/modules/power/power_helpers.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c 
b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
index e304e8435fb8..477289846a0a 100644
--- a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
+++ b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c
@@ -841,6 +841,8 @@ bool is_psr_su_specific_panel(struct dc_link *link)
isPSRSUSupported = false;
else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && 
dpcd_caps->sink_dev_id_str[0] == 0x03)
isPSRSUSupported = false;
+   else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && 
dpcd_caps->sink_dev_id_str[0] == 0x01)
+   isPSRSUSupported = false;
else if (dpcd_caps->psr_info.force_psrsu_cap == 0x1)
isPSRSUSupported = true;
}
-- 
2.34.1



RE: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)

2024-02-05 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: amd-gfx  On Behalf Of
> Shengyu Qu
> Sent: Saturday, February 3, 2024 8:05 AM
> To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org
> Cc: wiagn...@outlook.com; Cornwall, Jay ;
> Koenig, Christian ; Paneer Selvam, Arunpravin
> 
> Subject: Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)
>
> Hi Felix,
> Sorry for my late reply. I was busy this week.
> I just did some more tests using next-20240202 branch. Testing using blender
> 4.0.2, when only one HIP render task is running, there's no problem.
> However, when two tasks run together, software always crashes, but not
> crashes the whole system. Dmesg reports gpu reset in most cases, for
> example:
>
> [  176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx_0.0.0 timeout, signaled seq=32608, emitted seq=32610 [  176.072000]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
> information: process blender pid 4256 thread blender:cs0 pid 4297
> [  176.072143] amdgpu :03:00.0: amdgpu: GPU reset begin!
> [  176.073571] amdgpu :03:00.0: amdgpu: Guilty job already signaled,
> skipping HW reset [  176.073593] amdgpu :03:00.0: amdgpu: GPU
> reset(4) succeeded!
>
> And in some rare cases, there would be a page fault report, see dmesg.log.
> Do you have any idea? Can I make it print more detailed diagnostic
> information?

Are you only seeing the problem with this patch applied or in general?  If you 
are seeing it in general, it likely related to a firmware issue that was 
recently fixed that will be resolved with an update CP firmware image.
Driver side changes:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/0eb6c664b780dd1b4080e047ad51b100cd7840a3
https://gitlab.freedesktop.org/agd5f/linux/-/commit/40970e60070ed3d1390ec65e38e819f6d81b8f0c

Alex


>
> Best regards,
> Shengyu
>
>
> 在 2024/1/30 01:47, Felix Kuehling 写道:
> > On 2024-01-29 10:24, Shengyu Qu wrote:
> >> Hello Felix,
> >> I think you are right. This problem has existed for years(just look
> >> at the issue creation time in my link), and is thought caused by
> >> OpenGL-ROCM interop(that's why I think this patch might help). It is
> >> very easy to trigger this problem in blender(method is also mentioned
> >> in the link).
> >
> > This doesn't help you, but it's unlikely that this has been the same
> > issue for two years for everybody who chimed into this bug report.
> > Different kernel versions, GPUs, user mode ROCm and Mesa versions etc.
> >
> > Case in point, it's possible that you're seeing an issue specific to
> > RDNA3, which hasn't even been around for that long.
> >
> >
> >> Do
> >> you have any idea about this?
> >
> > Not without seeing a lot more diagnostic information. A full backtrace
> > from your kernel log would be a good start.
> >
> > Regards,
> >   Felix
> >
> >
> >> Best regards,
> >> Shengyu
> >> 在 2024/1/29 22:51, Felix Kuehling 写道:
> >>> On 2024-01-29 8:58, Shengyu Qu wrote:
>  Hi,
>  Seems rocm-opengl interop hang problem still exists[1]. Btw have
>  you discovered into this problem?
>  Best regards,
>  Shengyu
>  [1]
>  https://projects.blender.org/blender/blender/issues/100353#issuecom
>  ment-599
> >>>
> >>> Maybe you're having a different problem. Do you see this issue also
> >>> without any version of the "Relocate TBA/TMA ..." patch?
> >>>
> >>> Regards,
> >>>   Felix
> >>>
> >>>
> 
>  在 2024/1/27 03:15, Shengyu Qu 写道:
> > Hello Felix,
> > This patch seems working on my system, also it seems fixes the
> > ROCM/OpenGL interop problem.
> > Is this intended to happen or not? Maybe we need more users to
> > test it.
> > Besides,
> > Tested-by: Shengyu Qu  Best Regards,
> Shengyu
> >
> > 在 2024/1/26 06:27, Felix Kuehling 写道:
> >> The TBA and TMA, along with an unused IB allocation, reside at
> >> low addresses in the VM address space. A stray VM fault which
> >> hits these pages must be serviced by making their page table entries
> invalid.
> >> The scheduler depends upon these pages being resident and fails,
> >> preventing a debugger from inspecting the failure state.
> >>
> >> By relocating these pages above 47 bits in the VM address space
> >> they can only be reached when bits [63:48] are set to 1. This
> >> makes it much less likely for a misbehaving program to generate
> >> accesses to them.
> >> The current placement at VA (PAGE_SIZE*2) is readily hit by a
> >> NULL access with a small offset.
> >>
> >> v2:
> >> - Move it to the reserved space to avoid concflicts with Mesa
> >> - Add macros to make reserved space management easier
> >>
> >> Cc: Arunpravin Paneer Selvam 
> >> Cc: Christian Koenig 
> >> Signed-off-by: Jay Cornwall 
> >> Signed-off-by: Felix Kuehling 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c  |  4 +--
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c|  7 ++---
> >> 

[PATCH v3] drm/amdgpu: change vm->task_info handling

2024-02-05 Thread Shashank Sharma
This patch changes the handling and lifecycle of vm->task_info object.
The major changes are:
- vm->task_info is a dynamically allocated ptr now, and its uasge is
  reference counted.
- introducing two new helper funcs for task_info lifecycle management
- amdgpu_vm_get_task_info: reference counts up task_info before
  returning this info
- amdgpu_vm_put_task_info: reference counts down task_info
- last put to task_info() frees task_info from the vm.

This patch also does logistical changes required for existing usage
of vm->task_info.

V2: Do not block all the prints when task_info not found (Felix)
V3: (Felix)
   - Fix wrong indentation
   - No debug message for -ENOMEM
   - Add NULL check for task_info
   - Do not duplicate the debug messages (ti vs no ti)
   - Get first reference of task_info in vm_init(), put last
 in vm_fini()

Cc: Christian Koenig 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Signed-off-by: Shashank Sharma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  18 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c   |  12 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 158 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  21 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  24 +--
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  |  23 +--
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c   |  20 ++-
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  23 +--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c  |  23 +--
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c|  22 +--
 drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c |  20 +--
 13 files changed, 251 insertions(+), 124 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 0e61ebdb3f3e..f9eb12697b95 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1775,9 +1775,14 @@ static int amdgpu_debugfs_vm_info_show(struct seq_file 
*m, void *unused)
list_for_each_entry(file, >filelist, lhead) {
struct amdgpu_fpriv *fpriv = file->driver_priv;
struct amdgpu_vm *vm = >vm;
+   struct amdgpu_task_info *ti;
+
+   ti = amdgpu_vm_get_task_info_vm(vm);
+   if (ti) {
+   seq_printf(m, "pid:%d\tProcess:%s --\n", 
ti->pid, ti->process_name);
+   amdgpu_vm_put_task_info(ti);
+   }
 
-   seq_printf(m, "pid:%d\tProcess:%s --\n",
-   vm->task_info.pid, vm->task_info.process_name);
r = amdgpu_bo_reserve(vm->root.bo, true);
if (r)
break;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 1f357198533f..e6e6d56398f2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -35,7 +35,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
 {
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
-   struct amdgpu_task_info ti;
+   struct amdgpu_task_info *ti;
struct amdgpu_device *adev = ring->adev;
int idx;
int r;
@@ -48,7 +48,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
return DRM_GPU_SCHED_STAT_ENODEV;
}
 
-   memset(, 0, sizeof(struct amdgpu_task_info));
+
adev->job_hang = true;
 
if (amdgpu_gpu_recovery &&
@@ -58,12 +58,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct 
drm_sched_job *s_job)
goto exit;
}
 
-   amdgpu_vm_get_task_info(ring->adev, job->pasid, );
DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
- job->base.sched->name, atomic_read(>fence_drv.last_seq),
- ring->fence_drv.sync_seq);
-   DRM_ERROR("Process information: process %s pid %d thread %s pid %d\n",
- ti.process_name, ti.tgid, ti.task_name, ti.pid);
+  job->base.sched->name, 
atomic_read(>fence_drv.last_seq),
+  ring->fence_drv.sync_seq);
+
+   ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
+   if (ti) {
+   DRM_ERROR("Process information: process %s pid %d thread %s pid 
%d\n",
+ ti->process_name, ti->tgid, ti->task_name, ti->pid);
+   amdgpu_vm_put_task_info(ti);
+   }
 
dma_fence_set_error(_job->s_fence->finished, -ETIME);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index 4baa300121d8..a59364e9b6ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -230,8 +230,16 @@ void 

RE: [PATCH] drm/amd/display: Implement bounds check for stream encoder creation in DCN301

2024-02-05 Thread Li, Roman
[Public]

Inline.

> -Original Message-
> From: SHANMUGAM, SRINIVASAN 
> Sent: Sunday, February 4, 2024 9:35 PM
> To: Siqueira, Rodrigo ; Pillai, Aurabindo
> 
> Cc: amd-gfx@lists.freedesktop.org; SHANMUGAM, SRINIVASAN
> ; Li, Roman 
> Subject: [PATCH] drm/amd/display: Implement bounds check for stream
> encoder creation in DCN301
>
> 'stream_enc_regs' array is an array of dcn10_stream_enc_registers structures.
> The array is initialized with four elements, corresponding to the four calls 
> to
> stream_enc_regs() in the array initializer. This means that valid indices for 
> this
> array are 0, 1, 2, and 3.
>
> The error message 'stream_enc_regs' 4 <= 5 below, is indicating that there is 
> an
> attempt to access this array with an index of 5, which is out of bounds. This
> could lead to undefined behavior
>
> Here, eng_id is used as an index to access the stream_enc_regs array. If 
> eng_id
> is 5, this would result in an out-of-bounds access.
>
> Fixes the below:
> drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn301/dcn301_reso
> urce.c:1011 dcn301_stream_encoder_create() error: buffer overflow
> 'stream_enc_regs' 4 <= 5

Please mention that this is Smatch warning.
In current implementation this function is called with eng_id  limited by 
num_stream_encoder = 4 for dcn301.

>
> Fixes: 3a83e4e64bb1 ("drm/amd/display: Add dcn3.01 support to DC (v2)")
> Cc: Roman Li 
> Cc: Rodrigo Siqueira 
> Cc: Aurabindo Pillai 
> Signed-off-by: Srinivasan Shanmugam 
> ---
>  .../display/dc/resource/dcn301/dcn301_resource.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git
> a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
> b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
> index 511ff6b5b985..f915d7c3980e 100644
> --- a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
> +++
> b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c
> @@ -1006,10 +1006,18 @@ static struct stream_encoder
> *dcn301_stream_encoder_create(enum engine_id eng_id
>   return NULL;
>   }

>
> - dcn30_dio_stream_encoder_construct(enc1, ctx, ctx->dc_bios,
> - eng_id, vpg, afmt,
> - _enc_regs[eng_id],
> - _shift, _mask);
> + if (eng_id < ARRAY_SIZE(stream_enc_regs)) {
> + dcn30_dio_stream_encoder_construct(enc1, ctx, ctx-
> >dc_bios,
> +eng_id, vpg, afmt,
> +_enc_regs[eng_id],
> +_shift, _mask);
> + } else {
> + DRM_ERROR("Invalid engine id: %d\n", eng_id);
> + kfree(enc1);
> + kfree(vpg);
> + kfree(afmt);
> + return NULL;
> + }

Can you just extend the existing null checks instead?
e.g.
if (!enc1 || !vpg || !afmt || (eng_id  >= ARRAY_SIZE(stream_enc_regs))

>
>   return >base;
>  }
> --
> 2.34.1



Re: [PATCH] drm/amd/display: only call sysfs_remove_group() for eDP connectors

2024-02-05 Thread Harry Wentland
On 2024-02-05 10:18, Hamza Mahfooz wrote:
> Since we only register the amdgpu sysfs group for eDP connectors, we
> should only remove it from them. Otherwise, we run into a harmless
> WARN() on device detach for all of the device's non-eDP connectors.
> 
> Fixes: f97e4303da16 ("drm/amd/display: add panel_power_savings sysfs entry to 
> eDP connectors")
> Signed-off-by: Hamza Mahfooz 

Reviewed-by: Harry Wentland 

Harry

> ---
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index adda423615a1..b3a5e730be24 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -6508,7 +6508,9 @@ static void amdgpu_dm_connector_unregister(struct 
> drm_connector *connector)
>  {
>   struct amdgpu_dm_connector *amdgpu_dm_connector = 
> to_amdgpu_dm_connector(connector);
>  
> - sysfs_remove_group(>kdev->kobj, _group);
> + if (connector->connector_type == DRM_MODE_CONNECTOR_eDP)
> + sysfs_remove_group(>kdev->kobj, _group);
> +
>   drm_dp_aux_unregister(_dm_connector->dm_dp_aux.aux);
>  }
>  



[PATCH] drm/amd/display: only call sysfs_remove_group() for eDP connectors

2024-02-05 Thread Hamza Mahfooz
Since we only register the amdgpu sysfs group for eDP connectors, we
should only remove it from them. Otherwise, we run into a harmless
WARN() on device detach for all of the device's non-eDP connectors.

Fixes: f97e4303da16 ("drm/amd/display: add panel_power_savings sysfs entry to 
eDP connectors")
Signed-off-by: Hamza Mahfooz 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index adda423615a1..b3a5e730be24 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -6508,7 +6508,9 @@ static void amdgpu_dm_connector_unregister(struct 
drm_connector *connector)
 {
struct amdgpu_dm_connector *amdgpu_dm_connector = 
to_amdgpu_dm_connector(connector);
 
-   sysfs_remove_group(>kdev->kobj, _group);
+   if (connector->connector_type == DRM_MODE_CONNECTOR_eDP)
+   sysfs_remove_group(>kdev->kobj, _group);
+
drm_dp_aux_unregister(_dm_connector->dm_dp_aux.aux);
 }
 
-- 
2.43.0



[lvc-project] [PATCH] drm/amd/pm: check return value of amdgpu_irq_add_id()

2024-02-05 Thread Igor Artemiev
amdgpu_irq_ad_id() may fail and the irq handlers will not be registered.
This patch adds error code check.

Found by Linux Verification Center (linuxtesting.org).

Signed-off-by: Igor Artemiev 
---
 .../gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c| 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c
index 79a566f3564a..9cb965479dd8 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c
@@ -647,26 +647,34 @@ int smu9_register_irq_handlers(struct pp_hwmgr *hwmgr)
 {
struct amdgpu_irq_src *source =
kzalloc(sizeof(struct amdgpu_irq_src), GFP_KERNEL);
+   int ret;
 
if (!source)
return -ENOMEM;
 
source->funcs = _irq_funcs;
 
-   amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
+   ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
SOC15_IH_CLIENTID_THM,
THM_9_0__SRCID__THM_DIG_THERM_L2H,
source);
-   amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
+   if (ret)
+   return ret;
+
+   ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
SOC15_IH_CLIENTID_THM,
THM_9_0__SRCID__THM_DIG_THERM_H2L,
source);
+   if (ret)
+   return ret;
 
/* Register CTF(GPIO_19) interrupt */
-   amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
+   ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev),
SOC15_IH_CLIENTID_ROM_SMUIO,
SMUIO_9_0__SRCID__SMUIO_GPIO19,
source);
+   if (ret)
+   return ret;
 
return 0;
 }
-- 
2.39.2



Re: [lvc-project] [PATCH] drm/amd/pm: check return value of amdgpu_irq_add_id()

2024-02-05 Thread Alexey Khoroshilov
On 05.02.2024 15:25, Igor Artemiev wrote:
> amdgpu_irq_ad_id() may fail and the irq handlers will not be registered.
> This patch adds error code check.

But what is about deallocation of already allocated memory?

--
Alexey





Re: [PATCH] PCI: Add vf reset notification for pf

2024-02-05 Thread Zhi Wang
On Sun, 4 Feb 2024 14:12:57 +0800
Emily Deng  wrote:

> When a vf has been reset, the pf wants to get notification to remove
> the vf out of schedule.
> 
> Solution:
> Add the callback function in pci_driver sriov_vf_reset_notification.
> When vf reset happens, then call this callback function.
> 
> Signed-off-by: Emily Deng 
> ---
>  drivers/pci/pci.c   | 8 
>  include/linux/pci.h | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 60230da957e0..aca937b05531 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4780,6 +4780,14 @@ EXPORT_SYMBOL_GPL(pcie_flr);
>   */
>  int pcie_reset_flr(struct pci_dev *dev, bool probe)
>  {
> + struct pci_dev *pf_dev;
> +
> + if (dev->is_virtfn) {
> + pf_dev = dev->physfn;
> + if (pf_dev->driver->sriov_vf_reset_notification)
> +
> pf_dev->driver->sriov_vf_reset_notification(pf_dev, dev);
> + }
> +
>   if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
>   return -ENOTTY;
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index c69a2cc1f412..4fa31d9b0aa7 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -926,6 +926,7 @@ struct pci_driver {
>   int  (*sriov_configure)(struct pci_dev *dev, int num_vfs);
> /* On PF */ int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int
> msix_vec_count); /* On PF */ u32  (*sriov_get_vf_total_msix)(struct
> pci_dev *pf);
> + void  (*sriov_vf_reset_notification)(struct pci_dev *pf,
> struct pci_dev *vf); const struct pci_error_handlers *err_handler;
>   const struct attribute_group **groups;
>   const struct attribute_group **dev_groups;

Hi:

I would suggest you can provide a cover letter including a complete
picture that tells the background, detailed problem statement, the
solutions and plus the users. As this seems very like a generic change,
it needs a better justification to convince folks why this is the best
solution. Without a complete picture, the solution just looks like a
workaround.

Thanks,
Zhi.


Re: [RFC PATCH 2/2] drm/amd/display: switch amdgpu_dm_connector to use struct drm_edid

2024-02-05 Thread Melissa Wen
On 01/26, Alex Hung wrote:
> 
> 
> On 2024-01-26 09:28, Melissa Wen wrote:
> > Replace raw edid handling (struct edid) with the opaque EDID type
> > (struct drm_edid) on amdgpu_dm_connector for consistency. It may also
> > prevent mismatch of approaches in different parts of the driver code.
> > Working in progress. There are a couple of cast warnings and it was only
> > tested with IGT.
> > 
> > Signed-off-by: Melissa Wen 
> > ---
> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 63 ++-
> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  4 +-
> >   .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c |  9 +--
> >   .../display/amdgpu_dm/amdgpu_dm_mst_types.c   | 23 +++
> >   4 files changed, 51 insertions(+), 48 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 68e71e2ea472..741081d73bb3 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -3277,12 +3277,12 @@ void amdgpu_dm_update_connector_after_detect(
> > >dm_dp_aux.aux);
> > }
> > } else {
> > -   aconnector->edid =
> > -   (struct edid *)sink->dc_edid.raw_edid;
> > +   const struct edid *edid = (const struct edid 
> > *)sink->dc_edid.raw_edid;
> > +   aconnector->edid = drm_edid_alloc(edid, 
> > (edid->extensions + 1) * EDID_LENGTH);
> > if (aconnector->dc_link->aux_mode)
> > drm_dp_cec_set_edid(>dm_dp_aux.aux,
> > -   aconnector->edid);
> > +   
> > drm_edid_raw(aconnector->edid));
> > }
> > if (!aconnector->timing_requested) {
> > @@ -3293,13 +3293,13 @@ void amdgpu_dm_update_connector_after_detect(
> > "failed to create 
> > aconnector->requested_timing\n");
> > }
> > -   drm_connector_update_edid_property(connector, aconnector->edid);
> > +   drm_edid_connector_update(connector, aconnector->edid);
> > amdgpu_dm_update_freesync_caps(connector, aconnector->edid);
> > update_connector_ext_caps(aconnector);
> > } else {
> > drm_dp_cec_unset_edid(>dm_dp_aux.aux);
> > amdgpu_dm_update_freesync_caps(connector, NULL);
> > -   drm_connector_update_edid_property(connector, NULL);
> > +   drm_edid_connector_update(connector, NULL);
> > aconnector->num_modes = 0;
> > dc_sink_release(aconnector->dc_sink);
> > aconnector->dc_sink = NULL;
> > @@ -6564,7 +6564,6 @@ static void amdgpu_dm_connector_funcs_force(struct 
> > drm_connector *connector)
> > struct dc_link *dc_link = aconnector->dc_link;
> > struct dc_sink *dc_em_sink = aconnector->dc_em_sink;
> > const struct drm_edid *drm_edid;
> > -   const struct edid *edid;
> > /*
> >  * Note: drm_get_edid gets edid in the following order:
> > @@ -6578,11 +6577,12 @@ static void amdgpu_dm_connector_funcs_force(struct 
> > drm_connector *connector)
> > DRM_ERROR("No EDID found on connector: %s.\n", connector->name);
> > return;
> > }
> > -   edid = drm_edid_raw(drm_edid);
> > -   aconnector->edid = edid;
> > -
> > +   aconnector->edid = drm_edid;
> > +   drm_edid_connector_update(connector, drm_edid);
> > /* Update emulated (virtual) sink's EDID */
> > if (dc_em_sink && dc_link) {
> > +   const struct edid *edid = drm_edid_raw(drm_edid);
> > +
> > memset(_em_sink->edid_caps, 0, sizeof(struct dc_edid_caps));
> > memmove(dc_em_sink->dc_edid.raw_edid, (uint8_t *)edid, 
> > (edid->extensions + 1) * EDID_LENGTH);
> > dm_helpers_parse_edid_caps(
> > @@ -6633,13 +6633,13 @@ static void create_eml_sink(struct 
> > amdgpu_dm_connector *aconnector)
> > return;
> > }
> > -   edid = drm_edid_raw(drm_edid);
> > -
> > -   if (drm_detect_hdmi_monitor(edid))
> > +   if (connector->display_info.is_hdmi)
> > init_params.sink_signal = SIGNAL_TYPE_HDMI_TYPE_A;
> > -   aconnector->edid = edid;
> > +   aconnector->edid = drm_edid;
> > +   drm_edid_connector_update(connector, drm_edid);
> > +   edid = drm_edid_raw(drm_edid);
> > aconnector->dc_em_sink = dc_link_add_remote_sink(
> > aconnector->dc_link,
> > (uint8_t *)edid,
> > @@ -7322,16 +7322,16 @@ static void amdgpu_set_panel_orientation(struct 
> > drm_connector *connector)
> >   }
> >   static void amdgpu_dm_connector_ddc_get_modes(struct drm_connector 
> > *connector,
> > - struct edid *edid)
> > + const struct drm_edid *drm_edid)
> >   {
> > struct amdgpu_dm_connector 

Re: [RFC PATCH 0/2] drm/amd/display: switch amdgpu_dm_connector to

2024-02-05 Thread Melissa Wen
On 01/29, Jani Nikula wrote:
> On Fri, 26 Jan 2024, Mario Limonciello  wrote:
> > On 1/26/2024 10:28, Melissa Wen wrote:
> >> Hi,
> >> 
> >> I'm debugging a null-pointer dereference when running
> >> igt@kms_connector_force_edid and the way I found to solve the bug is to
> >> stop using raw edid handler in amdgpu_connector_funcs_force and
> >> create_eml_sink in favor of managing resouces via sruct drm_edid helpers
> >> (Patch 1). The proper solution seems to be switch amdgpu_dm_connector
> >> from struct edid to struct drm_edid and avoid the usage of different
> >> approaches in the driver (Patch 2). However, doing it implies a good
> >> amount of work and validation, therefore I decided to send this RFC
> >> first to collect opinions and check if there is any parallel work on
> >> this side. It's a working in progress.
> >> 
> >> The null-pointer error trigger by the igt@kms_connector_force_edid test
> >> was introduced by:
> >> - e54ed41620f ("drm/amd/display: Remove unwanted drm edid references")
> >> 
> >> You can check the error trace in the first patch.
> >> 
> >> This series was tested with kms_hdmi_inject and kms_force_connector. No
> >> null-pointer error, kms_hdmi_inject is successul and kms_force_connector
> >> is sucessful after the second execution - the force-edid subtest
> >> still fails in the first run (I'm still investigating).
> >> 
> >> There is also a couple of cast warnings to be addressed - I'm looking
> >> for the best replacement.
> >> 
> >> I appreciate any feedback and testing.
> >
> > So I'm actually a little bit worried by hardcoding EDID_LENGTH in this 
> > series.
> >
> > I have some other patches that I'm posting later on that let you get the 
> > EDID from _DDC BIOS method too.  My observation was that the EDID can be 
> > anywhere up to 512 bytes according to the ACPI spec.
> >
> > An earlier version of my patch was using EDID_LENGTH when fetching it 
> > and the EDID checksum failed.
> >
> > I'll CC you on the post, we probably want to get your changes and mine 
> > merged together.
> 
> One of the main points of struct drm_edid is that it tracks the
> allocation size separately.
> 
> We should simply not trust edid->extensions, because most of the time it
> originates from outside the kernel.
> 
> Using drm_edid and immediately drm_edid_raw() falls short. That function
> should only be used during migration to help. And yeah, it also means
> EDID parsing should be done in drm_edid.c, and not spread out all over
> the subsystem.

Hi Mario and Jani,

Thanks for the feedback.

I agree with you. I used the drm_edid_raw() as an intermediate step to
assess/validate this migration, but I'll work on removing this hack.

So, I understand that the transition from edid to drm_edid is the right
path, so I'll improve this work (keeping the points you raised in mind)
and send a version.

BR,

Melissa
> 
> 
> BR,
> Jani.
> 
> 
> >
> >> 
> >> Melissa
> >> 
> >> Melissa Wen (2):
> >>drm/amd/display: fix null-pointer dereference on edid reading
> >>drm/amd/display: switch amdgpu_dm_connector to use struct drm_edid
> >> 
> >>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 78 ++-
> >>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  4 +-
> >>   .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c |  9 ++-
> >>   .../display/amdgpu_dm/amdgpu_dm_mst_types.c   | 23 +++---
> >>   4 files changed, 60 insertions(+), 54 deletions(-)
> >> 
> >
> 
> -- 
> Jani Nikula, Intel


RE: [PATCH 00/21] DC Patches January 31, 2024

2024-02-05 Thread Wheeler, Daniel
[Public]

Hi all,

This week this patchset was tested on the following systems:
* Lenovo ThinkBook T13s Gen4 with AMD Ryzen 5 6600U
* MSI Gaming X Trio RX 6800
* Gigabyte Gaming OC RX 7900 XTX

These systems were tested on the following display/connection types:
* eDP, (1080p 60hz [5650U]) (1920x1200 60hz [6600U]) (2560x1600 
120hz[6600U])
* VGA and DVI (1680x1050 60hz [DP to VGA/DVI, USB-C to VGA/DVI])
* DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz, 4k 240hz [Includes 
USB-C to DP/HDMI adapters])
* Thunderbolt (LG Ultrafine 5k)
* MST (Startech MST14DP123DP [DP to 3x DP] and 2x 4k 60Hz displays)
* DSC (with Cable Matters 101075 [DP to 3x DP] with 3x 4k60 displays, 
and HP Hook G2 with 1 4k60 display)
* USB 4 (Kensington SD5700T and 1x 4k 60Hz display)
* PCON (Club3D CAC-1085 and 1x 4k 144Hz display [at 4k 120HZ, as that 
is the max the adapter supports])

The testing is a mix of automated and manual tests. Manual testing includes 
(but is not limited to):
* Changing display configurations and settings
* Benchmark testing
* Feature testing (Freesync, etc.)

Automated testing includes (but is not limited to):
* Script testing (scripts to automate some of the manual checks)
* IGT testing

The patchset consists of the amd-staging-drm-next branch (Head commit - 
0b48b36f80b0 -> drm/amdgpu: Need to resume ras during gpu reset for gfx v9_4_3 
sriov) with new patches added on top of it.

Tested on Ubuntu 22.04.3, on Wayland and X11, using KDE Plasma and Gnome.


Tested-by: Daniel Wheeler 


Thank you,

Dan Wheeler
Sr. Technologist  |  AMD
SW Display
--
1 Commerce Valley Dr E, Thornhill, ON L3T 7X6
Facebook |  Twitter |  amd.com


-Original Message-
From: amd-gfx  On Behalf Of Hamza Mahfooz
Sent: Wednesday, January 31, 2024 3:11 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chung, ChiaHsuan (Tom) ; Li, Sun peng (Leo) 
; Siqueira, Rodrigo ; Li, Roman 
; Zuo, Jerry ; Pillai, Aurabindo 
; Wu, Hersen ; Mahfooz, Hamza 
; Lin, Wayne ; Wentland, Harry 
; Gutierrez, Agustin 
Subject: [PATCH 00/21] DC Patches January 31, 2024

This version brings along the following:
* DCN35 fixes
* DMUB fixes
* Link training fixes
* Misc code style fixes
* MST fixes
* ODM fixes
* SubVP fixes

Allen Pan (1):
  drm/amd/display: correct static screen event mask

Alvin Lee (2):
  Revert "drm/amd/display: For FPO and SubVP/DRR configs program
vmin/max sel"
  drm/amd/display: Update phantom pipe enable / disable sequence

Aric Cyr (1):
  drm/amd/display: 3.2.271

Camille Cho (1):
  drm/amd/display: correct comment in set_default_brightness_aux()

Ethan Bitnun (2):
  drm/amd/display: Add delay before logging clks from hw
  drm/amd/display: Adjust set_p_state calls to fix logging

Fangzhi Zuo (1):
  drm/amd/display: Fix MST Null Ptr for RV

George Shen (2):
  drm/amd/display: Add debug option to force 1-tap chroma subsampling
  drm/amd/display: Add left edge pixel for YCbCr422/420 + ODM pipe split

Michael Strauss (2):
  drm/amd/display: Remove Legacy FIXED_VS Transparent LT Sequence
  drm/amd/display: Don't perform rate toggle on DP2-capable FIXED_VS
retimers

Nicholas Kazlauskas (4):
  drm/amd/display: Add more checks for exiting idle in DC
  drm/amd/display: Disable timeout in more places for dc_dmub_srv
  drm/amd/display: Increase eval/entry delay for DCN35
  drm/amd/display: Disable idle reallow as part of command/gpint
execution

Rodrigo Siqueira (4):
  drm/amd/display: Drop legacy code
  drm/amd/display: Disable ODM by default for DCN35
  drm/amd/display: Trivial code style adjustment
  drm/amd/display: Drop some unnecessary guards

Wenjing Liu (1):
  drm/amd/display: set odm_combine_policy based on context in dcn32
resource

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  12 +-
 .../amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c |   2 -
 .../dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c   |   4 -
 .../display/dc/clk_mgr/dcn301/dcn301_smu.c|   4 -
 .../amd/display/dc/clk_mgr/dcn31/dcn31_smu.c  |   4 -
 .../display/dc/clk_mgr/dcn314/dcn314_smu.c|   6 -
 .../display/dc/clk_mgr/dcn315/dcn315_smu.c|   4 -
 .../display/dc/clk_mgr/dcn316/dcn316_smu.c|   4 -
 .../display/dc/clk_mgr/dcn32/dcn32_clk_mgr.c  |   2 +
 .../dc/clk_mgr/dcn32/dcn32_clk_mgr_smu_msg.h  |   3 +-
 drivers/gpu/drm/amd/display/dc/core/dc.c  | 115 +++---
 .../gpu/drm/amd/display/dc/core/dc_resource.c |  68 ++--
 .../gpu/drm/amd/display/dc/core/dc_stream.c   |  18 +
 .../gpu/drm/amd/display/dc/core/dc_surface.c  |   2 +
 drivers/gpu/drm/amd/display/dc/dc.h   |   9 +-
 drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c  |  17 +-
 drivers/gpu/drm/amd/display/dc/dc_hw_types.h  |   2 -
 drivers/gpu/drm/amd/display/dc/dc_stream.h|   2 +
 .../gpu/drm/amd/display/dc/dcn10/dcn10_opp.c  |   7 +
 

Re: [PATCH] drm/amd/display: Clear phantom stream count and plane count

2024-02-05 Thread Deucher, Alexander
[Public]

Acked-by: Alex Deucher 

From: amd-gfx  on behalf of Mario 
Limonciello 
Sent: Friday, February 2, 2024 7:30 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Limonciello, Mario 
Subject: [PATCH] drm/amd/display: Clear phantom stream count and plane count

When dc_state_destruct() was refactored the new phantom_stream_count
and phantom_plane_count members weren't cleared.

Fixes: 012a04b1d6af ("drm/amd/display: Refactor phantom resource allocation")
Signed-off-by: Mario Limonciello 
---
 drivers/gpu/drm/amd/display/dc/core/dc_state.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_state.c 
b/drivers/gpu/drm/amd/display/dc/core/dc_state.c
index 88c6436b28b6..180ac47868c2 100644
--- a/drivers/gpu/drm/amd/display/dc/core/dc_state.c
+++ b/drivers/gpu/drm/amd/display/dc/core/dc_state.c
@@ -291,11 +291,14 @@ void dc_state_destruct(struct dc_state *state)
 dc_stream_release(state->phantom_streams[i]);
 state->phantom_streams[i] = NULL;
 }
+   state->phantom_stream_count = 0;

 for (i = 0; i < state->phantom_plane_count; i++) {
 dc_plane_state_release(state->phantom_planes[i]);
 state->phantom_planes[i] = NULL;
 }
+   state->phantom_plane_count = 0;
+
 state->stream_mask = 0;
 memset(>res_ctx, 0, sizeof(state->res_ctx));
 memset(>pp_display_cfg, 0, sizeof(state->pp_display_cfg));
--
2.34.1



Re: [PATCH] PCI: Add vf reset notification for pf

2024-02-05 Thread Christian König

Am 04.02.24 um 07:12 schrieb Emily Deng:

When a vf has been reset, the pf wants to get notification to remove the vf
out of schedule.

Solution:
Add the callback function in pci_driver sriov_vf_reset_notification. When
vf reset happens, then call this callback function.


Well that doesn't make much sense. As other already noted as well a VF 
should be an encapsulated representation of a physical devices 
functionality.


AMD implemented that a bit different with a hypervisor to control which 
PF functionality a VF exposes, but that doesn't mean that we can leak 
this AMD specific handling into the common Linux PCI subsystem.


Additional to that a technical blocker is that when a VF is passed into 
a VM you don't have access to the PF any more to make this reset 
notification.


Regards,
Christian.



Signed-off-by: Emily Deng 
---
  drivers/pci/pci.c   | 8 
  include/linux/pci.h | 1 +
  2 files changed, 9 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 60230da957e0..aca937b05531 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4780,6 +4780,14 @@ EXPORT_SYMBOL_GPL(pcie_flr);
   */
  int pcie_reset_flr(struct pci_dev *dev, bool probe)
  {
+   struct pci_dev *pf_dev;
+
+   if (dev->is_virtfn) {
+   pf_dev = dev->physfn;
+   if (pf_dev->driver->sriov_vf_reset_notification)
+   pf_dev->driver->sriov_vf_reset_notification(pf_dev, 
dev);
+   }
+
if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET)
return -ENOTTY;
  
diff --git a/include/linux/pci.h b/include/linux/pci.h

index c69a2cc1f412..4fa31d9b0aa7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -926,6 +926,7 @@ struct pci_driver {
int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int 
msix_vec_count); /* On PF */
u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
+   void  (*sriov_vf_reset_notification)(struct pci_dev *pf, struct pci_dev 
*vf);
const struct pci_error_handlers *err_handler;
const struct attribute_group **groups;
const struct attribute_group **dev_groups;




Re: [PATCH 1/3] driver core: bus: introduce can_remove()

2024-02-05 Thread Christian König

Am 02.02.24 um 23:25 schrieb Hamza Mahfooz:

Currently, drivers have no mechanism to block requests to unbind
devices. However, this can cause resource leaks and leave the device in
an inconsistent state, such that rebinding the device may cause a hang
or otherwise prevent the device from being rebound. So, introduce
the can_remove() callback to allow drivers to indicate if it isn't
appropriate to remove a device at the given time.


Well that is nonsense. When you physically remove a device (e.g. unplug 
it) then there is nothing in software you can do to prevent that.


Regards,
Christian.



Cc: sta...@vger.kernel.org
Signed-off-by: Hamza Mahfooz 
---
  drivers/base/bus.c | 4 
  include/linux/device/bus.h | 2 ++
  2 files changed, 6 insertions(+)

diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index daee55c9b2d9..7c259b01ea99 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -239,6 +239,10 @@ static ssize_t unbind_store(struct device_driver *drv, 
const char *buf,
  
  	dev = bus_find_device_by_name(bus, NULL, buf);

if (dev && dev->driver == drv) {
+   if (dev->bus && dev->bus->can_remove &&
+   !dev->bus->can_remove(dev))
+   return -EBUSY;
+
device_driver_detach(dev);
err = count;
}
diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index 5ef4ec1c36c3..c9d4af0ed3b8 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -46,6 +46,7 @@ struct fwnode_handle;
   *be called at late_initcall_sync level. If the device has
   *consumers that are never bound to a driver, this function
   *will never get called until they do.
+ * @can_remove: Called before attempting to remove a device from this bus.
   * @remove:   Called when a device removed from this bus.
   * @shutdown: Called at shut-down time to quiesce the device.
   *
@@ -85,6 +86,7 @@ struct bus_type {
int (*uevent)(const struct device *dev, struct kobj_uevent_env *env);
int (*probe)(struct device *dev);
void (*sync_state)(struct device *dev);
+   bool (*can_remove)(struct device *dev);
void (*remove)(struct device *dev);
void (*shutdown)(struct device *dev);
  




RE: [PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user

2024-02-05 Thread Zhang, Hawking
[AMD Official Use Only - General]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: amd-gfx  On Behalf Of Stanley.Yang
Sent: Monday, February 5, 2024 16:10
To: amd-gfx@lists.freedesktop.org
Cc: Yang, Stanley 
Subject: [PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user

ta if invoke node buffer
| ta type --|
|  ta id  --|
| cmd  id --|
|-- shared buf len -|
|-- shared buffer --|

ta if invoke node buffer is as above, copy shared buffer data to correct 
location

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
index 468a67b302d4..ca5c86e5f7cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
@@ -362,7 +362,7 @@ static ssize_t ta_if_invoke_debugfs_write(struct file *fp, 
const char *buf, size
}
}

-   if (copy_to_user((char *)buf, context->mem_context.shared_buf, 
shared_buf_len))
+   if (copy_to_user((char *)[copy_pos], 
context->mem_context.shared_buf, shared_buf_len))
ret = -EFAULT;

 err_free_shared_buf:
--
2.25.1



Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)

2024-02-05 Thread Shengyu Qu

Hi Felix,
Sorry for my late reply. I was busy this week.
I just did some more tests using next-20240202 branch. Testing using 
blender 4.0.2, when only one HIP render task is running, there's no problem.
However, when two tasks run together, software always crashes, but not 
crashes the whole system. Dmesg reports gpu reset in most cases, for 
example:


[  176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 
timeout, signaled seq=32608, emitted seq=32610
[  176.072000] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
information: process blender pid 4256 thread blender:cs0 pid 4297

[  176.072143] amdgpu :03:00.0: amdgpu: GPU reset begin!
[  176.073571] amdgpu :03:00.0: amdgpu: Guilty job already signaled, 
skipping HW reset

[  176.073593] amdgpu :03:00.0: amdgpu: GPU reset(4) succeeded!

And in some rare cases, there would be a page fault report, see dmesg.log.
Do you have any idea? Can I make it print more detailed diagnostic 
information?


Best regards,
Shengyu


在 2024/1/30 01:47, Felix Kuehling 写道:

On 2024-01-29 10:24, Shengyu Qu wrote:

Hello Felix,
I think you are right. This problem has existed for years(just look 
at the

issue creation time in my link), and is thought caused by OpenGL-ROCM
interop(that's why I think this patch might help). It is very easy to
trigger this problem in blender(method is also mentioned in the link).


This doesn't help you, but it's unlikely that this has been the same 
issue for two years for everybody who chimed into this bug report. 
Different kernel versions, GPUs, user mode ROCm and Mesa versions etc.


Case in point, it's possible that you're seeing an issue specific to 
RDNA3, which hasn't even been around for that long.




Do
you have any idea about this?


Not without seeing a lot more diagnostic information. A full backtrace 
from your kernel log would be a good start.


Regards,
  Felix



Best regards,
Shengyu
在 2024/1/29 22:51, Felix Kuehling 写道:

On 2024-01-29 8:58, Shengyu Qu wrote:

Hi,
Seems rocm-opengl interop hang problem still exists[1]. Btw have you
discovered into this problem?
Best regards,
Shengyu
[1] 
https://projects.blender.org/blender/blender/issues/100353#issuecomment-599


Maybe you're having a different problem. Do you see this issue also 
without any version of the "Relocate TBA/TMA ..." patch?


Regards,
  Felix




在 2024/1/27 03:15, Shengyu Qu 写道:

Hello Felix,
This patch seems working on my system, also it seems fixes the 
ROCM/OpenGL

interop problem.
Is this intended to happen or not? Maybe we need more users to 
test it.

Besides,
Tested-by: Shengyu Qu 
Best Regards,
Shengyu

在 2024/1/26 06:27, Felix Kuehling 写道:

The TBA and TMA, along with an unused IB allocation, reside at low
addresses in the VM address space. A stray VM fault which hits these
pages must be serviced by making their page table entries invalid.
The scheduler depends upon these pages being resident and fails,
preventing a debugger from inspecting the failure state.

By relocating these pages above 47 bits in the VM address space they
can only be reached when bits [63:48] are set to 1. This makes it 
much

less likely for a misbehaving program to generate accesses to them.
The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL
access with a small offset.

v2:
- Move it to the reserved space to avoid concflicts with Mesa
- Add macros to make reserved space management easier

Cc: Arunpravin Paneer Selvam 
Cc: Christian Koenig 
Signed-off-by: Jay Cornwall 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c  |  4 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c    |  7 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h   | 12 ++--
  drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30 
+++-

  4 files changed, 30 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c

index 823d31f4a2a3..53d0a458d78e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -28,9 +28,9 @@
    uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
  {
-    uint64_t addr = adev->vm_manager.max_pfn << 
AMDGPU_GPU_PAGE_SHIFT;

+    uint64_t addr = AMDGPU_VA_RESERVED_CSA_START(
+    adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT);
  -    addr -= AMDGPU_VA_RESERVED_CSA_SIZE;
  addr = amdgpu_gmc_sign_extend(addr);
    return addr;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c

index 3d0d56087d41..9e769ef50f2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c
@@ -45,11 +45,8 @@
   */
  static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device 
*adev)

  {
-    u64 addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
-
-    addr -= AMDGPU_VA_RESERVED_TOP;
-
-    return addr;
+    return AMDGPU_VA_RESERVED_SEQ64_START(
+    

Re: [PATCH 2/3] PCI: introduce can_remove()

2024-02-05 Thread Greg Kroah-Hartman
On Fri, Feb 02, 2024 at 05:25:55PM -0500, Hamza Mahfooz wrote:
> Wire up the can_remove() callback, such that pci drivers can implement
> their own version of it.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Hamza Mahfooz 
> ---

Again, sorry, nope, not allowed.


Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback

2024-02-05 Thread Greg Kroah-Hartman
On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> Removing an amdgpu device that still has user space references allocated
> to it causes undefined behaviour. So, implement amdgpu_pci_can_remove()
> and disallow devices that still have files allocated to them from being
> unbound.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Hamza Mahfooz 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index cc69005f5b46..cfa64f3c5be5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2323,6 +2323,22 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   return ret;
>  }
>  
> +static bool amdgpu_pci_can_remove(struct pci_dev *pdev)
> +{
> + struct drm_device *dev = pci_get_drvdata(pdev);
> +
> + mutex_lock(>filelist_mutex);
> +
> + if (!list_empty(>filelist)) {
> + mutex_unlock(>filelist_mutex);
> + return false;
> + }
> +
> + mutex_unlock(>filelist_mutex);
> +
> + return true;

Also, to be pedantic, this will not work as right after you returned
"true" here, userspace could open a file, causing the same issue you are
trying to prevent to have happen, happen.

So even if we wanted to do this, which again, we do not, this isn't even
a solution for it because it will still cause you problems.

greg k-h


Re: [PATCH 1/3] driver core: bus: introduce can_remove()

2024-02-05 Thread Greg Kroah-Hartman
On Fri, Feb 02, 2024 at 05:25:54PM -0500, Hamza Mahfooz wrote:
> Currently, drivers have no mechanism to block requests to unbind
> devices.

And that is by design.

> However, this can cause resource leaks and leave the device in
> an inconsistent state, such that rebinding the device may cause a hang
> or otherwise prevent the device from being rebound.

That is a driver bug, please fix your driver.

> So, introduce the can_remove() callback to allow drivers to indicate
> if it isn't appropriate to remove a device at the given time.

Nope, sorry, the driver needs to be fixed.

What broken driver are you needing this for?

Also realize, only root can unbind drivers (and it can also unload
modules), so if the root user really wants to do this, it can, and
should be possible to.

sorry,

greg k-h


Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback

2024-02-05 Thread Greg Kroah-Hartman
On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote:
> Removing an amdgpu device that still has user space references allocated
> to it causes undefined behaviour.

Then fix that please.  There should not be anything special about your
hardware that all of the tens of thousands of other devices can't handle
today.

What happens when I yank your device out of a system with a pci hotplug
bus?  You can't prevent that either, so this should not be any different
at all.

sorry, but please, just fix your driver.

greg k-h


[PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user

2024-02-05 Thread Stanley . Yang
ta if invoke node buffer
| ta type --|
|  ta id  --|
| cmd  id --|
|-- shared buf len -|
|-- shared buffer --|

ta if invoke node buffer is as above, copy shared buffer data to correct 
location

Signed-off-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
index 468a67b302d4..ca5c86e5f7cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c
@@ -362,7 +362,7 @@ static ssize_t ta_if_invoke_debugfs_write(struct file *fp, 
const char *buf, size
}
}
 
-   if (copy_to_user((char *)buf, context->mem_context.shared_buf, 
shared_buf_len))
+   if (copy_to_user((char *)[copy_pos], 
context->mem_context.shared_buf, shared_buf_len))
ret = -EFAULT;
 
 err_free_shared_buf:
-- 
2.25.1