[PATCH v2] drm/amd/display: Implement bounds check for stream encoder creation in DCN301
'stream_enc_regs' array is an array of dcn10_stream_enc_registers structures. The array is initialized with four elements, corresponding to the four calls to stream_enc_regs() in the array initializer. This means that valid indices for this array are 0, 1, 2, and 3. The error message 'stream_enc_regs' 4 <= 5 below, is indicating that there is an attempt to access this array with an index of 5, which is out of bounds. This could lead to undefined behavior Here, eng_id is used as an index to access the stream_enc_regs array. If eng_id is 5, this would result in an out-of-bounds access on the stream_enc_regs array. Thus fixing Buffer overflow error in dcn301_stream_encoder_create reported by Smatch: drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn301/dcn301_resource.c:1011 dcn301_stream_encoder_create() error: buffer overflow 'stream_enc_regs' 4 <= 5 Fixes: 3a83e4e64bb1 ("drm/amd/display: Add dcn3.01 support to DC (v2)") Cc: Roman Li Cc: Rodrigo Siqueira Cc: Aurabindo Pillai Signed-off-by: Srinivasan Shanmugam --- .../drm/amd/display/dc/resource/dcn301/dcn301_resource.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c index 511ff6b5b985..4a475a723191 100644 --- a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c +++ b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c @@ -999,7 +999,7 @@ static struct stream_encoder *dcn301_stream_encoder_create(enum engine_id eng_id vpg = dcn301_vpg_create(ctx, vpg_inst); afmt = dcn301_afmt_create(ctx, afmt_inst); - if (!enc1 || !vpg || !afmt) { + if (!enc1 || !vpg || !afmt || eng_id >= ARRAY_SIZE(stream_enc_regs)) { kfree(enc1); kfree(vpg); kfree(afmt); @@ -1007,10 +1007,9 @@ static struct stream_encoder *dcn301_stream_encoder_create(enum engine_id eng_id } dcn30_dio_stream_encoder_construct(enc1, ctx, ctx->dc_bios, - eng_id, vpg, afmt, - _enc_regs[eng_id], - _shift, _mask); - + eng_id, vpg, afmt, + _enc_regs[eng_id], + _shift, _mask); return >base; } -- 2.34.1
[PATCH] drm/amd/display: Increase frame-larger-than for all display_mode_vba files
After a recent change in LLVM, allmodconfig (which has CONFIG_KCSAN=y and CONFIG_WERROR=y enabled) has a few new instances of -Wframe-larger-than for the mode support and system configuration functions: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20v2.c:3393:6: error: stack frame size (2144) exceeds limit (2048) in 'dml20v2_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than] 3393 | void dml20v2_ModeSupportAndSystemConfigurationFull(struct display_mode_lib *mode_lib) | ^ 1 error generated. drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn21/display_mode_vba_21.c:3520:6: error: stack frame size (2192) exceeds limit (2048) in 'dml21_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than] 3520 | void dml21_ModeSupportAndSystemConfigurationFull(struct display_mode_lib *mode_lib) | ^ 1 error generated. drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dcn20/display_mode_vba_20.c:3286:6: error: stack frame size (2128) exceeds limit (2048) in 'dml20_ModeSupportAndSystemConfigurationFull' [-Werror,-Wframe-larger-than] 3286 | void dml20_ModeSupportAndSystemConfigurationFull(struct display_mode_lib *mode_lib) | ^ 1 error generated. Without the sanitizers enabled, there are no warnings. This was the catalyst for commit 6740ec97bcdb ("drm/amd/display: Increase frame warning limit with KASAN or KCSAN in dml2") and that same change was made to dml in commit 5b750b22530f ("drm/amd/display: Increase frame warning limit with KASAN or KCSAN in dml") but the frame_warn_flag variable was not applied to all files. Do so now to clear up the warnings and make all these files consistent. Cc: sta...@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issue/1990 Signed-off-by: Nathan Chancellor --- drivers/gpu/drm/amd/display/dc/dml/Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile b/drivers/gpu/drm/amd/display/dc/dml/Makefile index 6042a5a6a44f..59ade76ffb18 100644 --- a/drivers/gpu/drm/amd/display/dc/dml/Makefile +++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile @@ -72,11 +72,11 @@ CFLAGS_$(AMDDALPATH)/dc/dml/display_mode_lib.o := $(dml_ccflags) CFLAGS_$(AMDDALPATH)/dc/dml/display_mode_vba.o := $(dml_ccflags) CFLAGS_$(AMDDALPATH)/dc/dml/dcn10/dcn10_fpu.o := $(dml_ccflags) CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/dcn20_fpu.o := $(dml_ccflags) -CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := $(dml_ccflags) +CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20.o := $(dml_ccflags) $(frame_warn_flag) CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20.o := $(dml_ccflags) -CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := $(dml_ccflags) +CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_mode_vba_20v2.o := $(dml_ccflags) $(frame_warn_flag) CFLAGS_$(AMDDALPATH)/dc/dml/dcn20/display_rq_dlg_calc_20v2.o := $(dml_ccflags) -CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := $(dml_ccflags) +CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_mode_vba_21.o := $(dml_ccflags) $(frame_warn_flag) CFLAGS_$(AMDDALPATH)/dc/dml/dcn21/display_rq_dlg_calc_21.o := $(dml_ccflags) CFLAGS_$(AMDDALPATH)/dc/dml/dcn30/display_mode_vba_30.o := $(dml_ccflags) $(frame_warn_flag) CFLAGS_$(AMDDALPATH)/dc/dml/dcn30/display_rq_dlg_calc_30.o := $(dml_ccflags) --- base-commit: 6813cdca4ab94a238f8eb0cef3d3f3fcbdfb0ee0 change-id: 20240205-amdgpu-raise-flt-for-dml-vba-files-ee5b5a9c5e43 Best regards, -- Nathan Chancellor
[PATCH] drm/amd/display: Disable PSR-SU on Parade 08-01 TCON too
Stuart Hayhurst has found that both at bootup and fullscreen VA-API video is leading to black screens for around 1 second and kernel WARNING [1] traces when calling dmub_psr_enable() with Parade 08-01 TCON. These symptoms all go away with PSR-SU disabled for this TCON, so disable it for now while DMUB traces [2] from the failure can be analyzed and the failure state properly root caused. Cc: sta...@vger.kernel.org Cc: Marc Rossi Cc: Hamza Mahfooz Link: https://gitlab.freedesktop.org/drm/amd/uploads/a832dd515b571ee171b3e3b566e99a13/dmesg.log [1] Link: https://gitlab.freedesktop.org/drm/amd/uploads/8f13ff3b00963c833e23e68aa8116959/output.log [2] Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2645 Signed-off-by: Mario Limonciello --- --- drivers/gpu/drm/amd/display/modules/power/power_helpers.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c index e304e8435fb8..477289846a0a 100644 --- a/drivers/gpu/drm/amd/display/modules/power/power_helpers.c +++ b/drivers/gpu/drm/amd/display/modules/power/power_helpers.c @@ -841,6 +841,8 @@ bool is_psr_su_specific_panel(struct dc_link *link) isPSRSUSupported = false; else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && dpcd_caps->sink_dev_id_str[0] == 0x03) isPSRSUSupported = false; + else if (dpcd_caps->sink_dev_id_str[1] == 0x08 && dpcd_caps->sink_dev_id_str[0] == 0x01) + isPSRSUSupported = false; else if (dpcd_caps->psr_info.force_psrsu_cap == 0x1) isPSRSUSupported = true; } -- 2.34.1
RE: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)
[Public] > -Original Message- > From: amd-gfx On Behalf Of > Shengyu Qu > Sent: Saturday, February 3, 2024 8:05 AM > To: Kuehling, Felix ; amd-gfx@lists.freedesktop.org > Cc: wiagn...@outlook.com; Cornwall, Jay ; > Koenig, Christian ; Paneer Selvam, Arunpravin > > Subject: Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2) > > Hi Felix, > Sorry for my late reply. I was busy this week. > I just did some more tests using next-20240202 branch. Testing using blender > 4.0.2, when only one HIP render task is running, there's no problem. > However, when two tasks run together, software always crashes, but not > crashes the whole system. Dmesg reports gpu reset in most cases, for > example: > > [ 176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring > gfx_0.0.0 timeout, signaled seq=32608, emitted seq=32610 [ 176.072000] > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process > information: process blender pid 4256 thread blender:cs0 pid 4297 > [ 176.072143] amdgpu :03:00.0: amdgpu: GPU reset begin! > [ 176.073571] amdgpu :03:00.0: amdgpu: Guilty job already signaled, > skipping HW reset [ 176.073593] amdgpu :03:00.0: amdgpu: GPU > reset(4) succeeded! > > And in some rare cases, there would be a page fault report, see dmesg.log. > Do you have any idea? Can I make it print more detailed diagnostic > information? Are you only seeing the problem with this patch applied or in general? If you are seeing it in general, it likely related to a firmware issue that was recently fixed that will be resolved with an update CP firmware image. Driver side changes: https://gitlab.freedesktop.org/agd5f/linux/-/commit/0eb6c664b780dd1b4080e047ad51b100cd7840a3 https://gitlab.freedesktop.org/agd5f/linux/-/commit/40970e60070ed3d1390ec65e38e819f6d81b8f0c Alex > > Best regards, > Shengyu > > > 在 2024/1/30 01:47, Felix Kuehling 写道: > > On 2024-01-29 10:24, Shengyu Qu wrote: > >> Hello Felix, > >> I think you are right. This problem has existed for years(just look > >> at the issue creation time in my link), and is thought caused by > >> OpenGL-ROCM interop(that's why I think this patch might help). It is > >> very easy to trigger this problem in blender(method is also mentioned > >> in the link). > > > > This doesn't help you, but it's unlikely that this has been the same > > issue for two years for everybody who chimed into this bug report. > > Different kernel versions, GPUs, user mode ROCm and Mesa versions etc. > > > > Case in point, it's possible that you're seeing an issue specific to > > RDNA3, which hasn't even been around for that long. > > > > > >> Do > >> you have any idea about this? > > > > Not without seeing a lot more diagnostic information. A full backtrace > > from your kernel log would be a good start. > > > > Regards, > > Felix > > > > > >> Best regards, > >> Shengyu > >> 在 2024/1/29 22:51, Felix Kuehling 写道: > >>> On 2024-01-29 8:58, Shengyu Qu wrote: > Hi, > Seems rocm-opengl interop hang problem still exists[1]. Btw have > you discovered into this problem? > Best regards, > Shengyu > [1] > https://projects.blender.org/blender/blender/issues/100353#issuecom > ment-599 > >>> > >>> Maybe you're having a different problem. Do you see this issue also > >>> without any version of the "Relocate TBA/TMA ..." patch? > >>> > >>> Regards, > >>> Felix > >>> > >>> > > 在 2024/1/27 03:15, Shengyu Qu 写道: > > Hello Felix, > > This patch seems working on my system, also it seems fixes the > > ROCM/OpenGL interop problem. > > Is this intended to happen or not? Maybe we need more users to > > test it. > > Besides, > > Tested-by: Shengyu Qu Best Regards, > Shengyu > > > > 在 2024/1/26 06:27, Felix Kuehling 写道: > >> The TBA and TMA, along with an unused IB allocation, reside at > >> low addresses in the VM address space. A stray VM fault which > >> hits these pages must be serviced by making their page table entries > invalid. > >> The scheduler depends upon these pages being resident and fails, > >> preventing a debugger from inspecting the failure state. > >> > >> By relocating these pages above 47 bits in the VM address space > >> they can only be reached when bits [63:48] are set to 1. This > >> makes it much less likely for a misbehaving program to generate > >> accesses to them. > >> The current placement at VA (PAGE_SIZE*2) is readily hit by a > >> NULL access with a small offset. > >> > >> v2: > >> - Move it to the reserved space to avoid concflicts with Mesa > >> - Add macros to make reserved space management easier > >> > >> Cc: Arunpravin Paneer Selvam > >> Cc: Christian Koenig > >> Signed-off-by: Jay Cornwall > >> Signed-off-by: Felix Kuehling > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +-- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c| 7 ++--- > >>
[PATCH v3] drm/amdgpu: change vm->task_info handling
This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm. This patch also does logistical changes required for existing usage of vm->task_info. V2: Do not block all the prints when task_info not found (Felix) V3: (Felix) - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini() Cc: Christian Koenig Cc: Alex Deucher Cc: Felix Kuehling Signed-off-by: Shashank Sharma --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 18 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 12 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 158 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 21 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 24 +-- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 23 +-- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 20 ++- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 23 +-- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 23 +-- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c| 22 +-- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 20 +-- 13 files changed, 251 insertions(+), 124 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 0e61ebdb3f3e..f9eb12697b95 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -1775,9 +1775,14 @@ static int amdgpu_debugfs_vm_info_show(struct seq_file *m, void *unused) list_for_each_entry(file, >filelist, lhead) { struct amdgpu_fpriv *fpriv = file->driver_priv; struct amdgpu_vm *vm = >vm; + struct amdgpu_task_info *ti; + + ti = amdgpu_vm_get_task_info_vm(vm); + if (ti) { + seq_printf(m, "pid:%d\tProcess:%s --\n", ti->pid, ti->process_name); + amdgpu_vm_put_task_info(ti); + } - seq_printf(m, "pid:%d\tProcess:%s --\n", - vm->task_info.pid, vm->task_info.process_name); r = amdgpu_bo_reserve(vm->root.bo, true); if (r) break; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 1f357198533f..e6e6d56398f2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -35,7 +35,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) { struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); struct amdgpu_job *job = to_amdgpu_job(s_job); - struct amdgpu_task_info ti; + struct amdgpu_task_info *ti; struct amdgpu_device *adev = ring->adev; int idx; int r; @@ -48,7 +48,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) return DRM_GPU_SCHED_STAT_ENODEV; } - memset(, 0, sizeof(struct amdgpu_task_info)); + adev->job_hang = true; if (amdgpu_gpu_recovery && @@ -58,12 +58,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) goto exit; } - amdgpu_vm_get_task_info(ring->adev, job->pasid, ); DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", - job->base.sched->name, atomic_read(>fence_drv.last_seq), - ring->fence_drv.sync_seq); - DRM_ERROR("Process information: process %s pid %d thread %s pid %d\n", - ti.process_name, ti.tgid, ti.task_name, ti.pid); + job->base.sched->name, atomic_read(>fence_drv.last_seq), + ring->fence_drv.sync_seq); + + ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid); + if (ti) { + DRM_ERROR("Process information: process %s pid %d thread %s pid %d\n", + ti->process_name, ti->tgid, ti->task_name, ti->pid); + amdgpu_vm_put_task_info(ti); + } dma_fence_set_error(_job->s_fence->finished, -ETIME); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 4baa300121d8..a59364e9b6ed 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -230,8 +230,16 @@ void
RE: [PATCH] drm/amd/display: Implement bounds check for stream encoder creation in DCN301
[Public] Inline. > -Original Message- > From: SHANMUGAM, SRINIVASAN > Sent: Sunday, February 4, 2024 9:35 PM > To: Siqueira, Rodrigo ; Pillai, Aurabindo > > Cc: amd-gfx@lists.freedesktop.org; SHANMUGAM, SRINIVASAN > ; Li, Roman > Subject: [PATCH] drm/amd/display: Implement bounds check for stream > encoder creation in DCN301 > > 'stream_enc_regs' array is an array of dcn10_stream_enc_registers structures. > The array is initialized with four elements, corresponding to the four calls > to > stream_enc_regs() in the array initializer. This means that valid indices for > this > array are 0, 1, 2, and 3. > > The error message 'stream_enc_regs' 4 <= 5 below, is indicating that there is > an > attempt to access this array with an index of 5, which is out of bounds. This > could lead to undefined behavior > > Here, eng_id is used as an index to access the stream_enc_regs array. If > eng_id > is 5, this would result in an out-of-bounds access. > > Fixes the below: > drivers/gpu/drm/amd/amdgpu/../display/dc/resource/dcn301/dcn301_reso > urce.c:1011 dcn301_stream_encoder_create() error: buffer overflow > 'stream_enc_regs' 4 <= 5 Please mention that this is Smatch warning. In current implementation this function is called with eng_id limited by num_stream_encoder = 4 for dcn301. > > Fixes: 3a83e4e64bb1 ("drm/amd/display: Add dcn3.01 support to DC (v2)") > Cc: Roman Li > Cc: Rodrigo Siqueira > Cc: Aurabindo Pillai > Signed-off-by: Srinivasan Shanmugam > --- > .../display/dc/resource/dcn301/dcn301_resource.c | 16 > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git > a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c > b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c > index 511ff6b5b985..f915d7c3980e 100644 > --- a/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c > +++ > b/drivers/gpu/drm/amd/display/dc/resource/dcn301/dcn301_resource.c > @@ -1006,10 +1006,18 @@ static struct stream_encoder > *dcn301_stream_encoder_create(enum engine_id eng_id > return NULL; > } > > - dcn30_dio_stream_encoder_construct(enc1, ctx, ctx->dc_bios, > - eng_id, vpg, afmt, > - _enc_regs[eng_id], > - _shift, _mask); > + if (eng_id < ARRAY_SIZE(stream_enc_regs)) { > + dcn30_dio_stream_encoder_construct(enc1, ctx, ctx- > >dc_bios, > +eng_id, vpg, afmt, > +_enc_regs[eng_id], > +_shift, _mask); > + } else { > + DRM_ERROR("Invalid engine id: %d\n", eng_id); > + kfree(enc1); > + kfree(vpg); > + kfree(afmt); > + return NULL; > + } Can you just extend the existing null checks instead? e.g. if (!enc1 || !vpg || !afmt || (eng_id >= ARRAY_SIZE(stream_enc_regs)) > > return >base; > } > -- > 2.34.1
Re: [PATCH] drm/amd/display: only call sysfs_remove_group() for eDP connectors
On 2024-02-05 10:18, Hamza Mahfooz wrote: > Since we only register the amdgpu sysfs group for eDP connectors, we > should only remove it from them. Otherwise, we run into a harmless > WARN() on device detach for all of the device's non-eDP connectors. > > Fixes: f97e4303da16 ("drm/amd/display: add panel_power_savings sysfs entry to > eDP connectors") > Signed-off-by: Hamza Mahfooz Reviewed-by: Harry Wentland Harry > --- > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index adda423615a1..b3a5e730be24 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -6508,7 +6508,9 @@ static void amdgpu_dm_connector_unregister(struct > drm_connector *connector) > { > struct amdgpu_dm_connector *amdgpu_dm_connector = > to_amdgpu_dm_connector(connector); > > - sysfs_remove_group(>kdev->kobj, _group); > + if (connector->connector_type == DRM_MODE_CONNECTOR_eDP) > + sysfs_remove_group(>kdev->kobj, _group); > + > drm_dp_aux_unregister(_dm_connector->dm_dp_aux.aux); > } >
[PATCH] drm/amd/display: only call sysfs_remove_group() for eDP connectors
Since we only register the amdgpu sysfs group for eDP connectors, we should only remove it from them. Otherwise, we run into a harmless WARN() on device detach for all of the device's non-eDP connectors. Fixes: f97e4303da16 ("drm/amd/display: add panel_power_savings sysfs entry to eDP connectors") Signed-off-by: Hamza Mahfooz --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index adda423615a1..b3a5e730be24 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -6508,7 +6508,9 @@ static void amdgpu_dm_connector_unregister(struct drm_connector *connector) { struct amdgpu_dm_connector *amdgpu_dm_connector = to_amdgpu_dm_connector(connector); - sysfs_remove_group(>kdev->kobj, _group); + if (connector->connector_type == DRM_MODE_CONNECTOR_eDP) + sysfs_remove_group(>kdev->kobj, _group); + drm_dp_aux_unregister(_dm_connector->dm_dp_aux.aux); } -- 2.43.0
[lvc-project] [PATCH] drm/amd/pm: check return value of amdgpu_irq_add_id()
amdgpu_irq_ad_id() may fail and the irq handlers will not be registered. This patch adds error code check. Found by Linux Verification Center (linuxtesting.org). Signed-off-by: Igor Artemiev --- .../gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c| 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c index 79a566f3564a..9cb965479dd8 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu_helper.c @@ -647,26 +647,34 @@ int smu9_register_irq_handlers(struct pp_hwmgr *hwmgr) { struct amdgpu_irq_src *source = kzalloc(sizeof(struct amdgpu_irq_src), GFP_KERNEL); + int ret; if (!source) return -ENOMEM; source->funcs = _irq_funcs; - amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), + ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), SOC15_IH_CLIENTID_THM, THM_9_0__SRCID__THM_DIG_THERM_L2H, source); - amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), + if (ret) + return ret; + + ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), SOC15_IH_CLIENTID_THM, THM_9_0__SRCID__THM_DIG_THERM_H2L, source); + if (ret) + return ret; /* Register CTF(GPIO_19) interrupt */ - amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), + ret = amdgpu_irq_add_id((struct amdgpu_device *)(hwmgr->adev), SOC15_IH_CLIENTID_ROM_SMUIO, SMUIO_9_0__SRCID__SMUIO_GPIO19, source); + if (ret) + return ret; return 0; } -- 2.39.2
Re: [lvc-project] [PATCH] drm/amd/pm: check return value of amdgpu_irq_add_id()
On 05.02.2024 15:25, Igor Artemiev wrote: > amdgpu_irq_ad_id() may fail and the irq handlers will not be registered. > This patch adds error code check. But what is about deallocation of already allocated memory? -- Alexey
Re: [PATCH] PCI: Add vf reset notification for pf
On Sun, 4 Feb 2024 14:12:57 +0800 Emily Deng wrote: > When a vf has been reset, the pf wants to get notification to remove > the vf out of schedule. > > Solution: > Add the callback function in pci_driver sriov_vf_reset_notification. > When vf reset happens, then call this callback function. > > Signed-off-by: Emily Deng > --- > drivers/pci/pci.c | 8 > include/linux/pci.h | 1 + > 2 files changed, 9 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 60230da957e0..aca937b05531 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -4780,6 +4780,14 @@ EXPORT_SYMBOL_GPL(pcie_flr); > */ > int pcie_reset_flr(struct pci_dev *dev, bool probe) > { > + struct pci_dev *pf_dev; > + > + if (dev->is_virtfn) { > + pf_dev = dev->physfn; > + if (pf_dev->driver->sriov_vf_reset_notification) > + > pf_dev->driver->sriov_vf_reset_notification(pf_dev, dev); > + } > + > if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET) > return -ENOTTY; > > diff --git a/include/linux/pci.h b/include/linux/pci.h > index c69a2cc1f412..4fa31d9b0aa7 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -926,6 +926,7 @@ struct pci_driver { > int (*sriov_configure)(struct pci_dev *dev, int num_vfs); > /* On PF */ int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int > msix_vec_count); /* On PF */ u32 (*sriov_get_vf_total_msix)(struct > pci_dev *pf); > + void (*sriov_vf_reset_notification)(struct pci_dev *pf, > struct pci_dev *vf); const struct pci_error_handlers *err_handler; > const struct attribute_group **groups; > const struct attribute_group **dev_groups; Hi: I would suggest you can provide a cover letter including a complete picture that tells the background, detailed problem statement, the solutions and plus the users. As this seems very like a generic change, it needs a better justification to convince folks why this is the best solution. Without a complete picture, the solution just looks like a workaround. Thanks, Zhi.
Re: [RFC PATCH 2/2] drm/amd/display: switch amdgpu_dm_connector to use struct drm_edid
On 01/26, Alex Hung wrote: > > > On 2024-01-26 09:28, Melissa Wen wrote: > > Replace raw edid handling (struct edid) with the opaque EDID type > > (struct drm_edid) on amdgpu_dm_connector for consistency. It may also > > prevent mismatch of approaches in different parts of the driver code. > > Working in progress. There are a couple of cast warnings and it was only > > tested with IGT. > > > > Signed-off-by: Melissa Wen > > --- > > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 63 ++- > > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 4 +- > > .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 9 +-- > > .../display/amdgpu_dm/amdgpu_dm_mst_types.c | 23 +++ > > 4 files changed, 51 insertions(+), 48 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > index 68e71e2ea472..741081d73bb3 100644 > > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > > @@ -3277,12 +3277,12 @@ void amdgpu_dm_update_connector_after_detect( > > >dm_dp_aux.aux); > > } > > } else { > > - aconnector->edid = > > - (struct edid *)sink->dc_edid.raw_edid; > > + const struct edid *edid = (const struct edid > > *)sink->dc_edid.raw_edid; > > + aconnector->edid = drm_edid_alloc(edid, > > (edid->extensions + 1) * EDID_LENGTH); > > if (aconnector->dc_link->aux_mode) > > drm_dp_cec_set_edid(>dm_dp_aux.aux, > > - aconnector->edid); > > + > > drm_edid_raw(aconnector->edid)); > > } > > if (!aconnector->timing_requested) { > > @@ -3293,13 +3293,13 @@ void amdgpu_dm_update_connector_after_detect( > > "failed to create > > aconnector->requested_timing\n"); > > } > > - drm_connector_update_edid_property(connector, aconnector->edid); > > + drm_edid_connector_update(connector, aconnector->edid); > > amdgpu_dm_update_freesync_caps(connector, aconnector->edid); > > update_connector_ext_caps(aconnector); > > } else { > > drm_dp_cec_unset_edid(>dm_dp_aux.aux); > > amdgpu_dm_update_freesync_caps(connector, NULL); > > - drm_connector_update_edid_property(connector, NULL); > > + drm_edid_connector_update(connector, NULL); > > aconnector->num_modes = 0; > > dc_sink_release(aconnector->dc_sink); > > aconnector->dc_sink = NULL; > > @@ -6564,7 +6564,6 @@ static void amdgpu_dm_connector_funcs_force(struct > > drm_connector *connector) > > struct dc_link *dc_link = aconnector->dc_link; > > struct dc_sink *dc_em_sink = aconnector->dc_em_sink; > > const struct drm_edid *drm_edid; > > - const struct edid *edid; > > /* > > * Note: drm_get_edid gets edid in the following order: > > @@ -6578,11 +6577,12 @@ static void amdgpu_dm_connector_funcs_force(struct > > drm_connector *connector) > > DRM_ERROR("No EDID found on connector: %s.\n", connector->name); > > return; > > } > > - edid = drm_edid_raw(drm_edid); > > - aconnector->edid = edid; > > - > > + aconnector->edid = drm_edid; > > + drm_edid_connector_update(connector, drm_edid); > > /* Update emulated (virtual) sink's EDID */ > > if (dc_em_sink && dc_link) { > > + const struct edid *edid = drm_edid_raw(drm_edid); > > + > > memset(_em_sink->edid_caps, 0, sizeof(struct dc_edid_caps)); > > memmove(dc_em_sink->dc_edid.raw_edid, (uint8_t *)edid, > > (edid->extensions + 1) * EDID_LENGTH); > > dm_helpers_parse_edid_caps( > > @@ -6633,13 +6633,13 @@ static void create_eml_sink(struct > > amdgpu_dm_connector *aconnector) > > return; > > } > > - edid = drm_edid_raw(drm_edid); > > - > > - if (drm_detect_hdmi_monitor(edid)) > > + if (connector->display_info.is_hdmi) > > init_params.sink_signal = SIGNAL_TYPE_HDMI_TYPE_A; > > - aconnector->edid = edid; > > + aconnector->edid = drm_edid; > > + drm_edid_connector_update(connector, drm_edid); > > + edid = drm_edid_raw(drm_edid); > > aconnector->dc_em_sink = dc_link_add_remote_sink( > > aconnector->dc_link, > > (uint8_t *)edid, > > @@ -7322,16 +7322,16 @@ static void amdgpu_set_panel_orientation(struct > > drm_connector *connector) > > } > > static void amdgpu_dm_connector_ddc_get_modes(struct drm_connector > > *connector, > > - struct edid *edid) > > + const struct drm_edid *drm_edid) > > { > > struct amdgpu_dm_connector
Re: [RFC PATCH 0/2] drm/amd/display: switch amdgpu_dm_connector to
On 01/29, Jani Nikula wrote: > On Fri, 26 Jan 2024, Mario Limonciello wrote: > > On 1/26/2024 10:28, Melissa Wen wrote: > >> Hi, > >> > >> I'm debugging a null-pointer dereference when running > >> igt@kms_connector_force_edid and the way I found to solve the bug is to > >> stop using raw edid handler in amdgpu_connector_funcs_force and > >> create_eml_sink in favor of managing resouces via sruct drm_edid helpers > >> (Patch 1). The proper solution seems to be switch amdgpu_dm_connector > >> from struct edid to struct drm_edid and avoid the usage of different > >> approaches in the driver (Patch 2). However, doing it implies a good > >> amount of work and validation, therefore I decided to send this RFC > >> first to collect opinions and check if there is any parallel work on > >> this side. It's a working in progress. > >> > >> The null-pointer error trigger by the igt@kms_connector_force_edid test > >> was introduced by: > >> - e54ed41620f ("drm/amd/display: Remove unwanted drm edid references") > >> > >> You can check the error trace in the first patch. > >> > >> This series was tested with kms_hdmi_inject and kms_force_connector. No > >> null-pointer error, kms_hdmi_inject is successul and kms_force_connector > >> is sucessful after the second execution - the force-edid subtest > >> still fails in the first run (I'm still investigating). > >> > >> There is also a couple of cast warnings to be addressed - I'm looking > >> for the best replacement. > >> > >> I appreciate any feedback and testing. > > > > So I'm actually a little bit worried by hardcoding EDID_LENGTH in this > > series. > > > > I have some other patches that I'm posting later on that let you get the > > EDID from _DDC BIOS method too. My observation was that the EDID can be > > anywhere up to 512 bytes according to the ACPI spec. > > > > An earlier version of my patch was using EDID_LENGTH when fetching it > > and the EDID checksum failed. > > > > I'll CC you on the post, we probably want to get your changes and mine > > merged together. > > One of the main points of struct drm_edid is that it tracks the > allocation size separately. > > We should simply not trust edid->extensions, because most of the time it > originates from outside the kernel. > > Using drm_edid and immediately drm_edid_raw() falls short. That function > should only be used during migration to help. And yeah, it also means > EDID parsing should be done in drm_edid.c, and not spread out all over > the subsystem. Hi Mario and Jani, Thanks for the feedback. I agree with you. I used the drm_edid_raw() as an intermediate step to assess/validate this migration, but I'll work on removing this hack. So, I understand that the transition from edid to drm_edid is the right path, so I'll improve this work (keeping the points you raised in mind) and send a version. BR, Melissa > > > BR, > Jani. > > > > > >> > >> Melissa > >> > >> Melissa Wen (2): > >>drm/amd/display: fix null-pointer dereference on edid reading > >>drm/amd/display: switch amdgpu_dm_connector to use struct drm_edid > >> > >> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 78 ++- > >> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 4 +- > >> .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 9 ++- > >> .../display/amdgpu_dm/amdgpu_dm_mst_types.c | 23 +++--- > >> 4 files changed, 60 insertions(+), 54 deletions(-) > >> > > > > -- > Jani Nikula, Intel
RE: [PATCH 00/21] DC Patches January 31, 2024
[Public] Hi all, This week this patchset was tested on the following systems: * Lenovo ThinkBook T13s Gen4 with AMD Ryzen 5 6600U * MSI Gaming X Trio RX 6800 * Gigabyte Gaming OC RX 7900 XTX These systems were tested on the following display/connection types: * eDP, (1080p 60hz [5650U]) (1920x1200 60hz [6600U]) (2560x1600 120hz[6600U]) * VGA and DVI (1680x1050 60hz [DP to VGA/DVI, USB-C to VGA/DVI]) * DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz, 4k 240hz [Includes USB-C to DP/HDMI adapters]) * Thunderbolt (LG Ultrafine 5k) * MST (Startech MST14DP123DP [DP to 3x DP] and 2x 4k 60Hz displays) * DSC (with Cable Matters 101075 [DP to 3x DP] with 3x 4k60 displays, and HP Hook G2 with 1 4k60 display) * USB 4 (Kensington SD5700T and 1x 4k 60Hz display) * PCON (Club3D CAC-1085 and 1x 4k 144Hz display [at 4k 120HZ, as that is the max the adapter supports]) The testing is a mix of automated and manual tests. Manual testing includes (but is not limited to): * Changing display configurations and settings * Benchmark testing * Feature testing (Freesync, etc.) Automated testing includes (but is not limited to): * Script testing (scripts to automate some of the manual checks) * IGT testing The patchset consists of the amd-staging-drm-next branch (Head commit - 0b48b36f80b0 -> drm/amdgpu: Need to resume ras during gpu reset for gfx v9_4_3 sriov) with new patches added on top of it. Tested on Ubuntu 22.04.3, on Wayland and X11, using KDE Plasma and Gnome. Tested-by: Daniel Wheeler Thank you, Dan Wheeler Sr. Technologist | AMD SW Display -- 1 Commerce Valley Dr E, Thornhill, ON L3T 7X6 Facebook | Twitter | amd.com -Original Message- From: amd-gfx On Behalf Of Hamza Mahfooz Sent: Wednesday, January 31, 2024 3:11 PM To: amd-gfx@lists.freedesktop.org Cc: Chung, ChiaHsuan (Tom) ; Li, Sun peng (Leo) ; Siqueira, Rodrigo ; Li, Roman ; Zuo, Jerry ; Pillai, Aurabindo ; Wu, Hersen ; Mahfooz, Hamza ; Lin, Wayne ; Wentland, Harry ; Gutierrez, Agustin Subject: [PATCH 00/21] DC Patches January 31, 2024 This version brings along the following: * DCN35 fixes * DMUB fixes * Link training fixes * Misc code style fixes * MST fixes * ODM fixes * SubVP fixes Allen Pan (1): drm/amd/display: correct static screen event mask Alvin Lee (2): Revert "drm/amd/display: For FPO and SubVP/DRR configs program vmin/max sel" drm/amd/display: Update phantom pipe enable / disable sequence Aric Cyr (1): drm/amd/display: 3.2.271 Camille Cho (1): drm/amd/display: correct comment in set_default_brightness_aux() Ethan Bitnun (2): drm/amd/display: Add delay before logging clks from hw drm/amd/display: Adjust set_p_state calls to fix logging Fangzhi Zuo (1): drm/amd/display: Fix MST Null Ptr for RV George Shen (2): drm/amd/display: Add debug option to force 1-tap chroma subsampling drm/amd/display: Add left edge pixel for YCbCr422/420 + ODM pipe split Michael Strauss (2): drm/amd/display: Remove Legacy FIXED_VS Transparent LT Sequence drm/amd/display: Don't perform rate toggle on DP2-capable FIXED_VS retimers Nicholas Kazlauskas (4): drm/amd/display: Add more checks for exiting idle in DC drm/amd/display: Disable timeout in more places for dc_dmub_srv drm/amd/display: Increase eval/entry delay for DCN35 drm/amd/display: Disable idle reallow as part of command/gpint execution Rodrigo Siqueira (4): drm/amd/display: Drop legacy code drm/amd/display: Disable ODM by default for DCN35 drm/amd/display: Trivial code style adjustment drm/amd/display: Drop some unnecessary guards Wenjing Liu (1): drm/amd/display: set odm_combine_policy based on context in dcn32 resource .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 12 +- .../amd/display/dc/clk_mgr/dcn21/rn_clk_mgr.c | 2 - .../dc/clk_mgr/dcn21/rn_clk_mgr_vbios_smu.c | 4 - .../display/dc/clk_mgr/dcn301/dcn301_smu.c| 4 - .../amd/display/dc/clk_mgr/dcn31/dcn31_smu.c | 4 - .../display/dc/clk_mgr/dcn314/dcn314_smu.c| 6 - .../display/dc/clk_mgr/dcn315/dcn315_smu.c| 4 - .../display/dc/clk_mgr/dcn316/dcn316_smu.c| 4 - .../display/dc/clk_mgr/dcn32/dcn32_clk_mgr.c | 2 + .../dc/clk_mgr/dcn32/dcn32_clk_mgr_smu_msg.h | 3 +- drivers/gpu/drm/amd/display/dc/core/dc.c | 115 +++--- .../gpu/drm/amd/display/dc/core/dc_resource.c | 68 ++-- .../gpu/drm/amd/display/dc/core/dc_stream.c | 18 + .../gpu/drm/amd/display/dc/core/dc_surface.c | 2 + drivers/gpu/drm/amd/display/dc/dc.h | 9 +- drivers/gpu/drm/amd/display/dc/dc_dmub_srv.c | 17 +- drivers/gpu/drm/amd/display/dc/dc_hw_types.h | 2 - drivers/gpu/drm/amd/display/dc/dc_stream.h| 2 + .../gpu/drm/amd/display/dc/dcn10/dcn10_opp.c | 7 +
Re: [PATCH] drm/amd/display: Clear phantom stream count and plane count
[Public] Acked-by: Alex Deucher From: amd-gfx on behalf of Mario Limonciello Sent: Friday, February 2, 2024 7:30 PM To: amd-gfx@lists.freedesktop.org Cc: Limonciello, Mario Subject: [PATCH] drm/amd/display: Clear phantom stream count and plane count When dc_state_destruct() was refactored the new phantom_stream_count and phantom_plane_count members weren't cleared. Fixes: 012a04b1d6af ("drm/amd/display: Refactor phantom resource allocation") Signed-off-by: Mario Limonciello --- drivers/gpu/drm/amd/display/dc/core/dc_state.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/display/dc/core/dc_state.c b/drivers/gpu/drm/amd/display/dc/core/dc_state.c index 88c6436b28b6..180ac47868c2 100644 --- a/drivers/gpu/drm/amd/display/dc/core/dc_state.c +++ b/drivers/gpu/drm/amd/display/dc/core/dc_state.c @@ -291,11 +291,14 @@ void dc_state_destruct(struct dc_state *state) dc_stream_release(state->phantom_streams[i]); state->phantom_streams[i] = NULL; } + state->phantom_stream_count = 0; for (i = 0; i < state->phantom_plane_count; i++) { dc_plane_state_release(state->phantom_planes[i]); state->phantom_planes[i] = NULL; } + state->phantom_plane_count = 0; + state->stream_mask = 0; memset(>res_ctx, 0, sizeof(state->res_ctx)); memset(>pp_display_cfg, 0, sizeof(state->pp_display_cfg)); -- 2.34.1
Re: [PATCH] PCI: Add vf reset notification for pf
Am 04.02.24 um 07:12 schrieb Emily Deng: When a vf has been reset, the pf wants to get notification to remove the vf out of schedule. Solution: Add the callback function in pci_driver sriov_vf_reset_notification. When vf reset happens, then call this callback function. Well that doesn't make much sense. As other already noted as well a VF should be an encapsulated representation of a physical devices functionality. AMD implemented that a bit different with a hypervisor to control which PF functionality a VF exposes, but that doesn't mean that we can leak this AMD specific handling into the common Linux PCI subsystem. Additional to that a technical blocker is that when a VF is passed into a VM you don't have access to the PF any more to make this reset notification. Regards, Christian. Signed-off-by: Emily Deng --- drivers/pci/pci.c | 8 include/linux/pci.h | 1 + 2 files changed, 9 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 60230da957e0..aca937b05531 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4780,6 +4780,14 @@ EXPORT_SYMBOL_GPL(pcie_flr); */ int pcie_reset_flr(struct pci_dev *dev, bool probe) { + struct pci_dev *pf_dev; + + if (dev->is_virtfn) { + pf_dev = dev->physfn; + if (pf_dev->driver->sriov_vf_reset_notification) + pf_dev->driver->sriov_vf_reset_notification(pf_dev, dev); + } + if (dev->dev_flags & PCI_DEV_FLAGS_NO_FLR_RESET) return -ENOTTY; diff --git a/include/linux/pci.h b/include/linux/pci.h index c69a2cc1f412..4fa31d9b0aa7 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -926,6 +926,7 @@ struct pci_driver { int (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */ int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */ u32 (*sriov_get_vf_total_msix)(struct pci_dev *pf); + void (*sriov_vf_reset_notification)(struct pci_dev *pf, struct pci_dev *vf); const struct pci_error_handlers *err_handler; const struct attribute_group **groups; const struct attribute_group **dev_groups;
Re: [PATCH 1/3] driver core: bus: introduce can_remove()
Am 02.02.24 um 23:25 schrieb Hamza Mahfooz: Currently, drivers have no mechanism to block requests to unbind devices. However, this can cause resource leaks and leave the device in an inconsistent state, such that rebinding the device may cause a hang or otherwise prevent the device from being rebound. So, introduce the can_remove() callback to allow drivers to indicate if it isn't appropriate to remove a device at the given time. Well that is nonsense. When you physically remove a device (e.g. unplug it) then there is nothing in software you can do to prevent that. Regards, Christian. Cc: sta...@vger.kernel.org Signed-off-by: Hamza Mahfooz --- drivers/base/bus.c | 4 include/linux/device/bus.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/drivers/base/bus.c b/drivers/base/bus.c index daee55c9b2d9..7c259b01ea99 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -239,6 +239,10 @@ static ssize_t unbind_store(struct device_driver *drv, const char *buf, dev = bus_find_device_by_name(bus, NULL, buf); if (dev && dev->driver == drv) { + if (dev->bus && dev->bus->can_remove && + !dev->bus->can_remove(dev)) + return -EBUSY; + device_driver_detach(dev); err = count; } diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h index 5ef4ec1c36c3..c9d4af0ed3b8 100644 --- a/include/linux/device/bus.h +++ b/include/linux/device/bus.h @@ -46,6 +46,7 @@ struct fwnode_handle; *be called at late_initcall_sync level. If the device has *consumers that are never bound to a driver, this function *will never get called until they do. + * @can_remove: Called before attempting to remove a device from this bus. * @remove: Called when a device removed from this bus. * @shutdown: Called at shut-down time to quiesce the device. * @@ -85,6 +86,7 @@ struct bus_type { int (*uevent)(const struct device *dev, struct kobj_uevent_env *env); int (*probe)(struct device *dev); void (*sync_state)(struct device *dev); + bool (*can_remove)(struct device *dev); void (*remove)(struct device *dev); void (*shutdown)(struct device *dev);
RE: [PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user
[AMD Official Use Only - General] Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: amd-gfx On Behalf Of Stanley.Yang Sent: Monday, February 5, 2024 16:10 To: amd-gfx@lists.freedesktop.org Cc: Yang, Stanley Subject: [PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user ta if invoke node buffer | ta type --| | ta id --| | cmd id --| |-- shared buf len -| |-- shared buffer --| ta if invoke node buffer is as above, copy shared buffer data to correct location Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c index 468a67b302d4..ca5c86e5f7cd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c @@ -362,7 +362,7 @@ static ssize_t ta_if_invoke_debugfs_write(struct file *fp, const char *buf, size } } - if (copy_to_user((char *)buf, context->mem_context.shared_buf, shared_buf_len)) + if (copy_to_user((char *)[copy_pos], context->mem_context.shared_buf, shared_buf_len)) ret = -EFAULT; err_free_shared_buf: -- 2.25.1
Re: drm/amdkfd: Relocate TBA/TMA to opposite side of VM hole (v2)
Hi Felix, Sorry for my late reply. I was busy this week. I just did some more tests using next-20240202 branch. Testing using blender 4.0.2, when only one HIP render task is running, there's no problem. However, when two tasks run together, software always crashes, but not crashes the whole system. Dmesg reports gpu reset in most cases, for example: [ 176.071823] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=32608, emitted seq=32610 [ 176.072000] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process blender pid 4256 thread blender:cs0 pid 4297 [ 176.072143] amdgpu :03:00.0: amdgpu: GPU reset begin! [ 176.073571] amdgpu :03:00.0: amdgpu: Guilty job already signaled, skipping HW reset [ 176.073593] amdgpu :03:00.0: amdgpu: GPU reset(4) succeeded! And in some rare cases, there would be a page fault report, see dmesg.log. Do you have any idea? Can I make it print more detailed diagnostic information? Best regards, Shengyu 在 2024/1/30 01:47, Felix Kuehling 写道: On 2024-01-29 10:24, Shengyu Qu wrote: Hello Felix, I think you are right. This problem has existed for years(just look at the issue creation time in my link), and is thought caused by OpenGL-ROCM interop(that's why I think this patch might help). It is very easy to trigger this problem in blender(method is also mentioned in the link). This doesn't help you, but it's unlikely that this has been the same issue for two years for everybody who chimed into this bug report. Different kernel versions, GPUs, user mode ROCm and Mesa versions etc. Case in point, it's possible that you're seeing an issue specific to RDNA3, which hasn't even been around for that long. Do you have any idea about this? Not without seeing a lot more diagnostic information. A full backtrace from your kernel log would be a good start. Regards, Felix Best regards, Shengyu 在 2024/1/29 22:51, Felix Kuehling 写道: On 2024-01-29 8:58, Shengyu Qu wrote: Hi, Seems rocm-opengl interop hang problem still exists[1]. Btw have you discovered into this problem? Best regards, Shengyu [1] https://projects.blender.org/blender/blender/issues/100353#issuecomment-599 Maybe you're having a different problem. Do you see this issue also without any version of the "Relocate TBA/TMA ..." patch? Regards, Felix 在 2024/1/27 03:15, Shengyu Qu 写道: Hello Felix, This patch seems working on my system, also it seems fixes the ROCM/OpenGL interop problem. Is this intended to happen or not? Maybe we need more users to test it. Besides, Tested-by: Shengyu Qu Best Regards, Shengyu 在 2024/1/26 06:27, Felix Kuehling 写道: The TBA and TMA, along with an unused IB allocation, reside at low addresses in the VM address space. A stray VM fault which hits these pages must be serviced by making their page table entries invalid. The scheduler depends upon these pages being resident and fails, preventing a debugger from inspecting the failure state. By relocating these pages above 47 bits in the VM address space they can only be reached when bits [63:48] are set to 1. This makes it much less likely for a misbehaving program to generate accesses to them. The current placement at VA (PAGE_SIZE*2) is readily hit by a NULL access with a small offset. v2: - Move it to the reserved space to avoid concflicts with Mesa - Add macros to make reserved space management easier Cc: Arunpravin Paneer Selvam Cc: Christian Koenig Signed-off-by: Jay Cornwall Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 4 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c | 7 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 12 ++-- drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c | 30 +++- 4 files changed, 30 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index 823d31f4a2a3..53d0a458d78e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -28,9 +28,9 @@ uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) { - uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; + uint64_t addr = AMDGPU_VA_RESERVED_CSA_START( + adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT); - addr -= AMDGPU_VA_RESERVED_CSA_SIZE; addr = amdgpu_gmc_sign_extend(addr); return addr; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c index 3d0d56087d41..9e769ef50f2e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_seq64.c @@ -45,11 +45,8 @@ */ static inline u64 amdgpu_seq64_get_va_base(struct amdgpu_device *adev) { - u64 addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; - - addr -= AMDGPU_VA_RESERVED_TOP; - - return addr; + return AMDGPU_VA_RESERVED_SEQ64_START( +
Re: [PATCH 2/3] PCI: introduce can_remove()
On Fri, Feb 02, 2024 at 05:25:55PM -0500, Hamza Mahfooz wrote: > Wire up the can_remove() callback, such that pci drivers can implement > their own version of it. > > Cc: sta...@vger.kernel.org > Signed-off-by: Hamza Mahfooz > --- Again, sorry, nope, not allowed.
Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote: > Removing an amdgpu device that still has user space references allocated > to it causes undefined behaviour. So, implement amdgpu_pci_can_remove() > and disallow devices that still have files allocated to them from being > unbound. > > Cc: sta...@vger.kernel.org > Signed-off-by: Hamza Mahfooz > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index cc69005f5b46..cfa64f3c5be5 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -2323,6 +2323,22 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, > return ret; > } > > +static bool amdgpu_pci_can_remove(struct pci_dev *pdev) > +{ > + struct drm_device *dev = pci_get_drvdata(pdev); > + > + mutex_lock(>filelist_mutex); > + > + if (!list_empty(>filelist)) { > + mutex_unlock(>filelist_mutex); > + return false; > + } > + > + mutex_unlock(>filelist_mutex); > + > + return true; Also, to be pedantic, this will not work as right after you returned "true" here, userspace could open a file, causing the same issue you are trying to prevent to have happen, happen. So even if we wanted to do this, which again, we do not, this isn't even a solution for it because it will still cause you problems. greg k-h
Re: [PATCH 1/3] driver core: bus: introduce can_remove()
On Fri, Feb 02, 2024 at 05:25:54PM -0500, Hamza Mahfooz wrote: > Currently, drivers have no mechanism to block requests to unbind > devices. And that is by design. > However, this can cause resource leaks and leave the device in > an inconsistent state, such that rebinding the device may cause a hang > or otherwise prevent the device from being rebound. That is a driver bug, please fix your driver. > So, introduce the can_remove() callback to allow drivers to indicate > if it isn't appropriate to remove a device at the given time. Nope, sorry, the driver needs to be fixed. What broken driver are you needing this for? Also realize, only root can unbind drivers (and it can also unload modules), so if the root user really wants to do this, it can, and should be possible to. sorry, greg k-h
Re: [PATCH 3/3] drm/amdgpu: wire up the can_remove() callback
On Fri, Feb 02, 2024 at 05:25:56PM -0500, Hamza Mahfooz wrote: > Removing an amdgpu device that still has user space references allocated > to it causes undefined behaviour. Then fix that please. There should not be anything special about your hardware that all of the tens of thousands of other devices can't handle today. What happens when I yank your device out of a system with a pci hotplug bus? You can't prevent that either, so this should not be any different at all. sorry, but please, just fix your driver. greg k-h
[PATCH Review 1/1] drm/amdgpu: Fix shared buff copy to user
ta if invoke node buffer | ta type --| | ta id --| | cmd id --| |-- shared buf len -| |-- shared buffer --| ta if invoke node buffer is as above, copy shared buffer data to correct location Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c index 468a67b302d4..ca5c86e5f7cd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp_ta.c @@ -362,7 +362,7 @@ static ssize_t ta_if_invoke_debugfs_write(struct file *fp, const char *buf, size } } - if (copy_to_user((char *)buf, context->mem_context.shared_buf, shared_buf_len)) + if (copy_to_user((char *)[copy_pos], context->mem_context.shared_buf, shared_buf_len)) ret = -EFAULT; err_free_shared_buf: -- 2.25.1