Re: [RFC v3 06/12] drm/amdgpu: Drop hive->in_reset

2022-02-08 Thread Andrey Grodzovsky
On 2022-02-08 01:33, Lazar, Lijo wrote: On 1/26/2022 4:07 AM, Andrey Grodzovsky wrote: Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König ---   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19

Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-02 Thread Andrey Grodzovsky
Just another ping, with Shyun's help I was able to do some smoke testing on XGMI SRIOV system (booting and triggering hive reset) and for now looks good. Andrey On 2022-01-28 14:36, Andrey Grodzovsky wrote: Just a gentle ping if people have more comments on this patch set ? Especially last 5

[RFC v4] drm/amdgpu: Rework reset domain to be refcounted.

2022-02-02 Thread Andrey Grodzovsky
on boot with XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd

Re: [PATCH] drm/amdgpu: Handle the GPU recovery failure in SRIOV environment.

2022-02-02 Thread Andrey Grodzovsky
On 2022-02-01 16:47, Surbhi Kakarya wrote: This patch handles the GPU recovery faliure in sriov environment by retrying the reset if the first reset fails. To determine the condition of retry, a new function amdgpu_is_retry_sriov_reset() is added which returns true if failure is due to

Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-01-28 Thread Andrey Grodzovsky
Just a gentle ping if people have more comments on this patch set ? Especially last 5 patches as first 7 are exact same as V2 and we already went over them mostly. Andrey On 2022-01-25 17:37, Andrey Grodzovsky wrote: This patchset is based on earlier work by Boris[1] that allowed to have

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-26 Thread Andrey Grodzovsky
14:21, Andrey Grodzovsky wrote: On 2022-01-17 2:17 p.m., Christian König wrote: Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky: Ping on the question Oh, my! That was already more than a week ago and is completely swapped out of my head again. Andrey On 2022-01-05 1:11 p.m., Andrey

Re: [RFC v3 01/12] drm/amdgpu: Introduce reset domain

2022-01-26 Thread Andrey Grodzovsky
On 2022-01-26 07:07, Christian König wrote: Am 25.01.22 um 23:37 schrieb Andrey Grodzovsky: Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey

[RFC v3 12/12] Revert 'drm/amdgpu: annotate a false positive recursive locking'

2022-01-25 Thread Andrey Grodzovsky
Since we have a single instance of reset semaphore which we lock only once even for XGMI hive we don't need the nested locking hint anymore. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 -- 1 file changed, 4 insertions(+), 10 deletions

[RFC v3 07/12] drm/amdgpu: Drop concurrent GPU reset protection for device

2022-01-25 Thread Andrey Grodzovsky
Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++ 1 file

[RFC v3 11/12] drm/amdgpu: Rework amdgpu_device_lock_adev

2022-01-25 Thread Andrey Grodzovsky
This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd

[RFC v3 10/12] drm/amdgpu: Move in_gpu_reset into reset_domain

2022-01-25 Thread Andrey Grodzovsky
We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky Suggested-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1

[RFC v3 09/12] drm/amdgpu: Move reset sem into reset_domain

2022-01-25 Thread Andrey Grodzovsky
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - drivers/gpu/drm/amd

[RFC v3 08/12] drm/amdgpu: Rework reset domain to be refcounted.

2022-01-25 Thread Andrey Grodzovsky
-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 6 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 44 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 36 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 10 + drivers/gpu/drm/amd/amdgpu

[RFC v3 06/12] drm/amdgpu: Drop hive->in_reset

2022-01-25 Thread Andrey Grodzovsky
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 - drivers/gpu/drm/amd/amdgpu

[RFC v3 05/12] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2022-01-25 Thread Andrey Grodzovsky
No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun Signed-off-by: Andrey

[RFC v3 04/12] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-25 Thread Andrey Grodzovsky
to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +- drivers/gpu/drm/amd/amdgpu

[RFC v3 03/12] drm/amdgpu: Fix crash on modprobe

2022-01-25 Thread Andrey Grodzovsky
Restrict jobs resubmission to suspend case only since schedulers not initialised yet on probe. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm

[RFC v3 02/12] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-01-25 Thread Andrey Grodzovsky
Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu

[RFC v3 01/12] drm/amdgpu: Introduce reset domain

2022-01-25 Thread Andrey Grodzovsky
Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky Suggested-by: Daniel Vetter Suggested-by: Christian König Reviewed-by: Christian

[RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-01-25 Thread Andrey Grodzovsky
P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system because drm-misc-next fails to load there. Would appriciate if maybe jingwech can try it on his system like he tested V2. Andrey Grodzovsky (12): drm/

Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

2022-01-24 Thread Andrey Grodzovsky
they feel a need to dump info ? Also, how reliable is the STB infra during a reset ? Regards Shashank On 1/24/2022 5:32 PM, Andrey Grodzovsky wrote: You probably can add the STB dump we worked on a while ago to your info dump - a reminder on the feature is here https://www.spinics.net/lists/amd-gfx

Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler

2022-01-24 Thread Andrey Grodzovsky
You probably can add the STB dump we worked on a while ago to your info dump - a reminder on the feature is here https://www.spinics.net/lists/amd-gfx/msg70751.html Andrey On 2022-01-21 15:34, Sharma, Shashank wrote: From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 2001 From:

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-17 Thread Andrey Grodzovsky
On 2022-01-17 2:17 p.m., Christian König wrote: Am 17.01.22 um 20:14 schrieb Andrey Grodzovsky: Ping on the question Oh, my! That was already more than a week ago and is completely swapped out of my head again. Andrey On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote: Also, what about

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-17 Thread Andrey Grodzovsky
Ping on the question Andrey On 2022-01-05 1:11 p.m., Andrey Grodzovsky wrote: Also, what about having the reset_active or in_reset flag in the reset_domain itself? Of hand that sounds like a good idea. What then about the adev->reset_sem semaphore ? Should we also m

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-07 Thread Andrey Grodzovsky
On 2022-01-07 12:46 a.m., JingWen Chen wrote: On 2022/1/7 上午11:57, JingWen Chen wrote: On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: On 2022-01-06 12:18 a.m., JingWen Chen wrote: On 2022/1/6 下午12:59, JingWen Chen wrote: On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: On 2022-01-05 2:59 a.m

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-06 Thread Andrey Grodzovsky
On 2022-01-06 12:18 a.m., JingWen Chen wrote: On 2022/1/6 下午12:59, JingWen Chen wrote: On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: On 2022-01-05 2:59 a.m., Christian König wrote: Am 05.01.22 um 08:34 schrieb JingWen Chen: On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: On 2022-01-04 6:36

Re: [PATCH v2] drm/amdgpu: Unmap MMIO mappings when device is not unplugged

2022-01-06 Thread Andrey Grodzovsky
Got it See bellow one small comment, with that the patch is Reviewed-by: Andrey Grodzovsky On 2022-01-05 9:24 p.m., Shi, Leslie wrote: [AMD Official Use Only] Hi Andrey, It is the following patch calls to amdgpu_device_unmap_mmio() conditioned on device unplugged. 3efb17ae7e92 &quo

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-05 Thread Andrey Grodzovsky
On 2022-01-05 2:59 a.m., Christian König wrote: Am 05.01.22 um 08:34 schrieb JingWen Chen: On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: On 2022-01-04 6:36 a.m., Christian König wrote: Am 04.01.22 um 11:49 schrieb Liu, Monk: [AMD Official Use Only] See the FLR request from the hypervisor

Re: [PATCH v2] drm/amdgpu: Unmap MMIO mappings when device is not unplugged

2022-01-05 Thread Andrey Grodzovsky
On 2022-01-04 11:23 p.m., Leslie Shi wrote: Patch: 3efb17ae7e92 ("drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure") makes call to amdgpu_device_unmap_mmio() conditioned on device unplugged. This patch unmaps MMIO mappings even

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-01-05 Thread Andrey Grodzovsky
On 2022-01-05 7:31 a.m., Christian König wrote: Am 05.01.22 um 10:54 schrieb Lazar, Lijo: On 12/23/2021 3:35 AM, Andrey Grodzovsky wrote: Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize all possible GPU recoveries to gurantee no concurrency

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Andrey Grodzovsky
at we need to change SRIOV and not the driver. Christian. Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: Sure, I guess i can drop this patch then. Andrey On 2021-12-24 4:57 a.m., JingWen Chen wrote: I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, gues

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread Andrey Grodzovsky
work with that we need to change SRIOV and not the driver. Christian. Am 30.12.21 um 19:45 schrieb Andrey Grodzovsky: Sure, I guess i can drop this patch then. Andrey On 2021-12-24 4:57 a.m., JingWen Chen wrote: I do agree with shaoyun, if the host find the gpu engine hangs first, and do th

Re: [PATCH] drm/amdgpu: Delay unmapping MMIO VRAM to amdgpu_ttm_fini() in GPU initialization failure

2022-01-04 Thread Andrey Grodzovsky
On 2022-01-03 9:30 p.m., Leslie Shi wrote: If the driver loads failed during hw_init(), delay unmapping MMIO VRAM to amdgpu_ttm_fini(). Its prevents accessing invalid memory address in vcn_v3_0_sw_fini(). Signed-off-by: Leslie Shi --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-30 Thread Andrey Grodzovsky
- de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: Since now flr work is serialized against GPU resets there is no

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-30 Thread Andrey Grodzovsky
-gfx@lists.freedesktop.org Cc: dan...@ffwll.ch; Liu, Monk ; Chen, Horace Subject: Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV Am 22.12.21 um 23:14 schrieb Andrey Grodzovsky: Since now flr work is serialized against GPU resets there is no need for this. Signed-off-by: An

[RFC v3 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2021-12-23 Thread Andrey Grodzovsky
No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun Signed-off-by: Andrey

Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

2021-12-23 Thread Andrey Grodzovsky
wondering what is the impact on a system like MI200 A+A. Thanks, Lijo -Original Message- From: amd-gfx On Behalf Of Andrey Grodzovsky Sent: Friday, December 17, 2021 8:32 PM To: Koenig, Christian ; Shi, Leslie ; Pan, Xinhui ; Deucher, Alexander ; amd-gfx@lists.freedesktop.org Cc: Chen, Guchun Subj

[RFC v2 7/8] drm/amdgpu: Drop concurrent GPU reset protection for device

2021-12-22 Thread Andrey Grodzovsky
Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++ 1 file

[RFC v2 6/8] drm/amdgpu: Drop hive->in_reset

2021-12-22 Thread Andrey Grodzovsky
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 - drivers/gpu/drm/amd/amdgpu

[RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-22 Thread Andrey Grodzovsky
Since now flr work is serialized against GPU resets there is no need for this. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 --- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 --- 2 files changed, 22 deletions(-) diff --git a/drivers/gpu/drm/amd

[RFC v2 5/8] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2021-12-22 Thread Andrey Grodzovsky
No need to to trigger another work queue inside the work queue. Suggested-by: Liu Shaoyun Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 7 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 7 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 7 +-- 3 files

[RFC v2 3/8] drm/amdgpu: Fix crash on modprobe

2021-12-22 Thread Andrey Grodzovsky
Restrict jobs resubmission to suspend case only since schedulers not initialised yet on probe. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers

[RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-22 Thread Andrey Grodzovsky
to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 2 +- 3 files

[RFC v2 2/8] drm/amdgpu: Move scheduler init to after XGMI is ready

2021-12-22 Thread Andrey Grodzovsky
Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu

[RFC v2 0/8] Define and use reset domain for GPU recovery in amdgpu

2021-12-22 Thread Andrey Grodzovsky
. (Shaoyun) [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezil...@collabora.com/ P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. Andrey Grodzovsky (8): drm/amdgpu: Introduce reset domain drm/amdgpu

[RFC v2 1/8] drm/amdgpu: Introduce reset domain

2021-12-22 Thread Andrey Grodzovsky
Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky Suggested-by: Daniel Vetter Suggested-by: Christian König Reviewed-by: Christian

Re: [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-21 Thread Andrey Grodzovsky
On 2021-12-21 2:59 a.m., Christian König wrote: Am 20.12.21 um 23:17 schrieb Andrey Grodzovsky: On 2021-12-20 2:20 a.m., Christian König wrote: Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize

Re: [RFC 3/6] drm/amdgpu: Fix crash on modprobe

2021-12-21 Thread Andrey Grodzovsky
On 2021-12-21 2:02 a.m., Christian König wrote: Am 20.12.21 um 20:22 schrieb Andrey Grodzovsky: On 2021-12-20 2:17 a.m., Christian König wrote: Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: Restrict jobs resubmission to suspend case only since schedulers not initialised yet on probe

Re: [RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-20 Thread Andrey Grodzovsky
On 2021-12-20 2:20 a.m., Christian König wrote: Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: Use reset domain wq also for non TDR gpu recovery trigers such as sysfs and RAS. We must serialize all possible GPU recoveries to gurantee no concurrency there. For TDR call the original recovery

Re: [RFC 2/6] drm/amdgpu: Move scheduler init to after XGMI is ready

2021-12-20 Thread Andrey Grodzovsky
On 2021-12-20 2:16 a.m., Christian König wrote: Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans

Re: [RFC 3/6] drm/amdgpu: Fix crash on modprobe

2021-12-20 Thread Andrey Grodzovsky
On 2021-12-20 2:17 a.m., Christian König wrote: Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: Restrict jobs resubmission to suspend case only since schedulers not initialised yet on probe. Signed-off-by: Andrey Grodzovsky ---   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-   1 file

Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-20 Thread Andrey Grodzovsky
, Monk Subject: Re: [RFC 0/6] Define and use reset domain for GPU recovery in amdgpu Am 17.12.21 um 23:27 schrieb Andrey Grodzovsky: This patchset is based on earlier work by Boris[1] that allowed to have an ordered workqueue at the driver level that will be used by the different schedulers to

[RFC 5/6] drm/amdgpu: Drop hive->in_reset

2021-12-17 Thread Andrey Grodzovsky
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 1 - 3 files

[RFC 1/6] drm/amdgpu: Init GPU reset single threaded wq

2021-12-17 Thread Andrey Grodzovsky
Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 20 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 9 + drivers/gpu/drm/amd

[RFC 6/6] drm/amdgpu: Drop concurrent GPU reset protection for device

2021-12-17 Thread Andrey Grodzovsky
Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++ 1 file changed, 7 insertions(+), 82

[RFC 4/6] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2021-12-17 Thread Andrey Grodzovsky
to qeueue work and wait on it to finish. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c| 2 +- 3 files changed, 35 insertions(+), 2 deletions

[RFC 3/6] drm/amdgpu: Fix crash on modprobe

2021-12-17 Thread Andrey Grodzovsky
Restrict jobs resubmission to suspend case only since schedulers not initialised yet on probe. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers

[RFC 2/6] drm/amdgpu: Move scheduler init to after XGMI is ready

2021-12-17 Thread Andrey Grodzovsky
Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu

[RFC 0/6] Define and use reset domain for GPU recovery in amdgpu

2021-12-17 Thread Andrey Grodzovsky
landed yet there. Andrey Grodzovsky (6): drm/amdgpu: Init GPU reset single threaded wq drm/amdgpu: Move scheduler init to after XGMI is ready drm/amdgpu: Fix crash on modprobe drm/amdgpu: Serialize non TDR gpu recovery with TDRs drm/amdgpu: Drop hive->in_reset drm/amdgpu: D

Re: [PATCH v3] drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

2021-12-17 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2021-12-17 3:49 a.m., Christian König wrote: Am 17.12.21 um 03:26 schrieb Leslie Shi: [Why] In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it will start the error handle path immediately and call

Re: [PATCH v2] drm/amdgpu: Call amdgpu_device_unmap_mmio() iff device is unplugged to prevent crash in GPU initialization failure

2021-12-16 Thread Andrey Grodzovsky
Maybe we just should use drm_dev_is_unplugged() for this particular case because, there would be no race since when device is unplugged it's final. It's the other way around that requires strict drm_dev_enter/exit scope. Andrey On 2021-12-16 3:38 a.m., Christian König wrote: The

Re: [PATCH] drm/amdgpu: add drm_dev_unplug() in GPU initialization failure to prevent crash

2021-12-15 Thread Andrey Grodzovsky
I think that we should not call amdgpu_device_unmap_mmio unless device is unplugged (as in amdgpu_pci_remove) because the point of this function is to prevent accesses to MMIO range the device was occupying before removal. There is no point to prevent MMIO accesses when init failed and we want

Re: [PATCH v2] drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence

2021-12-14 Thread Andrey Grodzovsky
On 2021-12-14 12:03 p.m., Andrey Grodzovsky wrote: - -    if (job != NULL) { -    /* mark this fence has a parent job */ -    set_bit(AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT, >flags); +    if (job) +    dma_fence_init(fence, _job_fence_ops, +   >fence_dr

Re: [PATCH v2] drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence

2021-12-14 Thread Andrey Grodzovsky
On 2021-12-14 6:15 a.m., Huang Rui wrote: The job embedded fence donesn't initialize the flags at dma_fence_init(). Then we will go a wrong way in amdgpu_fence_get_timeline_name callback and trigger a null pointer panic once we enabled the trace event here. So introduce new amdgpu_fence object

Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

2021-12-10 Thread Andrey Grodzovsky
On 2021-12-09 10:47 p.m., Lang Yu wrote: On 12/09/ , Christian KKKnig wrote: Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky: On 2021-12-09 4:00 a.m., Christian König wrote: Am 09.12.21 um 09:49 schrieb Lang Yu: It is useful to maintain error context when debugging SW/FW issues. We

Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

2021-12-09 Thread Andrey Grodzovsky
. Compare to a simple hang, the system will keep stable at least for SSH access. Then it should be trivial to inspect the hardware state and see what's going on. Suggested-by: Christian Koenig Suggested-by: Andrey Grodzovsky Signed-off-by: Lang Yu ---   drivers/gpu/drm/amd/amdgpu/amdgpu.h

Re: [PATCH] drm/amdgpu: add support to SMU debug option

2021-12-01 Thread Andrey Grodzovsky
On 2021-12-01 8:11 a.m., Christian König wrote: Adding Andrey as well. Am 01.12.21 um 12:37 schrieb Yu, Lang: [SNIP] + BUG_ON(unlikely(smu->smu_debug_mode) && res); BUG_ON() really crashes the kernel and is only allowed if we prevent further data corruption with that. Most of the time

Re: [RFC 0/3] Add Smart Trace Buffers support

2021-11-19 Thread Andrey Grodzovsky
Ping - mostly just to get final ack to push it into amd-stagin-drm-next Andrey On 2021-11-18 1:18 p.m., Andrey Grodzovsky wrote: The Smart Trace Buffer (STB), is a cyclic data buffer used to log information about system execution for characterization and debug purposes. If at any point should

[RFC 3/3] drm/amd/pm: Add debugfs info for STB

2021-11-18 Thread Andrey Grodzovsky
Add debugfs hook. Signed-off-by: Andrey Grodzovsky Reviewed-by: Lijo Lazar Reviewed-by: Luben Tuikov --- drivers/gpu/drm/amd/pm/amdgpu_pm.c| 2 + drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 1 + drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 86 +++ 3 files changed

[RFC 2/3] drm/amd/pm: Add STB support in sienna_cichlid

2021-11-18 Thread Andrey Grodzovsky
Add STB implementation for sienna_cichlid Signed-off-by: Andrey Grodzovsky Reviewed-by: Lijo Lazar Reviewed-by: Luben Tuikov --- .../amd/include/asic_reg/mp/mp_11_0_offset.h | 7 +++ .../amd/include/asic_reg/mp/mp_11_0_sh_mask.h | 12 .../amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 55

[RFC 1/3] drm/amd/pm: Add STB accessors interface

2021-11-18 Thread Andrey Grodzovsky
Add interface to collect STB logs. Signed-off-by: Andrey Grodzovsky Reviewed-by: Lijo Lazar Reviewed-by: Luben Tuikov --- drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 15 +++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 18 ++ 2 files changed, 33 insertions(+) diff

[RFC 0/3] Add Smart Trace Buffers support

2021-11-18 Thread Andrey Grodzovsky
additional instrumentation. Andrey Grodzovsky (3): drm/amd/pm: Add STB accessors interface drm/amd/pm: Add STB support in sienna_cichlid drm/amd/pm: Add debugfs info for STB .../amd/include/asic_reg/mp/mp_11_0_offset.h | 7 ++ .../amd/include/asic_reg/mp/mp_11_0_sh_mask.h | 12 ++ drivers/gpu

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-11-11 Thread Andrey Grodzovsky
On 2021-11-10 8:24 a.m., Daniel Vetter wrote: On Wed, Nov 10, 2021 at 11:09:50AM +0100, Christian König wrote: Am 10.11.21 um 10:50 schrieb Daniel Vetter: On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote: On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote: On Mon, Nov 08, 2021 at

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

2021-11-10 Thread Andrey Grodzovsky
On 2021-11-10 5:09 a.m., Christian König wrote: Am 10.11.21 um 10:50 schrieb Daniel Vetter: On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote: On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter wrote: On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote: I stumbled across this

Re: Lockdep spalt on killing a processes

2021-11-01 Thread Andrey Grodzovsky
Pushed to drm-misc-next Andrey On 2021-10-29 3:07 a.m., Christian König wrote: Attached a patch. Give it a try please, I tested it on my side and tried to generate the right conditions to trigger this code path by repeatedly submitting commands while issuing GPU reset to stop the scheduler

Re: Lockdep spalt on killing a processes

2021-10-28 Thread Andrey Grodzovsky
On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote: On 2021-10-27 10:50 a.m., Christian König wrote: Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky: On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-28 Thread Andrey Grodzovsky
On 2021-10-27 10:43 p.m., JingWen Chen wrote: On 2021/10/28 上午3:43, Andrey Grodzovsky wrote: On 2021-10-25 10:57 p.m., JingWen Chen wrote: On 2021/10/25 下午11:18, Andrey Grodzovsky wrote: On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-27 10:50 a.m., Christian König wrote: Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky: On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still missing some point of yours. Well, I mean we need

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-25 10:57 p.m., JingWen Chen wrote: On 2021/10/25 下午11:18, Andrey Grodzovsky wrote: On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-27 10:34 a.m., Christian König wrote: Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky: [SNIP] Let me please know if I am still missing some point of yours. Well, I mean we need to be able to handle this for all drivers. For sure, but as i said above in my opinion we need

Re: Lockdep spalt on killing a processes

2021-10-27 Thread Andrey Grodzovsky
On 2021-10-26 6:54 a.m., Christian König wrote: Am 26.10.21 um 04:33 schrieb Andrey Grodzovsky: On 2021-10-25 3:56 p.m., Christian König wrote: In general I'm all there to get this fixed, but there is one major problem: Drivers don't expect the lock to be dropped. I am probably missing

Re: Lockdep spalt on killing a processes

2021-10-25 Thread Andrey Grodzovsky
ill missing some point of yours. Andrey Regards, Christian. Am 25.10.21 um 21:10 schrieb Andrey Grodzovsky: Adding back Daniel (somehow he got off the addresses list) and Chris who worked a lot in this area. On 2021-10-21 2:34 a.m., Christian König wrote: Am 20.10.21 um 21:32 schrieb And

Re: Lockdep spalt on killing a processes

2021-10-25 Thread Andrey Grodzovsky
Adding back Daniel (somehow he got off the addresses list) and Chris who worked a lot in this area. On 2021-10-21 2:34 a.m., Christian König wrote: Am 20.10.21 um 21:32 schrieb Andrey Grodzovsky: On 2021-10-04 4:14 a.m., Christian König wrote: The problem is a bit different. The callback

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-25 Thread Andrey Grodzovsky
On 2021-10-24 10:56 p.m., JingWen Chen wrote: On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? yes Then maybe update the description  because 'underflow' is very confusing On 2021-10

Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow

2021-10-22 Thread Andrey Grodzovsky
What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? On 2021-10-22 4:14 a.m., JingWen Chen wrote: ping On 2021/10/22 AM11:33, Jingwen Chen wrote: [Why] In advance tdr mode, the real bad job will be resubmitted twice, while in

Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-22 Thread Andrey Grodzovsky
ring->adev->rings[ring->idx] = NULL; } Regards, Lang Got it, Looks good to me. Reviewed-by: Andrey Grodzovsky Andrey Fixes: 72c8c97b1522 ("drm/amdgpu: Split amdgpu_device_fini into early and late") Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdgpu/amdgpu_device

Re: FW: [PATCH 1/3] drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw()

2021-10-21 Thread Andrey Grodzovsky
On 2021-10-21 3:19 a.m., Yu, Lang wrote: [AMD Official Use Only] -Original Message- From: Yu, Lang Sent: Thursday, October 21, 2021 3:18 PM To: Grodzovsky, Andrey Cc: Deucher, Alexander ; Koenig, Christian ; Huang, Ray ; Yu, Lang Subject: [PATCH 1/3] drm/amdgpu: fix a potential

Re: Lockdep spalt on killing a processes

2021-10-20 Thread Andrey Grodzovsky
t be done there. Andrey Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky: From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within hw fence done callback (drm_sched_job_done_cb) but then again from within it's

Re: [PATCH 1/1] drm/amdgpu: recover gart table at resume

2021-10-19 Thread Andrey Grodzovsky
On 2021-10-19 11:54 a.m., Christian König wrote: Am 19.10.21 um 17:41 schrieb Andrey Grodzovsky: On 2021-10-19 9:22 a.m., Nirmoy Das wrote: Get rid off pin/unpin and evict and swap back gart page table which should make things less likely to break. +Christian Could you guys also clarify

Re: [PATCH 1/1] drm/amdgpu: recover gart table at resume

2021-10-19 Thread Andrey Grodzovsky
On 2021-10-19 9:22 a.m., Nirmoy Das wrote: Get rid off pin/unpin and evict and swap back gart page table which should make things less likely to break. +Christian Could you guys also clarify what exactly are the stability issues this fixes ? Andrey Also remove 2nd call to

Re: [PATCH] drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume

2021-10-04 Thread Andrey Grodzovsky
, and only continue the execution in amdgpu_pci_resume when it's pci_channel_io_frozen. Fixes: c9a6b82f45e2("drm/amdgpu: Implement DPC recovery") Suggested-by: Andrey Grodzovsky Signed-off-by: Guchun Chen --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 1 + drivers/gpu/drm/amd/amdgpu/amdgp

Re: Lockdep spalt on killing a processes

2021-10-04 Thread Andrey Grodzovsky
the scheduler fence. Daniel is right that this needs an irq_work struct to handle this properly. Christian. Am 01.10.21 um 17:10 schrieb Andrey Grodzovsky: From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within

Re: Lockdep spalt on killing a processes

2021-10-01 Thread Andrey Grodzovsky
From what I see here you supposed to have actual deadlock and not only warning, sched_fence->finished is  first signaled from within hw fence done callback (drm_sched_job_done_cb) but then again from within it's own callback (drm_sched_entity_kill_jobs_cb) and so looks like same fence  object is

Re: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal

2021-10-01 Thread Andrey Grodzovsky
No, scheduler restart and device unlock must take place inamdgpu_pci_resume (see struct pci_error_handlers for the various states of PCI recovery). So just add a flag (probably in amdgpu_device) so we can remember what pci_channel_state_t we came from (unfortunately it's not passed to us in 

Re: [PATCH] drm/amdgpu: add missed write lock for pci detected state pci_channel_io_normal

2021-09-30 Thread Andrey Grodzovsky
On 2021-09-30 10:00 p.m., Guchun Chen wrote: When a PCI error state pci_channel_io_normal is detectd, it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI driver will continue the execution of PCI resume callback report_resume by pci_walk_bridge, and the callback will go into

Re: [PATCH] drm/amd/amdgpu: Do irq_fini_hw after ip_fini_early

2021-09-29 Thread Andrey Grodzovsky
Can you test  this change with hotunplug tests in libdrm ? Since the tests are still in disabled mode until latest fixes propagate to drm-next upstream you will need to comment out https://gitlab.freedesktop.org/mesa/drm/-/blob/main/tests/amdgpu/hotunplug_tests.c#L65 I recently fixed a few

Re: [PATCH v2 1/2] drm/amdkfd: handle svm migrate init error

2021-09-21 Thread Andrey Grodzovsky
Series is Acked-by: Andrey Grodzovsky Andrey On 2021-09-21 2:53 p.m., Philip Yang wrote: If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c

Re: [PATCH] drm/amdgpu: move amdgpu_virt_release_full_gpu to fini_early stage

2021-09-21 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2021-09-21 9:11 a.m., Chen, Guchun wrote: [Public] Ping... Regards, Guchun -Original Message- From: Chen, Guchun Sent: Saturday, September 18, 2021 2:09 PM To: amd-gfx@lists.freedesktop.org; Koenig, Christian ; Pan, Xinhui ; Deucher

Re: [PATCH] drm/amdkfd: fix svm_migrate_fini warning

2021-09-21 Thread Andrey Grodzovsky
In any case, once you converge on solution please include the relevant ticket in the commit description  - https://gitlab.freedesktop.org/drm/amd/-/issues/1718 Andrey On 2021-09-20 10:20 p.m., Felix Kuehling wrote: Am 2021-09-20 um 5:55 p.m. schrieb Philip Yang: Don't use

Re: [PATCH 2/2] drm/amdgpu: Fix resume failures when device is gone

2021-09-17 Thread Andrey Grodzovsky
Ping Andrey On 2021-09-17 7:30 a.m., Andrey Grodzovsky wrote: Problem: When device goes into suspend and unplugged during it then all HW programming during resume fails leading to a bad SW during pci remove handling which follows. Because device is first resumed and only later removed we

<    1   2   3   4   5   6   7   8   9   10   >