[PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level.

2022-05-17 Thread Andrey Grodzovsky
Will be read by executors of async reset like debugfs. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 -- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 3 files changed, 6 insertions(+), 2 deletions

[PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.

2022-05-17 Thread Andrey Grodzovsky
Save the extra usless work schedule. Also swith to delayed work. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd

[PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct.

2022-05-17 Thread Andrey Grodzovsky
We need to be able to non blocking cancel pending reset works from within GPU reset. Currently kernel API allows this only for delayed_work and not for work_struct. Switch to delayed work and queue it with delay 0 which is equal to queueing work struct. Signed-off-by: Andrey Grodzovsky

[PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

2022-05-17 Thread Andrey Grodzovsky
as was in v1[1] to eplicit stopping of each reset request from each reset source per each request submitter. [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzov...@amd.com/ Andrey Grodzovsky (7): drm/amdgpu: Cache result of last reset at reset domain level. drm/amdgpu

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky
On 2022-05-16 11:08, Christian König wrote: Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky: Ping Ah, yes sorry. Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: Yes, exactly that's the idea. Basically the reset domain knowns which amdgpu devices it needs to reset together

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky
Ping Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: Yes, exactly that's the idea. Basically the reset domain knowns which amdgpu devices it needs to reset together. If you then represent that so that you always have a hive even when you only have one device in it, or if you put

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-13 Thread Andrey Grodzovsky
On 2022-05-12 09:15, Christian König wrote: Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky: On 2022-05-12 02:06, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
Sure, I will investigate that. What about the ticket which LIjo raised which was basically doing 8 resets instead of one  ? Lijo - can this ticket wait until I come up with this new design for amdgpu reset function or u need a quick solution now in which case we can use the already existing

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
On 2022-05-12 02:06, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
On 2022-05-12 02:03, Christian König wrote: Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky: [SNIP] How about we do it like this then: struct amdgpu_reset_domain { union {     struct {         struct work_item debugfs;         struct work_item ras

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 12:49, Felix Kuehling wrote: Am 2022-05-11 um 09:49 schrieb Andrey Grodzovsky: [snip] diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c b/drivers/gpu/drm/amd/amdkfd/kfd_device.c index f1a225a20719..4b789bec9670 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:46, Lazar, Lijo wrote: On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:37, Lazar, Lijo wrote: On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:37, Lazar, Lijo wrote: On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g. in the reset code (either before or after the reset

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: On 2022-05-10 13:19, Christian König wrote: Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-10 13:19, Christian König wrote: Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other

Re: [Bug 215958] New: thunderbolt3 egpu cannot disconnect cleanly

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-09 14:03, Deucher, Alexander wrote: [Public] -Original Message- From: Bjorn Helgaas Sent: Monday, May 9, 2022 12:23 PM To: Linux PCI Cc: r087...@yahoo.it; Deucher, Alexander ; Koenig, Christian ; Pan, Xinhui ; amd-gfx mailing list ; dri-devel Subject: Re: [Bug 215958]

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other reset sources. See the whole idea is the following: 1. We have one single queued work queue

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-06 04:56, Christian König wrote: Am 06.05.22 um 08:02 schrieb Lazar, Lijo: On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote: On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15

Re: [PATCH] drm/amdgpu/psp: use proper vmalloc API

2022-05-10 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-05-10 10:58, Alex Deucher wrote: Use kvmalloc and kvfree. Fixes: 31aad22e2b3c ("drm/amdgpu/psp: Add vbflash sysfs interface support") Reported-by: kernel test robot Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset

[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-04 Thread Andrey Grodzovsky
reset domain will cancel all those pending redundant resets. This is in line with what we already do for redundant TDRs in scheduler code. Signed-off-by: Andrey Grodzovsky Tested-by: Bai Zoy --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 11 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-27 Thread Andrey Grodzovsky
ng recursive fault but reboot is needed! On Apr 21, 2022, at 2:41 AM, Andrey Grodzovsky wrote: I retested hot plug tests at the commit I mentioned bellow - looks ok, my ASIC is Navi 10, I also tested using Vega 10 and older Polaris ASICs (whatever i had at home at the time). I

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-20 Thread Andrey Grodzovsky
  3  3  0    1 asserts 21 21 21  0  n/a Elapsed time =    9.195 seconds Andrey On 2022-04-20 11:44, Andrey Grodzovsky wrote: The only one in Radeon 7 I see is the same sysfs crash we already fixed so you can use the same fix. The MI 200 issue i

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-20 Thread Andrey Grodzovsky
The only one in Radeon 7 I see is the same sysfs crash we already fixed so you can use the same fix. The MI 200 issue i haven't seen yet but I also haven't tested MI200 so never saw it before. Need to test when i get the time. So try that fix with Radeon 7 again to see if you pass the tests

Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
My bad, I see u already fixed this in amd-staging-drm-next. We had an issue in an internal branch with this and just reinvented the wheel :)) Andrey On 2022-04-14 10:32, Andrey Grodzovsky wrote: Yea, i need to improve it a bit, ignore this one, will be back with V2. Andrey On 2022-04-14 03

Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd

Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
On 2022-04-14 02:40, Christian König wrote: Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off

[PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-13 Thread Andrey Grodzovsky
Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-13 Thread Andrey Grodzovsky
On 2022-04-13 12:03, Shuotao Xu wrote: On Apr 11, 2022, at 11:52 PM, Andrey Grodzovsky wrote: [Some people who received this message don't often get email fromandrey.grodzov...@amd.com. Learn why this is important athttp://aka.ms/LearnAboutSenderIdentification.] On 2022-04-08 21:28

Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-11 Thread Andrey Grodzovsky
On 2022-04-08 21:28, Shuotao Xu wrote: On Apr 8, 2022, at 11:28 PM, Andrey Grodzovsky wrote: [Some people who received this message don't often get email from andrey.grodzov...@amd.com. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] On 2022-04-08 04:45

Re: [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

2022-04-08 Thread Andrey Grodzovsky
On 2022-04-08 04:45, Shuotao Xu wrote: Adding PCIe Hotplug Support for AMDKFD: the support of hot-plug of GPU devices can open doors for many advanced applications in data center in the next few years, such as for GPU resource disaggregation. Current AMDKFD does not support hotplug out b/o

Re: [PATCH] drm/amdkfd: Cleanup IO links during KFD device removal

2022-04-07 Thread Andrey Grodzovsky
I suggest adding another patch to handle unbalanced decrement of kfd_lock in kgd2kfd_suspend. This patch alone is not enough to fix all removal issues. Andrey On 2022-04-07 12:15, Mukul Joshi wrote: Currently, the IO-links to the device being removed from topology, are not cleared. As a

Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

2022-04-06 Thread Andrey Grodzovsky
48 89 e5 48 89 7d f8 48 8b 45 f8 Best regards, Shuotao *From: *Andrey Grodzovsky *Date: *Wednesday, April 6, 2022 at 10:36 PM *To: *Shuotao Xu , amd-gfx@lists.freedesktop.org *Cc: *Ziyue Yang , Lei Qu , Peng Cheng , Ran Shu *Subject: *Re: [EXTERNAL] Re: Code Review Request for AMDGPU

Re: [EXTERNAL] Re: Code Review Request for AMDGPU Hotplug Support

2022-04-06 Thread Andrey Grodzovsky
patch in this email, in case that you would want to delete that later email. Best regards, Shuotao *From: *Andrey Grodzovsky *Date: *Wednesday, April 6, 2022 at 10:13 PM *To: *Shuotao Xu , amd-gfx@lists.freedesktop.org *Cc: *Ziyue Yang , Lei Qu , Peng Cheng , Ran Shu *Subject: *[EXTERN

Re: Code Review Request for AMDGPU Hotplug Support

2022-04-06 Thread Andrey Grodzovsky
Looks like you are using 5.13 kernel for this work, FYI we added hot plug support for the graphic stack in 5.14 kernel (see https://www.phoronix.com/scan.php?page=news_item=Linux-5.14-AMDGPU-Hot-Unplug) I am not sure about the code part since it all touches KFD driver (KFD team can comment

Re: [PATCH 1/1] drm/amdgpu: Move reset domain init before calling RREG32

2022-03-11 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-03-11 10:15, Philip Yang wrote: amdgpu_detect_virtualization reads register, amdgpu_device_rreg access adev->reset_domain->sem if kernel defined CONFIG_LOCKDEP, below is the random boot hang backtrace on Vega10. It may get random NULL p

Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

2022-03-10 Thread Andrey Grodzovsky
On 2022-03-10 11:21, Sharma, Shashank wrote: On 3/10/2022 4:24 PM, Rob Clark wrote: On Thu, Mar 10, 2022 at 1:55 AM Christian König wrote: Am 09.03.22 um 19:12 schrieb Rob Clark: On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma wrote: From: Shashank Sharma This patch adds a new

Re: [PATCH 3/4] drm/amdgpu: add sdma v5_2 soft reset

2022-03-10 Thread Andrey Grodzovsky
On 2022-03-10 05:06, Christian König wrote: Am 10.03.22 um 07:11 schrieb Victor Zhao: enable sdma v5_2 soft reset Signed-off-by: Victor Zhao ---   drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 79 +-   1 file changed, 78 insertions(+), 1 deletion(-) diff --git

Re: [PATCH] drm/amdgpu: schedule GPU reset event work function

2022-03-10 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-03-10 04:07, Somalapuram Amaranath wrote: Schedule work function with valid PID, process name and vram lost status during a GPU reset/recovery. Signed-off-by: Somalapuram Amaranath --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13

Re: [PATCH 1/2] drm/amd/pm: Add STB dump function.

2022-03-10 Thread Andrey Grodzovsky
On 2022-03-10 00:17, Lazar, Lijo wrote: On 3/10/2022 2:33 AM, Andrey Grodzovsky wrote: It will be used during GPU reset. Signed-off-by: Andrey Grodzovsky ---   drivers/gpu/drm/amd/pm/amdgpu_dpm.c   | 10 +++   drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h   |  3 +++   drivers/gpu

[PATCH 1/2] drm/amd/pm: Add STB dump function.

2022-03-09 Thread Andrey Grodzovsky
It will be used during GPU reset. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 10 +++ drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 3 +++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 26 +++ drivers/gpu/drm/amd/pm/swsmu/inc

[PATCH 2/2] drm/amdgpu: Dump STB during ASIC reset.

2022-03-09 Thread Andrey Grodzovsky
This should provide more debug info for the driver. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index

Re: [PATCH 2/2] drm/amdgpu: add work function for GPU reset event

2022-03-08 Thread Andrey Grodzovsky
On 2022-03-08 12:20, Somalapuram, Amaranath wrote: On 3/8/2022 10:00 PM, Sharma, Shashank wrote: Hello Andrey On 3/8/2022 5:26 PM, Andrey Grodzovsky wrote: On 2022-03-07 11:26, Shashank Sharma wrote: From: Shashank Sharma This patch adds a work function, which will get scheduled

Re: [PATCH 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Andrey Grodzovsky
On 2022-03-08 12:04, Somalapuram, Amaranath wrote: On 3/8/2022 10:27 PM, Sharma, Shashank wrote: On 3/8/2022 5:55 PM, Andrey Grodzovsky wrote: You can read on their side here - https://www.phoronix.com/scan.php?page=news_item=AMD-STB-Linux-5.17 and see their patch. THey don't have

Re: [PATCH 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Andrey Grodzovsky
on behalf of Andrey Grodzovsky *Sent:* Tuesday, March 8, 2022 9:55:03 PM *To:* Shashank Sharma ; amd-gfx@lists.freedesktop.org *Cc:* Deucher, Alexander ; Somalapuram, Amaranath ; Koenig, Christian ; Sharma, Shashank *Subject:* Re: [PATCH 1/2] drm: Add GPU reset sysfs event On 2022-03-07 11

Re: [PATCH 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Andrey Grodzovsky
of Andrey Grodzovsky *Sent:* Tuesday, March 8, 2022 9:55:03 PM *To:* Shashank Sharma ; amd-gfx@lists.freedesktop.org *Cc:* Deucher, Alexander ; Somalapuram, Amaranath ; Koenig, Christian ; Sharma, Shashank *Subject:* Re: [PATCH 1/2] drm: Add GPU reset sysfs event On 2022-03-07 11:26

Re: [PATCH 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Andrey Grodzovsky
On 2022-03-08 11:35, Sharma, Shashank wrote: On 3/8/2022 5:25 PM, Andrey Grodzovsky wrote: On 2022-03-07 11:26, Shashank Sharma wrote: From: Shashank Sharma This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like

Re: [PATCH 2/2] drm/amdgpu: add work function for GPU reset event

2022-03-08 Thread Andrey Grodzovsky
On 2022-03-07 11:26, Shashank Sharma wrote: From: Shashank Sharma This patch adds a work function, which will get scheduled in event of a GPU reset, and will send a uevent to user with some reset context infomration, like a PID and some flags. Where is the actual scheduling of the work

Re: [PATCH 1/2] drm: Add GPU reset sysfs event

2022-03-08 Thread Andrey Grodzovsky
On 2022-03-07 11:26, Shashank Sharma wrote: From: Shashank Sharma This patch adds a new sysfs event, which will indicate the userland about a GPU reset, and can also provide some information like: - which PID was involved in the GPU reset - what was the GPU status (using flags) This patch

Re: [PATCH 10/10] drm/amdgpu: add gang submit frontend

2022-03-07 Thread Andrey Grodzovsky
On 2022-03-03 03:23, Christian König wrote: Allows submitting jobs as gang which needs to run on multiple engines at the same time. All members of the gang get the same implicit, explicit and VM dependencies. So no gang member will start running until everything else is ready. The last job

Re: [PATCH 09/10] drm/amdgpu: add gang submit backend

2022-03-07 Thread Andrey Grodzovsky
:) I am like - I must be crazy because no way this works but you insist that it is and I know u are usually right. Andrey On 2022-03-07 10:59, Christian König wrote: If we don't check for NULL here we would just crash. But you go into the 'if clause' if job->gang_submit is equal to

Re: [PATCH 09/10] drm/amdgpu: add gang submit backend

2022-03-07 Thread Andrey Grodzovsky
On 2022-03-05 13:40, Christian König wrote: Am 04.03.22 um 18:10 schrieb Andrey Grodzovsky: On 2022-03-03 03:23, Christian König wrote: Allows submitting jobs as gang which needs to run on multiple engines at the same time. Basic idea is that we have a global gang submit fence representing

Re: [PATCH 09/10] drm/amdgpu: add gang submit backend

2022-03-04 Thread Andrey Grodzovsky
On 2022-03-03 03:23, Christian König wrote: Allows submitting jobs as gang which needs to run on multiple engines at the same time. Basic idea is that we have a global gang submit fence representing when the gang leader is finally pushed to run on the hardware last. Jobs submitted as gang

Re: [PATCH 08/10] drm/amdgpu: initialize the vmid_wait with the stub fence

2022-03-03 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-03-03 03:23, Christian König wrote: This way we don't need to check for NULL any more. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 1 + 2 files changed, 2

Re: [PATCH 06/10] drm/amdgpu: properly imbed the IBs into the job

2022-03-03 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-03-03 03:23, Christian König wrote: We now have standard macros for that. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 6 -- 2 files changed, 5

Re: [PATCH 05/10] drm/amdgpu: use job and ib structures directly in CS parsers

2022-03-03 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-03-03 03:23, Christian König wrote: Instead of providing the ib index provide the job and ib pointers directly to the patch and parse functions for UVD and VCE. Also move the set/get functions for IB values to the IB declerations. Signed-off

Re: [PATCH 02/10] drm/amdgpu: header cleanup

2022-03-03 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-03-03 03:23, Christian König wrote: No function change, just move a bunch of definitions from amdgpu.h into separate header files. Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 95 --- drivers

Re: [PATCH 01/10] drm/amdgpu: install ctx entities with cmpxchg

2022-03-03 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-03-03 03:22, Christian König wrote: Since we removed the context lock we need to make sure that not two threads are trying to install an entity at the same time. Signed-off-by: Christian König Fixes: e68efb27647f ("drm/amdgpu: remove ctx-

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-03 Thread Andrey Grodzovsky
I pushed all the changes including your patch. Andrey On 2022-03-02 22:16, Andrey Grodzovsky wrote: OK, i will do quick smoke test tomorrow and push all of it it then. Andrey On 2022-03-02 21:59, Chen, JingWen wrote: Hi Andrey, I don't have the bare mental environment, I can only test

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread Andrey Grodzovsky
: The patch is acked-by: Andrey Grodzovsky If you also smoked tested bare metal feel free to apply all the patches, if no let me know. Andrey On 2022-03-02 04:51, JingWen Chen wrote: Hi Andrey, Most part of the patches are OK, but the code will introduce a ib test fail on the disabled vcn

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread Andrey Grodzovsky
The patch is acked-by: Andrey Grodzovsky If you also smoked tested bare metal feel free to apply all the patches, if no let me know. Andrey On 2022-03-02 04:51, JingWen Chen wrote: Hi Andrey, Most part of the patches are OK, but the code will introduce a ib test fail on the disabled vcn

Re: [PATCH 1/2] drm/amdgpu: Fix sigsev when accessing MMIO on hot unplug.

2022-03-02 Thread Andrey Grodzovsky
Thanks, already did. Code pushed both here and in libdrm. Andrey On 2022-03-02 03:37, Christian König wrote: Am 01.03.22 um 19:07 schrieb Andrey Grodzovsky: Protect with drm_dev_enter/exit Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König for this one here. Regarding

[PATCH 2/2] drm/amdgpu: Bump minor version for hot plug tests enabliing.

2022-03-01 Thread Andrey Grodzovsky
the tests finally - if other people during testing will encounter errors they will report and I will be able to fix. The releated merge request for enabling libdrm tests suite is in https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/227 Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd

[PATCH 1/2] drm/amdgpu: Fix sigsev when accessing MMIO on hot unplug.

2022-03-01 Thread Andrey Grodzovsky
Protect with drm_dev_enter/exit Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index f522b52725e4

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-25 Thread Andrey Grodzovsky
are depending on this patch series to fix the concurrency issue within SRIOV TDR sequence. On 2/25/22 1:26 AM, Andrey Grodzovsky wrote: No problem if so but before I do, JingWen - why you think this patch is needed as a standalone now ? It has no use without the entire feature together

Re: [PATCH] drm/amdgpu: Exclude PCI reset method for now.

2022-02-24 Thread Andrey Grodzovsky
On 2022-02-24 13:11, Alex Deucher wrote: On Thu, Feb 24, 2022 at 1:05 PM Andrey Grodzovsky wrote: According to my investigation of the state of PCI reset recently it's not working. The reason is due to the fact the kernel PCI code rejects SBR when there are more then one PF under same bridge

[PATCH] drm/amdgpu: Exclude PCI reset method for now.

2022-02-24 Thread Andrey Grodzovsky
and devices under the same bridge as you and you cannot assume they support SBR. Once we anble FLR support we can reenable this option as FLR is doable on single PF and doens't have this restriction. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 + drivers

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread Andrey Grodzovsky
] If it applies cleanly, feel free to drop it in.  I'll drop those patches for drm-next since they are already in drm-misc. Alex *From:* amd-gfx on behalf of Andrey Grodzovsky *Sent:* Thursday, February 24, 2022 11:24 AM

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread Andrey Grodzovsky
Grodzovsky wrote: All comments are fixed and code pushed. Thanks for everyone who helped reviewing. Andrey On 2022-02-09 02:53, Christian König wrote: Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky: Before we initialize schedulers we must know which reset domain are we in - for single device

Re: [PATCH v11 2/2] drm/amdgpu: add reset register dump trace on GPU

2022-02-22 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-02-22 09:37, Somalapuram Amaranath wrote: Dump the list of register values to trace event on GPU reset. Signed-off-by: Somalapuram Amaranath --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 + drivers/gpu/drm/amd/amdgpu

Re: [PATCH] drm/sched: Add device pointer to drm_gpu_scheduler

2022-02-20 Thread Andrey Grodzovsky
On 2022-02-20 22:32, Gu, JiaWei (Will) wrote: [AMD Official Use Only] Pinging. -Original Message- From: Jiawei Gu Sent: Thursday, February 17, 2022 6:44 PM To: dri-de...@lists.freedesktop.org; amd-gfx@lists.freedesktop.org; Koenig, Christian ; Grodzovsky, Andrey ; Liu, Monk ; Deng,

Re: [PATCH v4 2/2] drm/amdgpu: add reset register dump trace on GPU reset

2022-02-16 Thread Andrey Grodzovsky
On 2022-02-16 05:46, Somalapuram, Amaranath wrote: On 2/15/2022 10:09 PM, Andrey Grodzovsky wrote: On 2022-02-15 05:12, Somalapuram Amaranath wrote: Dump the list of register values to trace event on GPU reset. Signed-off-by: Somalapuram Amaranath ---   drivers/gpu/drm/amd/amdgpu

Re: [PATCH v4 2/2] drm/amdgpu: add reset register dump trace on GPU reset

2022-02-15 Thread Andrey Grodzovsky
On 2022-02-15 05:12, Somalapuram Amaranath wrote: Dump the list of register values to trace event on GPU reset. Signed-off-by: Somalapuram Amaranath --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 - drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 16 2

Re: [PATCH] drm/sched: Add device pointer to drm_gpu_scheduler

2022-02-15 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-02-15 06:29, Jiawei Gu wrote: Add device pointer so scheduler's printing can use DRM_DEV_ERROR() instead, which makes life easier under multiple GPU scenario. Signed-off-by: Jiawei Gu --- drivers/gpu/drm/scheduler/sched_main.c | 9

[PATCH] drm/amdgpu: Fix htmldoc warning

2022-02-11 Thread Andrey Grodzovsky
Update function name. Signed-off-by: Andrey Grodzovsky Reported-by: kernel test robot --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

Re: [PATCH] drm/amdgpu: remove ctx->lock

2022-02-11 Thread Andrey Grodzovsky
On 2022-02-11 03:24, Ken Xue wrote: KMD reports a warning on holding a lock from drm_syncobj_find_fence, when running amdgpu_test case “syncobj timeline test”. ctx->lock was designed to prevent concurrent "amdgpu_ctx_wait_prev_fence" calls and avoid dead reservation lock from GPU reset.

Re: [Patch V2] drm/amdgpu: Handle the GPU recovery failure in SRIOV environment.

2022-02-10 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-02-03 21:45, Surbhi Kakarya wrote: This patch handles the GPU recovery failure in sriov environment by retrying the reset if the first reset fails. To determine the condition of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns

Re: [PATCH] drm/amdgpu: Fix compile error.

2022-02-10 Thread Andrey Grodzovsky
On 2022-02-10 02:06, Christian König wrote: Am 10.02.22 um 04:17 schrieb Andrey Grodzovsky: Seems I forgot to add this to the relevant commit when submitting. Rebase/merge issue? Looks like it. It looks more like I forgot to add the header file change to the commit after updating

[PATCH] drm/amdgpu: Fix compile error.

2022-02-09 Thread Andrey Grodzovsky
Seems I forgot to add this to the relevant commit when submitting. Signed-off-by: Andrey Grodzovsky Reported-by: kernel test robot --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-09 Thread Andrey Grodzovsky
All comments are fixed and code pushed. Thanks for everyone who helped reviewing. Andrey On 2022-02-09 02:53, Christian König wrote: Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky: Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single

Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-09 Thread Andrey Grodzovsky
Thanks a lot! Andrey On 2022-02-09 01:06, JingWen Chen wrote: Hi Andrey, I have been testing your patch and it seems fine till now. Best Regards, Jingwen Chen On 2022/2/3 上午2:57, Andrey Grodzovsky wrote: Just another ping, with Shyun's help I was able to do some smoke testing on XGMI

[RFC v4 07/11] drm/amdgpu: Rework reset domain to be refcounted.

2022-02-08 Thread Andrey Grodzovsky
put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 6 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 44 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 40 drivers/gpu/drm/

[RFC v4 11/11] Revert 'drm/amdgpu: annotate a false positive recursive locking'

2022-02-08 Thread Andrey Grodzovsky
Since we have a single instance of reset semaphore which we lock only once even for XGMI hive we don't need the nested locking hint anymore. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 -- 1 file changed, 4 insertions(+), 10 deletions

[RFC v4 06/11] drm/amdgpu: Drop concurrent GPU reset protection for device

2022-02-08 Thread Andrey Grodzovsky
Since now all GPU resets are serialzied there is no need for this. This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout' Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 89 ++ 1 file

[RFC v4 10/11] drm/amdgpu: Rework amdgpu_device_lock_adev

2022-02-08 Thread Andrey Grodzovsky
This functions needs to be split into 2 parts where one is called only once for locking single instance of reset_domain's sem and reset flag and the other part which handles MP1 states should still be called for each device in XGMI hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd

[RFC v4 08/11] drm/amdgpu: Move reset sem into reset_domain

2022-02-08 Thread Andrey Grodzovsky
We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 - drivers/gpu/drm/amd

[RFC v4 09/11] drm/amdgpu: Move in_gpu_reset into reset_domain

2022-02-08 Thread Andrey Grodzovsky
We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky Suggested-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 7 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1

[RFC v4 05/11] drm/amdgpu: Drop hive->in_reset

2022-02-08 Thread Andrey Grodzovsky
Since we serialize all resets no need to protect from concurrent resets. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +-- drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 1 - drivers/gpu/drm/amd/amdgpu

[RFC v4 04/11] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.

2022-02-08 Thread Andrey Grodzovsky
No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun Signed-off-by: Andrey

[RFC v4 03/11] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-02-08 Thread Andrey Grodzovsky
to qeueue work and wait on it to finish. v2: Rename to amdgpu_recover_work_struct Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 33 +- drivers/gpu/drm/amd/amdgpu

[RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-08 Thread Andrey Grodzovsky
Before we initialize schedulers we must know which reset domain are we in - for single device there iis a single domain per device and so single wq per device. For XGMI the reset domain spans the entire XGMI hive and so the reset wq is per hive. Signed-off-by: Andrey Grodzovsky --- drivers/gpu

[RFC v4 01/11] drm/amdgpu: Introduce reset domain

2022-02-08 Thread Andrey Grodzovsky
Defined a reset_domain struct such that all the entities that go through reset together will be serialized one against another. Do it for both single device and XGMI hive cases. Signed-off-by: Andrey Grodzovsky Suggested-by: Daniel Vetter Suggested-by: Christian König Reviewed-by: Christian

[RFC v4 00/11] Define and use reset domain for GPU recovery in amdgpu

2022-02-08 Thread Andrey Grodzovsky
md-gfx/msg58836.html P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. Andrey Grodzovsky (11): drm/amdgpu: Introduce reset domain drm/amdgpu: Move scheduler init to after XGMI is

Re: [RFC v4] drm/amdgpu: Rework reset domain to be refcounted.

2022-02-08 Thread Andrey Grodzovsky
On 2022-02-08 06:25, Lazar, Lijo wrote: On 2/2/2022 10:56 PM, Andrey Grodzovsky wrote: The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset

<    1   2   3   4   5   6   7   8   9   10   >