Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky
On 2022-05-16 11:08, Christian König wrote: Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky: Ping Ah, yes sorry. Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: Yes, exactly that's the idea. Basically the reset domain knowns which amdgpu devices it needs to reset together. If

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Christian König
Am 16.05.22 um 16:12 schrieb Andrey Grodzovsky: Ping Ah, yes sorry. Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: Yes, exactly that's the idea. Basically the reset domain knowns which amdgpu devices it needs to reset together. If you then represent that so that you always have

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-16 Thread Andrey Grodzovsky
Ping Andrey On 2022-05-13 11:41, Andrey Grodzovsky wrote: Yes, exactly that's the idea. Basically the reset domain knowns which amdgpu devices it needs to reset together. If you then represent that so that you always have a hive even when you only have one device in it, or if you put an

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-13 Thread Andrey Grodzovsky
On 2022-05-12 09:15, Christian König wrote: Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky: On 2022-05-12 02:06, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
Sure, I will investigate that. What about the ticket which LIjo raised which was basically doing 8 resets instead of one  ? Lijo - can this ticket wait until I come up with this new design for amdgpu reset function or u need a quick solution now in which case we can use the already existing

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König
Am 12.05.22 um 15:07 schrieb Andrey Grodzovsky: On 2022-05-12 02:06, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
On 2022-05-12 02:06, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Andrey Grodzovsky
On 2022-05-12 02:03, Christian König wrote: Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky: [SNIP] How about we do it like this then: struct amdgpu_reset_domain { union {     struct {         struct work_item debugfs;         struct work_item ras;        

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Lazar, Lijo
On 5/12/2022 11:36 AM, Christian König wrote: Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König
Am 11.05.22 um 22:27 schrieb Andrey Grodzovsky: On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-12 Thread Christian König
Am 11.05.22 um 17:57 schrieb Andrey Grodzovsky: [SNIP] How about we do it like this then: struct amdgpu_reset_domain { union {     struct {         struct work_item debugfs;         struct work_item ras;             };     struct work_item

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:39, Christian König wrote: Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:46, Lazar, Lijo wrote: On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:37, Lazar, Lijo wrote: On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo
On 5/11/2022 9:13 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:37, Lazar, Lijo wrote: On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:37, Lazar, Lijo wrote: On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König
Am 11.05.22 um 17:35 schrieb Andrey Grodzovsky: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g. in

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo
On 5/11/2022 9:05 PM, Andrey Grodzovsky wrote: On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g.

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 11:20, Lazar, Lijo wrote: On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g. in the reset code (either before or after the

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Lazar, Lijo
On 5/11/2022 7:28 PM, Christian König wrote: Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g. in the reset code (either before or after the reset, that's debatable) you do something like

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König
Am 11.05.22 um 15:43 schrieb Andrey Grodzovsky: On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: [SNIP] E.g. in the reset code (either before or after the reset, that's debatable) you do something like this: for (i = 0; i < num_ring; ++i)    

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Andrey Grodzovsky
On 2022-05-11 03:38, Christian König wrote: Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: On 2022-05-10 13:19, Christian König wrote: Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP]

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-11 Thread Christian König
Am 10.05.22 um 20:53 schrieb Andrey Grodzovsky: On 2022-05-10 13:19, Christian König wrote: Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-10 13:19, Christian König wrote: Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Christian König
Am 10.05.22 um 19:01 schrieb Andrey Grodzovsky: On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other reset sources. See the whole idea is the

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-10 12:17, Christian König wrote: Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other reset sources. See the whole idea is the following: 1. We have one single queued work queue for

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Christian König
Am 10.05.22 um 18:00 schrieb Andrey Grodzovsky: [SNIP] That's one of the reasons why we should have multiple work items for job based reset and other reset sources. See the whole idea is the following: 1. We have one single queued work queue for each reset domain which makes sure that all

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-10 Thread Andrey Grodzovsky
On 2022-05-06 04:56, Christian König wrote: Am 06.05.22 um 08:02 schrieb Lazar, Lijo: On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote: On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-06 Thread Christian König
Am 06.05.22 um 08:02 schrieb Lazar, Lijo: On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote: On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-06 Thread Lazar, Lijo
On 5/6/2022 3:17 AM, Andrey Grodzovsky wrote: On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Luben Tuikov
On 2022-05-05 17:47, Andrey Grodzovsky wrote: > > On 2022-05-05 15:49, Felix Kuehling wrote: >> >> Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: >>> >>> On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: > > On 2022-05-05 09:23,

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 15:49, Felix Kuehling wrote: Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Felix Kuehling
Am 2022-05-05 um 14:57 schrieb Andrey Grodzovsky: On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 11:06, Christian König wrote: Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem:

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König
Am 05.05.22 um 15:54 schrieb Andrey Grodzovsky: On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 09:23, Christian König wrote: Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König
Am 05.05.22 um 15:15 schrieb Andrey Grodzovsky: On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Andrey Grodzovsky
On 2022-05-05 06:09, Christian König wrote: Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to

Re: [PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-05 Thread Christian König
Am 04.05.22 um 18:18 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to use a list of pending reset jobs such that the

[PATCH] drm/amdgpu: Fix multiple GPU resets in XGMI hive.

2022-05-04 Thread Andrey Grodzovsky
Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to use a list of pending reset jobs such that the first reset jobs that actaully resets the entire