Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König
[SNIP] Maybe just empirically - let's try it and see under different test scenarios what actually happens  ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky
On 2021-04-15 3:02 a.m., Christian König wrote: Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König
Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky
On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Andrey Grodzovsky
On 2021-04-14 3:01 a.m., Christian König wrote: Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König
Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
gt; ; Li, Dennis ; > amd-gfx@lists.freedesktop.org; Deucher, Alexander > ; Kuehling, Felix ; Zhang, > Hawking ; Daniel Vetter > Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
On Tue, Apr 13, 2021 at 9:10 AM Christian König wrote: > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On 2021-04-12 3:18 p.m., Christian König wrote: > >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: > >>> [SNIP] > > > > So what's the right approach ? How we guarantee that

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Li, Dennis
tian König Sent: Tuesday, April 13, 2021 3:10 PM To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.2

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky: [SNIP] emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/**        dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-10 Thread Christian König
Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky
On 2021-04-09 12:39 p.m., Christian König wrote: Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König
Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky
On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König
Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP]

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky
On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no  IOCTL

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no  IOCTL can access MMIO post device unplug OR in flight

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky
felix.kuehl...@amd.com>>; Zhang, Hawking <mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started be

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
t;>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that n

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
ailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started bef

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Andrey Grodzovsky
org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Christian König
Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: >

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Andrey Grodzovsky
mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two var

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König
>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com&g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König
Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be ex

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-05 Thread Andrey Grodzovsky
Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged o

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König
op.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
s.freedesktop.org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schri

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
:54 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in ad

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König
Am 18.03.21 um 08:23 schrieb Dennis Li: We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread