Re: [PATCH v8] drm/sched: Add FIFO sched policy to run queue

2022-09-30 Thread Andrey Grodzovsky
Thanks for helping with review and good improvement ideas. Pushed to drm-misc-next. Andrey On 2022-09-30 00:12, Luben Tuikov wrote: From: Andrey Grodzovsky When many entities are competing for the same run queue on the same scheduler, we observe an unusually long wait times and some jobs

[PATCH v5] drm/sched: Add FIFO sched policy to run queue

2022-09-28 Thread Andrey Grodzovsky
: Various cosmetical fixes and minor refactoring of fifo update function. (Luben) v4: Switch drm_sched_rq_select_entity_fifo to in order search (Luben) v5: Fix up drm_sched_rq_select_entity_fifo loop     Signed-off-by: Andrey Grodzovsky Tested-by: Li Yunxiang (Teddy) --- drivers/gpu/drm/sche

Re: [PATCH v4] drm/sched: Add FIFO sched policy to run queue v3

2022-09-27 Thread Andrey Grodzovsky
Hey, i have problems with my git-send today so i just attached V5 as a patch here. Andrey On 2022-09-27 19:56, Luben Tuikov wrote: Inlined: On 2022-09-22 12:15, Andrey Grodzovsky wrote: On 2022-09-22 11:03, Luben Tuikov wrote: The title of this patch has "v3", but "v4"

Re: [PATCH v4] drm/sched: Add FIFO sched policy to run queue v3

2022-09-23 Thread Andrey Grodzovsky
Ping Andrey On 2022-09-22 12:15, Andrey Grodzovsky wrote: On 2022-09-22 11:03, Luben Tuikov wrote: The title of this patch has "v3", but "v4" in the title prefix. If you're using "-v" to git-format-patch, please remove the "v3" from the t

Re: [PATCH v4] drm/sched: Add FIFO sched policy to run queue v3

2022-09-22 Thread Andrey Grodzovsky
On 2022-09-22 11:03, Luben Tuikov wrote: The title of this patch has "v3", but "v4" in the title prefix. If you're using "-v" to git-format-patch, please remove the "v3" from the title. Inlined: On 2022-09-21 14:28, Andrey Grodzovsky wrote: When

[PATCH v4] drm/sched: Add FIFO sched policy to run queue v3

2022-09-21 Thread Andrey Grodzovsky
op default option in module control parameter. v3: Various cosmetical fixes and minor refactoring of fifo update function. (Luben) v4: Switch drm_sched_rq_select_entity_fifo to in order search (Luben)     Signed-off-by: Andrey Grodzovsky Tested-by: Li Yunxiang (Teddy) --- drivers/gpu/drm/scheduler/

Re: [PATCH v3] drm/sched: Add FIFO sched policy to run queue v3

2022-09-20 Thread Andrey Grodzovsky
On 2022-09-19 23:11, Luben Tuikov wrote: Please run this patch through checkpatch.pl, as it shows 12 warnings with it. Use these command line options: "--strict --show-types". Inlined: On 2022-09-13 16:40, Andrey Grodzovsky wrote: Given many entities competing for same run queue o

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-19 Thread Andrey Grodzovsky
Zhao, Victor ; amd-gfx@lists.freedesktop.org Cc: Deng, Emily Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow On 2022-09-16 01:18, Christian König wrote: Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky: On 2022-09-15 15:26, Christian König wrote: Am 15.09.22 um 20:29 sc

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-16 Thread Andrey Grodzovsky
On 2022-09-16 01:18, Christian König wrote: Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky: On 2022-09-15 15:26, Christian König wrote: Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky: On 2022-09-15 06:09, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Christian, The test

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky
On 2022-09-15 15:26, Christian König wrote: Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky: On 2022-09-15 06:09, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Christian, The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky
Had a typo - see bellow On 2022-09-15 14:29, Andrey Grodzovsky wrote: On 2022-09-15 06:09, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Christian, The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr

Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

2022-09-15 Thread Andrey Grodzovsky
On 2022-09-15 06:09, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Christian, The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr mode and mode2 reset enabled on driver. When a compute hang job timeout h

[PATCH v3] drm/sched: Add FIFO sched policy to run queue v3

2022-09-13 Thread Andrey Grodzovsky
le control parameter. v3: Various cosmetical fixes and minor refactoring of fifo update function. Signed-off-by: Andrey Grodzovsky Tested-by: Li Yunxiang (Teddy) --- drivers/gpu/drm/scheduler/sched_entity.c | 26 - drivers/gpu/drm/scheduler/sched_main.c | 132

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-13 Thread Andrey Grodzovsky
I guess, but this is kind of implicit assumption which is not really documented and easily overlooked. Anyway - for this code it's not directly relevant. Andrey On 2022-09-13 03:25, Christian König wrote: Am 13.09.22 um 04:00 schrieb Andrey Grodzovsky: [SNIP] You are right for sche

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-12 Thread Andrey Grodzovsky
: Introduce gfx software ring(v3) On 2022-09-12 12:22, Christian König wrote: Am 12.09.22 um 17:34 schrieb Andrey Grodzovsky: On 2022-09-12 09:27, Christian König wrote: Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky: On 2022-09-12 06:20, Christian König wrote: Am 09.09.22 um 18:45 schrieb

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-12 Thread Andrey Grodzovsky
On 2022-09-12 12:22, Christian König wrote: Am 12.09.22 um 17:34 schrieb Andrey Grodzovsky: On 2022-09-12 09:27, Christian König wrote: Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky: On 2022-09-12 06:20, Christian König wrote: Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky: On 2022-09

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-12 Thread Andrey Grodzovsky
On 2022-09-12 09:27, Christian König wrote: Am 12.09.22 um 15:22 schrieb Andrey Grodzovsky: On 2022-09-12 06:20, Christian König wrote: Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky: On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" The software ring is

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-12 Thread Andrey Grodzovsky
On 2022-09-12 06:20, Christian König wrote: Am 09.09.22 um 18:45 schrieb Andrey Grodzovsky: On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" The software ring is created to support priority context while there is only one hardware queue for gfx. Every soft

Re: [PATCH 4/4] drm/amdgpu: Implement OS triggered MCBP(v2)

2022-09-09 Thread Andrey Grodzovsky
On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" Trigger MCBP according to the priroty of the software rings and the hw fence signaling condition. The muxer records some lastest locations from the software ring which is used to resubmit packages in preemption scenarios. v

Re: [PATCH 3/4] drm/amdgpu: Modify unmap_queue format for gfx9(v2)

2022-09-09 Thread Andrey Grodzovsky
Really can't say to much here as I am not really familiar with queues map/unmap... Andrey On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" 1. Modify the unmap_queue package on gfx9. Add trailing fence to track the preemption done. 2. Modify emit_ce_meta emit_de_meta fu

Re: [PATCH 2/4] drm/amdgpu: Add software ring callbacks for gfx9(v3)

2022-09-09 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" Set ring functions with software ring callbacks on gfx9. The software ring could be tested by debugfs_test_ib case. v2: set sw_ring 2 to enable software ring by default. v3:

Re: [PATCH 1/4] drm/amdgpu: Introduce gfx software ring(v3)

2022-09-09 Thread Andrey Grodzovsky
On 2022-09-08 21:50, jiadong@amd.com wrote: From: "Jiadong.Zhu" The software ring is created to support priority context while there is only one hardware queue for gfx. Every software rings has its fence driver and could be used as an ordinary ring for the gpu_scheduler. Multiple softwar

Re: [PATCH 1/4] drm/sched: returns struct drm_gpu_scheduler ** for drm_sched_pick_best

2022-09-08 Thread Andrey Grodzovsky
Please send everything together because otherwise it's not clear why we need this. Andrey On 2022-09-08 11:09, James Zhu wrote: Yes, it is for NPI design. I will send out patches for review soon. Thanks! James On 2022-09-08 11:05 a.m., Andrey Grodzovsky wrote: So this is the real ne

Re: [PATCH 1/4] drm/sched: returns struct drm_gpu_scheduler ** for drm_sched_pick_best

2022-09-08 Thread Andrey Grodzovsky
ched_list to track ring which is used in this ctx in amdgpu_ctx_fini_entity Best Regards! James On 2022-09-08 10:38 a.m., Andrey Grodzovsky wrote: I guess it's an option but i don't really see what's the added value  ? You saved a few lines in this patch but added a few lines

Re: [PATCH 1/4] drm/sched: returns struct drm_gpu_scheduler ** for drm_sched_pick_best

2022-09-08 Thread Andrey Grodzovsky
re derived from patch [3/4]: entity->sched_list = num_sched_list > 1 ? sched_list : NULL; I think no special reason to treat single and multiple schedule list here. Best Regards! James On 2022-09-08 10:08 a.m., Andrey Grodzovsky wrote: What's the reason for this entire patch set ?

Re: [PATCH 1/4] drm/sched: returns struct drm_gpu_scheduler ** for drm_sched_pick_best

2022-09-08 Thread Andrey Grodzovsky
What's the reason for this entire patch set ? Andrey On 2022-09-07 16:57, James Zhu wrote: drm_sched_pick_best returns struct drm_gpu_scheduler ** instead of struct drm_gpu_scheduler * Signed-off-by: James Zhu --- include/drm/gpu_scheduler.h | 2 +- 1 file changed, 1 insertion(+), 1 deleti

Re: [PATCH v2] drm/sced: Add FIFO sched policy to rq

2022-09-07 Thread Andrey Grodzovsky
Luben, just a ping, whenever you have time. Andrey On 2022-09-05 01:57, Christian König wrote: Am 03.09.22 um 04:48 schrieb Andrey Grodzovsky: Poblem: Given many entities competing for same rq on same scheduler an uncceptabliy long wait time for some jobs waiting stuck in rq before being

[PATCH v2] drm/sced: Add FIFO sched policy to rq

2022-09-02 Thread Andrey Grodzovsky
e structure for entites based on TS of oldest job waiting in job queue of enitity. Improves next enitity extraction to O(1). Enitity TS update O(log(number of entites in rq)) Drop default option in module control parameter. Signed-off-by: Andrey Grodzovsky Tested-by: Li Yunxiang (Teddy) ---

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-25 Thread Andrey Grodzovsky
On 2022-08-24 22:29, Luben Tuikov wrote: Inlined: On 2022-08-24 12:21, Andrey Grodzovsky wrote: On 2022-08-23 17:37, Luben Tuikov wrote: On 2022-08-23 14:57, Andrey Grodzovsky wrote: On 2022-08-23 14:30, Luben Tuikov wrote: On 2022-08-23 14:13, Andrey Grodzovsky wrote: On 2022-08-23 12

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-25 Thread Andrey Grodzovsky
On 2022-08-24 22:29, Luben Tuikov wrote: Inlined: On 2022-08-24 12:21, Andrey Grodzovsky wrote: On 2022-08-23 17:37, Luben Tuikov wrote: On 2022-08-23 14:57, Andrey Grodzovsky wrote: On 2022-08-23 14:30, Luben Tuikov wrote: On 2022-08-23 14:13, Andrey Grodzovsky wrote: On 2022-08-23 12

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-25 Thread Andrey Grodzovsky
On 2022-08-23 17:37, Luben Tuikov wrote: On 2022-08-23 14:57, Andrey Grodzovsky wrote: On 2022-08-23 14:30, Luben Tuikov wrote: On 2022-08-23 14:13, Andrey Grodzovsky wrote: On 2022-08-23 12:58, Luben Tuikov wrote: Inlined: On 2022-08-22 16:09, Andrey Grodzovsky wrote: Poblem: Given

Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-24 Thread Andrey Grodzovsky
On 2022-08-17 10:01, Andrey Grodzovsky wrote: On 2022-08-17 09:44, Alex Deucher wrote: On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas wrote: [AMD Official Use Only - General] Hi Alex:    When removing an amdgpu device, it may be difficult to change the order of psp_hw_fini calls. 1. The

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-24 Thread Andrey Grodzovsky
On 2022-08-24 04:29, Michel Dänzer wrote: On 2022-08-22 22:09, Andrey Grodzovsky wrote: Poblem: Given many entities competing for same rq on same scheduler an uncceptabliy long wait time for some jobs waiting stuck in rq before being picked up are observed (seen using GPUVis). The issue is

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-23 Thread Andrey Grodzovsky
On 2022-08-23 14:30, Luben Tuikov wrote: On 2022-08-23 14:13, Andrey Grodzovsky wrote: On 2022-08-23 12:58, Luben Tuikov wrote: Inlined: On 2022-08-22 16:09, Andrey Grodzovsky wrote: Poblem: Given many entities competing for same rq on ^Problem same scheduler an uncceptabliy long wait

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-23 Thread Andrey Grodzovsky
On 2022-08-23 12:58, Luben Tuikov wrote: Inlined: On 2022-08-22 16:09, Andrey Grodzovsky wrote: Poblem: Given many entities competing for same rq on ^Problem same scheduler an uncceptabliy long wait time for some ^unacceptably jobs waiting stuck in rq before being picked up are

Re: [PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-23 Thread Andrey Grodzovsky
On 2022-08-23 08:15, Christian König wrote: Am 22.08.22 um 22:09 schrieb Andrey Grodzovsky: Poblem: Given many entities competing for same rq on same scheduler an uncceptabliy long wait time for some jobs waiting stuck in rq before being picked up are observed (seen using  GPUVis). The issue

[PATCH] drm/sced: Add FIFO policy for scheduler rq

2022-08-22 Thread Andrey Grodzovsky
job in the long queue. Fix: Add FIFO selection policy to entites in RQ, chose next enitity on rq in such order that if job on one entity arrived ealrier then job on another entity the first job will start executing ealier regardless of the length of the entity's job queue. Signed-off-by: An

Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-17 Thread Andrey Grodzovsky
amdgpu_pci_remove function, which makes the gpu device inaccessible for userspace operations. If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could access the gpu device but the psp might be removing. It has unknown issues. +Andrey Grodzovsky We should fix the ordering

Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

2022-08-12 Thread Andrey Grodzovsky
On 2022-08-12 14:38, Kim, Jonathan wrote: [Public] Hi Andrey, Here's the load/unload stack trace. This is a 2 GPU xGMI system. I put dbg_xgmi_hive_get/put refcount print post kobj get/put. It's stuck at 2 on unload. If it's an 8 GPU system, it's stuck at 8. e.g. of sysfs leak after drive

Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

2022-08-11 Thread Andrey Grodzovsky
On 2022-08-11 11:34, Kim, Jonathan wrote: [Public] -Original Message- From: Kuehling, Felix Sent: August 11, 2022 11:19 AM To: amd-gfx@lists.freedesktop.org; Kim, Jonathan Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak Am 2022-08-11 um 09:42 schrieb

Re: [PATCH v3 1/6] drm/amdgpu: add mode2 reset for sienna_cichlid

2022-08-03 Thread Andrey Grodzovsky
Series is Acked-by: Andrey Grodzovsky Andrey On 2022-08-01 00:07, Victor Zhao wrote: To meet the requirement for multi container usecase which needs a quicker reset and not causing VRAM lost, adding the Mode2 reset handler for sienna_cichlid. v2: move skip mode2 flag part separately v3

Re: [PATCH v2 1/6] drm/amdgpu: add mode2 reset for sienna_cichlid

2022-07-28 Thread Andrey Grodzovsky
On 2022-07-28 06:30, Victor Zhao wrote: To meet the requirement for multi container usecase which needs a quicker reset and not causing VRAM lost, adding the Mode2 reset handler for sienna_cichlid. v2: move skip mode2 flag part separately Signed-off-by: Victor Zhao --- drivers/gpu/drm/amd/

Re: [PATCH v2 6/6] drm/amdgpu: reduce reset time

2022-07-28 Thread Andrey Grodzovsky
On 2022-07-28 06:30, Victor Zhao wrote: In multi container use case, reset time is important, so skip ring tests and cp halt wait during ip suspending for reset as they are going to fail and cost more time on reset v2: add a hang flag to indicate the reset comes from a job timeout, skip ring t

Re: [PATCH 5/5] drm/amdgpu: reduce reset time

2022-07-27 Thread Andrey Grodzovsky
On 2022-07-27 06:35, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Andrey, Problem with status.hang is that it is set at amdgpu_device_ip_check_soft_reset, which is not implemented in nv or gfx10. They have to be nicely implemented first. Another option I thought is to mark status.

Re: Crash on resume from S3

2022-07-26 Thread Andrey Grodzovsky
The stack trace is expected part of reset procedure  so that ok. The issue you are having is a hang on one of GPU jobs during resume which triggers a GPU reset attempt. You can open a ticket with this issue here https://gitlab.freedesktop.org/drm/amd/-/issues, please attach full dmesg log.

Re: [PATCH 4/5] drm/amdgpu: revert context to stop engine before mode2 reset

2022-07-26 Thread Andrey Grodzovsky
Got it Acked-by: Andrey Grodzovsky Andrey On 2022-07-26 06:01, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Andrey, For slow tests I mean the slow hang tests by quark tool. An example here: hang_vm_gfx_dispatch_slow.lua - This script runs on a graphics engine using compute

Re: [PATCH 5/5] drm/amdgpu: reduce reset time

2022-07-26 Thread Andrey Grodzovsky
On 2022-07-26 05:40, Zhao, Victor wrote: [AMD Official Use Only - General] Hi Andrey, Reply inline. Thanks, Victor -Original Message- From: Grodzovsky, Andrey Sent: Tuesday, July 26, 2022 5:18 AM To: Zhao, Victor ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Deng, Emi

Re: [PATCH 4/5] drm/amdgpu: revert context to stop engine before mode2 reset

2022-07-25 Thread Andrey Grodzovsky
On 2022-07-22 03:34, Victor Zhao wrote: For some hang caused by slow tests, engine cannot be stopped which may cause resume failure after reset. In this case, force halt engine by reverting context addresses Can you maybe explain a bit more what exactly you mean by slow test and why engine ca

Re: [PATCH 3/5] drm/amdgpu: save and restore gc hub regs

2022-07-25 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-07-22 03:34, Victor Zhao wrote: Save and restore gfxhub regs as they will be reset during mode 2 Signed-off-by: Victor Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfxhub.h| 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 26

Re: [PATCH 5/5] drm/amdgpu: reduce reset time

2022-07-25 Thread Andrey Grodzovsky
On 2022-07-22 03:34, Victor Zhao wrote: In multi container use case, reset time is important, so skip ring tests and cp halt wait during ip suspending for reset as they are going to fail and cost more time on reset Why are they failing in this case ? Skipping ring tests is not the best idea

Re: [PATCH 2/5] drm/amdgpu: add debugfs amdgpu_reset_level

2022-07-25 Thread Andrey Grodzovsky
On 2022-07-25 13:37, Christian König wrote: Hi Victor, Am 25.07.22 um 12:45 schrieb Zhao, Victor: [AMD Official Use Only - General] Hi @Grodzovsky, Andrey, Please help review the series, thanks a lot. Hi @Koenig, Christian, I thought a module parameter will be exposed to a common user, th

Re: [PATCH 1/5] drm/amdgpu: add mode2 reset for sienna_cichlid

2022-07-25 Thread Andrey Grodzovsky
On 2022-07-22 03:33, Victor Zhao wrote: To meet the requirement for multi container usecase which needs a quicker reset and not causing VRAM lost, adding the Mode2 reset handler for sienna_cichlid. Adding a AMDGPU_SKIP_MODE2_RESET flag so driver can fallback to default reset method when mode2 r

Re: [PATCH] drm/amdgpu: remove useless condition in amdgpu_job_stop_all_jobs_on_sched()

2022-07-25 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-07-19 06:39, Andrey Strachuk wrote: Local variable 'rq' is initialized by an address of field of drm_sched_job, so it does not make sense to compare 'rq' with NULL. Found by Linux Verification Center (linuxtesting.org) with S

Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

2022-07-16 Thread Andrey Grodzovsky
amdgpu_job v4: add tdr sequence support for this feature. Add a job_run_counter to indicate whether this job is a resubmit job. v5 add missing handling in amdgpu_fence_enable_signaling Signed-off-by: Jingwen Chen Signed-off-by: Jack Zhang Reviewed-by: Andrey Gr

Re: [PATCH 10/10] drm/amdgpu: add gang submit frontend v2

2022-07-14 Thread Andrey Grodzovsky
Acked-by: Andrey Grodzovsky Andrey On 2022-07-14 06:39, Christian König wrote: Allows submitting jobs as gang which needs to run on multiple engines at the same time. All members of the gang get the same implicit, explicit and VM dependencies. So no gang member will start running until

Re: [PATCH 07/10] drm/amdgpu: move setting the job resources

2022-07-14 Thread Andrey Grodzovsky
Reviewed-by: Andrey Grodzovsky Andrey On 2022-07-14 06:38, Christian König wrote: Move setting the job resources into amdgpu_job.c Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 17

Re: [PATCH 01/10] drm/sched: move calling drm_sched_entity_select_rq

2022-07-14 Thread Andrey Grodzovsky
Found the new use case from the 5/10 of reordering CS ioctl. Reviewed-by: Andrey Grodzovsky Andrey On 2022-07-14 12:26, Christian König wrote: We need this for limiting codecs like AV1 to the first instance for VCN3. Essentially the idea is that we first initialize the job with entity, id

Re: [PATCH 01/10] drm/sched: move calling drm_sched_entity_select_rq

2022-07-14 Thread Andrey Grodzovsky
ff-by: Christian König CC: Andrey Grodzovsky CC: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/scheduler/sched_main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 68317d3

Re: [PATCH 09/10] drm/amdgpu: add gang submit backend

2022-07-14 Thread Andrey Grodzovsky
On 2022-07-14 06:39, Christian König wrote: Allows submitting jobs as gang which needs to run on multiple engines at the same time. Basic idea is that we have a global gang submit fence representing when the gang leader is finally pushed to run on the hardware last. Jobs submitted as gang are

Re: [PATCH] drm/amdgpu: Get rid of amdgpu_job->external_hw_fence

2022-07-13 Thread Andrey Grodzovsky
On 2022-07-13 13:33, Christian König wrote: Am 13.07.22 um 19:13 schrieb Andrey Grodzovsky: This is a follow-up cleanup to [1]. See bellow refcount balancing for calling amdgpu_job_submit_direct after this cleanup as far as I calculated. amdgpu_fence_emit dma_fence_init 1

[PATCH] drm/amdgpu: Get rid of amdgpu_job->external_hw_fence

2022-07-13 Thread Andrey Grodzovsky
g_test_ib dma_fence_put(fence) 0 [1] - https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzov...@amd.com/ Signed-off-by: Andrey Grodzovsky Suggested-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +-- drivers/gpu/drm/a

[PATCH v2 2/4] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-24 Thread Andrey Grodzovsky
e EOP interrupt. Fix: Before accessing fence array in GPU disable EOP interrupt and flush all pending interrupt handlers for amdgpu device's interrupt line. v2: Switch from irq_get/put to full enable/disable_irq for amdgpu Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgp

[PATCH v2 4/4] drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-24 Thread Andrey Grodzovsky
patch we resumed setting s_fence->parent to NULL in drm_sched_stop switch to directly checking if job->hw_fence is signaled to short circuit reset if already signed. Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 ++ drivers/

[PATCH v2 1/4] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences

2022-06-24 Thread Andrey Grodzovsky
This function should drop the fence refcount when it extracts the fence from the fence array, just as it's done in amdgpu_fence_process. Signed-off-by: Andrey Grodzovsky Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++- 1 file changed, 3 insertions(

[PATCH v2 3/4] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'

2022-06-24 Thread Andrey Grodzovsky
ext patch). [1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4...@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3 Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao --- drivers/gpu/drm/scheduler/sched_main.c | 13 ++--- 1 file changed, 10 insertions(+), 3

[PATCH v2 0/4] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount.

2022-06-24 Thread Andrey Grodzovsky
file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing Andrey Grodzovsky (4): drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences drm/amdgpu: Prevent race between late signaled fences and GPU reset. drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer' drm/amdgpu: Follow up

Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence

2022-06-23 Thread Andrey Grodzovsky
On 2022-06-22 11:04, Christian König wrote: Am 22.06.22 um 17:01 schrieb Andrey Grodzovsky: On 2022-06-22 05:00, Christian König wrote: Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky: On 2022-06-21 03:19, Christian König wrote: Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky: Problem: In

Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-23 Thread Andrey Grodzovsky
On 2022-06-23 01:52, Christian König wrote: Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky: On 2022-06-22 03:17, Christian König wrote: Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky: On 2022-06-21 03:28, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Align

Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-22 Thread Andrey Grodzovsky
Just a ping Andrey On 2022-06-21 15:45, Andrey Grodzovsky wrote: On 2022-06-21 03:25, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Problem: After we start handling timed out jobs we assume there fences won't be signaled but we cannot be sure and sometimes they

Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-22 Thread Andrey Grodzovsky
On 2022-06-22 03:17, Christian König wrote: Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky: On 2022-06-21 03:28, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Align refcount behaviour for amdgpu_job embedded HW fence with classic pointer style HW fences by

Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence

2022-06-22 Thread Andrey Grodzovsky
On 2022-06-22 05:00, Christian König wrote: Am 21.06.22 um 21:34 schrieb Andrey Grodzovsky: On 2022-06-21 03:19, Christian König wrote: Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky: Problem: In amdgpu_job_submit_direct - The refcount should drop by 2 but it drops only by 1

Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-21 Thread Andrey Grodzovsky
21:47, VURDIGERENATARAJ, CHANDAN wrote: Hi, Is this a preventive fix or you found errors/oops/hangs? If you had found errors/oops/hangs, can you please share the details? BR, Chandan V N On 2022-06-21 03:25, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Problem: Aft

Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-21 Thread Andrey Grodzovsky
On 2022-06-21 03:28, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Align refcount behaviour for amdgpu_job embedded HW fence with classic pointer style HW fences by increasing refcount each time emit is called so amdgpu code doesn't need to make workarounds

Re: [PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-21 Thread Andrey Grodzovsky
On 2022-06-21 03:25, Christian König wrote: Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky: Problem: After we start handling timed out jobs we assume there fences won't be signaled but we cannot be sure and sometimes they fire late. We need to prevent concurrent accesses to fence array

Re: [PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence

2022-06-21 Thread Andrey Grodzovsky
On 2022-06-21 03:19, Christian König wrote: Am 21.06.22 um 00:02 schrieb Andrey Grodzovsky: Problem: In amdgpu_job_submit_direct - The refcount should drop by 2 but it drops only by 1. amdgpu_ib_sched->emit -> refcount 1 from first fence init dma_fence_get -> refcount 2 dme_

[PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.

2022-06-20 Thread Andrey Grodzovsky
patch we resumed setting s_fence->parent to NULL in drm_sched_stop switch to directly checking if job->hw_fence is signaled to short circuit reset if already signed. Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 ++ drivers/

[PATCH 4/5] drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'

2022-06-20 Thread Andrey Grodzovsky
ext patch). [1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4...@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3 Signed-off-by: Andrey Grodzovsky Tested-by: Yiqing Yao --- drivers/gpu/drm/scheduler/sched_main.c | 16 +--- 1 file changed, 13 insertions(+), 3

[PATCH 3/5] drm/amdgpu: Prevent race between late signaled fences and GPU reset.

2022-06-20 Thread Andrey Grodzovsky
e EOP interrupt. Fix: Before accessing fence array in GPU disable EOP interrupt and flush all pending interrupt handlers for amdgpu device's interrupt line. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_fen

[PATCH 2/5] drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences

2022-06-20 Thread Andrey Grodzovsky
This function should drop the fence refcount when it extracts the fence from the fence array, just as it's done in amdgpu_fence_process. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/dr

[PATCH 1/5] drm/amdgpu: Fix possible refcount leak for release of external_hw_fence

2022-06-20 Thread Andrey Grodzovsky
Problem: In amdgpu_job_submit_direct - The refcount should drop by 2 but it drops only by 1. amdgpu_ib_sched->emit -> refcount 1 from first fence init dma_fence_get -> refcount 2 dme_fence_put -> refcount 1 Fix: Add put for external_hw_fence in amdgpu_job_free/free_cb Signed-of

[PATCH 0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount.

2022-06-20 Thread Andrey Grodzovsky
/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing Andrey Grodzovsky (5): drm/amdgpu: Fix possible refcount leak for release of external_hw_fence drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences drm/amdgpu: Prevent race between late signaled fences and GPU reset. drm/sched: Partial revert

Re: [PATCH] drm/amdgpu: fix refcount underflow in device reset

2022-06-06 Thread Andrey Grodzovsky
On 2022-06-06 03:43, Yiqing Yao wrote: [why] A gfx job may be processed but not finished when reset begin from compute job timeout. drm_sched_resubmit_jobs_ext in sched_main assume submitted job unsignaled and always put parent fence. Resubmission for that job cause underflow. This fix is done

Re: [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs

2022-05-30 Thread Andrey Grodzovsky
+ Monk On 2022-05-30 03:52, Christian König wrote: Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky: We need to have a work_struct to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky ---   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |  2 ++   drivers/gpu/drm

[PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

2022-05-25 Thread Andrey Grodzovsky
We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress.

2022-05-25 Thread Andrey Grodzovsky
We skip rest requests if another one is already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++ 1 file changed, 27 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu

[PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd.

2022-05-25 Thread Andrey Grodzovsky
We need to have a work_struct to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31

[PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs

2022-05-25 Thread Andrey Grodzovsky
We need to have a work_struct to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +-- 2 files changed, 19 insertions(+), 2 deletions(-) diff

[PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level.

2022-05-25 Thread Andrey Grodzovsky
Will be read by executors of async reset like debugfs. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 -- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + 3 files changed, 6 insertions(+), 2 deletions

[PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.

2022-05-25 Thread Andrey Grodzovsky
Save the extra usless work schedule. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 31207f7eec02

[PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()"

2022-05-25 Thread Andrey Grodzovsky
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f. amdpgu need this function in order to prematurly stop pending reset works when another reset work already in progress. Signed-off-by: Andrey Grodzovsky Reviewed-by: Lai Jiangshan Reviewed-by: Christian König --- include/linux

[PATCH v3 0/7] Fix multiple GPU resets in XGMI hive.

2022-05-25 Thread Andrey Grodzovsky
was in v1[1] to eplicit stopping of each reset request from each reset source per each request submitter. v3: Switch back to work_struct from delayed_work (Christian) [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzov...@amd.com/ Andrey Grodzovsky (7): Revert "work

Re: [PATCH] Revert "workqueue: remove unused cancel_work()"

2022-05-20 Thread Andrey Grodzovsky
On 2022-05-20 03:52, Tejun Heo wrote: On Fri, May 20, 2022 at 08:22:39AM +0200, Christian König wrote: Am 20.05.22 um 02:47 schrieb Lai Jiangshan: On Thu, May 19, 2022 at 11:04 PM Andrey Grodzovsky wrote: See this patch-set https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F

Re: [PATCH] Revert "workqueue: remove unused cancel_work()"

2022-05-19 Thread Andrey Grodzovsky
this only for delayed_work and not for work_struct. Andrey On 2022-05-19 10:52, Lai Jiangshan wrote: On Thu, May 19, 2022 at 9:57 PM Andrey Grodzovsky wrote: This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f and exports the function. We need this funtion in amdgpu driver to fix

[PATCH] Revert "workqueue: remove unused cancel_work()"

2022-05-19 Thread Andrey Grodzovsky
This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f and exports the function. We need this funtion in amdgpu driver to fix a bug. Signed-off-by: Andrey Grodzovsky --- include/linux/workqueue.h | 1 + kernel/workqueue.c| 9 + 2 files changed, 10 insertions(+) diff

Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

2022-05-19 Thread Andrey Grodzovsky
On 2022-05-19 03:58, Christian König wrote: Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky: On 2022-05-18 02:07, Christian König wrote: Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by

Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.

2022-05-18 Thread Andrey Grodzovsky
On 2022-05-18 02:07, Christian König wrote: Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky: Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to

[PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover

2022-05-17 Thread Andrey Grodzovsky
We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

[PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress.

2022-05-17 Thread Andrey Grodzovsky
We skip rest requests if another one is already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++ 1 file changed, 27 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu

[PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd.

2022-05-17 Thread Andrey Grodzovsky
We need to have a delayed work to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31

[PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs

2022-05-17 Thread Andrey Grodzovsky
We need to have a delayed work to cancel this reset if another already in progress. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +-- 2 files changed, 19 insertions(+), 2 deletions(-) diff

  1   2   3   4   5   6   7   8   9   10   >