Re: [PATCH] drm/ttm: update bulk move object of ghost BO

2022-09-01 Thread JingWen Chen
Acked-by: Jingwen Chen still need confirmation from Christian On 9/1/22 5:29 PM, ZhenGuo Yin wrote: > [Why] > Ghost BO is released with non-empty bulk move object. There is a > warning trace: > WARNING: CPU: 19 PID: 1582 at ttm/ttm_bo.c:366 ttm_bo_release+0x2e1/0x2f0 > [amdtt

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread JingWen Chen
istian >> ; dan...@ffwll.ch >> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI >> is ready >> No because all the patch-set including this patch was landed into >> drm-misc-next and will reach amd-staging-drm-next on the next upstream >> r

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-23 Thread JingWen Chen
Hi Andrey, Will you port this patch into amd-staging-drm-next? on 2/10/22 2:06 AM, Andrey Grodzovsky wrote: > All comments are fixed and code pushed. Thanks for everyone > who helped reviewing. > > Andrey > > On 2022-02-09 02:53, Christian König wrote: >> Am 09.02.22 um 01:23 schrieb Andrey

Re: [RFC v3 00/12] Define and use reset domain for GPU recovery in amdgpu

2022-02-08 Thread JingWen Chen
Hi Andrey, I have been testing your patch and it seems fine till now. Best Regards, Jingwen Chen On 2022/2/3 上午2:57, Andrey Grodzovsky wrote: > Just another ping, with Shyun's help I was able to do some smoke testing on > XGMI SRIOV system (booting and triggering hive reset) > an

Re: [RFC v2 4/8] drm/amdgpu: Serialize non TDR gpu recovery with TDRs

2022-02-06 Thread JingWen Chen
Hi Andrey, I don't have any XGMI machines here, maybe you can reach out shaoyun for help. On 2022/1/29 上午12:57, Grodzovsky, Andrey wrote: > Just a gentle ping. > > Andrey >

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-11 Thread JingWen Chen
Hi Andrey, Please go ahead and push your change. I will prepare the RFC later. On 2022/1/8 上午12:02, Andrey Grodzovsky wrote: > > On 2022-01-07 12:46 a.m., JingWen Chen wrote: >> On 2022/1/7 上午11:57, JingWen Chen wrote: >>> On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: >>

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-06 Thread JingWen Chen
On 2022/1/7 上午11:57, JingWen Chen wrote: > On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: >> On 2022-01-06 12:18 a.m., JingWen Chen wrote: >>> On 2022/1/6 下午12:59, JingWen Chen wrote: >>>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: >>>>> On 2022-01-0

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-06 Thread JingWen Chen
On 2022/1/7 上午3:13, Andrey Grodzovsky wrote: > > On 2022-01-06 12:18 a.m., JingWen Chen wrote: >> On 2022/1/6 下午12:59, JingWen Chen wrote: >>> On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: >>>> On 2022-01-05 2:59 a.m., Christian König wrote: >>>&g

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-05 Thread JingWen Chen
On 2022/1/6 下午12:59, JingWen Chen wrote: > On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: >> On 2022-01-05 2:59 a.m., Christian König wrote: >>> Am 05.01.22 um 08:34 schrieb JingWen Chen: >>>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>>>> O

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-05 Thread JingWen Chen
On 2022/1/6 上午2:24, Andrey Grodzovsky wrote: > > On 2022-01-05 2:59 a.m., Christian König wrote: >> Am 05.01.22 um 08:34 schrieb JingWen Chen: >>> On 2022/1/5 上午12:56, Andrey Grodzovsky wrote: >>>> On 2022-01-04 6:36 a.m., Christian König wrote: >>>

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread JingWen Chen
implementation in amdgpu to >>> actually match the requirements. >>> >>> Could be that the reset sequence is questionable in general, but I doubt so >>> at least for now. >>> >>> See the FLR request from the hypervisor is just another source of s

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread JingWen Chen
t; >> See the FLR request from the hypervisor is just another source of signaling >> the need for a reset, similar to each job timeout on each queue. Otherwise >> you have a race condition between the hypervisor and the scheduler. >> >> Properly setting in_gpu_reset

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2022-01-04 Thread JingWen Chen
e_unlock_adev in flr_work instead of try_lock since no one will conflict with this thread with reset_domain introduced. But we do need the reset_sem and adev->in_gpu_reset to keep device untouched via user space. Best Regards, Jingwen Chen On 2022/1/3 下午6:17, Christian König wrote

Re: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

2021-12-24 Thread JingWen Chen
I do agree with shaoyun, if the host find the gpu engine hangs first, and do the flr, guest side thread may not know this and still try to access HW(e.g. kfd is using a lot of amdgpu_in_reset and reset_sem to identify the reset status). And this may lead to very bad result. On 2021/12/24

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-09-06 Thread Jingwen Chen
deleted from pending list. While if we use the ordered workqueue for timedout in the driver, there will be no bailing job. Do you have any suggestions? Best Regards, JingWen Chen On Mon Sep 06, 2021 at 02:36:52PM +0800, Liu, Monk wrote: > [AMD Official Use Only] > > > I'm fearing that ju

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote: > > On 2021-09-01 12:25 a.m., Jingwen Chen wrote: > > On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: > > > I will answer everything here - > > > > > > O

Re: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

2021-08-31 Thread Jingwen Chen
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote: > I will answer everything here - > > On 2021-08-31 9:58 p.m., Liu, Monk wrote: > > > [AMD Official Use Only] > > > > In the previous discussion, you guys stated that we should drop the > “kthread_should_park”

Re: [PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-20 Thread Jingwen Chen
; -Original Message- > > From: Daniel Vetter > > Sent: Thursday, August 19, 2021 5:31 PM > > To: Grodzovsky, Andrey > > Cc: Daniel Vetter ; Alex Deucher ; > > Chen, JingWen ; Maling list - DRI developers > > ; amd-gfx list > > ; Liu, Monk ; Koenig, >

[PATCH v3] Revert "drm/scheduler: Avoid accessing freed bad job."

2021-08-20 Thread Jingwen Chen
revert this commit. This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90. v2: add dma_fence_get/put() around timedout_job to avoid concurrent delete during processing timedout_job v3: park sched->thread instead during timedout_job. Signed-off-by: Jingwen Chen --- drivers/gpu/drm/schedu