[PATCH] drm/amdgpu: value of amdgpu_sriov_vf cannot be set into F32_POLL_ENABLE

2019-04-24 Thread wentalou
amdgpu_sriov_vf would return 0x0 or 0x4 to indicate if sriov. but F32_POLL_ENABLE need 0x0 or 0x1 to determine if enabled. set 0x4 into F32_POLL_ENABLE would make SDMA0_GFX_RB_WPTR_POLL_CNTL not working. Change-Id: I7d13ed35469ebd7bdf10c90341181977c6cfd38d Signed-off-by: Wentao Lou ---

[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent

2019-04-16 Thread wentalou
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock. If amdgpu_device_recover_vram executed between amdgpu_bo_unref and list_del_init, it would get NULL of shadow->parent, then caused Call Trace and GPU reset failed. Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230

[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent

2019-04-16 Thread wentalou
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock. If amdgpu_device_recover_vram executed between amdgpu_bo_unref and list_del_init, it would get NULL of shadow->parent, then caused Call Trace and GPU reset failed. Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230

[PATCH] drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR

2019-04-12 Thread wentalou
shadow was added into shadow_list by amdgpu_bo_create_shadow. meanwhile, shadow->tbo.mem was not fully configured. tbo.mem would be fully configured by amdgpu_vm_sdma_map_table until calling amdgpu_vm_clear_bo. If sriov TDR occurred between amdgpu_bo_create_shadow and amdgpu_vm_sdma_map_table,

[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list

2019-04-03 Thread wentalou
amdgpu_bo_restore_shadow would assign zero to r if succeeded. r would remain zero if there is only one node in shadow_list. current code would always return failure when r <= 0. restart the timeout for each wait was a rather problematic bug as well. The value of tmo SHOULD be changed, otherwise we

[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list

2019-04-02 Thread wentalou
amdgpu_bo_restore_shadow would assign zero to r if succeeded. r would remain zero if there is only one node in shadow_list. current code would always return failure when r <= 0. restart the timeout for each wait was a rather problematic bug as well. The value of tmo SHOULD be changed, otherwise we

[PATCH] drm/amdkfd/sriov:Put the pre and post reset in exclusive mode v2

2019-03-13 Thread wentalou
add amdgpu_amdkfd_pre_reset and amdgpu_amdkfd_post_reset inside amdgpu_device_reset_sriov. Change-Id: Icf2839f0b620ce9d47d6414b6c32b9d06672f2ac Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ 1 file changed, 3 insertions(+) diff --git

[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov

2019-01-29 Thread wentalou
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Change-Id: I840dfc145e4e1be9ece6eac8d9f3501da9b28ebf Signed-off-by: wentalou --- drivers

[PATCH] drm/amdgpu: sriov restrict max_pfn below AMDGPU_GMC_HOLE

2019-01-23 Thread wentalou
sriov need to restrict max_pfn below AMDGPU_GMC_HOLE. access the hole results in a range fault interrupt IIRC. Change-Id: I0add197a24a54388a128a545056e9a9f0330abfb Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +-- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 +-

[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov

2019-01-23 Thread wentalou
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Change-Id: Ifcba4ac43a0229ae19061aad3b0ddc96957ff9c6 Signed-off-by: wentalou --- drivers

[PATCH] drm/amdgpu: sriov put csa below AMDGPU_GMC_HOLE

2019-01-22 Thread wentalou
since vm_size enlarged to 0x4 GB, sriov need to put csa below AMDGPU_GMC_HOLE. or amdgpu_vm_alloc_pts would receive saddr among AMDGPU_GMC_HOLE, and result in a range fault interrupt IIRC. Change-Id: I405a25a01d949f3130889b346f71bedad8ebcae7 Signed-off-by: Wenta Lou ---

[PATCH] drm/amdgpu: sriov should skip asic_reset in device_init

2019-01-17 Thread wentalou
sriov would meet guest driver load failure, if calling amdgpu_asic_reset in amdgpu_device_init. sriov should skip asic_reset in device_init. Change-Id: I6c03b2fcdbf29200fab09459bbffd87726047908 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- 1 file changed, 1

[PATCH] drm/amdgpu: csa_vaddr should not larger than AMDGPU_GMC_HOLE_START

2019-01-14 Thread wentalou
After removing unnecessary VM size calculations, vm_manager.max_pfn would reach 0x10,, max_pfn << AMDGPU_GPU_PAGE_SHIFT exceeding AMDGPU_GMC_HOLE_START would caused GPU reset. Change-Id: I47ad0be2b0bd9fb7490c4e1d7bb7bdacf71132cb Signed-off-by: wentalou --- drivers/gpu/drm/amd/

[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-21 Thread wentalou
When 2 rings met timeout at same time, triggered job_timedout separately. Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock. Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock. So dma_fence_remove_callback

[PATCH] drm/amdgpu: psp_ring_destroy cause psp->km_ring.ring_mem NULL

2018-12-17 Thread wentalou
psp_ring_destroy inside psp_load_fw cause psp->km_ring.ring_mem NULL. Call Trace occurred when psp_cmd_submit. should be psp_ring_stop instead. Change-Id: Ib332004b3b9edc9e002adc532b2d45cdad929b05 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +- 1 file changed, 1

[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

2018-12-09 Thread wentalou
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev, but outside req_full_gpu of sriov. It would make sriov hang during reset. Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++ 1 file changed, 6

[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

2018-12-06 Thread wentalou
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev, but outside req_full_gpu of sriov. It would make sriov hang during reset. Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++ 1 file changed, 6

[PATCH] drm/amdgpu: Skip ring soft recovery when fence was NULL

2018-12-05 Thread wentalou
amdgpu_ring_soft_recovery would have Call-Trace, when s_fence->parent was NULL inside amdgpu_job_timedout. Check fence first, as drm_sched_hw_job_reset did. Change-Id: Ibb062e36feb4e2522a59641fe0d2d76b9773cda7 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file

[PATCH] drm/amdgpu: Skip ring soft recovery when fence parent was NULL

2018-12-05 Thread wentalou
amdgpu_ring_soft_recovery would have Call-Trace, when s_job->s_fence->parent was NULL inside amdgpu_job_timedout. Check parent first, as drm_sched_hw_job_reset did. Change-Id: I0b674ffd96afd44bcefe37a66fb157b1dbba61a0 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-

[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ

2018-12-02 Thread wentalou
KIQ in VF’s init delayed by another VF’s reset, which would cause late_init failed occasionally. MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue. Change-Id: Iac680af3cbd6afe4f8e408785f0795e1b23dba83 Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- 1 file

[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ

2018-11-30 Thread wentalou
SWDEV-171843: KIQ in VF’s init delayed by another VF’s reset. late_init failed occasionally if overlapped with another VF’s reset. MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue. Change-Id: I841774bdd9ebf125c5aa2046b1dcebd65e07 Signed-off-by: wentalou --- drivers/gpu/drm/amd