[PATCH] drm/amdgpu: value of amdgpu_sriov_vf cannot be set into F32_POLL_ENABLE
amdgpu_sriov_vf would return 0x0 or 0x4 to indicate if sriov. but F32_POLL_ENABLE need 0x0 or 0x1 to determine if enabled. set 0x4 into F32_POLL_ENABLE would make SDMA0_GFX_RB_WPTR_POLL_CNTL not working. Change-Id: I7d13ed35469ebd7bdf10c90341181977c6cfd38d Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c index 1ec60f5..1be85b7 100644 --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c @@ -851,7 +851,7 @@ static void sdma_v4_0_gfx_resume(struct amdgpu_device *adev, unsigned int i) wptr_poll_cntl = RREG32_SDMA(i, mmSDMA0_GFX_RB_WPTR_POLL_CNTL); wptr_poll_cntl = REG_SET_FIELD(wptr_poll_cntl, SDMA0_GFX_RB_WPTR_POLL_CNTL, - F32_POLL_ENABLE, amdgpu_sriov_vf(adev)); + F32_POLL_ENABLE, amdgpu_sriov_vf(adev)? 1 : 0); WREG32_SDMA(i, mmSDMA0_GFX_RB_WPTR_POLL_CNTL, wptr_poll_cntl); /* enable DMA RB */ @@ -942,7 +942,7 @@ static void sdma_v4_0_page_resume(struct amdgpu_device *adev, unsigned int i) wptr_poll_cntl = RREG32_SDMA(i, mmSDMA0_PAGE_RB_WPTR_POLL_CNTL); wptr_poll_cntl = REG_SET_FIELD(wptr_poll_cntl, SDMA0_PAGE_RB_WPTR_POLL_CNTL, - F32_POLL_ENABLE, amdgpu_sriov_vf(adev)); + F32_POLL_ENABLE, amdgpu_sriov_vf(adev)? 1 : 0); WREG32_SDMA(i, mmSDMA0_PAGE_RB_WPTR_POLL_CNTL, wptr_poll_cntl); /* enable DMA RB */ -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock. If amdgpu_device_recover_vram executed between amdgpu_bo_unref and list_del_init, it would get NULL of shadow->parent, then caused Call Trace and GPU reset failed. Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index ec9e450..93b2c5a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -88,12 +88,14 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo) if (bo->gem_base.import_attach) drm_prime_gem_destroy(>gem_base, bo->tbo.sg); drm_gem_object_release(>gem_base); - amdgpu_bo_unref(>parent); + /* in case amdgpu_device_recover_vram got NULL of bo->parent */ if (!list_empty(>shadow_list)) { mutex_lock(>shadow_list_lock); list_del_init(>shadow_list); mutex_unlock(>shadow_list_lock); } + amdgpu_bo_unref(>parent); + kfree(bo->metadata); kfree(bo); } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock. If amdgpu_device_recover_vram executed between amdgpu_bo_unref and list_del_init, it would get NULL of shadow->parent, then caused Call Trace and GPU reset failed. Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index ec9e450..0df8158 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -88,12 +88,16 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo) if (bo->gem_base.import_attach) drm_prime_gem_destroy(>gem_base, bo->tbo.sg); drm_gem_object_release(>gem_base); - amdgpu_bo_unref(>parent); + /* in case amdgpu_device_recover_vram got NULL of bo->parent */ if (!list_empty(>shadow_list)) { mutex_lock(>shadow_list_lock); list_del_init(>shadow_list); + amdgpu_bo_unref(>parent); mutex_unlock(>shadow_list_lock); } + else + amdgpu_bo_unref(>parent); + kfree(bo->metadata); kfree(bo); } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR
shadow was added into shadow_list by amdgpu_bo_create_shadow. meanwhile, shadow->tbo.mem was not fully configured. tbo.mem would be fully configured by amdgpu_vm_sdma_map_table until calling amdgpu_vm_clear_bo. If sriov TDR occurred between amdgpu_bo_create_shadow and amdgpu_vm_sdma_map_table, amdgpu_device_recover_vram would deal with shadow without tbo.mem.start. Change-Id: I1a6a69587d6c689d0a357dd495ee44833d0f0790 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3785195..be88d06 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3184,6 +3184,7 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) /* No need to recover an evicted BO */ if (shadow->tbo.mem.mem_type != TTM_PL_TT || + shadow->tbo.mem.start == AMDGPU_BO_INVALID_OFFSET || shadow->parent->tbo.mem.mem_type != TTM_PL_VRAM) continue; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list
amdgpu_bo_restore_shadow would assign zero to r if succeeded. r would remain zero if there is only one node in shadow_list. current code would always return failure when r <= 0. restart the timeout for each wait was a rather problematic bug as well. The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each loop. Change-Id: I7e836ec7ab6cd0f069aac24f88e454e906637541 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index c4c61e9..fcb3d95 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3191,11 +3191,16 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) break; if (fence) { - r = dma_fence_wait_timeout(fence, false, tmo); + tmo = dma_fence_wait_timeout(fence, false, tmo); dma_fence_put(fence); fence = next; - if (r <= 0) + if (tmo == 0) { + r = -ETIMEDOUT; break; + } else if (tmo < 0) { + r = tmo; + break; + } } else { fence = next; } @@ -3206,8 +3211,8 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) tmo = dma_fence_wait_timeout(fence, false, tmo); dma_fence_put(fence); - if (r <= 0 || tmo <= 0) { - DRM_ERROR("recover vram bo from shadow failed\n"); + if (r < 0 || tmo <= 0) { + DRM_ERROR("recover vram bo from shadow failed, r is %ld, tmo is %ld\n", r, tmo); return -EIO; } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list
amdgpu_bo_restore_shadow would assign zero to r if succeeded. r would remain zero if there is only one node in shadow_list. current code would always return failure when r <= 0. restart the timeout for each wait was a rather problematic bug as well. The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each loop. meanwhile, fix Call Trace by NULL of shadow->parent. Change-Id: I7e836ec7ab6cd0f069aac24f88e454e906637541 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index c4c61e9..5a2dc44 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3183,7 +3183,7 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) /* No need to recover an evicted BO */ if (shadow->tbo.mem.mem_type != TTM_PL_TT || - shadow->parent->tbo.mem.mem_type != TTM_PL_VRAM) + shadow->parent == NULL || shadow->parent->tbo.mem.mem_type != TTM_PL_VRAM) continue; r = amdgpu_bo_restore_shadow(shadow, ); @@ -3191,11 +3191,16 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) break; if (fence) { - r = dma_fence_wait_timeout(fence, false, tmo); + tmo = dma_fence_wait_timeout(fence, false, tmo); dma_fence_put(fence); fence = next; - if (r <= 0) + if (tmo == 0) { + r = -ETIMEDOUT; break; + } else if (tmo < 0) { + r = tmo; + break; + } } else { fence = next; } @@ -3206,8 +3211,8 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev) tmo = dma_fence_wait_timeout(fence, false, tmo); dma_fence_put(fence); - if (r <= 0 || tmo <= 0) { - DRM_ERROR("recover vram bo from shadow failed\n"); + if (r < 0 || tmo <= 0) { + DRM_ERROR("recover vram bo from shadow failed, tmo is %d\n", tmo); return -EIO; } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdkfd/sriov:Put the pre and post reset in exclusive mode v2
add amdgpu_amdkfd_pre_reset and amdgpu_amdkfd_post_reset inside amdgpu_device_reset_sriov. Change-Id: Icf2839f0b620ce9d47d6414b6c32b9d06672f2ac Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 95cd3b7..b6693bd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3234,6 +3234,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, if (r) return r; + amdgpu_amdkfd_pre_reset(adev); + /* Resume IP prior to SMC */ r = amdgpu_device_ip_reinit_early_sriov(adev); if (r) @@ -3253,6 +3255,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, amdgpu_irq_gpu_reset_resume_helper(adev); r = amdgpu_ib_ring_tests(adev); + amdgpu_amdkfd_post_reset(adev); error: amdgpu_virt_init_data_exchange(adev); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Change-Id: I840dfc145e4e1be9ece6eac8d9f3501da9b28ebf Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index b11a1c17..73851eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -266,7 +266,8 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) } /* Trigger recovery for world switch failure if no TDR */ - if (amdgpu_device_should_recover_gpu(adev)) + if (amdgpu_device_should_recover_gpu(adev) + && amdgpu_lockup_timeout == MAX_SCHEDULE_TIMEOUT) amdgpu_device_gpu_recover(adev, NULL); } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: sriov restrict max_pfn below AMDGPU_GMC_HOLE
sriov need to restrict max_pfn below AMDGPU_GMC_HOLE. access the hole results in a range fault interrupt IIRC. Change-Id: I0add197a24a54388a128a545056e9a9f0330abfb Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +-- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 +- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index dd3bd01..7e22be7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -26,8 +26,7 @@ uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) { - uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT, - AMDGPU_GMC_HOLE_START); + uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; addr -= AMDGPU_VA_RESERVED_SIZE; addr = amdgpu_gmc_sign_extend(addr); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c index 9c082f9..600259b 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c @@ -965,7 +965,11 @@ static int gmc_v9_0_sw_init(void *handle) * vm size is 256TB (48bit), maximum size of Vega10, * block size 512 (9bit) */ - amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 48); + /* sriov restrict max_pfn below AMDGPU_GMC_HOLE */ + if (amdgpu_sriov_vf(adev)) + amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 47); + else + amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 48); break; default: break; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Change-Id: Ifcba4ac43a0229ae19061aad3b0ddc96957ff9c6 Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c index b11a1c17..f227633 100644 --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c @@ -266,7 +266,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work) } /* Trigger recovery for world switch failure if no TDR */ - if (amdgpu_device_should_recover_gpu(adev)) + if (amdgpu_device_should_recover_gpu(adev) && amdgpu_lockup_timeout == 0) amdgpu_device_gpu_recover(adev, NULL); } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: sriov put csa below AMDGPU_GMC_HOLE
since vm_size enlarged to 0x4 GB, sriov need to put csa below AMDGPU_GMC_HOLE. or amdgpu_vm_alloc_pts would receive saddr among AMDGPU_GMC_HOLE, and result in a range fault interrupt IIRC. Change-Id: I405a25a01d949f3130889b346f71bedad8ebcae7 Signed-off-by: Wenta Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 6 -- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 6 -- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index dd3bd01..7a93c36 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -26,8 +26,10 @@ uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) { - uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT, - AMDGPU_GMC_HOLE_START); + uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; + /* sriov put csa below AMDGPU_GMC_HOLE */ + if (amdgpu_sriov_vf(adev)) + addr = min(addr, AMDGPU_GMC_HOLE_START); addr -= AMDGPU_VA_RESERVED_SIZE; addr = amdgpu_gmc_sign_extend(addr); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index f87f717..cf9ec28 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -707,8 +707,10 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file vm_size = min(vm_size, 1ULL << 40); dev_info.virtual_address_offset = AMDGPU_VA_RESERVED_SIZE; - dev_info.virtual_address_max = - min(vm_size, AMDGPU_GMC_HOLE_START); + if (amdgpu_sriov_vf(adev)) + dev_info.virtual_address_max = min(vm_size, AMDGPU_GMC_HOLE_START - AMDGPU_VA_RESERVED_SIZE); + else + dev_info.virtual_address_max = min(vm_size, AMDGPU_GMC_HOLE_START); if (vm_size > AMDGPU_GMC_HOLE_START) { dev_info.high_va_offset = AMDGPU_GMC_HOLE_END; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: sriov should skip asic_reset in device_init
sriov would meet guest driver load failure, if calling amdgpu_asic_reset in amdgpu_device_init. sriov should skip asic_reset in device_init. Change-Id: I6c03b2fcdbf29200fab09459bbffd87726047908 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 8a61764..e20dce4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2553,7 +2553,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, /* check if we need to reset the asic * E.g., driver was not cleanly unloaded previously, etc. */ - if (amdgpu_asic_need_reset_on_init(adev)) { + if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) { r = amdgpu_asic_reset(adev); if (r) { dev_err(adev->dev, "asic reset on init failed\n"); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: csa_vaddr should not larger than AMDGPU_GMC_HOLE_START
After removing unnecessary VM size calculations, vm_manager.max_pfn would reach 0x10,, max_pfn << AMDGPU_GPU_PAGE_SHIFT exceeding AMDGPU_GMC_HOLE_START would caused GPU reset. Change-Id: I47ad0be2b0bd9fb7490c4e1d7bb7bdacf71132cb Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c index 7e22be7..dd3bd01 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c @@ -26,7 +26,8 @@ uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev) { - uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT; + uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT, + AMDGPU_GMC_HOLE_START); addr -= AMDGPU_VA_RESERVED_SIZE; addr = amdgpu_gmc_sign_extend(addr); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback
When 2 rings met timeout at same time, triggered job_timedout separately. Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock. Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock. So dma_fence_remove_callback could not be called immediately. Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT. After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery. job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job. Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 0a17fb1..fc1d3a0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job) trace_amdgpu_sched_run_job(job); - if (job->vram_lost_counter != atomic_read(>adev->vram_lost_counter)) + if (job->vram_lost_counter != atomic_read(>adev->vram_lost_counter)) { + /* flags might be signaled by unexpected callback, clear it */ + test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >flags); dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if VRAM lost */ + } if (finished->error < 0) { DRM_INFO("Skip scheduling IBs!\n"); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: psp_ring_destroy cause psp->km_ring.ring_mem NULL
psp_ring_destroy inside psp_load_fw cause psp->km_ring.ring_mem NULL. Call Trace occurred when psp_cmd_submit. should be psp_ring_stop instead. Change-Id: Ib332004b3b9edc9e002adc532b2d45cdad929b05 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index 7f5ce37..8189a90 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -547,7 +547,7 @@ static int psp_load_fw(struct amdgpu_device *adev) struct psp_context *psp = >psp; if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset) { - psp_ring_destroy(psp, PSP_RING_TYPE__KM); + psp_ring_stop(psp, PSP_RING_TYPE__KM); /* should not destroy ring, only stop */ goto skip_memalloc; } -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev, but outside req_full_gpu of sriov. It would make sriov hang during reset. Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ef36cc5..659dd40 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3474,14 +3474,16 @@ static void amdgpu_device_lock_adev(struct amdgpu_device *adev) mutex_lock(>lock_reset); atomic_inc(>gpu_reset_counter); adev->in_gpu_reset = 1; - /* Block kfd */ - amdgpu_amdkfd_pre_reset(adev); + /* Block kfd: SRIOV would do it separately */ + if (!amdgpu_sriov_vf(adev)) +amdgpu_amdkfd_pre_reset(adev); } static void amdgpu_device_unlock_adev(struct amdgpu_device *adev) { - /*unlock kfd */ - amdgpu_amdkfd_post_reset(adev); + /*unlock kfd: SRIOV would do it separately */ + if (!amdgpu_sriov_vf(adev)) +amdgpu_amdkfd_post_reset(adev); amdgpu_vf_error_trans_all(adev); adev->in_gpu_reset = 0; mutex_unlock(>lock_reset); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev, but outside req_full_gpu of sriov. It would make sriov hang during reset. Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ef36cc5..659dd40 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3474,14 +3474,16 @@ static void amdgpu_device_lock_adev(struct amdgpu_device *adev) mutex_lock(>lock_reset); atomic_inc(>gpu_reset_counter); adev->in_gpu_reset = 1; - /* Block kfd */ - amdgpu_amdkfd_pre_reset(adev); + /* Block kfd: SRIOV would do it separately */ + if (!amdgpu_sriov_vf(adev)) +amdgpu_amdkfd_pre_reset(adev); } static void amdgpu_device_unlock_adev(struct amdgpu_device *adev) { - /*unlock kfd */ - amdgpu_amdkfd_post_reset(adev); + /*unlock kfd: SRIOV would do it separately */ + if (!amdgpu_sriov_vf(adev)) +amdgpu_amdkfd_post_reset(adev); amdgpu_vf_error_trans_all(adev); adev->in_gpu_reset = 0; mutex_unlock(>lock_reset); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: Skip ring soft recovery when fence was NULL
amdgpu_ring_soft_recovery would have Call-Trace, when s_fence->parent was NULL inside amdgpu_job_timedout. Check fence first, as drm_sched_hw_job_reset did. Change-Id: Ibb062e36feb4e2522a59641fe0d2d76b9773cda7 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 5b75bdc..335a0ed 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -397,7 +397,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, { ktime_t deadline = ktime_add_us(ktime_get(), 1); - if (!ring->funcs->soft_recovery) + if (!ring->funcs->soft_recovery || !fence) return false; atomic_inc(>adev->gpu_reset_counter); -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: Skip ring soft recovery when fence parent was NULL
amdgpu_ring_soft_recovery would have Call-Trace, when s_job->s_fence->parent was NULL inside amdgpu_job_timedout. Check parent first, as drm_sched_hw_job_reset did. Change-Id: I0b674ffd96afd44bcefe37a66fb157b1dbba61a0 Signed-off-by: Wentao Lou --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index e0af44f..2945615 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -33,7 +33,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); struct amdgpu_job *job = to_amdgpu_job(s_job); - if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { + if (s_job->s_fence->parent && amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { DRM_ERROR("ring %s timeout, but soft recovered\n", s_job->sched->name); return; -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ
KIQ in VF’s init delayed by another VF’s reset, which would cause late_init failed occasionally. MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue. Change-Id: Iac680af3cbd6afe4f8e408785f0795e1b23dba83 Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c8ad6bf..62018e7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -235,7 +235,7 @@ enum amdgpu_kiq_irq { #define MAX_KIQ_REG_WAIT 5000 /* in usecs, 5ms */ #define MAX_KIQ_REG_BAILOUT_INTERVAL 5 /* in msecs, 5ms */ -#define MAX_KIQ_REG_TRY 20 +#define MAX_KIQ_REG_TRY 80 /* 20 -> 80 */ int amdgpu_device_ip_set_clockgating_state(void *dev, enum amd_ip_block_type block_type, -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ
SWDEV-171843: KIQ in VF’s init delayed by another VF’s reset. late_init failed occasionally if overlapped with another VF’s reset. MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue. Change-Id: I841774bdd9ebf125c5aa2046b1dcebd65e07 Signed-off-by: wentalou --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index c8ad6bf..26e2455 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -235,7 +235,7 @@ enum amdgpu_kiq_irq { #define MAX_KIQ_REG_WAIT 5000 /* in usecs, 5ms */ #define MAX_KIQ_REG_BAILOUT_INTERVAL 5 /* in msecs, 5ms */ -#define MAX_KIQ_REG_TRY 20 +#define MAX_KIQ_REG_TRY 80 /* SWDEV-171843: 20 -> 80 */ int amdgpu_device_ip_set_clockgating_state(void *dev, enum amd_ip_block_type block_type, -- 2.7.4 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx