[PATCH] drm/amdgpu: value of amdgpu_sriov_vf cannot be set into F32_POLL_ENABLE

2019-04-24 Thread wentalou
amdgpu_sriov_vf would return 0x0 or 0x4 to indicate if sriov.
but F32_POLL_ENABLE need 0x0 or 0x1 to determine if enabled.
set 0x4 into F32_POLL_ENABLE would make SDMA0_GFX_RB_WPTR_POLL_CNTL not working.

Change-Id: I7d13ed35469ebd7bdf10c90341181977c6cfd38d
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index 1ec60f5..1be85b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -851,7 +851,7 @@ static void sdma_v4_0_gfx_resume(struct amdgpu_device 
*adev, unsigned int i)
wptr_poll_cntl = RREG32_SDMA(i, mmSDMA0_GFX_RB_WPTR_POLL_CNTL);
wptr_poll_cntl = REG_SET_FIELD(wptr_poll_cntl,
   SDMA0_GFX_RB_WPTR_POLL_CNTL,
-  F32_POLL_ENABLE, amdgpu_sriov_vf(adev));
+  F32_POLL_ENABLE, amdgpu_sriov_vf(adev)? 
1 : 0);
WREG32_SDMA(i, mmSDMA0_GFX_RB_WPTR_POLL_CNTL, wptr_poll_cntl);
 
/* enable DMA RB */
@@ -942,7 +942,7 @@ static void sdma_v4_0_page_resume(struct amdgpu_device 
*adev, unsigned int i)
wptr_poll_cntl = RREG32_SDMA(i, mmSDMA0_PAGE_RB_WPTR_POLL_CNTL);
wptr_poll_cntl = REG_SET_FIELD(wptr_poll_cntl,
   SDMA0_PAGE_RB_WPTR_POLL_CNTL,
-  F32_POLL_ENABLE, amdgpu_sriov_vf(adev));
+  F32_POLL_ENABLE, amdgpu_sriov_vf(adev)? 
1 : 0);
WREG32_SDMA(i, mmSDMA0_PAGE_RB_WPTR_POLL_CNTL, wptr_poll_cntl);
 
/* enable DMA RB */
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent

2019-04-16 Thread wentalou
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock.
If amdgpu_device_recover_vram executed between amdgpu_bo_unref and 
list_del_init,
it would get NULL of shadow->parent, then caused Call Trace and GPU reset 
failed.

Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index ec9e450..93b2c5a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -88,12 +88,14 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
if (bo->gem_base.import_attach)
drm_prime_gem_destroy(>gem_base, bo->tbo.sg);
drm_gem_object_release(>gem_base);
-   amdgpu_bo_unref(>parent);
+   /* in case amdgpu_device_recover_vram got NULL of bo->parent */
if (!list_empty(>shadow_list)) {
mutex_lock(>shadow_list_lock);
list_del_init(>shadow_list);
mutex_unlock(>shadow_list_lock);
}
+   amdgpu_bo_unref(>parent);
+
kfree(bo->metadata);
kfree(bo);
 }
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: amdgpu_device_recover_vram got NULL of shadow->parent

2019-04-16 Thread wentalou
amdgpu_bo_destroy had a bug by calling amdgpu_bo_unref outside mutex_lock.
If amdgpu_device_recover_vram executed between amdgpu_bo_unref and 
list_del_init,
it would get NULL of shadow->parent, then caused Call Trace and GPU reset 
failed.

Change-Id: I41d7b54605e613e87ee03c3ad89c191063c19230
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index ec9e450..0df8158 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -88,12 +88,16 @@ static void amdgpu_bo_destroy(struct ttm_buffer_object *tbo)
if (bo->gem_base.import_attach)
drm_prime_gem_destroy(>gem_base, bo->tbo.sg);
drm_gem_object_release(>gem_base);
-   amdgpu_bo_unref(>parent);
+   /* in case amdgpu_device_recover_vram got NULL of bo->parent */
if (!list_empty(>shadow_list)) {
mutex_lock(>shadow_list_lock);
list_del_init(>shadow_list);
+   amdgpu_bo_unref(>parent);
mutex_unlock(>shadow_list_lock);
}
+   else
+   amdgpu_bo_unref(>parent);
+
kfree(bo->metadata);
kfree(bo);
 }
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR

2019-04-12 Thread wentalou
shadow was added into shadow_list by amdgpu_bo_create_shadow.
meanwhile, shadow->tbo.mem was not fully configured.
tbo.mem would be fully configured by amdgpu_vm_sdma_map_table until calling 
amdgpu_vm_clear_bo.
If sriov TDR occurred between amdgpu_bo_create_shadow and 
amdgpu_vm_sdma_map_table,
amdgpu_device_recover_vram would deal with shadow without tbo.mem.start.

Change-Id: I1a6a69587d6c689d0a357dd495ee44833d0f0790
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3785195..be88d06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3184,6 +3184,7 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
 
/* No need to recover an evicted BO */
if (shadow->tbo.mem.mem_type != TTM_PL_TT ||
+   shadow->tbo.mem.start == AMDGPU_BO_INVALID_OFFSET ||
shadow->parent->tbo.mem.mem_type != TTM_PL_VRAM)
continue;
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list

2019-04-03 Thread wentalou
amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
restart the timeout for each wait was a rather problematic bug as well.
The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each loop.

Change-Id: I7e836ec7ab6cd0f069aac24f88e454e906637541
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4c61e9..fcb3d95 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3191,11 +3191,16 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
break;
 
if (fence) {
-   r = dma_fence_wait_timeout(fence, false, tmo);
+   tmo = dma_fence_wait_timeout(fence, false, tmo);
dma_fence_put(fence);
fence = next;
-   if (r <= 0)
+   if (tmo == 0) {
+   r = -ETIMEDOUT;
break;
+   } else if (tmo < 0) {
+   r = tmo;
+   break;
+   }
} else {
fence = next;
}
@@ -3206,8 +3211,8 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
tmo = dma_fence_wait_timeout(fence, false, tmo);
dma_fence_put(fence);
 
-   if (r <= 0 || tmo <= 0) {
-   DRM_ERROR("recover vram bo from shadow failed\n");
+   if (r < 0 || tmo <= 0) {
+   DRM_ERROR("recover vram bo from shadow failed, r is %ld, tmo is 
%ld\n", r, tmo);
return -EIO;
}
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] amdgpu_device_recover_vram always failed if only one node in shadow_list

2019-04-02 Thread wentalou
amdgpu_bo_restore_shadow would assign zero to r if succeeded.
r would remain zero if there is only one node in shadow_list.
current code would always return failure when r <= 0.
restart the timeout for each wait was a rather problematic bug as well.
The value of tmo SHOULD be changed, otherwise we wait tmo jiffies on each loop.
meanwhile, fix Call Trace by NULL of shadow->parent.

Change-Id: I7e836ec7ab6cd0f069aac24f88e454e906637541
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c4c61e9..5a2dc44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3183,7 +3183,7 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
 
/* No need to recover an evicted BO */
if (shadow->tbo.mem.mem_type != TTM_PL_TT ||
-   shadow->parent->tbo.mem.mem_type != TTM_PL_VRAM)
+   shadow->parent == NULL || shadow->parent->tbo.mem.mem_type 
!= TTM_PL_VRAM)
continue;
 
r = amdgpu_bo_restore_shadow(shadow, );
@@ -3191,11 +3191,16 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
break;
 
if (fence) {
-   r = dma_fence_wait_timeout(fence, false, tmo);
+   tmo = dma_fence_wait_timeout(fence, false, tmo);
dma_fence_put(fence);
fence = next;
-   if (r <= 0)
+   if (tmo == 0) {
+   r = -ETIMEDOUT;
break;
+   } else if (tmo < 0) {
+   r = tmo;
+   break;
+   }
} else {
fence = next;
}
@@ -3206,8 +3211,8 @@ static int amdgpu_device_recover_vram(struct 
amdgpu_device *adev)
tmo = dma_fence_wait_timeout(fence, false, tmo);
dma_fence_put(fence);
 
-   if (r <= 0 || tmo <= 0) {
-   DRM_ERROR("recover vram bo from shadow failed\n");
+   if (r < 0 || tmo <= 0) {
+   DRM_ERROR("recover vram bo from shadow failed, tmo is %d\n", 
tmo);
return -EIO;
}
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdkfd/sriov:Put the pre and post reset in exclusive mode v2

2019-03-13 Thread wentalou
add amdgpu_amdkfd_pre_reset and amdgpu_amdkfd_post_reset inside 
amdgpu_device_reset_sriov.

Change-Id: Icf2839f0b620ce9d47d6414b6c32b9d06672f2ac
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 95cd3b7..b6693bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3234,6 +3234,8 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
if (r)
return r;
 
+   amdgpu_amdkfd_pre_reset(adev);
+
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)
@@ -3253,6 +3255,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device 
*adev,
 
amdgpu_irq_gpu_reset_resume_helper(adev);
r = amdgpu_ib_ring_tests(adev);
+   amdgpu_amdkfd_post_reset(adev);
 
 error:
amdgpu_virt_init_data_exchange(adev);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov

2019-01-29 Thread wentalou
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate 
recover in TDR.
TDR's gpu_recover would be triggered by amdgpu_job_timedout,
that could avoid vk-cts failure by unexpected recover.

Change-Id: I840dfc145e4e1be9ece6eac8d9f3501da9b28ebf
Signed-off-by: wentalou 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b11a1c17..73851eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -266,7 +266,8 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
}
 
/* Trigger recovery for world switch failure if no TDR */
-   if (amdgpu_device_should_recover_gpu(adev))
+   if (amdgpu_device_should_recover_gpu(adev)
+   && amdgpu_lockup_timeout == MAX_SCHEDULE_TIMEOUT)
amdgpu_device_gpu_recover(adev, NULL);
 }
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: sriov restrict max_pfn below AMDGPU_GMC_HOLE

2019-01-23 Thread wentalou
sriov need to restrict max_pfn below AMDGPU_GMC_HOLE.
access the hole results in a range fault interrupt IIRC.

Change-Id: I0add197a24a54388a128a545056e9a9f0330abfb
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 +--
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   | 6 +-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index dd3bd01..7e22be7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -26,8 +26,7 @@
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
-   uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT,
-   AMDGPU_GMC_HOLE_START);
+   uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
 
addr -= AMDGPU_VA_RESERVED_SIZE;
addr = amdgpu_gmc_sign_extend(addr);
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 9c082f9..600259b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -965,7 +965,11 @@ static int gmc_v9_0_sw_init(void *handle)
 * vm size is 256TB (48bit), maximum size of Vega10,
 * block size 512 (9bit)
 */
-   amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 48);
+   /* sriov restrict max_pfn below AMDGPU_GMC_HOLE */
+   if (amdgpu_sriov_vf(adev))
+   amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 47);
+   else
+   amdgpu_vm_adjust_size(adev, 256 * 1024, 9, 3, 48);
break;
default:
break;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov

2019-01-23 Thread wentalou
sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate 
recover in TDR.
TDR's gpu_recover would be triggered by amdgpu_job_timedout,
that could avoid vk-cts failure by unexpected recover.

Change-Id: Ifcba4ac43a0229ae19061aad3b0ddc96957ff9c6
Signed-off-by: wentalou 
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c 
b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b11a1c17..f227633 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -266,7 +266,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct 
*work)
}
 
/* Trigger recovery for world switch failure if no TDR */
-   if (amdgpu_device_should_recover_gpu(adev))
+   if (amdgpu_device_should_recover_gpu(adev) && amdgpu_lockup_timeout == 
0)
amdgpu_device_gpu_recover(adev, NULL);
 }
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: sriov put csa below AMDGPU_GMC_HOLE

2019-01-22 Thread wentalou
since vm_size enlarged to 0x4 GB,
sriov need to put csa below AMDGPU_GMC_HOLE.
or amdgpu_vm_alloc_pts would receive saddr among AMDGPU_GMC_HOLE,
and result in a range fault interrupt IIRC.

Change-Id: I405a25a01d949f3130889b346f71bedad8ebcae7
Signed-off-by: Wenta Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 6 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 6 --
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index dd3bd01..7a93c36 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -26,8 +26,10 @@
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
-   uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT,
-   AMDGPU_GMC_HOLE_START);
+   uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
+   /* sriov put csa below AMDGPU_GMC_HOLE */
+   if (amdgpu_sriov_vf(adev))
+   addr = min(addr, AMDGPU_GMC_HOLE_START);
 
addr -= AMDGPU_VA_RESERVED_SIZE;
addr = amdgpu_gmc_sign_extend(addr);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index f87f717..cf9ec28 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -707,8 +707,10 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void 
*data, struct drm_file
vm_size = min(vm_size, 1ULL << 40);
 
dev_info.virtual_address_offset = AMDGPU_VA_RESERVED_SIZE;
-   dev_info.virtual_address_max =
-   min(vm_size, AMDGPU_GMC_HOLE_START);
+   if (amdgpu_sriov_vf(adev))
+   dev_info.virtual_address_max = min(vm_size, 
AMDGPU_GMC_HOLE_START - AMDGPU_VA_RESERVED_SIZE);
+   else
+   dev_info.virtual_address_max = min(vm_size, 
AMDGPU_GMC_HOLE_START);
 
if (vm_size > AMDGPU_GMC_HOLE_START) {
dev_info.high_va_offset = AMDGPU_GMC_HOLE_END;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: sriov should skip asic_reset in device_init

2019-01-17 Thread wentalou
sriov would meet guest driver load failure,
if calling amdgpu_asic_reset in amdgpu_device_init.
sriov should skip asic_reset in device_init.

Change-Id: I6c03b2fcdbf29200fab09459bbffd87726047908
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8a61764..e20dce4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2553,7 +2553,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
/* check if we need to reset the asic
 *  E.g., driver was not cleanly unloaded previously, etc.
 */
-   if (amdgpu_asic_need_reset_on_init(adev)) {
+   if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
r = amdgpu_asic_reset(adev);
if (r) {
dev_err(adev->dev, "asic reset on init failed\n");
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: csa_vaddr should not larger than AMDGPU_GMC_HOLE_START

2019-01-14 Thread wentalou
After removing unnecessary VM size calculations,
vm_manager.max_pfn would reach 0x10,,
max_pfn << AMDGPU_GPU_PAGE_SHIFT exceeding AMDGPU_GMC_HOLE_START
would caused GPU reset.

Change-Id: I47ad0be2b0bd9fb7490c4e1d7bb7bdacf71132cb
Signed-off-by: wentalou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
index 7e22be7..dd3bd01 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
@@ -26,7 +26,8 @@
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
-   uint64_t addr = adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT;
+   uint64_t addr = min(adev->vm_manager.max_pfn << AMDGPU_GPU_PAGE_SHIFT,
+   AMDGPU_GMC_HOLE_START);
 
addr -= AMDGPU_VA_RESERVED_SIZE;
addr = amdgpu_gmc_sign_extend(addr);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-21 Thread wentalou
When 2 rings met timeout at same time, triggered job_timedout separately.
Each job_timedout called gpu_recover, but one of gpu_recover locked by 
another's mutex_lock.
Bad jod’s callback should be removed by dma_fence_remove_callback but locked 
inside mutex_lock.
So dma_fence_remove_callback could not be called immediately.
Then callback drm_sched_process_job triggered unexpectedly, and signaled 
DMA_FENCE_FLAG_SIGNALED_BIT.
After another's mutex_unlock, signaled bad job went through job_run inside 
drm_sched_job_recovery.
job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error 
for signaled bad job.

Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0a17fb1..fc1d3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct 
drm_sched_job *sched_job)
 
trace_amdgpu_sched_run_job(job);
 
-   if (job->vram_lost_counter != 
atomic_read(>adev->vram_lost_counter))
+   if (job->vram_lost_counter != 
atomic_read(>adev->vram_lost_counter)) {
+   /* flags might be signaled by unexpected callback, clear it */
+   test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
>flags);
dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
VRAM lost */
+   }
 
if (finished->error < 0) {
DRM_INFO("Skip scheduling IBs!\n");
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: psp_ring_destroy cause psp->km_ring.ring_mem NULL

2018-12-17 Thread wentalou
psp_ring_destroy inside psp_load_fw cause psp->km_ring.ring_mem NULL.
Call Trace occurred when psp_cmd_submit.
should be psp_ring_stop instead.

Change-Id: Ib332004b3b9edc9e002adc532b2d45cdad929b05
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 7f5ce37..8189a90 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -547,7 +547,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
struct psp_context *psp = >psp;
 
if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset) {
-   psp_ring_destroy(psp, PSP_RING_TYPE__KM);
+   psp_ring_stop(psp, PSP_RING_TYPE__KM); /* should not destroy 
ring, only stop */
goto skip_memalloc;
}
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

2018-12-09 Thread wentalou
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.

Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ef36cc5..659dd40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3474,14 +3474,16 @@ static void amdgpu_device_lock_adev(struct 
amdgpu_device *adev)
mutex_lock(>lock_reset);
atomic_inc(>gpu_reset_counter);
adev->in_gpu_reset = 1;
-   /* Block kfd */
-   amdgpu_amdkfd_pre_reset(adev);
+   /* Block kfd: SRIOV would do it separately */
+   if (!amdgpu_sriov_vf(adev))
+amdgpu_amdkfd_pre_reset(adev);
 }
 
 static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
 {
-   /*unlock kfd */
-   amdgpu_amdkfd_post_reset(adev);
+   /*unlock kfd: SRIOV would do it separately */
+   if (!amdgpu_sriov_vf(adev))
+amdgpu_amdkfd_post_reset(adev);
amdgpu_vf_error_trans_all(adev);
adev->in_gpu_reset = 0;
mutex_unlock(>lock_reset);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

2018-12-06 Thread wentalou
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.

Change-Id: I5b3e2a42c77b3b9635419df4470d021df7be34d1
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ef36cc5..659dd40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3474,14 +3474,16 @@ static void amdgpu_device_lock_adev(struct 
amdgpu_device *adev)
mutex_lock(>lock_reset);
atomic_inc(>gpu_reset_counter);
adev->in_gpu_reset = 1;
-   /* Block kfd */
-   amdgpu_amdkfd_pre_reset(adev);
+   /* Block kfd: SRIOV would do it separately */
+   if (!amdgpu_sriov_vf(adev))
+amdgpu_amdkfd_pre_reset(adev);
 }
 
 static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
 {
-   /*unlock kfd */
-   amdgpu_amdkfd_post_reset(adev);
+   /*unlock kfd: SRIOV would do it separately */
+   if (!amdgpu_sriov_vf(adev))
+amdgpu_amdkfd_post_reset(adev);
amdgpu_vf_error_trans_all(adev);
adev->in_gpu_reset = 0;
mutex_unlock(>lock_reset);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Skip ring soft recovery when fence was NULL

2018-12-05 Thread wentalou
amdgpu_ring_soft_recovery would have Call-Trace,
when s_fence->parent was NULL inside amdgpu_job_timedout.
Check fence first, as drm_sched_hw_job_reset did.

Change-Id: Ibb062e36feb4e2522a59641fe0d2d76b9773cda7
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5b75bdc..335a0ed 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -397,7 +397,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
unsigned int vmid,
 {
ktime_t deadline = ktime_add_us(ktime_get(), 1);
 
-   if (!ring->funcs->soft_recovery)
+   if (!ring->funcs->soft_recovery || !fence)
return false;
 
atomic_inc(>adev->gpu_reset_counter);
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Skip ring soft recovery when fence parent was NULL

2018-12-05 Thread wentalou
amdgpu_ring_soft_recovery would have Call-Trace,
when s_job->s_fence->parent was NULL inside amdgpu_job_timedout.
Check parent first, as drm_sched_hw_job_reset did.

Change-Id: I0b674ffd96afd44bcefe37a66fb157b1dbba61a0
Signed-off-by: Wentao Lou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index e0af44f..2945615 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,7 +33,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
 
-   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
+   if (s_job->s_fence->parent && amdgpu_ring_soft_recovery(ring, 
job->vmid, s_job->s_fence->parent)) {
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
return;
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ

2018-12-02 Thread wentalou
KIQ in VF’s init delayed by another VF’s reset,
which would cause late_init failed occasionally.
MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue.

Change-Id: Iac680af3cbd6afe4f8e408785f0795e1b23dba83
Signed-off-by: wentalou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c8ad6bf..62018e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -235,7 +235,7 @@ enum amdgpu_kiq_irq {
 
 #define MAX_KIQ_REG_WAIT   5000 /* in usecs, 5ms */
 #define MAX_KIQ_REG_BAILOUT_INTERVAL   5 /* in msecs, 5ms */
-#define MAX_KIQ_REG_TRY 20
+#define MAX_KIQ_REG_TRY 80 /* 20 -> 80 */
 
 int amdgpu_device_ip_set_clockgating_state(void *dev,
   enum amd_ip_block_type block_type,
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: enlarge maximum waiting time of KIQ

2018-11-30 Thread wentalou
SWDEV-171843: KIQ in VF’s init delayed by another VF’s reset.
late_init failed occasionally if overlapped with another VF’s reset.
MAX_KIQ_REG_TRY enlarged from 20 to 80 would fix this issue.

Change-Id: I841774bdd9ebf125c5aa2046b1dcebd65e07
Signed-off-by: wentalou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c8ad6bf..26e2455 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -235,7 +235,7 @@ enum amdgpu_kiq_irq {
 
 #define MAX_KIQ_REG_WAIT   5000 /* in usecs, 5ms */
 #define MAX_KIQ_REG_BAILOUT_INTERVAL   5 /* in msecs, 5ms */
-#define MAX_KIQ_REG_TRY 20
+#define MAX_KIQ_REG_TRY 80 /* SWDEV-171843: 20 -> 80 */
 
 int amdgpu_device_ip_set_clockgating_state(void *dev,
   enum amd_ip_block_type block_type,
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx