Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-27 Thread Grodzovsky, Andrey
They are still not pushed - I am fixing review comments.

Andrey


On 12/26/2018 01:38 AM, Lou, Wentao wrote:
> Hi Andrey,
>
> In amd-staging-dkms-4.18's merged list, I can't find 'drm/sched: Refactor 
> ring mirror list handling', neither 'drm/sched: Rework HW fence processing'.
> Now there was still much Call-Trace in new osdb triggered in 
> dma_fence_set_error. Do you have link for these patches?
> Thanks.
>
> BR,
> Wentao
>
>
> -Original Message-
> From: Grodzovsky, Andrey 
> Sent: Saturday, December 22, 2018 12:57 AM
> To: Lou, Wentao ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected 
> callback
>
> I believe this issue would be resolved by my pending  in review patch set, 
> specifically 'drm/sched: Refactor ring mirror list handling.' since already 
> on the first TO handler it will go over all the rings including the second 
> timed out ring and will remove all call backs including the bad job cb. In 
> case by this time this bad job will signal for some reason it will be removed 
> from the mirror list already during drm_sched_process_job (take a look at 
> 'drm/sched: Rework HW fence
> processing.') and hence will not be rerun in drm_sched_job_recovery 
> (drm_sched_resubmit_jobs under the new name).
>
> Andrey
>
>
> On 12/21/2018 03:25 AM, wentalou wrote:
>> When 2 rings met timeout at same time, triggered job_timedout separately.
>> Each job_timedout called gpu_recover, but one of gpu_recover locked by 
>> another's mutex_lock.
>> Bad jod’s callback should be removed by dma_fence_remove_callback but locked 
>> inside mutex_lock.
>> So dma_fence_remove_callback could not be called immediately.
>> Then callback drm_sched_process_job triggered unexpectedly, and signaled 
>> DMA_FENCE_FLAG_SIGNALED_BIT.
>> After another's mutex_unlock, signaled bad job went through job_run inside 
>> drm_sched_job_recovery.
>> job_run would have WARN_ON and Call-Trace, when calling 
>> kcl_dma_fence_set_error for signaled bad job.
>>
>> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
>> Signed-off-by: Wentao Lou 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 -
>>1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index 0a17fb1..fc1d3a0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct
>> drm_sched_job *sched_job)
>>
>>  trace_amdgpu_sched_run_job(job);
>>
>> -if (job->vram_lost_counter != 
>> atomic_read(>adev->vram_lost_counter))
>> +if (job->vram_lost_counter != 
>> atomic_read(>adev->vram_lost_counter)) {
>> +/* flags might be signaled by unexpected callback, clear it */
>> +test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
>> >flags);
>>  dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if
>> VRAM lost */
>> +}
>>
>>  if (finished->error < 0) {
>>  DRM_INFO("Skip scheduling IBs!\n");

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-25 Thread Lou, Wentao
Hi Andrey,

In amd-staging-dkms-4.18's merged list, I can't find 'drm/sched: Refactor ring 
mirror list handling', neither 'drm/sched: Rework HW fence processing'.
Now there was still much Call-Trace in new osdb triggered in 
dma_fence_set_error. Do you have link for these patches?
Thanks.

BR,
Wentao


-Original Message-
From: Grodzovsky, Andrey  
Sent: Saturday, December 22, 2018 12:57 AM
To: Lou, Wentao ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected 
callback

I believe this issue would be resolved by my pending  in review patch set, 
specifically 'drm/sched: Refactor ring mirror list handling.' since already on 
the first TO handler it will go over all the rings including the second timed 
out ring and will remove all call backs including the bad job cb. In case by 
this time this bad job will signal for some reason it will be removed from the 
mirror list already during drm_sched_process_job (take a look at 'drm/sched: 
Rework HW fence
processing.') and hence will not be rerun in drm_sched_job_recovery 
(drm_sched_resubmit_jobs under the new name).

Andrey


On 12/21/2018 03:25 AM, wentalou wrote:
> When 2 rings met timeout at same time, triggered job_timedout separately.
> Each job_timedout called gpu_recover, but one of gpu_recover locked by 
> another's mutex_lock.
> Bad jod’s callback should be removed by dma_fence_remove_callback but locked 
> inside mutex_lock.
> So dma_fence_remove_callback could not be called immediately.
> Then callback drm_sched_process_job triggered unexpectedly, and signaled 
> DMA_FENCE_FLAG_SIGNALED_BIT.
> After another's mutex_unlock, signaled bad job went through job_run inside 
> drm_sched_job_recovery.
> job_run would have WARN_ON and Call-Trace, when calling 
> kcl_dma_fence_set_error for signaled bad job.
>
> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
> Signed-off-by: Wentao Lou 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 -
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0a17fb1..fc1d3a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct 
> drm_sched_job *sched_job)
>   
>   trace_amdgpu_sched_run_job(job);
>   
> - if (job->vram_lost_counter != 
> atomic_read(>adev->vram_lost_counter))
> + if (job->vram_lost_counter != 
> atomic_read(>adev->vram_lost_counter)) {
> + /* flags might be signaled by unexpected callback, clear it */
> + test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> >flags);
>   dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
> VRAM lost */
> + }
>   
>   if (finished->error < 0) {
>   DRM_INFO("Skip scheduling IBs!\n");

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

2018-12-21 Thread Grodzovsky, Andrey
I believe this issue would be resolved by my pending  in review patch 
set, specifically 'drm/sched: Refactor ring mirror list handling.' since 
already on the first TO handler it will go over all the rings including 
the second timed out ring and will remove all call backs including the 
bad job cb. In case by this time this bad job will signal for some 
reason it will be removed from the mirror list already during 
drm_sched_process_job (take a look at 'drm/sched: Rework HW fence 
processing.') and hence will not be rerun in drm_sched_job_recovery 
(drm_sched_resubmit_jobs under the new name).

Andrey


On 12/21/2018 03:25 AM, wentalou wrote:
> When 2 rings met timeout at same time, triggered job_timedout separately.
> Each job_timedout called gpu_recover, but one of gpu_recover locked by 
> another's mutex_lock.
> Bad jod’s callback should be removed by dma_fence_remove_callback but locked 
> inside mutex_lock.
> So dma_fence_remove_callback could not be called immediately.
> Then callback drm_sched_process_job triggered unexpectedly, and signaled 
> DMA_FENCE_FLAG_SIGNALED_BIT.
> After another's mutex_unlock, signaled bad job went through job_run inside 
> drm_sched_job_recovery.
> job_run would have WARN_ON and Call-Trace, when calling 
> kcl_dma_fence_set_error for signaled bad job.
>
> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
> Signed-off-by: Wentao Lou 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 -
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0a17fb1..fc1d3a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct 
> drm_sched_job *sched_job)
>   
>   trace_amdgpu_sched_run_job(job);
>   
> - if (job->vram_lost_counter != 
> atomic_read(>adev->vram_lost_counter))
> + if (job->vram_lost_counter != 
> atomic_read(>adev->vram_lost_counter)) {
> + /* flags might be signaled by unexpected callback, clear it */
> + test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, 
> >flags);
>   dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
> VRAM lost */
> + }
>   
>   if (finished->error < 0) {
>   DRM_INFO("Skip scheduling IBs!\n");

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx