Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Am 24.04.2018 um 17:30 schrieb Andrey Grodzovsky: If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David PanaritiSigned-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } Please drop the whole extra handling. The caller is perfectly capable of dealing with interrupted waits. So all we need to do here is change "dma_fence_wait_timeout(other, false, ..." into "dma_fence_wait_timeout(other, true, ..." and suppress the error message when the IOCTL was just interrupted by a signal. Regards, Christian. } }
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Am 24.04.2018 um 17:30 schrieb Andrey Grodzovsky: If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } Please drop the whole extra handling. The caller is perfectly capable of dealing with interrupted waits. So all we need to do here is change "dma_fence_wait_timeout(other, false, ..." into "dma_fence_wait_timeout(other, true, ..." and suppress the error message when the IOCTL was just interrupted by a signal. Regards, Christian. } }
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/25/2018 04:55 PM, Eric W. Biederman wrote: Andrey Grodzovskywrites: On 04/24/2018 12:30 PM, Eric W. Biederman wrote: "Panariti, David" writes: Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. I am not clear here - could you be more specific about what races will happen here, more bellow Eric If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + Do you mean that by the time I reach here some other thread from my group already might dequeued SIGKILL since it's a shared signal and hence fatal_signal_pending will return false ? Or are you talking about the dma_fence_wait_timeout implementation in dma_fence_default_wait with schedule_timeout ? Given Oleg's earlier comment about the scheduler having special cases for signals I might be wrong. But in general there is a pattern: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (loop_is_done()) break; schedule(); } set_current_state(TASK_RUNNING); If you violate that pattern by testing for a condition without having first set your task as TASK_UNINTERRUPTIBLE (or whatever your sleep state is). Then it is possible to miss a wake-up that tests the condidtion. Thus I am quite concerned that there is a subtle corner case where you can miss a wakeup and not retest fatal_signal_pending(). I see the general problem now. In this particular case dma_fence_default_wait and the caller of wake_up_state use lock for protecting wake up delivery and wakeup condition and also dma_fence_default_wait retests the wakeup condition on entry. But obviously it's a bad practice to rely on API's internal implementation for assumptions in client code. Given that there is is a timeout the worst case might have you sleep MAX_SCHEDULE_TIMEOUT instead of indefinitely. It actually means never wake Without a comment why this is safe, or having fatal_signal_pending check integrated into dma_fence_wait_timeout I am not comfortable with this loop. Agree, fatal_signal_pending should be part of the wait function. Andrey Eric + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/25/2018 04:55 PM, Eric W. Biederman wrote: Andrey Grodzovsky writes: On 04/24/2018 12:30 PM, Eric W. Biederman wrote: "Panariti, David" writes: Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. I am not clear here - could you be more specific about what races will happen here, more bellow Eric If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + Do you mean that by the time I reach here some other thread from my group already might dequeued SIGKILL since it's a shared signal and hence fatal_signal_pending will return false ? Or are you talking about the dma_fence_wait_timeout implementation in dma_fence_default_wait with schedule_timeout ? Given Oleg's earlier comment about the scheduler having special cases for signals I might be wrong. But in general there is a pattern: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (loop_is_done()) break; schedule(); } set_current_state(TASK_RUNNING); If you violate that pattern by testing for a condition without having first set your task as TASK_UNINTERRUPTIBLE (or whatever your sleep state is). Then it is possible to miss a wake-up that tests the condidtion. Thus I am quite concerned that there is a subtle corner case where you can miss a wakeup and not retest fatal_signal_pending(). I see the general problem now. In this particular case dma_fence_default_wait and the caller of wake_up_state use lock for protecting wake up delivery and wakeup condition and also dma_fence_default_wait retests the wakeup condition on entry. But obviously it's a bad practice to rely on API's internal implementation for assumptions in client code. Given that there is is a timeout the worst case might have you sleep MAX_SCHEDULE_TIMEOUT instead of indefinitely. It actually means never wake Without a comment why this is safe, or having fatal_signal_pending check integrated into dma_fence_wait_timeout I am not comfortable with this loop. Agree, fatal_signal_pending should be part of the wait function. Andrey Eric + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Andrey Grodzovskywrites: > On 04/24/2018 12:30 PM, Eric W. Biederman wrote: >> "Panariti, David" writes: >> >>> Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) >>> Depends on how many places it would be called, or think it might be called. >>> Can always factor on the 2nd time it's needed. >>> Factoring, IMO, rarely hurts. The factored function can easily be visited >>> using `M-.' ;-> >>> >>> Also, if the wait could be very long, would a log message, something like >>> "xxx has run for Y seconds." help? >>> I personally hate hanging w/no info. >> Ugh. This loop appears susceptible to loosing wake ups. There are >> races between when a wake-up happens, when we clear the sleeping state, >> and when we test the stat to see if we should stat awake. So yes >> implementing a dma_fence_wait_killable that handles of all that >> correctly sounds like an very good idea. > > I am not clear here - could you be more specific about what races will happen > here, more bellow >> >> Eric >> >> If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + > > Do you mean that by the time I reach here some other thread from my group > already might dequeued SIGKILL since it's a shared signal and hence > fatal_signal_pending will return false ? Or are you talking about the > dma_fence_wait_timeout implementation in dma_fence_default_wait with > schedule_timeout ? Given Oleg's earlier comment about the scheduler having special cases for signals I might be wrong. But in general there is a pattern: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (loop_is_done()) break; schedule(); } set_current_state(TASK_RUNNING); If you violate that pattern by testing for a condition without having first set your task as TASK_UNINTERRUPTIBLE (or whatever your sleep state is). Then it is possible to miss a wake-up that tests the condidtion. Thus I am quite concerned that there is a subtle corner case where you can miss a wakeup and not retest fatal_signal_pending(). Given that there is is a timeout the worst case might have you sleep MAX_SCHEDULE_TIMEOUT instead of indefinitely. Without a comment why this is safe, or having fatal_signal_pending check integrated into dma_fence_wait_timeout I am not comfortable with this loop. Eric + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 >> Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Andrey Grodzovsky writes: > On 04/24/2018 12:30 PM, Eric W. Biederman wrote: >> "Panariti, David" writes: >> >>> Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) >>> Depends on how many places it would be called, or think it might be called. >>> Can always factor on the 2nd time it's needed. >>> Factoring, IMO, rarely hurts. The factored function can easily be visited >>> using `M-.' ;-> >>> >>> Also, if the wait could be very long, would a log message, something like >>> "xxx has run for Y seconds." help? >>> I personally hate hanging w/no info. >> Ugh. This loop appears susceptible to loosing wake ups. There are >> races between when a wake-up happens, when we clear the sleeping state, >> and when we test the stat to see if we should stat awake. So yes >> implementing a dma_fence_wait_killable that handles of all that >> correctly sounds like an very good idea. > > I am not clear here - could you be more specific about what races will happen > here, more bellow >> >> Eric >> >> If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + > > Do you mean that by the time I reach here some other thread from my group > already might dequeued SIGKILL since it's a shared signal and hence > fatal_signal_pending will return false ? Or are you talking about the > dma_fence_wait_timeout implementation in dma_fence_default_wait with > schedule_timeout ? Given Oleg's earlier comment about the scheduler having special cases for signals I might be wrong. But in general there is a pattern: for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (loop_is_done()) break; schedule(); } set_current_state(TASK_RUNNING); If you violate that pattern by testing for a condition without having first set your task as TASK_UNINTERRUPTIBLE (or whatever your sleep state is). Then it is possible to miss a wake-up that tests the condidtion. Thus I am quite concerned that there is a subtle corner case where you can miss a wakeup and not retest fatal_signal_pending(). Given that there is is a timeout the worst case might have you sleep MAX_SCHEDULE_TIMEOUT instead of indefinitely. Without a comment why this is safe, or having fatal_signal_pending check integrated into dma_fence_wait_timeout I am not comfortable with this loop. Eric + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 >> Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 12:30 PM, Eric W. Biederman wrote: "Panariti, David"writes: Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. I am not clear here - could you be more specific about what races will happen here, more bellow Eric If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + Do you mean that by the time I reach here some other thread from my group already might dequeued SIGKILL since it's a shared signal and hence fatal_signal_pending will return false ? Or are you talking about the dma_fence_wait_timeout implementation in dma_fence_default_wait with schedule_timeout ? Andrey + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 12:30 PM, Eric W. Biederman wrote: "Panariti, David" writes: Andrey Grodzovsky writes: Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. I am not clear here - could you be more specific about what races will happen here, more bellow Eric If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + Do you mean that by the time I reach here some other thread from my group already might dequeued SIGKILL since it's a shared signal and hence fatal_signal_pending will return false ? Or are you talking about the dma_fence_wait_timeout implementation in dma_fence_default_wait with schedule_timeout ? Andrey + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4 Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 12:14 PM, Eric W. Biederman wrote: Andrey Grodzovskywrites: If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } It looks like if you make this code say: if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { DRM_ERROR("Error (%ld) waiting for fence!\n", r); return r; } } Than you would not need the horrible hack want_signal to deliver signals to processes who have passed exit_signal() and don't expect to need their signal handling mechanisms anymore. Let me clarify, the change in want_signal wasn't addressing this code but hang in drm_sched_entity_do_release->wait_event_killable, when you try to gracefully terminate by waiting for all job completions on the GPU pipe you process is using. Andrey Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 12:14 PM, Eric W. Biederman wrote: Andrey Grodzovsky writes: If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } It looks like if you make this code say: if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { DRM_ERROR("Error (%ld) waiting for fence!\n", r); return r; } } Than you would not need the horrible hack want_signal to deliver signals to processes who have passed exit_signal() and don't expect to need their signal handling mechanisms anymore. Let me clarify, the change in want_signal wasn't addressing this code but hang in drm_sched_entity_do_release->wait_event_killable, when you try to gracefully terminate by waiting for all job completions on the GPU pipe you process is using. Andrey Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
"Panariti, David"writes: > Andrey Grodzovsky writes: >> Kind of dma_fence_wait_killable, except that we don't have such API >> (maybe worth adding ?) > Depends on how many places it would be called, or think it might be called. > Can always factor on the 2nd time it's needed. > Factoring, IMO, rarely hurts. The factored function can easily be visited > using `M-.' ;-> > > Also, if the wait could be very long, would a log message, something like > "xxx has run for Y seconds." help? > I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. Eric >> If the ring is hanging for some reason allow to recover the waiting by >> sending fatal signal. >> >> Originally-by: David Panariti >> Signed-off-by: Andrey Grodzovsky >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> index eb80edf..37a36af 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, >> unsigned ring_id) >> >> if (other) { >> signed long r; >> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); >> - if (r < 0) { >> - DRM_ERROR("Error (%ld) waiting for fence!\n", r); >> - return r; >> + >> + while (true) { >> + if ((r = dma_fence_wait_timeout(other, true, >> + MAX_SCHEDULE_TIMEOUT)) >= 0) >> + return 0; >> + >> + if (fatal_signal_pending(current)) { >> + DRM_ERROR("Error (%ld) waiting for fence!\n", >> r); >> + return r; >> + } >> } >> } >> >> -- >> 2.7.4 >> Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
"Panariti, David" writes: > Andrey Grodzovsky writes: >> Kind of dma_fence_wait_killable, except that we don't have such API >> (maybe worth adding ?) > Depends on how many places it would be called, or think it might be called. > Can always factor on the 2nd time it's needed. > Factoring, IMO, rarely hurts. The factored function can easily be visited > using `M-.' ;-> > > Also, if the wait could be very long, would a log message, something like > "xxx has run for Y seconds." help? > I personally hate hanging w/no info. Ugh. This loop appears susceptible to loosing wake ups. There are races between when a wake-up happens, when we clear the sleeping state, and when we test the stat to see if we should stat awake. So yes implementing a dma_fence_wait_killable that handles of all that correctly sounds like an very good idea. Eric >> If the ring is hanging for some reason allow to recover the waiting by >> sending fatal signal. >> >> Originally-by: David Panariti >> Signed-off-by: Andrey Grodzovsky >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> index eb80edf..37a36af 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, >> unsigned ring_id) >> >> if (other) { >> signed long r; >> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); >> - if (r < 0) { >> - DRM_ERROR("Error (%ld) waiting for fence!\n", r); >> - return r; >> + >> + while (true) { >> + if ((r = dma_fence_wait_timeout(other, true, >> + MAX_SCHEDULE_TIMEOUT)) >= 0) >> + return 0; >> + >> + if (fatal_signal_pending(current)) { >> + DRM_ERROR("Error (%ld) waiting for fence!\n", >> r); >> + return r; >> + } >> } >> } >> >> -- >> 2.7.4 >> Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
> Kind of dma_fence_wait_killable, except that we don't have such API > (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. regards, davep From: Grodzovsky, Andrey Sent: Tuesday, April 24, 2018 11:58:19 AM To: Panariti, David; linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander; Koenig, Christian; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com Subject: Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. On 04/24/2018 11:52 AM, Panariti, David wrote: > Hi, > > It looks like there can be an infinite loop if neither of the if()'s become > true. > Is that an impossible condition? That intended, we want to wait until either the fence signals or fatal signal received, we don't want to terminate the wait if fence is not signaled even when interrupted by non fatal signal. Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Andrey > > -Original Message- > From: Andrey Grodzovsky <andrey.grodzov...@amd.com> > Sent: Tuesday, April 24, 2018 11:31 AM > To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org > Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian > <christian.koe...@amd.com>; Panariti, David <david.panar...@amd.com>; > o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; > Grodzovsky, Andrey <andrey.grodzov...@amd.com> > Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from > ring hang. > > If the ring is hanging for some reason allow to recover the waiting by > sending fatal signal. > > Originally-by: David Panariti <david.panar...@amd.com> > Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index eb80edf..37a36af 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, > unsigned ring_id) > > if (other) { > signed long r; > - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); > - if (r < 0) { > - DRM_ERROR("Error (%ld) waiting for fence!\n", r); > - return r; > + > + while (true) { > + if ((r = dma_fence_wait_timeout(other, true, > + MAX_SCHEDULE_TIMEOUT)) >= 0) > + return 0; > + > + if (fatal_signal_pending(current)) { > + DRM_ERROR("Error (%ld) waiting for fence!\n", > r); > + return r; > + } > } > } > > -- > 2.7.4 >
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
> Kind of dma_fence_wait_killable, except that we don't have such API > (maybe worth adding ?) Depends on how many places it would be called, or think it might be called. Can always factor on the 2nd time it's needed. Factoring, IMO, rarely hurts. The factored function can easily be visited using `M-.' ;-> Also, if the wait could be very long, would a log message, something like "xxx has run for Y seconds." help? I personally hate hanging w/no info. regards, davep From: Grodzovsky, Andrey Sent: Tuesday, April 24, 2018 11:58:19 AM To: Panariti, David; linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander; Koenig, Christian; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com Subject: Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. On 04/24/2018 11:52 AM, Panariti, David wrote: > Hi, > > It looks like there can be an infinite loop if neither of the if()'s become > true. > Is that an impossible condition? That intended, we want to wait until either the fence signals or fatal signal received, we don't want to terminate the wait if fence is not signaled even when interrupted by non fatal signal. Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Andrey > > -Original Message- > From: Andrey Grodzovsky > Sent: Tuesday, April 24, 2018 11:31 AM > To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org > Cc: Deucher, Alexander ; Koenig, Christian > ; Panariti, David ; > o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; > Grodzovsky, Andrey > Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from > ring hang. > > If the ring is hanging for some reason allow to recover the waiting by > sending fatal signal. > > Originally-by: David Panariti > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index eb80edf..37a36af 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, > unsigned ring_id) > > if (other) { > signed long r; > - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); > - if (r < 0) { > - DRM_ERROR("Error (%ld) waiting for fence!\n", r); > - return r; > + > + while (true) { > + if ((r = dma_fence_wait_timeout(other, true, > + MAX_SCHEDULE_TIMEOUT)) >= 0) > + return 0; > + > + if (fatal_signal_pending(current)) { > + DRM_ERROR("Error (%ld) waiting for fence!\n", > r); > + return r; > + } > } > } > > -- > 2.7.4 >
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Andrey Grodzovskywrites: > If the ring is hanging for some reason allow to recover the waiting > by sending fatal signal. > > Originally-by: David Panariti > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index eb80edf..37a36af 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, > unsigned ring_id) > > if (other) { > signed long r; > - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); > - if (r < 0) { > - DRM_ERROR("Error (%ld) waiting for fence!\n", r); > - return r; > + > + while (true) { > + if ((r = dma_fence_wait_timeout(other, true, > + MAX_SCHEDULE_TIMEOUT)) >= 0) > + return 0; > + > + if (fatal_signal_pending(current)) { > + DRM_ERROR("Error (%ld) waiting for fence!\n", > r); > + return r; > + } It looks like if you make this code say: if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { DRM_ERROR("Error (%ld) waiting for fence!\n", r); return r; > } > } Than you would not need the horrible hack want_signal to deliver signals to processes who have passed exit_signal() and don't expect to need their signal handling mechanisms anymore. Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Andrey Grodzovsky writes: > If the ring is hanging for some reason allow to recover the waiting > by sending fatal signal. > > Originally-by: David Panariti > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index eb80edf..37a36af 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, > unsigned ring_id) > > if (other) { > signed long r; > - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); > - if (r < 0) { > - DRM_ERROR("Error (%ld) waiting for fence!\n", r); > - return r; > + > + while (true) { > + if ((r = dma_fence_wait_timeout(other, true, > + MAX_SCHEDULE_TIMEOUT)) >= 0) > + return 0; > + > + if (fatal_signal_pending(current)) { > + DRM_ERROR("Error (%ld) waiting for fence!\n", > r); > + return r; > + } It looks like if you make this code say: if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) { DRM_ERROR("Error (%ld) waiting for fence!\n", r); return r; > } > } Than you would not need the horrible hack want_signal to deliver signals to processes who have passed exit_signal() and don't expect to need their signal handling mechanisms anymore. Eric
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 11:52 AM, Panariti, David wrote: Hi, It looks like there can be an infinite loop if neither of the if()'s become true. Is that an impossible condition? That intended, we want to wait until either the fence signals or fatal signal received, we don't want to terminate the wait if fence is not signaled even when interrupted by non fatal signal. Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Andrey -Original Message- From: Andrey GrodzovskySent: Tuesday, April 24, 2018 11:31 AM To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Panariti, David ; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4
Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
On 04/24/2018 11:52 AM, Panariti, David wrote: Hi, It looks like there can be an infinite loop if neither of the if()'s become true. Is that an impossible condition? That intended, we want to wait until either the fence signals or fatal signal received, we don't want to terminate the wait if fence is not signaled even when interrupted by non fatal signal. Kind of dma_fence_wait_killable, except that we don't have such API (maybe worth adding ?) Andrey -Original Message- From: Andrey Grodzovsky Sent: Tuesday, April 24, 2018 11:31 AM To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Panariti, David ; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4
RE: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Hi, It looks like there can be an infinite loop if neither of the if()'s become true. Is that an impossible condition? -Original Message- From: Andrey GrodzovskySent: Tuesday, April 24, 2018 11:31 AM To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Panariti, David ; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4
RE: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.
Hi, It looks like there can be an infinite loop if neither of the if()'s become true. Is that an impossible condition? -Original Message- From: Andrey Grodzovsky Sent: Tuesday, April 24, 2018 11:31 AM To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org Cc: Deucher, Alexander ; Koenig, Christian ; Panariti, David ; o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang. If the ring is hanging for some reason allow to recover the waiting by sending fatal signal. Originally-by: David Panariti Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index eb80edf..37a36af 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, unsigned ring_id) if (other) { signed long r; - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT); - if (r < 0) { - DRM_ERROR("Error (%ld) waiting for fence!\n", r); - return r; + + while (true) { + if ((r = dma_fence_wait_timeout(other, true, + MAX_SCHEDULE_TIMEOUT)) >= 0) + return 0; + + if (fatal_signal_pending(current)) { + DRM_ERROR("Error (%ld) waiting for fence!\n", r); + return r; + } } } -- 2.7.4