Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-30 Thread Christian König

Am 24.04.2018 um 17:30 schrieb Andrey Grodzovsky:

If the ring is hanging for some reason allow to recover the waiting
by sending fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }


Please drop the whole extra handling. The caller is perfectly capable of 
dealing with interrupted waits.


So all we need to do here is change "dma_fence_wait_timeout(other, 
false, ..." into "dma_fence_wait_timeout(other, true, ..." and suppress 
the error message when the IOCTL was just interrupted by a signal.


Regards,
Christian.


}
}
  




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-30 Thread Christian König

Am 24.04.2018 um 17:30 schrieb Andrey Grodzovsky:

If the ring is hanging for some reason allow to recover the waiting
by sending fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }


Please drop the whole extra handling. The caller is perfectly capable of 
dealing with interrupted waits.


So all we need to do here is change "dma_fence_wait_timeout(other, 
false, ..." into "dma_fence_wait_timeout(other, true, ..." and suppress 
the error message when the IOCTL was just interrupted by a signal.


Regards,
Christian.


}
}
  




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-26 Thread Andrey Grodzovsky



On 04/25/2018 04:55 PM, Eric W. Biederman wrote:

Andrey Grodzovsky  writes:


On 04/24/2018 12:30 PM, Eric W. Biederman wrote:

"Panariti, David"  writes:


Andrey Grodzovsky  writes:

Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited using 
`M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx has 
run for Y seconds."  help?
I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.

I am not clear here - could you be more specific about what races will happen
here, more bellow

Eric



If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)

if (other) {
signed long r;
- r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
- if (r < 0) {
- DRM_ERROR("Error (%ld) waiting for fence!\n", r);
- return r;
+
+ while (true) {
+ if ((r = dma_fence_wait_timeout(other, true,
+ MAX_SCHEDULE_TIMEOUT)) >= 0)
+ return 0;
+

Do you mean that by the time I reach here some other thread from my group
already might dequeued SIGKILL since it's a shared signal and hence
fatal_signal_pending will return false ? Or are you talking about the
dma_fence_wait_timeout implementation in dma_fence_default_wait with
schedule_timeout ?

Given Oleg's earlier comment about the scheduler having special cases
for signals I might be wrong.  But in general there is a pattern:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (loop_is_done())
break;
schedule();
}
 set_current_state(TASK_RUNNING);

If you violate that pattern by testing for a condition without
having first set your task as TASK_UNINTERRUPTIBLE (or whatever your
sleep state is).  Then it is possible to miss a wake-up that
tests the condidtion.

Thus I am quite concerned that there is a subtle corner case where
you can miss a wakeup and not retest fatal_signal_pending().



I see the general problem now. In this particular case 
dma_fence_default_wait
and the caller of wake_up_state use lock for protecting wake up delivery 
and wakeup condition

and also dma_fence_default_wait retests the wakeup condition on entry.
But obviously it's a bad practice to rely on API's internal 
implementation for assumptions

in client code.



Given that there is is a timeout the worst case might have you sleep
MAX_SCHEDULE_TIMEOUT instead of indefinitely.


It actually means never wake



Without a comment why this is safe, or having fatal_signal_pending
check integrated into dma_fence_wait_timeout I am not comfortable
with this loop.


Agree, fatal_signal_pending should be part of the wait function.

Andrey



Eric



+ if (fatal_signal_pending(current)) {
+ DRM_ERROR("Error (%ld) waiting for fence!\n", r);
+ return r;
+ }
}
}

--
2.7.4


Eric




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-26 Thread Andrey Grodzovsky



On 04/25/2018 04:55 PM, Eric W. Biederman wrote:

Andrey Grodzovsky  writes:


On 04/24/2018 12:30 PM, Eric W. Biederman wrote:

"Panariti, David"  writes:


Andrey Grodzovsky  writes:

Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited using 
`M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx has 
run for Y seconds."  help?
I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.

I am not clear here - could you be more specific about what races will happen
here, more bellow

Eric



If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)

if (other) {
signed long r;
- r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
- if (r < 0) {
- DRM_ERROR("Error (%ld) waiting for fence!\n", r);
- return r;
+
+ while (true) {
+ if ((r = dma_fence_wait_timeout(other, true,
+ MAX_SCHEDULE_TIMEOUT)) >= 0)
+ return 0;
+

Do you mean that by the time I reach here some other thread from my group
already might dequeued SIGKILL since it's a shared signal and hence
fatal_signal_pending will return false ? Or are you talking about the
dma_fence_wait_timeout implementation in dma_fence_default_wait with
schedule_timeout ?

Given Oleg's earlier comment about the scheduler having special cases
for signals I might be wrong.  But in general there is a pattern:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (loop_is_done())
break;
schedule();
}
 set_current_state(TASK_RUNNING);

If you violate that pattern by testing for a condition without
having first set your task as TASK_UNINTERRUPTIBLE (or whatever your
sleep state is).  Then it is possible to miss a wake-up that
tests the condidtion.

Thus I am quite concerned that there is a subtle corner case where
you can miss a wakeup and not retest fatal_signal_pending().



I see the general problem now. In this particular case 
dma_fence_default_wait
and the caller of wake_up_state use lock for protecting wake up delivery 
and wakeup condition

and also dma_fence_default_wait retests the wakeup condition on entry.
But obviously it's a bad practice to rely on API's internal 
implementation for assumptions

in client code.



Given that there is is a timeout the worst case might have you sleep
MAX_SCHEDULE_TIMEOUT instead of indefinitely.


It actually means never wake



Without a comment why this is safe, or having fatal_signal_pending
check integrated into dma_fence_wait_timeout I am not comfortable
with this loop.


Agree, fatal_signal_pending should be part of the wait function.

Andrey



Eric



+ if (fatal_signal_pending(current)) {
+ DRM_ERROR("Error (%ld) waiting for fence!\n", r);
+ return r;
+ }
}
}

--
2.7.4


Eric




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> On 04/24/2018 12:30 PM, Eric W. Biederman wrote:
>> "Panariti, David"  writes:
>>
>>> Andrey Grodzovsky  writes:
 Kind of dma_fence_wait_killable, except that we don't have such API
 (maybe worth adding ?)
>>> Depends on how many places it would be called, or think it might be called. 
>>>  Can always factor on the 2nd time it's needed.
>>> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
>>> using `M-.' ;->
>>>
>>> Also, if the wait could be very long, would a log message, something like 
>>> "xxx has run for Y seconds."  help?
>>> I personally hate hanging w/no info.
>> Ugh.  This loop appears susceptible to loosing wake ups.  There are
>> races between when a wake-up happens, when we clear the sleeping state,
>> and when we test the stat to see if we should stat awake.  So yes
>> implementing a dma_fence_wait_killable that handles of all that
>> correctly sounds like an very good idea.
>
> I am not clear here - could you be more specific about what races will happen
> here, more bellow
>>
>> Eric
>>
>>
 If the ring is hanging for some reason allow to recover the waiting by 
 sending fatal signal.

 Originally-by: David Panariti 
 Signed-off-by: Andrey Grodzovsky 
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 index eb80edf..37a36af 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx 
 *ctx, unsigned ring_id)

if (other) {
signed long r;
 - r = dma_fence_wait_timeout(other, false, 
 MAX_SCHEDULE_TIMEOUT);
 - if (r < 0) {
 - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
 - return r;
 +
 + while (true) {
 + if ((r = dma_fence_wait_timeout(other, true,
 + MAX_SCHEDULE_TIMEOUT)) >= 0)
 + return 0;
 +
>
> Do you mean that by the time I reach here some other thread from my group
> already might dequeued SIGKILL since it's a shared signal and hence
> fatal_signal_pending will return false ? Or are you talking about the
> dma_fence_wait_timeout implementation in dma_fence_default_wait with
> schedule_timeout ?

Given Oleg's earlier comment about the scheduler having special cases
for signals I might be wrong.  But in general there is a pattern:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (loop_is_done())
break;
schedule();
}
set_current_state(TASK_RUNNING);

If you violate that pattern by testing for a condition without
having first set your task as TASK_UNINTERRUPTIBLE (or whatever your
sleep state is).  Then it is possible to miss a wake-up that
tests the condidtion.

Thus I am quite concerned that there is a subtle corner case where
you can miss a wakeup and not retest fatal_signal_pending().

Given that there is is a timeout the worst case might have you sleep
MAX_SCHEDULE_TIMEOUT instead of indefinitely.

Without a comment why this is safe, or having fatal_signal_pending
check integrated into dma_fence_wait_timeout I am not comfortable
with this loop.

Eric


 + if (fatal_signal_pending(current)) {
 + DRM_ERROR("Error (%ld) waiting for 
 fence!\n", r);
 + return r;
 + }
}
}

 --
 2.7.4

>> Eric


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-25 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> On 04/24/2018 12:30 PM, Eric W. Biederman wrote:
>> "Panariti, David"  writes:
>>
>>> Andrey Grodzovsky  writes:
 Kind of dma_fence_wait_killable, except that we don't have such API
 (maybe worth adding ?)
>>> Depends on how many places it would be called, or think it might be called. 
>>>  Can always factor on the 2nd time it's needed.
>>> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
>>> using `M-.' ;->
>>>
>>> Also, if the wait could be very long, would a log message, something like 
>>> "xxx has run for Y seconds."  help?
>>> I personally hate hanging w/no info.
>> Ugh.  This loop appears susceptible to loosing wake ups.  There are
>> races between when a wake-up happens, when we clear the sleeping state,
>> and when we test the stat to see if we should stat awake.  So yes
>> implementing a dma_fence_wait_killable that handles of all that
>> correctly sounds like an very good idea.
>
> I am not clear here - could you be more specific about what races will happen
> here, more bellow
>>
>> Eric
>>
>>
 If the ring is hanging for some reason allow to recover the waiting by 
 sending fatal signal.

 Originally-by: David Panariti 
 Signed-off-by: Andrey Grodzovsky 
 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
1 file changed, 10 insertions(+), 4 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 index eb80edf..37a36af 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
 @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx 
 *ctx, unsigned ring_id)

if (other) {
signed long r;
 - r = dma_fence_wait_timeout(other, false, 
 MAX_SCHEDULE_TIMEOUT);
 - if (r < 0) {
 - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
 - return r;
 +
 + while (true) {
 + if ((r = dma_fence_wait_timeout(other, true,
 + MAX_SCHEDULE_TIMEOUT)) >= 0)
 + return 0;
 +
>
> Do you mean that by the time I reach here some other thread from my group
> already might dequeued SIGKILL since it's a shared signal and hence
> fatal_signal_pending will return false ? Or are you talking about the
> dma_fence_wait_timeout implementation in dma_fence_default_wait with
> schedule_timeout ?

Given Oleg's earlier comment about the scheduler having special cases
for signals I might be wrong.  But in general there is a pattern:

for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (loop_is_done())
break;
schedule();
}
set_current_state(TASK_RUNNING);

If you violate that pattern by testing for a condition without
having first set your task as TASK_UNINTERRUPTIBLE (or whatever your
sleep state is).  Then it is possible to miss a wake-up that
tests the condidtion.

Thus I am quite concerned that there is a subtle corner case where
you can miss a wakeup and not retest fatal_signal_pending().

Given that there is is a timeout the worst case might have you sleep
MAX_SCHEDULE_TIMEOUT instead of indefinitely.

Without a comment why this is safe, or having fatal_signal_pending
check integrated into dma_fence_wait_timeout I am not comfortable
with this loop.

Eric


 + if (fatal_signal_pending(current)) {
 + DRM_ERROR("Error (%ld) waiting for 
 fence!\n", r);
 + return r;
 + }
}
}

 --
 2.7.4

>> Eric


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-25 Thread Andrey Grodzovsky



On 04/24/2018 12:30 PM, Eric W. Biederman wrote:

"Panariti, David"  writes:


Andrey Grodzovsky  writes:

Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited using 
`M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx has 
run for Y seconds."  help?
I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.


I am not clear here - could you be more specific about what races will 
happen here, more bellow


Eric



If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
   1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)

   if (other) {
   signed long r;
- r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
- if (r < 0) {
- DRM_ERROR("Error (%ld) waiting for fence!\n", r);
- return r;
+
+ while (true) {
+ if ((r = dma_fence_wait_timeout(other, true,
+ MAX_SCHEDULE_TIMEOUT)) >= 0)
+ return 0;
+


Do you mean that by the time I reach here some other thread from my 
group already might dequeued SIGKILL since it's a shared signal and 
hence fatal_signal_pending will return false ? Or are you talking about 
the dma_fence_wait_timeout implementation in dma_fence_default_wait with 
schedule_timeout ?


Andrey


+ if (fatal_signal_pending(current)) {
+ DRM_ERROR("Error (%ld) waiting for fence!\n", r);
+ return r;
+ }
   }
   }

--
2.7.4


Eric




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-25 Thread Andrey Grodzovsky



On 04/24/2018 12:30 PM, Eric W. Biederman wrote:

"Panariti, David"  writes:


Andrey Grodzovsky  writes:

Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited using 
`M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx has 
run for Y seconds."  help?
I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.


I am not clear here - could you be more specific about what races will 
happen here, more bellow


Eric



If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
   1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)

   if (other) {
   signed long r;
- r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
- if (r < 0) {
- DRM_ERROR("Error (%ld) waiting for fence!\n", r);
- return r;
+
+ while (true) {
+ if ((r = dma_fence_wait_timeout(other, true,
+ MAX_SCHEDULE_TIMEOUT)) >= 0)
+ return 0;
+


Do you mean that by the time I reach here some other thread from my 
group already might dequeued SIGKILL since it's a shared signal and 
hence fatal_signal_pending will return false ? Or are you talking about 
the dma_fence_wait_timeout implementation in dma_fence_default_wait with 
schedule_timeout ?


Andrey


+ if (fatal_signal_pending(current)) {
+ DRM_ERROR("Error (%ld) waiting for fence!\n", r);
+ return r;
+ }
   }
   }

--
2.7.4


Eric




Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Andrey Grodzovsky



On 04/24/2018 12:14 PM, Eric W. Biederman wrote:

Andrey Grodzovsky  writes:


If the ring is hanging for some reason allow to recover the waiting
by sending fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }

It looks like if you make this code say:
if (fatal_signal_pending(current) ||
(current->flags & PF_EXITING)) {
DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
return r;

}
}

Than you would not need the horrible hack want_signal to deliver signals
to processes who have passed exit_signal() and don't expect to need
their signal handling mechanisms anymore.


Let me clarify,  the change in want_signal wasn't addressing this code 
but hang in
drm_sched_entity_do_release->wait_event_killable, when you try to 
gracefully terminate by waiting

for all job completions on the GPU pipe you process is using.

Andrey



Eric





Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Andrey Grodzovsky



On 04/24/2018 12:14 PM, Eric W. Biederman wrote:

Andrey Grodzovsky  writes:


If the ring is hanging for some reason allow to recover the waiting
by sending fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }

It looks like if you make this code say:
if (fatal_signal_pending(current) ||
(current->flags & PF_EXITING)) {
DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
return r;

}
}

Than you would not need the horrible hack want_signal to deliver signals
to processes who have passed exit_signal() and don't expect to need
their signal handling mechanisms anymore.


Let me clarify,  the change in want_signal wasn't addressing this code 
but hang in
drm_sched_entity_do_release->wait_event_killable, when you try to 
gracefully terminate by waiting

for all job completions on the GPU pipe you process is using.

Andrey



Eric





Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
"Panariti, David"  writes:

> Andrey Grodzovsky  writes:
>> Kind of dma_fence_wait_killable, except that we don't have such API
>> (maybe worth adding ?)
> Depends on how many places it would be called, or think it might be called.  
> Can always factor on the 2nd time it's needed.
> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
> using `M-.' ;->
>
> Also, if the wait could be very long, would a log message, something like 
> "xxx has run for Y seconds."  help?
> I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.

Eric


>> If the ring is hanging for some reason allow to recover the waiting by 
>> sending fatal signal.
>>
>> Originally-by: David Panariti 
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index eb80edf..37a36af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
>> unsigned ring_id)
>>
>>   if (other) {
>>   signed long r;
>> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
>> - if (r < 0) {
>> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
>> - return r;
>> +
>> + while (true) {
>> + if ((r = dma_fence_wait_timeout(other, true,
>> + MAX_SCHEDULE_TIMEOUT)) >= 0)
>> + return 0;
>> +
>> + if (fatal_signal_pending(current)) {
>> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
>> r);
>> + return r;
>> + }
>>   }
>>   }
>>
>> --
>> 2.7.4
>>
Eric


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
"Panariti, David"  writes:

> Andrey Grodzovsky  writes:
>> Kind of dma_fence_wait_killable, except that we don't have such API
>> (maybe worth adding ?)
> Depends on how many places it would be called, or think it might be called.  
> Can always factor on the 2nd time it's needed.
> Factoring, IMO, rarely hurts.  The factored function can easily be visited 
> using `M-.' ;->
>
> Also, if the wait could be very long, would a log message, something like 
> "xxx has run for Y seconds."  help?
> I personally hate hanging w/no info.

Ugh.  This loop appears susceptible to loosing wake ups.  There are
races between when a wake-up happens, when we clear the sleeping state,
and when we test the stat to see if we should stat awake.  So yes
implementing a dma_fence_wait_killable that handles of all that
correctly sounds like an very good idea.

Eric


>> If the ring is hanging for some reason allow to recover the waiting by 
>> sending fatal signal.
>>
>> Originally-by: David Panariti 
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index eb80edf..37a36af 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
>> unsigned ring_id)
>>
>>   if (other) {
>>   signed long r;
>> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
>> - if (r < 0) {
>> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
>> - return r;
>> +
>> + while (true) {
>> + if ((r = dma_fence_wait_timeout(other, true,
>> + MAX_SCHEDULE_TIMEOUT)) >= 0)
>> + return 0;
>> +
>> + if (fatal_signal_pending(current)) {
>> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
>> r);
>> + return r;
>> + }
>>   }
>>   }
>>
>> --
>> 2.7.4
>>
Eric


Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
> Kind of dma_fence_wait_killable, except that we don't have such API
> (maybe worth adding ?)
Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited 
using `M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx 
has run for Y seconds."  help?
I personally hate hanging w/no info.

regards,
davep


From: Grodzovsky, Andrey
Sent: Tuesday, April 24, 2018 11:58:19 AM
To: Panariti, David; linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander; Koenig, Christian; o...@redhat.com; 
a...@linux-foundation.org; ebied...@xmission.com
Subject: Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.



On 04/24/2018 11:52 AM, Panariti, David wrote:
> Hi,
>
> It looks like there can be an infinite loop if neither of the if()'s become 
> true.
> Is that an impossible condition?

That intended, we want to wait until either the fence signals or fatal
signal received, we don't want to terminate the wait if fence is not
signaled  even  when interrupted by non fatal signal.
Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Andrey

>
> -Original Message-
> From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> Sent: Tuesday, April 24, 2018 11:31 AM
> To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
> Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian 
> <christian.koe...@amd.com>; Panariti, David <david.panar...@amd.com>; 
> o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; 
> Grodzovsky, Andrey <andrey.grodzov...@amd.com>
> Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
> ring hang.
>
> If the ring is hanging for some reason allow to recover the waiting by 
> sending fatal signal.
>
> Originally-by: David Panariti <david.panar...@amd.com>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>   1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }
>   }
>   }
>
> --
> 2.7.4
>



Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
> Kind of dma_fence_wait_killable, except that we don't have such API
> (maybe worth adding ?)
Depends on how many places it would be called, or think it might be called.  
Can always factor on the 2nd time it's needed.
Factoring, IMO, rarely hurts.  The factored function can easily be visited 
using `M-.' ;->

Also, if the wait could be very long, would a log message, something like "xxx 
has run for Y seconds."  help?
I personally hate hanging w/no info.

regards,
davep


From: Grodzovsky, Andrey
Sent: Tuesday, April 24, 2018 11:58:19 AM
To: Panariti, David; linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander; Koenig, Christian; o...@redhat.com; 
a...@linux-foundation.org; ebied...@xmission.com
Subject: Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.



On 04/24/2018 11:52 AM, Panariti, David wrote:
> Hi,
>
> It looks like there can be an infinite loop if neither of the if()'s become 
> true.
> Is that an impossible condition?

That intended, we want to wait until either the fence signals or fatal
signal received, we don't want to terminate the wait if fence is not
signaled  even  when interrupted by non fatal signal.
Kind of dma_fence_wait_killable, except that we don't have such API
(maybe worth adding ?)

Andrey

>
> -Original Message-
> From: Andrey Grodzovsky 
> Sent: Tuesday, April 24, 2018 11:31 AM
> To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
> Cc: Deucher, Alexander ; Koenig, Christian 
> ; Panariti, David ; 
> o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; 
> Grodzovsky, Andrey 
> Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
> ring hang.
>
> If the ring is hanging for some reason allow to recover the waiting by 
> sending fatal signal.
>
> Originally-by: David Panariti 
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>   1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }
>   }
>   }
>
> --
> 2.7.4
>



Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> If the ring is hanging for some reason allow to recover the waiting
> by sending fatal signal.
>
> Originally-by: David Panariti 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>  
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }

It looks like if you make this code say:
if (fatal_signal_pending(current) ||
(current->flags & PF_EXITING)) {
DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
return r;
>   }
>   }

Than you would not need the horrible hack want_signal to deliver signals
to processes who have passed exit_signal() and don't expect to need
their signal handling mechanisms anymore.

Eric



Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Eric W. Biederman
Andrey Grodzovsky  writes:

> If the ring is hanging for some reason allow to recover the waiting
> by sending fatal signal.
>
> Originally-by: David Panariti 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index eb80edf..37a36af 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
> unsigned ring_id)
>  
>   if (other) {
>   signed long r;
> - r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
> - if (r < 0) {
> - DRM_ERROR("Error (%ld) waiting for fence!\n", r);
> - return r;
> +
> + while (true) {
> + if ((r = dma_fence_wait_timeout(other, true,
> + MAX_SCHEDULE_TIMEOUT)) >= 0)
> + return 0;
> +
> + if (fatal_signal_pending(current)) {
> + DRM_ERROR("Error (%ld) waiting for fence!\n", 
> r);
> + return r;
> + }

It looks like if you make this code say:
if (fatal_signal_pending(current) ||
(current->flags & PF_EXITING)) {
DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
return r;
>   }
>   }

Than you would not need the horrible hack want_signal to deliver signals
to processes who have passed exit_signal() and don't expect to need
their signal handling mechanisms anymore.

Eric



Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Andrey Grodzovsky



On 04/24/2018 11:52 AM, Panariti, David wrote:

Hi,

It looks like there can be an infinite loop if neither of the if()'s become 
true.
Is that an impossible condition?


That intended, we want to wait until either the fence signals or fatal 
signal received, we don't want to terminate the wait if fence is not 
signaled  even  when interrupted by non fatal signal.
Kind of dma_fence_wait_killable, except that we don't have such API 
(maybe worth adding ?)


Andrey



-Original Message-
From: Andrey Grodzovsky 
Sent: Tuesday, April 24, 2018 11:31 AM
To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Panariti, David ; o...@redhat.com; 
a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey 
Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.

If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }
}
}
  
--

2.7.4





Re: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Andrey Grodzovsky



On 04/24/2018 11:52 AM, Panariti, David wrote:

Hi,

It looks like there can be an infinite loop if neither of the if()'s become 
true.
Is that an impossible condition?


That intended, we want to wait until either the fence signals or fatal 
signal received, we don't want to terminate the wait if fence is not 
signaled  even  when interrupted by non fatal signal.
Kind of dma_fence_wait_killable, except that we don't have such API 
(maybe worth adding ?)


Andrey



-Original Message-
From: Andrey Grodzovsky 
Sent: Tuesday, April 24, 2018 11:31 AM
To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Panariti, David ; o...@redhat.com; 
a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, Andrey 
Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.

If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
  
  	if (other) {

signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }
}
}
  
--

2.7.4





RE: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
Hi,

It looks like there can be an infinite loop if neither of the if()'s become 
true.
Is that an impossible condition?

-Original Message-
From: Andrey Grodzovsky  
Sent: Tuesday, April 24, 2018 11:31 AM
To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Panariti, David ; 
o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, 
Andrey 
Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.

If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
 
if (other) {
signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }
}
}
 
--
2.7.4



RE: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from ring hang.

2018-04-24 Thread Panariti, David
Hi,

It looks like there can be an infinite loop if neither of the if()'s become 
true.
Is that an impossible condition?

-Original Message-
From: Andrey Grodzovsky  
Sent: Tuesday, April 24, 2018 11:31 AM
To: linux-kernel@vger.kernel.org; amd-...@lists.freedesktop.org
Cc: Deucher, Alexander ; Koenig, Christian 
; Panariti, David ; 
o...@redhat.com; a...@linux-foundation.org; ebied...@xmission.com; Grodzovsky, 
Andrey 
Subject: [PATCH 3/3] drm/amdgpu: Switch to interrupted wait to recover from 
ring hang.

If the ring is hanging for some reason allow to recover the waiting by sending 
fatal signal.

Originally-by: David Panariti 
Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index eb80edf..37a36af 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -421,10 +421,16 @@ int amdgpu_ctx_wait_prev_fence(struct amdgpu_ctx *ctx, 
unsigned ring_id)
 
if (other) {
signed long r;
-   r = dma_fence_wait_timeout(other, false, MAX_SCHEDULE_TIMEOUT);
-   if (r < 0) {
-   DRM_ERROR("Error (%ld) waiting for fence!\n", r);
-   return r;
+
+   while (true) {
+   if ((r = dma_fence_wait_timeout(other, true,
+   MAX_SCHEDULE_TIMEOUT)) >= 0)
+   return 0;
+
+   if (fatal_signal_pending(current)) {
+   DRM_ERROR("Error (%ld) waiting for fence!\n", 
r);
+   return r;
+   }
}
}
 
--
2.7.4