Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-05 Thread Maarten Lankhorst
op 04-08-14 19:04, Christian König schreef:
> Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
>> op 04-08-14 17:04, Christian König schreef:
>>> Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
 op 04-08-14 16:45, Christian König schreef:
> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>> op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
>>> Yeah, well that's nothing new.
>>>
 I've now tried other solutions but that would mean reverting to the 
 old style during gpu lockup recovery, and only running the delayed 
 work when !lockup.
 But this meant that the timeout was useless to add. I think the 
 cleanest is keeping the v2 patch, because potentially any waiting code 
 can be called during lockup recovery.
>>> The lockup code itself should never call any waiting code and V2 
>>> doesn't seem to handle a couple of cases correctly either.
>>>
>>> How about moving the fence waiting out of the reset code?
>> What cases did I miss then?
>>
>> I'm curious how you want to move the fence waiting out of reset, when 
>> there are so many places that could potentially wait, like radeon_ib_get 
>> can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
>> can wait on radeon_fence_wait_next, etc.
> The IB test itself doesn't needs to be protected by the exclusive lock. 
> Only everything between radeon_save_bios_scratch_regs and 
> radeon_ring_restore.
 I'm not sure about that, what do you want to do if the ring tests fail? Do 
 you have to retake the exclusive lock?
>>> Just set need_reset again and return -EAGAIN, that should have mostly the 
>>> same effect as what we are doing right now.
>> Yeah, except for the locking the ttm delayed workqueue, but that bool should 
>> be easy to save/restore.
>> I think this could work.
>
> Actually you could activate the delayed workqueue much earlier as well.
>
> Thinking more about it that sounds like a bug in the current code, because we 
> probably want the workqueue activated before waiting for the fence.

Ok, how about this?

Because of the downgrade_write, a second gpu reset can't be started until the 
first finishes.

I'm uncertain about it, I think I might either have to stop calling 
radeon_restore_bios_scratch_regs a second time,
or I should call save_bios_scratch_regs the second time around.

Also it feels like drm_helper_resume_force_mode should be called before 
downgrading exclusive_lock, but it might depend on PM restore.
Tough! Maybe move both calls to before downgrade_write?
 
commit 3644aae8581a15e3a935279287c397f7eab400ff
Author: Maarten Lankhorst 
Date:   Tue Aug 5 10:29:23 2014 +0200

drm/radeon: take exclusive_lock in read mode during ring tests

Signed-off-by: Maarten Lankhorst 

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 29d9cc04c04e..de14e35da002 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -2286,7 +2286,7 @@ struct radeon_device {
boolneed_dma32;
boolaccel_working;
boolfastfb_working; /* IGP feature*/
-   boolneeds_reset;
+   boolneeds_reset, in_reset;
struct radeon_surface_reg surface_regs[RADEON_GEM_MAX_SURFACES];
const struct firmware *me_fw;   /* all family ME firmware */
const struct firmware *pfp_fw;  /* r6/700 PFP firmware */
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 03686fab842d..6317b8a2ef20 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1620,37 +1620,44 @@ int radeon_gpu_reset(struct radeon_device *rdev)
unsigned ring_sizes[RADEON_NUM_RINGS];
uint32_t *ring_data[RADEON_NUM_RINGS];
 
-   bool saved = false;
+   bool saved;
 
int i, r;
int resched;
 
+retry:
+   saved = false;
down_write(>exclusive_lock);
 
if (!rdev->needs_reset) {
+   WARN_ON(rdev->in_reset);
up_write(>exclusive_lock);
return 0;
}
 
rdev->needs_reset = false;
-
-   radeon_save_bios_scratch_regs(rdev);
-   /* block TTM */
resched = ttm_bo_lock_delayed_workqueue(>mman.bdev);
-   radeon_pm_suspend(rdev);
-   radeon_suspend(rdev);
 
-   for (i = 0; i < RADEON_NUM_RINGS; ++i) {
-   ring_sizes[i] = radeon_ring_backup(rdev, >ring[i],
-  _data[i]);
-   if (ring_sizes[i]) {
-   saved = true;
-   dev_info(rdev->dev, "Saved %d dwords of commands "
-"on ring %d.\n", ring_sizes[i], i);
+   if (!rdev->in_reset) {
+   

Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-05 Thread Daniel Vetter
On Mon, Aug 04, 2014 at 07:04:46PM +0200, Christian König wrote:
> Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
> >op 04-08-14 17:04, Christian König schreef:
> >>Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
> >>>op 04-08-14 16:45, Christian König schreef:
> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
> >op 04-08-14 16:37, Christian König schreef:
> >>>It'a pain to deal with gpu reset.
> >>Yeah, well that's nothing new.
> >>
> >>>I've now tried other solutions but that would mean reverting to the 
> >>>old style during gpu lockup recovery, and only running the delayed 
> >>>work when !lockup.
> >>>But this meant that the timeout was useless to add. I think the 
> >>>cleanest is keeping the v2 patch, because potentially any waiting code 
> >>>can be called during lockup recovery.
> >>The lockup code itself should never call any waiting code and V2 
> >>doesn't seem to handle a couple of cases correctly either.
> >>
> >>How about moving the fence waiting out of the reset code?
> >What cases did I miss then?
> >
> >I'm curious how you want to move the fence waiting out of reset, when 
> >there are so many places that could potentially wait, like radeon_ib_get 
> >can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
> >can wait on radeon_fence_wait_next, etc.
> The IB test itself doesn't needs to be protected by the exclusive lock. 
> Only everything between radeon_save_bios_scratch_regs and 
> radeon_ring_restore.
> >>>I'm not sure about that, what do you want to do if the ring tests fail? Do 
> >>>you have to retake the exclusive lock?
> >>Just set need_reset again and return -EAGAIN, that should have mostly the 
> >>same effect as what we are doing right now.
> >Yeah, except for the locking the ttm delayed workqueue, but that bool should 
> >be easy to save/restore.
> >I think this could work.
> 
> Actually you could activate the delayed workqueue much earlier as well.
> 
> Thinking more about it that sounds like a bug in the current code, because
> we probably want the workqueue activated before waiting for the fence.

We've actually had a similar issue on i915 where when userspace never
waited for rendering (some shitty userspace drivers did that way back) we
never noticed that the gpu died. So launching the hangcheck/stuck wait
worker (we have both too) right away is what we do now.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-05 Thread Daniel Vetter
On Mon, Aug 04, 2014 at 07:04:46PM +0200, Christian König wrote:
 Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
 op 04-08-14 17:04, Christian König schreef:
 Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
 op 04-08-14 16:45, Christian König schreef:
 Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
 op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
 Yeah, well that's nothing new.
 
 I've now tried other solutions but that would mean reverting to the 
 old style during gpu lockup recovery, and only running the delayed 
 work when !lockup.
 But this meant that the timeout was useless to add. I think the 
 cleanest is keeping the v2 patch, because potentially any waiting code 
 can be called during lockup recovery.
 The lockup code itself should never call any waiting code and V2 
 doesn't seem to handle a couple of cases correctly either.
 
 How about moving the fence waiting out of the reset code?
 What cases did I miss then?
 
 I'm curious how you want to move the fence waiting out of reset, when 
 there are so many places that could potentially wait, like radeon_ib_get 
 can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
 can wait on radeon_fence_wait_next, etc.
 The IB test itself doesn't needs to be protected by the exclusive lock. 
 Only everything between radeon_save_bios_scratch_regs and 
 radeon_ring_restore.
 I'm not sure about that, what do you want to do if the ring tests fail? Do 
 you have to retake the exclusive lock?
 Just set need_reset again and return -EAGAIN, that should have mostly the 
 same effect as what we are doing right now.
 Yeah, except for the locking the ttm delayed workqueue, but that bool should 
 be easy to save/restore.
 I think this could work.
 
 Actually you could activate the delayed workqueue much earlier as well.
 
 Thinking more about it that sounds like a bug in the current code, because
 we probably want the workqueue activated before waiting for the fence.

We've actually had a similar issue on i915 where when userspace never
waited for rendering (some shitty userspace drivers did that way back) we
never noticed that the gpu died. So launching the hangcheck/stuck wait
worker (we have both too) right away is what we do now.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-05 Thread Maarten Lankhorst
op 04-08-14 19:04, Christian König schreef:
 Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
 op 04-08-14 17:04, Christian König schreef:
 Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
 op 04-08-14 16:45, Christian König schreef:
 Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
 op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
 Yeah, well that's nothing new.

 I've now tried other solutions but that would mean reverting to the 
 old style during gpu lockup recovery, and only running the delayed 
 work when !lockup.
 But this meant that the timeout was useless to add. I think the 
 cleanest is keeping the v2 patch, because potentially any waiting code 
 can be called during lockup recovery.
 The lockup code itself should never call any waiting code and V2 
 doesn't seem to handle a couple of cases correctly either.

 How about moving the fence waiting out of the reset code?
 What cases did I miss then?

 I'm curious how you want to move the fence waiting out of reset, when 
 there are so many places that could potentially wait, like radeon_ib_get 
 can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
 can wait on radeon_fence_wait_next, etc.
 The IB test itself doesn't needs to be protected by the exclusive lock. 
 Only everything between radeon_save_bios_scratch_regs and 
 radeon_ring_restore.
 I'm not sure about that, what do you want to do if the ring tests fail? Do 
 you have to retake the exclusive lock?
 Just set need_reset again and return -EAGAIN, that should have mostly the 
 same effect as what we are doing right now.
 Yeah, except for the locking the ttm delayed workqueue, but that bool should 
 be easy to save/restore.
 I think this could work.

 Actually you could activate the delayed workqueue much earlier as well.

 Thinking more about it that sounds like a bug in the current code, because we 
 probably want the workqueue activated before waiting for the fence.

Ok, how about this?

Because of the downgrade_write, a second gpu reset can't be started until the 
first finishes.

I'm uncertain about it, I think I might either have to stop calling 
radeon_restore_bios_scratch_regs a second time,
or I should call save_bios_scratch_regs the second time around.

Also it feels like drm_helper_resume_force_mode should be called before 
downgrading exclusive_lock, but it might depend on PM restore.
Tough! Maybe move both calls to before downgrade_write?
 
commit 3644aae8581a15e3a935279287c397f7eab400ff
Author: Maarten Lankhorst maarten.lankho...@canonical.com
Date:   Tue Aug 5 10:29:23 2014 +0200

drm/radeon: take exclusive_lock in read mode during ring tests

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 29d9cc04c04e..de14e35da002 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -2286,7 +2286,7 @@ struct radeon_device {
boolneed_dma32;
boolaccel_working;
boolfastfb_working; /* IGP feature*/
-   boolneeds_reset;
+   boolneeds_reset, in_reset;
struct radeon_surface_reg surface_regs[RADEON_GEM_MAX_SURFACES];
const struct firmware *me_fw;   /* all family ME firmware */
const struct firmware *pfp_fw;  /* r6/700 PFP firmware */
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 03686fab842d..6317b8a2ef20 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1620,37 +1620,44 @@ int radeon_gpu_reset(struct radeon_device *rdev)
unsigned ring_sizes[RADEON_NUM_RINGS];
uint32_t *ring_data[RADEON_NUM_RINGS];
 
-   bool saved = false;
+   bool saved;
 
int i, r;
int resched;
 
+retry:
+   saved = false;
down_write(rdev-exclusive_lock);
 
if (!rdev-needs_reset) {
+   WARN_ON(rdev-in_reset);
up_write(rdev-exclusive_lock);
return 0;
}
 
rdev-needs_reset = false;
-
-   radeon_save_bios_scratch_regs(rdev);
-   /* block TTM */
resched = ttm_bo_lock_delayed_workqueue(rdev-mman.bdev);
-   radeon_pm_suspend(rdev);
-   radeon_suspend(rdev);
 
-   for (i = 0; i  RADEON_NUM_RINGS; ++i) {
-   ring_sizes[i] = radeon_ring_backup(rdev, rdev-ring[i],
-  ring_data[i]);
-   if (ring_sizes[i]) {
-   saved = true;
-   dev_info(rdev-dev, Saved %d dwords of commands 
-on ring %d.\n, ring_sizes[i], i);
+   if (!rdev-in_reset) {
+   rdev-in_reset = true;
+
+   radeon_save_bios_scratch_regs(rdev);
+   /* block TTM */
+ 

Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:

op 04-08-14 17:04, Christian König schreef:

Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:

op 04-08-14 16:45, Christian König schreef:

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.

The IB test itself doesn't needs to be protected by the exclusive lock. Only 
everything between radeon_save_bios_scratch_regs and radeon_ring_restore.

I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?

Just set need_reset again and return -EAGAIN, that should have mostly the same 
effect as what we are doing right now.

Yeah, except for the locking the ttm delayed workqueue, but that bool should be 
easy to save/restore.
I think this could work.


Actually you could activate the delayed workqueue much earlier as well.

Thinking more about it that sounds like a bug in the current code, 
because we probably want the workqueue activated before waiting for the 
fence.


Christian.



~Maarten



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 17:04, Christian König schreef:
> Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
>> op 04-08-14 16:45, Christian König schreef:
>>> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
 op 04-08-14 16:37, Christian König schreef:
>> It'a pain to deal with gpu reset.
> Yeah, well that's nothing new.
>
>> I've now tried other solutions but that would mean reverting to the old 
>> style during gpu lockup recovery, and only running the delayed work when 
>> !lockup.
>> But this meant that the timeout was useless to add. I think the cleanest 
>> is keeping the v2 patch, because potentially any waiting code can be 
>> called during lockup recovery.
> The lockup code itself should never call any waiting code and V2 doesn't 
> seem to handle a couple of cases correctly either.
>
> How about moving the fence waiting out of the reset code?
 What cases did I miss then?

 I'm curious how you want to move the fence waiting out of reset, when 
 there are so many places that could potentially wait, like radeon_ib_get 
 can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
 can wait on radeon_fence_wait_next, etc.
>>> The IB test itself doesn't needs to be protected by the exclusive lock. 
>>> Only everything between radeon_save_bios_scratch_regs and 
>>> radeon_ring_restore.
>> I'm not sure about that, what do you want to do if the ring tests fail? Do 
>> you have to retake the exclusive lock?
>
> Just set need_reset again and return -EAGAIN, that should have mostly the 
> same effect as what we are doing right now.
Yeah, except for the locking the ttm delayed workqueue, but that bool should be 
easy to save/restore.
I think this could work.

~Maarten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:

op 04-08-14 16:45, Christian König schreef:

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.

The IB test itself doesn't needs to be protected by the exclusive lock. Only 
everything between radeon_save_bios_scratch_regs and radeon_ring_restore.

I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?


Just set need_reset again and return -EAGAIN, that should have mostly 
the same effect as what we are doing right now.


Christian.



~Maarten



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 16:45, Christian König schreef:
> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>> op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
>>> Yeah, well that's nothing new.
>>>
 I've now tried other solutions but that would mean reverting to the old 
 style during gpu lockup recovery, and only running the delayed work when 
 !lockup.
 But this meant that the timeout was useless to add. I think the cleanest 
 is keeping the v2 patch, because potentially any waiting code can be 
 called during lockup recovery.
>>> The lockup code itself should never call any waiting code and V2 doesn't 
>>> seem to handle a couple of cases correctly either.
>>>
>>> How about moving the fence waiting out of the reset code?
>> What cases did I miss then?
>>
>> I'm curious how you want to move the fence waiting out of reset, when there 
>> are so many places that could potentially wait, like radeon_ib_get can call 
>> radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
>> radeon_fence_wait_next, etc.
>
> The IB test itself doesn't needs to be protected by the exclusive lock. Only 
> everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?

~Maarten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.


The IB test itself doesn't needs to be protected by the exclusive lock. 
Only everything between radeon_save_bios_scratch_regs and 
radeon_ring_restore.


Christian.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 16:37, Christian König schreef:
>> It'a pain to deal with gpu reset.
>
> Yeah, well that's nothing new.
>
>> I've now tried other solutions but that would mean reverting to the old 
>> style during gpu lockup recovery, and only running the delayed work when 
>> !lockup.
>> But this meant that the timeout was useless to add. I think the cleanest is 
>> keeping the v2 patch, because potentially any waiting code can be called 
>> during lockup recovery.
>
> The lockup code itself should never call any waiting code and V2 doesn't seem 
> to handle a couple of cases correctly either.
>
> How about moving the fence waiting out of the reset code?
What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

It'a pain to deal with gpu reset.


Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.


The lockup code itself should never call any waiting code and V2 doesn't 
seem to handle a couple of cases correctly either.


How about moving the fence waiting out of the reset code?

Christian.

Am 04.08.2014 um 15:34 schrieb Maarten Lankhorst:

Hey,

op 04-08-14 13:57, Christian König schreef:

Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:

op 04-08-14 10:36, Christian König schreef:

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:

On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst 
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

The idea was turning the delayed work on and off when we turn the irq on and 
off as well, processing of the delayed work handler can still happen in 
radeon_fence.c


Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish -> deadlock

Why do you want to disable the work item from the lockup handler in the first 
place?

Just take the exclusive lock in the work item, when it concurrently runs with 
the lockup handler it will just block for the lockup handler to complete.

With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.

Yeah, agree. Manually specifying a timeout in the fence wait on lockup handling 
sounds like the best alternative to me.

It'a pain to deal with gpu reset.

I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

~Maarten



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
Hey,

op 04-08-14 13:57, Christian König schreef:
> Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
>> op 04-08-14 10:36, Christian König schreef:
>>> Hi Maarten,
>>>
>>> Sorry for the delay. I've got way to much todo recently.
>>>
>>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
 On 01-08-14 18:35, Christian König wrote:
> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>> Signed-off-by: Maarten Lankhorst 
>> ---
>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>> and kill it during lockup recovery instead.
> That looks like the delayed work starts running as soon as we submit a 
> fence, and not when it's needed for waiting.
>
> Since it's a backup for failing IRQs I would rather put it into 
> radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
 The delayed work is not just for failing irq's, it's also the handler 
 that's used to detect lockups, which is why I trigger after processing 
 fences, and reset the timer after processing.
>>> The idea was turning the delayed work on and off when we turn the irq on 
>>> and off as well, processing of the delayed work handler can still happen in 
>>> radeon_fence.c
>>>
 Specifically what happened was this scenario:

 - lock up occurs
 - write lock taken by gpu_reset
 - delayed work runs, tries to acquire read lock, blocks
 - gpu_reset tries to cancel delayed work synchronously
 - has to wait for delayed work to finish -> deadlock
>>> Why do you want to disable the work item from the lockup handler in the 
>>> first place?
>>>
>>> Just take the exclusive lock in the work item, when it concurrently runs 
>>> with the lockup handler it will just block for the lockup handler to 
>>> complete.
>> With the delayed work radeon_fence_wait no longer handles unreliable 
>> interrupts itself, so it has to run from the lockup handler.
>> But an alternative solution could be adding a radeon_fence_wait_timeout, 
>> ignore the timeout and check if fence is signaled on timeout.
>> This would probably be a cleaner solution.
>
> Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
> handling sounds like the best alternative to me.
It'a pain to deal with gpu reset.

I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

~Maarten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:

op 04-08-14 10:36, Christian König schreef:

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:

On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst 
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

The idea was turning the delayed work on and off when we turn the irq on and 
off as well, processing of the delayed work handler can still happen in 
radeon_fence.c


Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish -> deadlock

Why do you want to disable the work item from the lockup handler in the first 
place?

Just take the exclusive lock in the work item, when it concurrently runs with 
the lockup handler it will just block for the lockup handler to complete.

With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.


Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
handling sounds like the best alternative to me.


Christian.



~Maarten



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 10:36, Christian König schreef:
> Hi Maarten,
>
> Sorry for the delay. I've got way to much todo recently.
>
> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>
>> On 01-08-14 18:35, Christian König wrote:
>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
 Signed-off-by: Maarten Lankhorst 
 ---
 V1 had a nasty bug breaking gpu lockup recovery. The fix is not
 allowing radeon_fence_driver_check_lockup to take exclusive_lock,
 and kill it during lockup recovery instead.
>>> That looks like the delayed work starts running as soon as we submit a 
>>> fence, and not when it's needed for waiting.
>>>
>>> Since it's a backup for failing IRQs I would rather put it into 
>>> radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>> The delayed work is not just for failing irq's, it's also the handler that's 
>> used to detect lockups, which is why I trigger after processing fences, and 
>> reset the timer after processing.
>
> The idea was turning the delayed work on and off when we turn the irq on and 
> off as well, processing of the delayed work handler can still happen in 
> radeon_fence.c
>
>>
>> Specifically what happened was this scenario:
>>
>> - lock up occurs
>> - write lock taken by gpu_reset
>> - delayed work runs, tries to acquire read lock, blocks
>> - gpu_reset tries to cancel delayed work synchronously
>> - has to wait for delayed work to finish -> deadlock
>
> Why do you want to disable the work item from the lockup handler in the first 
> place?
>
> Just take the exclusive lock in the work item, when it concurrently runs with 
> the lockup handler it will just block for the lockup handler to complete.
With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.

~Maarten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:


On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst 
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.


The idea was turning the delayed work on and off when we turn the irq on 
and off as well, processing of the delayed work handler can still happen 
in radeon_fence.c




Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish -> deadlock


Why do you want to disable the work item from the lockup handler in the 
first place?


Just take the exclusive lock in the work item, when it concurrently runs 
with the lockup handler it will just block for the lockup handler to 
complete.


Regards,
Christian.


~Maarten


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:


On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.


The idea was turning the delayed work on and off when we turn the irq on 
and off as well, processing of the delayed work handler can still happen 
in radeon_fence.c




Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish - deadlock


Why do you want to disable the work item from the lockup handler in the 
first place?


Just take the exclusive lock in the work item, when it concurrently runs 
with the lockup handler it will just block for the lockup handler to 
complete.


Regards,
Christian.


~Maarten


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 10:36, Christian König schreef:
 Hi Maarten,

 Sorry for the delay. I've got way to much todo recently.

 Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:

 On 01-08-14 18:35, Christian König wrote:
 Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
 ---
 V1 had a nasty bug breaking gpu lockup recovery. The fix is not
 allowing radeon_fence_driver_check_lockup to take exclusive_lock,
 and kill it during lockup recovery instead.
 That looks like the delayed work starts running as soon as we submit a 
 fence, and not when it's needed for waiting.

 Since it's a backup for failing IRQs I would rather put it into 
 radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
 The delayed work is not just for failing irq's, it's also the handler that's 
 used to detect lockups, which is why I trigger after processing fences, and 
 reset the timer after processing.

 The idea was turning the delayed work on and off when we turn the irq on and 
 off as well, processing of the delayed work handler can still happen in 
 radeon_fence.c


 Specifically what happened was this scenario:

 - lock up occurs
 - write lock taken by gpu_reset
 - delayed work runs, tries to acquire read lock, blocks
 - gpu_reset tries to cancel delayed work synchronously
 - has to wait for delayed work to finish - deadlock

 Why do you want to disable the work item from the lockup handler in the first 
 place?

 Just take the exclusive lock in the work item, when it concurrently runs with 
 the lockup handler it will just block for the lockup handler to complete.
With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:

op 04-08-14 10:36, Christian König schreef:

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:

On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

The idea was turning the delayed work on and off when we turn the irq on and 
off as well, processing of the delayed work handler can still happen in 
radeon_fence.c


Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish - deadlock

Why do you want to disable the work item from the lockup handler in the first 
place?

Just take the exclusive lock in the work item, when it concurrently runs with 
the lockup handler it will just block for the lockup handler to complete.

With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.


Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
handling sounds like the best alternative to me.


Christian.



~Maarten



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
Hey,

op 04-08-14 13:57, Christian König schreef:
 Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
 op 04-08-14 10:36, Christian König schreef:
 Hi Maarten,

 Sorry for the delay. I've got way to much todo recently.

 Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
 On 01-08-14 18:35, Christian König wrote:
 Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
 ---
 V1 had a nasty bug breaking gpu lockup recovery. The fix is not
 allowing radeon_fence_driver_check_lockup to take exclusive_lock,
 and kill it during lockup recovery instead.
 That looks like the delayed work starts running as soon as we submit a 
 fence, and not when it's needed for waiting.

 Since it's a backup for failing IRQs I would rather put it into 
 radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
 The delayed work is not just for failing irq's, it's also the handler 
 that's used to detect lockups, which is why I trigger after processing 
 fences, and reset the timer after processing.
 The idea was turning the delayed work on and off when we turn the irq on 
 and off as well, processing of the delayed work handler can still happen in 
 radeon_fence.c

 Specifically what happened was this scenario:

 - lock up occurs
 - write lock taken by gpu_reset
 - delayed work runs, tries to acquire read lock, blocks
 - gpu_reset tries to cancel delayed work synchronously
 - has to wait for delayed work to finish - deadlock
 Why do you want to disable the work item from the lockup handler in the 
 first place?

 Just take the exclusive lock in the work item, when it concurrently runs 
 with the lockup handler it will just block for the lockup handler to 
 complete.
 With the delayed work radeon_fence_wait no longer handles unreliable 
 interrupts itself, so it has to run from the lockup handler.
 But an alternative solution could be adding a radeon_fence_wait_timeout, 
 ignore the timeout and check if fence is signaled on timeout.
 This would probably be a cleaner solution.

 Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
 handling sounds like the best alternative to me.
It'a pain to deal with gpu reset.

I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

It'a pain to deal with gpu reset.


Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.


The lockup code itself should never call any waiting code and V2 doesn't 
seem to handle a couple of cases correctly either.


How about moving the fence waiting out of the reset code?

Christian.

Am 04.08.2014 um 15:34 schrieb Maarten Lankhorst:

Hey,

op 04-08-14 13:57, Christian König schreef:

Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:

op 04-08-14 10:36, Christian König schreef:

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:

On 01-08-14 18:35, Christian König wrote:

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a fence, 
and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

The idea was turning the delayed work on and off when we turn the irq on and 
off as well, processing of the delayed work handler can still happen in 
radeon_fence.c


Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish - deadlock

Why do you want to disable the work item from the lockup handler in the first 
place?

Just take the exclusive lock in the work item, when it concurrently runs with 
the lockup handler it will just block for the lockup handler to complete.

With the delayed work radeon_fence_wait no longer handles unreliable interrupts 
itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore 
the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.

Yeah, agree. Manually specifying a timeout in the fence wait on lockup handling 
sounds like the best alternative to me.

It'a pain to deal with gpu reset.

I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

~Maarten



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.

 Yeah, well that's nothing new.

 I've now tried other solutions but that would mean reverting to the old 
 style during gpu lockup recovery, and only running the delayed work when 
 !lockup.
 But this meant that the timeout was useless to add. I think the cleanest is 
 keeping the v2 patch, because potentially any waiting code can be called 
 during lockup recovery.

 The lockup code itself should never call any waiting code and V2 doesn't seem 
 to handle a couple of cases correctly either.

 How about moving the fence waiting out of the reset code?
What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.


The IB test itself doesn't needs to be protected by the exclusive lock. 
Only everything between radeon_save_bios_scratch_regs and 
radeon_ring_restore.


Christian.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 16:45, Christian König schreef:
 Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
 op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
 Yeah, well that's nothing new.

 I've now tried other solutions but that would mean reverting to the old 
 style during gpu lockup recovery, and only running the delayed work when 
 !lockup.
 But this meant that the timeout was useless to add. I think the cleanest 
 is keeping the v2 patch, because potentially any waiting code can be 
 called during lockup recovery.
 The lockup code itself should never call any waiting code and V2 doesn't 
 seem to handle a couple of cases correctly either.

 How about moving the fence waiting out of the reset code?
 What cases did I miss then?

 I'm curious how you want to move the fence waiting out of reset, when there 
 are so many places that could potentially wait, like radeon_ib_get can call 
 radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
 radeon_fence_wait_next, etc.

 The IB test itself doesn't needs to be protected by the exclusive lock. Only 
 everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:

op 04-08-14 16:45, Christian König schreef:

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.

The IB test itself doesn't needs to be protected by the exclusive lock. Only 
everything between radeon_save_bios_scratch_regs and radeon_ring_restore.

I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?


Just set need_reset again and return -EAGAIN, that should have mostly 
the same effect as what we are doing right now.


Christian.



~Maarten



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Maarten Lankhorst
op 04-08-14 17:04, Christian König schreef:
 Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
 op 04-08-14 16:45, Christian König schreef:
 Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
 op 04-08-14 16:37, Christian König schreef:
 It'a pain to deal with gpu reset.
 Yeah, well that's nothing new.

 I've now tried other solutions but that would mean reverting to the old 
 style during gpu lockup recovery, and only running the delayed work when 
 !lockup.
 But this meant that the timeout was useless to add. I think the cleanest 
 is keeping the v2 patch, because potentially any waiting code can be 
 called during lockup recovery.
 The lockup code itself should never call any waiting code and V2 doesn't 
 seem to handle a couple of cases correctly either.

 How about moving the fence waiting out of the reset code?
 What cases did I miss then?

 I'm curious how you want to move the fence waiting out of reset, when 
 there are so many places that could potentially wait, like radeon_ib_get 
 can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that 
 can wait on radeon_fence_wait_next, etc.
 The IB test itself doesn't needs to be protected by the exclusive lock. 
 Only everything between radeon_save_bios_scratch_regs and 
 radeon_ring_restore.
 I'm not sure about that, what do you want to do if the ring tests fail? Do 
 you have to retake the exclusive lock?

 Just set need_reset again and return -EAGAIN, that should have mostly the 
 same effect as what we are doing right now.
Yeah, except for the locking the ttm delayed workqueue, but that bool should be 
easy to save/restore.
I think this could work.

~Maarten

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-04 Thread Christian König

Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:

op 04-08-14 17:04, Christian König schreef:

Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:

op 04-08-14 16:45, Christian König schreef:

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:

op 04-08-14 16:37, Christian König schreef:

It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


I've now tried other solutions but that would mean reverting to the old style 
during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is 
keeping the v2 patch, because potentially any waiting code can be called during 
lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't seem 
to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are 
so many places that could potentially wait, like radeon_ib_get can call 
radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on 
radeon_fence_wait_next, etc.

The IB test itself doesn't needs to be protected by the exclusive lock. Only 
everything between radeon_save_bios_scratch_regs and radeon_ring_restore.

I'm not sure about that, what do you want to do if the ring tests fail? Do you 
have to retake the exclusive lock?

Just set need_reset again and return -EAGAIN, that should have mostly the same 
effect as what we are doing right now.

Yeah, except for the locking the ttm delayed workqueue, but that bool should be 
easy to save/restore.
I think this could work.


Actually you could activate the delayed workqueue much earlier as well.

Thinking more about it that sounds like a bug in the current code, 
because we probably want the workqueue activated before waiting for the 
fence.


Christian.



~Maarten



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-01 Thread Maarten Lankhorst


On 01-08-14 18:35, Christian König wrote:
> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>> Signed-off-by: Maarten Lankhorst 
>> ---
>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>> and kill it during lockup recovery instead.
> 
> That looks like the delayed work starts running as soon as we submit a fence, 
> and not when it's needed for waiting.
> 
> Since it's a backup for failing IRQs I would rather put it into 
> radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish -> deadlock

~Maarten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-01 Thread Christian König

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst 
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.


That looks like the delayed work starts running as soon as we submit a 
fence, and not when it's needed for waiting.


Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.


Christian.


---
  drivers/gpu/drm/radeon/radeon.h|3 +
  drivers/gpu/drm/radeon/radeon_device.c |5 +
  drivers/gpu/drm/radeon/radeon_fence.c  |  124 ++--
  drivers/gpu/drm/radeon/radeon_ring.c   |1
  4 files changed, 77 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 60c47f829122..b01d88fc10cb 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -347,6 +347,8 @@ struct radeon_fence_driver {
uint64_tsync_seq[RADEON_NUM_RINGS];
atomic64_t  last_seq;
boolinitialized;
+   struct delayed_work fence_check_work;
+   struct radeon_device*rdev;
  };
  
  struct radeon_fence {

@@ -360,6 +362,7 @@ struct radeon_fence {
  
  int radeon_fence_driver_start_ring(struct radeon_device *rdev, int ring);

  int radeon_fence_driver_init(struct radeon_device *rdev);
+void radeon_fence_cancel_delayed_check(struct radeon_device *rdev, int ring);
  void radeon_fence_driver_fini(struct radeon_device *rdev);
  void radeon_fence_driver_force_completion(struct radeon_device *rdev);
  int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence 
**fence, int ring);
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 697add2cd4e3..21efd32d07ee 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1637,6 +1637,11 @@ int radeon_gpu_reset(struct radeon_device *rdev)
radeon_save_bios_scratch_regs(rdev);
/* block TTM */
resched = ttm_bo_lock_delayed_workqueue(>mman.bdev);
+
+   /* kill all the hangcheck tests too */
+   for (i = 0; i < RADEON_NUM_RINGS; ++i)
+   radeon_fence_cancel_delayed_check(rdev, i);
+
radeon_pm_suspend(rdev);
radeon_suspend(rdev);
  
diff --git a/drivers/gpu/drm/radeon/radeon_fence.c b/drivers/gpu/drm/radeon/radeon_fence.c

index 913787085dfa..c055acc3a271 100644
--- a/drivers/gpu/drm/radeon/radeon_fence.c
+++ b/drivers/gpu/drm/radeon/radeon_fence.c
@@ -125,16 +125,7 @@ int radeon_fence_emit(struct radeon_device *rdev,
return 0;
  }
  
-/**

- * radeon_fence_process - process a fence
- *
- * @rdev: radeon_device pointer
- * @ring: ring index the fence is associated with
- *
- * Checks the current fence value and wakes the fence queue
- * if the sequence number has increased (all asics).
- */
-void radeon_fence_process(struct radeon_device *rdev, int ring)
+static bool __radeon_fence_process(struct radeon_device *rdev, int ring)
  {
uint64_t seq, last_seq, last_emitted;
unsigned count_loop = 0;
@@ -190,7 +181,51 @@ void radeon_fence_process(struct radeon_device *rdev, int 
ring)
}
} while (atomic64_xchg(>fence_drv[ring].last_seq, seq) > seq);
  
-	if (wake)

+   if (seq < last_emitted)
+   mod_delayed_work(system_power_efficient_wq,
+>fence_drv[ring].fence_check_work,
+RADEON_FENCE_JIFFIES_TIMEOUT);
+
+   return wake;
+}
+
+static void radeon_fence_driver_check_lockup(struct work_struct *work)
+{
+   struct radeon_fence_driver *fence_drv;
+   struct radeon_device *rdev;
+   unsigned long iring;
+
+   fence_drv = container_of(work, struct radeon_fence_driver, 
fence_check_work.work);
+   rdev = fence_drv->rdev;
+   iring = fence_drv - >fence_drv[0];
+
+   if (__radeon_fence_process(rdev, iring))
+   wake_up_all(>fence_queue);
+   else if (radeon_ring_is_lockup(rdev, iring, >ring[iring])) {
+   /* good news we believe it's a lockup */
+   dev_warn(rdev->dev, "GPU lockup (current fence id "
+"0x%016llx last fence id 0x%016llx on ring %ld)\n",
+(uint64_t)atomic64_read(_drv->last_seq),
+fence_drv->sync_seq[iring], iring);
+
+   /* remember that we need an reset */
+   rdev->needs_reset = true;
+   wake_up_all(>fence_queue);
+   }
+}
+
+/**
+ * radeon_fence_process - process a fence
+ *
+ * @rdev: radeon_device pointer
+ * @ring: ring index the fence is associated with
+ *
+ * Checks the current fence value and wakes the fence queue
+ * if the sequence number has increased (all asics).
+ 

Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-01 Thread Christian König

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:

Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.


That looks like the delayed work starts running as soon as we submit a 
fence, and not when it's needed for waiting.


Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.


Christian.


---
  drivers/gpu/drm/radeon/radeon.h|3 +
  drivers/gpu/drm/radeon/radeon_device.c |5 +
  drivers/gpu/drm/radeon/radeon_fence.c  |  124 ++--
  drivers/gpu/drm/radeon/radeon_ring.c   |1
  4 files changed, 77 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 60c47f829122..b01d88fc10cb 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -347,6 +347,8 @@ struct radeon_fence_driver {
uint64_tsync_seq[RADEON_NUM_RINGS];
atomic64_t  last_seq;
boolinitialized;
+   struct delayed_work fence_check_work;
+   struct radeon_device*rdev;
  };
  
  struct radeon_fence {

@@ -360,6 +362,7 @@ struct radeon_fence {
  
  int radeon_fence_driver_start_ring(struct radeon_device *rdev, int ring);

  int radeon_fence_driver_init(struct radeon_device *rdev);
+void radeon_fence_cancel_delayed_check(struct radeon_device *rdev, int ring);
  void radeon_fence_driver_fini(struct radeon_device *rdev);
  void radeon_fence_driver_force_completion(struct radeon_device *rdev);
  int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence 
**fence, int ring);
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 697add2cd4e3..21efd32d07ee 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1637,6 +1637,11 @@ int radeon_gpu_reset(struct radeon_device *rdev)
radeon_save_bios_scratch_regs(rdev);
/* block TTM */
resched = ttm_bo_lock_delayed_workqueue(rdev-mman.bdev);
+
+   /* kill all the hangcheck tests too */
+   for (i = 0; i  RADEON_NUM_RINGS; ++i)
+   radeon_fence_cancel_delayed_check(rdev, i);
+
radeon_pm_suspend(rdev);
radeon_suspend(rdev);
  
diff --git a/drivers/gpu/drm/radeon/radeon_fence.c b/drivers/gpu/drm/radeon/radeon_fence.c

index 913787085dfa..c055acc3a271 100644
--- a/drivers/gpu/drm/radeon/radeon_fence.c
+++ b/drivers/gpu/drm/radeon/radeon_fence.c
@@ -125,16 +125,7 @@ int radeon_fence_emit(struct radeon_device *rdev,
return 0;
  }
  
-/**

- * radeon_fence_process - process a fence
- *
- * @rdev: radeon_device pointer
- * @ring: ring index the fence is associated with
- *
- * Checks the current fence value and wakes the fence queue
- * if the sequence number has increased (all asics).
- */
-void radeon_fence_process(struct radeon_device *rdev, int ring)
+static bool __radeon_fence_process(struct radeon_device *rdev, int ring)
  {
uint64_t seq, last_seq, last_emitted;
unsigned count_loop = 0;
@@ -190,7 +181,51 @@ void radeon_fence_process(struct radeon_device *rdev, int 
ring)
}
} while (atomic64_xchg(rdev-fence_drv[ring].last_seq, seq)  seq);
  
-	if (wake)

+   if (seq  last_emitted)
+   mod_delayed_work(system_power_efficient_wq,
+rdev-fence_drv[ring].fence_check_work,
+RADEON_FENCE_JIFFIES_TIMEOUT);
+
+   return wake;
+}
+
+static void radeon_fence_driver_check_lockup(struct work_struct *work)
+{
+   struct radeon_fence_driver *fence_drv;
+   struct radeon_device *rdev;
+   unsigned long iring;
+
+   fence_drv = container_of(work, struct radeon_fence_driver, 
fence_check_work.work);
+   rdev = fence_drv-rdev;
+   iring = fence_drv - rdev-fence_drv[0];
+
+   if (__radeon_fence_process(rdev, iring))
+   wake_up_all(rdev-fence_queue);
+   else if (radeon_ring_is_lockup(rdev, iring, rdev-ring[iring])) {
+   /* good news we believe it's a lockup */
+   dev_warn(rdev-dev, GPU lockup (current fence id 
+0x%016llx last fence id 0x%016llx on ring %ld)\n,
+(uint64_t)atomic64_read(fence_drv-last_seq),
+fence_drv-sync_seq[iring], iring);
+
+   /* remember that we need an reset */
+   rdev-needs_reset = true;
+   wake_up_all(rdev-fence_queue);
+   }
+}
+
+/**
+ * radeon_fence_process - process a fence
+ *
+ * @rdev: radeon_device pointer
+ * @ring: ring index the fence is associated with
+ *
+ * Checks the current fence value and wakes the fence queue
+ * 

Re: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

2014-08-01 Thread Maarten Lankhorst


On 01-08-14 18:35, Christian König wrote:
 Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
 Signed-off-by: Maarten Lankhorst maarten.lankho...@canonical.com
 ---
 V1 had a nasty bug breaking gpu lockup recovery. The fix is not
 allowing radeon_fence_driver_check_lockup to take exclusive_lock,
 and kill it during lockup recovery instead.
 
 That looks like the delayed work starts running as soon as we submit a fence, 
 and not when it's needed for waiting.
 
 Since it's a backup for failing IRQs I would rather put it into 
 radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's 
used to detect lockups, which is why I trigger after processing fences, and 
reset the timer after processing.

Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish - deadlock

~Maarten
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/