Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-24 Thread Marcin Slusarz
On Mon, Apr 23, 2012 at 06:56:44PM +0200, Martin Peres wrote:
 Le 23/04/2012 18:32, Marcin Slusarz a écrit :
 
  Just run piglit. Even quick tests can cause ~5 lockups (it eventually 
  messes
  up DDX channel, but this patchset can't fix this case).
  You can run fs-discard-exit-2 test first - for me it causes instant GPU 
  lockup.
 
  Marcin
 Great, Thanks.
 
 Did you have a look at 
 https://bugs.freedesktop.org/show_bug.cgi?id=40886 and 
 http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ?

Yeah, I've seen them some time ago.

 The Ubuntu xorg devs were looking for something like this, but they also 
 wanted a lockup report. Are you also interested on working on it ?

Yes, when this patchset will be applied, I'm going to work on improving
error reporting.

Marcin
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-24 Thread Ben Skeggs
On Tue, 2012-04-24 at 21:31 +0200, Marcin Slusarz wrote:
 On Mon, Apr 23, 2012 at 06:56:44PM +0200, Martin Peres wrote:
  Le 23/04/2012 18:32, Marcin Slusarz a écrit :
  
   Just run piglit. Even quick tests can cause ~5 lockups (it eventually 
   messes
   up DDX channel, but this patchset can't fix this case).
   You can run fs-discard-exit-2 test first - for me it causes instant GPU 
   lockup.
  
   Marcin
  Great, Thanks.
  
  Did you have a look at 
  https://bugs.freedesktop.org/show_bug.cgi?id=40886 and 
  http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ?
 
 Yeah, I've seen them some time ago.
 
  The Ubuntu xorg devs were looking for something like this, but they also 
  wanted a lockup report. Are you also interested on working on it ?
As I argued at XDC last year, I really question the usefulness of
something like this.  We have stupidly HUGE amounts of state that could
be relevant, and the situations where we even need something like this
are RARE.

I don't want this useless crap in our kernel module just because some
random distro thinks it's so useful, when it's not.  On the very very
rare (I can think of one situation where we've wanted these register
dumps, and they weren't useful even then) occasions we need this info,
we can ask people to install envytools and grab it..

We have a GPU with *very* good error reporting, and we log this to
dmesg.  This is good enough.  Any random errorless lockups are much
harder, and unless you dump *all* the card state right from the memory
controllers, to the clocks, to PFIFO to the particular engine that's
involved.. It's going to be useless.  The problem could be anything.

 
 Yes, when this patchset will be applied, I'm going to work on improving
 error reporting.
Assuming you're not talking about a register-dump style lockup report
like above, this could be good.  Particularly, fleshing out and
improving/completing each engine's IRQ handlers (which will probably
have the nice side-effect of surviving a few more errors without locking
up) :)

Cheers,
Ben.

 
 Marcin
 ___
 dri-devel mailing list
 dri-de...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel


___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-23 Thread Marcin Slusarz
On Mon, Apr 23, 2012 at 10:43:08AM +0200, Martin Peres wrote:
 Le 23/04/2012 00:18, Marcin Slusarz a écrit :
  Overall idea:
  Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
  handle them at ioctl level, reset the GPU and repeat last ioctl.
 
  GPU reset is done by doing suspend / resume cycle with few tweaks:
  - CPU-only bo eviction
  - ignoring vm flush / fence timeouts
  - shortening waits
 
  Signed-off-by: Marcin Slusarzmarcin.slus...@gmail.com
  ---
  Tested only on nv92.
 Hi Marcin,
 
 I'm really busy at the moment but I'm glad to see such patches coming out.
 I'll try them out ASAP!
 
 Do you have a recommended way to test your patch set?

Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes
up DDX channel, but this patchset can't fix this case).
You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup.

Marcin
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-23 Thread Martin Peres

Le 23/04/2012 18:32, Marcin Slusarz a écrit :


Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes
up DDX channel, but this patchset can't fix this case).
You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup.

Marcin

Great, Thanks.

Did you have a look at 
https://bugs.freedesktop.org/show_bug.cgi?id=40886 and 
http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ?
The Ubuntu xorg devs were looking for something like this, but they also 
wanted a lockup report. Are you also interested on working on it ?


Martin
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-23 Thread Marcin Slusarz
On Mon, Apr 23, 2012 at 06:46:41PM +0200, Martin Peres wrote:
 Hey,
 
 Just a minor mistake spotted while skimming through the patch.
 
 Le 23/04/2012 00:18, Marcin Slusarz a écrit :
  +static inline uint64_t nv_timeout(struct drm_device *dev)
  +{
  +   uint64_t tm = 20ULL;
  +   if (nouveau_gpu_reset_in_progress(dev))
  +   tm /= 40; /* 50ms */
 This will cause a problem on 32 bit kernels. You should use do_div.
 

Thanks. I'll fix this later.

Marcin
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery

2012-04-22 Thread Marcin Slusarz
Overall idea:
Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
handle them at ioctl level, reset the GPU and repeat last ioctl.

GPU reset is done by doing suspend / resume cycle with few tweaks:
- CPU-only bo eviction
- ignoring vm flush / fence timeouts
- shortening waits

Signed-off-by: Marcin Slusarz marcin.slus...@gmail.com
---
Tested only on nv92.
---
 drivers/gpu/drm/nouveau/Makefile   |2 +-
 drivers/gpu/drm/nouveau/nouveau_bo.c   |2 +-
 drivers/gpu/drm/nouveau/nouveau_channel.c  |5 +-
 drivers/gpu/drm/nouveau/nouveau_drv.c  |3 +-
 drivers/gpu/drm/nouveau/nouveau_drv.h  |   33 ++-
 drivers/gpu/drm/nouveau/nouveau_fence.c|7 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c  |   14 +++-
 drivers/gpu/drm/nouveau/nouveau_notifier.c |3 +
 drivers/gpu/drm/nouveau/nouveau_object.c   |6 +
 drivers/gpu/drm/nouveau/nouveau_reset.c|  144 
 drivers/gpu/drm/nouveau/nouveau_state.c|5 +
 drivers/gpu/drm/nouveau/nv50_graph.c   |   11 +-
 12 files changed, 221 insertions(+), 14 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c

diff --git a/drivers/gpu/drm/nouveau/Makefile b/drivers/gpu/drm/nouveau/Makefile
index 03860f5..77d0c33 100644
--- a/drivers/gpu/drm/nouveau/Makefile
+++ b/drivers/gpu/drm/nouveau/Makefile
@@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o 
nouveau_mem.o \
  nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \
  nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \
  nouveau_display.o nouveau_connector.o nouveau_fbcon.o \
- nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \
+ nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \
 nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \
 nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \
  nv04_timer.o \
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 5b0dc50..7de6cad 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, 
bool intr,
}
 
/* Software copy if the card isn't up and running yet. */
-   if (!dev_priv-channel) {
+   if (!dev_priv-channel || nouveau_gpu_reset_in_progress(dev_priv-dev)) 
{
ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, 
no_wait_gpu, new_mem);
goto out;
}
diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c 
b/drivers/gpu/drm/nouveau/nouveau_channel.c
index 846afb0..c0fa5a7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_channel.c
+++ b/drivers/gpu/drm/nouveau/nouveau_channel.c
@@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
init-fb_ctxdma_handle,
init-tt_ctxdma_handle);
if (ret)
-   return ret;
+   goto out;
init-channel  = chan-id;
 
if (nouveau_vram_pushbuf == 0) {
@@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
if (ret == 0)
atomic_inc(chan-users); /* userspace reference */
nouveau_channel_put(chan);
+out:
+   if (ret == -EIO)
+   ret = nouveau_reset_device(dev);
return ret;
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.c 
b/drivers/gpu/drm/nouveau/nouveau_drv.c
index 090fff6..22c435f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.c
@@ -237,7 +237,7 @@ nouveau_pci_suspend(struct pci_dev *pdev, pm_message_t 
pm_state)
if (!dev_priv-eng[e])
continue;
 
-   ret = dev_priv-eng[e]-fini(dev, e, true);
+   ret = dev_priv-eng[e]-fini(dev, e, 
!nouveau_gpu_reset_in_progress(dev));
if (ret) {
NV_ERROR(dev, ... engine %d failed: %d\n, e, ret);
goto out_abort;
@@ -483,6 +483,7 @@ static struct drm_driver driver = {
.disable_vblank = nouveau_vblank_disable,
.reclaim_buffers = drm_core_reclaim_buffers,
.ioctls = nouveau_ioctls,
+   .ioctls_need_rwsem = true,
.fops = nouveau_driver_fops,
.gem_init_object = nouveau_gem_object_new,
.gem_free_object = nouveau_gem_object_del,
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h 
b/drivers/gpu/drm/nouveau/nouveau_drv.h
index d120baf..01500e1 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.h
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.h
@@ -708,6 +708,10 @@ struct drm_nouveau_private {
struct drm_device *dev;
bool noaccel;
 
+   struct mutex reset_lock;
+   atomic_t gpureset_in_progress;
+   unsigned long last_gpu_reset;
+
/* the card type, takes NV_* as values */