Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
On Mon, Apr 23, 2012 at 06:56:44PM +0200, Martin Peres wrote: Le 23/04/2012 18:32, Marcin Slusarz a écrit : Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes up DDX channel, but this patchset can't fix this case). You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup. Marcin Great, Thanks. Did you have a look at https://bugs.freedesktop.org/show_bug.cgi?id=40886 and http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ? Yeah, I've seen them some time ago. The Ubuntu xorg devs were looking for something like this, but they also wanted a lockup report. Are you also interested on working on it ? Yes, when this patchset will be applied, I'm going to work on improving error reporting. Marcin ___ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
On Tue, 2012-04-24 at 21:31 +0200, Marcin Slusarz wrote: On Mon, Apr 23, 2012 at 06:56:44PM +0200, Martin Peres wrote: Le 23/04/2012 18:32, Marcin Slusarz a écrit : Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes up DDX channel, but this patchset can't fix this case). You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup. Marcin Great, Thanks. Did you have a look at https://bugs.freedesktop.org/show_bug.cgi?id=40886 and http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ? Yeah, I've seen them some time ago. The Ubuntu xorg devs were looking for something like this, but they also wanted a lockup report. Are you also interested on working on it ? As I argued at XDC last year, I really question the usefulness of something like this. We have stupidly HUGE amounts of state that could be relevant, and the situations where we even need something like this are RARE. I don't want this useless crap in our kernel module just because some random distro thinks it's so useful, when it's not. On the very very rare (I can think of one situation where we've wanted these register dumps, and they weren't useful even then) occasions we need this info, we can ask people to install envytools and grab it.. We have a GPU with *very* good error reporting, and we log this to dmesg. This is good enough. Any random errorless lockups are much harder, and unless you dump *all* the card state right from the memory controllers, to the clocks, to PFIFO to the particular engine that's involved.. It's going to be useless. The problem could be anything. Yes, when this patchset will be applied, I'm going to work on improving error reporting. Assuming you're not talking about a register-dump style lockup report like above, this could be good. Particularly, fleshing out and improving/completing each engine's IRQ handlers (which will probably have the nice side-effect of surviving a few more errors without locking up) :) Cheers, Ben. Marcin ___ dri-devel mailing list dri-de...@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ___ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
On Mon, Apr 23, 2012 at 10:43:08AM +0200, Martin Peres wrote: Le 23/04/2012 00:18, Marcin Slusarz a écrit : Overall idea: Detect lockups by watching for timeouts (vm flush / fence), return -EIOs, handle them at ioctl level, reset the GPU and repeat last ioctl. GPU reset is done by doing suspend / resume cycle with few tweaks: - CPU-only bo eviction - ignoring vm flush / fence timeouts - shortening waits Signed-off-by: Marcin Slusarzmarcin.slus...@gmail.com --- Tested only on nv92. Hi Marcin, I'm really busy at the moment but I'm glad to see such patches coming out. I'll try them out ASAP! Do you have a recommended way to test your patch set? Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes up DDX channel, but this patchset can't fix this case). You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup. Marcin ___ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
Le 23/04/2012 18:32, Marcin Slusarz a écrit : Just run piglit. Even quick tests can cause ~5 lockups (it eventually messes up DDX channel, but this patchset can't fix this case). You can run fs-discard-exit-2 test first - for me it causes instant GPU lockup. Marcin Great, Thanks. Did you have a look at https://bugs.freedesktop.org/show_bug.cgi?id=40886 and http://xorg.freedesktop.org/wiki/SummerOfCodeIdeas ? The Ubuntu xorg devs were looking for something like this, but they also wanted a lockup report. Are you also interested on working on it ? Martin ___ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
On Mon, Apr 23, 2012 at 06:46:41PM +0200, Martin Peres wrote: Hey, Just a minor mistake spotted while skimming through the patch. Le 23/04/2012 00:18, Marcin Slusarz a écrit : +static inline uint64_t nv_timeout(struct drm_device *dev) +{ + uint64_t tm = 20ULL; + if (nouveau_gpu_reset_in_progress(dev)) + tm /= 40; /* 50ms */ This will cause a problem on 32 bit kernels. You should use do_div. Thanks. I'll fix this later. Marcin ___ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [RFC PATCH 5/5] drm/nouveau: gpu lockup recovery
Overall idea: Detect lockups by watching for timeouts (vm flush / fence), return -EIOs, handle them at ioctl level, reset the GPU and repeat last ioctl. GPU reset is done by doing suspend / resume cycle with few tweaks: - CPU-only bo eviction - ignoring vm flush / fence timeouts - shortening waits Signed-off-by: Marcin Slusarz marcin.slus...@gmail.com --- Tested only on nv92. --- drivers/gpu/drm/nouveau/Makefile |2 +- drivers/gpu/drm/nouveau/nouveau_bo.c |2 +- drivers/gpu/drm/nouveau/nouveau_channel.c |5 +- drivers/gpu/drm/nouveau/nouveau_drv.c |3 +- drivers/gpu/drm/nouveau/nouveau_drv.h | 33 ++- drivers/gpu/drm/nouveau/nouveau_fence.c|7 +- drivers/gpu/drm/nouveau/nouveau_gem.c | 14 +++- drivers/gpu/drm/nouveau/nouveau_notifier.c |3 + drivers/gpu/drm/nouveau/nouveau_object.c |6 + drivers/gpu/drm/nouveau/nouveau_reset.c| 144 drivers/gpu/drm/nouveau/nouveau_state.c|5 + drivers/gpu/drm/nouveau/nv50_graph.c | 11 +- 12 files changed, 221 insertions(+), 14 deletions(-) create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c diff --git a/drivers/gpu/drm/nouveau/Makefile b/drivers/gpu/drm/nouveau/Makefile index 03860f5..77d0c33 100644 --- a/drivers/gpu/drm/nouveau/Makefile +++ b/drivers/gpu/drm/nouveau/Makefile @@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o nouveau_mem.o \ nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \ nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \ nouveau_display.o nouveau_connector.o nouveau_fbcon.o \ - nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \ + nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \ nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \ nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \ nv04_timer.o \ diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 5b0dc50..7de6cad 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, bool intr, } /* Software copy if the card isn't up and running yet. */ - if (!dev_priv-channel) { + if (!dev_priv-channel || nouveau_gpu_reset_in_progress(dev_priv-dev)) { ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, no_wait_gpu, new_mem); goto out; } diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c b/drivers/gpu/drm/nouveau/nouveau_channel.c index 846afb0..c0fa5a7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_channel.c +++ b/drivers/gpu/drm/nouveau/nouveau_channel.c @@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data, init-fb_ctxdma_handle, init-tt_ctxdma_handle); if (ret) - return ret; + goto out; init-channel = chan-id; if (nouveau_vram_pushbuf == 0) { @@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data, if (ret == 0) atomic_inc(chan-users); /* userspace reference */ nouveau_channel_put(chan); +out: + if (ret == -EIO) + ret = nouveau_reset_device(dev); return ret; } diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.c b/drivers/gpu/drm/nouveau/nouveau_drv.c index 090fff6..22c435f 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.c +++ b/drivers/gpu/drm/nouveau/nouveau_drv.c @@ -237,7 +237,7 @@ nouveau_pci_suspend(struct pci_dev *pdev, pm_message_t pm_state) if (!dev_priv-eng[e]) continue; - ret = dev_priv-eng[e]-fini(dev, e, true); + ret = dev_priv-eng[e]-fini(dev, e, !nouveau_gpu_reset_in_progress(dev)); if (ret) { NV_ERROR(dev, ... engine %d failed: %d\n, e, ret); goto out_abort; @@ -483,6 +483,7 @@ static struct drm_driver driver = { .disable_vblank = nouveau_vblank_disable, .reclaim_buffers = drm_core_reclaim_buffers, .ioctls = nouveau_ioctls, + .ioctls_need_rwsem = true, .fops = nouveau_driver_fops, .gem_init_object = nouveau_gem_object_new, .gem_free_object = nouveau_gem_object_del, diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.h b/drivers/gpu/drm/nouveau/nouveau_drv.h index d120baf..01500e1 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drv.h +++ b/drivers/gpu/drm/nouveau/nouveau_drv.h @@ -708,6 +708,10 @@ struct drm_nouveau_private { struct drm_device *dev; bool noaccel; + struct mutex reset_lock; + atomic_t gpureset_in_progress; + unsigned long last_gpu_reset; + /* the card type, takes NV_* as values */