Re: [Intel-gfx] [PATCH 3/7] drm/i915: More surgically unbreak the modeset vs reset deadlock
On Thu, Aug 03, 2017 at 12:35:48PM -0700, Michel Thierry wrote: > Hi, > > First sorry about the delay... > > On 7/20/2017 10:57 AM, Daniel Vetter wrote: > > There's no reason to entirely wedge the gpu, for the minimal deadlock > > bugfix we only need to unbreak/decouple the atomic commit from the gpu > > reset. The simplest way to fix that is by replacing the > > unconditional fence wait a the top of commit_tail by a wait which > > completes either when the fences are done (normal case, or when a > > reset doesn't need to touch the display state). Or when the gpu reset > > needs to force-unblock all pending modeset states. > > > > The lesser source of deadlocks is when we try to pin a new framebuffer > > and run into a stall. There's a bunch of places this can happen, like > > eviction, changing the caching mode, acquiring a fence on older > > platforms. And we can't just break the depency loop and keep going, > > the only way would be to break out and restart. But the problem with > > that approach is that we must stall for the reset to complete before > > we grab any locks, and with the atomic infrastructure that's a bit > > tricky. The only place is the ioctl code, and we don't want to insert > > code into e.g. the BUSY ioctl. Hence for that problem just create a > > critical section, and if any code is in there, wedge the GPU. For the > > steady-state this should never be a problem. > > > > Note that in both cases TDR itself keeps working, so from a userspace > > pov this trickery isn't observable. Users themselvs might spot a short > > glitch while the rendering is catching up again, but that's still > > better than pre-TDR where we've thrown away all the rendering, > > including innocent batches. Also, this fixes the regression TDR > > introduced of making gpu resets deadlock-prone when we do need to > > touch the display. > > > > One thing I noticed is that gpu_error.flags seems to use both our own > > wait-queue in gpu_error.wait_queue, and the generic wait_on_bit > > facilities. Not entirely sure why this inconsistency exists, I just > > picked one style. > > > > A possible future avenue could be to insert the gpu reset in-between > > ongoing modeset changes, which would avoid the momentary glitch. But > > that's a lot more work to implement in the atomic commit machinery, > > and given that we only need this for pre-g4x hw, of questionable > > utility just for the sake of polishing gpu reset even more on those > > old boxes. It might be useful for other features though. > > > > v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/. > > > > v3: Really emabarrassing fixup, I checked the wrong bit and broke the > > unbreak/wakeup logic. > > > > v4: Also handle deadlocks in pin_to_display. > > > > Cc: Chris Wilson> > Cc: Mika Kuoppala > > Cc: Joonas Lahtinen > > Signed-off-by: Daniel Vetter > > --- > > drivers/gpu/drm/i915/i915_drv.h | 3 +++ > > drivers/gpu/drm/i915/intel_display.c | 45 > > +++- > > 2 files changed, 42 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h > > b/drivers/gpu/drm/i915/i915_drv.h > > index 07e98b07c5bc..9b055deca36d 100644 > > --- a/drivers/gpu/drm/i915/i915_drv.h > > +++ b/drivers/gpu/drm/i915/i915_drv.h > > @@ -1505,6 +1505,8 @@ struct i915_gpu_error { > > /* Protected by the above dev->gpu_error.lock. */ > > struct i915_gpu_state *first_error; > > > > + atomic_t pending_fb_pin; > > + > > unsigned long missed_irq_rings; > > > > /** > > @@ -1564,6 +1566,7 @@ struct i915_gpu_error { > > unsigned long flags; > > #define I915_RESET_BACKOFF 0 > > #define I915_RESET_HANDOFF 1 > > +#define I915_RESET_MODESET 2 > > #define I915_WEDGED(BITS_PER_LONG - 1) > > #define I915_RESET_ENGINE (I915_WEDGED - I915_NUM_ENGINES) > > > > Since you still need to resend this, could you also update the > I915_RESET_ENGINE overflow check in i915_handle_error? i.e.: > > ---BUILD_BUG_ON(I915_RESET_HANDOFF >= I915_RESET_ENGINE); > +++BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE); > > Not that we would have this problem anytime soon... Yeah missed that, will fix. > > diff --git a/drivers/gpu/drm/i915/intel_display.c > > b/drivers/gpu/drm/i915/intel_display.c > > index f6bd6282d7f7..63ea8d6b2ebd 100644 > > --- a/drivers/gpu/drm/i915/intel_display.c > > +++ b/drivers/gpu/drm/i915/intel_display.c > > @@ -2162,6 +2162,8 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer > > *fb, unsigned int rotation) > > */ > > intel_runtime_pm_get(dev_priv); > > > > + atomic_inc(_priv->gpu_error.pending_fb_pin); > > + > > Do we also need this in intel_overlay_do_put_image? It is also being called > when the struct_mutex is held. Indeed I missed that, will fix too and
Re: [Intel-gfx] [PATCH 3/7] drm/i915: More surgically unbreak the modeset vs reset deadlock
Hi, First sorry about the delay... On 7/20/2017 10:57 AM, Daniel Vetter wrote: There's no reason to entirely wedge the gpu, for the minimal deadlock bugfix we only need to unbreak/decouple the atomic commit from the gpu reset. The simplest way to fix that is by replacing the unconditional fence wait a the top of commit_tail by a wait which completes either when the fences are done (normal case, or when a reset doesn't need to touch the display state). Or when the gpu reset needs to force-unblock all pending modeset states. The lesser source of deadlocks is when we try to pin a new framebuffer and run into a stall. There's a bunch of places this can happen, like eviction, changing the caching mode, acquiring a fence on older platforms. And we can't just break the depency loop and keep going, the only way would be to break out and restart. But the problem with that approach is that we must stall for the reset to complete before we grab any locks, and with the atomic infrastructure that's a bit tricky. The only place is the ioctl code, and we don't want to insert code into e.g. the BUSY ioctl. Hence for that problem just create a critical section, and if any code is in there, wedge the GPU. For the steady-state this should never be a problem. Note that in both cases TDR itself keeps working, so from a userspace pov this trickery isn't observable. Users themselvs might spot a short glitch while the rendering is catching up again, but that's still better than pre-TDR where we've thrown away all the rendering, including innocent batches. Also, this fixes the regression TDR introduced of making gpu resets deadlock-prone when we do need to touch the display. One thing I noticed is that gpu_error.flags seems to use both our own wait-queue in gpu_error.wait_queue, and the generic wait_on_bit facilities. Not entirely sure why this inconsistency exists, I just picked one style. A possible future avenue could be to insert the gpu reset in-between ongoing modeset changes, which would avoid the momentary glitch. But that's a lot more work to implement in the atomic commit machinery, and given that we only need this for pre-g4x hw, of questionable utility just for the sake of polishing gpu reset even more on those old boxes. It might be useful for other features though. v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/. v3: Really emabarrassing fixup, I checked the wrong bit and broke the unbreak/wakeup logic. v4: Also handle deadlocks in pin_to_display. Cc: Chris WilsonCc: Mika Kuoppala Cc: Joonas Lahtinen Signed-off-by: Daniel Vetter --- drivers/gpu/drm/i915/i915_drv.h | 3 +++ drivers/gpu/drm/i915/intel_display.c | 45 +++- 2 files changed, 42 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 07e98b07c5bc..9b055deca36d 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1505,6 +1505,8 @@ struct i915_gpu_error { /* Protected by the above dev->gpu_error.lock. */ struct i915_gpu_state *first_error; + atomic_t pending_fb_pin; + unsigned long missed_irq_rings; /** @@ -1564,6 +1566,7 @@ struct i915_gpu_error { unsigned long flags; #define I915_RESET_BACKOFF 0 #define I915_RESET_HANDOFF 1 +#define I915_RESET_MODESET 2 #define I915_WEDGED(BITS_PER_LONG - 1) #define I915_RESET_ENGINE (I915_WEDGED - I915_NUM_ENGINES) Since you still need to resend this, could you also update the I915_RESET_ENGINE overflow check in i915_handle_error? i.e.: ---BUILD_BUG_ON(I915_RESET_HANDOFF >= I915_RESET_ENGINE); +++BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE); Not that we would have this problem anytime soon... diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index f6bd6282d7f7..63ea8d6b2ebd 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -2162,6 +2162,8 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb, unsigned int rotation) */ intel_runtime_pm_get(dev_priv); + atomic_inc(_priv->gpu_error.pending_fb_pin); + Do we also need this in intel_overlay_do_put_image? It is also being called when the struct_mutex is held. vma = i915_gem_object_pin_to_display_plane(obj, alignment, ); if (IS_ERR(vma)) goto err; @@ -2189,6 +2191,8 @@ intel_pin_and_fence_fb_obj(struct drm_framebuffer *fb, unsigned int rotation) i915_vma_get(vma); err: + atomic_dec(_priv->gpu_error.pending_fb_pin); + intel_runtime_pm_put(dev_priv); return vma; } @@ -3471,11 +3475,14 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) !gpu_reset_clobbers_display(dev_priv))