Re: [Intel-gfx] [PATCH 3/7] drm/i915/tdr: Add support for per engine reset recovery

2016-09-19 Thread Chris Wilson
On Mon, Sep 19, 2016 at 04:30:15PM +0100, Matthew Auld wrote:
> From: "arun.siluv...@linux.intel.com" 
> 
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - identifies the request that caused the hang and it is dropped
>  - force engine to idle: this is done by issuing a reset request
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> v2
>   - rebase
> 
> Cc: Chris Wilson 
> Cc: Mika Kuoppala 
> Signed-off-by: Tomas Elf 
> Signed-off-by: Arun Siluvery 
> Signed-off-by: Matthew Auld 
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 59 
> +
>  drivers/gpu/drm/i915/i915_drv.h |  3 ++
>  drivers/gpu/drm/i915/i915_gem.c |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.c| 10 +++
>  drivers/gpu/drm/i915/intel_lrc.h|  1 +
>  drivers/gpu/drm/i915/intel_uncore.c | 41 +++---
>  6 files changed, 105 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 99fa690..8625207 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1812,21 +1812,68 @@ error:
>   * Returns zero on successful reset or otherwise an error code.
>   *
>   * Procedure is fairly simple:
> - *  - force engine to idle
> - *  - save current state which includes head and current request
> - *  - reset engine
> - *  - restore saved state and resubmit context
> + *- identifies the request that caused the hang and it is dropped
> + *- force engine to idle: this is done by issuing a reset request
> + *- reset engine
> + *- restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
>   int ret;
>   struct drm_i915_private *dev_priv = engine->i915;
>  
> - /* FIXME: replace me with engine reset sequence */
> - ret = -ENODEV;
> + /*
> +  * We need to first idle the engine by issuing a reset request,
> +  * then perform soft reset and re-initialize hw state, for all of
> +  * this GT power need to be awake so ensure it does throughout the
> +  * process
> +  */
> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> + /*
> +  * the request that caused the hang is stuck on elsp, identify the
> +  * active request and drop it, adjust head to skip the offending
> +  * request to resume executing remaining requests in the queue.
> +  */
> + i915_gem_reset_engine(engine);
> +
> + ret = intel_engine_reset_begin(engine);
> + if (ret) {
> + DRM_ERROR("Failed to disable %s\n", engine->name);
> + goto error;
> + }
> +
> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> + if (ret) {
> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> + intel_engine_reset_cancel(engine);
> + goto error;
> + }

Ordering is still broken.

> +
> + ret = engine->init_hw(engine);
> + if (ret)
> + goto error;
> +
> + intel_engine_reset_cancel(engine);
> + intel_execlists_restart_submission(engine);

This is broken.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 3/7] drm/i915/tdr: Add support for per engine reset recovery

2016-09-19 Thread Matthew Auld
From: "arun.siluv...@linux.intel.com" 

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2
  - rebase

Cc: Chris Wilson 
Cc: Mika Kuoppala 
Signed-off-by: Tomas Elf 
Signed-off-by: Arun Siluvery 
Signed-off-by: Matthew Auld 
---
 drivers/gpu/drm/i915/i915_drv.c | 59 +
 drivers/gpu/drm/i915/i915_drv.h |  3 ++
 drivers/gpu/drm/i915/i915_gem.c |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c| 10 +++
 drivers/gpu/drm/i915/intel_lrc.h|  1 +
 drivers/gpu/drm/i915/intel_uncore.c | 41 +++---
 6 files changed, 105 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 99fa690..8625207 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1812,21 +1812,68 @@ error:
  * Returns zero on successful reset or otherwise an error code.
  *
  * Procedure is fairly simple:
- *  - force engine to idle
- *  - save current state which includes head and current request
- *  - reset engine
- *  - restore saved state and resubmit context
+ *- identifies the request that caused the hang and it is dropped
+ *- force engine to idle: this is done by issuing a reset request
+ *- reset engine
+ *- restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
int ret;
struct drm_i915_private *dev_priv = engine->i915;
 
-   /* FIXME: replace me with engine reset sequence */
-   ret = -ENODEV;
+   /*
+* We need to first idle the engine by issuing a reset request,
+* then perform soft reset and re-initialize hw state, for all of
+* this GT power need to be awake so ensure it does throughout the
+* process
+*/
+   intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+   /*
+* the request that caused the hang is stuck on elsp, identify the
+* active request and drop it, adjust head to skip the offending
+* request to resume executing remaining requests in the queue.
+*/
+   i915_gem_reset_engine(engine);
+
+   ret = intel_engine_reset_begin(engine);
+   if (ret) {
+   DRM_ERROR("Failed to disable %s\n", engine->name);
+   goto error;
+   }
+
+   ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+   if (ret) {
+   DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+   intel_engine_reset_cancel(engine);
+   goto error;
+   }
+
+   ret = engine->init_hw(engine);
+   if (ret)
+   goto error;
+
+   intel_engine_reset_cancel(engine);
+   intel_execlists_restart_submission(engine);
 
+   intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+   return 0;
+
+error:
+   /* use full gpu reset to recover on error */
set_bit(I915_RESET_IN_PROGRESS, _priv->gpu_error.flags);
 
+   /* Engine reset is performed without taking struct_mutex, since it
+* failed we now fallback to full gpu reset. Wakeup any waiters
+* which should now see the reset_in_progress and release
+* struct_mutex for us to continue recovery.
+*/
+   rcu_read_lock();
+   intel_engine_wakeup(engine);
+   rcu_read_unlock();
+
+   intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 79de74d..3dcf3f6 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2889,6 +2889,8 @@ extern int intel_gpu_reset(struct drm_i915_private 
*dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int intel_engine_reset_begin(struct intel_engine_cs *engine);
+extern int intel_engine_reset_cancel(struct intel_engine_cs *engine);
 extern int i915_reset_engine(struct intel_engine_cs