Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
On Wed, 21 Feb 2018 09:27:51 +0100, Chris Wilsonwrote: Quoting Michal Wajdeczko (2018-02-20 22:57:10) Since commit 6ca9a2beb54a ("drm/i915: Unwind i915_gem_init() failure") we believed that we correctly handle all errors encountered during GuC initialization, including special one that indicates request to run driver with disabled GPU submission (-EIO). Unfortunately since commit 121981fafe69 ("drm/i915/guc: Combine enable_guc_loading|submission modparams") we stopped using that error code to avoid unwanted fallback to execlist submission mode. In result any GuC initialization failure was treated as non-recoverable error leading to driver load abort, so we could not even read related GuC error log to investigate cause of the problem. Fix that by always returning -EIO on uC hardware related failure. v2: don't allow -EIO from uc_init don't call uc_fini[_misc] on -EIO mark guc fw as failed on hw init failure prepare uc_fini_hw to run after earlier -EIO Signed-off-by: Michal Wajdeczko Cc: Chris Wilson Cc: Michal Winiarski Cc: Daniele Ceraolo Spurio Cc: Sagar Arun Kamble Whilst thinking about this, do you want to disable (or rather prevent setup of) guc submission if the driver is already wedged? We setup GuC submission in intel_uc_init_hw() and just before we call it from i915_gem_init_hw() we already have a check for wedged driver. So maybe we don't want to add redundant guard ? /Michal ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
On 2/21/2018 5:50 PM, Michal Wajdeczko wrote: On Wed, 21 Feb 2018 09:08:08 +0100, Sagar Arun Kamblewrote: On 2/21/2018 4:27 AM, Michal Wajdeczko wrote: Since commit 6ca9a2beb54a ("drm/i915: Unwind i915_gem_init() failure") we believed that we correctly handle all errors encountered during GuC initialization, including special one that indicates request to run driver with disabled GPU submission (-EIO). Unfortunately since commit 121981fafe69 ("drm/i915/guc: Combine enable_guc_loading|submission modparams") we stopped using that error code to avoid unwanted fallback to execlist submission mode. In result any GuC initialization failure was treated as non-recoverable error leading to driver load abort, so we could not even read related GuC error log to investigate cause of the problem. Fix that by always returning -EIO on uC hardware related failure. v2: don't allow -EIO from uc_init don't call uc_fini[_misc] on -EIO mark guc fw as failed on hw init failure prepare uc_fini_hw to run after earlier -EIO Signed-off-by: Michal Wajdeczko Cc: Chris Wilson Cc: Michal Winiarski Cc: Daniele Ceraolo Spurio Cc: Sagar Arun Kamble --- drivers/gpu/drm/i915/i915_gem.c | 13 - drivers/gpu/drm/i915/intel_guc.h | 5 + drivers/gpu/drm/i915/intel_uc.c | 13 + drivers/gpu/drm/i915/intel_uc_fw.h | 5 + 4 files changed, 27 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 631a2db..80f23a8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -5324,8 +5324,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_init_gt_powersave(dev_priv); ret = intel_uc_init(dev_priv); - if (ret) + if (ret) { + GEM_BUG_ON(ret == -EIO); goto err_pm; + } ret = i915_gem_init_hw(dev_priv); if (ret) @@ -5372,7 +5374,8 @@ int i915_gem_init(struct drm_i915_private *dev_priv) i915_gem_contexts_lost(dev_priv); intel_uc_fini_hw(dev_priv); This uc_fini_hw should also be not called on -EIO? This one here is fine. But I need to clear guc->fw.load_status there to make sure we will not try to perform full fini_hw() again. Yes. Will need to set load_status as FIRMWARE_FAIL. err_uc_init: - intel_uc_fini(dev_priv); + if (ret != -EIO) + intel_uc_fini(dev_priv); err_pm: if (ret != -EIO) { intel_cleanup_gt_powersave(dev_priv); @@ -5386,10 +5389,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); mutex_unlock(_priv->drm.struct_mutex); - intel_uc_fini_misc(dev_priv); - - if (ret != -EIO) + if (ret != -EIO) { + intel_uc_fini_misc(dev_priv); i915_gem_cleanup_userptr(dev_priv); + } if (ret == -EIO) { /* Comment here can be updated to say "Allow engines or uC initialization to fail ... " ok diff --git a/drivers/gpu/drm/i915/intel_guc.h b/drivers/gpu/drm/i915/intel_guc.h index 52856a9..512ff7b 100644 --- a/drivers/gpu/drm/i915/intel_guc.h +++ b/drivers/gpu/drm/i915/intel_guc.h @@ -100,6 +100,11 @@ static inline void intel_guc_notify(struct intel_guc *guc) guc->notify(guc); } +static inline bool intel_guc_is_loaded(struct intel_guc *guc) +{ + return intel_uc_fw_is_loaded(>fw); +} + /* * GuC does not allow any gfx GGTT address that falls into range [0, WOPCM_TOP), * which is reserved for Boot ROM, SRAM and WOPCM. Currently this top address is diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c index 9f1bac6..75d0eb9 100644 --- a/drivers/gpu/drm/i915/intel_uc.c +++ b/drivers/gpu/drm/i915/intel_uc.c @@ -421,11 +421,13 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) * Note that there is no fallback as either user explicitly asked for * the GuC or driver default option was to run with the GuC enabled. */ - if (GEM_WARN_ON(ret == -EIO)) - ret = -EINVAL; - dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); - return ret; + + /* Mark GuC firmware as failed to avoid redundant clean-up */ uc_fini_hw is not redundant. uc_fini[_misc] was redundant. May be we should say "to avoid clean-up on wedged" ok + guc->fw.load_status = INTEL_UC_FIRMWARE_FAIL; + + /* We want to disable GPU submission but keep KMS alive */ + return -EIO; } void intel_uc_fini_hw(struct drm_i915_private *dev_priv) @@ -437,6 +439,9 @@ void intel_uc_fini_hw(struct drm_i915_private *dev_priv) GEM_BUG_ON(!HAS_GUC(dev_priv)); + if (!intel_guc_is_loaded(guc)) + return; + Can we skip based on i915_terminally_wedged instead? I'm not sure, as we declare GPU as wedged
Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
diff --git a/drivers/gpu/drm/i915/intel_uc_fw.h b/drivers/gpu/drm/i915/intel_uc_fw.h index d5fd460..0e3b237 100644 --- a/drivers/gpu/drm/i915/intel_uc_fw.h +++ b/drivers/gpu/drm/i915/intel_uc_fw.h @@ -115,6 +115,11 @@ static inline bool intel_uc_fw_is_selected(struct intel_uc_fw *uc_fw) return uc_fw->path != NULL; } +static inline bool intel_uc_fw_is_loaded(struct intel_uc_fw *uc_fw) +{ + return uc_fw->load_status == INTEL_UC_FIRMWARE_SUCCESS; Since we do not immediately update uc_fw->load_status after full GPU reset we have a small window of time during re-initialization where this function would falsely return true. We don't hit the issue in this patch, but I'd personally prefer not to add this function until uc_fw->load_status is correctly updated as we might inadvertently start to use it at the wrong time. Alternatively, if you want to merge this soon we could read the status from the HW as an initial version and then switch to uc_fw->load_status after we've fixed it. Daniele +} + void intel_uc_fw_fetch(struct drm_i915_private *dev_priv, struct intel_uc_fw *uc_fw); int intel_uc_fw_upload(struct intel_uc_fw *uc_fw, ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
On Wed, 21 Feb 2018 09:08:08 +0100, Sagar Arun Kamblewrote: On 2/21/2018 4:27 AM, Michal Wajdeczko wrote: Since commit 6ca9a2beb54a ("drm/i915: Unwind i915_gem_init() failure") we believed that we correctly handle all errors encountered during GuC initialization, including special one that indicates request to run driver with disabled GPU submission (-EIO). Unfortunately since commit 121981fafe69 ("drm/i915/guc: Combine enable_guc_loading|submission modparams") we stopped using that error code to avoid unwanted fallback to execlist submission mode. In result any GuC initialization failure was treated as non-recoverable error leading to driver load abort, so we could not even read related GuC error log to investigate cause of the problem. Fix that by always returning -EIO on uC hardware related failure. v2: don't allow -EIO from uc_init don't call uc_fini[_misc] on -EIO mark guc fw as failed on hw init failure prepare uc_fini_hw to run after earlier -EIO Signed-off-by: Michal Wajdeczko Cc: Chris Wilson Cc: Michal Winiarski Cc: Daniele Ceraolo Spurio Cc: Sagar Arun Kamble --- drivers/gpu/drm/i915/i915_gem.c| 13 - drivers/gpu/drm/i915/intel_guc.h | 5 + drivers/gpu/drm/i915/intel_uc.c| 13 + drivers/gpu/drm/i915/intel_uc_fw.h | 5 + 4 files changed, 27 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 631a2db..80f23a8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -5324,8 +5324,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_init_gt_powersave(dev_priv); ret = intel_uc_init(dev_priv); - if (ret) + if (ret) { + GEM_BUG_ON(ret == -EIO); goto err_pm; + } ret = i915_gem_init_hw(dev_priv); if (ret) @@ -5372,7 +5374,8 @@ int i915_gem_init(struct drm_i915_private *dev_priv) i915_gem_contexts_lost(dev_priv); intel_uc_fini_hw(dev_priv); This uc_fini_hw should also be not called on -EIO? This one here is fine. But I need to clear guc->fw.load_status there to make sure we will not try to perform full fini_hw() again. err_uc_init: - intel_uc_fini(dev_priv); + if (ret != -EIO) + intel_uc_fini(dev_priv); err_pm: if (ret != -EIO) { intel_cleanup_gt_powersave(dev_priv); @@ -5386,10 +5389,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); mutex_unlock(_priv->drm.struct_mutex); - intel_uc_fini_misc(dev_priv); - - if (ret != -EIO) + if (ret != -EIO) { + intel_uc_fini_misc(dev_priv); i915_gem_cleanup_userptr(dev_priv); + } if (ret == -EIO) { /* Comment here can be updated to say "Allow engines or uC initialization to fail ... " ok diff --git a/drivers/gpu/drm/i915/intel_guc.h b/drivers/gpu/drm/i915/intel_guc.h index 52856a9..512ff7b 100644 --- a/drivers/gpu/drm/i915/intel_guc.h +++ b/drivers/gpu/drm/i915/intel_guc.h @@ -100,6 +100,11 @@ static inline void intel_guc_notify(struct intel_guc *guc) guc->notify(guc); } +static inline bool intel_guc_is_loaded(struct intel_guc *guc) +{ + return intel_uc_fw_is_loaded(>fw); +} + /* * GuC does not allow any gfx GGTT address that falls into range [0, WOPCM_TOP), * which is reserved for Boot ROM, SRAM and WOPCM. Currently this top address is diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c index 9f1bac6..75d0eb9 100644 --- a/drivers/gpu/drm/i915/intel_uc.c +++ b/drivers/gpu/drm/i915/intel_uc.c @@ -421,11 +421,13 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) * Note that there is no fallback as either user explicitly asked for * the GuC or driver default option was to run with the GuC enabled. */ - if (GEM_WARN_ON(ret == -EIO)) - ret = -EINVAL; - dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); - return ret; + + /* Mark GuC firmware as failed to avoid redundant clean-up */ uc_fini_hw is not redundant. uc_fini[_misc] was redundant. May be we should say "to avoid clean-up on wedged" ok + guc->fw.load_status = INTEL_UC_FIRMWARE_FAIL; + + /* We want to disable GPU submission but keep KMS alive */ + return -EIO; } void intel_uc_fini_hw(struct drm_i915_private *dev_priv) @@ -437,6 +439,9 @@ void intel_uc_fini_hw(struct drm_i915_private *dev_priv) GEM_BUG_ON(!HAS_GUC(dev_priv)); + if (!intel_guc_is_loaded(guc)) + return; + Can we skip based on i915_terminally_wedged instead? I'm
Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
Quoting Michal Wajdeczko (2018-02-20 22:57:10) > Since commit 6ca9a2beb54a ("drm/i915: Unwind i915_gem_init() failure") > we believed that we correctly handle all errors encountered during > GuC initialization, including special one that indicates request to > run driver with disabled GPU submission (-EIO). > > Unfortunately since commit 121981fafe69 ("drm/i915/guc: Combine > enable_guc_loading|submission modparams") we stopped using that > error code to avoid unwanted fallback to execlist submission mode. > > In result any GuC initialization failure was treated as non-recoverable > error leading to driver load abort, so we could not even read related > GuC error log to investigate cause of the problem. > > Fix that by always returning -EIO on uC hardware related failure. > > v2: don't allow -EIO from uc_init > don't call uc_fini[_misc] on -EIO > mark guc fw as failed on hw init failure > prepare uc_fini_hw to run after earlier -EIO > > Signed-off-by: Michal Wajdeczko> Cc: Chris Wilson > Cc: Michal Winiarski > Cc: Daniele Ceraolo Spurio > Cc: Sagar Arun Kamble Whilst thinking about this, do you want to disable (or rather prevent setup of) guc submission if the driver is already wedged? -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH v2 1/2] drm/i915/guc: Use correct error code for GuC initialization failure
On 2/21/2018 4:27 AM, Michal Wajdeczko wrote: Since commit 6ca9a2beb54a ("drm/i915: Unwind i915_gem_init() failure") we believed that we correctly handle all errors encountered during GuC initialization, including special one that indicates request to run driver with disabled GPU submission (-EIO). Unfortunately since commit 121981fafe69 ("drm/i915/guc: Combine enable_guc_loading|submission modparams") we stopped using that error code to avoid unwanted fallback to execlist submission mode. In result any GuC initialization failure was treated as non-recoverable error leading to driver load abort, so we could not even read related GuC error log to investigate cause of the problem. Fix that by always returning -EIO on uC hardware related failure. v2: don't allow -EIO from uc_init don't call uc_fini[_misc] on -EIO mark guc fw as failed on hw init failure prepare uc_fini_hw to run after earlier -EIO Signed-off-by: Michal WajdeczkoCc: Chris Wilson Cc: Michal Winiarski Cc: Daniele Ceraolo Spurio Cc: Sagar Arun Kamble --- drivers/gpu/drm/i915/i915_gem.c| 13 - drivers/gpu/drm/i915/intel_guc.h | 5 + drivers/gpu/drm/i915/intel_uc.c| 13 + drivers/gpu/drm/i915/intel_uc_fw.h | 5 + 4 files changed, 27 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 631a2db..80f23a8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -5324,8 +5324,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_init_gt_powersave(dev_priv); ret = intel_uc_init(dev_priv); - if (ret) + if (ret) { + GEM_BUG_ON(ret == -EIO); goto err_pm; + } ret = i915_gem_init_hw(dev_priv); if (ret) @@ -5372,7 +5374,8 @@ int i915_gem_init(struct drm_i915_private *dev_priv) i915_gem_contexts_lost(dev_priv); intel_uc_fini_hw(dev_priv); This uc_fini_hw should also be not called on -EIO? err_uc_init: - intel_uc_fini(dev_priv); + if (ret != -EIO) + intel_uc_fini(dev_priv); err_pm: if (ret != -EIO) { intel_cleanup_gt_powersave(dev_priv); @@ -5386,10 +5389,10 @@ int i915_gem_init(struct drm_i915_private *dev_priv) intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL); mutex_unlock(_priv->drm.struct_mutex); - intel_uc_fini_misc(dev_priv); - - if (ret != -EIO) + if (ret != -EIO) { + intel_uc_fini_misc(dev_priv); i915_gem_cleanup_userptr(dev_priv); + } if (ret == -EIO) { /* Comment here can be updated to say "Allow engines or uC initialization to fail ... " diff --git a/drivers/gpu/drm/i915/intel_guc.h b/drivers/gpu/drm/i915/intel_guc.h index 52856a9..512ff7b 100644 --- a/drivers/gpu/drm/i915/intel_guc.h +++ b/drivers/gpu/drm/i915/intel_guc.h @@ -100,6 +100,11 @@ static inline void intel_guc_notify(struct intel_guc *guc) guc->notify(guc); } +static inline bool intel_guc_is_loaded(struct intel_guc *guc) +{ + return intel_uc_fw_is_loaded(>fw); +} + /* * GuC does not allow any gfx GGTT address that falls into range [0, WOPCM_TOP), * which is reserved for Boot ROM, SRAM and WOPCM. Currently this top address is diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c index 9f1bac6..75d0eb9 100644 --- a/drivers/gpu/drm/i915/intel_uc.c +++ b/drivers/gpu/drm/i915/intel_uc.c @@ -421,11 +421,13 @@ int intel_uc_init_hw(struct drm_i915_private *dev_priv) * Note that there is no fallback as either user explicitly asked for * the GuC or driver default option was to run with the GuC enabled. */ - if (GEM_WARN_ON(ret == -EIO)) - ret = -EINVAL; - dev_err(dev_priv->drm.dev, "GuC initialization failed %d\n", ret); - return ret; + + /* Mark GuC firmware as failed to avoid redundant clean-up */ uc_fini_hw is not redundant. uc_fini[_misc] was redundant. May be we should say "to avoid clean-up on wedged" + guc->fw.load_status = INTEL_UC_FIRMWARE_FAIL; + + /* We want to disable GPU submission but keep KMS alive */ + return -EIO; } void intel_uc_fini_hw(struct drm_i915_private *dev_priv) @@ -437,6 +439,9 @@ void intel_uc_fini_hw(struct drm_i915_private *dev_priv) GEM_BUG_ON(!HAS_GUC(dev_priv)); + if (!intel_guc_is_loaded(guc)) + return; + Can we skip based on i915_terminally_wedged instead? Similarly wedged check is needed for invoking other portion of i915_gem_fini like gem_cleanup_engines, gem_contexts_fini since we skipped them on -EIO during gem_init. if (USES_GUC_SUBMISSION(dev_priv)) intel_guc_submission_disable(guc); diff --git