RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
[AMD Official Use Only - General] For mode2 reset, only call SDMA/GFX suspend to disable SDMA/GFX ecc_irq, driver just need enable SDMA/GFX ecc_irq during resume process. Think about below scenario on aqua vanjaram, user modprobe amdgpu with reset_method=3, driver will do GPU recovery if the SDMA uncorrectable error is triggered, It's difficult to distinguish whether need resume gmc ecc_irq, nbio ras_controller_irq, nbio ras_err_event_athub_irq since driver do full gpu reset. Regards, Stanley > -Original Message- > From: Zhang, Hawking > Sent: Monday, December 18, 2023 3:03 PM > To: Yang, Stanley ; amd-gfx@lists.freedesktop.org > Cc: Yang, Stanley > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > unpaired > > [AMD Official Use Only - General] > > Can we put the irq resume in amdgpu_ras_resume? > > Regards, > Hawking > > -Original Message- > From: amd-gfx On Behalf Of > Stanley.Yang > Sent: Saturday, December 16, 2023 00:50 > To: amd-gfx@lists.freedesktop.org > Cc: Yang, Stanley > Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired > > The ecc_irq is disabled while GPU mode2 reset suspending process, but not be > enabled during GPU mode2 reset resume process. > > Signed-off-by: Stanley.Yang > --- > drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6 > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 > + > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++ > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 +++ > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++ > 6 files changed, 54 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > index 02f4c6f9d4f6..ba9238a93064 100644 > --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct > amdgpu_reset_control *reset_ctl, > /* Resume RAS */ > amdgpu_ras_resume(tmp_adev); > > + r = amdgpu_ras_late_resume(tmp_adev); > + if (r) { > + dev_err(tmp_adev->dev, "amdgpu_ras_late_resume > failed %d\n", r); > + goto end; > + } > + > /* Update PSP FW topology after reset */ > if (reset_context->hive && > tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index 8a04fb6c7c1f..318e77c493f2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device > *adev) > return 0; > } > > +/* Handle mode 2 reset, resume ecc irq state */ int > +amdgpu_ras_late_resume(struct amdgpu_device *adev) { > + struct amdgpu_ras_block_list *node, *tmp; > + struct amdgpu_ras_block_object *obj; > + int r; > + > + /* Guest side doesn't need init ras feature */ > + if (amdgpu_sriov_vf(adev)) > + return 0; > + > + list_for_each_entry_safe(node, tmp, >ras_list, node) { > + if (!node->ras_obj) { > + dev_warn(adev->dev, "Warning: abnormal ras list > node.\n"); > + continue; > + } > + > + obj = node->ras_obj; > + > + if (!(obj->ras_comm.block == AMDGPU_RAS_BLOCK__SDMA || > + obj->ras_comm.block == AMDGPU_RAS_BLOCK__GFX)) > + continue; > + > + if (obj->ras_late_init) { > + r = obj->ras_late_init(adev, >ras_comm); > + if (r) { > + dev_err(adev->dev, "%s failed to execute > ras_late_init! > ret:%d\n", > + obj->ras_comm.name, r); > + return r; > + } > + } else > + amdgpu_ras_block_late_init_default(adev, > >ras_comm); > + } > + > + return 0; > +} > + > /* do some fini work before IP fini as dependence */ int > amdgpu_ras_pre_fini(struct amdgpu_device *adev) { diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > index 6a941eb8fb8f..5c1ffc5a6899 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h > @
RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
[AMD Official Use Only - General] Yes, we can only call gfx/sdma ras late init in aldebaran_mode2_restore_ip, will update. Regards, Stanley > -Original Message- > From: Zhang, Hawking > Sent: Monday, December 18, 2023 8:37 PM > To: Yang, Stanley ; amd-gfx@lists.freedesktop.org > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > unpaired > > [AMD Official Use Only - General] > > In such case, can we call amdgpu_gfx_ras_late_init and > amdgpu_sdma_ras_late_init in aldebaran_mode2_restore_ip? > > Regards, > Hawking > > -Original Message- > From: Yang, Stanley > Sent: Monday, December 18, 2023 17:30 > To: Zhang, Hawking ; amd- > g...@lists.freedesktop.org > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > unpaired > > [AMD Official Use Only - General] > > For mode2 reset, only call SDMA/GFX suspend to disable SDMA/GFX ecc_irq, > driver just need enable SDMA/GFX ecc_irq during resume process. > Think about below scenario on aqua vanjaram, user modprobe amdgpu with > reset_method=3, driver will do GPU recovery if the SDMA uncorrectable error > is triggered, It's difficult to distinguish whether need resume gmc ecc_irq, > nbio > ras_controller_irq, nbio ras_err_event_athub_irq since driver do full gpu > reset. > > Regards, > Stanley > > -Original Message- > > From: Zhang, Hawking > > Sent: Monday, December 18, 2023 3:03 PM > > To: Yang, Stanley ; > > amd-gfx@lists.freedesktop.org > > Cc: Yang, Stanley > > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > > unpaired > > > > [AMD Official Use Only - General] > > > > Can we put the irq resume in amdgpu_ras_resume? > > > > Regards, > > Hawking > > > > -Original Message- > > From: amd-gfx On Behalf Of > > Stanley.Yang > > Sent: Saturday, December 16, 2023 00:50 > > To: amd-gfx@lists.freedesktop.org > > Cc: Yang, Stanley > > Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > > unpaired > > > > The ecc_irq is disabled while GPU mode2 reset suspending process, but > > not be enabled during GPU mode2 reset resume process. > > > > Signed-off-by: Stanley.Yang > > --- > > drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6 > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 > > + > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + > > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++ > > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 +++ > > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++ > > 6 files changed, 54 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > > b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > > index 02f4c6f9d4f6..ba9238a93064 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > > +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > > @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct > > amdgpu_reset_control *reset_ctl, > > /* Resume RAS */ > > amdgpu_ras_resume(tmp_adev); > > > > + r = amdgpu_ras_late_resume(tmp_adev); > > + if (r) { > > + dev_err(tmp_adev->dev, "amdgpu_ras_late_resume > > failed %d\n", r); > > + goto end; > > + } > > + > > /* Update PSP FW topology after reset */ > > if (reset_context->hive && > > tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff > > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index 8a04fb6c7c1f..318e77c493f2 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device > > *adev) > > return 0; > > } > > > > +/* Handle mode 2 reset, resume ecc irq state */ int > > +amdgpu_ras_late_resume(struct amdgpu_device *adev) { > > + struct amdgpu_ras_block_list *node, *tmp; > > + struct amdgpu_ras_block_object *obj; > > + int r; > > + > > + /* Guest side doesn't need init ras feature */ > > + if (amdgpu_sriov_vf(adev)) > > + return 0; > > + > > + list_for_each_entry_safe(node, tmp, >ras_list, node) { > > + if (!node->ras_obj) { > > + dev_warn(adev->dev, "Warning: abnormal ras list > >
RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
[AMD Official Use Only - General] In such case, can we call amdgpu_gfx_ras_late_init and amdgpu_sdma_ras_late_init in aldebaran_mode2_restore_ip? Regards, Hawking -Original Message- From: Yang, Stanley Sent: Monday, December 18, 2023 17:30 To: Zhang, Hawking ; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired [AMD Official Use Only - General] For mode2 reset, only call SDMA/GFX suspend to disable SDMA/GFX ecc_irq, driver just need enable SDMA/GFX ecc_irq during resume process. Think about below scenario on aqua vanjaram, user modprobe amdgpu with reset_method=3, driver will do GPU recovery if the SDMA uncorrectable error is triggered, It's difficult to distinguish whether need resume gmc ecc_irq, nbio ras_controller_irq, nbio ras_err_event_athub_irq since driver do full gpu reset. Regards, Stanley > -Original Message- > From: Zhang, Hawking > Sent: Monday, December 18, 2023 3:03 PM > To: Yang, Stanley ; > amd-gfx@lists.freedesktop.org > Cc: Yang, Stanley > Subject: RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > unpaired > > [AMD Official Use Only - General] > > Can we put the irq resume in amdgpu_ras_resume? > > Regards, > Hawking > > -Original Message- > From: amd-gfx On Behalf Of > Stanley.Yang > Sent: Saturday, December 16, 2023 00:50 > To: amd-gfx@lists.freedesktop.org > Cc: Yang, Stanley > Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable > unpaired > > The ecc_irq is disabled while GPU mode2 reset suspending process, but > not be enabled during GPU mode2 reset resume process. > > Signed-off-by: Stanley.Yang > --- > drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6 > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 > + > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++ > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 +++ > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++ > 6 files changed, 54 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > index 02f4c6f9d4f6..ba9238a93064 100644 > --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c > +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c > @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct > amdgpu_reset_control *reset_ctl, > /* Resume RAS */ > amdgpu_ras_resume(tmp_adev); > > + r = amdgpu_ras_late_resume(tmp_adev); > + if (r) { > + dev_err(tmp_adev->dev, "amdgpu_ras_late_resume > failed %d\n", r); > + goto end; > + } > + > /* Update PSP FW topology after reset */ > if (reset_context->hive && > tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff > --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index 8a04fb6c7c1f..318e77c493f2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device > *adev) > return 0; > } > > +/* Handle mode 2 reset, resume ecc irq state */ int > +amdgpu_ras_late_resume(struct amdgpu_device *adev) { > + struct amdgpu_ras_block_list *node, *tmp; > + struct amdgpu_ras_block_object *obj; > + int r; > + > + /* Guest side doesn't need init ras feature */ > + if (amdgpu_sriov_vf(adev)) > + return 0; > + > + list_for_each_entry_safe(node, tmp, >ras_list, node) { > + if (!node->ras_obj) { > + dev_warn(adev->dev, "Warning: abnormal ras list > node.\n"); > + continue; > + } > + > + obj = node->ras_obj; > + > + if (!(obj->ras_comm.block == AMDGPU_RAS_BLOCK__SDMA || > + obj->ras_comm.block == AMDGPU_RAS_BLOCK__GFX)) > + continue; > + > + if (obj->ras_late_init) { > + r = obj->ras_late_init(adev, >ras_comm); > + if (r) { > + dev_err(adev->dev, "%s failed to execute > ras_late_init! > ret:%d\n", > + obj->ras_comm.name, r); > + return r; > + } > + } else > + amdgpu_ras_block_late_init_default(adev, > >ras_comm); > + } > + > + return 0; &g
RE: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired
[AMD Official Use Only - General] Can we put the irq resume in amdgpu_ras_resume? Regards, Hawking -Original Message- From: amd-gfx On Behalf Of Stanley.Yang Sent: Saturday, December 16, 2023 00:50 To: amd-gfx@lists.freedesktop.org Cc: Yang, Stanley Subject: [PATCH Review 1/1] drm/amdgpu: Fix ecc irq enable/disable unpaired The ecc_irq is disabled while GPU mode2 reset suspending process, but not be enabled during GPU mode2 reset resume process. Signed-off-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 + drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 + drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 3 ++ drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 4 +++ drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++ 6 files changed, 54 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/aldebaran.c b/drivers/gpu/drm/amd/amdgpu/aldebaran.c index 02f4c6f9d4f6..ba9238a93064 100644 --- a/drivers/gpu/drm/amd/amdgpu/aldebaran.c +++ b/drivers/gpu/drm/amd/amdgpu/aldebaran.c @@ -358,6 +358,12 @@ aldebaran_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl, /* Resume RAS */ amdgpu_ras_resume(tmp_adev); + r = amdgpu_ras_late_resume(tmp_adev); + if (r) { + dev_err(tmp_adev->dev, "amdgpu_ras_late_resume failed %d\n", r); + goto end; + } + /* Update PSP FW topology after reset */ if (reset_context->hive && tmp_adev->gmc.xgmi.num_physical_nodes > 1) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 8a04fb6c7c1f..318e77c493f2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -3164,6 +3164,43 @@ int amdgpu_ras_late_init(struct amdgpu_device *adev) return 0; } +/* Handle mode 2 reset, resume ecc irq state */ int +amdgpu_ras_late_resume(struct amdgpu_device *adev) { + struct amdgpu_ras_block_list *node, *tmp; + struct amdgpu_ras_block_object *obj; + int r; + + /* Guest side doesn't need init ras feature */ + if (amdgpu_sriov_vf(adev)) + return 0; + + list_for_each_entry_safe(node, tmp, >ras_list, node) { + if (!node->ras_obj) { + dev_warn(adev->dev, "Warning: abnormal ras list node.\n"); + continue; + } + + obj = node->ras_obj; + + if (!(obj->ras_comm.block == AMDGPU_RAS_BLOCK__SDMA || + obj->ras_comm.block == AMDGPU_RAS_BLOCK__GFX)) + continue; + + if (obj->ras_late_init) { + r = obj->ras_late_init(adev, >ras_comm); + if (r) { + dev_err(adev->dev, "%s failed to execute ras_late_init! ret:%d\n", + obj->ras_comm.name, r); + return r; + } + } else + amdgpu_ras_block_late_init_default(adev, >ras_comm); + } + + return 0; +} + /* do some fini work before IP fini as dependence */ int amdgpu_ras_pre_fini(struct amdgpu_device *adev) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h index 6a941eb8fb8f..5c1ffc5a6899 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h @@ -693,6 +693,7 @@ amdgpu_ras_error_to_ta(enum amdgpu_ras_error_type error) { /* called in ip_init and ip_fini */ int amdgpu_ras_init(struct amdgpu_device *adev); int amdgpu_ras_late_init(struct amdgpu_device *adev); +int amdgpu_ras_late_resume(struct amdgpu_device *adev); int amdgpu_ras_fini(struct amdgpu_device *adev); int amdgpu_ras_pre_fini(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index 09cbca596bb5..b93a0baeb2d3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -1043,6 +1043,9 @@ static int gmc_v10_0_hw_fini(void *handle) amdgpu_irq_put(adev, >gmc.vm_fault, 0); + if (adev->gmc.ecc_irq.funcs) + amdgpu_irq_put(adev, >gmc.ecc_irq, 0); + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index 416f3e4f0438..e633e60850b3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -941,6 +941,10 @@ static int gmc_v11_0_hw_fini(void *handle) } amdgpu_irq_put(adev, >gmc.vm_fault, 0); + + if (adev->gmc.ecc_irq.funcs) + amdgpu_irq_put(adev, >gmc.ecc_irq, 0); + gmc_v11_0_gart_disable(adev); return 0; diff --git