Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Felix Kuehling



On 2023-12-14 10:06, Alex Deucher wrote:

On Thu, Dec 14, 2023 at 9:24 AM Liu, Shaoyun  wrote:

[AMD Official Use Only - General]

The gmc flush tlb function is used on both  baremetal and sriov.   But the  
function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with  
name  'virt'  make it appear as a SRIOV only function, this sounds confusion . 
Will it make more sense to move the function out of amdgpu_virt.c file and  
rename it as amdgpu_kig_reg_write_reg_wait ?

Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
below logic :
 if (adev->mes.ring.sched.ready) {
 amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
   ref, mask);
 return;
 }
On MES enabled situation , it will always call to mes queue to do the register write 
and  wait .  Shouldn't this OP been directly send to kiq itself ?  The ring for kiq 
and  mes is different ,  driver should use kiq(adev->gfx.kiq[0].ring) for these 
register read/write or wait operation  and  mes ( adev->mes.ring) for add/remove 
queues  etc.


I understand why it is needed for SR-IOV.  Is there a reason to use
the MES or KIQ for TLB invalidation rather than the register method on
bare metal?  It looks like the register method is never used anymore.
  Seems like we should either, make the KIQ/MES method SR-IOV only, or
drop the register method and just always use KIQ/MES.


It can be useful as a fallback or workaround for FW issues. We also use 
it when running without HWS (amdgpu.sched_policy=2). This is not the 
production code path, but can be useful for triaging on GPUs without MES.


Regards,
  Felix




Alex



Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, December 14, 2023 4:22 AM
To: Alex Deucher ; Limonciello, Mario 

Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org; Koenig, Christian 
; sta...@vger.kernel.org
Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
flush TLB

Am 13.12.23 um 20:44 schrieb Alex Deucher:

On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
 wrote:

On 12/13/2023 13:12, Mario Limonciello wrote:

On 12/13/2023 13:07, Alex Deucher wrote:

On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:

Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
causes the first MES packet after resume to fail. This packet is
used to flush the TLB when GART is enabled.

This issue is fixed in newer firmware, but as OEMs may not roll
this out to the field, introduce a workaround that will retry the
flush when detecting running on an older firmware and decrease
relevant error messages to debug while workaround is in use.

Cc: sta...@vger.kernel.org # 6.1+
Cc: Tim Huang 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
Signed-off-by: Mario Limonciello 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 9ddbf1494326..6ce3f6e6b6de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
amdgpu_device *adev,
   }

   r = adev->mes.funcs->misc_op(>mes, _input);
-   if (r)
-   DRM_ERROR("failed to reg_write_reg_wait\n");
+   if (r) {
+   const char *msg = "failed to
+ reg_write_reg_wait\n";
+
+   if (adev->mes.suspend_workaround)
+   DRM_DEBUG(msg);
+   else
+   DRM_ERROR(msg);
+   }

error:
   return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index a27b424ffe00..90f2bba3b12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -135,6 +135,8 @@ struct amdgpu_mes {

   /* ip specific functions */
   const struct amdgpu_mes_funcs   *funcs;
+
+   boolsuspend_workaround;
};

struct amdgpu_mes_process {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 23d7b548d13f..e810c7bb3156 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
amdgpu_device *adev)
   false : true;

   adev->mmhub.funcs->set_fault_enable_default(adev, value);
-   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+
+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Mario Limonciello

On 12/14/2023 03:21, Christian König wrote:

Am 13.12.23 um 20:44 schrieb Alex Deucher:

On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
 wrote:

On 12/13/2023 13:12, Mario Limonciello wrote:

On 12/13/2023 13:07, Alex Deucher wrote:

On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:

Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
causes the first MES packet after resume to fail. This packet is
used to flush the TLB when GART is enabled.

This issue is fixed in newer firmware, but as OEMs may not roll this
out to the field, introduce a workaround that will retry the flush
when detecting running on an older firmware and decrease relevant
error messages to debug while workaround is in use.

Cc: sta...@vger.kernel.org # 6.1+
Cc: Tim Huang 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
Signed-off-by: Mario Limonciello 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
   4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 9ddbf1494326..6ce3f6e6b6de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
amdgpu_device *adev,
  }

  r = adev->mes.funcs->misc_op(>mes, _input);
-   if (r)
-   DRM_ERROR("failed to reg_write_reg_wait\n");
+   if (r) {
+   const char *msg = "failed to reg_write_reg_wait\n";
+
+   if (adev->mes.suspend_workaround)
+   DRM_DEBUG(msg);
+   else
+   DRM_ERROR(msg);
+   }

   error:
  return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index a27b424ffe00..90f2bba3b12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -135,6 +135,8 @@ struct amdgpu_mes {

  /* ip specific functions */
  const struct amdgpu_mes_funcs   *funcs;
+
+   bool    suspend_workaround;
   };

   struct amdgpu_mes_process {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 23d7b548d13f..e810c7bb3156 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
amdgpu_device *adev)
  false : true;

  adev->mmhub.funcs->set_fault_enable_default(adev, value);
-   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+
+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 
0);

+   adev->mes.suspend_workaround = false;
+   } while (adev->mes.suspend_workaround);

Shouldn't this be something like:


+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 
0);

+   adev->mes.suspend_workaround = false;
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 
0);

+   } while (adev->mes.suspend_workaround);

If we actually need the flush.  Maybe a better approach would be to
check if we are in s0ix in

Ah you're right; I had shifted this around to keep less stateful
variables and push them up the stack from when I first made it and that
logic is wrong now.

I don't think the one you suggested is right either; it's going to 
apply

twice on ASICs that only need it once.

I guess pending on what Christian comments on below I'll respin to 
logic

that only calls twice on resume for these ASICs.

One more comment.  Tim and I both did an experiment for this (skipping
the flush) separately.  The problem isn't the flush itself, rather it's
the first MES packet after exiting GFXOFF.


Well that's an ugly one. Can that happen every time GFXOFF kicks in?


No; it's specific to the exit from s0i3.





So it seems that it pushes off the issue to the next thing which is a
ring buffer test:

[drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0
(-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).

So maybe a better workaround is a "dummy" command that is only sent for
the broken firmware that we don't care about the outcome and discard 
errors.


Then the workaround doesn't need to get as entangled with correct flow.

Yeah. Something like that seems cleaner.  Just a question of where to
put it since we skip GC and MES for s0ix.  Probably somewhere in
gmc_v11_0_resume() before gmc_v11_0_gart_enable().  Maybe add a new
mes callback.


Please try to keep it completely outside of the TLB invalidation and VM 
flush handling.


OK, will continue to iterate the direction v2 went.



Regards,
Christian.



Alex


diff --git 

RE: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Liu, Shaoyun
[AMD Official Use Only - General]

I remembered we try to use kiq directly send the invalidation packet to mec FW 
with pasid  as parameter since it's FW control the  VMID and  PASID mapping , 
but during  some test , it shows performance drop compare to driver directly 
get the  vmid/pasid mapping and  do the  invalidation by itself.   For now,  
although driver still use the  kiq/mes , but only use it for the register 
read/write and wait instead  of send the  invalidate package directly . This 
actually has some potential issue like when  driver read the vmid_pasid mapping 
from hw register , FW(hw scheduler) might change the process mapping 
(vmid/pasid mapping changed)
I think it probably make more sense to directly use mmio way for bare metal, 
but I heard someone  compare with kiq , and  it doesn't make much performance 
difference.  So if we want to minimize the code difference between SRIOV and  
bare metal , use the kiq looks ok .
From my understanding , although kiq and  mes are all use mes pipe , but the  
design is different,  kiq(MES pipe 1)  is mainly used for immediate job like 
register access etc , mes (MES pipe 0 )  is a scheduler responsible for queue 
management mostly , although it has extend its support for misc-op (include 
register access), internally , it will  eventually pass these package to kig( 
pipe 1) , driver side should try to not overuse them and directly sent these 
operation to kiq if necessary.

Regards
Shaoyun.liu

-Original Message-
From: Alex Deucher 
Sent: Thursday, December 14, 2023 10:07 AM
To: Liu, Shaoyun 
Cc: Christian König ; Limonciello, Mario 
; Huang, Tim ; 
amd-gfx@lists.freedesktop.org; Koenig, Christian ; 
sta...@vger.kernel.org
Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
flush TLB

On Thu, Dec 14, 2023 at 9:24 AM Liu, Shaoyun  wrote:
>
> [AMD Official Use Only - General]
>
> The gmc flush tlb function is used on both  baremetal and sriov.   But the  
> function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with 
>  name  'virt'  make it appear as a SRIOV only function, this sounds confusion 
> . Will it make more sense to move the function out of amdgpu_virt.c file and  
> rename it as amdgpu_kig_reg_write_reg_wait ?
>
> Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
> below logic :
> if (adev->mes.ring.sched.ready) {
> amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
>   ref, mask);
> return;
> }
> On MES enabled situation , it will always call to mes queue to do the 
> register write and  wait .  Shouldn't this OP been directly send to kiq 
> itself ?  The ring for kiq and  mes is different ,  driver should use 
> kiq(adev->gfx.kiq[0].ring) for these register read/write or wait operation  
> and  mes ( adev->mes.ring) for add/remove queues  etc.
>

I understand why it is needed for SR-IOV.  Is there a reason to use the MES or 
KIQ for TLB invalidation rather than the register method on bare metal?  It 
looks like the register method is never used anymore.
 Seems like we should either, make the KIQ/MES method SR-IOV only, or drop the 
register method and just always use KIQ/MES.

Alex


> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of
> Christian König
> Sent: Thursday, December 14, 2023 4:22 AM
> To: Alex Deucher ; Limonciello, Mario
> 
> Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org;
> Koenig, Christian ; sta...@vger.kernel.org
> Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that
> fail to flush TLB
>
> Am 13.12.23 um 20:44 schrieb Alex Deucher:
> > On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
> >  wrote:
> >> On 12/13/2023 13:12, Mario Limonciello wrote:
> >>> On 12/13/2023 13:07, Alex Deucher wrote:
>  On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
>   wrote:
> > Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> > causes the first MES packet after resume to fail. This packet is
> > used to flush the TLB when GART is enabled.
> >
> > This issue is fixed in newer firmware, but as OEMs may not roll
> > this out to the field, introduce a workaround that will retry
> > the flush when detecting running on an older firmware and
> > decrease relevant error messages to debug while workaround is in use.
> >
> > Cc: sta...@vger.kernel.org # 6.1+
> > Cc: Tim Huang 
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> > Signed-off-by: Mario Limonciello 
> > ---
> >drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
> >drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
> >drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
> >drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
> >4 files changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git 

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Alex Deucher
On Thu, Dec 14, 2023 at 9:24 AM Liu, Shaoyun  wrote:
>
> [AMD Official Use Only - General]
>
> The gmc flush tlb function is used on both  baremetal and sriov.   But the  
> function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with 
>  name  'virt'  make it appear as a SRIOV only function, this sounds confusion 
> . Will it make more sense to move the function out of amdgpu_virt.c file and  
> rename it as amdgpu_kig_reg_write_reg_wait ?
>
> Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
> below logic :
> if (adev->mes.ring.sched.ready) {
> amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
>   ref, mask);
> return;
> }
> On MES enabled situation , it will always call to mes queue to do the 
> register write and  wait .  Shouldn't this OP been directly send to kiq 
> itself ?  The ring for kiq and  mes is different ,  driver should use 
> kiq(adev->gfx.kiq[0].ring) for these register read/write or wait operation  
> and  mes ( adev->mes.ring) for add/remove queues  etc.
>

I understand why it is needed for SR-IOV.  Is there a reason to use
the MES or KIQ for TLB invalidation rather than the register method on
bare metal?  It looks like the register method is never used anymore.
 Seems like we should either, make the KIQ/MES method SR-IOV only, or
drop the register method and just always use KIQ/MES.

Alex


> Regards
> Shaoyun.liu
>
> -Original Message-
> From: amd-gfx  On Behalf Of Christian 
> König
> Sent: Thursday, December 14, 2023 4:22 AM
> To: Alex Deucher ; Limonciello, Mario 
> 
> Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org; Koenig, 
> Christian ; sta...@vger.kernel.org
> Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
> flush TLB
>
> Am 13.12.23 um 20:44 schrieb Alex Deucher:
> > On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
> >  wrote:
> >> On 12/13/2023 13:12, Mario Limonciello wrote:
> >>> On 12/13/2023 13:07, Alex Deucher wrote:
>  On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
>   wrote:
> > Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> > causes the first MES packet after resume to fail. This packet is
> > used to flush the TLB when GART is enabled.
> >
> > This issue is fixed in newer firmware, but as OEMs may not roll
> > this out to the field, introduce a workaround that will retry the
> > flush when detecting running on an older firmware and decrease
> > relevant error messages to debug while workaround is in use.
> >
> > Cc: sta...@vger.kernel.org # 6.1+
> > Cc: Tim Huang 
> > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> > Signed-off-by: Mario Limonciello 
> > ---
> >drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
> >drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
> >drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
> >drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
> >4 files changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > index 9ddbf1494326..6ce3f6e6b6de 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> > @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
> > amdgpu_device *adev,
> >   }
> >
> >   r = adev->mes.funcs->misc_op(>mes, _input);
> > -   if (r)
> > -   DRM_ERROR("failed to reg_write_reg_wait\n");
> > +   if (r) {
> > +   const char *msg = "failed to
> > + reg_write_reg_wait\n";
> > +
> > +   if (adev->mes.suspend_workaround)
> > +   DRM_DEBUG(msg);
> > +   else
> > +   DRM_ERROR(msg);
> > +   }
> >
> >error:
> >   return r;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > index a27b424ffe00..90f2bba3b12b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> > @@ -135,6 +135,8 @@ struct amdgpu_mes {
> >
> >   /* ip specific functions */
> >   const struct amdgpu_mes_funcs   *funcs;
> > +
> > +   boolsuspend_workaround;
> >};
> >
> >struct amdgpu_mes_process {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > index 23d7b548d13f..e810c7bb3156 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
> > 

RE: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Liu, Shaoyun
[AMD Official Use Only - General]

The gmc flush tlb function is used on both  baremetal and sriov.   But the  
function  amdgpu_virt_kiq_reg_write_reg_wait is defined in amdgpu_virt.c with  
name  'virt'  make it appear as a SRIOV only function, this sounds confusion . 
Will it make more sense to move the function out of amdgpu_virt.c file and  
rename it as amdgpu_kig_reg_write_reg_wait ?

Another thing I'm not sure is inside amdgpu_virt_kiq_reg_write_reg_wait , has 
below logic :
if (adev->mes.ring.sched.ready) {
amdgpu_mes_reg_write_reg_wait(adev, reg0, reg1,
  ref, mask);
return;
}
On MES enabled situation , it will always call to mes queue to do the register 
write and  wait .  Shouldn't this OP been directly send to kiq itself ?  The 
ring for kiq and  mes is different ,  driver should use 
kiq(adev->gfx.kiq[0].ring) for these register read/write or wait operation  and 
 mes ( adev->mes.ring) for add/remove queues  etc.

Regards
Shaoyun.liu

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, December 14, 2023 4:22 AM
To: Alex Deucher ; Limonciello, Mario 

Cc: Huang, Tim ; amd-gfx@lists.freedesktop.org; Koenig, 
Christian ; sta...@vger.kernel.org
Subject: Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to 
flush TLB

Am 13.12.23 um 20:44 schrieb Alex Deucher:
> On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
>  wrote:
>> On 12/13/2023 13:12, Mario Limonciello wrote:
>>> On 12/13/2023 13:07, Alex Deucher wrote:
 On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
  wrote:
> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> causes the first MES packet after resume to fail. This packet is
> used to flush the TLB when GART is enabled.
>
> This issue is fixed in newer firmware, but as OEMs may not roll
> this out to the field, introduce a workaround that will retry the
> flush when detecting running on an older firmware and decrease
> relevant error messages to debug while workaround is in use.
>
> Cc: sta...@vger.kernel.org # 6.1+
> Cc: Tim Huang 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> Signed-off-by: Mario Limonciello 
> ---
>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
>drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
>drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
>drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
>4 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 9ddbf1494326..6ce3f6e6b6de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
> amdgpu_device *adev,
>   }
>
>   r = adev->mes.funcs->misc_op(>mes, _input);
> -   if (r)
> -   DRM_ERROR("failed to reg_write_reg_wait\n");
> +   if (r) {
> +   const char *msg = "failed to
> + reg_write_reg_wait\n";
> +
> +   if (adev->mes.suspend_workaround)
> +   DRM_DEBUG(msg);
> +   else
> +   DRM_ERROR(msg);
> +   }
>
>error:
>   return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index a27b424ffe00..90f2bba3b12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -135,6 +135,8 @@ struct amdgpu_mes {
>
>   /* ip specific functions */
>   const struct amdgpu_mes_funcs   *funcs;
> +
> +   boolsuspend_workaround;
>};
>
>struct amdgpu_mes_process {
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> index 23d7b548d13f..e810c7bb3156 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
> amdgpu_device *adev)
>   false : true;
>
>   adev->mmhub.funcs->set_fault_enable_default(adev, value);
> -   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +
> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = false;
> +   } while (adev->mes.suspend_workaround);
 Shouldn't this be something like:

> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = 

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-14 Thread Christian König

Am 13.12.23 um 20:44 schrieb Alex Deucher:

On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
 wrote:

On 12/13/2023 13:12, Mario Limonciello wrote:

On 12/13/2023 13:07, Alex Deucher wrote:

On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:

Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
causes the first MES packet after resume to fail. This packet is
used to flush the TLB when GART is enabled.

This issue is fixed in newer firmware, but as OEMs may not roll this
out to the field, introduce a workaround that will retry the flush
when detecting running on an older firmware and decrease relevant
error messages to debug while workaround is in use.

Cc: sta...@vger.kernel.org # 6.1+
Cc: Tim Huang 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
Signed-off-by: Mario Limonciello 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
   4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 9ddbf1494326..6ce3f6e6b6de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
amdgpu_device *adev,
  }

  r = adev->mes.funcs->misc_op(>mes, _input);
-   if (r)
-   DRM_ERROR("failed to reg_write_reg_wait\n");
+   if (r) {
+   const char *msg = "failed to reg_write_reg_wait\n";
+
+   if (adev->mes.suspend_workaround)
+   DRM_DEBUG(msg);
+   else
+   DRM_ERROR(msg);
+   }

   error:
  return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index a27b424ffe00..90f2bba3b12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -135,6 +135,8 @@ struct amdgpu_mes {

  /* ip specific functions */
  const struct amdgpu_mes_funcs   *funcs;
+
+   boolsuspend_workaround;
   };

   struct amdgpu_mes_process {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 23d7b548d13f..e810c7bb3156 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
amdgpu_device *adev)
  false : true;

  adev->mmhub.funcs->set_fault_enable_default(adev, value);
-   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+
+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   } while (adev->mes.suspend_workaround);

Shouldn't this be something like:


+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   } while (adev->mes.suspend_workaround);

If we actually need the flush.  Maybe a better approach would be to
check if we are in s0ix in

Ah you're right; I had shifted this around to keep less stateful
variables and push them up the stack from when I first made it and that
logic is wrong now.

I don't think the one you suggested is right either; it's going to apply
twice on ASICs that only need it once.

I guess pending on what Christian comments on below I'll respin to logic
that only calls twice on resume for these ASICs.

One more comment.  Tim and I both did an experiment for this (skipping
the flush) separately.  The problem isn't the flush itself, rather it's
the first MES packet after exiting GFXOFF.


Well that's an ugly one. Can that happen every time GFXOFF kicks in?



So it seems that it pushes off the issue to the next thing which is a
ring buffer test:

[drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0
(-110).
[drm:process_one_work] *ERROR* ib ring test failed (-110).

So maybe a better workaround is a "dummy" command that is only sent for
the broken firmware that we don't care about the outcome and discard errors.

Then the workaround doesn't need to get as entangled with correct flow.

Yeah. Something like that seems cleaner.  Just a question of where to
put it since we skip GC and MES for s0ix.  Probably somewhere in
gmc_v11_0_resume() before gmc_v11_0_gart_enable().  Maybe add a new
mes callback.


Please try to keep it completely outside of the TLB invalidation and VM 
flush handling.


Regards,
Christian.



Alex


diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb():
index 23d7b548d13f..bd6d9953a80e 100644
--- 

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-13 Thread Alex Deucher
On Wed, Dec 13, 2023 at 2:32 PM Mario Limonciello
 wrote:
>
> On 12/13/2023 13:12, Mario Limonciello wrote:
> > On 12/13/2023 13:07, Alex Deucher wrote:
> >> On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
> >>  wrote:
> >>>
> >>> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> >>> causes the first MES packet after resume to fail. This packet is
> >>> used to flush the TLB when GART is enabled.
> >>>
> >>> This issue is fixed in newer firmware, but as OEMs may not roll this
> >>> out to the field, introduce a workaround that will retry the flush
> >>> when detecting running on an older firmware and decrease relevant
> >>> error messages to debug while workaround is in use.
> >>>
> >>> Cc: sta...@vger.kernel.org # 6.1+
> >>> Cc: Tim Huang 
> >>> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> >>> Signed-off-by: Mario Limonciello 
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
> >>>   drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
> >>>   drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
> >>>   4 files changed, 32 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> index 9ddbf1494326..6ce3f6e6b6de 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> >>> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct
> >>> amdgpu_device *adev,
> >>>  }
> >>>
> >>>  r = adev->mes.funcs->misc_op(>mes, _input);
> >>> -   if (r)
> >>> -   DRM_ERROR("failed to reg_write_reg_wait\n");
> >>> +   if (r) {
> >>> +   const char *msg = "failed to reg_write_reg_wait\n";
> >>> +
> >>> +   if (adev->mes.suspend_workaround)
> >>> +   DRM_DEBUG(msg);
> >>> +   else
> >>> +   DRM_ERROR(msg);
> >>> +   }
> >>>
> >>>   error:
> >>>  return r;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> index a27b424ffe00..90f2bba3b12b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> >>> @@ -135,6 +135,8 @@ struct amdgpu_mes {
> >>>
> >>>  /* ip specific functions */
> >>>  const struct amdgpu_mes_funcs   *funcs;
> >>> +
> >>> +   boolsuspend_workaround;
> >>>   };
> >>>
> >>>   struct amdgpu_mes_process {
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> index 23d7b548d13f..e810c7bb3156 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> >>> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct
> >>> amdgpu_device *adev)
> >>>  false : true;
> >>>
> >>>  adev->mmhub.funcs->set_fault_enable_default(adev, value);
> >>> -   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +
> >>> +   do {
> >>> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +   adev->mes.suspend_workaround = false;
> >>> +   } while (adev->mes.suspend_workaround);
> >>
> >> Shouldn't this be something like:
> >>
> >>> +   do {
> >>> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +   adev->mes.suspend_workaround = false;
> >>> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> >>> +   } while (adev->mes.suspend_workaround);
> >>
> >> If we actually need the flush.  Maybe a better approach would be to
> >> check if we are in s0ix in
> >
> > Ah you're right; I had shifted this around to keep less stateful
> > variables and push them up the stack from when I first made it and that
> > logic is wrong now.
> >
> > I don't think the one you suggested is right either; it's going to apply
> > twice on ASICs that only need it once.
> >
> > I guess pending on what Christian comments on below I'll respin to logic
> > that only calls twice on resume for these ASICs.
>
> One more comment.  Tim and I both did an experiment for this (skipping
> the flush) separately.  The problem isn't the flush itself, rather it's
> the first MES packet after exiting GFXOFF.
>
> So it seems that it pushes off the issue to the next thing which is a
> ring buffer test:
>
> [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0
> (-110).
> [drm:process_one_work] *ERROR* ib ring test failed (-110).
>
> So maybe a better workaround is a "dummy" command that is only sent for
> the broken firmware that we don't care about the outcome and discard errors.
>
> Then the workaround doesn't need to get as entangled with correct flow.

Yeah. Something like that seems cleaner.  Just a question of where to
put it since we skip GC and MES 

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-13 Thread Mario Limonciello

On 12/13/2023 13:12, Mario Limonciello wrote:

On 12/13/2023 13:07, Alex Deucher wrote:

On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:


Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
causes the first MES packet after resume to fail. This packet is
used to flush the TLB when GART is enabled.

This issue is fixed in newer firmware, but as OEMs may not roll this
out to the field, introduce a workaround that will retry the flush
when detecting running on an older firmware and decrease relevant
error messages to debug while workaround is in use.

Cc: sta...@vger.kernel.org # 6.1+
Cc: Tim Huang 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
Signed-off-by: Mario Limonciello 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
  4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c

index 9ddbf1494326..6ce3f6e6b6de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct 
amdgpu_device *adev,

 }

 r = adev->mes.funcs->misc_op(>mes, _input);
-   if (r)
-   DRM_ERROR("failed to reg_write_reg_wait\n");
+   if (r) {
+   const char *msg = "failed to reg_write_reg_wait\n";
+
+   if (adev->mes.suspend_workaround)
+   DRM_DEBUG(msg);
+   else
+   DRM_ERROR(msg);
+   }

  error:
 return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h

index a27b424ffe00..90f2bba3b12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -135,6 +135,8 @@ struct amdgpu_mes {

 /* ip specific functions */
 const struct amdgpu_mes_funcs   *funcs;
+
+   bool    suspend_workaround;
  };

  struct amdgpu_mes_process {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c

index 23d7b548d13f..e810c7bb3156 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct 
amdgpu_device *adev)

 false : true;

 adev->mmhub.funcs->set_fault_enable_default(adev, value);
-   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+
+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   } while (adev->mes.suspend_workaround);


Shouldn't this be something like:


+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   } while (adev->mes.suspend_workaround);


If we actually need the flush.  Maybe a better approach would be to
check if we are in s0ix in


Ah you're right; I had shifted this around to keep less stateful 
variables and push them up the stack from when I first made it and that 
logic is wrong now.


I don't think the one you suggested is right either; it's going to apply 
twice on ASICs that only need it once.


I guess pending on what Christian comments on below I'll respin to logic 
that only calls twice on resume for these ASICs.


One more comment.  Tim and I both did an experiment for this (skipping 
the flush) separately.  The problem isn't the flush itself, rather it's 
the first MES packet after exiting GFXOFF.


So it seems that it pushes off the issue to the next thing which is a 
ring buffer test:


[drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on comp_1.0.0 
(-110).

[drm:process_one_work] *ERROR* ib ring test failed (-110).

So maybe a better workaround is a "dummy" command that is only sent for 
the broken firmware that we don't care about the outcome and discard errors.


Then the workaround doesn't need to get as entangled with correct flow.





diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb():
index 23d7b548d13f..bd6d9953a80e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -227,7 +227,8 @@ static void gmc_v11_0_flush_gpu_tlb(struct
amdgpu_device *adev, uint32_t vmid,
  * Directly use kiq to do the vm invalidation instead
  */
 if ((adev->gfx.kiq[0].ring.sched.ready || 
adev->mes.ring.sched.ready) &&

-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) ||
+   !adev->in_s0ix) {
  

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-13 Thread Mario Limonciello

On 12/13/2023 13:07, Alex Deucher wrote:

On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:


Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
causes the first MES packet after resume to fail. This packet is
used to flush the TLB when GART is enabled.

This issue is fixed in newer firmware, but as OEMs may not roll this
out to the field, introduce a workaround that will retry the flush
when detecting running on an older firmware and decrease relevant
error messages to debug while workaround is in use.

Cc: sta...@vger.kernel.org # 6.1+
Cc: Tim Huang 
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
Signed-off-by: Mario Limonciello 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
  4 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 9ddbf1494326..6ce3f6e6b6de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct amdgpu_device 
*adev,
 }

 r = adev->mes.funcs->misc_op(>mes, _input);
-   if (r)
-   DRM_ERROR("failed to reg_write_reg_wait\n");
+   if (r) {
+   const char *msg = "failed to reg_write_reg_wait\n";
+
+   if (adev->mes.suspend_workaround)
+   DRM_DEBUG(msg);
+   else
+   DRM_ERROR(msg);
+   }

  error:
 return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
index a27b424ffe00..90f2bba3b12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
@@ -135,6 +135,8 @@ struct amdgpu_mes {

 /* ip specific functions */
 const struct amdgpu_mes_funcs   *funcs;
+
+   boolsuspend_workaround;
  };

  struct amdgpu_mes_process {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 23d7b548d13f..e810c7bb3156 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct amdgpu_device 
*adev)
 false : true;

 adev->mmhub.funcs->set_fault_enable_default(adev, value);
-   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+
+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   } while (adev->mes.suspend_workaround);


Shouldn't this be something like:


+   do {
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   adev->mes.suspend_workaround = false;
+   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
+   } while (adev->mes.suspend_workaround);


If we actually need the flush.  Maybe a better approach would be to
check if we are in s0ix in


Ah you're right; I had shifted this around to keep less stateful 
variables and push them up the stack from when I first made it and that 
logic is wrong now.


I don't think the one you suggested is right either; it's going to apply 
twice on ASICs that only need it once.


I guess pending on what Christian comments on below I'll respin to logic 
that only calls twice on resume for these ASICs.




diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb():
index 23d7b548d13f..bd6d9953a80e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -227,7 +227,8 @@ static void gmc_v11_0_flush_gpu_tlb(struct
amdgpu_device *adev, uint32_t vmid,
  * Directly use kiq to do the vm invalidation instead
  */
 if ((adev->gfx.kiq[0].ring.sched.ready || adev->mes.ring.sched.ready) 
&&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) ||
+   !adev->in_s0ix) {
 amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
 1 << vmid, GET_INST(GC, 0));
 return;

@Christian Koenig is this logic correct?

 /* For SRIOV run time, driver shouldn't access the register
through MMIO
  * Directly use kiq to do the vm invalidation instead
  */
 if ((adev->gfx.kiq[0].ring.sched.ready || adev->mes.ring.sched.ready) 
&&
 (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
 amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
 1 << vmid, GET_INST(GC, 0));
 return;
 }

We basically always 

Re: [PATCH] drm/amd: Add a workaround for GFX11 systems that fail to flush TLB

2023-12-13 Thread Alex Deucher
On Wed, Dec 13, 2023 at 1:00 PM Mario Limonciello
 wrote:
>
> Some systems with MP1 13.0.4 or 13.0.11 have a firmware bug that
> causes the first MES packet after resume to fail. This packet is
> used to flush the TLB when GART is enabled.
>
> This issue is fixed in newer firmware, but as OEMs may not roll this
> out to the field, introduce a workaround that will retry the flush
> when detecting running on an older firmware and decrease relevant
> error messages to debug while workaround is in use.
>
> Cc: sta...@vger.kernel.org # 6.1+
> Cc: Tim Huang 
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3045
> Signed-off-by: Mario Limonciello 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 10 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h |  2 ++
>  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c  | 17 -
>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  |  8 ++--
>  4 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> index 9ddbf1494326..6ce3f6e6b6de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
> @@ -836,8 +836,14 @@ int amdgpu_mes_reg_write_reg_wait(struct amdgpu_device 
> *adev,
> }
>
> r = adev->mes.funcs->misc_op(>mes, _input);
> -   if (r)
> -   DRM_ERROR("failed to reg_write_reg_wait\n");
> +   if (r) {
> +   const char *msg = "failed to reg_write_reg_wait\n";
> +
> +   if (adev->mes.suspend_workaround)
> +   DRM_DEBUG(msg);
> +   else
> +   DRM_ERROR(msg);
> +   }
>
>  error:
> return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> index a27b424ffe00..90f2bba3b12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.h
> @@ -135,6 +135,8 @@ struct amdgpu_mes {
>
> /* ip specific functions */
> const struct amdgpu_mes_funcs   *funcs;
> +
> +   boolsuspend_workaround;
>  };
>
>  struct amdgpu_mes_process {
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> index 23d7b548d13f..e810c7bb3156 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> @@ -889,7 +889,11 @@ static int gmc_v11_0_gart_enable(struct amdgpu_device 
> *adev)
> false : true;
>
> adev->mmhub.funcs->set_fault_enable_default(adev, value);
> -   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +
> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = false;
> +   } while (adev->mes.suspend_workaround);

Shouldn't this be something like:

> +   do {
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   adev->mes.suspend_workaround = false;
> +   gmc_v11_0_flush_gpu_tlb(adev, 0, AMDGPU_MMHUB0(0), 0);
> +   } while (adev->mes.suspend_workaround);

If we actually need the flush.  Maybe a better approach would be to
check if we are in s0ix in

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c in gmc_v11_0_flush_gpu_tlb():
index 23d7b548d13f..bd6d9953a80e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -227,7 +227,8 @@ static void gmc_v11_0_flush_gpu_tlb(struct
amdgpu_device *adev, uint32_t vmid,
 * Directly use kiq to do the vm invalidation instead
 */
if ((adev->gfx.kiq[0].ring.sched.ready || adev->mes.ring.sched.ready) &&
-   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
+   (amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev)) ||
+   !adev->in_s0ix) {
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
1 << vmid, GET_INST(GC, 0));
return;

@Christian Koenig is this logic correct?

/* For SRIOV run time, driver shouldn't access the register
through MMIO
 * Directly use kiq to do the vm invalidation instead
 */
if ((adev->gfx.kiq[0].ring.sched.ready || adev->mes.ring.sched.ready) &&
(amdgpu_sriov_runtime(adev) || !amdgpu_sriov_vf(adev))) {
amdgpu_virt_kiq_reg_write_reg_wait(adev, req, ack, inv_req,
1 << vmid, GET_INST(GC, 0));
return;
}

We basically always use the MES with that logic.  If that is the case,
we should just drop the rest of that function.  Shouldn't we only use
KIQ or MES for SR-IOV?  gmc v10 has similar logic which also seems
wrong.

Alex


>
> DRM_INFO("PCIE GART of %uM enabled (table at 0x%016llX).\n",
>  (unsigned int)(adev->gmc.gart_size >>