RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2021-01-15 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Hi Joerg,

Thanks. Hope you are doing well now.

Edgar

-Original Message-
From: jroe...@suse.de  
Sent: Freitag, 15. Januar 2021 09:18
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

Hi Edgar,

On Mon, Nov 23, 2020 at 06:41:18AM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> Just wanted to follow-up on that topic.
> Is that quirk already put into upstream kernel?

Sorry for the late reply, I had to take an extended sick leave. I will take 
care of sending this fix upstream next week.

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-12-08 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Alex,

I had to revise the patch. Please see attachment. It is actually two more SSIDs 
affected to that.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Dienstag, 8. Dezember 2020 09:23
To: 'Deucher, Alexander' ; 'Huang, Ray' 
; 'Kuehling, Felix' 
Cc: 'Will Deacon' ; 'linux-ker...@vger.kernel.org' 
; 'linux-...@vger.kernel.org' 
; 'iommu@lists.linux-foundation.org' 
; 'Bjorn Helgaas' ; 
'Joerg Roedel' ; 'Zhu, Changfeng' 
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Applied the patch as in attachment. Verified that ATS for GPU-Device had been 
disabled. See attachment "dmesg_ATS.log".

Was running that build over night successfully.

-Original Message-----
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Montag, 7. Dezember 2020 05:53
To: Deucher, Alexander ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Hi Alex,

I believe in the patch file, this
+   (pdev->subsystem_device == 0x0c19 ||
+pdev->subsystem_device == 0x0c10))

Has to be changed to:
+   (pdev->subsystem_device == 0xce19 ||
+pdev->subsystem_device == 0xcc10))

Because our SSIDs are "ea50:ce19" and "ea50:cc10" respectively and another one 
would "ea50:cc08". 

I will apply that patch and feedback the results soon plus the patch file that 
I actually had applied.


-Original Message-
From: Deucher, Alexander 
Sent: Montag, 30. November 2020 19:36
To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -----Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Sent: Thursday, November 26, 2020 4:24 AM
> To: Deucher, Alexander ; Huang, Ray 
> ; Kuehling, Felix 
> Cc: Will Deacon ; linux-ker...@vger.kernel.org;
> linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> Helgaas ; Joerg Roedel ; Zhu, 
> Changfeng 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Alex,
> 
> This is pretty much the same patch as what I have received from Joerg 
> previously, except that it is tied to the particular Emerson platform 
> and its derivatives (listed with Subsystem IDs).

Right.  As per my original point, I don't want to disable ATS on all Picasso 
chips because doing so would break GPU compute on them, so I'd like to apply 
this quirk as narrowly as possible.

> 
> Below patch was what Joerg provided me and I successfully tested.
> 
> This diff to the kernel should do that:
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
> f70692ac79c5..3911b0ec57ba 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI,
> 0x6900, quirk_amd_harvest_no_ats);
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, 
> quirk_amd_harvest_no_ats);
>  /* AMD Navi14 dGPU */
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, 
> quirk_amd_harvest_no_ats);
> +/* AMD Raven platform iGPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, 
> +quirk_amd_harvest_no_ats);
>  #endif /* CONFIG_PCI_ATS */
> 
>  /* Freescale PCIe doesn't support MSI in RC mode */
> 
> So far I have seen this issue on two instances of this chip, but I 
> must admit that I did test only two of them to this extent, so I guess 
> it is not a bad chip in particular, but the chips we use are from the 
> same production lot, so it might be a systematical problem of that production 
> lot?
> 
> UEFI-Setup shows:
> Processor Family: 17h
> Procossor Model: 20h - 2Fh
> CPUID: 00820F01
> Microcode Patch Level: 8200103
> 
> Looking at the chip-die I found that this is a fully qualified IP 
> Silicon (according to Ryzen Embedded R1000 SOC Interlock).
> YE1305C9T20FG
> BI2015SUY
> 9JB6496P00123
> 2016 AMD
> DIFFUSED IN USA
> MADE IN CHINA
> 
> Currently used SBIOS is a branch from "EmbeddedPI-FP5 1.2.0.3RC3".
> 
> In the future our SBIOS might merge with EmbeddedPI-FP5_1.2.0.5RC3.
> 

I think it's more likely an sbios issue, so hopefully the new release fixes it.

Alex

> 
> 
> 
> -Original Message-
> From: Deucher, Alexander 
> Sent: Mittwoch, 25. November 2020 17:08
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; 
> Huang, Ray ; Kuehlin

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-12-06 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Hi Alex,

I believe in the patch file, this
+   (pdev->subsystem_device == 0x0c19 ||
+pdev->subsystem_device == 0x0c10))

Has to be changed to:
+   (pdev->subsystem_device == 0xce19 ||
+pdev->subsystem_device == 0xcc10))

Because our SSIDs are "ea50:ce19" and "ea50:cc10" respectively and another one 
would "ea50:cc08". 

I will apply that patch and feedback the results soon plus the patch file that 
I actually had applied.


-Original Message-
From: Deucher, Alexander  
Sent: Montag, 30. November 2020 19:36
To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -----Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Sent: Thursday, November 26, 2020 4:24 AM
> To: Deucher, Alexander ; Huang, Ray 
> ; Kuehling, Felix 
> Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
> linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> Helgaas ; Joerg Roedel ; Zhu, 
> Changfeng 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Alex,
> 
> This is pretty much the same patch as what I have received from Joerg 
> previously, except that it is tied to the particular Emerson platform 
> and its derivatives (listed with Subsystem IDs).

Right.  As per my original point, I don't want to disable ATS on all Picasso 
chips because doing so would break GPU compute on them, so I'd like to apply 
this quirk as narrowly as possible.

> 
> Below patch was what Joerg provided me and I successfully tested.
> 
> This diff to the kernel should do that:
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
> f70692ac79c5..3911b0ec57ba 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI,
> 0x6900, quirk_amd_harvest_no_ats);
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, 
> quirk_amd_harvest_no_ats);
>  /* AMD Navi14 dGPU */
>  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, 
> quirk_amd_harvest_no_ats);
> +/* AMD Raven platform iGPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, 
> +quirk_amd_harvest_no_ats);
>  #endif /* CONFIG_PCI_ATS */
> 
>  /* Freescale PCIe doesn't support MSI in RC mode */
> 
> So far I have seen this issue on two instances of this chip, but I 
> must admit that I did test only two of them to this extent, so I guess 
> it is not a bad chip in particular, but the chips we use are from the 
> same production lot, so it might be a systematical problem of that production 
> lot?
> 
> UEFI-Setup shows:
> Processor Family: 17h
> Procossor Model: 20h - 2Fh
> CPUID: 00820F01
> Microcode Patch Level: 8200103
> 
> Looking at the chip-die I found that this is a fully qualified IP 
> Silicon (according to Ryzen Embedded R1000 SOC Interlock).
> YE1305C9T20FG
> BI2015SUY
> 9JB6496P00123
> 2016 AMD
> DIFFUSED IN USA
> MADE IN CHINA
> 
> Currently used SBIOS is a branch from "EmbeddedPI-FP5 1.2.0.3RC3".
> 
> In the future our SBIOS might merge with EmbeddedPI-FP5_1.2.0.5RC3.
> 

I think it's more likely an sbios issue, so hopefully the new release fixes it.

Alex

> 
> 
> 
> -Original Message-
> From: Deucher, Alexander 
> Sent: Mittwoch, 25. November 2020 17:08
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; 
> Huang, Ray ; Kuehling, Felix 
> 
> Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
> linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> Helgaas ; Joerg Roedel ; Zhu, 
> Changfeng 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> [AMD Public Use]
> 
> > -Original Message-
> > From: Merger, Edgar [AUTOSOL/MAS/AUGS]
> 
> > Sent: Wednesday, November 25, 2020 5:04 AM
> > To: Deucher, Alexander ; Huang, Ray 
> > ; Kuehling, Felix 
> > Cc: Will Deacon ; linux-ker...@vger.kernel.org;
> > linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> > Helgaas ; Joerg Roedel ; Zhu, 
> > Changfeng 
> > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> > broken
> >
> > I do have also other problems with this unit, when IOMMU is enabled 
> > and pci=noats is not set as kernel parameter.
> >
> > [ 2004.265906] amdgpu :0b:00.0: [drm:amdgpu_ib_ring_tests 
> > [amdgpu]]
> > *ERROR* IB test failed on gfx (-11

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-26 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Alex,

This is pretty much the same patch as what I have received from Joerg 
previously, except that it is tied to the particular Emerson platform and its 
derivatives (listed with Subsystem IDs). 

Below patch was what Joerg provided me and I successfully tested.

This diff to the kernel should do that:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
f70692ac79c5..3911b0ec57ba 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, 
quirk_amd_harvest_no_ats);  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, 
quirk_amd_harvest_no_ats);
 /* AMD Navi14 dGPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, quirk_amd_harvest_no_ats);
+/* AMD Raven platform iGPU */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, 
+quirk_amd_harvest_no_ats);
 #endif /* CONFIG_PCI_ATS */
 
 /* Freescale PCIe doesn't support MSI in RC mode */

So far I have seen this issue on two instances of this chip, but I must admit 
that I did test only two of them to this extent, so I guess it is not a bad 
chip in particular, but the chips we use are from the same production lot, so 
it might be a systematical problem of that production lot?

UEFI-Setup shows:
Processor Family: 17h
Procossor Model: 20h - 2Fh
CPUID: 00820F01
Microcode Patch Level: 8200103

Looking at the chip-die I found that this is a fully qualified IP Silicon 
(according to Ryzen Embedded R1000 SOC Interlock).
YE1305C9T20FG
BI2015SUY
9JB6496P00123
2016 AMD
DIFFUSED IN USA
MADE IN CHINA

Currently used SBIOS is a branch from "EmbeddedPI-FP5 1.2.0.3RC3".

In the future our SBIOS might merge with EmbeddedPI-FP5_1.2.0.5RC3.




-Original Message-
From: Deucher, Alexander  
Sent: Mittwoch, 25. November 2020 17:08
To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Sent: Wednesday, November 25, 2020 5:04 AM
> To: Deucher, Alexander ; Huang, Ray 
> ; Kuehling, Felix 
> Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
> linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> Helgaas ; Joerg Roedel ; Zhu, 
> Changfeng 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> I do have also other problems with this unit, when IOMMU is enabled 
> and pci=noats is not set as kernel parameter.
> 
> [ 2004.265906] amdgpu :0b:00.0: [drm:amdgpu_ib_ring_tests 
> [amdgpu]]
> *ERROR* IB test failed on gfx (-110).
> [ 2004.266024] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]]
> *ERROR* ib ring test failed (-110).
> 

Is this seen on all instances of this chip or only specific silicon?  I.e., 
could this be a bad chip?  Would it be possible to test a newer sbios?  I think 
the attached patch should work if we can't get it fixed on the platform side.  
It should only enable the quirk on your particular platform.

Alex


> -Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS]
> Sent: Mittwoch, 25. November 2020 10:16
> To: 'Deucher, Alexander' ; 'Huang, Ray'
> ; 'Kuehling, Felix' 
> Cc: 'Will Deacon' ; 'linux-ker...@vger.kernel.org' 
> ; 'linux-...@vger.kernel.org'  p...@vger.kernel.org>; 'iommu@lists.linux-foundation.org'
> ; 'Bjorn Helgaas'
> ; 'Joerg Roedel' ; 'Zhu, 
> Changfeng' 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Remark:
> 
> Systems with R1305G APU (which show the issue) have the following VGA-
> Controller:
> 0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. 
> [AMD/ATI] Picasso (rev cf)
> 
> Systems with V1404I APU (which do not show the issue) have the 
> following
> VGA-Controller:
> 0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. 
> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] 
> (rev 83)
> 
> "rev cf" vs. "ref 83" is probably what you where referring to with PCI 
> Revision ID.
> 
> -Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS]
> Sent: Mittwoch, 25. November 2020 07:05
> To: 'Deucher, Alexander' ; Huang, Ray 
> ; Kuehling, Felix 
> Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
> linux- p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn 
> Helgaas ; Joerg Roedel ; Zhu, 
> Changfeng 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> I see that problem only on systems that use a R1305G APU
> 
> sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
> 
&

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-25 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
I do have also other problems with this unit, when IOMMU is enabled and 
pci=noats is not set as kernel parameter.

[ 2004.265906] amdgpu :0b:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* 
IB test failed on gfx (-110).
[ 2004.266024] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* 
ib ring test failed (-110).

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 25. November 2020 10:16
To: 'Deucher, Alexander' ; 'Huang, Ray' 
; 'Kuehling, Felix' 
Cc: 'Will Deacon' ; 'linux-ker...@vger.kernel.org' 
; 'linux-...@vger.kernel.org' 
; 'iommu@lists.linux-foundation.org' 
; 'Bjorn Helgaas' ; 
'Joerg Roedel' ; 'Zhu, Changfeng' 
Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Remark: 

Systems with R1305G APU (which show the issue) have the following 
VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Picasso (rev cf)

Systems with V1404I APU (which do not show the issue) have the following 
VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven 
Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)

"rev cf" vs. "ref 83" is probably what you where referring to with PCI Revision 
ID.

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 25. November 2020 07:05
To: 'Deucher, Alexander' ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

I see that problem only on systems that use a R1305G APU

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

shows

VCE feature version: 0, firmware version: 0x UVD feature version: 0, 
firmware version: 0x MC feature version: 0, firmware version: 
0x ME feature version: 50, firmware version: 0x00a3 PFP feature 
version: 50, firmware version: 0x00bb CE feature version: 50, firmware 
version: 0x004f RLC feature version: 1, firmware version: 0x0049 RLC 
SRLC feature version: 1, firmware version: 0x0001 RLC SRLG feature version: 
1, firmware version: 0x0001 RLC SRLS feature version: 1, firmware version: 
0x0001 MEC feature version: 50, firmware version: 0x01b5
MEC2 feature version: 50, firmware version: 0x01b5 SOS feature version: 0, 
firmware version: 0x ASD feature version: 0, firmware version: 
0x2130 TA XGMI feature version: 0, firmware version: 0x TA RAS 
feature version: 0, firmware version: 0x SMC feature version: 0, 
firmware version: 0x2527
SDMA0 feature version: 41, firmware version: 0x00a9 VCN feature version: 0, 
firmware version: 0x0110901c DMCU feature version: 0, firmware version: 
0x0001 VBIOS version: 113-RAVEN2-117

We are also using V1404I APU on the same boards and I haven´t seen the issue on 
those boards

These boards give me slightly different info: sudo cat 
/sys/kernel/debug/dri/0/amdgpu_firmware_info
 
VCE feature version: 0, firmware version: 0x UVD feature version: 0, 
firmware version: 0x MC feature version: 0, firmware version: 
0x ME feature version: 47, firmware version: 0x00a2 PFP feature 
version: 47, firmware version: 0x00b9 CE feature version: 47, firmware 
version: 0x004e RLC feature version: 1, firmware version: 0x0213 RLC 
SRLC feature version: 1, firmware version: 0x0001 RLC SRLG feature version: 
1, firmware version: 0x0001 RLC SRLS feature version: 1, firmware version: 
0x0001 MEC feature version: 47, firmware version: 0x01ab
MEC2 feature version: 47, firmware version: 0x01ab SOS feature version: 0, 
firmware version: 0x ASD feature version: 0, firmware version: 
0x2113 TA XGMI feature version: 0, firmware version: 0x TA RAS 
feature version: 0, firmware version: 0x SMC feature version: 0, 
firmware version: 0x1e5b
SDMA0 feature version: 41, firmware version: 0x00a9 VCN feature version: 0, 
firmware version: 0x0110901c DMCU feature version: 0, firmware version: 
0x VBIOS version: 113-RAVEN-116




00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root 
Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 
00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP 
Bridge [6:0]
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream 
(PCIE SW.US)
00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP 
Bridge [6:0]
00:01.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream 
(PCIE SW.US)
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 
00h-1fh) PCIe Dummy

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-25 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Remark: 

Systems with R1305G APU (which show the issue) have the following 
VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Picasso (rev cf)

Systems with V1404I APU (which do not show the issue) have the following 
VGA-Controller:
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven 
Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev 83)

"rev cf" vs. "ref 83" is probably what you where referring to with PCI Revision 
ID.

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 25. November 2020 07:05
To: 'Deucher, Alexander' ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

I see that problem only on systems that use a R1305G APU

sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info

shows

VCE feature version: 0, firmware version: 0x UVD feature version: 0, 
firmware version: 0x MC feature version: 0, firmware version: 
0x ME feature version: 50, firmware version: 0x00a3 PFP feature 
version: 50, firmware version: 0x00bb CE feature version: 50, firmware 
version: 0x004f RLC feature version: 1, firmware version: 0x0049 RLC 
SRLC feature version: 1, firmware version: 0x0001 RLC SRLG feature version: 
1, firmware version: 0x0001 RLC SRLS feature version: 1, firmware version: 
0x0001 MEC feature version: 50, firmware version: 0x01b5
MEC2 feature version: 50, firmware version: 0x01b5 SOS feature version: 0, 
firmware version: 0x ASD feature version: 0, firmware version: 
0x2130 TA XGMI feature version: 0, firmware version: 0x TA RAS 
feature version: 0, firmware version: 0x SMC feature version: 0, 
firmware version: 0x2527
SDMA0 feature version: 41, firmware version: 0x00a9 VCN feature version: 0, 
firmware version: 0x0110901c DMCU feature version: 0, firmware version: 
0x0001 VBIOS version: 113-RAVEN2-117

We are also using V1404I APU on the same boards and I haven´t seen the issue on 
those boards

These boards give me slightly different info: sudo cat 
/sys/kernel/debug/dri/0/amdgpu_firmware_info
 
VCE feature version: 0, firmware version: 0x UVD feature version: 0, 
firmware version: 0x MC feature version: 0, firmware version: 
0x ME feature version: 47, firmware version: 0x00a2 PFP feature 
version: 47, firmware version: 0x00b9 CE feature version: 47, firmware 
version: 0x004e RLC feature version: 1, firmware version: 0x0213 RLC 
SRLC feature version: 1, firmware version: 0x0001 RLC SRLG feature version: 
1, firmware version: 0x0001 RLC SRLS feature version: 1, firmware version: 
0x0001 MEC feature version: 47, firmware version: 0x01ab
MEC2 feature version: 47, firmware version: 0x01ab SOS feature version: 0, 
firmware version: 0x ASD feature version: 0, firmware version: 
0x2113 TA XGMI feature version: 0, firmware version: 0x TA RAS 
feature version: 0, firmware version: 0x SMC feature version: 0, 
firmware version: 0x1e5b
SDMA0 feature version: 41, firmware version: 0x00a9 VCN feature version: 0, 
firmware version: 0x0110901c DMCU feature version: 0, firmware version: 
0x VBIOS version: 113-RAVEN-116




00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Root 
Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 
00h-1fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP 
Bridge [6:0]
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream 
(PCIE SW.US)
00:01.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 PCIe GPP 
Bridge [6:0]
00:01.5 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream 
(PCIE SW.US)
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 
00h-1fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal 
PCIe GPP Bridge 0 to Bus A
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Internal 
PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: 
Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: 
Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: 
Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Raven/Raven2 Device 24: 
Function 3
00:18.4 H

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-24 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
-Port/8-Lane 
Packet Switch
04:01.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane 
Packet Switch
04:02.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane 
Packet Switch
04:03.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane 
Packet Switch
04:04.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane 
Packet Switch
04:05.0 PCI bridge: Pericom Semiconductor PI7C9X2G608GP PCIe2 6-Port/8-Lane 
Packet Switch
06:00.0 Serial controller: Asix Electronics Corporation Device 9100
06:00.1 Serial controller: Asix Electronics Corporation Device 9100
07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection 
(rev 03)
0a:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection 
(rev 03)
0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Picasso (rev cf)
0b:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] 
Raven/Raven2/Fenghuang HDMI/DP Audio Controller
0b:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h 
(Models 10h-1fh) Platform Security Processor
0b:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Raven2 USB 3.1
0b:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] 
Raven/Raven2/FireFlight/Renoir Audio Processor
0b:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc. [AMD] 
Raven/Raven2/Renoir Non-Sensor Fusion Hub KMDF driver
0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller 
[AHCI mode] (rev 61)

PCI Revision ID is 06 I believe. Got that from this lspci -xx

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Zeppelin Switch Upstream 
(PCIE SW.US)
00: 22 10 5d 14 07 04 10 00 00 00 04 06 10 00 81 00
10: 00 00 00 00 00 00 00 00 00 02 02 00 f1 01 00 00
20: e0 fc e0 fc f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 ff 00 12 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 01 58 03 c8 00 00 00 00 10 a0 42 01 22 80 00 00
60: 1f 29 00 00 13 38 73 03 42 00 11 30 00 00 04 00
70: 00 00 40 01 18 00 01 00 00 00 00 00 bf 01 70 00
80: 06 00 00 00 0e 00 00 00 03 00 01 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 05 c0 81 00 00 00 e0 fe 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 0d c8 00 00 22 10 34 12 08 00 03 a8 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 4c 8a 05 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-Original Message-
From: Deucher, Alexander  
Sent: Dienstag, 24. November 2020 16:06
To: Merger, Edgar [AUTOSOL/MAS/AUGS] ; Huang, Ray 
; Kuehling, Felix 
Cc: Will Deacon ; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Joerg Roedel ; Zhu, Changfeng 

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -Original Message-
> From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Sent: Tuesday, November 24, 2020 2:29 AM
> To: Huang, Ray ; Kuehling, Felix 
> 
> Cc: Will Deacon ; Deucher, Alexander 
> ; linux-ker...@vger.kernel.org; linux- 
> p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> ; Joerg Roedel ; Zhu, Changfeng 
> 
> Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as 
> broken
> 
> Module Version : PiccasoCpu 10
> AGESA Version   : PiccasoPI 100A
> 
> I did not try to enter the system in any other way (like via ssh) than 
> via Desktop.

You can get this information from the amdgpu driver.  E.g., sudo cat 
/sys/kernel/debug/dri/0/amdgpu_firmware_info .  Also what is the PCI revision 
id of your chip (from lspci)?  Also are you just seeing this on specific 
versions of the sbios?

Thanks,

Alex


> 
> -Original Message-
> From: Huang Rui 
> Sent: Dienstag, 24. November 2020 07:43
> To: Kuehling, Felix 
> Cc: Will Deacon ; Deucher, Alexander 
> ; linux-ker...@vger.kernel.org; linux- 
> p...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
> ; Merger, Edgar [AUTOSOL/MAS/AUGS] 
> ; Joerg Roedel ; Changfeng 
> Zhu 
> Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> 
> On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> > On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > > On Mon, Nov 23, 2020 at 09:04:14PM +, Deucher, Alexander wrote:
> > >> [AMD Public Use]
> > >>
> > >>> -Original Message-
> > >>> From: Will Deacon 
> > >>> Sent: Monday, November 23, 2020 8:44 AM
> > >>> To: linux-ker...@vger.kernel.org
> > >>> Cc: linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; 
> > >>> Will Deacon ; Bjorn Helgaas 
> > >>> ; Deucher, Alexander 
>

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-23 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Module Version : PiccasoCpu 10 
AGESA Version   : PiccasoPI 100A

I did not try to enter the system in any other way (like via ssh) than via 
Desktop.

-Original Message-
From: Huang Rui  
Sent: Dienstag, 24. November 2020 07:43
To: Kuehling, Felix 
Cc: Will Deacon ; Deucher, Alexander 
; linux-ker...@vger.kernel.org; 
linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; Bjorn Helgaas 
; Merger, Edgar [AUTOSOL/MAS/AUGS] 
; Joerg Roedel ; Changfeng Zhu 

Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 09:04:14PM +, Deucher, Alexander wrote:
> >> [AMD Public Use]
> >>
> >>> -Original Message-
> >>> From: Will Deacon 
> >>> Sent: Monday, November 23, 2020 8:44 AM
> >>> To: linux-ker...@vger.kernel.org
> >>> Cc: linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; 
> >>> Will Deacon ; Bjorn Helgaas 
> >>> ; Deucher, Alexander 
> >>> ; Edgar Merger 
> >>> ; Joerg Roedel 
> >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> >>>
> >>> Edgar Merger reports that the AMD Raven GPU does not work reliably 
> >>> on his system when the IOMMU is enabled:
> >>>
> >>>| [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, 
> >>> signaled seq=1, emitted seq=3
> >>>| [...]
> >>>| amdgpu :0b:00.0: GPU reset begin!
> >>>| AMD-Vi: Completion-Wait loop timed out
> >>>| iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> >>> device=0b:00.0 address=0x38edc0970]
> >>>
> >>> This is indicative of a hardware/platform configuration issue so, 
> >>> since disabling ATS has been shown to resolve the problem, add a 
> >>> quirk to match this particular device while Edgar follows-up with AMD for 
> >>> more information.
> >>>
> >>> Cc: Bjorn Helgaas 
> >>> Cc: Alex Deucher 
> >>> Reported-by: Edgar Merger 
> >>> Suggested-by: Joerg Roedel 
> >>> Link:
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore=DwIDAw=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg=lNXu2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4=_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM=
> >>>  .
> >>> kernel.org/linux-
> >>> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
> >>> B1310.namprd10.prod.outlook.com
> >>> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> >>> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> >>> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> >>> LCJXVCI6Mn0%3D%7C1000sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> >>> LOUfX5oGaoLN8n%2B8%3Dreserved=0
> >>> Signed-off-by: Will Deacon 
> >>> ---
> >>>
> >>> Hi all,
> >>>
> >>> Since Joerg is away at the moment, I'm posting this to try to make 
> >>> some progress with the thread in the Link: tag.
> >> + Felix
> >>
> >> What system is this?  Can you provide more details?  Does a sbios 
> >> update fix this?  Disabling ATS for all Ravens will break GPU 
> >> compute for a lot of people.  I'd prefer to just black list this 
> >> particular system (e.g., just SSIDs or revision) if possible.
> 
> +Ray
> 
> There are already many systems where the IOMMU is disabled in the 
> BIOS, or the CRAT table reporting the APU compute capabilities is 
> broken. Ray has been working on a fallback to make APUs behave like 
> dGPUs on such systems. That should also cover this case where ATS is 
> blacklisted. That said, it affects the programming model, because we 
> don't support the unified and coherent memory model on dGPUs like we 
> do on APUs with IOMMUv2. So it would be good to make the conditions 
> for this workaround as narrow as possible.

Yes, besides the comments from Alex and Felix, may we get your firmware version 
(SMC firmware which is from SBIOS) and device id?

> >>>| [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, 
> >>> signaled seq=1, emitted seq=3

It looks only gfx ib test passed, and fails to lanuch desktop, am I right?

We would like to see whether it is Raven, Raven kicker (new Raven), or Picasso. 
In our side, per the internal test result, we didn't see the similiar issue on 
Raven kicker and Picasso platform.

Thanks,
Ray

> 
> 

RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

2020-11-23 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
This is a board developed by my company.
Subsystem-ID is ea50:0c19 or ea50:cc10 (depending on which particular carrier 
board the compute module is attached to), however we haven´t managed yet to 
enter this Subsystem-ID to every PCI-Device in the system, because of missing 
means to do that by our UEFI-FW. This might will change if we update to latest 
AGESA version. 

-Original Message-
From: Will Deacon  
Sent: Montag, 23. November 2020 23:34
To: Deucher, Alexander 
Cc: linux-ker...@vger.kernel.org; linux-...@vger.kernel.org; 
iommu@lists.linux-foundation.org; Bjorn Helgaas ; Merger, 
Edgar [AUTOSOL/MAS/AUGS] ; Joerg Roedel 
; Kuehling, Felix 
Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Mon, Nov 23, 2020 at 09:04:14PM +, Deucher, Alexander wrote:
> [AMD Public Use]
> 
> > -Original Message-
> > From: Will Deacon 
> > Sent: Monday, November 23, 2020 8:44 AM
> > To: linux-ker...@vger.kernel.org
> > Cc: linux-...@vger.kernel.org; iommu@lists.linux-foundation.org; 
> > Will Deacon ; Bjorn Helgaas ; 
> > Deucher, Alexander ; Edgar Merger 
> > ; Joerg Roedel 
> > Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> > 
> > Edgar Merger reports that the AMD Raven GPU does not work reliably 
> > on his system when the IOMMU is enabled:
> > 
> >   | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, 
> > signaled seq=1, emitted seq=3
> >   | [...]
> >   | amdgpu :0b:00.0: GPU reset begin!
> >   | AMD-Vi: Completion-Wait loop timed out
> >   | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> > device=0b:00.0 address=0x38edc0970]
> > 
> > This is indicative of a hardware/platform configuration issue so, 
> > since disabling ATS has been shown to resolve the problem, add a 
> > quirk to match this particular device while Edgar follows-up with AMD for 
> > more information.
> > 
> > Cc: Bjorn Helgaas 
> > Cc: Alex Deucher 
> > Reported-by: Edgar Merger 
> > Suggested-by: Joerg Roedel 
> > Link:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Flore=DwIBAg=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg=WjiRGepDgI7voSyaAJcvnvZb6gsvZ1fvcnR2tm6bGXg=O1nU-RafBXMAS7Mao5Gtu6o1Xkuj8fg4oHQs74TssuA=
> >  .
> > kernel.org%2Flinux-
> > iommu%2FMWHPR10MB1310F042A30661D4158520B589FC0%40MWHPR10M
> > B1310.namprd10.prod.outlook.comdata=04%7C01%7Calexander.deuc
> > her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> > 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> > CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> > LCJXVCI6Mn0%3D%7C1000sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> > LOUfX5oGaoLN8n%2B8%3Dreserved=0
> > Signed-off-by: Will Deacon 
> > ---
> > 
> > Hi all,
> > 
> > Since Joerg is away at the moment, I'm posting this to try to make 
> > some progress with the thread in the Link: tag.
> 
> + Felix
> 
> What system is this?  Can you provide more details?  Does a sbios 
> update fix this?  Disabling ATS for all Ravens will break GPU compute 
> for a lot of people.  I'd prefer to just black list this particular 
> system (e.g., just SSIDs or revision) if possible.

Cheers, Alex. I'll have to defer to Edgar for the details, as my understanding 
from the original thread over at:

https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150-40MWHPR10MB1310.namprd10.prod.outlook.com_=DwIBAg=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg=WjiRGepDgI7voSyaAJcvnvZb6gsvZ1fvcnR2tm6bGXg=9qyuCqHeOGaY1sKjkzNN5A6ks6PNF7V2M2PPckHyFKk=
 

is that this is a board developed by his company.

Edgar -- please can you answer Alex's questions?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-06 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Alright, so is this going to make it into an upstream-Kernel?

-Original Message-
From: jroe...@suse.de  
Sent: Freitag, 6. November 2020 15:06
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Fri, Nov 06, 2020 at 01:03:22PM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> Thank you. I do think that this is the GPU. Would you please elaborate 
> on what that quirk would be?

The GPU seems to have broken ATS, or require driver setup to make ATS work. 
Anyhow, ATS is unstable for Linux to use, so it must not be enabled.

This diff to the kernel should do that:

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 
f70692ac79c5..3911b0ec57ba 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, 
quirk_amd_harvest_no_ats);  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, 
quirk_amd_harvest_no_ats);
 /* AMD Navi14 dGPU */
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, quirk_amd_harvest_no_ats);
+/* AMD Raven platform iGPU */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, 
+quirk_amd_harvest_no_ats);
 #endif /* CONFIG_PCI_ATS */
 
 /* Freescale PCIe doesn't support MSI in RC mode */
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-06 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
> Thanks. So I guess the GPU needs a quirk to disable ATS on it. Can you please 
> send me the output of lspci -n -s "0b:00.0" (Given that 0b:00.0 ais your GPU)?

Thank you. I do think that this is the GPU. Would you please elaborate on what 
that quirk would be?

-Original Message-
From: jroe...@suse.de  
Sent: Freitag, 6. November 2020 13:19
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Fri, Nov 06, 2020 at 05:51:18AM +0000, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> With Kernel 5.9.3 kernel-parameter pci=noats the system is running for 
> 19hours now in reboot-test without the error to occur.

Thanks. So I guess the GPU needs a quirk to disable ATS on it. Can you please 
send me the output of lspci -n -s "0b:00.0" (Given that 0b:00.0 ais your GPU)?

Thanks,

Joerg
0b:00.0 0300: 1002:15d8 (rev cf)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-05 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
With Kernel 5.9.3 kernel-parameter pci=noats the system is running for 19hours 
now in reboot-test without the error to occur.

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de  
Sent: Donnerstag, 5. November 2020 13:33
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Thu, Nov 05, 2020 at 11:58:30AM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> One remark:
> With kernel-parameter pci=noats in dmesg there is
> 
> [   10.128463] kfd kfd: Error initializing iommuv2

That is expected. IOMMUv2 depends on ATS support.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-05 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Joerg,

One remark:
With kernel-parameter pci=noats in dmesg there is

[   10.128463] kfd kfd: Error initializing iommuv2

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Donnerstag, 5. November 2020 12:16
To: 'jroe...@suse.de' 
Cc: 'iommu@lists.linux-foundation.org' 
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

Joerg,

I did run with 5.9.3. After about 2 hours in a reboot-cycle the system failed 
again with amdgpu-problems.

> please try booting with "pci=noats" on the kernel command line.
This I will do next.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 4. November 2020 15:36
To: 'jroe...@suse.de' 
Cc: 'iommu@lists.linux-foundation.org' 
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

Joerg,

One remark: 
> However I found out that with Kernel 5.9.3 the amdgpu kernel module is 
> not loaded/installed
That is likely my fault because I was compiling that linux kernel on a faster 
machine (V1807B CPU against R1305G CPU (target)). I restarted that compile just 
now on the target machine to avoid any problems.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 4. November 2020 15:19
To: jroe...@suse.de
Cc: iommu@lists.linux-foundation.org
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

> Yes, but it could be the same underlying reason.
There is no PCI setup issue that we are aware of.

> For a first try, use 5.9.3. If it reproduces there, please try booting with 
> "pci=noats" on the kernel command line.
Did compile the kernel 5.9.3 and started a reboot test to see if it is going to 
fail again. However I found out that with Kernel 5.9.3 the amdgpu kernel module 
is not loaded/installed. So this way I don´t see it makes sense for further 
investigation. I might did something wrong when compiling the linux kernel 
5.9.3. I did reuse my .config file that I used with 5.4.0-47 for configuration 
of the kernel 5.9.3. However I do not know why it did not install amdgpu.

> Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
> where this happens.
For comparison I attached the logs when using 5.4.0-47 and 5.9.3. 

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de 
Sent: Mittwoch, 4. November 2020 11:15
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Wed, Nov 04, 2020 at 09:21:35AM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> AMD-Vi: Completion-Wait loop timed out is at [65499.964105] but amdgpu-error 
> is at [   52.772273], hence much earlier.

Yes, but it could be the same underlying reason.

> Have not tried to use an upstream kernel yet. Which one would you recommend?

For a first try, use 5.9.3. If it reproduces there, please try booting with 
"pci=noats" on the kernel command line.

Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
where this happens.

Regards,

Joerg


> 
> As far as inconsistencies in the PCI-setup is concerned, the only thing that 
> I know of right now is that we haven´t entered a PCI subsystem vendor and 
> device ID yet. It is still "Advanced Micro Devices". We will change that soon 
> to "General Electric" or "Emerson".
> 
> Best regards,
> Edgar
> 
> -----Original Message-
> From: jroe...@suse.de 
> Sent: Mittwoch, 4. November 2020 09:53
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Cc: iommu@lists.linux-foundation.org
> Subject: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled
> 
> Hi Edgar,
> 
> On Fri, Oct 30, 2020 at 02:26:23PM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
> wrote:
> > With one board we have a boot-problem that is reproducible at every ~50 
> > boot.
> > The system is accessible via ssh and works fine except for the 
> > Graphics. The graphics is off. We don´t see a screen. Please see 
> > attached “dmesg.log”. From [52.772273] onwards the kernel reports 
> > drm/amdgpu errors. It even tries to reset the GPU but that fails too.
> > I tried to reset amdgpu also by command “sudo cat 
> > /sys/kernel/debug/dri/N/amdgpu_gpu_recover”. That did not help either.
> 
> Can you reproduce the problem with an upstream kernel too?
> 
> These messages in dmesg indicate some problem in the platform setup:
> 
>   AMD-Vi: Completion-Wait loop timed out
> 
> Might there be some inconsistencies in the PCI setup between the bridges and 
> the endpoints or something?
> 
> Regards,
> 
>   Joerg


dmesg_pci_noats.log
Description: dmesg_pci_noats.log
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-05 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Joerg,

I did run with 5.9.3. After about 2 hours in a reboot-cycle the system failed 
again with amdgpu-problems.

> please try booting with "pci=noats" on the kernel command line.
This I will do next.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 4. November 2020 15:36
To: 'jroe...@suse.de' 
Cc: 'iommu@lists.linux-foundation.org' 
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

Joerg,

One remark: 
> However I found out that with Kernel 5.9.3 the amdgpu kernel module is 
> not loaded/installed
That is likely my fault because I was compiling that linux kernel on a faster 
machine (V1807B CPU against R1305G CPU (target)). I restarted that compile just 
now on the target machine to avoid any problems.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Mittwoch, 4. November 2020 15:19
To: jroe...@suse.de
Cc: iommu@lists.linux-foundation.org
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

> Yes, but it could be the same underlying reason.
There is no PCI setup issue that we are aware of.

> For a first try, use 5.9.3. If it reproduces there, please try booting with 
> "pci=noats" on the kernel command line.
Did compile the kernel 5.9.3 and started a reboot test to see if it is going to 
fail again. However I found out that with Kernel 5.9.3 the amdgpu kernel module 
is not loaded/installed. So this way I don´t see it makes sense for further 
investigation. I might did something wrong when compiling the linux kernel 
5.9.3. I did reuse my .config file that I used with 5.4.0-47 for configuration 
of the kernel 5.9.3. However I do not know why it did not install amdgpu.

> Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
> where this happens.
For comparison I attached the logs when using 5.4.0-47 and 5.9.3. 

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de 
Sent: Mittwoch, 4. November 2020 11:15
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Wed, Nov 04, 2020 at 09:21:35AM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> AMD-Vi: Completion-Wait loop timed out is at [65499.964105] but amdgpu-error 
> is at [   52.772273], hence much earlier.

Yes, but it could be the same underlying reason.

> Have not tried to use an upstream kernel yet. Which one would you recommend?

For a first try, use 5.9.3. If it reproduces there, please try booting with 
"pci=noats" on the kernel command line.

Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
where this happens.

Regards,

Joerg


> 
> As far as inconsistencies in the PCI-setup is concerned, the only thing that 
> I know of right now is that we haven´t entered a PCI subsystem vendor and 
> device ID yet. It is still "Advanced Micro Devices". We will change that soon 
> to "General Electric" or "Emerson".
> 
> Best regards,
> Edgar
> 
> -----Original Message-
> From: jroe...@suse.de 
> Sent: Mittwoch, 4. November 2020 09:53
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Cc: iommu@lists.linux-foundation.org
> Subject: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled
> 
> Hi Edgar,
> 
> On Fri, Oct 30, 2020 at 02:26:23PM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
> wrote:
> > With one board we have a boot-problem that is reproducible at every ~50 
> > boot.
> > The system is accessible via ssh and works fine except for the 
> > Graphics. The graphics is off. We don´t see a screen. Please see 
> > attached “dmesg.log”. From [52.772273] onwards the kernel reports 
> > drm/amdgpu errors. It even tries to reset the GPU but that fails too.
> > I tried to reset amdgpu also by command “sudo cat 
> > /sys/kernel/debug/dri/N/amdgpu_gpu_recover”. That did not help either.
> 
> Can you reproduce the problem with an upstream kernel too?
> 
> These messages in dmesg indicate some problem in the platform setup:
> 
>   AMD-Vi: Completion-Wait loop timed out
> 
> Might there be some inconsistencies in the PCI setup between the bridges and 
> the endpoints or something?
> 
> Regards,
> 
>   Joerg
<>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-04 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Joerg,

One remark: 
> However I found out that with Kernel 5.9.3 the amdgpu kernel module is not 
> loaded/installed
That is likely my fault because I was compiling that linux kernel on a faster 
machine (V1807B CPU against R1305G CPU (target)). I restarted that compile just 
now on the target machine to avoid any problems.

Best regards,
Edgar

-Original Message-
From: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Sent: Mittwoch, 4. November 2020 15:19
To: jroe...@suse.de
Cc: iommu@lists.linux-foundation.org
Subject: RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

> Yes, but it could be the same underlying reason.
There is no PCI setup issue that we are aware of.

> For a first try, use 5.9.3. If it reproduces there, please try booting with 
> "pci=noats" on the kernel command line.
Did compile the kernel 5.9.3 and started a reboot test to see if it is going to 
fail again. However I found out that with Kernel 5.9.3 the amdgpu kernel module 
is not loaded/installed. So this way I don´t see it makes sense for further 
investigation. I might did something wrong when compiling the linux kernel 
5.9.3. I did reuse my .config file that I used with 5.4.0-47 for configuration 
of the kernel 5.9.3. However I do not know why it did not install amdgpu.

> Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
> where this happens.
For comparison I attached the logs when using 5.4.0-47 and 5.9.3. 

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de 
Sent: Mittwoch, 4. November 2020 11:15
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Wed, Nov 04, 2020 at 09:21:35AM +0000, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> AMD-Vi: Completion-Wait loop timed out is at [65499.964105] but amdgpu-error 
> is at [   52.772273], hence much earlier.

Yes, but it could be the same underlying reason.

> Have not tried to use an upstream kernel yet. Which one would you recommend?

For a first try, use 5.9.3. If it reproduces there, please try booting with 
"pci=noats" on the kernel command line.

Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
where this happens.

Regards,

Joerg


> 
> As far as inconsistencies in the PCI-setup is concerned, the only thing that 
> I know of right now is that we haven´t entered a PCI subsystem vendor and 
> device ID yet. It is still "Advanced Micro Devices". We will change that soon 
> to "General Electric" or "Emerson".
> 
> Best regards,
> Edgar
> 
> -----Original Message-
> From: jroe...@suse.de 
> Sent: Mittwoch, 4. November 2020 09:53
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Cc: iommu@lists.linux-foundation.org
> Subject: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled
> 
> Hi Edgar,
> 
> On Fri, Oct 30, 2020 at 02:26:23PM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
> wrote:
> > With one board we have a boot-problem that is reproducible at every ~50 
> > boot.
> > The system is accessible via ssh and works fine except for the 
> > Graphics. The graphics is off. We don´t see a screen. Please see 
> > attached “dmesg.log”. From [52.772273] onwards the kernel reports 
> > drm/amdgpu errors. It even tries to reset the GPU but that fails too.
> > I tried to reset amdgpu also by command “sudo cat 
> > /sys/kernel/debug/dri/N/amdgpu_gpu_recover”. That did not help either.
> 
> Can you reproduce the problem with an upstream kernel too?
> 
> These messages in dmesg indicate some problem in the platform setup:
> 
>   AMD-Vi: Completion-Wait loop timed out
> 
> Might there be some inconsistencies in the PCI setup between the bridges and 
> the endpoints or something?
> 
> Regards,
> 
>   Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-04 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
> Yes, but it could be the same underlying reason.
There is no PCI setup issue that we are aware of.

> For a first try, use 5.9.3. If it reproduces there, please try booting with 
> "pci=noats" on the kernel command line.
Did compile the kernel 5.9.3 and started a reboot test to see if it is going to 
fail again. However I found out that with Kernel 5.9.3 the amdgpu kernel module 
is not loaded/installed. So this way I don´t see it makes sense for further 
investigation. I might did something wrong when compiling the linux kernel 
5.9.3. I did reuse my .config file that I used with 5.4.0-47 for configuration 
of the kernel 5.9.3. However I do not know why it did not install amdgpu.

> Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
> where this happens.
For comparison I attached the logs when using 5.4.0-47 and 5.9.3. 

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de  
Sent: Mittwoch, 4. November 2020 11:15
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: Re: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

On Wed, Nov 04, 2020 at 09:21:35AM +0000, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> AMD-Vi: Completion-Wait loop timed out is at [65499.964105] but amdgpu-error 
> is at [   52.772273], hence much earlier.

Yes, but it could be the same underlying reason.

> Have not tried to use an upstream kernel yet. Which one would you recommend?

For a first try, use 5.9.3. If it reproduces there, please try booting with 
"pci=noats" on the kernel command line.

Please also send me the output of 'lspci -vvv' and 'lspci -t' of the machine 
where this happens.

Regards,

Joerg


> 
> As far as inconsistencies in the PCI-setup is concerned, the only thing that 
> I know of right now is that we haven´t entered a PCI subsystem vendor and 
> device ID yet. It is still "Advanced Micro Devices". We will change that soon 
> to "General Electric" or "Emerson".
> 
> Best regards,
> Edgar
> 
> -----Original Message-
> From: jroe...@suse.de 
> Sent: Mittwoch, 4. November 2020 09:53
> To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
> Cc: iommu@lists.linux-foundation.org
> Subject: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled
> 
> Hi Edgar,
> 
> On Fri, Oct 30, 2020 at 02:26:23PM +, Merger, Edgar [AUTOSOL/MAS/AUGS] 
> wrote:
> > With one board we have a boot-problem that is reproducible at every ~50 
> > boot.
> > The system is accessible via ssh and works fine except for the 
> > Graphics. The graphics is off. We don´t see a screen. Please see 
> > attached “dmesg.log”. From [52.772273] onwards the kernel reports 
> > drm/amdgpu errors. It even tries to reset the GPU but that fails too.
> > I tried to reset amdgpu also by command “sudo cat 
> > /sys/kernel/debug/dri/N/amdgpu_gpu_recover”. That did not help either.
> 
> Can you reproduce the problem with an upstream kernel too?
> 
> These messages in dmesg indicate some problem in the platform setup:
> 
>   AMD-Vi: Completion-Wait loop timed out
> 
> Might there be some inconsistencies in the PCI setup between the bridges and 
> the endpoints or something?
> 
> Regards,
> 
>   Joerg


Linux-logs.tar.gz
Description: Linux-logs.tar.gz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

2020-11-04 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Hi Jörg,

AMD-Vi: Completion-Wait loop timed out is at [65499.964105] but amdgpu-error is 
at [   52.772273], hence much earlier.

Have not tried to use an upstream kernel yet. Which one would you recommend?

As far as inconsistencies in the PCI-setup is concerned, the only thing that I 
know of right now is that we haven´t entered a PCI subsystem vendor and device 
ID yet. It is still "Advanced Micro Devices". We will change that soon to 
"General Electric" or "Emerson".

Best regards,
Edgar

-Original Message-
From: jroe...@suse.de  
Sent: Mittwoch, 4. November 2020 09:53
To: Merger, Edgar [AUTOSOL/MAS/AUGS] 
Cc: iommu@lists.linux-foundation.org
Subject: [EXTERNAL] Re: amdgpu error whenever IOMMU is enabled

Hi Edgar,

On Fri, Oct 30, 2020 at 02:26:23PM +0000, Merger, Edgar [AUTOSOL/MAS/AUGS] 
wrote:
> With one board we have a boot-problem that is reproducible at every ~50 boot.
> The system is accessible via ssh and works fine except for the 
> Graphics. The graphics is off. We don´t see a screen. Please see 
> attached “dmesg.log”. From [52.772273] onwards the kernel reports 
> drm/amdgpu errors. It even tries to reset the GPU but that fails too. 
> I tried to reset amdgpu also by command “sudo cat 
> /sys/kernel/debug/dri/N/amdgpu_gpu_recover”. That did not help either.

Can you reproduce the problem with an upstream kernel too?

These messages in dmesg indicate some problem in the platform setup:

AMD-Vi: Completion-Wait loop timed out

Might there be some inconsistencies in the PCI setup between the bridges and 
the endpoints or something?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: amdgpu error whenever IOMMU is enabled

2020-11-03 Thread Merger, Edgar [AUTOSOL/MAS/AUGS]
Hi Jörg,

I am seeing that amdgpu uses amd_iommu_v2 kernel-module.
To me this is the last puzzle piece that was missing to explain the dependency 
of that bug in amdgpu to the iommu.

[cid:image001.png@01D6B1CB.26F9C970]

Best regards
Edgar

From: Merger, Edgar [AUTOSOL/MAS/AUGS]
Sent: Freitag, 30. Oktober 2020 15:26
To: jroe...@suse.de
Cc: iommu@lists.linux-foundation.org
Subject: amdgpu error whenever IOMMU is enabled

Hello Jörg,

We have developed a Board "mCOM10L1900" that can be populated with an AMD 
R1305G Ryzen CPU but also with other CPUs from AMD´s R1000 and V1000 Series. 
Please see attached datasheet.

With one board we have a boot-problem that is reproducible at every ~50 boot. 
The system is accessible via ssh and works fine except for the Graphics. The 
graphics is off. We don´t see a screen. Please see attached "dmesg.log". From 
[52.772273] onwards the kernel reports drm/amdgpu errors. It even tries to 
reset the GPU but that fails too. I tried to reset amdgpu also by command "sudo 
cat /sys/kernel/debug/dri/N/amdgpu_gpu_recover". That did not help either.
There is a similar error reported here 
https://bugzilla.kernel.org/show_bug.cgi?id=204241. However the applied patch 
should have already been in the Linux-kernel we use, which is 
"5.4.0-47-generic".

Also found that "amdgpu_info" shows different versions at "SMC firmware 
version". Please see attachment. It is still unclear what role the IOMMU plays 
here. Whenever we turn it off, the error does not show up anymore on the 
failing board.

Best regards
Edgar

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu