Re: source-id verification failures

2018-10-15 Thread Jacob Pan
On Thu, 11 Oct 2018 12:09:16 -0700
Jerry Snitselaar  wrote:

> On Fri Oct 05 18, Jacob Pan wrote:
> >On Thu, 4 Oct 2018 13:57:24 -0700
> >Jerry Snitselaar  wrote:
> >  
> >> >
> >> >On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar
> >> >wrote:  
> >> >> I've been trying to track down a problem where an hp dl380 gen8
> >> >> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
> >> >> verification failures when running dhclient against that
> >> >> interface. This started showing up when I backported the iova
> >> >> deferred flushing patches. So far this has only been seen on
> >> >> this one system, but I'm trying to understand why it appears
> >> >> with the new deferred flushing code. I also see it with both
> >> >> 4.18.10, and 4.19.0-rc6 kernels.
> >> >>  
> >Hi Jerry,
> >Could you confirm that you see this failure in v4.19-rc6 kernel only
> >without "strict" mode? I don't see a connection between deferred
> >flushing and IR here, AFAIK deferred flush only affects DMA
> >remapping.
> >
> >Also, does the SID failure occur on other devices under the same
> >IOMMU?
> >
> >
> >Thanks,
> >
> >Jacob
> >  
> 
> Confirmed the system doesn't see the problem with intel_iommu=strict.
> We've only seen SID failures occur for the 2 ports on the brocade
> device.
> 
Thanks for the data. I am puzzled about how could IR fault be related
to the strict mode. We don't defer flushing for IR. The only thing is
that queued invalidation interface is shared, where the non-strict mode
may hold up the QI longer from time to time.
perhaps add a printk in modify_irte() and see if that fault happens
around IRTE updates.
e.g.
@@ -170,10 +171,15 @@ static int modify_irte(struct irq_2_iommu
*irq_iommu, index = irq_iommu->irte_index + irq_iommu->sub_handle;
irte = >ir_table->base[index];
 
+   pr_debug("irte h:%llx l:%llx, sid:%x\n",
+   irte_modified->high, irte_modified->low,
irte_modified->sid); +

> Another data point is that there is a dl388 gen8 with the same card,
> and we don't see any problems there. I'd say it is something with
> this system, but it is odd that the problem starts showing itself
> when I add those patches.
> 
All I can say is that the following IRTE dump looks sane. You might be
able to relax the SID check from full 16bit to 13bit which excludes
the three function bits. (bit 80-81 SQ=0b11). But i doubt it is the
issue.

Jacob

> 
> # cat ir_translation_struct 
> Remapped Interrupt supported on IOMMU: dmar0
>  IR table address:42f20
>  Entry SrcID   DstIDVct IRTE_high IRTE_low
>  2 24:00.0 00020001 21  00042400  000200010021000d
>  3 24:00.0 00020200 28  00042400  00020228000d
>  4 24:00.0 00020800 28  00042400  00020828000d
>  5 24:00.0 0001 ef  00042400  000100ef000d
>  6 24:00.0 0001 ef  00042400  000100ef000d
>  7 24:00.0 0001 ef  00042400  000100ef000d
>  8 24:00.0 0001 ef  00042400  000100ef000d
>  9 24:00.0 0001 ef  00042400  000100ef000d
>  1024:00.0 0001 ef  00042400  000100ef000d
>  1124:00.0 0001 ef  00042400  000100ef000d
>  1224:00.0 0001 ef  00042400  000100ef000d
>  1324:00.0 0001 ef  00042400  000100ef000d
>  1424:00.0 0001 ef  00042400  000100ef000d
>  1524:00.0 0001 ef  00042400  000100ef000d
>  1624:00.0 0001 ef  00042400  000100ef000d
>  1724:00.0 0001 ef  00042400  000100ef000d
>  1824:00.0 0001 ef  00042400  000100ef000d
>  1924:00.0 0001 ef  00042400  000100ef000d
>  2024:00.1 00020004 25  00042401  000200040025000d
>  2124:00.1 00020001 29  00042401  000200010029000d
>  2224:00.1 00020004 29  00042401  000200040029000d
>  2324:00.1 0001 ef  00042401  000100ef000d
>  2424:00.1 0001 ef  00042401  000100ef000d
>  2524:00.1 0001 ef  00042401  000100ef000d
>  2624:00.1 0001 ef  00042401  000100ef000d
>  2724:00.1 0001 ef  00042401  000100ef000d
>  2824:00.1 0001 ef  00042401  000100ef000d
>  2924:00.1 0001 ef  00042401  000100ef000d
>  3024:00.1 0001 ef  00042401  000100ef000d
>  3124:00.1 0001 ef  00042401  000100ef000d
>  3224

Re: source-id verification failures

2018-10-11 Thread Jerry Snitselaar

On Fri Oct 05 18, Jacob Pan wrote:

On Thu, 4 Oct 2018 13:57:24 -0700
Jerry Snitselaar  wrote:


>
>On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:
>> I've been trying to track down a problem where an hp dl380 gen8
>> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
>> verification failures when running dhclient against that
>> interface. This started showing up when I backported the iova
>> deferred flushing patches. So far this has only been seen on this
>> one system, but I'm trying to understand why it appears with the
>> new deferred flushing code. I also see it with both 4.18.10, and
>> 4.19.0-rc6 kernels.
>>

Hi Jerry,
Could you confirm that you see this failure in v4.19-rc6 kernel only
without "strict" mode? I don't see a connection between deferred
flushing and IR here, AFAIK deferred flush only affects DMA remapping.

Also, does the SID failure occur on other devices under the same IOMMU?


Thanks,

Jacob



Confirmed the system doesn't see the problem with intel_iommu=strict.
We've only seen SID failures occur for the 2 ports on the brocade device.

Another data point is that there is a dl388 gen8 with the same card,
and we don't see any problems there. I'd say it is something with
this system, but it is odd that the problem starts showing itself
when I add those patches.


# cat ir_translation_struct 
Remapped Interrupt supported on IOMMU: dmar0

IR table address:42f20
Entry SrcID   DstIDVct IRTE_highIRTE_low
2 24:00.0 00020001 21  00042400 000200010021000d
3 24:00.0 00020200 28  00042400 00020228000d
4 24:00.0 00020800 28  00042400 00020828000d
5 24:00.0 0001 ef  00042400 000100ef000d
6 24:00.0 0001 ef  00042400 000100ef000d
7 24:00.0 0001 ef  00042400 000100ef000d
8 24:00.0 0001 ef  00042400 000100ef000d
9 24:00.0 0001 ef  00042400 000100ef000d
1024:00.0 0001 ef  00042400 000100ef000d
1124:00.0 0001 ef  00042400 000100ef000d
1224:00.0 0001 ef  00042400 000100ef000d
1324:00.0 0001 ef  00042400 000100ef000d
1424:00.0 0001 ef  00042400 000100ef000d
1524:00.0 0001 ef  00042400 000100ef000d
1624:00.0 0001 ef  00042400 000100ef000d
1724:00.0 0001 ef  00042400 000100ef000d
1824:00.0 0001 ef  00042400 000100ef000d
1924:00.0 0001 ef  00042400 000100ef000d
2024:00.1 00020004 25  00042401 000200040025000d
2124:00.1 00020001 29  00042401 000200010029000d
2224:00.1 00020004 29  00042401 000200040029000d
2324:00.1 0001 ef  00042401 000100ef000d
2424:00.1 0001 ef  00042401 000100ef000d
2524:00.1 0001 ef  00042401 000100ef000d
2624:00.1 0001 ef  00042401 000100ef000d
2724:00.1 0001 ef  00042401 000100ef000d
2824:00.1 0001 ef  00042401 000100ef000d
2924:00.1 0001 ef  00042401 000100ef000d
3024:00.1 0001 ef  00042401 000100ef000d
3124:00.1 0001 ef  00042401 000100ef000d
3224:00.1 0001 ef  00042401 000100ef000d
3324:00.1 0001 ef  00042401 000100ef000d
3424:00.1 0001 ef  00042401 000100ef000d
3524:00.1 0001 ef  00042401 000100ef000d
3624:00.1 0001 ef  00042401 000100ef000d
3724:00.1 0001 ef  00042401 000100ef000d
3920:04.0 00020010 28  00042020 000200100028000d
4120:04.1 00020040 28  00042021 000200400028000d
4220:04.2 00020100 28  00042022 00020128000d
4320:04.3 00020400 28  00042023 00020428000d
4420:04.4 00020002 28  00042024 000200020028000d
4520:04.5 00020008 28  00042025 000200080028000d
4620:04.6 00020020 28  00042026 000200200028000d
4720:04.7 00020080 28  00042027 000200800028000d

Remapped Interrupt supported on IOMMU: dmar1
IR table address:42e80
Entry SrcID   DstIDVct IRTE_highIRTE_low
0 00:1e.1 00020020 2a  000400f1 00020020002a000d
1 00:1e.1 0001 30  000400f1 0001003d
2 00:1e.1 00020200 2a  000400f1 0002022a000d
7 00:1e.1 00020004 2b  000400f1 00020004002b000d
8 00:1e.1 00020040 2b  000400f1 00020040002b000d
1100:1e.1 00020100 2b  000400f1 0002012b000d
1700:1e.1 00020010 29  000

Re: source-id verification failures

2018-10-08 Thread Jerry Snitselaar

On Fri Oct 05 18, Raj, Ashok wrote:

On Thu, Oct 04, 2018 at 03:07:46PM -0700, Jacob Pan wrote:

On Thu, 4 Oct 2018 13:57:24 -0700
Jerry Snitselaar  wrote:

> On Thu Oct 04 18, Joerg Roedel wrote:
> >Hi Jerry,
> >
> >thanks for the report.
> >
> >On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:
> >> I've been trying to track down a problem where an hp dl380 gen8
> >> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
> >> verification failures when running dhclient against that
> >> interface. This started showing up when I backported the iova
> >> deferred flushing patches. So far this has only been seen on this
> >> one system, but I'm trying to understand why it appears with the
> >> new deferred flushing code. I also see it with both 4.18.10, and
> >> 4.19.0-rc6 kernels.


Weird.. IRC, these were there to accomodate phantom functions.
Thought PCIe allowed 8bit tag, so if the device needs to allow
more than 256 outstanding transactions, one could use the extra
functions to account for.

I assumed Linux didn't enable phantom functions. If that's the case
we also need to ensure all the DMA is aliased properly.

I'm assuming if interrupts are generated by other aliases we could
block them.

Is this device one such?

Cheers,
Ashok


> >>
> >> [35645.282021] bna :24:00.1 ens5f1: link down
> >> [35645.298396] bna :24:00.0 ens5f0: link down
> >> [35650.313210] DMAR: DRHD: handling fault status reg 2
> >> [35650.332477] DMAR: [INTR-REMAP] Request device [24:00.0] fault
> >> index 14 [fault reason 38] Blocked an interrupt request due to
> >> source-id verification failure [35655.137667] bna :24:00.0
> >> ens5f0: link up [35657.532454] bna :24:00.1 ens5f1: link up
> >> [35664.281563] bna :24:00.1 ens5f1: link down [35664.298103]
> >> bna :24:00.0 ens5f0: link down [35669.313568] DMAR: DRHD:
> >> handling fault status reg 102 [35669.333198] DMAR: [INTR-REMAP]
> >> Request device [24:00.0] fault index 14 [fault reason 38] Blocked
> >> an interrupt request due to source-id verification failure
> >> [35674.081212] bna :24:00.0 ens5f0: link up [35674.981280] bna
> >> :24:00.1 ens5f1: link up
> >>
> >>
> >> Any ideas?
> >
> >No, not yet. Can you please post the output of lscpi -vvv?
> >
> >Jacob, can you or someone from your team please also have a look into
> >this problem report?
> >
yep.
+Ashok

Jerry,
Could you also dump the interrupt remapping table with this patchset?
https://lkml.org/lkml/2018/9/12/44

Thanks,



Sorry, I've been on dad duty the past few days. I should be back working on
this tonight or tomorrow.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: source-id verification failures

2018-10-05 Thread Raj, Ashok
On Thu, Oct 04, 2018 at 03:07:46PM -0700, Jacob Pan wrote:
> On Thu, 4 Oct 2018 13:57:24 -0700
> Jerry Snitselaar  wrote:
> 
> > On Thu Oct 04 18, Joerg Roedel wrote:
> > >Hi Jerry,
> > >
> > >thanks for the report.
> > >
> > >On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:  
> > >> I've been trying to track down a problem where an hp dl380 gen8
> > >> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
> > >> verification failures when running dhclient against that
> > >> interface. This started showing up when I backported the iova
> > >> deferred flushing patches. So far this has only been seen on this
> > >> one system, but I'm trying to understand why it appears with the
> > >> new deferred flushing code. I also see it with both 4.18.10, and
> > >> 4.19.0-rc6 kernels.

Weird.. IRC, these were there to accomodate phantom functions.
Thought PCIe allowed 8bit tag, so if the device needs to allow
more than 256 outstanding transactions, one could use the extra
functions to account for.

I assumed Linux didn't enable phantom functions. If that's the case
we also need to ensure all the DMA is aliased properly.

I'm assuming if interrupts are generated by other aliases we could
block them. 

Is this device one such?

Cheers,
Ashok

> > >>
> > >> [35645.282021] bna :24:00.1 ens5f1: link down
> > >> [35645.298396] bna :24:00.0 ens5f0: link down
> > >> [35650.313210] DMAR: DRHD: handling fault status reg 2
> > >> [35650.332477] DMAR: [INTR-REMAP] Request device [24:00.0] fault
> > >> index 14 [fault reason 38] Blocked an interrupt request due to
> > >> source-id verification failure [35655.137667] bna :24:00.0
> > >> ens5f0: link up [35657.532454] bna :24:00.1 ens5f1: link up
> > >> [35664.281563] bna :24:00.1 ens5f1: link down [35664.298103]
> > >> bna :24:00.0 ens5f0: link down [35669.313568] DMAR: DRHD:
> > >> handling fault status reg 102 [35669.333198] DMAR: [INTR-REMAP]
> > >> Request device [24:00.0] fault index 14 [fault reason 38] Blocked
> > >> an interrupt request due to source-id verification failure
> > >> [35674.081212] bna :24:00.0 ens5f0: link up [35674.981280] bna
> > >> :24:00.1 ens5f1: link up
> > >>
> > >>
> > >> Any ideas?  
> > >
> > >No, not yet. Can you please post the output of lscpi -vvv?
> > >
> > >Jacob, can you or someone from your team please also have a look into
> > >this problem report?
> > >
> yep.
> +Ashok
> 
> Jerry,
> Could you also dump the interrupt remapping table with this patchset?
> https://lkml.org/lkml/2018/9/12/44
> 
> Thanks,
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: source-id verification failures

2018-10-05 Thread Jacob Pan
On Thu, 4 Oct 2018 13:57:24 -0700
Jerry Snitselaar  wrote:

> >
> >On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:  
> >> I've been trying to track down a problem where an hp dl380 gen8
> >> with a Cavium QLogic BR-1860 Fabric Adapter is getting source-id
> >> verification failures when running dhclient against that
> >> interface. This started showing up when I backported the iova
> >> deferred flushing patches. So far this has only been seen on this
> >> one system, but I'm trying to understand why it appears with the
> >> new deferred flushing code. I also see it with both 4.18.10, and
> >> 4.19.0-rc6 kernels.
> >>
Hi Jerry,
Could you confirm that you see this failure in v4.19-rc6 kernel only
without "strict" mode? I don't see a connection between deferred
flushing and IR here, AFAIK deferred flush only affects DMA remapping.

Also, does the SID failure occur on other devices under the same IOMMU?


Thanks,

Jacob

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: source-id verification failures

2018-10-04 Thread Joerg Roedel
Hi Jerry,

thanks for the report.

On Tue, Oct 02, 2018 at 10:25:29AM -0700, Jerry Snitselaar wrote:
> I've been trying to track down a problem where an hp dl380 gen8 with a Cavium 
> QLogic BR-1860 Fabric Adapter
> is getting source-id verification failures when running dhclient against that 
> interface. This started showing
> up when I backported the iova deferred flushing patches. So far this has only 
> been seen on this one system,
> but I'm trying to understand why it appears with the new deferred flushing 
> code. I also see it with both 4.18.10,
> and 4.19.0-rc6 kernels.
> 
> [35645.282021] bna :24:00.1 ens5f1: link down
> [35645.298396] bna :24:00.0 ens5f0: link down
> [35650.313210] DMAR: DRHD: handling fault status reg 2
> [35650.332477] DMAR: [INTR-REMAP] Request device [24:00.0] fault index 14 
> [fault reason 38] Blocked an interrupt request due to source-id verification 
> failure
> [35655.137667] bna :24:00.0 ens5f0: link up
> [35657.532454] bna :24:00.1 ens5f1: link up
> [35664.281563] bna :24:00.1 ens5f1: link down
> [35664.298103] bna :24:00.0 ens5f0: link down
> [35669.313568] DMAR: DRHD: handling fault status reg 102
> [35669.333198] DMAR: [INTR-REMAP] Request device [24:00.0] fault index 14 
> [fault reason 38] Blocked an interrupt request due to source-id verification 
> failure
> [35674.081212] bna :24:00.0 ens5f0: link up
> [35674.981280] bna :24:00.1 ens5f1: link up
> 
> 
> Any ideas?

No, not yet. Can you please post the output of lscpi -vvv?

Jacob, can you or someone from your team please also have a look into
this problem report?

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


source-id verification failures

2018-10-02 Thread Jerry Snitselaar

I've been trying to track down a problem where an hp dl380 gen8 with a Cavium 
QLogic BR-1860 Fabric Adapter
is getting source-id verification failures when running dhclient against that 
interface. This started showing
up when I backported the iova deferred flushing patches. So far this has only 
been seen on this one system,
but I'm trying to understand why it appears with the new deferred flushing 
code. I also see it with both 4.18.10,
and 4.19.0-rc6 kernels.

[35645.282021] bna :24:00.1 ens5f1: link down
[35645.298396] bna :24:00.0 ens5f0: link down
[35650.313210] DMAR: DRHD: handling fault status reg 2
[35650.332477] DMAR: [INTR-REMAP] Request device [24:00.0] fault index 14 
[fault reason 38] Blocked an interrupt request due to source-id verification 
failure
[35655.137667] bna :24:00.0 ens5f0: link up
[35657.532454] bna :24:00.1 ens5f1: link up
[35664.281563] bna :24:00.1 ens5f1: link down
[35664.298103] bna :24:00.0 ens5f0: link down
[35669.313568] DMAR: DRHD: handling fault status reg 102
[35669.333198] DMAR: [INTR-REMAP] Request device [24:00.0] fault index 14 
[fault reason 38] Blocked an interrupt request due to source-id verification 
failure
[35674.081212] bna :24:00.0 ens5f0: link up
[35674.981280] bna :24:00.1 ens5f1: link up


Any ideas?


Regards,
Jerry
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu