Re: EPROTO when USB 3 GbE adapters are under load
On 19/11/18 10:30 pm, Hao Wei Tee wrote: On 25/10/18 11:37 am, Hao Wei Tee wrote: I got another RTL8153-powered adapter and, guess what, I can't seem to reproduce this anymore. Not sure if it is something that changed in 4.19 or something to do with the adapters themselves (I don't have the old adapter with me right now). It finally happened again after a week of use. I guess it still happens, but it's less common. Oh well. -- Hao Wei
Re: EPROTO when USB 3 GbE adapters are under load
On 25/10/18 11:37 am, Hao Wei Tee wrote: Hi, There are multiple reports[1][2][3] (more elsewhere on the internet) of USB 3 GbE adapters throwing EPROTO errors on USB transfer especially when the devices are under load. Both of the two common chipsets (Realtek RTL8153 (r8152[4]) and Asix AX88179 (ax88179_178a[5])) seem to exhibit this behaviour. I got another RTL8153-powered adapter and, guess what, I can't seem to reproduce this anymore. Not sure if it is something that changed in 4.19 or something to do with the adapters themselves (I don't have the old adapter with me right now). -- Hao Wei
Re: EPROTO when USB 3 GbE adapters are under load
On 25/10/18 11:04 PM, Mathias Nyman wrote: There is a patch in usb-next that might help. f8f80be xhci: Use soft retry to recover faster from transaction errors It soft resets the halted host side endpoint, clears the halt without clearing the sequence number. FWIW, although I guess you might've guessed, the patch didn't seem to change the behaviour at all. But thanks in any case. I'll see what else I can figure out. -- Hao Wei
Re: EPROTO when USB 3 GbE adapters are under load
On 25/10/18 4:45 PM, Mathias Nyman wrote: Reproducing the issue with a recent kernel with xhci traces enabled should show the reason for EPROTO error. Add xhci traces before triggering the issue with: mount -t debugfs none /sys/kernel/debug echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable after issue is triggered save and send the trace at /sys/kernel/debug/tracing/trace Note that it might be huge Thanks for the suggestion. Here[1] is (part of) the trace starting about 250 lines before the EPROTO happens. [1]: https://gist.githubusercontent.com/angelsl/fdd04d2bded3a41029122b0536c00944/raw/b8e9f7d2695ac030b7f3dd53a1a9c3f37da7b7a0/trace The first error happens at line 243 (timestamp 8144.248398) coinciding with the start of errors spewed into dmesg: [ 8144.245359] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 [ 8144.248837] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 [ 8144.252392] r8152 2-2:1.0 enp0s20f0u2: Rx status -71 [ 8144.255987] r8152 2-2:1.0 enp0s20f0u2: Stop submitting intr, status -71 ... It doesn't seem to point to anything in particular, but I'm not really familiar with USB. I'll do some digging in any case... Thanks! -- Hao Wei
EPROTO when USB 3 GbE adapters are under load
Hi, There are multiple reports[1][2][3] (more elsewhere on the internet) of USB 3 GbE adapters throwing EPROTO errors on USB transfer especially when the devices are under load. Both of the two common chipsets (Realtek RTL8153 (r8152[4]) and Asix AX88179 (ax88179_178a[5])) seem to exhibit this behaviour. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=75381 [2]: https://bugzilla.kernel.org/show_bug.cgi?id=196747 [3]: https://bugzilla.kernel.org/show_bug.cgi?id=198931 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/r8152.c [5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/ax88179_178a.c I'm trying to figure out why this happens (while it doesn't seem to happen on other OSes, but I'm not sure). I think it's unlikely that both drivers are buggy, so perhaps it is something to do with the USB stack instead of the device drivers. It wouldn't be surprising if both devices actually don't adhere to the USB specs properly and other OSes are just more tolerant of that (?) but that is just conjecture on my part. Does anyone have any ideas? Thanks. -- Hao Wei
Re: VL805 xHCI DMA read faults
On 16/10/2017 20:23, Robin Murphy wrote: > On 16/10/17 12:54, Hao Wei Tee wrote: >> On 12/10/2017 21:36, Mathias Nyman wrote: >>> You could try booting with xhci_hcd.dyndbg=+p added to the kernel command >>> line. >> >> I can't find anything relevant... Hmm. > > Is your VL805 on the motherboard or an add-on card? One other possibly > important difference that comes to mind is that on my arm64 system Linux > is the only agent to ever touch the xHCI - UEFI doesn't even try to > probe it. It seems likely that a full-featured PC firmware might have > been more hands-on, especially if the controller is on-board. > > It seems noteworthy that these RMRRs are within about 10MB of the > faulting address... > > ...and that correspondingly for this to be a Linux-allocated IOVA would > mean over 540MB having been mapped for DMA already, which seems somewhat > less likely than it being some leftover physical address from firmware. > > Can you try instrumenting xhci_segment_alloc() to get an idea of what > the actual DMA addresses of the various queues are at this point? > > Robin Sorry for taking so long, I got caught up with other stuff.. Anyway, I think you may be right about the addresses coming from UEFI. With the IOMMU on, xhci_segment_alloc is never called (the DMA faults happen in very early xHCI init). With the IOMMU off, the allocated segments look something like this: xhci_segment_alloc() = (xhci_segment) { .dma = 0x0002136b9000 .bounce_dma = 0x .bounce_buf = (null) .bounce_offs = 0 .bounce_len = 0 } xhci_segment_alloc() = (xhci_segment) { .dma = 0x0002136bb000 .bounce_dma = 0x .bounce_buf = (null) .bounce_offs = 0 .bounce_len = 0 } xhci_segment_alloc() = (xhci_segment) { .dma = 0x00021377e000 .bounce_dma = 0x .bounce_buf = (null) .bounce_offs = 0 .bounce_len = 0 } xhci_segment_alloc() = (xhci_segment) { .dma = 0x000213776000 .bounce_dma = 0x .bounce_buf = (null) .bounce_offs = 0 .bounce_len = 0 } xhci_segment_alloc() = (xhci_segment) { .dma = 0x0002126ec000 .bounce_dma = 0x .bounce_buf = 9887d4d91d30 .bounce_offs = 0 .bounce_len = 0 } xhci_segment_alloc() = (xhci_segment) { .dma = 0x0002126ed000 .bounce_dma = 0x .bounce_buf = 9887d4d91140 .bounce_offs = 0 .bounce_len = 0 } .. which is nowhere near the address the DMA faults occur at, although I'm not sure if having the IOMMU on affects this (??). Thanks. -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VL805 xHCI DMA read faults
On 16/10/2017 20:23, Robin Murphy wrote: > Is your VL805 on the motherboard or an add-on card? One other possibly > important difference that comes to mind is that on my arm64 system Linux > is the only agent to ever touch the xHCI - UEFI doesn't even try to > probe it. It seems likely that a full-featured PC firmware might have > been more hands-on, especially if the controller is on-board. Ah, right, I probably should have mentioned that -- it's on a Gigabyte H61M-S2P rev 3. The H61 chipset doesn't have USB 3.0, of course, so Gigabyte tacked on this VIA VL805. > It seems noteworthy that these RMRRs are within about 10MB of the > faulting address... > > ...and that correspondingly for this to be a Linux-allocated IOVA would > mean over 540MB having been mapped for DMA already, which seems somewhat > less likely than it being some leftover physical address from firmware. > > Can you try instrumenting xhci_segment_alloc() to get an idea of what > the actual DMA addresses of the various queues are at this point? > > Robin. Will do. -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VL805 xHCI DMA read faults
On 12/10/2017 21:36, Mathias Nyman wrote: > You could try booting with xhci_hcd.dyndbg=+p added to the kernel command > line. I can't find anything relevant... Hmm. Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=... rw intel_iommu=on xhci_hcd.dyndbg=+p ACPI: DMAR 0xDE2C20E0 78 (v01 INTEL SNB 0001 INTL 0001) DMAR: IOMMU enabled DMAR: Host address width 36 DMAR: DRHD base: 0x00fed9 flags: 0x1 DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap c9008020660262 ecap f0105a DMAR: RMRR base: 0x00dec75000 end: 0x00dec83fff DMAR-IR: IOAPIC id 2 under DRHD base 0xfed9 IOMMU 0 DMAR-IR: HPET id 0 under DRHD base 0xfed9 DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. DMAR-IR: Enabled IRQ remapping in x2apic mode DMAR: No ATSR found DMAR: dmar0: Using Queued invalidation DMAR: Setting RMRR: DMAR: Setting identity map for device :00:1a.0 [0xdec75000 - 0xdec83fff] DMAR: Setting identity map for device :00:1d.0 [0xdec75000 - 0xdec83fff] DMAR: Prepare 0-16MiB unity mapping for LPC DMAR: Setting identity map for device :00:1f.0 [0x0 - 0xff] DMAR: Intel(R) Virtualization Technology for Directed I/O ... iommu: Adding device :03:00.0 to group 11 ... xhci_hcd :03:00.0: xHCI Host Controller xhci_hcd :03:00.0: new USB bus registered, assigned bus number 2 xhci_hcd :03:00.0: xHCI capability registers at c3ed00c99000: xhci_hcd :03:00.0: CAPLENGTH AND HCIVERSION 0x120: xhci_hcd :03:00.0: CAPLENGTH: 0x20 xhci_hcd :03:00.0: HCIVERSION: 0x100 xhci_hcd :03:00.0: HCSPARAMS 1: 0x5000420 xhci_hcd :03:00.0: Max device slots: 32 xhci_hcd :03:00.0: Max interrupters: 4 xhci_hcd :03:00.0: Max ports: 5 xhci_hcd :03:00.0: HCSPARAMS 2: 0xfc31 xhci_hcd :03:00.0: Isoc scheduling threshold: 1 xhci_hcd :03:00.0: Maximum allowed segments in event ring: 3 xhci_hcd :03:00.0: HCSPARAMS 3 0xe70004: xhci_hcd :03:00.0: Worst case U1 device exit latency: 4 xhci_hcd :03:00.0: Worst case U2 device exit latency: 231 xhci_hcd :03:00.0: HCC PARAMS 0x2841eb: xhci_hcd :03:00.0: HC generates 64 bit addresses xhci_hcd :03:00.0: HC hasn't Contiguous Frame ID Capability xhci_hcd :03:00.0: HC can't generate Stopped - Short Package event xhci_hcd :03:00.0: FIXME: more HCCPARAMS debugging xhci_hcd :03:00.0: RTSOFF 0x200: xhci_hcd :03:00.0: xHCI operational registers at c3ed00c99020: xhci_hcd :03:00.0: USBCMD 0x0: xhci_hcd :03:00.0: HC is being stopped xhci_hcd :03:00.0: HC has finished hard reset xhci_hcd :03:00.0: Event Interrupts disabled xhci_hcd :03:00.0: Host System Error Interrupts disabled xhci_hcd :03:00.0: HC has finished light reset xhci_hcd :03:00.0: USBSTS 0x1: xhci_hcd :03:00.0: Event ring is empty xhci_hcd :03:00.0: No Host System Error xhci_hcd :03:00.0: HC is halted xhci_hcd :03:00.0: c3ed00c99420 port status reg = 0x4ee1 xhci_hcd :03:00.0: c3ed00c99424 port power reg = 0x0 xhci_hcd :03:00.0: c3ed00c99428 port link reg = 0x0 xhci_hcd :03:00.0: c3ed00c9942c port reserved reg = 0x0 xhci_hcd :03:00.0: c3ed00c99430 port status reg = 0x2a0 xhci_hcd :03:00.0: c3ed00c99434 port power reg = 0x0 xhci_hcd :03:00.0: c3ed00c99438 port link reg = 0x0 xhci_hcd :03:00.0: c3ed00c9943c port reserved reg = 0x0 xhci_hcd :03:00.0: c3ed00c99440 port status reg = 0x2a0 xhci_hcd :03:00.0: c3ed00c99444 port power reg = 0x0 xhci_hcd :03:00.0: c3ed00c99448 port link reg = 0x0 xhci_hcd :03:00.0: c3ed00c9944c port reserved reg = 0x0 xhci_hcd :03:00.0: c3ed00c99450 port status reg = 0x2a0 xhci_hcd :03:00.0: c3ed00c99454 port power reg = 0x0 xhci_hcd :03:00.0: c3ed00c99458 port link reg = 0x0 xhci_hcd :03:00.0: c3ed00c9945c port reserved reg = 0x0 xhci_hcd :03:00.0: c3ed00c99460 port status reg = 0x2a0 xhci_hcd :03:00.0: c3ed00c99464 port power reg = 0x0 xhci_hcd :03:00.0: c3ed00c99468 port link reg = 0x0 xhci_hcd :03:00.0: c3ed00c9946c port reserved reg = 0x0 xhci_hcd :03:00.0: QUIRK: Resetting on resume xhci_hcd :03:00.0: // Halt the HC xhci_hcd :03:00.0: Resetting HCD xhci_hcd :03:00.0: // Reset the HC DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [03:00.0] fault addr de28a000 [fault reason 01] Present bit in root entry is clear ... xhci_hcd :03:00.0: can't setup: -110 xhci_hcd :03:00.0: USB bus 2 deregistered xhci_hcd :03:00.0: init :03:00.0 fail, -110 xhci_hcd: probe of :03:00.0 failed with error -110 -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VL805 xHCI DMA read faults
On 10/10/2017 22:13, Mathias Nyman wrote: > On 10.10.2017 12:41, David Laight wrote: >> From: Robin Murphy >>> Sent: 09 October 2017 18:39 >> ... - without the IOMMU, block sizes >=128K all settle down into a suspiciously-periodic error every 2048 sectors. >> >> That stinks of being a problem where either the link TRB is part >> way through a USB packet or where a buffer fragment crosses >> a 64k boundary. >> >> Neither is allowed. >> > > Those should be taken care of by the xhci driver already > > xhci_align_td() should make sure the link TRB is at packet boundary, and > TRB_BUFF_LEN_UP_TO_BOUNDARY(addr) in xhci_queue_bulk_tx() should prevent > crossing 64k boundary in a TRB when queuing it. > > more traces and logs of the VIA xhci controller could maybe tell something. > > with the latest kernel: > > echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb > echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable > after failure: > cat /sys/kernel/debug/tracing/trace Unfortunately since the failure on my VL805 is during xhci init tracing doesn't produce anything we don't already know.. # tracer: nop # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | modprobe-964 [003] 240.271468: xhci_dbg_quirks: QUIRK: Resetting on resume modprobe-964 [003] 240.271471: xhci_dbg_init: // Halt the HC modprobe-964 [003] 240.271477: xhci_dbg_init: // Reset the HC And the associated DMA faults: [ 265.286686] DMAR: DRHD: handling fault status reg 2 [ 265.286688] DMAR: [DMA Read] Request device [03:00.0] fault addr de28a000 [fault reason 01] Present bit in root entry is clear I'll try and figure out exactly what de28a000 points at (or points after..). Thanks. -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xhci: Cope with VIA VL805 readahead
On 11/10/2017 02:09, Robin Murphy wrote: > The VIA VL805 host controller is well-known for causing problems on > systems with IOMMUs enabled, ranging from triggering endless streams of > fault messages to locking itself up completely. It appears that the root > of the problem might be an over-aggressive prefetching of TRBs, wherein > consuming commands near the end of a queue segment causes it to read off > the end of the segment, even across a page boundary. This blows up when > DMA mapping ops are backed by an IOMMU, since there is no guarantee that > addresses outside the allocated segment are accessible at all. > > Some trial-and-error investigation reveals that we can avoid such > cross-page reads by not using the last few TRBs in a segment; to that > end, factor out the implicit index of the end-of-segemnt link TRB, and > implement a quirk to move it slightly further forward when necessary. > > Signed-off-by: Robin Murphy> --- > drivers/usb/host/xhci-mem.c | 32 +++- > drivers/usb/host/xhci-pci.c | 5 + > drivers/usb/host/xhci-ring.c | 10 +- > drivers/usb/host/xhci.c | 10 +- > drivers/usb/host/xhci.h | 2 ++ > 5 files changed, 40 insertions(+), 19 deletions(-) Seems like the invalid DMA I'm getting on xhci_reset during the xhci driver's initialisation/probing process isn't the same one you're getting.. This patch doesn't appear to change that at all. Oh well. Thanks, -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VL805 xHCI DMA read faults
On 09/10/2017 17:22, Mathias Nyman wrote: > On 08.10.2017 17:03, Hao Wei Tee wrote: >> Hi, >> >> I've been having DMA read faults with my VL805 xHCI controller when the >> Intel IOMMU >> is turned on: >> >> xhci_hcd :03:00.0: xHCI Host Controller >> xhci_hcd :03:00.0: new USB bus registered, assigned bus number 2 >> DMAR: DRHD: handling fault status reg 3 >> DMAR: [DMA Read] Request device [03:00.0] fault addr de28a000 [fault >> reason 01] Present bit in root entry is clear >> >> xhci_hcd :03:00.0: can't setup: -110 >> xhci_hcd :03:00.0: USB bus 2 deregistered >> xhci_hcd :03:00.0: init :03:00.0 fail, -110 >> xhci_hcd: probe of :03:00.0 failed with error -110 >> >> The controller works fine, as far as I can tell, when the IOMMU is off. >> >> I've tracked it down to where CMD_RESET is sent to the controller in >> xhci_reset, >> [1] called from xhci_gen_setup in xhci.c. It seems that when the command >> register >> is being polled in the xhci_handshake after that, the controller tries to do >> a >> DMA read from an address that is apparently invalid (?). Eventually >> xhci_handshake >> times out. >> >> I've tried setting the XHCI_NO_64BIT_SUPPORT quirks flag as someone >> suggested in >> an earlier thread here [2] about a similar/the same(?) device, but that >> doesn't >> seem to have worked. >> >> Help, please. I have no idea how to debug this further. >> > > Could it maybe be related to a iommu/vt-d: Fix scatterlist offset handling > fix: > https://lists.linuxfoundation.org/pipermail/iommu/2017-September/024371.html > > Can you check if that patch is included? I applied that patch on top of current stable (v4.13.5) and mainline (v4.14-rc4) but it doesn't appear to have changed anything. Hmm. Further searching shows that this probably isn't an IOMMU bug but rather some quirk of the controller itself, probably (?). This bug on Red Hat's bugtracker shows AMD IOMMU faults with the VL805 too. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1409098 > The author Robin Murphy (CC) Also had some recent issues with a VIA VL805 > controller > > https://marc.info/?l=linux-usb=150730678304383=2 Yeah, that looks like the same thing.. Robin, any luck with your VL805 card? Thanks. > -Mathias -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
VL805 xHCI DMA read faults
Hi, I've been having DMA read faults with my VL805 xHCI controller when the Intel IOMMU is turned on: xhci_hcd :03:00.0: xHCI Host Controller xhci_hcd :03:00.0: new USB bus registered, assigned bus number 2 DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [03:00.0] fault addr de28a000 [fault reason 01] Present bit in root entry is clear xhci_hcd :03:00.0: can't setup: -110 xhci_hcd :03:00.0: USB bus 2 deregistered xhci_hcd :03:00.0: init :03:00.0 fail, -110 xhci_hcd: probe of :03:00.0 failed with error -110 The controller works fine, as far as I can tell, when the IOMMU is off. I've tracked it down to where CMD_RESET is sent to the controller in xhci_reset, [1] called from xhci_gen_setup in xhci.c. It seems that when the command register is being polled in the xhci_handshake after that, the controller tries to do a DMA read from an address that is apparently invalid (?). Eventually xhci_handshake times out. I've tried setting the XHCI_NO_64BIT_SUPPORT quirks flag as someone suggested in an earlier thread here [2] about a similar/the same(?) device, but that doesn't seem to have worked. Help, please. I have no idea how to debug this further. Some information about the device in question: 03:00.0 USB controller [0c03]: VIA Technologies, Inc. VL805 USB 3.0 Host Controller [1106:3483] (rev 01) (prog-if 30 [XHCI]) Subsystem: Gigabyte Technology Co., Ltd VL805 USB 3.0 Host Controller [1458:5007] Flags: fast devsel, IRQ 17 Memory at f710 (64-bit, non-prefetchable) [size=4K] Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/4 Maskable- 64bit+ Capabilities: [c4] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Kernel modules: xhci_pci Thanks. N.B. I sent a message here a while ago mentioning changes in 4.13 that might be the cause, but that is not the case. My distro just turned on the IOMMU by default when they updated to 4.13. Oops. [1]: https://elixir.free-electrons.com/linux/v4.13.5/source/drivers/usb/host/xhci.c#L184 [2]: https://www.spinics.net/lists/linux-usb/msg146591.html -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
DMAR faults with VL805 xHCI controller since 4.13
I've been getting xhci_hcd probe failures since 4.13, seemingly caused by DMAR faults: xhci_hcd :03:00.0: xHCI Host Controller xhci_hcd :03:00.0: new USB bus registered, assigned bus number 2 DMAR: DRHD: handling fault status reg 3 DMAR: [DMA Read] Request device [03:00.0] fault addr de483000 [fault reason 01] Present bit in root entry is clear dmar_fault: 1638936 callbacks suppressed ... dmar_fault: 2334570 callbacks suppressed ... dmar_fault: 2322965 callbacks suppressed ... dmar_fault: 2322989 callbacks suppressed ... xhci_hcd :03:00.0: can't setup: -110 xhci_hcd :03:00.0: USB bus 2 deregistered xhci_hcd :03:00.0: init :03:00.0 fail, -110 xhci_hcd: probe of :03:00.0 failed with error -110 Presumably this has to do with the new dma-mapping things in 4.13. If VT-d is disabled, no more errors occur. I have no idea how to investigate this. I'm not sure if the fault lies in the USB driver or elsewhere. Please forgive me if this ends up being the wrong list. -- Hao Wei -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html