Re: [vbox-dev] [Intel-wired-lan] igb Detected Tx Unit Hang after upgrade to 4.18-rc6 [was Re: igb Detected Tx Unit Hang after upgrade to 4.17]

Marco Berizzi Mon, 03 Sep 2018 07:01:24 -0700

Hi Folks,

I'm forwarding this email message from Alexander Duyck
from Intel Corporation.


Please cc also [email protected] if/when you
are going to reply.

Thanks

Marco Berizzi


Il 1 settembre 2018 alle 4.01 Alexander Duyck <[email protected] 
scritto:

On Fri, Aug 31, 2018 at 2:00 AM Marco Berizzi <[email protected]:
>

Il 28 agosto 2018 alle 19.01 Alexander Duyck <[email protected] 
scritto:

Okay, so this looks like a Tx hang of some sort where writeback is not
being triggered.

What we may want do is turn up the level of information we are getting
out of the error. You can do that by running:
ethtool -s eth0 msglvl hw on tx_done on pktdata on

That should cause us to dump the Tx descriptor rings and packet data
when we encounter one of these hangs. It is possible we are looking at
some sort of error in the way the data is being formatted on the
descriptor ring that is resulting int he device stopping on whatever
descriptor is at offset .

Hello Alexander, everyone

Here is the output:

[2712507.323017] igb 0000:08:00.0: Detected Tx Unit Hang
 Tx Queue 
 TDH 
 TDT 
 next_to_use 
 next_to_clean 
 buffer_info[next_to_clean]
 time_stamp 
 next_to_watch 
 jiffies 
 desc.status 



[2712515.514954] igb 0000:08:00.0: TX Rings Summary
[2712515.514954] igb: Queue [NTU] [NTC] [bi(ntc)->dma ] leng ntw timestamp
[2712515.514957] igb: 0 E7 E7 00000000FF71E7C0 0000 (null) 00000001A1A8B476
[2712515.514959] igb: 1 7D 7C 00000000FF6E6C02 002A 00000000c658a9f5 
00000001A1A98A00
[2712515.514961] igb: 2 3C 3C 00000000FF375D40 0000 (null) 00000001A1A951B0
[2712515.514962] igb: 3 15 15 00000000FF709040 0000 (null) 00000001A1A6FB5E
[2712515.514963] igb 0000:08:00.0: TX Rings Dump



[2712515.515457] igb: ------------------------------------
[2712515.515458] igb: TX QUEUE INDEX = 1
[2712515.515458] igb: ------------------------------------
[2712515.515459] igb: T [desc] [address 63:0 ] [PlPOCIStDDM Ln] [bi->dma ] leng 
ntw timestamp bi->skb



[2712515.515677] igb: T [0x078] 00000000FEFCB000 0010C0192B300043 
00000000FEFCB000 0000 (null) 00000001A1A98667 00000000c5ed1096
[2712515.515679] igb: T [0x079] 00000000FD7DBE40 001000192B300040 
00000000FD7DBE40 0000 (null) 00000001A1A98671 000000002e13a21e
[2712515.515681] igb: T [0x07A] 00000000FF930C80 000F00192B30003C 
00000000FF930C80 0000 (null) 00000001A1A98672 000000009976f72e
[2712515.515682] igb: T [0x07B] 00000000FF71CAC0 000F00192B30003C 
00000000FF71CAC0 0000 (null) 00000001A1A98675 000000000339ac31
[2712515.515684] igb: T [0x07C] 00000000FF6E6C02 000A80102B30002A 
00000000FF6E6C02 002A 00000000c658a9f5 00000001A1A98A00 00000000b1f60b47 NTC
[2712515.515687] 00000000d7c2311d: b4 96 91 0e fd 1f 00 19 99 ab 0b 38 08 06 00 
01 ...........8....
[2712515.515687] 00000000922018be: 08 00 06 04 00 01 00 19 99 ab 0b 38 0a 51 6e 
0f ...........8.Qn.
[2712515.515688] 00000000a423f9da: 00 00 00 00 00 00 0a 51 6e fe .......Qn.
[2712515.515690] igb: T [0x07D] 0000000000001C14 0000001020200800 
00000000FEFAB46A 0000 (null) 00000001A1A8883A 0000000047f33a64 NTU
[2712515.515692] igb: T [0x07E] 00000000FEFAB46A 00AA821122300036 
00000000FF542D00 0000 (null) 00000001A1A83F2A 00000000b9702733
[2712515.515694] igb: T [0x07F] 00000000FF1EC508 000000092B300274 
00000000FF1EC508 0000 (null) 00000001A1A83F2E 00000000bd02a543

Okay so looking over the data everything here is correct except for
the fact that NTU and the tail register values don't match. The only
way that should be able to happen is if the xmit_more hint is set on
the packet, which unfortunately we don't dump.

However with that being said it does occur to me that you mentioned
using VirtualBox. I dug back through the dmesg output and found the
line:
[ 20.944982] VBoxNetFlt: attached to 'eth0' / 00:19:99:ab:0b:38

I think there may be an issue with the filter stripping out packets at
the end of a queue and as a result we are receiving packets with
xmit_more set and not updating tail. At least that is my understanding
of things for now as I am not that familiar with the VBoxNetFlt
driver.

There are two ways to verify if this is the case. The first and more
correct solution might be to look at updating
vboxNetFltLinuxStartXmitFilter so that it will clear the xmit_more bit
before handing it off to the lower device. If you don't have access to
that an alternative would be to just modify the igb driver you have so
that you replace "skb->xmit_more" with "true" and just force the tail
to always be written out. It will cause a small performance hit but it
will work around the issue in the driver that overwrote our network
device ops.

Hope that helps.

*   Alex
_______________________________________________
vbox-dev mailing list
[email protected]
https://www.virtualbox.org/mailman/listinfo/vbox-dev

Re: [vbox-dev] [Intel-wired-lan] igb Detected Tx Unit Hang after upgrade to 4.18-rc6 [was Re: igb Detected Tx Unit Hang after upgrade to 4.17]

Reply via email to