** Description changed: + [Impact] + + The bnxt_en_bpo driver experienced tx timeouts causing the system to + experience network stalls and fail to send data and heartbeat packets. + The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error - and triggered the Netdev Watchdog timer under load. + and triggered the Netdev Watchdog timer under load. * From kernel log: - "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" - See attached kern.log excerpt file for full excerpt of error log. + "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" + See attached kern.log excerpt file for full excerpt of error log. - * Release = Xenial - Kernel = 4.4.0-141-generic #167 - eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet - + * Release = Xenial + Kernel = 4.4.0-141-generic #167 + eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet + * This caused the driver to reset in order to recover: - - "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!" - - driver: bnxt_en_bpo - version: 1.8.1 - source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() + + "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset + task!" + + driver: bnxt_en_bpo + version: 1.8.1 + source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures - on the system. + on the system. * The bnxt_en_po driver is the imported Broadcom driver - pulled in to support newer Broadcom HW (specific boards) - while the bnx_en module continues to support the older - HW. The current Linux upstream driver does not compile - easily with the 4.4 kernel (too many changes). + pulled in to support newer Broadcom HW (specific boards) + while the bnx_en module continues to support the older + HW. The current Linux upstream driver does not compile + easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: - "bnxt_en: Fix TX timeout during netpoll" - commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 - - This fix has not been applied to the bnxt_en_po driver - version, but review of the code indicates that it is - susceptible to the bug, and the fix would be reasonable. + "bnxt_en: Fix TX timeout during netpoll" + commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 - * No easy way to reproduce this + This fix has not been applied to the bnxt_en_po driver + version, but review of the code indicates that it is + susceptible to the bug, and the fix would be reasonable. + + [Test Case] + + * Unfortunately, this is not easy to reproduce. Also, it is only seen on + 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo + driver. + + [Regression Potential] + + * The patch is restricted to the bpo driver, with very constrained scope + - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as + opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed + driver). + + * The patch is very small and backport is fairly minimal and simple. + + * The fix has been running on the in-tree driver in upstream mainline as + well as the Ubuntu Linux in-tree driver, although the Broadcom driver + has a lot of lower level code that is different, this piece is still the + same.
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
