[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug was fixed in the package linux - 4.4.0-159.187 --- linux (4.4.0-159.187) xenial; urgency=medium * CVE-2019-1125 - x86/cpufeatures: Carve out CQM features retrieval - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations - x86/speculation: Enable Spectre v1 swapgs mitigations - x86/entry/64: Use JMP instead of JMPQ - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS linux (4.4.0-158.186) xenial; urgency=medium * xenial/linux: 4.4.0-158.186 -proposed tracker (LP: #1837609) * Packaging resync (LP: #1786013) - [Packaging] resync git-ubuntu-log - [Packaging] update helper scripts * ixgbe{vf} - Physical Function gets IRQ when VF checks link state (LP: #1836760) - ixgbevf: Use cached link state instead of re-reading the value for ethtool * CVE-2018-5383 - crypto: kpp - Key-agreement Protocol Primitives API (KPP) - crypto: dh - Add DH software implementation - crypto: ecdh - Add ECDH software support - crypto: ecdh - make ecdh_shared_secret unique - crypto: doc - add KPP documentation - crypto: kpp, (ec)dh - fix typos - crypto: ecc - remove unused function arguments - crypto: ecc - remove unnecessary casts - crypto: ecc - rename ecdh_make_pub_key() - crypto: ecdh - add privkey generation support - crypto: ecc - Fix NULL pointer deref. on no default_rng - [Config] CRYPTO_ECDH=m - Bluetooth: convert smp and selftest to crypto kpp API - crypto: ecdh - add public key verification test * Xenial update: 4.4.185 upstream stable release (LP: #1836668) - fs/binfmt_flat.c: make load_flat_shared_library() work - scsi: vmw_pscsi: Fix use-after-free in pvscsi_queue_lck() - tracing: Silence GCC 9 array bounds warning - gcc-9: silence 'address-of-packed-member' warning - usb: chipidea: udc: workaround for endpoint conflict issue - Input: uinput - add compat ioctl number translation for UI_*_FF_UPLOAD - apparmor: enforce nullbyte at end of tag string - parport: Fix mem leak in parport_register_dev_model - parisc: Fix compiler warnings in float emulation code - IB/hfi1: Insure freeze_work work_struct is canceled on shutdown - MIPS: uprobes: remove set but not used variable 'epc' - net: hns: Fix loopback test failed at copper ports - sparc: perf: fix updated event period in response to PERF_EVENT_IOC_PERIOD - scripts/checkstack.pl: Fix arm64 wrong or unknown architecture - scsi: ufs: Check that space was properly alloced in copy_query_response - s390/qeth: fix VLAN attribute in bridge_hostnotify udev event - hwmon: (pmbus/core) Treat parameters as paged if on multiple pages - Btrfs: fix race between readahead and device replace/removal - btrfs: start readahead also in seed devices - can: flexcan: fix timeout when set small bitrate - can: purge socket error queue on sock destruct - ARM: imx: cpuidle-imx6sx: Restrict the SW2ISO increase to i.MX6SX - Bluetooth: Align minimum encryption key size for LE and BR/EDR connections - Bluetooth: Fix regression with minimum encryption key size alignment - SMB3: retry on STATUS_INSUFFICIENT_RESOURCES instead of failing write - cfg80211: fix memory leak of wiphy device name - mac80211: drop robust management frames from unknown TA - perf ui helpline: Use strlcpy() as a shorter form of strncpy() + explicit set nul - perf help: Remove needless use of strncpy() - 9p/rdma: do not disconnect on down_interruptible EAGAIN - 9p: acl: fix uninitialized iattr access - 9p/rdma: remove useless check in cm_event_handler - 9p: p9dirent_read: check network-provided name length - net/9p: include trans_common.h to fix missing prototype warning. - ovl: modify ovl_permission() to do checks on two inodes - x86/speculation: Allow guests to use SSBD even if host does not - cpu/speculation: Warn on unsupported mitigations= parameter - sctp: change to hold sk after auth shkey is created successfully - tipc: change to use register_pernet_device - tipc: check msg->req data len in tipc_nl_compat_bearer_disable - team: Always enable vlan tx offload - ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop - bonding: Always enable vlan tx offload - net: check before dereferencing netdev_ops during busy poll - Bluetooth: Fix faulty expression for minimum encryption key size check - um: Compile with modern headers - ASoC : cs4265 : readable register too low - spi: bitbang: Fix NULL pointer dereference in spi_unregister_master - ASoC: max98090: remove 24-bit format support if RJ is 0 - usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i] - usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC - scsi: hpsa: correct ioaccel2 chaining -
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug was fixed in the package linux - 5.0.0-25.26 --- linux (5.0.0-25.26) disco; urgency=medium * CVE-2019-1125 - x86/cpufeatures: Carve out CQM features retrieval - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations - x86/speculation: Enable Spectre v1 swapgs mitigations - x86/entry/64: Use JMP instead of JMPQ - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS -- Kleber Sacilotto de Souza Thu, 01 Aug 2019 12:04:35 +0200 ** Changed in: linux (Ubuntu Disco) Status: Fix Committed => Fix Released ** CVE added: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2019-1125 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in FF-Series: Fix Released Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
Hi, does the fix for Bionic has been backported to stable? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Released Status in linux source package in FF-Series: Fix Released Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug was fixed in the package linux - 5.2.0-10.11 --- linux (5.2.0-10.11) eoan; urgency=medium * eoan/linux: 5.2.0-10.11 -proposed tracker (LP: #1838113) * Packaging resync (LP: #1786013) - [Packaging] resync git-ubuntu-log * Eoan update: v5.2.4 upstream stable release (LP: #1838428) - bnx2x: Prevent load reordering in tx completion processing - caif-hsi: fix possible deadlock in cfhsi_exit_module() - hv_netvsc: Fix extra rcu_read_unlock in netvsc_recv_callback() - igmp: fix memory leak in igmpv3_del_delrec() - ipv4: don't set IPv6 only flags to IPv4 addresses - ipv6: rt6_check should return NULL if 'from' is NULL - ipv6: Unlink sibling route in case of failure - net: bcmgenet: use promisc for unsupported filters - net: dsa: mv88e6xxx: wait after reset deactivation - net: make skb_dst_force return true when dst is refcounted - net: neigh: fix multiple neigh timer scheduling - net: openvswitch: fix csum updates for MPLS actions - net: phy: sfp: hwmon: Fix scaling of RX power - net_sched: unset TCQ_F_CAN_BYPASS when adding filters - net: stmmac: Re-work the queue selection for TSO packets - net/tls: make sure offload also gets the keys wiped - nfc: fix potential illegal memory access - r8169: fix issue with confused RX unit after PHY power-down on RTL8411b - rxrpc: Fix send on a connected, but unbound socket - sctp: fix error handling on stream scheduler initialization - sctp: not bind the socket in sctp_connect - sky2: Disable MSI on ASUS P6T - tcp: be more careful in tcp_fragment() - tcp: fix tcp_set_congestion_control() use from bpf hook - tcp: Reset bytes_acked and bytes_received when disconnecting - vrf: make sure skb->data contains ip header to make routing - net/mlx5e: IPoIB, Add error path in mlx5_rdma_setup_rn - net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling - net: bridge: mcast: fix stale ipv6 hdr pointer when handling v6 query - net: bridge: don't cache ether dest pointer on input - net: bridge: stp: don't cache eth dest pointer before skb pull - macsec: fix use-after-free of skb during RX - macsec: fix checksumming after decryption - netrom: fix a memory leak in nr_rx_frame() - netrom: hold sock when setting skb->destructor - selftests: txring_overwrite: fix incorrect test of mmap() return value - net/tls: fix poll ignoring partially copied records - net/tls: reject offload of TLS 1.3 - net/mlx5e: Fix port tunnel GRE entropy control - net/mlx5e: Rx, Fix checksum calculation for new hardware - net/mlx5e: Fix return value from timeout recover function - net/mlx5e: Fix error flow in tx reporter diagnose - bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips. - mlxsw: spectrum_dcb: Configure DSCP map as the last rule is removed - net/mlx5: E-Switch, Fix default encap mode - mlxsw: spectrum: Do not process learned records with a dummy FID - dma-buf: balance refcount inbalance - dma-buf: Discard old fence_excl on retrying get_fences_rcu for realloc - Revert "gpio/spi: Fix spi-gpio regression on active high CS" - gpiolib: of: fix a memory leak in of_gpio_flags_quirks() - gpio: davinci: silence error prints in case of EPROBE_DEFER - MIPS: lb60: Fix pin mappings - perf script: Assume native_arch for pipe mode - perf/core: Fix exclusive events' grouping - perf/core: Fix race between close() and fork() - ext4: don't allow any modifications to an immutable file - ext4: enforce the immutable flag on open files - mm: add filemap_fdatawait_range_keep_errors() - jbd2: introduce jbd2_inode dirty range scoping - ext4: use jbd2_inode dirty range scoping - ext4: allow directory holes - KVM: nVMX: do not use dangling shadow VMCS after guest reset - KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested - Revert "kvm: x86: Use task structs fpu field for user" - sd_zbc: Fix report zones buffer allocation - block: Limit zone array allocation size - net: sched: verify that q!=NULL before setting q->flags - Linux 5.2.4 * linux hwe i386 kernel 5.0.0-21.22~18.04.1 crashes on Lenovo x220 (LP: #1838115) - x86/mm: Check for pfn instead of page in vmalloc_sync_one() - x86/mm: Sync also unmappings in vmalloc_sync_all() - mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() * br_netfilter: namespace sysctl operations (LP: #1836910) - netfilter: bridge: port sysctls to use brnf_net - netfilter: bridge: namespace bridge netfilter sysctls - netfilter: bridge: prevent UAF in brnf_exit_net() * Eoan update: v5.2.3 upstream stable release (LP: #1838089) - ath10k: Check tx_stats before use it - ath10k: htt: don't use txdone_fifo with SDIO - ath10k: fix incorrect multicast/broadcast rate setting - ath9k: Don't trust TX status TID number
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
I've validated the -proposed kernels for Xenial (4.4.0-158), Bionic (4.15.0-56) and Disco (5.0.0-22), using the test case mentioned in the description. All working fine, the issue is gone. Also, the patch was released upstream in the 5.3.x series, so I'll mark ff-series as Released. Cheers, Guilherme ** Changed in: linux (Ubuntu Ff-series) Status: Fix Committed => Fix Released ** Tags removed: verification-needed-bionic verification-needed-disco verification-needed-xenial ** Tags added: verification-done-bionic verification-done-disco verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Released Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed- xenial'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed- bionic'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed- disco' to 'verification-done-disco'. If the problem still exists, change the tag 'verification-needed-disco' to 'verification-failed-disco'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-disco -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Tags added: cscc -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Changed in: linux (Ubuntu Bionic) Status: In Progress => Fix Committed ** Changed in: linux (Ubuntu Disco) Status: In Progress => Fix Committed ** Changed in: linux (Ubuntu Xenial) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Changed in: linux (Ubuntu Cosmic) Status: In Progress => Won't Fix -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: Won't Fix Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
SRU submitted to kernel mailing-list: https://lists.ubuntu.com/archives /kernel-team/2019-July/101925.html I've marked Xenial->Disco as "In Progress", because we need acceptance from kernel team. On the other hand, Devel series (Eoan/Ff) will get the fix via regular rebase with Linus tree, hence I've put "Fix Committed". Thanks, Guilherme ** Changed in: linux (Ubuntu Ff-series) Status: Confirmed => Fix Committed ** Changed in: linux (Ubuntu Eoan) Status: Confirmed => Fix Committed ** Changed in: linux (Ubuntu Disco) Status: Confirmed => Fix Committed ** Changed in: linux (Ubuntu Cosmic) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu Eoan) Status: Fix Committed => In Progress ** Changed in: linux (Ubuntu Eoan) Status: In Progress => Fix Committed ** Changed in: linux (Ubuntu Disco) Status: Fix Committed => In Progress ** Changed in: linux (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu Xenial) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Fix Committed Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Committed Status in linux source package in FF-Series: Fix Committed Bug description: [Impact] * The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU. * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s msglvl 16777216) it's possible to observe the following message flooding the kernel log: "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. [Test case] Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are: a) Install chrony on Bionic in a system with working NIC managed by bnx2x; b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file; c) Restart chrony service Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU. [Regression potential] The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Description changed: - For the customer OpenStack deployment we deploy infra nodes on Dell R630 - servers. The servers have onboard Broadcom's NetXtreme II BCM57800 NIC - (quad port: 2x1G ports, 2x10G ports). For each port in UP state, we - observe 100% CPU load. So in total, we observe 4 CPUs with 100% load. + [Impact] - perf report shows function bnx2x_ptp_task taking up much of the CPUs - time: https://pastebin.canonical.com/p/kfrpd6Pwh5/ + * The PTP feature in bnx2x driver is implemented in a way that if the + NIC firmware takes some time to perform the timestamping - which is + observed as a bad register read in bnx2x_ptp_task() - then the ptp + worker function will reschedule itself indefinitely until the value read + from the register is meaningful. With that behavior, if an userspace + tool request a bad configured RX filter to bnx2x (or if NIC firmware has + any other issue in timestamping), the function bnx2x_ptp_task() will be + rescheduled forever and cause a unbound resource consumption. This + manifests as a kworker thread consuming 100% of CPU. - Also, /var/log/syslog contains the following outputs every few seconds: - [1738143.581721] bnx2x: [bnx2x_start_xmit:3855(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738176.727642] bnx2x: [bnx2x_start_xmit:3855(eno1)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738207.988310] bnx2x: [bnx2x_start_xmit:3855(eno3)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738240.227333] bnx2x: [bnx2x_start_xmit:3855(eno2)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped + * The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline": - So, the problem seems to be in a "timestampped" TX packet; the driver - for some reason (to be yet understood) get an unexpected value from a - register and then, it that same function, reschedule itself to try again - this register read, read gets a bad value again, and so on infinitely. + "bnx2x: [bnx2x_start_xmit:3862(eno4)]The device supports only a single + outstanding packet to timestamp, this packet will not be timestamped" - This is showing in the system as the 100% CPU usage kthreads; the - message "The device supports only a single outstanding packet to - timestamp, this packet will not be timestamped" happens because the - driver can only timestamp a single TX packet at a time, and given it's - stuck trying, it cannot accept another packet in this "queue". + Also, by using ftrace user can notice that function bnx2x_ptp_task() is + being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s + msglvl 16777216) it's possible to observe the following message + flooding the kernel log: - The infinite loop appears to be: + "bnx2x: [bnx2x_ptp_task:15242(eno4)]There is no valid Tx timestamp yet" - static void bnx2x_ptp_task(struct work_struct *work) - { - struct bnx2x *bp = container_of(work, struct bnx2x, ptp_task); - int port = BP_PORT(bp); - u32 val_seq; - u64 timestamp, ns; - struct skb_shared_hwtstamps shhwtstamps; - /* Read Tx timestamp registers */ - val_seq = REG_RD(bp, port ? NIG_REG_P1_TLLH_PTP_BUF_SEQID : - NIG_REG_P0_TLLH_PTP_BUF_SEQID); - if (val_seq & 0x1) { - [...] - } else { - DP(BNX2X_MSG_PTP, "There is no valid Tx timestamp yet\n"); - /* Reschedule to keep checking for a valid timestamp value */ - schedule_work(&bp->ptp_task); - } + * The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree: + git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=3c91f25c2f72 + Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors. - It appears that val_seq & 0x1 is never true, so the task constantly - reschedules itself immediately. Instrumenting the function shows that it - is being called in excess of 100,000 times per second. The REG_RD call - does appear to be expensive (as it's a register read from the device) - and shows high in the perf report, but that by itself doesn't appear to - be the root cause (i.e., it's not hanging forever in the REG_RD). - The cause appears to be that the driver is not prepared to deal with the - PTP request never being completed by the hardware. It's unclear why it - isn't completing, but regardless, the driver should not loop forever - here. + [Test case] + + Reproducing the problem is not difficult; we've used chrony in Bionic to + trigger the problem. The steps are: + + a) Install chrony on Bionic in a system with working NIC managed by + bnx2x; + + b) Edit chrony configuration and add: "hwtimestamp *" to the top
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Attachment added: "system_details.txt" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+attachment/5272083/+files/system_details.txt -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Confirmed Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Status in linux source package in Disco: Confirmed Status in linux source package in Eoan: Confirmed Status in linux source package in FF-Series: Confirmed Bug description: For the customer OpenStack deployment we deploy infra nodes on Dell R630 servers. The servers have onboard Broadcom's NetXtreme II BCM57800 NIC (quad port: 2x1G ports, 2x10G ports). For each port in UP state, we observe 100% CPU load. So in total, we observe 4 CPUs with 100% load. perf report shows function bnx2x_ptp_task taking up much of the CPUs time: https://pastebin.canonical.com/p/kfrpd6Pwh5/ Also, /var/log/syslog contains the following outputs every few seconds: [1738143.581721] bnx2x: [bnx2x_start_xmit:3855(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738176.727642] bnx2x: [bnx2x_start_xmit:3855(eno1)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738207.988310] bnx2x: [bnx2x_start_xmit:3855(eno3)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738240.227333] bnx2x: [bnx2x_start_xmit:3855(eno2)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped So, the problem seems to be in a "timestampped" TX packet; the driver for some reason (to be yet understood) get an unexpected value from a register and then, it that same function, reschedule itself to try again this register read, read gets a bad value again, and so on infinitely. This is showing in the system as the 100% CPU usage kthreads; the message "The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" happens because the driver can only timestamp a single TX packet at a time, and given it's stuck trying, it cannot accept another packet in this "queue". The infinite loop appears to be: static void bnx2x_ptp_task(struct work_struct *work) { struct bnx2x *bp = container_of(work, struct bnx2x, ptp_task); int port = BP_PORT(bp); u32 val_seq; u64 timestamp, ns; struct skb_shared_hwtstamps shhwtstamps; /* Read Tx timestamp registers */ val_seq = REG_RD(bp, port ? NIG_REG_P1_TLLH_PTP_BUF_SEQID : NIG_REG_P0_TLLH_PTP_BUF_SEQID); if (val_seq & 0x1) { [...] } else { DP(BNX2X_MSG_PTP, "There is no valid Tx timestamp yet\n"); /* Reschedule to keep checking for a valid timestamp value */ schedule_work(&bp->ptp_task); } It appears that val_seq & 0x1 is never true, so the task constantly reschedules itself immediately. Instrumenting the function shows that it is being called in excess of 100,000 times per second. The REG_RD call does appear to be expensive (as it's a register read from the device) and shows high in the perf report, but that by itself doesn't appear to be the root cause (i.e., it's not hanging forever in the REG_RD). The cause appears to be that the driver is not prepared to deal with the PTP request never being completed by the hardware. It's unclear why it isn't completing, but regardless, the driver should not loop forever here. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832082/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
Thanks Przemyslaw, good explanation on bug's description! I'm dealing with this one, will update status here with news. Cheers, Guilherme ** Also affects: linux (Ubuntu Cosmic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: Incomplete ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Ff-series) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Eoan) Status: Incomplete => Confirmed ** Changed in: linux (Ubuntu Ff-series) Status: New => Confirmed ** Changed in: linux (Ubuntu Disco) Status: New => Confirmed ** Changed in: linux (Ubuntu Cosmic) Status: New => Confirmed ** Changed in: linux (Ubuntu Bionic) Status: New => Confirmed ** Changed in: linux (Ubuntu Xenial) Status: New => Confirmed ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Cosmic) Importance: Undecided => High ** Changed in: linux (Ubuntu Disco) Importance: Undecided => High ** Changed in: linux (Ubuntu Eoan) Importance: Undecided => High ** Changed in: linux (Ubuntu Ff-series) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Changed in: linux (Ubuntu Cosmic) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Changed in: linux (Ubuntu Ff-series) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Changed in: linux (Ubuntu Eoan) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Changed in: linux (Ubuntu Disco) Assignee: (unassigned) => Guilherme G. Piccoli (gpiccoli) ** Tags removed: bionic ** Tags added: bnx2x sts ** Description changed: For the customer OpenStack deployment we deploy infra nodes on Dell R630 servers. The servers have onboard Broadcom's NetXtreme II BCM57800 NIC (quad port: 2x1G ports, 2x10G ports). For each port in UP state, we observe 100% CPU load. So in total, we observe 4 CPUs with 100% load. perf report shows function bnx2x_ptp_task taking up much of the CPUs time: https://pastebin.canonical.com/p/kfrpd6Pwh5/ Also, /var/log/syslog contains the following outputs every few seconds: - [1738143.581721] bnx2x: [bnx2x_start_xmit:3855(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738176.727642] bnx2x: [bnx2x_start_xmit:3855(eno1)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738207.988310] bnx2x: [bnx2x_start_xmit:3855(eno3)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped - [1738240.227333] bnx2x: [bnx2x_start_xmit:3855(eno2)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped + [1738143.581721] bnx2x: [bnx2x_start_xmit:3855(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped + [1738176.727642] bnx2x: [bnx2x_start_xmit:3855(eno1)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped + [1738207.988310] bnx2x: [bnx2x_start_xmit:3855(eno3)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped + [1738240.227333] bnx2x: [bnx2x_start_xmit:3855(eno2)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped So, the problem seems to be in a "timestampped" TX packet; the driver for some reason (to be yet understood) get an unexpected value from a register and then, it that same function, reschedule itself to try again this register read, read gets a bad value again, and so on infinitely. This is showing in the system as the 100% CPU usage kthreads; the message "The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" happens because the driver can only timestamp a single TX packet at a time, and given it's stuck trying, it cannot accept another packet in this "queue". The infinite loop appears to be: - static void bnx2x_ptp_task(struct work_struct *work) - { - struct bnx2x *bp = container_of(work, struct bnx2x, ptp_task); - int port = BP_PORT(bp); - u32 val_seq; - u64 timestamp, ns; - struct skb_shared_hwtstamps shhwtstamps; + static void bnx2x_ptp_task(struct work_struct *work) + { + struct bnx2x *bp = container_of(work, struct bnx2x, ptp_task
[Kernel-packages] [Bug 1832082] Re: bnx2x driver causes 100% CPU load
** Information type changed from Private to Public -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1832082 Title: bnx2x driver causes 100% CPU load Status in linux package in Ubuntu: New Bug description: For the customer OpenStack deployment we deploy infra nodes on Dell R630 servers. The servers have onboard Broadcom's NetXtreme II BCM57800 NIC (quad port: 2x1G ports, 2x10G ports). For each port in UP state, we observe 100% CPU load. So in total, we observe 4 CPUs with 100% load. perf report shows function bnx2x_ptp_task taking up much of the CPUs time: https://pastebin.canonical.com/p/kfrpd6Pwh5/ Also, /var/log/syslog contains the following outputs every few seconds: [1738143.581721] bnx2x: [bnx2x_start_xmit:3855(eno4)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738176.727642] bnx2x: [bnx2x_start_xmit:3855(eno1)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738207.988310] bnx2x: [bnx2x_start_xmit:3855(eno3)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped [1738240.227333] bnx2x: [bnx2x_start_xmit:3855(eno2)]The device supports only a single outstanding packet to timestamp, this packet will not be timestamped So, the problem seems to be in a "timestampped" TX packet; the driver for some reason (to be yet understood) get an unexpected value from a register and then, it that same function, reschedule itself to try again this register read, read gets a bad value again, and so on infinitely. This is showing in the system as the 100% CPU usage kthreads; the message "The device supports only a single outstanding packet to timestamp, this packet will not be timestamped" happens because the driver can only timestamp a single TX packet at a time, and given it's stuck trying, it cannot accept another packet in this "queue". The infinite loop appears to be: static void bnx2x_ptp_task(struct work_struct *work) { struct bnx2x *bp = container_of(work, struct bnx2x, ptp_task); int port = BP_PORT(bp); u32 val_seq; u64 timestamp, ns; struct skb_shared_hwtstamps shhwtstamps; /* Read Tx timestamp registers */ val_seq = REG_RD(bp, port ? NIG_REG_P1_TLLH_PTP_BUF_SEQID : NIG_REG_P0_TLLH_PTP_BUF_SEQID); if (val_seq & 0x1) { [...] } else { DP(BNX2X_MSG_PTP, "There is no valid Tx timestamp yet\n"); /* Reschedule to keep checking for a valid timestamp value */ schedule_work(&bp->ptp_task); } It appears that val_seq & 0x1 is never true, so the task constantly reschedules itself immediately. Instrumenting the function shows that it is being called in excess of 100,000 times per second. The REG_RD call does appear to be expensive (as it's a register read from the device) and shows high in the perf report, but that by itself doesn't appear to be the root cause (i.e., it's not hanging forever in the REG_RD). The cause appears to be that the driver is not prepared to deal with the PTP request never being completed by the hardware. It's unclear why it isn't completing, but regardless, the driver should not loop forever here. Additional info: ubuntu@infra-1:~$ uname -a Linux infra-1 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Lin ubuntu@infra-1:~$ lspci | grep Broadcom 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10) 01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10) 01:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10) 01:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM57800 1/10 Gigabit Ethernet (rev 10) ubuntu@infra-1:~$ lspci -n | grep 01:00 01:00.0 0200: 14e4:168a (rev 10) 01:00.1 0200: 14e4:168a (rev 10) 01:00.2 0200: 14e4:168a (rev 10) 01:00.3 0200: 14e4:168a (rev 10) ubuntu@infra-1:~/deploy$ sudo lshw -c network *-network:0 description: Ethernet interface product: NetXtreme II BCM57800 1/10 Gigabit Ethernet vendor: Broadcom Inc. and subsidiaries physical id: 0 bus info: pci@:01:00.0 logical name: eno1 version: 10 serial: 42:39:92:e0:66:b6 size: 10Gbit/s capacity: 10Gbit/s width: 64 bits clock: 33MHz capabilities: pm vpd msi msix pciexpress bus_master cap_list rom ethernet physical tp 100bt 100bt-fd 1000bt-fd 1bt-fd autonegotiation configuration: autonegotiation=on broadcast=yes driver=bnx2x driverversion=1.712.30-0 duplex=full firmware=FFV14.10.07 bc 7.14.11 phy 1.45 latency=0 link=yes multicast=yes port=twisted pair slave=yes speed=10