** Description changed:
- This bug is similar to #1832082 (bnx2x driver causes 100% CPU load) but
- applies for qede driver instead of bnx2x. The symptoms are the same:
+ [Impact]
- With chrony installed, and configured with "hwtimestamp *", I observe
- 100% CPU load on 2 CPU cores.
+ * The PTP feature in qede driver is implemented in a way that if the NIC
+ firmware takes some time to perform the timestamping then the PTP worker
+ function will reschedule itself indefinitely until the value read from a
+ device register is meaningful. With that behavior, if an userspace tool
+ requests a bad configured TX/RX filter (or if NIC firmware has any other
+ issue in timestamping), the function qede_ptp_task() will reschedule
+ itself forever and cause an unbound resource consumption. This manifests
+ as a kworker thread consuming 100% of CPU.
- Running perf report shows that kernel is busy executing qede_ptp_task
- function in qede driver.
+ * The dmesg log will show a message like this:
+ "qede_ptp_tx_ts:533(eno3)]Timestamping in progress"
- A workaround is to disable "hwtimestamp *" in chrony configuration.
-
- ---
-
- $ modinfo qede
- filename:
/lib/modules/4.15.0-72-generic/kernel/drivers/net/ethernet/qlogic/qede/qede.ko
- version: 8.10.10.21
- license: GPL
- description: QLogic FastLinQ 4xxxx Ethernet Driver
- srcversion: D5EC89D815FC81B973EE9F0
- alias: pci:v00001077d00008090sv*sd*bc*sc*i*
- alias: pci:v00001077d00008070sv*sd*bc*sc*i*
- alias: pci:v00001077d00001664sv*sd*bc*sc*i*
- alias: pci:v00001077d00001656sv*sd*bc*sc*i*
- alias: pci:v00001077d00001654sv*sd*bc*sc*i*
- alias: pci:v00001077d00001644sv*sd*bc*sc*i*
- alias: pci:v00001077d00001636sv*sd*bc*sc*i*
- alias: pci:v00001077d00001666sv*sd*bc*sc*i*
- alias: pci:v00001077d00001634sv*sd*bc*sc*i*
- depends: ptp,qed
- retpoline: Y
- intree: Y
- name: qede
- vermagic: 4.15.0-72-generic SMP mod_unload
- signat: PKCS#7
- signer:
- sig_key:
- sig_hashalgo: md4
- parm: debug: Default debug msglevel (uint)
-
-
- $ uname -a
- Linux dcn1-clm-inf-1 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux
-
-
- $ lspci | grep -i ether
- 19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series
10/25/40/50GbE Controller (rev 02)
- 19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series
10/25/40/50GbE Controller (rev 02)
- 19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series
10/25/40/50GbE Controller (rev 02)
- 19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series
10/25/40/50GbE Controller (rev 02)
-
-
- # perf report snippet:
-
- Children Self Command Shared Object
- - 44.76% 0.00% kworker/16:5 [kernel.kallsyms]
+ Also, by using perf user can observe a stack like the following:
+ - 44.76% 0.00% kworker/16:5 [kernel.kallsyms]
ret_from_fork
- kthread
- 44.74% worker_thread
- 44.57% process_one_work
- 42.67% qede_ptp_task
- 38.86% qed_ptp_hw_read_tx_ts
qed_rd
- 3.03% queue_work_on
- 2.06% __queue_work
- 0.68% get_work_pool
- 0.61% radix_tree_lookup
__radix_tree_lookup
0.50% set_work_pool_and_clear_pending
+
+ * The patch proposed in this SRU request refactors the PTP worked in
+ qede by adding a time limit, after which the task doesn't reschedule
+ itself anymore, failing the timestamp procedure: 9adebac37e7d ("qede:
+ Handle infinite driver spinning for Tx timestamp.")
+ http://git.kernel.org/linus/9adebac37e7d
+
+ Besides fixing the issue, it also adds an ethtool statistics for
+ accounting the PTP errors.
+
+ [Test case]
+
+ By using chrony in Bionic, the following steps will reproduce the issue:
+
+ a) Install chrony on Bionic in a system with working NIC managed by qede;
+ b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf
file;
+ c) Restart chrony service
+
+ Check dmesg for the "[...]Timestamping in progress" message and the
+ overall CPU workload using a tool like "top" to observe a kthread
+ consuming 100% of CPU.
+
+ [Regression potential]
+
+ The patch scope is restricted to qede PTP handler, and is upstream for
+ more than 7 months. If there's any possibility of regressions, the worst
+ would be an issue affecting the packet timestamping, not messing with
+ the regular xmit path of the driver.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1855409
Title:
qede driver causes 100% CPU load
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1855409/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs