** Description changed:

- This bug is similar to #1832082 (bnx2x driver causes 100% CPU load) but
- applies for qede driver instead of bnx2x. The symptoms are the same:
+ [Impact]
  
- With chrony installed, and configured with "hwtimestamp *", I observe
- 100% CPU load on 2 CPU cores.
+ * The PTP feature in qede driver is implemented in a way that if the NIC
+ firmware takes some time to perform the timestamping then the PTP worker
+ function will reschedule itself indefinitely until the value read from a
+ device register is meaningful. With that behavior, if an userspace tool
+ requests a bad configured TX/RX filter (or if NIC firmware has any other
+ issue in timestamping), the function qede_ptp_task() will reschedule
+ itself forever and cause an unbound resource consumption. This manifests
+ as a kworker thread consuming 100% of CPU.
  
- Running perf report shows that kernel is busy executing qede_ptp_task
- function in qede driver.
+ * The dmesg log will show a message like this:
+ "qede_ptp_tx_ts:533(eno3)]Timestamping in progress"
  
- A workaround is to disable "hwtimestamp *" in chrony configuration.
- 
- ---
- 
- $ modinfo qede
- filename:       
/lib/modules/4.15.0-72-generic/kernel/drivers/net/ethernet/qlogic/qede/qede.ko
- version:        8.10.10.21
- license:        GPL
- description:    QLogic FastLinQ 4xxxx Ethernet Driver
- srcversion:     D5EC89D815FC81B973EE9F0
- alias:          pci:v00001077d00008090sv*sd*bc*sc*i*
- alias:          pci:v00001077d00008070sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001664sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001656sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001654sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001644sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001636sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001666sv*sd*bc*sc*i*
- alias:          pci:v00001077d00001634sv*sd*bc*sc*i*
- depends:        ptp,qed
- retpoline:      Y
- intree:         Y
- name:           qede
- vermagic:       4.15.0-72-generic SMP mod_unload 
- signat:         PKCS#7
- signer:         
- sig_key:        
- sig_hashalgo:   md4
- parm:           debug: Default debug msglevel (uint)
- 
- 
- $ uname -a
- Linux dcn1-clm-inf-1 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 
2019 x86_64 x86_64 x86_64 GNU/Linux
- 
- 
- $ lspci | grep -i ether
- 19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 
10/25/40/50GbE Controller (rev 02)
- 19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 
10/25/40/50GbE Controller (rev 02)
- 19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 
10/25/40/50GbE Controller (rev 02)
- 19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 
10/25/40/50GbE Controller (rev 02)
- 
- 
- # perf report snippet:
- 
-   Children      Self  Command          Shared Object
- -   44.76%     0.00%  kworker/16:5     [kernel.kallsyms]
+ Also, by using perf user can observe a stack like the following:
+ - 44.76% 0.00% kworker/16:5 [kernel.kallsyms]
       ret_from_fork
     - kthread
        - 44.74% worker_thread
           - 44.57% process_one_work
              - 42.67% qede_ptp_task
                 - 38.86% qed_ptp_hw_read_tx_ts
                      qed_rd
                 - 3.03% queue_work_on
                    - 2.06% __queue_work
                       - 0.68% get_work_pool
                          - 0.61% radix_tree_lookup
                               __radix_tree_lookup
                0.50% set_work_pool_and_clear_pending
+ 
+ * The patch proposed in this SRU request refactors the PTP worked in
+ qede by adding a time limit, after which the task doesn't reschedule
+ itself anymore, failing the timestamp procedure: 9adebac37e7d ("qede:
+ Handle infinite driver spinning for Tx timestamp.")
+ http://git.kernel.org/linus/9adebac37e7d
+ 
+ Besides fixing the issue, it also adds an ethtool statistics for
+ accounting the PTP errors.
+ 
+ [Test case]
+ 
+ By using chrony in Bionic, the following steps will reproduce the issue:
+ 
+ a) Install chrony on Bionic in a system with working NIC managed by qede;
+ b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf 
file;
+ c) Restart chrony service
+ 
+ Check dmesg for the "[...]Timestamping in progress" message and the
+ overall CPU workload using a tool like "top" to observe a kthread
+ consuming 100% of CPU.
+ 
+ [Regression potential]
+ 
+ The patch scope is restricted to qede PTP handler, and is upstream for
+ more than 7 months. If there's any possibility of regressions, the worst
+ would be an issue affecting the packet timestamping, not messing with
+ the regular xmit path of the driver.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1855409

Title:
  qede driver causes 100% CPU load

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1855409/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to