[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Verified on Xenial ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
** Changed in: linux (Ubuntu) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
I have installed and booted to this kernel, and ensured no new regression introduced, although I cannot repro the issue. ** Tags removed: 4.15.0-24-generic cosmic kernel verification-needed-bionic verification-needed-cosmic ** Tags added: verification-done-bionic verification-done-cosmic ** Description changed: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running - the 4.15.0-38-generic kernel on Xenial. + the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: - "i40e: Fix for Tx timeouts when interface is brought up if - DCB is enabled" + "i40e: Fix for Tx timeouts when interface is brought up if + DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. - * Not easy to reproduce as it requires something like the - following example environment and heavy load: + * Not easy to reproduce as it requires something like the + following example environment and heavy load: - Kernel: 4.15.0-38-generic - Network driver: i40e - version: 2.1.14-k - firmware-version: 6.00 0x800034e6 18.3.6 - NIC: Intel 40Gb XL710 - DCB enabled - + Kernel: 4.15.0-38-generic + Network driver: i40e + version: 2.1.14-k + firmware-version: 6.00 0x800034e6 18.3.6 + NIC: Intel 40Gb XL710 + DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has - been running for several months in production-load testing + been running for several months in production-load testing successfully. - --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
We had tested a patch discussed above and tested internally, with success - although we have limited testing (opening up a geneve tunnel between 2 kvm guests). Jiri has now pushed an identical patch upstream which is available in the v5.0 kernel and later. "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 Although I do not have testing validation from original poster, since it has been committed upstream, I'm going to go ahead and get the SRU request started. ** Changed in: linux (Ubuntu) Status: Triaged => In Progress ** Changed in: linux (Ubuntu) Importance: Medium => High ** Also affects: linux (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Cosmic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Disco) Importance: High Status: In Progress ** Changed in: linux (Ubuntu Cosmic) Status: New => In Progress ** Changed in: linux (Ubuntu Disco) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Cosmic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Xenial) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Xenial) Status: New => In Progress ** Changed in: linux (Ubuntu Cosmic) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Description changed: [Impact] - When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in - an OS environment with open vswitch, where ipv6 has been disabled, + When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in + an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : - “ovs-vsctl: Error detected while setting up 'geneve0': could not - add network device geneve0 to ofproto (Address family not supported + “ovs-vsctl: Error detected while setting up 'geneve0': could not + add network device geneve0 to ofproto (Address family not supported by protocol)." - + [Fix] + There is an upstream commit for this in v5.0 mainline. + + "geneve: correctly handle ipv6.disable module parameter" + Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 + + This fix is needed on all our series: X, C, B, D + + [Test Case] - (Best to do this on a kvm guest VM so as not to interfere with - your system's networking) + (Best to do this on a kvm guest VM so as not to interfere with + your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example -is shown with the4.15.0-23-generic kernel (which differs -slightly from 4.4.x in symptoms): - + is shown with the4.15.0-23-generic kernel (which differs + slightly from 4.4.x in symptoms): + - Edit /etc/default/grub to add the line: - GRUB_CMDLINE_LINUX="ipv6.disable=1" + GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot - 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 - # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: - "ovs-vsctl: Error detected while setting up 'geneve1'. + "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: - "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: - failed to add geneve1 as port: Address family not supported + "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: + failed to add geneve1 as port: Address family not supported by protocol" - You will notice from the "ifconfig" output that the device + You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. - If you do not disable IPv6 (remove ipv6.disable=1 from - /etc/default/grub + update-grub + reboot), the same - 'ovs-vsctl add-port' command completes successfully. - You can see that it is working properly by adding an - IP to the br1 and pinging each host. + If you do not disable IPv6 (remove ipv6.disable=1 from + /etc/default/grub + update-grub + reboot), the same + 'ovs-vsctl add-port' command completes successfully. + You can see that it is working properly by adding an + IP to the br1 and pinging each host. - On kernel 4.4 (4.4.0-128-generic), the error message doesn't - happen using the 'ovs-vsctl add-port' command, no warning is - shown in ovs-vswitchd.log, but the device g
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Changed in: linux (Ubuntu Disco) Status: In Progress => Fix Released ** Description changed: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] - There is an upstream commit for this in v5.0 mainline. + There is an upstream commit for this in v5.0 mainline. "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 - This fix is needed on all our series: X, C, B, D. It is identical + This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but - had not pushed upstream yet. - + had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. - With the fixed test kernel, the interfaces and tunnel + With the fixed test kernel, the interfaces and tunnel is created successfully. - [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(&request, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * Example Versions (bug exists in all versions of Ubuntu and mainline): $ uname -r 4.4.0-135-generic $ lsb_release -rd Description: Ubuntu 16.04.5 LTS Release: 16.04 $ dpkg -l | grep openvswitch-switch ii openvswitch-switch 2.5.4-0ubuntu0.16.04.1 ** Description changed: [Impact] When attempting t
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
*** This bug is a duplicate of bug 1837664 *** https://bugs.launchpad.net/bugs/1837664 I'm not sure this bug should be DUP'd to the stable-release bug. Might confuse the verification and handling triggers, perhaps? Will need to make sure the fix is tested once the fix is uploaded. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
*** This bug is a duplicate of bug 1837664 *** https://bugs.launchpad.net/bugs/1837664 I'll unDUP it unless the kernel team says otherwise in IRC. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
** This bug is no longer a duplicate of bug 1837664 Bionic update: upstream stable patchset 2019-07-23 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
I unduped it for test process clarity. Trying to get the relevant people to test the fix. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
I'll update here once kernel is uploaded. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840704] Re: ZFS kernel modules lack debug symbols
** Tags added: sts ** Tags added: linux ** Changed in: linux (Ubuntu) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840704 Title: ZFS kernel modules lack debug symbols Status in linux package in Ubuntu: In Progress Bug description: The ZFS kernel modules aren't built with debug symbols, which introduces problems/issues for debugging/support. Patches are required in: 1) linux kernel packaging, to add infrastructure to enable/build/strip/package debug symbols on DKMS. (this is sufficient with zfs-linux now in Eoan.) 2) zfs-linux and spl-linux, for the stable releases, which need a few patches to enable debug symbols (add option './configure --enable-debuginfo' and '(ZFS|SPL)_DKMS_ENABLE_DEBUGINFO' to dkms.conf.) Initially submitting the kernel patchset for Unstable, for review/feedback. It backports nicely into B/D/E, should it be accepted; for X (doesn't use DKMS builds) a simpler patch for the moment (until it does) works. The zfs/spl-linux patches are ready, to be submitted once the approach used by the kernel package settles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840704/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840789] Re: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled
** Tags added: sts ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Importance: High => Critical ** Changed in: linux (Ubuntu Bionic) Importance: High => Critical ** Changed in: linux (Ubuntu Disco) Importance: Undecided => Critical ** Changed in: linux (Ubuntu Eoan) Importance: Undecided => Critical -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840789 Title: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [Impact] * The bnx2x driver may cause hardware faults (leading to panic/reboot) and other behaviors as transmit timeouts, after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is introduced. * This issue has been observed by an user shortly after starting docker & kubelet, with adapters: - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c] - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79] * If options to ignore hardware faults are used (erst_disable=1 hest_disable=1 ghes.disable=1) the system doesn't panic/reboot and continues on to timeout on adapter stats, then transmit timeouts, spewing some adapter firmware dumps, but the network interface is non-functional. * The issue only happened when LLDP is enabled on the network switches, and crashdump shows the bnx2x driver is stuck/waits for firmware to complete the stop traffic command in LLDP handling. Workaround used is to disable LLDP in the network switches/ports. * Analysis of the driver and firmware dumps didn't help significantly towards finding the root cause. * Upstream/mainline recently just reverted the patch, due to similar problem reports, while looking for the root cause/proper fix. [Test Case] * No reproducible test case found outside the user's systems/cluster, where it is enough to start docker & kubelet & wait. * The user verified test kernels for Xenial and Bionic - the problem does not happen; build-tested on Disco. [Regression Potential] * Users who significantly use/apply the non-default traffic class (tc) / class of service (cos) might possibly see performance changes (if any at all) in such applications, however that's unclear now. * This is a recent revert upstream (v5.3-rc'ish), so there's chance things might change in this area. * Nonetheless, the patch is authored by the driver vendor, and made its way into stable kernels (e.g., v5.2.8 which made Eoan/19.10 recently). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
A 4.4 test kernel with the fix backported is available at: https://people.canonical.com/~nivedita/geneve-xenial-test/ if anyone wishes to validate the 4.4 X solution. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL;
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Late update, but the original reporter did test the proposed kernel on systems able to reproduce the problem and were tested successfully. We do not yet have a way of reproducing this on Xenial (i.e, any 4.4 kernel). I'm still leaving this an open issue, will be trying to do this and once we can confirm/test, will update and push an SRU for Xenial as well. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Bionic, Cosmic kernels successfully tested. I've updated the tags. ** Tags removed: verification-needed-bionic verification-needed-cosmic ** Tags added: verification-done-bionic verification-done-cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6)
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata))
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
** Tags added: sts ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic verification-done-cosmic ** Tags removed: verification-done-cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Released Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Description changed: - Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 - to the Kernel 4.15.0-24-generic. + [Impact] + The i40e driver can get stalled on tx timeouts. This can happen when + DCB is enabled on the connected switch. This can also trigger a + second situation when a tx timeout occurs before the recovery of + a previous timeout has completed due to CPU load, which is not + handled correctly. This leads to networking delays, drops and + application timeouts and hangs. Note that the first tx timeout + cause is just one of the ways to end up in the second situation. + + This issue was seen on a heavily loaded Kafka broker node running + the 4.15.0-38-generic kernel on Xenial. + + Symptoms include messages in the kernel log of the form: + + --- + [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 + [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 + + + With the test kernel provided in this LP bug which had these + two commits compiled in, the problem has not been seen again, + and has been running successfully for several months: + + "i40e: Fix for Tx timeouts when interface is brought up if + DCB is enabled" + Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee + + "i40e: prevent overlapping tx_timeout recover" + Commit: d5585b7b6846a6d0f9517afe57be3843150719da + + * The first commit is already in Disco, Cosmic + * The second commit is already in Disco + * Bionic needs both patches and Cosmic needs the second + + [Test Case] + * We are considering the case of both issues above occurring. + * Seen by reporter on a Kafka broker node with heavy traffic. + * Not easy to reproduce as it requires something like the + following example environment and heavy load: + + Kernel: 4.15.0-38-generic + Network driver: i40e + version: 2.1.14-k + firmware-version: 6.00 0x800034e6 18.3.6 + NIC: Intel 40Gb XL710 + DCB enabled + + + [Regression Potential] + Low, as the first only impacts i40e DCB environment, and has + been running for several months in production-load testing + successfully. + + + --- Original Description + Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : - [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel:
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Tags added: bionic cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
Submitted SRU request -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] [NEW] i40e xps management broken when > 64 queues/cpus
Public bug reported: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument ** Affects: linux (Ubuntu) Importance: High Status: Confirmed ** Affects: linux (Ubuntu Bionic) Importance: High Assignee: Nivedita Singhvi (niveditasinghvi) Status: Confirmed ** Tags: bionic ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: New => Confirmed ** Changed in: linux (Ubuntu Bionic) Status: New => Confirmed ** Changed in: linux (Ubuntu) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
It's been reported by an external reporter and reproduced internally. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Just briefly wanted to say that this is one we've discussed at length -- we may not be able to get someone who has the right NIC to test with it in time. I'm sanity checking the kernel, but that is not exercising the key change here. If we could assume verification-done for our purposes here, that might be needed. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Will be submitting SRU request early next week; trying to get it into this next kernel release cycle. ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
I am not sure we could deterministically provoke the issue. At the very least to ensure no other regression was introduced, I would run it under heavy network load. The environment in question which saw the issue had network load, contention for cpus and several other issues occur. The basic environment is: 1. For any 25Gb NIC/chipset that requires the 4.4 bnxt_en_bpo driver, set its 2 ports/interfaces up in bonding mode as follows: bond-lacp-rate fast bond-master bond0 bond-miimon 100 bond-mode 802.3ad bond-xmit-hash-policy layer3+4 mtu 9000 2. Run any heavy TCP network load test over the systems (e.g. iperf, netperf, file transfer, etc.) 3. Theoretically, it would appear that if the number of tx ring descriptors were lower, than that would be more likely to hit this (not successfully proven by testing here), but can lower it and see if that helps: # ethtool -G eno49 tx 128 // for example I am not sure if that helps, Scott. I'll try and smoke up more specific steps but I cannot guarantee you will see the issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Submitted patches for SRU. ** Description changed: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe - and Bionic kernels). + and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f + It requires the following commit as well: + + i40e: Do not allow use more TC queue pairs than MSI-X vectors exist + Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 + [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel -i40e driver version: 2.1.14-k -Any system with > 64 CPUs + i40e driver version: 2.1.14-k + Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: - + echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus - 00,, + 00,, - But for any queue number > 63, we see this error: + But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
I'm still trying to confirm this for Xenial. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] - * Low -- affects the geneve driver only, and when ipv6 is - disabled, and since it doesn't work in that case at all, - this fix gets the tunnel up and running for the common case. - + * Low -- affects the geneve driver only, and when ipv6 is + disabled, and since it doesn't work in that case at all, + this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(&request, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * Example Versions (bug exists in all versions of Ubuntu - and mainline): + and mainline) + + Update: This has been patched upstream after original description filed + here, fix available in v5.0 mainline and
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: + SRU Justification + + Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. + + Fix: + Fixed by upstream commit in v5.0: + Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 + "geneve: correctly handle ipv6.disable module parameter" + + Hence available in Disco and later; required in X,B,C + Cherry picked and tested successfully for X, B, C. + + Testcase: + 1. Boot with "ipv6.disable=1" + 2. Then try and create a geneve tunnel using: +# ovs-vsctl add-br br1 +# ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + type=geneve options:remote_ip=192.168.x.z // ip of the other host + + Regression Potential: Low, only geneve tunnels when ipv6 dynamically + disabled, current status is it doesn't work at all. + + Other Info: + * Mainline commit msg includes reference to a fix for + non-metadata tunnels (infrastructure is not yet in + our tree prior to Disco), hence not being included + at this time under this case. + + At this time, all geneve tunnels created as above + are metadata-enabled. + + + --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(&request, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" - Hence available in Disco and later; required in X,B,C - Cherry picked and tested successfully for X, B, C. + Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: -# ovs-vsctl add-br br1 -# ovs-vsctl add-port br1 geneve1 -- set interface geneve1 - type=geneve options:remote_ip=192.168.x.z // ip of the other host + # ovs-vsctl add-br br1 + # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for - non-metadata tunnels (infrastructure is not yet in - our tree prior to Disco), hence not being included - at this time under this case. + non-metadata tunnels (infrastructure is not yet in + our tree prior to Disco), hence not being included + at this time under this case. - At this time, all geneve tunnels created as above - are metadata-enabled. - + At this time, all geneve tunnels created as above + are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example - is shown with the4.15.0-23-generic kernel (which differs + is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata)
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Tags added: cosmic xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata))
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Submitted SRU request for Bionic, Cosmic. Huge thanks for the testing, Matthew! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Resubmitted SRU for B,C for this kernel cycle. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || meta
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
The second port on the NIC definitely works as the active interface in an active-backup bonding configuration on the other NICs. At the moment, it's only this particular NIC that is seeing this problem that we know of. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on thi
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We have narrowed it down to a flaw in a specific configuration setting on this NIC, so we're comparing the good and bad configurations now. Primary port: enp94s0f0 Secondary port: enp94s0f1d1 A] Good config for fault-tolerance (active-backup) bonding mode: -- Primary port = active interface; Secondary port = backup B] Bad config for fault-tolerance (active-backup) bonding mode: -- Primary port = backup interface; Secondary port = active We are consistently able to reproduce a drop rate difference with UDP pkts, for the above good/bad cases: Good Case UDP MTR Test Result - mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST Start: 2020-02-10T10:14:01+ HOST: hostname Loss% Snt Last Avg Best Wrst StDev 1.|-- nn.nn.nnn.nnn 0.0%600.3 0.2 0.2 0.3 0.0 Bad Case UDP MTR Test Result --- mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST Start: 2020-02-10T14:10:52+ HOST: hostname Loss% Snt Last Avg Best Wrst StDev 1.|-- nn.nn.nnn.nnn 8.3%600.3 0.3 0.2 0.4 0.0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] De
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
"Bad" Configuration for active-backup mode: $ cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp94s0f1d1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: enp94s0f1d1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 2 Permanent HW addr: 4c:d9:8f:48:08:da Slave queue ID: 0 Slave Interface: enp94s0f0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 2 Permanent HW addr: 4c:d9:8f:48:08:d9 Slave queue ID: 0 --- $ cat uname-rv 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020 --- Scrubbed /etc/netplan/50-cloud-init.yaml: network: bonds: bond0: addresses: - 0.0.235.177/25 gateway4: 0.0.235.129 interfaces: - enp94s0f0 - enp94s0f1d1 macaddress: 00:00:00:48:08:00 mtu: 9000 nameservers: addresses: - 0.0.235.171 - 0.0.235.172 search: - maas parameters: down-delay: 0 gratuitious-arp: 1 mii-monitor-interval: 100 mode: active-backup transmit-hash-policy: layer2 up-delay: 0 ethernets: eno1: match: macaddress: 00:00:00:76:6e:ca mtu: 1500 set-name: eno1 eno2: match: macaddress: 00:00:00:76:6e:cb mtu: 1500 set-name: eno2 enp94s0f0: match: macaddress: 00:00:00:48:08:00 mtu: 9000 set-name: enp94s0f0 enp94s0f1d1: match: macaddress: 00:00:00:48:08:da mtu: 9000 set-name: enp94s0f1d1 version: 2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Good System/Good NIC (all configurations work) Comparison NIC: NetXtreme II BCM57000 10 Gigabit Ethernet QLogic 57000 System: Dell Kernel: 5.0.0-25-generic #26~18.04.1-Ubuntu /proc/net/bonding/bond0 --- Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp5s0f1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp5s0f1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:00:00:00:73:e2 Slave queue ID: 0 Slave Interface: enp5s0f0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:00:00:00:73:e0 Slave queue ID: 0 /etc/netplan/50-cloud-init.yaml network: bonds: bond0: addresses: - 00.00.235.182/25 gateway4: 00.00.235.129 interfaces: - enp5s0f0 - enp5s0f1 macaddress: 00:00:00:00:73:e0 mtu: 9000 nameservers: addresses: - 00.00.235.172 - 00.00.235.171 search: - maas parameters: down-delay: 0 gratuitious-arp: 1 mii-monitor-interval: 100 mode: active-backup transmit-hash-policy: layer2 up-delay: 0 ethernets: ...(snip).. enp5s0f0: match: macaddress: 00:00:00:00:73:e0 mtu: 9000 set-name: enp5s0f0 enp5s0f1: match: macaddress: 00:00:00:00:73:e2 mtu: 9000 set-name: enp5s0f1 version: 2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
"Bad" System/NIC: NIC: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller System: Dell Kernel: 5.3.0-28-generic #30~18.04.1-Ubuntu (Note, this issue has been seen on prior kernels as well, upgraded to latest to see if various problems were resolved) Attaching stats/config files from nics from this system (seeing issue). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "ethtool -S for inactive interface enp94s0f0" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853638/+attachment/5327556/+files/ethtool-S-enp94s0f0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: htt
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
ethtool-enp94s0f0 -- Settings for enp94s0f0: Supported ports: [ FIBRE ] Supported link modes: 1baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 1Mb/s Duplex: Full Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: off Supports Wake-on: g Wake-on: d Current message level: 0x (0) Link detected: yes ethtool-i-enp94s0f0 -- driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: :5e:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no ethtool-c-enp94s0f0 - Coalesce parameters for enp94s0f0: Adaptive RX: off TX: off stats-block-usecs: 100 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 10 rx-frames: 15 rx-usecs-irq: 1 rx-frames-irq: 1 tx-usecs: 28 tx-frames: 30 tx-usecs-irq: 2 tx-frames-irq: 2 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 ethtool-g-enp94s0f0 Ring parameters for enp94s0f0: Pre-set maximums: RX: 2047 RX Mini:0 RX Jumbo: 8191 TX: 2047 Current hardware settings: RX: 511 RX Mini:0 RX Jumbo: 2044 TX: 511 ethtool-k-enp94s0f0 - Features for enp94s0f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: off [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: on rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: on tls-hw-record: off [fixed] -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12A
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Edwin, let me know if you can get in touch with me via the contact email on my Launchpad page. Thanks for all the help! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html -
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Additional observations. MAAS is being used to deploy the system and configure the bond interface and settings. MAAS allows you to specify which is the primary interface, with the other being the backup, for the active-backup bonding mode. However, it does not appear to be working -it's not passing along a primary primitive, for instance, in the netplan yaml or otherwise resulting in this being honored (still need to confirm). MAAS allows you to enter a mac address for the bond interface, but if not supplied, by default it will use the mac address of the "primary" interface, as configured. MAAS then populates the /etc/netplan/50-cloud-init.yaml, including a macaddr= line with the default. netplan then passes that along to systemd-networkd. The bonding kernel, however, will use as the active interface whichever interface is first attached to the bond (i.e., which completes getting attached to the bond interface first) in the absence of a primary= directive. The bonding kernel will, however, use the mac addr supplied as an override. So let's say the active interface was configured in MAAS to be f0, and it's mac is used to be the mac address of the bond, but f1 (the second port of the NIC) actually gets attached first to the bond and is used as the active interface by the bond. We have a situation where f0 = backup, f1 = active, and bond0 is using the mac of f0. While this should work, there is a potential for problems depending on the circumstances. It's likely this has nothing to do with our current issue, but here for completeness. Will see if we can test/confirm. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Edwin, Do you happen to notice any IPv6 or LLDP or other link-local traffic on the interfaces? (including backup interface). The MTR loss % is purely a capture of their packets xmitted and responses received, so for that UDP MTR test, this is saying that UDP packets were lost, somewhere. The NIC does not have any drops showing via ethtool -S stats but I'm hunting down which are the right pair of before/afters. Other than the tpa_abort counts, there were no errors that I saw. I can't tell what the tpa_abort means for the frame - is it purely a failure only to coalesce, or does it end up dropping packets at some point in that functionality? I'm assuming not, as whatever the reason, those would be counted as drops, I hope, and printed in the interface stats. I'll attach all the stats here once I get them sorted out, I thought I had a clean diff of before and after from the tester, but after looking through, I don't think the file I have is from before/after the mtr test, as there was negligible UDP traffic. I'll try and get clarification from the reporter. Note that when the provision of primary= is used to configure which interface is primary, and when the primary port is used as the active interface for the bond, no problems are seen (and that works deterministically to set the correct active interface). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected
[Kernel-packages] [Bug 1811963] Re: Sporadic problems with X710 (i40e) and bonding where one interface is shown as "state DOWN" and without LOWER_UP
Hi Malte, Was this issue resolved for you? There are several other possibilities that it could be - and if it's still a problem with current mainline, please let us know. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1811963 Title: Sporadic problems with X710 (i40e) and bonding where one interface is shown as "state DOWN" and without LOWER_UP Status in linux package in Ubuntu: Confirmed Bug description: After rebooting the physical server there is a 50/50 chance of all connected interfaces coming up. This affects Dell EMC R740's and R440's equipped with the X710 network cards. As far as I noticed (~20 reboots on different machines), this happens only when using bonding (in this case active-backup or mode 1, did not test different modes yet). The networking-hardware on the other side shows the ports "connected". tcpdump shows frames being received, even if the interface is in "state DOWN". Tried with: Ubuntu 16.04, kernel 4.4.0-141, driver 2.7.26 (from the Intel-website), firmware 18.8.9 Ubuntu 16.04, kernel 4.4.0-141, driver 1.4.25-k, firmware 18.8.9 Ubuntu 16.04, kernel 4.15.0-43 (hwe), driver 2.1.14-k, firmware 18.8.9 The following excerpts are made using Intels driver in version 2.7.26, therefore tainting the kernel, but the same happens using the original kernel's version or the hardware enablement kernel's version. Sporadic failure case: [6.319226] i40e: loading out-of-tree module taints kernel. [6.319227] i40e: loading out-of-tree module taints kernel. [6.319422] i40e: module verification failed: signature and/or required key missing - tainting kernel [6.410837] i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 2.7.26 [6.410838] i40e: Copyright(c) 2013 - 2018 Intel Corporation. [6.423542] i40e :3b:00.0: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 18.8.9 [6.658526] i40e :3b:00.0: MAC address: ff:ff:ff:ff:ff:ff [6.710391] i40e :3b:00.0: PCI-Express: Speed 8.0GT/s Width x8 [6.725692] i40e :3b:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 40 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA [6.750239] i40e :3b:00.1: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 18.8.9 [6.987874] i40e :3b:00.1: MAC address: ff:ff:ff:ff:ff:f1 [7.005397] i40e :3b:00.1 eth0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None [7.024993] i40e :3b:00.1: PCI-Express: Speed 8.0GT/s Width x8 [7.040298] i40e :3b:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 40 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA [7.054384] i40e :3b:00.1 enp59s0f1: renamed from eth0 [7.079613] i40e :3b:00.0 enp59s0f0: renamed from eth1 [9.788893] i40e :3b:00.0 enp59s0f0: already using mac address ff:ff:ff:ff:ff:ff [9.819480] i40e :3b:00.1 enp59s0f1: set new mac address ff:ff:ff:ff:ff:ff [9.728194] bond0: Setting MII monitoring interval to 100 [9.788690] bond0: Adding slave enp59s0f0 [9.805195] bond0: Enslaving enp59s0f0 as a backup interface with a down link [9.819470] bond0: Adding slave enp59s0f1 [9.836360] bond0: making interface enp59s0f1 the new active one [9.836614] bond0: Enslaving enp59s0f1 as an active interface with an up link Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp59s0f1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp59s0f0 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: ff:ff:ff:ff:ff:ff Slave queue ID: 0 Slave Interface: enp59s0f1 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: ff:ff:ff:ff:ff:f1 Slave queue ID: 0 4: enp59s0f0: mtu 1500 qdisc mq master bond0 portid state DOWN group default qlen 1000 link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff 5: enp59s0f1: mtu 1500 qdisc mq master bond0 portid fff1 state UP group default qlen 1000 link/ether ff:ff:ff:ff:ff:f1 brd ff:ff:ff:ff:ff:ff 6: bond0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff inet 123.123.123.123/24 brd 123.123.123.255 scope global bond0 valid_lft forever preferred_lft forever inet6 :::::/64 scope link valid_lft forever preferred_lft forever bond0 Link encap:Ethernet HWaddr ff:ff:ff:ff:ff:ff inet addr:123.123.123.123 Bcast:123.123.123.255 Mask:255.255.255.0 inet6 addr: :::::/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We are closing this LP bug for now as we aren't able to reproduce in-house, and we cannot get access to a live testing repro env at this time. Here is what we know: - There seems to be different performance for some tests when the NIC is configured with active-backup bonding mode, between the case when the active interface is the primary port, and when the active interface is the secondary port. i.e.: Primary port: enp94s0f0 // when this is the active, works fine Secondary port: enp94s0f1d1 // when this is the active, more drops - Switch info: 2 x Fortigate 1024D switches, each machine is connected to both - NIC info: root@u072:~# lspci | grep BCM57416 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) # ethtool -i enp1s0f0np0 driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 - Our attempt at a reproducer (initially reported in production env via graphical monitoring): mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST good system = ~ 0% drops bad systems = ~ 8% drops We are not getting NIC stats drops, nor UDP kernel drops, so it's not clear where the packet is being dropped, whether it's being dropped silently somewhere (?), or if that's a red herring and a mtr test issue, and what's seen in production is something else. If someone can reproduce this, or something similar, or if we manage to, we will re-open this bug or file a new one. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello, diarmuid, Re: original issue report, were you able to resolve your issue? Please let us know. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html - There is no firew
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
The issue we have reported is easily avoided by specifying the primary port to be the active interface of the bond. On netplan-using systems: Add the directive "primary: $interface" (e.g. "primary: p94s0f0") to the "parameters:" section of the netplan config file. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have
[Kernel-packages] [Bug 1882039] Re: The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs
** Changed in: linux (Ubuntu Bionic) Importance: Medium => High ** Changed in: linux (Ubuntu Bionic) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Eoan) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) ** Changed in: linux (Ubuntu Focal) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Focal) Importance: Medium => High ** Changed in: linux (Ubuntu Eoan) Importance: Medium => High ** Changed in: linux (Ubuntu) Importance: Medium => High ** Changed in: linux (Ubuntu Eoan) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) ** Changed in: linux (Ubuntu Focal) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1882039 Title: The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: [Impact] There is performance overhead observed when many threads are using hugetlbfs in the database environment. [Fix] bdfbd98bc018 hugetlbfs: take read_lock on i_mmap for PMD sharing The patch improves the locking by using the read lock instead of the write lock. And it allows multiple threads searching the suitable shared VMA. As there is no modification inside the searching process. The improvement increases the parallelism and decreases the waiting time of the other threads. [Test] The customer stand-up a database with seed data. Then they have a loading "driver" which makes a bunch of connections that look like user workflows from the database perspective. Finally, the measuring response times improvement can be observed for these "users" as well as various other metrics at the database level. [Regression Potential] The modification is only in replacing the write lock to a read one. And there is no modification inside the loop. The regression probability is low. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1882039/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Tested. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Committed Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Packages tested linux-gcp (4.15.0-1078.88~16.04.1) xenial; linux-hwe (4.15.0-107.108~16.04.1) xenial; linux-gcp-4.15 (4.15.0-1078.88) bionic; linux (4.15.0-107.108) bionic; -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Committed Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] [NEW] Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Public bug reported: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). ** Affects: linux (Ubuntu) Importance: Critical Status: Incomplete ** Affects: linux (Ubuntu Bionic) Importance: Critical Status: Incomplete ** Tags: bionic sts ** Changed in: linux (Ubuntu) Importance: Undecided => Critical ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => Critical ** Description changed: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument - This breaks Docker and other applications which use a Jumbo + This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. - The bug is caused by the following recent commit to Bionic - & Xenial-hwe; which is pulled in via the stable patchset below, + The bug is caused by the following recent commit to Bionic + & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) - * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() + * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking - for MTU size exposes the bug of the max mtu not being set correctly. + for MTU size exposes the bug of the max mtu not being set correctly + for the ip
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
SRU request has been submitted. If anyone would like to test, there are test images up on: https://people.canonical.com/~nivedita/ipvlan-test-fix-278887/ You can 'wget' the files and then 'dpkg -i' the modules, linux-image, modules-extra debs in that order, and reboot. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Incomplete Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu) Status: In Progress => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Test kernel has been tested successfully so far by original reporter and has fixed the Docker breakage and so on. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Could anyone hitting this bug confirm it is a DUP of LP Bug 1852077 and that latest releases fix this issue? The handling of the state changes/updates borked here due to not just marking it as a DUP and closing this one. I will close this next week otherwise. ** Changed in: linux (Ubuntu Focal) Status: In Progress => Fix Released ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released ** Changed in: linux (Ubuntu Disco) Status: Fix Committed => Fix Released ** Changed in: linux (Ubuntu Eoan) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in Focal: Fix Released Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Note that fix for all the above series are already released. i.e, from Ubuntu-4.15.0-73.82. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in Focal: Fix Released Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002: ID 1604:10c0 Tascam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R740xd Package: linux (not installed) PciM
[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring
Still waiting on these patches being committed to all the Ubuntu trees. Any ETA? Is this waiting on being picked up via -stable? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1852077 Title: Backport: bonding: fix state transition issue in link monitoring Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: == Justification == From the well explained commit message: Since de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring"), the bonding driver has utilized two separate variables to indicate the next link state a particular slave should transition to. Each is used to communicate to a different portion of the link state change commit logic; one to the bond_miimon_commit function itself, and another to the state transition logic. Unfortunately, the two variables can become unsynchronized, resulting in incorrect link state transitions within bonding. This can cause slaves to become stuck in an incorrect link state until a subsequent carrier state transition. The issue occurs when a special case in bond_slave_netdev_event sets slave->link directly to BOND_LINK_FAIL. On the next pass through bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL case will set the proposed next state (link_new_state) to BOND_LINK_UP, but the new_link to BOND_LINK_DOWN. The setting of the final link state from new_link comes after that from link_new_state, and so the slave will end up incorrectly in _DOWN state. Resolve this by combining the two variables into one. == Fixes == * 1899bb32 (bonding: fix state transition issue in link monitoring) This patch can be cherry-picked into E/F For older releases like B/D, it will needs to be backported as they are missing the slave_err() printk marco added in 5237ff79 (bonding: add slave_foo printk macros) as well as the commit to replace netdev_err() with slave_err() in e2a7420d (bonding/main: convert to using slave printk macros) For Xenial, the commit that causes this issue, de77ecd4, does not exist. == Test == Test kernels can be found here: https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ The X-hwe and Disco kernel were tested by the bug reporter, Aleksei, the patched kernel works as expected. == Regression Potential == Low. This patch just unify the variable used in link state change commit logic to prevent the occurrence of an incorrect state. And the changes are limited to the bonding driver itself. (Although the include/net/bonding.h will be used in other drivers, but the changes to that file is only affecting this bond_main.c driver) == Original Bug Report == There's an issue with bonding driver in the current ubuntu kernels. Sometimes one link stuck in a weird state. It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html in upstream. Commit 1899bb325149e481de31a4f32b59ea6f24e176ea. We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it should be reproducible with other current kernel versions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
This is being handled as a DUP of LP Bug 1852077 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077 ** Changed in: linux (Ubuntu) Status: Expired => In Progress ** Tags added: sts ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: In Progress ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Er
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ There is a test kernel above (from that LP bug). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002: ID 1604:10c0 Tascam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R740xd Package: linux (not installed) PciMultimedi
[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring
FWIW, the fix has been committed to -stable: "bonding: fix state transition issue in link monitoring" Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea ** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1852077 Title: Backport: bonding: fix state transition issue in link monitoring Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: == Justification == From the well explained commit message: Since de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring"), the bonding driver has utilized two separate variables to indicate the next link state a particular slave should transition to. Each is used to communicate to a different portion of the link state change commit logic; one to the bond_miimon_commit function itself, and another to the state transition logic. Unfortunately, the two variables can become unsynchronized, resulting in incorrect link state transitions within bonding. This can cause slaves to become stuck in an incorrect link state until a subsequent carrier state transition. The issue occurs when a special case in bond_slave_netdev_event sets slave->link directly to BOND_LINK_FAIL. On the next pass through bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL case will set the proposed next state (link_new_state) to BOND_LINK_UP, but the new_link to BOND_LINK_DOWN. The setting of the final link state from new_link comes after that from link_new_state, and so the slave will end up incorrectly in _DOWN state. Resolve this by combining the two variables into one. == Fixes == * 1899bb32 (bonding: fix state transition issue in link monitoring) This patch can be cherry-picked into E/F For older releases like B/D, it will needs to be backported as they are missing the slave_err() printk marco added in 5237ff79 (bonding: add slave_foo printk macros) as well as the commit to replace netdev_err() with slave_err() in e2a7420d (bonding/main: convert to using slave printk macros) For Xenial, the commit that causes this issue, de77ecd4, does not exist. == Test == Test kernels can be found here: https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ The X-hwe and Disco kernel were tested by the bug reporter, Aleksei, the patched kernel works as expected. == Regression Potential == Low. This patch just unify the variable used in link state change commit logic to prevent the occurrence of an incorrect state. And the changes are limited to the bonding driver itself. (Although the include/net/bonding.h will be used in other drivers, but the changes to that file is only affecting this bond_main.c driver) == Original Bug Report == There's an issue with bonding driver in the current ubuntu kernels. Sometimes one link stuck in a weird state. It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html in upstream. Commit 1899bb325149e481de31a4f32b59ea6f24e176ea. We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it should be reproducible with other current kernel versions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
FWIW, the fix has been committed to -stable: "bonding: fix state transition issue in link monitoring" Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002:
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Fix has been committed to B, D, E. I've manually updated this bug for now (it was not formally DUP'd to LP Bug 1852077. ** Changed in: linux (Ubuntu Focal) Importance: Undecided => High ** Changed in: linux (Ubuntu Eoan) Importance: Undecided => High ** Changed in: linux (Ubuntu Disco) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Status: New => Fix Committed ** Changed in: linux (Ubuntu Disco) Status: New => Fix Committed ** Changed in: linux (Ubuntu Eoan) Status: New => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser',
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello, Edwin, We have two separate users/customers filing reports, and I can answer for one of them. I'll ask the original poster separately as well to reply. With respect to one of these situations, this is the following system: Dell PowerEdge R440/0XP8V5, BIOS 2.2.11 06/14/2019 Note that a similar system does not have any issues: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.3.4 11/08/2016 So the NIC in the "bad" environment is: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet The NIC in the "good" environment is: Broadcom Inc. and subsidiaries NetXtreme II BCM57810 10 Gigabit Ethernet [14e4:1006] Product Name: QLogic 57810 10 Gigabit Ethernet I'll have to scrub some files and see what I can attach, apologies, I'll have it here by tmrw. Unfortunately, we don't have an easy reproducer. A single iperf and netperf test (both UDP and TCP) showed identical results from both "good" and "bad" environments. What we have is an identical kernel, network configuration and stack with the "bad" system showing double, triple the latency to the systems from a remote server. I'll have more information for you shortly here regarding the exact k8 cmd. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete.
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "active interface ethtool-S" https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324070/+files/ethtool-S-enp94s0f0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://fi
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "backup interface ethtool-S" https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324071/+files/ethtool-S-enp94s0f1d1 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Note that iperf was identical whereas netperf and mtr showed up differences (so it's possibly sporadic as well, not continuous) 1. iperf tcp test -- GoodSystem.9.84 Gbits/sec BadSystem18.37 Gbits/sec BadSystem2...9.85 Gbits/sec 2. iperf udp test -- GoodSystem.1.05 Mbits/sec BadSystem2...1.05 Mbits/sec 3. mtr ping test --- GoodSystem..0.0% Loss; 0.2 Avg; 0.1 Best, 0.9 Worst, 0.1 StdDev BadSystem2...11.7% Loss; 0.1 Avg; 0.1 Best, 0.2 Worst, 0.0 StdDev 4. netperf tcp_rr 1/1 bytes GoodSystem..17921.83 t/sec BadSystem1.13912.45 t/sec BadSystem2 5. netperf tcp_rr 64/64 bytes GoodSystem..16987.48 t/sec BadSystem1.13355.93 t/sec BadSystem2 6. netperf tcp_rr 128/8192 bytes --- GoodSystem..2396.45 t/sec BadSystem1.1678.54 t/sec BadSystem2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted sa
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> The mtr packet loss is an interesting result. What mtr options did you use? Is this a UDP or ICMP test? The mtr command was: mtr --no-dns --report --report-cycles 60 $IP_ADDR so ICMP was going out. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try de
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Thanks very much for helping on this, Edwin! Please let me know if there's anything specific you need. I'm asking them to disable any IPv6, LLDP traffic in their environment, and retest and collect information again. Also, I'd like to disable tpa, would this be at all useful: modprobe bnx disable_tpa=1 ?? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> There are more than one variable at play here. > Does the problem follow the NIC if you swap the > NICs between systems? Are OS / kernel and driver > versions the same on both systems? Unfortunately, I've not been able to get them to try permutations or switches, as yet, as this is still a production system/environment. I'll try and obtain more information about it. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly o
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> NICs between systems? Are OS / kernel and driver > versions the same on both systems? Yes, identical distro release, kernel, and most of the software stack (I have not obtained and examined the full sw stack). Configuration of networking settings is also the same. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I h
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hey Edwin, sorry, I didn't see your last question. I'll try and confirm but I've seen loss in both directions but it's not clear whether that's significant enough or not yet. e.g., TCP traffic is retransmitted, so it could be segments lost while outgoing or acks lost incoming. 4407 retransmitted TCP segments 130 TCP timeouts in stats collected about 5 mins apart - which isn't sufficient a sample size, we're trying to get a new collection of stats, logs using the netperf TCP_RR test. In our case, note, we're more concerned (and have more solid data) of latency issues than dropped packets (which I expect some of with heavy network testing). For example, netperf TCP_RR latency is about 70-78% of the older systems for 1,1 request/response byte sizes as well as 64/64, 100/200, 128/8192 sizes. I'll update here as soon as we have more data from the production environment. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello Edwin, Here is more information on the issue we are seeing wrt dropped packets and other connectivity issues with this NIC. The problem is *only* seen when the second port on the NIC is chosen as the active interface of a active-backup configuration. So on the "bad" system with the interfaces: enp94s0f0 -> when chosen as active, all OK enp94s0f1d1 -> when chosen as active, not OK I'll see if the reporters can confirm that on the "good" systems, there was no problem when the second interface is active. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested th
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
I have reports of the same device appearing to drop packets and incur greater number of retransmissions under certain circumstances which we're still trying to nail down. I'm using this bug for now until proven to be a different problem. This is causing issues in a production environment. ** Changed in: network-manager (Ubuntu) Status: New => Confirmed ** Changed in: network-manager (Ubuntu) Importance: Undecided => Critical ** Tags added: sts ** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: New Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We suspect this is a device (hw/fw) issue, however, not NetworkManager or kernel (driver bnxt_en). I've added the kernel for the driver impact (just in case, for now). This is really to eliminate all other causes and confirm whether it's the device at root cause). NIC Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet 5e:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) NIC Driver/FW --- driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: :5e:00.1 supports-statistics: yes Kernel - 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 (appears to be an issue on all kernel versions) Environment Configuration - active-backup bonding mode (having the active backup up *might* potentially be the problem, but it might just be the device itself). The exact same distro, kernel, applications and configuration works fine with a different NIC (Broadcom 10g bnx2x). There were quite a few total tpa_abort stats counts (1118473) during the duration of a 2 minute iperf test. Hoping to get more information from other users seeing the same issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence er
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
(active interface) > cat ethtool-S-enp94s0f1d1 | grep abort [0]: tpa_aborts: 19775497 [1]: tpa_aborts: 26758635 [2]: tpa_aborts: 12008147 [3]: tpa_aborts: 15829167 [4]: tpa_aborts: 25099500 [5]: tpa_aborts: 3292554 [6]: tpa_aborts: 2863692 [7]: tpa_aborts: 20224692 (backup interface) > cat ethtool-S-enp94s0f0 | grep abort [0]: tpa_aborts: 3158584 [1]: tpa_aborts: 1670319 [2]: tpa_aborts: 1749371 [3]: tpa_aborts: 1454301 [4]: tpa_aborts: 123020 [5]: tpa_aborts: 1403509 [6]: tpa_aborts: 1298383 [7]: tpa_aborts: 1858753 Netted out from previous capture, there were *f0 = 2014 tpa_aborts *d1 = 1118473 tpa_aborts ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed ** Changed in: linux (Ubuntu) Importance: Undecided => Critical -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x3
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
As the test kernel with the backported Xenial fix has been up for almost 2 months now, I'm submitting the SRU for Xenial, although I have not received feedback from original reporter or others. Backported patch for Xenial varies slightly from the cherry-picked patch for B, C. My testing has been successful (see original testing information in description). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() :
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
This issue has been tested and successfully verified: Verification successful ! "...test appliance built with 4.15.0-58 was unusable ... hundreds of "BUG: non-zero pgtables_bytes on freeing mm: -16384" in syslog, RestAPI interface timeouts, failed to produce FFDC data using sosreport. Build with 4.15.0-60.67 displays none of these behaviors ... smoke test completed successfully." ** Tags added: verification-done-bionic ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
Hi Joseph, We're continuing the investigation into this issue, and I was wondering if you and Nabuto could provide what the last point you had reached was, and/or next step you were going to do. >From what I can summarize (please confirm/correct): * Artful (4.13.*) kernels (with any Artful config) is good * Artful (4.13.*) kernels (with any Xenial config) is also bad * 4.12-rc4 - relatively good (1.x%) but still not 0% (<5%) * 4.12-rc3 - also bad (~ 27%) * Xenial (4.4.*) kernels (with any Xenial config) is bad * Xenial (4.4.*) kernels (with any Artful config) is still bad [data point: 4.12-rc4 with Artful configs is good. 4.12-rc4 with Xenial configs is bad.] So a kernel change + config change results in masked/fixed behavior, I guess? Is the remaining bisect window basically 4.12-rc4 -> 4.13 ? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1753662 Title: [i40e] LACP bonding start up race conditions Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded). Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service. How to reproduce: 1. configure LACP bonding with MAAS 2. provision machines 3. check the bonding status in /proc/net/bonding/bond* frequency of occurrence: About 5 bond pairs in 22 pairs at each provisioning. [reproducible combination] $ uname -a Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 1.4.25-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [non-reproducible combination] $ uname -a Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-116-generic 4.4.0-116.140 ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98 Uname: Linux 4.4.0-116-generic x86_64 AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 6 06:37 seq crw-rw 1 root audio 116, 33 Mar 6 06:37 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.15 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Tue Mar 6 06:46:32 2018 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R730 PciMultimedia: ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8 RelatedPackageVersions: linux-restricted-modules-4.4.0-116-generic N/A linux-backports-modules-4.4.0-116-generic N/A linux-firmware 1.157.17 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/16/2017 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.5.5 dmi.board.name: 072T6D dmi.board.vendor: Dell Inc. dmi.board.version: A08 dmi.chassis.asset.tag: 0018880 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.5.5:bd08/16/2017:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn072T6D:rvrA08:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R730 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1753662/+subscriptions -
[Kernel-packages] [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
I would have thought this would be the relevant patch: bonding: speed/duplex update at NETDEV_UP event Mahesh Bandewar authored and davem330 committed on Sep 28, 2017 1 parent b5c7d4e commit 4d2c0cda07448ea6980f00102dc3964eb25e241c However, it was first available in v4.15-rc1. At least as far as bonding kernel changes go, there does not seem another obvious candidate that might have fixed this problem between 4.12 and 4.13 (first skim). At least for one scenario I looked at, we got a bad speed/duplex setting, which eventually ended up with the bond interface aggregating on a separate port, and/or ending up in LACP DISABLED state which it never got out of. We only checked correct/latest device speed/duplex settings via the NETDEV_CHANGE path, where we called _ethtool_get_settings(). If we don't receive a change event again to correct the speed/duplex, we never recover. There are some other patches which help address this at different points, but are either before or later (see above) the window. I'll take a look at code outside the bonding dir which might impact this. Joseph, could you provide the raw config files you used as well? It was not super clear in the png image if those were the only diffs. They did not seem very relevant diffs either. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1753662 Title: [i40e] LACP bonding start up race conditions Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded). Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service. How to reproduce: 1. configure LACP bonding with MAAS 2. provision machines 3. check the bonding status in /proc/net/bonding/bond* frequency of occurrence: About 5 bond pairs in 22 pairs at each provisioning. [reproducible combination] $ uname -a Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 1.4.25-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [non-reproducible combination] $ uname -a Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-116-generic 4.4.0-116.140 ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98 Uname: Linux 4.4.0-116-generic x86_64 AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 6 06:37 seq crw-rw 1 root audio 116, 33 Mar 6 06:37 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.15 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Tue Mar 6 06:46:32 2018 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R730 PciMultimedia: ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8 RelatedPackageVersions: linux-restricted-modules-4.4.0-116-generic N/A linux-backports-modules-4.4.0-116-generic N/A linux-firmware 1.157.17 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/16/2017 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.5.5 dmi.board.name: 072T6D dmi.
[Kernel-packages] [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
Jeff, Please do provide your logs and whatever other information you can share from your error case, any piece of info will help here. I do not yet have a repro environment myself. I suspect that most of the changes which seem to help or fix the issue are simply changing the timing enough to affect the race window, making it less likely to occur, so are masking the problem rather than fixing the root cause. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1753662 Title: [i40e] LACP bonding start up race conditions Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded). Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service. How to reproduce: 1. configure LACP bonding with MAAS 2. provision machines 3. check the bonding status in /proc/net/bonding/bond* frequency of occurrence: About 5 bond pairs in 22 pairs at each provisioning. [reproducible combination] $ uname -a Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 1.4.25-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [non-reproducible combination] $ uname -a Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-116-generic 4.4.0-116.140 ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98 Uname: Linux 4.4.0-116-generic x86_64 AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 6 06:37 seq crw-rw 1 root audio 116, 33 Mar 6 06:37 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.15 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Tue Mar 6 06:46:32 2018 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R730 PciMultimedia: ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8 RelatedPackageVersions: linux-restricted-modules-4.4.0-116-generic N/A linux-backports-modules-4.4.0-116-generic N/A linux-firmware 1.157.17 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/16/2017 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.5.5 dmi.board.name: 072T6D dmi.board.vendor: Dell Inc. dmi.board.version: A08 dmi.chassis.asset.tag: 0018880 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.5.5:bd08/16/2017:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn072T6D:rvrA08:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R730 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1753662/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1771480] Re: WARNING: CPU: 28 PID: 34085 at /build/linux-90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90()
The warning message: "failed to disable LRO!" is coming from the function dev_disable_lro(): /** * dev_disable_lro - disable Large Receive Offload on a device * @dev: device * * Disable Large Receive Offload (LRO) on a net device. Must be * called under RTNL. This is needed if received packets may be * forwarded to another interface. */ dev_disable_lro() ... if (unlikely(dev->features & NETIF_F_LRO)) netdev_WARN(dev, "failed to disable LRO!\n"); ... Likely relevant callers here: bond_enslave() if (!(bond_dev->features & NETIF_F_LRO)) dev_disable_lro(slave_dev); br_add_if() dev_disable_lro(dev); ... Looking like the second, from the trace. I'd say if you can repro then turn on debug and also dynamic debug on the files br_if.c and dev.c. Possibly another issue with the device name? Is bond1.2001 a vlan interface? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1771480 Title: WARNING: CPU: 28 PID: 34085 at /build/linux- 90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90() Status in linux package in Ubuntu: Incomplete Bug description: I have multiple instances of this dev_disable_lro error in kern.log. Also seeing this: systemd-udevd[1452]: timeout: killing 'bridge-network-interface' [2765] <4>May 1 22:56:42 xxx kernel: [ 404.520990] bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond <4>May 1 22:56:44 xxx kernel: [ 406.926429] bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond <4>May 1 22:56:45 xxx kernel: [ 407.569020] [ cut here ] <4>May 1 22:56:45 xxx kernel: [ 407.569029] WARNING: CPU: 28 PID: 34085 at /build/linux-90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90() <4>May 1 22:56:45 xxx kernel: [ 407.569032] netdevice: bond0.2004 <4>May 1 22:56:45 xxx kernel: [ 407.569032] failed to disable LRO! <4>May 1 22:56:45 xxx kernel: [ 407.569035] Modules linked in: 8021q garp mrp bridge stp llc bonding iptable_filter ip_tables x_tables nf_conntrack_proto_gre nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipmi_devintf mxm_wmi dcdbas x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me mei ipmi_si shpchp wmi acpi_power_meter mac_hid xfs libcrc32c raid10 usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 igb ixgbe i2c_algo_bit multipath ahci dca ptp libahci pps_core linear megaraid_sas mdio dm_multipath scsi_dh <4>May 1 22:56:45 xxx kernel: [ 407.569112] CPU: 28 PID: 34085 Comm: brctl Not tainted 3.13.0-142-generic #191-Ubuntu <4>May 1 22:56:45 xxx kernel: [ 407.569115] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.7.1 001/22/2018 <4>May 1 22:56:45 xxx kernel: [ 407.569118] 881fcc753c70 8172e7fc 881fcc753cb8 <4>May 1 22:56:45 xxx kernel: [ 407.569129] 0009 881fcc753ca8 8106afad 883fcc6f8000 <4>May 1 22:56:45 xxx kernel: [ 407.569139] 883fcc696880 883fcc6f8000 881fce82dd40 <4>May 1 22:56:45 xxx kernel: [ 407.569150] Call Trace: <4>May 1 22:56:45 xxx kernel: [ 407.569160] [] dump_stack+0x64/0x82 <4>May 1 22:56:45 xxx kernel: [ 407.569168] [] warn_slowpath_common+0x7d/0xa0 <4>May 1 22:56:45 xxx kernel: [ 407.569175] [] warn_slowpath_fmt+0x4c/0x50 <4>May 1 22:56:45 xxx kernel: [ 407.569183] [] dev_disable_lro+0x87/0x90 <4>May 1 22:56:45 xxx kernel: [ 407.569195] [] br_add_if+0x1f3/0x430 [bridge] <4>May 1 22:56:45 xxx kernel: [ 407.569205] [] add_del_if+0x5d/0x90 [bridge] <4>May 1 22:56:45 xxx kernel: [ 407.569215] [] br_dev_ioctl+0x5b/0x90 [bridge] <4>May 1 22:56:45 xxx kernel: [ 407.569223] [] dev_ifsioc+0x313/0x360 <4>May 1 22:56:45 xxx kernel: [ 407.569230] [] ? dev_get_by_name_rcu+0x69/0x90 <4>May 1 22:56:45 xxx kernel: [ 407.569237] [] dev_ioctl+0xe9/0x590 <4>May 1 22:56:45 xxx kernel: [ 407.569245] [] sock_do_ioctl+0x45/0x50 <4>May 1 22:56:45 xxx kernel: [ 407.569252] [] sock_ioctl+0x1f0/0x2c0 <4>May 1 22:56:45 xxx kernel: [ 407.569260] [] do_vfs_ioctl+0x2e0/0x4c0 <4>May 1 22:56:45 xxx kernel: [ 407.569267] [] ? fput+0xe/0x10 <4>May 1 22:56:45 xxx kernel: [ 407.569273] [] SyS_ioctl+0x81/0xa0 <4>May 1 22:56:45 xxx kernel: [ 407.569283] [] system_call_fastpath+0x2f/0x34 <4>May 1 22:56:45 xxx kernel: [ 407.569287] ---[ end trace df5aa31d75a7e2b1 ]--- <4>May 1 22:56:54 xxx kernel: [ 416.320138] bonding: bond1: Warning: the permanent HWaddr of enp131s0f0 - a0:36:9f:c1:25:d0 - is still in use by bond1. Se
[Kernel-packages] [Bug 1794232] [NEW] Geneve tunnels don't work when ipv6 is disabled
Public bug reported: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." >From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(&request, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * We are in the process of sending a patch for this upstream once it has completed adequate testing. * Example Versions (bug exists in all versions of Ubuntu and mainline): $ uname -r 4.4.0-135-generic $ lsb_release -rd Description:Ubuntu 16.04.5 LTS Release:16.04 $ dpkg -l | grep openvswitch-switch ii openvswitch-switch 2.5.4-0ubuntu0.16.04.1 ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: geneve kernel-bug ** Tags added: geneve kernel-bug -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: New Bug description: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable i
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Logs not necessary at this time, will attach patches and other information as needed. ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Confirmed Bug description: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(&request, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * We are in the process of sending a patch for this upstream once it has completed adequate testing. * Example Versions (bug exists in all versions of Ubuntu and mainline): $ uname -r 4.4.0-135-generic $ lsb_release -rd Description: Ubuntu 16.04.5 LTS Release: 16.04 $ dpkg -l | grep openvswitch-switch ii openvswitch-switch 2.5.4-0ubuntu0.16.04.1 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794232/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.
[Kernel-packages] [Bug 1781413] [NEW] Cannot set MTU higher than 1500 in Xen instance
Public bug reported: [Impact] The latest Xenial update has broken MTU functionality in Xen: specifically, setting MTUs larger than 1500 fails. This prevents Jumbo Frames and other features which require larger than 1500 byte MTUs from being used. This can lead to a failure to sync/connect to other components in the cluster/cloud which expect higher MTUs and result in unavailable services. This can be worked around by manually using ethtool to set SCATTER/GATHER functionality: - $ sudo ethtool -K $interface_name sg on The issue is caused by the following commit to the xen-netfront driver: "xen-netfront: Fix race between device setup and open" commit f599c64fdf7d9c108e8717fb04bc41c680120da4 Introduced: v4.16-rc1 Reverting the above fix has confirmed that the problem goes away. The following commits fix this issue in the mainline kernel: "xen-netfront: Fix mismatched rtnl_unlock" commit cb257783c2927b73614b20f915a91ff78aa6f3e8 Introduced: v4.18-rc3 "xen-netfront: Update features after registering netdev" commit 45c8184c1bed1ca8a7f02918552063a00b909bf5 Introduced: v4.18-rc3 [Test Case] 1. Launch a Xen instance using the latest kernel version (e.g. 4.4.0-130, or 4.4.0-1062-aws) 2. Change MTU to 9000 or other value > 1500. [Regression Potential] The kernel patch might not be able to set MTU. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Affects: linux (Ubuntu Xenial) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781413 Title: Cannot set MTU higher than 1500 in Xen instance Status in linux package in Ubuntu: New Status in linux source package in Xenial: New Bug description: [Impact] The latest Xenial update has broken MTU functionality in Xen: specifically, setting MTUs larger than 1500 fails. This prevents Jumbo Frames and other features which require larger than 1500 byte MTUs from being used. This can lead to a failure to sync/connect to other components in the cluster/cloud which expect higher MTUs and result in unavailable services. This can be worked around by manually using ethtool to set SCATTER/GATHER functionality: - $ sudo ethtool -K $interface_name sg on The issue is caused by the following commit to the xen-netfront driver: "xen-netfront: Fix race between device setup and open" commit f599c64fdf7d9c108e8717fb04bc41c680120da4 Introduced: v4.16-rc1 Reverting the above fix has confirmed that the problem goes away. The following commits fix this issue in the mainline kernel: "xen-netfront: Fix mismatched rtnl_unlock" commit cb257783c2927b73614b20f915a91ff78aa6f3e8 Introduced: v4.18-rc3 "xen-netfront: Update features after registering netdev" commit 45c8184c1bed1ca8a7f02918552063a00b909bf5 Introduced: v4.18-rc3 [Test Case] 1. Launch a Xen instance using the latest kernel version (e.g. 4.4.0-130, or 4.4.0-1062-aws) 2. Change MTU to 9000 or other value > 1500. [Regression Potential] The kernel patch might not be able to set MTU. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781413/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] [NEW] bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Public bug reported: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this ** Affects: linux (Ubuntu) Importance: High Status: New ** Tags: xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: New Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Attachment added: "kern.log.excerpt-netdev-watchdog-timeout.txt" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+attachment/5234643/+files/kern.log.excerpt-netdev-watchdog-timeout.txt -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: New Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Due to earlier NIC flapping observed on systems for the 25Gb Broadcom NIC, with originally the following config, the firmware was upgraded to avoid a known FW bug: $ cat ethtool_-i_enp59s0f1d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03 expansion-rom-version: bus-info: :3b:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no The FW was upgraded on affected systems to: $ cat ethtool_-i_eno2d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6 expansion-rom-version: bus-info: :19:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no Unfortunately, it's not quite clear which FW version the current bug happened on (I believe the newer but can't confirm -- happened in the midst of several reboots) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Incomplete Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
Any update on a Bionic fix? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Confirmed Bug description: Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] [NEW] FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
Public bug reported: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Tags: artful fips xenial ** Package changed: libgcrypt20 (Ubuntu) => linux (Ubuntu) ** Description changed: [IMPACT] - Install and boot of the FIPS kernel packages + Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. - This was also observed on standard Ubuntu - kernels prior to 4.11.0. + This was also observed on standard Ubuntu + kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL - UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: -Ubuntu-4.11.0-1.6 <-- WORKS -Ubuntu-4.10.0-26.30 <-- FAILS + Ubuntu-4.11.0-1.6 <-- WORKS + Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: - -Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS -Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS - -Ubuntu-lts-* <--- ALL FAIL -Ubuntu-4.4.0-* <--- ALL FAIL + Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS + Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS + + Ubuntu-lts-* <--- ALL FAIL + Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb - Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - - Boot args (cat /proc/cmdline) -- Check rootdelay=... -- Check root= ... - - Missing modules (cat /proc/modules; ls/dev) + - Boot args (cat /proc/cmdline) + - Check rootdelay=... + - Check root= ... + - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ - There does not appear to be any workaround so far. - The disks are encrypted SSDs. + There does not appear to be any workaround so far. + The disks are encrypted SSDs. - Attaching commit list between the last known - failing Artful kernel and earliest known + Attaching commit list between the last known + failing Artful kernel and earliest known working kernel (adjacent tags) and other info. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: New Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
** Attachment added: "Commit list for the artful window where fix went in" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+attachment/5223616/+files/Ubuntu-4.10.0-26.30---Ubuntu-4.11.0-0.5---commitlist -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: New Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
** Changed in: linux (Ubuntu) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: New Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Confirmed Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
Thanks, Joe. I'll update this bug as soon as I get the results from the reporter. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
The disk in question is a PERC_H740P_Adp. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
v4.10 Final < FAILS v4.11-rc1 < WORKS -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
There are some similar issues out there: https://feeding.cloud.geek.nz/posts/recovering-from-unbootable-ubuntu- encrypted-lvm-root-partition/ https://bugs.launchpad.net/ubuntu/+source/xubuntu-meta/+bug/1801629 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1809168] Re: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found
One check to see if the above is the issue: 1. dpkg -l | grep crypt 2. dpkg -l | grep lvm If lvm2 is not installed, for instance, it should be possible to do the following to fix the problem: 1. # apt install lvm2 2. # update-initramfs -c -k all -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1809168 Title: FIPS and Ubuntu standard kernels prior to 4.11.0 won't boot; root device not found Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: [IMPACT] Booting of the Xenial-based FIPS kernel packages failed with disk not found errors on amd64. This was also observed on standard Ubuntu kernels prior to 4.11.0. FIPS -- 1. linux-image-4.4.0-1002-fips <-- FAIL 2. linux-image-4.4.0-1006-fips <-- FAIL UBUNTU 1. Bionic kernels all WORK 2. Artful kernels: Ubuntu-4.11.0-1.6 <-- WORKS Ubuntu-4.10.0-26.30 <-- FAILS 3. Xenial kernels: Ubuntu-hwe-4.11.0-12.17_16.04.1 <--- WORKS Ubuntu-hwe-4.10.0-43.47_16.04.1 <--- FAILS Ubuntu-lts-* <--- ALL FAIL Ubuntu-4.4.0-* <--- ALL FAIL We have narrowed down the window to be: 4.11.0-1.6 (custom build) <--- WORKS 4.10.0-43.47~16.04.1 <-- FAILS Also works: https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11/linux-image-4.11.0-041100-generic_4.11.0-041100.201705041534_amd64.deb Symptoms - System cannot find the root disk and drops into an initramfs shell: mdadm script local-block "CREATE group disk not found" "Gave up waiting for root device. Common problems: - Boot args (cat /proc/cmdline) - Check rootdelay=... - Check root= ... - Missing modules (cat /proc/modules; ls/dev) ALERT! UUID=... does not exist. Dropping to a shell! ... (initramfs)_ There does not appear to be any workaround so far. The disks are encrypted SSDs. Attaching commit list between the last known failing Artful kernel and earliest known working kernel (adjacent tags) and other info. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1809168/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp