[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free
As several different forums are discussing this issue, I'm using this LP bug to continue investigation in to current manifestation of this bug (after 4.15 kernel). I suspect it's in one of the other places not fixed, as my colleague Dan stated a while ago. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1711407 Title: unregister_netdevice: waiting for lo to become free Status in linux package in Ubuntu: In Progress Status in linux source package in Trusty: In Progress Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Won't Fix Status in linux source package in Artful: Won't Fix Status in linux source package in Bionic: In Progress Bug description: This is a "continuation" of bug 1403152, as that bug has been marked "fix released" and recent reports of failure may (or may not) be a new bug. Any further reports of the problem should please be reported here instead of that bug. -- [Impact] When shutting down and starting containers the container network namespace may experience a dst reference counting leak which results in this message repeated in the logs: unregister_netdevice: waiting for lo to become free. Usage count = 1 This can cause issues when trying to create net network namespace and thus block a user from creating new containers. [Test Case] See comment 16, reproducer provided at https://github.com/fho/docker- samba-loop To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free
We are seeing definitely a problem on kernels after 4.15.0-159-generic, which is the last known good kernel. 5.3* kernels are affected, but I do not have data on most recent upstream. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1711407 Title: unregister_netdevice: waiting for lo to become free Status in linux package in Ubuntu: In Progress Status in linux source package in Trusty: In Progress Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Won't Fix Status in linux source package in Artful: Won't Fix Status in linux source package in Bionic: In Progress Bug description: This is a "continuation" of bug 1403152, as that bug has been marked "fix released" and recent reports of failure may (or may not) be a new bug. Any further reports of the problem should please be reported here instead of that bug. -- [Impact] When shutting down and starting containers the container network namespace may experience a dst reference counting leak which results in this message repeated in the logs: unregister_netdevice: waiting for lo to become free. Usage count = 1 This can cause issues when trying to create net network namespace and thus block a user from creating new containers. [Test Case] See comment 16, reproducer provided at https://github.com/fho/docker- samba-loop To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1403152] Re: unregister_netdevice: waiting for lo to become free. Usage count
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-lts-utopic in Ubuntu. https://bugs.launchpad.net/bugs/1403152 Title: unregister_netdevice: waiting for lo to become free. Usage count Status in Linux: Incomplete Status in linux package in Ubuntu: Fix Released Status in linux-lts-utopic package in Ubuntu: Won't Fix Status in linux-lts-xenial package in Ubuntu: Won't Fix Status in linux source package in Trusty: Fix Released Status in linux-lts-utopic source package in Trusty: Fix Released Status in linux-lts-xenial source package in Trusty: Won't Fix Status in linux source package in Vivid: Fix Released Bug description: SRU Justification: [Impact] Users of kernels that utilize NFS may see the following messages when shutting down and starting containers: unregister_netdevice: waiting for lo to become free. Usage count = 1 This can cause issues when trying to create net network namespace and thus block a user from creating new containers. [Test Case] Setup multiple containers in parallel to mount and NFS share, create some traffic and shutdown. Eventually you will see the kernel message. Dave's script here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152/comments/24 [Fix] commit de84d89030fa4efa44c02c96c8b4a8176042c4ff upstream -- I currently running trusty latest patches and i get on these hardware and software: Ubuntu 3.13.0-43.72-generic 3.13.11.11 processor : 7 vendor_id : GenuineIntel cpu family: 6 model : 77 model name: Intel(R) Atom(TM) CPU C2758 @ 2.40GHz stepping : 8 microcode : 0x11d cpu MHz : 2400.000 cache size: 1024 KB physical id : 0 siblings : 8 core id : 7 cpu cores : 8 apicid: 14 initial apicid: 14 fpu : yes fpu_exception : yes cpuid level : 11 wp: yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms bogomips : 4799.48 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: somehow reproducable the subjected error, and lxc is working still but not more managable until a reboot. managable means every command hangs. I saw there are alot of bugs but they seams to relate to older version and are closed, so i decided to file a new one? I run alot of machine with trusty an lxc containers but only these kind of machines produces these errors, all other don't show these odd behavior. thx in advance meno To manage notifications about this bug go to: https://bugs.launchpad.net/linux/+bug/1403152/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free
Is anyone still seeing a similar issue on current mainline? ** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1711407 Title: unregister_netdevice: waiting for lo to become free Status in linux package in Ubuntu: In Progress Status in linux source package in Trusty: In Progress Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Won't Fix Status in linux source package in Artful: Won't Fix Status in linux source package in Bionic: In Progress Bug description: This is a "continuation" of bug 1403152, as that bug has been marked "fix released" and recent reports of failure may (or may not) be a new bug. Any further reports of the problem should please be reported here instead of that bug. -- [Impact] When shutting down and starting containers the container network namespace may experience a dst reference counting leak which results in this message repeated in the logs: unregister_netdevice: waiting for lo to become free. Usage count = 1 This can cause issues when trying to create net network namespace and thus block a user from creating new containers. [Test Case] See comment 16, reproducer provided at https://github.com/fho/docker- samba-loop To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation
** Description changed: SRU Justification: [Impact] - This bug in bcache [insert correct area] affects I/O performance on all versions of the kernel [correct versions affected]. It is particularly negative on ceph if used with bcache. + This bug in bcache affects I/O performance on all versions of the kernel [correct versions affected]. It is particularly negative on ceph if used with bcache. Write I/O latency would suddenly go to around 1 second from around 10 ms when hitting this issue and would easily be stuck there for hours or even days, especially bad for ceph on bcache architecture. This would make ceph extremely slow and make the entire cloud almost unusable. The root cause is that the dirty bucket had reached the 70 percent threshold, thus causing all writes to go direct to the backing HDD device. It might be fine if it actually had a lot of dirty data, but this happens when dirty data has not even reached over 10 percent, due to having high memory fragmentation. What makes it worse is that the writeback rate might be still at minimum value (8) due to the writeback percent not reached, so it takes ages for bcache to really reclaim enough dirty buckets to get itself out of this situation. [Fix] * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the fragmentation when update the writeback rate” - The current way to calculate the writeback rate only considered the dirty sectors. + The current way to calculate the writeback rate only considered the dirty sectors. This usually works fine when memory fragmentation is not high, but it will give us an unreasonably low writeback rate when we are in the situation that a few dirty sectors have consumed a lot of dirty buckets. In some cases, the dirty buckets reached CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback) while the dirty data (sectors) had not even reached the writeback_percent threshold (i.e., started writeback). In that situation, the writeback rate will still be the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck in a non-writeback mode because of the slow writeback. We accelerate the rate in 3 stages with different aggressiveness: - the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), + the first stage starts when dirty buckets percent reach above BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), - the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). + the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default the first stage tries to writeback the amount of dirty data in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds, the second stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third stage tries to writeback the amount of dirty data in one bucket in (1 / (dirty_buckets_percent - 64)) milliseconds. The initial rate at each stage can be controlled by 3 configurable - parameters: + parameters: writeback_rate_fp_term_{low|mid|high} They are by default 1, 10, 1000, chosen based on testing and production data, detailed below. A. When it comes to the low stage, it is still far from the 70% -threshold, so we only want to give it a little bit push by setting the -term to 1, it means the initial rate will be 170 if the fragment is 6, -it is calculated by bucket_size/fragment, this rate is very small, -but still much more reasonable than the minimum 8. -For a production bcache with non-heavy workload, if the cache device -is bigger than 1 TB, it may take hours to consume 1% buckets, -so it is very possible to reclaim enough dirty buckets in this stage, -thus to avoid entering the next stage. + threshold, so we only want to give it a little bit push by setting the + term to 1, it means the initial rate will be 170 if the fragment is 6, + it is calculated by bucket_size/fragment, this rate is very small, + but still much more reasonable than the minimum 8. + For a production bcache with non-heavy workload, if the cache device + is bigger than 1 TB, it may take hours to consume 1% buckets, + so it is very possible to reclaim enough dirty buckets in this stage, + thus to avoid entering the next stage. B. If the dirty buckets ratio didn’t turn around during the first stage, -it comes to the mid stage, then it is necessary for mid stage -to be more aggressive than low stage, so the initial rate is chosen -to be 10 times more than the low stage, which means 1700 as the initial -rate if the fragment is 6. This is a normal rate -we usually see for a normal workload when writeback happens -because of writeback_percent. + it comes to the mid stage, then it is necessary for mid stage + to be more aggressive than
[Kernel-packages] [Bug 1906476] Re: PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, >z_sa_hdl)) failed
** Tags added: seg -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to zfs-linux in Ubuntu. https://bugs.launchpad.net/bugs/1906476 Title: PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, >z_sa_hdl)) failed Status in zfs-linux package in Ubuntu: Confirmed Bug description: Since today while running Ubuntu 21.04 Hirsute I started getting a ZFS panic in the kernel log which was also hanging Disk I/O for all Chrome/Electron Apps. I have narrowed down a few important notes: - It does not happen with module version 0.8.4-1ubuntu11 built and included with 5.8.0-29-generic - It was happening when using zfs-dkms 0.8.4-1ubuntu16 built with DKMS on the same kernel and also on 5.8.18-acso (a custom kernel). - For whatever reason multiple Chrome/Electron apps were affected, specifically Discord, Chrome and Mattermost. In all cases they seem (but I was unable to strace the processes so it was a bit hard ot confirm 100% but by deduction from /proc/PID/fd and the hanging ls) they seem hung trying to open files in their 'Cache' directory, e.g. ~/.cache/google-chrome/Default/Cache and ~/.config/Mattermost/Cache .. while the issue was going on I could not list that directory either "ls" would just hang. - Once I removed zfs-dkms only to revert to the kernel built-in version it immediately worked without changing anything, removing files, etc. - It happened over multiple reboots and kernels every time, all my Chrome apps weren't working but for whatever reason nothing else seemed affected. - It would log a series of spl_panic dumps into kern.log that look like this: Dec 2 12:36:42 optane kernel: [ 72.857033] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, >z_sa_hdl)) failed Dec 2 12:36:42 optane kernel: [ 72.857036] PANIC at zfs_znode.c:335:zfs_znode_sa_init() I could only find one other google reference to this issue, with 2 other users reporting the same error but on 20.04 here: https://github.com/openzfs/zfs/issues/10971 - I was not experiencing the issue on 0.8.4-1ubuntu14 and fairly sure it was working on 0.8.4-1ubuntu15 but broken after upgrade to 0.8.4-1ubuntu16. I will reinstall those zfs-dkms versions to verify that. There were a few originating call stacks but the first one I hit was Call Trace: dump_stack+0x74/0x95 spl_dumpstack+0x29/0x2b [spl] spl_panic+0xd4/0xfc [spl] ? sa_cache_constructor+0x27/0x50 [zfs] ? _cond_resched+0x19/0x40 ? mutex_lock+0x12/0x40 ? dmu_buf_set_user_ie+0x54/0x80 [zfs] zfs_znode_sa_init+0xe0/0xf0 [zfs] zfs_znode_alloc+0x101/0x700 [zfs] ? arc_buf_fill+0x270/0xd30 [zfs] ? __cv_init+0x42/0x60 [spl] ? dnode_cons+0x28f/0x2a0 [zfs] ? _cond_resched+0x19/0x40 ? _cond_resched+0x19/0x40 ? mutex_lock+0x12/0x40 ? aggsum_add+0x153/0x170 [zfs] ? spl_kmem_alloc_impl+0xd8/0x110 [spl] ? arc_space_consume+0x54/0xe0 [zfs] ? dbuf_read+0x4a0/0xb50 [zfs] ? _cond_resched+0x19/0x40 ? mutex_lock+0x12/0x40 ? dnode_rele_and_unlock+0x5a/0xc0 [zfs] ? _cond_resched+0x19/0x40 ? mutex_lock+0x12/0x40 ? dmu_object_info_from_dnode+0x84/0xb0 [zfs] zfs_zget+0x1c3/0x270 [zfs] ? dmu_buf_rele+0x3a/0x40 [zfs] zfs_dirent_lock+0x349/0x680 [zfs] zfs_dirlook+0x90/0x2a0 [zfs] ? zfs_zaccess+0x10c/0x480 [zfs] zfs_lookup+0x202/0x3b0 [zfs] zpl_lookup+0xca/0x1e0 [zfs] path_openat+0x6a2/0xfe0 do_filp_open+0x9b/0x110 ? __check_object_size+0xdb/0x1b0 ? __alloc_fd+0x46/0x170 do_sys_openat2+0x217/0x2d0 ? do_sys_openat2+0x217/0x2d0 do_sys_open+0x59/0x80 __x64_sys_openat+0x20/0x30 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1521173] Re: AER: Corrected error received: id=00e0
Seen this as well -- although I don't believe it's causing any problems that we know of -- sure does look right now like it's only noise in the logs. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1521173 Title: AER: Corrected error received: id=00e0 Status in Linux: Unknown Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: WORKAROUND: add pci=noaer to your kernel command line: 1) edit /etc/default/grub and and add pci=noaer to the line starting with GRUB_CMDLINE_LINUX_DEFAULT. It will look like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer" 2) run "sudo update-grub" 3) reboot My dmesg gets completely spammed with the following messages appearing over and over again. It stops after one s3 cycle; it only happens after reboot. [ 5315.986588] pcieport :00:1c.0: AER: Corrected error received: id=00e0 [ 5315.987249] pcieport :00:1c.0: can't find device of ID00e0 [ 5315.995632] pcieport :00:1c.0: AER: Corrected error received: id=00e0 [ 5315.995664] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Receiver ID) [ 5315.995674] pcieport :00:1c.0: device [8086:9d14] error status/mask=0001/2000 [ 5315.995683] pcieport :00:1c.0:[ 0] Receiver Error [ 5316.002772] pcieport :00:1c.0: AER: Corrected error received: id=00e0 [ 5316.002811] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Receiver ID) [ 5316.002826] pcieport :00:1c.0: device [8086:9d14] error status/mask=0001/2000 [ 5316.002838] pcieport :00:1c.0:[ 0] Receiver Error [ 5316.009926] pcieport :00:1c.0: AER: Corrected error received: id=00e0 [ 5316.009964] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Receiver ID) [ 5316.009979] pcieport :00:1c.0: device [8086:9d14] error status/mask=0001/2000 [ 5316.009991] pcieport :00:1c.0:[ 0] Receiver Error ProblemType: BugDistroRelease: Ubuntu 16.04 Package: linux-image-4.2.0-19-generic 4.2.0-19.23 [modified: boot/vmlinuz-4.2.0-19-generic] ProcVersionSignature: Ubuntu 4.2.0-19.23-generic 4.2.6 Uname: Linux 4.2.0-19-generic x86_64 ApportVersion: 2.19.2-0ubuntu8 Architecture: amd64 AudioDevicesInUse: USERPID ACCESS COMMAND /dev/snd/pcmC0D0c: david 1502 F...m pulseaudio /dev/snd/controlC0: david 1502 F pulseaudio CurrentDesktop: Unity Date: Mon Nov 30 13:19:00 2015 EcryptfsInUse: Yes HibernationDevice: RESUME=UUID=fe528b90-b4eb-4a20-82bd-6a03b79cfb14 InstallationDate: Installed on 2015-11-28 (2 days ago) InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Alpha amd64 (20151127) MachineType: Dell Inc. Inspiron 13-7359 ProcFB: 0 inteldrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-19-generic.efi.signed root=UUID=94d54f88-5d18-4e2b-960a-8717d6e618bb ro noprompt persistent quiet splash vt.handoff=7 RelatedPackageVersions: linux-restricted-modules-4.2.0-19-generic N/A linux-backports-modules-4.2.0-19-generic N/A linux-firmware1.153SourcePackage: linux UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev' UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/07/2015 dmi.bios.vendor: Dell Inc. dmi.bios.version: 01.00.00 dmi.board.name: 0NT3WX dmi.board.vendor: Dell Inc. dmi.board.version: A00 dmi.chassis.type: 9 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr01.00.00:bd08/07/2015:svnDellInc.:pnInspiron13-7359:pvr:rvnDellInc.:rn0NT3WX:rvrA00:cvnDellInc.:ct9:cvr: dmi.product.name: Inspiron 13-7359 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/linux/+bug/1521173/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Some of the 4.15 kernels fixed: Bionic linux kernel: 4.15.0-109.110 Bionic linux-aws kernel: 4.15.0-1077.81 Xenial linux-hwe kernel: 4.15.0-107.108~16.04.1 Xenial linux-gcp kernel: 4.15.0-1078.88~16.04.1 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Released Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Packages tested linux-gcp (4.15.0-1078.88~16.04.1) xenial; linux-hwe (4.15.0-107.108~16.04.1) xenial; linux-gcp-4.15 (4.15.0-1078.88) bionic; linux (4.15.0-107.108) bionic; -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Committed Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Tested. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: Fix Committed Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1882039] Re: The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs
** Changed in: linux (Ubuntu Bionic) Importance: Medium => High ** Changed in: linux (Ubuntu Bionic) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Eoan) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) ** Changed in: linux (Ubuntu Focal) Status: Triaged => In Progress ** Changed in: linux (Ubuntu Focal) Importance: Medium => High ** Changed in: linux (Ubuntu Eoan) Importance: Medium => High ** Changed in: linux (Ubuntu) Importance: Medium => High ** Changed in: linux (Ubuntu Eoan) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) ** Changed in: linux (Ubuntu Focal) Assignee: (unassigned) => Gavin Guo (mimi0213kimo) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1882039 Title: The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: [Impact] There is performance overhead observed when many threads are using hugetlbfs in the database environment. [Fix] bdfbd98bc018 hugetlbfs: take read_lock on i_mmap for PMD sharing The patch improves the locking by using the read lock instead of the write lock. And it allows multiple threads searching the suitable shared VMA. As there is no modification inside the searching process. The improvement increases the parallelism and decreases the waiting time of the other threads. [Test] The customer stand-up a database with seed data. Then they have a loading "driver" which makes a bunch of connections that look like user workflows from the database perspective. Finally, the measuring response times improvement can be observed for these "users" as well as various other metrics at the database level. [Regression Potential] The modification is only in replacing the write lock to a read one. And there is no modification inside the loop. The regression probability is low. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1882039/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Note that fix for all the above series are already released. i.e, from Ubuntu-4.15.0-73.82. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in Focal: Fix Released Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002: ID 1604:10c0 Tascam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R740xd Package: linux (not installed)
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Could anyone hitting this bug confirm it is a DUP of LP Bug 1852077 and that latest releases fix this issue? The handling of the state changes/updates borked here due to not just marking it as a DUP and closing this one. I will close this next week otherwise. ** Changed in: linux (Ubuntu Focal) Status: In Progress => Fix Released ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released ** Changed in: linux (Ubuntu Disco) Status: Fix Committed => Fix Released ** Changed in: linux (Ubuntu Eoan) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in Focal: Fix Released Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq',
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Test kernel has been tested successfully so far by original reporter and has fixed the Docker breakage and so on. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu) Status: In Progress => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
SRU request has been submitted. If anyone would like to test, there are test images up on: https://people.canonical.com/~nivedita/ipvlan-test-fix-278887/ You can 'wget' the files and then 'dpkg -i' the modules, linux-image, modules-extra debs in that order, and reboot. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1879658 Title: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels Status in linux package in Ubuntu: Incomplete Status in linux source package in Bionic: In Progress Bug description: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1879658] [NEW] Cannot create ipvlans with > 1500 MTU on recent Bionic kernels
Public bug reported: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. The bug is caused by the following recent commit to Bionic & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking for MTU size exposes the bug of the max mtu not being set correctly for the ipvlan driver (this has been previously fixed in bonding, teaming drivers). Fix: --- This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans, but was not backported to Bionic along with other patches. The missing commit in the Bionic backport: ipvlan: use ETH_MAX_MTU as max mtu commit 548feb33c598dfaf9f8e066b842441ac49b84a8a [Test Case] 1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe) 2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 (where test1 eno1 is the physical interface you are adding the ipvlan on) 3. # ip link ... 14: test1@eno1: mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ... // check that your test1 ipvlan is created with mtu 9000 4. Install 4.15.0-92 kernel or later 5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument 6. With the above fix commit backported to the xenial-hwe/Bionic, the jumbo mtu ipvlan creation works again, identical to before 92. [Regression Potential] This commit is in upstream mainline as of v4.18-rc2, and hence is already in Cosmic and later, i.e. all post Bionic releases currently. Hence there's low regression potential here. It only impacts ipvlan functionality, and not other networking systems, so core systems should not be affected by this. And affects on setup so it either works or doesn't. Patch is trivial. It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions (where the latent bug got exposed). ** Affects: linux (Ubuntu) Importance: Critical Status: Incomplete ** Affects: linux (Ubuntu Bionic) Importance: Critical Status: Incomplete ** Tags: bionic sts ** Changed in: linux (Ubuntu) Importance: Undecided => Critical ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => Critical ** Description changed: [IMPACT] Setting an MTU larger than the default 1500 results in an error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels when attempting to create ipvlan interfaces: # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2 RTNETLINK answers: Invalid argument - This breaks Docker and other applications which use a Jumbo + This breaks Docker and other applications which use a Jumbo MTU (9000) when using ipvlans. - The bug is caused by the following recent commit to Bionic - & Xenial-hwe; which is pulled in via the stable patchset below, + The bug is caused by the following recent commit to Bionic + & Xenial-hwe; which is pulled in via the stable patchset below, which enforces a strict min/max MTU when MTUs are being set up via rtnetlink for ipvlans: Breaking commit: --- Ubuntu-hwe-4.15.0-92.93~16.04.1 * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261) - * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() + * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() The above patch applies checks of dev->min_mtu and dev->max_mtu to avoid a malicious user from crashing the kernel with a bad value. It was patching the original patchset to centralize min/max MTU checking from various different subsystems of the networking kernel. However, in that patchset, the max_mtu had not been set to the largest phys (64K) or jumbo (9000 bytes), and defaults to 1500. The recent commit above which enforces strict bounds checking - for MTU size exposes the bug of the max mtu not being set correctly. + for MTU size exposes the bug of the max mtu not being set correctly + for the
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
The issue we have reported is easily avoided by specifying the primary port to be the active interface of the bond. On netplan-using systems: Add the directive "primary: $interface" (e.g. "primary: p94s0f0") to the "parameters:" section of the netplan config file. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello, diarmuid, Re: original issue report, were you able to resolve your issue? Please let us know. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html - There is no
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We are closing this LP bug for now as we aren't able to reproduce in-house, and we cannot get access to a live testing repro env at this time. Here is what we know: - There seems to be different performance for some tests when the NIC is configured with active-backup bonding mode, between the case when the active interface is the primary port, and when the active interface is the secondary port. i.e.: Primary port: enp94s0f0 // when this is the active, works fine Secondary port: enp94s0f1d1 // when this is the active, more drops - Switch info: 2 x Fortigate 1024D switches, each machine is connected to both - NIC info: root@u072:~# lspci | grep BCM57416 01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) # ethtool -i enp1s0f0np0 driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 - Our attempt at a reproducer (initially reported in production env via graphical monitoring): mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST good system = ~ 0% drops bad systems = ~ 8% drops We are not getting NIC stats drops, nor UDP kernel drops, so it's not clear where the packet is being dropped, whether it's being dropped silently somewhere (?), or if that's a red herring and a mtr test issue, and what's seen in production is something else. If someone can reproduce this, or something similar, or if we manage to, we will re-open this bug or file a new one. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error.
[Kernel-packages] [Bug 1811963] Re: Sporadic problems with X710 (i40e) and bonding where one interface is shown as "state DOWN" and without LOWER_UP
Hi Malte, Was this issue resolved for you? There are several other possibilities that it could be - and if it's still a problem with current mainline, please let us know. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1811963 Title: Sporadic problems with X710 (i40e) and bonding where one interface is shown as "state DOWN" and without LOWER_UP Status in linux package in Ubuntu: Confirmed Bug description: After rebooting the physical server there is a 50/50 chance of all connected interfaces coming up. This affects Dell EMC R740's and R440's equipped with the X710 network cards. As far as I noticed (~20 reboots on different machines), this happens only when using bonding (in this case active-backup or mode 1, did not test different modes yet). The networking-hardware on the other side shows the ports "connected". tcpdump shows frames being received, even if the interface is in "state DOWN". Tried with: Ubuntu 16.04, kernel 4.4.0-141, driver 2.7.26 (from the Intel-website), firmware 18.8.9 Ubuntu 16.04, kernel 4.4.0-141, driver 1.4.25-k, firmware 18.8.9 Ubuntu 16.04, kernel 4.15.0-43 (hwe), driver 2.1.14-k, firmware 18.8.9 The following excerpts are made using Intels driver in version 2.7.26, therefore tainting the kernel, but the same happens using the original kernel's version or the hardware enablement kernel's version. Sporadic failure case: [6.319226] i40e: loading out-of-tree module taints kernel. [6.319227] i40e: loading out-of-tree module taints kernel. [6.319422] i40e: module verification failed: signature and/or required key missing - tainting kernel [6.410837] i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 2.7.26 [6.410838] i40e: Copyright(c) 2013 - 2018 Intel Corporation. [6.423542] i40e :3b:00.0: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 18.8.9 [6.658526] i40e :3b:00.0: MAC address: ff:ff:ff:ff:ff:ff [6.710391] i40e :3b:00.0: PCI-Express: Speed 8.0GT/s Width x8 [6.725692] i40e :3b:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 40 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA [6.750239] i40e :3b:00.1: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 18.8.9 [6.987874] i40e :3b:00.1: MAC address: ff:ff:ff:ff:ff:f1 [7.005397] i40e :3b:00.1 eth0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None [7.024993] i40e :3b:00.1: PCI-Express: Speed 8.0GT/s Width x8 [7.040298] i40e :3b:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 40 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA [7.054384] i40e :3b:00.1 enp59s0f1: renamed from eth0 [7.079613] i40e :3b:00.0 enp59s0f0: renamed from eth1 [9.788893] i40e :3b:00.0 enp59s0f0: already using mac address ff:ff:ff:ff:ff:ff [9.819480] i40e :3b:00.1 enp59s0f1: set new mac address ff:ff:ff:ff:ff:ff [9.728194] bond0: Setting MII monitoring interval to 100 [9.788690] bond0: Adding slave enp59s0f0 [9.805195] bond0: Enslaving enp59s0f0 as a backup interface with a down link [9.819470] bond0: Adding slave enp59s0f1 [9.836360] bond0: making interface enp59s0f1 the new active one [9.836614] bond0: Enslaving enp59s0f1 as an active interface with an up link Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp59s0f1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp59s0f0 MII Status: down Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: ff:ff:ff:ff:ff:ff Slave queue ID: 0 Slave Interface: enp59s0f1 MII Status: up Speed: Unknown Duplex: Unknown Link Failure Count: 0 Permanent HW addr: ff:ff:ff:ff:ff:f1 Slave queue ID: 0 4: enp59s0f0: mtu 1500 qdisc mq master bond0 portid state DOWN group default qlen 1000 link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff 5: enp59s0f1: mtu 1500 qdisc mq master bond0 portid fff1 state UP group default qlen 1000 link/ether ff:ff:ff:ff:ff:f1 brd ff:ff:ff:ff:ff:ff 6: bond0: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff inet 123.123.123.123/24 brd 123.123.123.255 scope global bond0 valid_lft forever preferred_lft forever inet6 :::::/64 scope link valid_lft forever preferred_lft forever bond0 Link encap:Ethernet HWaddr ff:ff:ff:ff:ff:ff inet addr:123.123.123.123 Bcast:123.123.123.255 Mask:255.255.255.0 inet6 addr: :::::/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Edwin, Do you happen to notice any IPv6 or LLDP or other link-local traffic on the interfaces? (including backup interface). The MTR loss % is purely a capture of their packets xmitted and responses received, so for that UDP MTR test, this is saying that UDP packets were lost, somewhere. The NIC does not have any drops showing via ethtool -S stats but I'm hunting down which are the right pair of before/afters. Other than the tpa_abort counts, there were no errors that I saw. I can't tell what the tpa_abort means for the frame - is it purely a failure only to coalesce, or does it end up dropping packets at some point in that functionality? I'm assuming not, as whatever the reason, those would be counted as drops, I hope, and printed in the interface stats. I'll attach all the stats here once I get them sorted out, I thought I had a clean diff of before and after from the tester, but after looking through, I don't think the file I have is from before/after the mtr test, as there was negligible UDP traffic. I'll try and get clarification from the reporter. Note that when the provision of primary= is used to configure which interface is primary, and when the primary port is used as the active interface for the bond, no problems are seen (and that works deterministically to set the correct active interface). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845]
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Additional observations. MAAS is being used to deploy the system and configure the bond interface and settings. MAAS allows you to specify which is the primary interface, with the other being the backup, for the active-backup bonding mode. However, it does not appear to be working -it's not passing along a primary primitive, for instance, in the netplan yaml or otherwise resulting in this being honored (still need to confirm). MAAS allows you to enter a mac address for the bond interface, but if not supplied, by default it will use the mac address of the "primary" interface, as configured. MAAS then populates the /etc/netplan/50-cloud-init.yaml, including a macaddr= line with the default. netplan then passes that along to systemd-networkd. The bonding kernel, however, will use as the active interface whichever interface is first attached to the bond (i.e., which completes getting attached to the bond interface first) in the absence of a primary= directive. The bonding kernel will, however, use the mac addr supplied as an override. So let's say the active interface was configured in MAAS to be f0, and it's mac is used to be the mac address of the bond, but f1 (the second port of the NIC) actually gets attached first to the bond and is used as the active interface by the bond. We have a situation where f0 = backup, f1 = active, and bond0 is using the mac of f0. While this should work, there is a potential for problems depending on the circumstances. It's likely this has nothing to do with our current issue, but here for completeness. Will see if we can test/confirm. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error.
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Edwin, let me know if you can get in touch with me via the contact email on my Launchpad page. Thanks for all the help! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
ethtool-enp94s0f0 -- Settings for enp94s0f0: Supported ports: [ FIBRE ] Supported link modes: 1baseT/Full Supported pause frame use: Symmetric Receive-only Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 1Mb/s Duplex: Full Port: FIBRE PHYAD: 1 Transceiver: internal Auto-negotiation: off Supports Wake-on: g Wake-on: d Current message level: 0x (0) Link detected: yes ethtool-i-enp94s0f0 -- driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: :5e:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no ethtool-c-enp94s0f0 - Coalesce parameters for enp94s0f0: Adaptive RX: off TX: off stats-block-usecs: 100 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 10 rx-frames: 15 rx-usecs-irq: 1 rx-frames-irq: 1 tx-usecs: 28 tx-frames: 30 tx-usecs-irq: 2 tx-frames-irq: 2 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 ethtool-g-enp94s0f0 Ring parameters for enp94s0f0: Pre-set maximums: RX: 2047 RX Mini:0 RX Jumbo: 8191 TX: 2047 Current hardware settings: RX: 511 RX Mini:0 RX Jumbo: 2044 TX: 511 ethtool-k-enp94s0f0 - Features for enp94s0f0: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: on receive-hashing: on highdma: on [fixed] rx-vlan-filter: off [fixed] vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: off [fixed] tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: off [fixed] fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off [fixed] rx-all: off [fixed] tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: on rx-vlan-stag-filter: off [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: on esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: on tls-hw-record: off [fixed] -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID:
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "ethtool -S for inactive interface enp94s0f0" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853638/+attachment/5327556/+files/ethtool-S-enp94s0f0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem:
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
"Bad" System/NIC: NIC: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller System: Dell Kernel: 5.3.0-28-generic #30~18.04.1-Ubuntu (Note, this issue has been seen on prior kernels as well, upgraded to latest to see if various problems were resolved) Attaching stats/config files from nics from this system (seeing issue). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Good System/Good NIC (all configurations work) Comparison NIC: NetXtreme II BCM57000 10 Gigabit Ethernet QLogic 57000 System: Dell Kernel: 5.0.0-25-generic #26~18.04.1-Ubuntu /proc/net/bonding/bond0 --- Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp5s0f1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: enp5s0f1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:00:00:00:73:e2 Slave queue ID: 0 Slave Interface: enp5s0f0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 00:00:00:00:73:e0 Slave queue ID: 0 /etc/netplan/50-cloud-init.yaml network: bonds: bond0: addresses: - 00.00.235.182/25 gateway4: 00.00.235.129 interfaces: - enp5s0f0 - enp5s0f1 macaddress: 00:00:00:00:73:e0 mtu: 9000 nameservers: addresses: - 00.00.235.172 - 00.00.235.171 search: - maas parameters: down-delay: 0 gratuitious-arp: 1 mii-monitor-interval: 100 mode: active-backup transmit-hash-policy: layer2 up-delay: 0 ethernets: ...(snip).. enp5s0f0: match: macaddress: 00:00:00:00:73:e0 mtu: 9000 set-name: enp5s0f0 enp5s0f1: match: macaddress: 00:00:00:00:73:e2 mtu: 9000 set-name: enp5s0f1 version: 2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
"Bad" Configuration for active-backup mode: $ cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: enp94s0f1d1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: enp94s0f1d1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 2 Permanent HW addr: 4c:d9:8f:48:08:da Slave queue ID: 0 Slave Interface: enp94s0f0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 2 Permanent HW addr: 4c:d9:8f:48:08:d9 Slave queue ID: 0 --- $ cat uname-rv 5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020 --- Scrubbed /etc/netplan/50-cloud-init.yaml: network: bonds: bond0: addresses: - 0.0.235.177/25 gateway4: 0.0.235.129 interfaces: - enp94s0f0 - enp94s0f1d1 macaddress: 00:00:00:48:08:00 mtu: 9000 nameservers: addresses: - 0.0.235.171 - 0.0.235.172 search: - maas parameters: down-delay: 0 gratuitious-arp: 1 mii-monitor-interval: 100 mode: active-backup transmit-hash-policy: layer2 up-delay: 0 ethernets: eno1: match: macaddress: 00:00:00:76:6e:ca mtu: 1500 set-name: eno1 eno2: match: macaddress: 00:00:00:76:6e:cb mtu: 1500 set-name: eno2 enp94s0f0: match: macaddress: 00:00:00:48:08:00 mtu: 9000 set-name: enp94s0f0 enp94s0f1d1: match: macaddress: 00:00:00:48:08:da mtu: 9000 set-name: enp94s0f1d1 version: 2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We have narrowed it down to a flaw in a specific configuration setting on this NIC, so we're comparing the good and bad configurations now. Primary port: enp94s0f0 Secondary port: enp94s0f1d1 A] Good config for fault-tolerance (active-backup) bonding mode: -- Primary port = active interface; Secondary port = backup B] Bad config for fault-tolerance (active-backup) bonding mode: -- Primary port = backup interface; Secondary port = active We are consistently able to reproduce a drop rate difference with UDP pkts, for the above good/bad cases: Good Case UDP MTR Test Result - mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST Start: 2020-02-10T10:14:01+ HOST: hostname Loss% Snt Last Avg Best Wrst StDev 1.|-- nn.nn.nnn.nnn 0.0%600.3 0.2 0.2 0.3 0.0 Bad Case UDP MTR Test Result --- mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST Start: 2020-02-10T14:10:52+ HOST: hostname Loss% Snt Last Avg Best Wrst StDev 1.|-- nn.nn.nnn.nnn 8.3%600.3 0.3 0.2 0.4 0.0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209]
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
The second port on the NIC definitely works as the active interface in an active-backup bonding configuration on the other NICs. At the moment, it's only this particular NIC that is seeing this problem that we know of. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello Edwin, Here is more information on the issue we are seeing wrt dropped packets and other connectivity issues with this NIC. The problem is *only* seen when the second port on the NIC is chosen as the active interface of a active-backup configuration. So on the "bad" system with the interfaces: enp94s0f0 -> when chosen as active, all OK enp94s0f1d1 -> when chosen as active, not OK I'll see if the reporters can confirm that on the "good" systems, there was no problem when the second interface is active. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hey Edwin, sorry, I didn't see your last question. I'll try and confirm but I've seen loss in both directions but it's not clear whether that's significant enough or not yet. e.g., TCP traffic is retransmitted, so it could be segments lost while outgoing or acks lost incoming. 4407 retransmitted TCP segments 130 TCP timeouts in stats collected about 5 mins apart - which isn't sufficient a sample size, we're trying to get a new collection of stats, logs using the netperf TCP_RR test. In our case, note, we're more concerned (and have more solid data) of latency issues than dropped packets (which I expect some of with heavy network testing). For example, netperf TCP_RR latency is about 70-78% of the older systems for 1,1 request/response byte sizes as well as 64/64, 100/200, 128/8192 sizes. I'll update here as soon as we have more data from the production environment. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> NICs between systems? Are OS / kernel and driver > versions the same on both systems? Yes, identical distro release, kernel, and most of the software stack (I have not obtained and examined the full sw stack). Configuration of networking settings is also the same. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Thanks very much for helping on this, Edwin! Please let me know if there's anything specific you need. I'm asking them to disable any IPv6, LLDP traffic in their environment, and retest and collect information again. Also, I'd like to disable tpa, would this be at all useful: modprobe bnx disable_tpa=1 ?? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> There are more than one variable at play here. > Does the problem follow the NIC if you swap the > NICs between systems? Are OS / kernel and driver > versions the same on both systems? Unfortunately, I've not been able to get them to try permutations or switches, as yet, as this is still a production system/environment. I'll try and obtain more information about it. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
> The mtr packet loss is an interesting result. What mtr options did you use? Is this a UDP or ICMP test? The mtr command was: mtr --no-dns --report --report-cycles 60 $IP_ADDR so ICMP was going out. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "active interface ethtool-S" https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324070/+files/ethtool-S-enp94s0f0 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem:
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
** Attachment added: "backup interface ethtool-S" https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324071/+files/ethtool-S-enp94s0f1d1 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples. Personally I think its something to do with the 10G network card, possibly on a ubuntu driver??? Note, Dell have said there is no hardware problem with the 10G interfaces I have followed the troubleshooting information on this link to try determine the problem:
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Note that iperf was identical whereas netperf and mtr showed up differences (so it's possibly sporadic as well, not continuous) 1. iperf tcp test -- GoodSystem.9.84 Gbits/sec BadSystem18.37 Gbits/sec BadSystem2...9.85 Gbits/sec 2. iperf udp test -- GoodSystem.1.05 Mbits/sec BadSystem2...1.05 Mbits/sec 3. mtr ping test --- GoodSystem..0.0% Loss; 0.2 Avg; 0.1 Best, 0.9 Worst, 0.1 StdDev BadSystem2...11.7% Loss; 0.1 Avg; 0.1 Best, 0.2 Worst, 0.0 StdDev 4. netperf tcp_rr 1/1 bytes GoodSystem..17921.83 t/sec BadSystem1.13912.45 t/sec BadSystem2 5. netperf tcp_rr 64/64 bytes GoodSystem..16987.48 t/sec BadSystem1.13355.93 t/sec BadSystem2 6. netperf tcp_rr 128/8192 bytes --- GoodSystem..2396.45 t/sec BadSystem1.1678.54 t/sec BadSystem2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
Hello, Edwin, We have two separate users/customers filing reports, and I can answer for one of them. I'll ask the original poster separately as well to reply. With respect to one of these situations, this is the following system: Dell PowerEdge R440/0XP8V5, BIOS 2.2.11 06/14/2019 Note that a similar system does not have any issues: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.3.4 11/08/2016 So the NIC in the "bad" environment is: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet The NIC in the "good" environment is: Broadcom Inc. and subsidiaries NetXtreme II BCM57810 10 Gigabit Ethernet [14e4:1006] Product Name: QLogic 57810 10 Gigabit Ethernet I'll have to scrub some files and see what I can attach, apologies, I'll have it here by tmrw. Unfortunately, we don't have an easy reproducer. A single iperf and netperf test (both UDP and TCP) showed identical results from both "good" and "bad" environments. What we have is an identical kernel, network configuration and stack with the "bad" system showing double, triple the latency to the systems from a remote server. I'll have more information for you shortly here regarding the exact k8 cmd. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
(active interface) > cat ethtool-S-enp94s0f1d1 | grep abort [0]: tpa_aborts: 19775497 [1]: tpa_aborts: 26758635 [2]: tpa_aborts: 12008147 [3]: tpa_aborts: 15829167 [4]: tpa_aborts: 25099500 [5]: tpa_aborts: 3292554 [6]: tpa_aborts: 2863692 [7]: tpa_aborts: 20224692 (backup interface) > cat ethtool-S-enp94s0f0 | grep abort [0]: tpa_aborts: 3158584 [1]: tpa_aborts: 1670319 [2]: tpa_aborts: 1749371 [3]: tpa_aborts: 1454301 [4]: tpa_aborts: 123020 [5]: tpa_aborts: 1403509 [6]: tpa_aborts: 1298383 [7]: tpa_aborts: 1858753 Netted out from previous capture, there were *f0 = 2014 tpa_aborts *d1 = 1118473 tpa_aborts ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed ** Changed in: linux (Ubuntu) Importance: Undecided => Critical -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done!
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
We suspect this is a device (hw/fw) issue, however, not NetworkManager or kernel (driver bnxt_en). I've added the kernel for the driver impact (just in case, for now). This is really to eliminate all other causes and confirm whether it's the device at root cause). NIC Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet 5e:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01) NIC Driver/FW --- driver: bnxt_en version: 1.10.0 firmware-version: 214.0.253.1/pkg 21.40.25.31 expansion-rom-version: bus-info: :5e:00.1 supports-statistics: yes Kernel - 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 (appears to be an issue on all kernel versions) Environment Configuration - active-backup bonding mode (having the active backup up *might* potentially be the problem, but it might just be the device itself). The exact same distro, kernel, applications and configuration works fine with a different NIC (Broadcom 10g bnx2x). There were quite a few total tpa_abort stats counts (1118473) during the duration of a 2 minute iperf test. Hoping to get more information from other users seeing the same issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: Confirmed Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence
[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data
I have reports of the same device appearing to drop packets and incur greater number of retransmissions under certain circumstances which we're still trying to nail down. I'm using this bug for now until proven to be a different problem. This is causing issues in a production environment. ** Changed in: network-manager (Ubuntu) Status: New => Confirmed ** Changed in: network-manager (Ubuntu) Importance: Undecided => Critical ** Tags added: sts ** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853638 Title: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Status in linux package in Ubuntu: New Status in network-manager package in Ubuntu: Confirmed Bug description: The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data Basically, we are dropping data, as you can see from the benchmark tool as follows: tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300 [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986 [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:00.07] Creating the usrp device with: ... [INFO] [X300] X300 initialization sequence... [INFO] [X300] Maximum frame size: 1472 bytes. [INFO] [X300] Radio 1x clock: 200 MHz [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s) [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s) [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001) [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0) [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0) [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0) Using Device: Single USRP: Device: X-Series Device Mboard 0: X310 RX Channel: 0 RX DSP: 0 RX Dboard: A RX Subdev: SBX-120 RX RX Channel: 1 RX DSP: 0 RX Dboard: B RX Subdev: SBX-120 RX TX Channel: 0 TX DSP: 0 TX Dboard: A TX Subdev: SBX-120 TX TX Channel: 1 TX DSP: 0 TX Dboard: B TX Subdev: SBX-120 TX [00:00:04.305374] Setting device timestamp to 0... [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels [WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected. Please see the general application notes in the manual for instructions. EnvironmentError: OSError: error in pthread_setschedparam [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels [00:00:06.693119] Detected Rx sequence error. D[00:00:09.402843] Detected Rx sequence error. DD[00:00:40.927978] Detected Rx sequence error. D[00:01:44.982243] Detected Rx sequence error. D[00:02:11.400692] Detected Rx sequence error. D[00:02:14.805292] Detected Rx sequence error. D[00:02:41.875596] Detected Rx sequence error. D[00:03:06.927743] Detected Rx sequence error. D[00:03:47.967891] Detected Rx sequence error. D[00:03:58.233659] Detected Rx sequence error. D[00:03:58.876588] Detected Rx sequence error. D[00:04:03.139770] Detected Rx sequence error. D[00:04:45.287465] Detected Rx sequence error. D[00:04:56.425845] Detected Rx sequence error. D[00:04:57.929209] Detected Rx sequence error. [00:05:04.529548] Benchmark complete. Benchmark rate summary: Num received samples: 2995435936 Num dropped samples: 4622800 Num overruns detected:0 Num transmitted samples: 3008276544 Num sequence errors (Tx): 0 Num sequence errors (Rx): 15 Num underruns detected: 0 Num late commands:0 Num timeouts (Tx):0 Num timeouts (Rx):0 Done! tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device. There is no problem with the USRPs themselves, as we
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
Fix has been committed to B, D, E. I've manually updated this bug for now (it was not formally DUP'd to LP Bug 1852077. ** Changed in: linux (Ubuntu Focal) Importance: Undecided => High ** Changed in: linux (Ubuntu Eoan) Importance: Undecided => High ** Changed in: linux (Ubuntu Disco) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Status: New => Fix Committed ** Changed in: linux (Ubuntu Disco) Status: New => Fix Committed ** Changed in: linux (Ubuntu Eoan) Status: New => Fix Committed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser',
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
FWIW, the fix has been committed to -stable: "bonding: fix state transition issue in link monitoring" Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002:
[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring
FWIW, the fix has been committed to -stable: "bonding: fix state transition issue in link monitoring" Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea ** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1852077 Title: Backport: bonding: fix state transition issue in link monitoring Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: == Justification == From the well explained commit message: Since de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring"), the bonding driver has utilized two separate variables to indicate the next link state a particular slave should transition to. Each is used to communicate to a different portion of the link state change commit logic; one to the bond_miimon_commit function itself, and another to the state transition logic. Unfortunately, the two variables can become unsynchronized, resulting in incorrect link state transitions within bonding. This can cause slaves to become stuck in an incorrect link state until a subsequent carrier state transition. The issue occurs when a special case in bond_slave_netdev_event sets slave->link directly to BOND_LINK_FAIL. On the next pass through bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL case will set the proposed next state (link_new_state) to BOND_LINK_UP, but the new_link to BOND_LINK_DOWN. The setting of the final link state from new_link comes after that from link_new_state, and so the slave will end up incorrectly in _DOWN state. Resolve this by combining the two variables into one. == Fixes == * 1899bb32 (bonding: fix state transition issue in link monitoring) This patch can be cherry-picked into E/F For older releases like B/D, it will needs to be backported as they are missing the slave_err() printk marco added in 5237ff79 (bonding: add slave_foo printk macros) as well as the commit to replace netdev_err() with slave_err() in e2a7420d (bonding/main: convert to using slave printk macros) For Xenial, the commit that causes this issue, de77ecd4, does not exist. == Test == Test kernels can be found here: https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ The X-hwe and Disco kernel were tested by the bug reporter, Aleksei, the patched kernel works as expected. == Regression Potential == Low. This patch just unify the variable used in link state change commit logic to prevent the occurrence of an incorrect state. And the changes are limited to the bonding driver itself. (Although the include/net/bonding.h will be used in other drivers, but the changes to that file is only affecting this bond_main.c driver) == Original Bug Report == There's an issue with bonding driver in the current ubuntu kernels. Sometimes one link stuck in a weird state. It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html in upstream. Commit 1899bb325149e481de31a4f32b59ea6f24e176ea. We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it should be reproducible with other current kernel versions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ There is a test kernel above (from that LP bug). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002: ID 1604:10c0 Tascam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R740xd Package: linux (not installed)
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
This is being handled as a DUP of LP Bug 1852077 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077 ** Changed in: linux (Ubuntu) Status: Expired => In Progress ** Tags added: sts ** Also affects: linux (Ubuntu Disco) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Focal) Importance: Undecided Status: In Progress ** Also affects: linux (Ubuntu Eoan) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: New Status in linux source package in Disco: New Status in linux source package in Eoan: New Status in linux source package in Focal: In Progress Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig:
[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring
Still waiting on these patches being committed to all the Ubuntu trees. Any ETA? Is this waiting on being picked up via -stable? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1852077 Title: Backport: bonding: fix state transition issue in link monitoring Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: In Progress Status in linux source package in Focal: In Progress Bug description: == Justification == From the well explained commit message: Since de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring"), the bonding driver has utilized two separate variables to indicate the next link state a particular slave should transition to. Each is used to communicate to a different portion of the link state change commit logic; one to the bond_miimon_commit function itself, and another to the state transition logic. Unfortunately, the two variables can become unsynchronized, resulting in incorrect link state transitions within bonding. This can cause slaves to become stuck in an incorrect link state until a subsequent carrier state transition. The issue occurs when a special case in bond_slave_netdev_event sets slave->link directly to BOND_LINK_FAIL. On the next pass through bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL case will set the proposed next state (link_new_state) to BOND_LINK_UP, but the new_link to BOND_LINK_DOWN. The setting of the final link state from new_link comes after that from link_new_state, and so the slave will end up incorrectly in _DOWN state. Resolve this by combining the two variables into one. == Fixes == * 1899bb32 (bonding: fix state transition issue in link monitoring) This patch can be cherry-picked into E/F For older releases like B/D, it will needs to be backported as they are missing the slave_err() printk marco added in 5237ff79 (bonding: add slave_foo printk macros) as well as the commit to replace netdev_err() with slave_err() in e2a7420d (bonding/main: convert to using slave printk macros) For Xenial, the commit that causes this issue, de77ecd4, does not exist. == Test == Test kernels can be found here: https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/ The X-hwe and Disco kernel were tested by the bug reporter, Aleksei, the patched kernel works as expected. == Regression Potential == Low. This patch just unify the variable used in link state change commit logic to prevent the occurrence of an incorrect state. And the changes are limited to the bonding driver itself. (Although the include/net/bonding.h will be used in other drivers, but the changes to that file is only affecting this bond_main.c driver) == Original Bug Report == There's an issue with bonding driver in the current ubuntu kernels. Sometimes one link stuck in a weird state. It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html in upstream. Commit 1899bb325149e481de31a4f32b59ea6f24e176ea. We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it should be reproducible with other current kernel versions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
This issue has been tested and successfully verified: Verification successful ! "...test appliance built with 4.15.0-58 was unusable ... hundreds of "BUG: non-zero pgtables_bytes on freeing mm: -16384" in syslog, RestAPI interface timeouts, failed to produce FFDC data using sosreport. Build with 4.15.0-60.67 displays none of these behaviors ... smoke test completed successfully." ** Tags added: verification-done-bionic ** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840789] Re: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled
** Tags added: sts ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Importance: High => Critical ** Changed in: linux (Ubuntu Bionic) Importance: High => Critical ** Changed in: linux (Ubuntu Disco) Importance: Undecided => Critical ** Changed in: linux (Ubuntu Eoan) Importance: Undecided => Critical -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840789 Title: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [Impact] * The bnx2x driver may cause hardware faults (leading to panic/reboot) and other behaviors as transmit timeouts, after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is introduced. * This issue has been observed by an user shortly after starting docker & kubelet, with adapters: - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c] - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79] * If options to ignore hardware faults are used (erst_disable=1 hest_disable=1 ghes.disable=1) the system doesn't panic/reboot and continues on to timeout on adapter stats, then transmit timeouts, spewing some adapter firmware dumps, but the network interface is non-functional. * The issue only happened when LLDP is enabled on the network switches, and crashdump shows the bnx2x driver is stuck/waits for firmware to complete the stop traffic command in LLDP handling. Workaround used is to disable LLDP in the network switches/ports. * Analysis of the driver and firmware dumps didn't help significantly towards finding the root cause. * Upstream/mainline recently just reverted the patch, due to similar problem reports, while looking for the root cause/proper fix. [Test Case] * No reproducible test case found outside the user's systems/cluster, where it is enough to start docker & kubelet & wait. * The user verified test kernels for Xenial and Bionic - the problem does not happen; build-tested on Disco. [Regression Potential] * Users who significantly use/apply the non-default traffic class (tc) / class of service (cos) might possibly see performance changes (if any at all) in such applications, however that's unclear now. * This is a recent revert upstream (v5.3-rc'ish), so there's chance things might change in this area. * Nonetheless, the patch is authored by the driver vendor, and made its way into stable kernels (e.g., v5.2.8 which made Eoan/19.10 recently). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840704] Re: ZFS kernel modules lack debug symbols
** Tags added: sts ** Tags added: linux ** Changed in: linux (Ubuntu) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840704 Title: ZFS kernel modules lack debug symbols Status in linux package in Ubuntu: In Progress Bug description: The ZFS kernel modules aren't built with debug symbols, which introduces problems/issues for debugging/support. Patches are required in: 1) linux kernel packaging, to add infrastructure to enable/build/strip/package debug symbols on DKMS. (this is sufficient with zfs-linux now in Eoan.) 2) zfs-linux and spl-linux, for the stable releases, which need a few patches to enable debug symbols (add option './configure --enable-debuginfo' and '(ZFS|SPL)_DKMS_ENABLE_DEBUGINFO' to dkms.conf.) Initially submitting the kernel patchset for Unstable, for review/feedback. It backports nicely into B/D/E, should it be accepted; for X (doesn't use DKMS builds) a simpler patch for the moment (until it does) works. The zfs/spl-linux patches are ready, to be submitted once the approach used by the kernel package settles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840704/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
I'll update here once kernel is uploaded. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
I unduped it for test process clarity. Trying to get the relevant people to test the fix. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
** This bug is no longer a duplicate of bug 1837664 Bionic update: upstream stable patchset 2019-07-23 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
*** This bug is a duplicate of bug 1837664 *** https://bugs.launchpad.net/bugs/1837664 I'll unDUP it unless the kernel team says otherwise in IRC. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384
*** This bug is a duplicate of bug 1837664 *** https://bugs.launchpad.net/bugs/1837664 I'm not sure this bug should be DUP'd to the stable-release bug. Might confuse the verification and handling triggers, perhaps? Will need to make sure the fix is tested once the fix is uploaded. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840046 Title: BUG: non-zero pgtables_bytes on freeing mm: -16384 Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [impact] This message is printed repeatedly in the logs: BUG: non-zero pgtables_bytes on freeing mm: -16384 [test case] boot the 4.15.0-58 kernel on s390x [regression potential] this affects task pud accounting; regressions may be around cleaning up task memory. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Verified on Xenial ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret =
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
As the test kernel with the backported Xenial fix has been up for almost 2 months now, I'm submitting the SRU for Xenial, although I have not received feedback from original reporter or others. Backported patch for Xenial varies slightly from the cherry-picked patch for B, C. My testing has been successful (see original testing information in description). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() :
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Cosmic: Fix Released Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Released Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
** Tags added: sts ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic verification-done-cosmic ** Tags removed: verification-done-cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata))
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Bionic, Cosmic kernels successfully tested. I've updated the tags. ** Tags removed: verification-needed-bionic verification-needed-cosmic ** Tags added: verification-done-bionic verification-done-cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6)
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Late update, but the original reporter did test the proposed kernel on systems able to reproduce the problem and were tested successfully. We do not yet have a way of reproducing this on Xenial (i.e, any 4.4 kernel). I'm still leaving this an open issue, will be trying to do this and once we can confirm/test, will update and push an SRU for Xenial as well. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
A 4.4 test kernel with the fix backported is available at: https://people.canonical.com/~nivedita/geneve-xenial-test/ if anyone wishes to validate the 4.4 X solution. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL;
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Resubmitted SRU for B,C for this kernel cycle. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 ||
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
Submitted SRU request for Bionic, Cosmic. Huge thanks for the testing, Matthew! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Tags added: cosmic xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1794232 Title: Geneve tunnels don't work when ipv6 is disabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Status in linux source package in Disco: Fix Released Bug description: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for non-metadata tunnels (infrastructure is not yet in our tree prior to Disco), hence not being included at this time under this case. At this time, all geneve tunnels created as above are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata))
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: SRU Justification Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. Fix: Fixed by upstream commit in v5.0: Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 "geneve: correctly handle ipv6.disable module parameter" - Hence available in Disco and later; required in X,B,C - Cherry picked and tested successfully for X, B, C. + Hence available in Disco and later; required in X,B,C. Testcase: 1. Boot with "ipv6.disable=1" 2. Then try and create a geneve tunnel using: -# ovs-vsctl add-br br1 -# ovs-vsctl add-port br1 geneve1 -- set interface geneve1 - type=geneve options:remote_ip=192.168.x.z // ip of the other host + # ovs-vsctl add-br br1 + # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + type=geneve options:remote_ip=192.168.x.z // ip of the other host Regression Potential: Low, only geneve tunnels when ipv6 dynamically disabled, current status is it doesn't work at all. Other Info: * Mainline commit msg includes reference to a fix for - non-metadata tunnels (infrastructure is not yet in - our tree prior to Disco), hence not being included - at this time under this case. + non-metadata tunnels (infrastructure is not yet in + our tree prior to Disco), hence not being included + at this time under this case. - At this time, all geneve tunnels created as above - are metadata-enabled. - + At this time, all geneve tunnels created as above + are metadata-enabled. --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example - is shown with the4.15.0-23-generic kernel (which differs + is shown with the 4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata)
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: + SRU Justification + + Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically. + + Fix: + Fixed by upstream commit in v5.0: + Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 + "geneve: correctly handle ipv6.disable module parameter" + + Hence available in Disco and later; required in X,B,C + Cherry picked and tested successfully for X, B, C. + + Testcase: + 1. Boot with "ipv6.disable=1" + 2. Then try and create a geneve tunnel using: +# ovs-vsctl add-br br1 +# ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + type=geneve options:remote_ip=192.168.x.z // ip of the other host + + Regression Potential: Low, only geneve tunnels when ipv6 dynamically + disabled, current status is it doesn't work at all. + + Other Info: + * Mainline commit msg includes reference to a fix for + non-metadata tunnels (infrastructure is not yet in + our tree prior to Disco), hence not being included + at this time under this case. + + At this time, all geneve tunnels created as above + are metadata-enabled. + + + --- [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] * Low -- affects the geneve driver only, and when ipv6 is disabled, and since it doesn't work in that case at all, this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Description changed: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] There is an upstream commit for this in v5.0 mainline (and in Disco and later Ubuntu kernels). "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. With the fixed test kernel, the interfaces and tunnel is created successfully. [Regression Potential] - * Low -- affects the geneve driver only, and when ipv6 is - disabled, and since it doesn't work in that case at all, - this fix gets the tunnel up and running for the common case. - + * Low -- affects the geneve driver only, and when ipv6 is + disabled, and since it doesn't work in that case at all, + this fix gets the tunnel up and running for the common case. [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * Example Versions (bug exists in all versions of Ubuntu - and mainline): + and mainline) + + Update: This has been patched upstream after original description filed + here, fix available in v5.0 mainline and Disco
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
** Changed in: linux (Ubuntu Disco) Status: In Progress => Fix Released ** Description changed: [Impact] When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : “ovs-vsctl: Error detected while setting up 'geneve0': could not add network device geneve0 to ofproto (Address family not supported by protocol)." [Fix] - There is an upstream commit for this in v5.0 mainline. + There is an upstream commit for this in v5.0 mainline. "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 - This fix is needed on all our series: X, C, B, D. It is identical + This fix is needed on all our series prior to Disco and the v5.0 kernel: X, C, B. It is identical to the fix we implemented and tested internally with, but - had not pushed upstream yet. - + had not pushed upstream yet. [Test Case] (Best to do this on a kvm guest VM so as not to interfere with your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example is shown with the4.15.0-23-generic kernel (which differs slightly from 4.4.x in symptoms): - Edit /etc/default/grub to add the line: GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: failed to add geneve1 as port: Address family not supported by protocol" You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. If you do not disable IPv6 (remove ipv6.disable=1 from /etc/default/grub + update-grub + reboot), the same 'ovs-vsctl add-port' command completes successfully. You can see that it is working properly by adding an IP to the br1 and pinging each host. On kernel 4.4 (4.4.0-128-generic), the error message doesn't happen using the 'ovs-vsctl add-port' command, no warning is shown in ovs-vswitchd.log, but the device genev_sys_6081 is also not created and ping test won't work. - With the fixed test kernel, the interfaces and tunnel + With the fixed test kernel, the interfaces and tunnel is created successfully. - [Other Info] * Analysis Geneve tunnels should work with either IPv4 or IPv6 environments as a design and support principle. Currently, however, what's in the implementation requires support for ipv6 for metadata-based tunnels which geneve is: rather than: a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled b) ipv4 + metadata + ipv6 What enforces this in the current 4.4.0-x code when opening a Geneve tunnel is the following in geneve_open() : bool ipv6 = geneve->remote.sa.sa_family == AF_INET6; bool metadata = geneve->collect_md; ... #if IS_ENABLED(CONFIG_IPV6) geneve->sock6 = NULL; if (ipv6 || metadata) ret = geneve_sock_add(geneve, true); #endif if (!ret && (!ipv6 || metadata)) ret = geneve_sock_add(geneve, false); CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but even though ipv6 is false, metadata is always true for a geneve open as it is set unconditionally in ovs: In /lib/dpif_netlink_rtnl.c : case OVS_VPORT_TYPE_GENEVE: nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA); The second argument of geneve_sock_add is a boolean value indicating whether it's an ipv6 address family socket or not, and we thus incorrectly pass a true value rather than false. The current "|| metadata" check is unnecessary and incorrectly sends the tunnel creation code down the ipv6 path, which fails subsequently when the code expects an ipv6 family socket. * This issue exists in all versions of the kernel upto present mainline and net-next trees. * Testing with a trivial patch to remove that and make similar changes to those made for vxlan (which had the same issue) has been successful. Patches for various versions to be attached here soon. * Example Versions (bug exists in all versions of Ubuntu and mainline): $ uname -r 4.4.0-135-generic $ lsb_release -rd Description: Ubuntu 16.04.5 LTS Release: 16.04 $ dpkg -l | grep openvswitch-switch ii openvswitch-switch 2.5.4-0ubuntu0.16.04.1 ** Description changed: [Impact] When attempting to
[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled
We had tested a patch discussed above and tested internally, with success - although we have limited testing (opening up a geneve tunnel between 2 kvm guests). Jiri has now pushed an identical patch upstream which is available in the v5.0 kernel and later. "geneve: correctly handle ipv6.disable module parameter" Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 Although I do not have testing validation from original poster, since it has been committed upstream, I'm going to go ahead and get the SRU request started. ** Changed in: linux (Ubuntu) Status: Triaged => In Progress ** Changed in: linux (Ubuntu) Importance: Medium => High ** Also affects: linux (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Cosmic) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Disco) Importance: High Status: In Progress ** Changed in: linux (Ubuntu Cosmic) Status: New => In Progress ** Changed in: linux (Ubuntu Disco) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Cosmic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Xenial) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Xenial) Status: New => In Progress ** Changed in: linux (Ubuntu Cosmic) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Description changed: [Impact] - When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in - an OS environment with open vswitch, where ipv6 has been disabled, + When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in + an OS environment with open vswitch, where ipv6 has been disabled, the create fails with the error : - “ovs-vsctl: Error detected while setting up 'geneve0': could not - add network device geneve0 to ofproto (Address family not supported + “ovs-vsctl: Error detected while setting up 'geneve0': could not + add network device geneve0 to ofproto (Address family not supported by protocol)." - + [Fix] + There is an upstream commit for this in v5.0 mainline. + + "geneve: correctly handle ipv6.disable module parameter" + Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7 + + This fix is needed on all our series: X, C, B, D + + [Test Case] - (Best to do this on a kvm guest VM so as not to interfere with - your system's networking) + (Best to do this on a kvm guest VM so as not to interfere with + your system's networking) 1. On any Ubuntu Xenial kernel, disable ipv6. This example -is shown with the4.15.0-23-generic kernel (which differs -slightly from 4.4.x in symptoms): - + is shown with the4.15.0-23-generic kernel (which differs + slightly from 4.4.x in symptoms): + - Edit /etc/default/grub to add the line: - GRUB_CMDLINE_LINUX="ipv6.disable=1" + GRUB_CMDLINE_LINUX="ipv6.disable=1" - # update-grub - Reboot - 2. Install OVS # apt install openvswitch-switch 3. Create a Geneve tunnel # ovs-vsctl add-br br1 - # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 + # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 type=geneve options:remote_ip=192.168.x.z (where remote_ip is the IP of the other host) You will see the following error message: - "ovs-vsctl: Error detected while setting up 'geneve1'. + "ovs-vsctl: Error detected while setting up 'geneve1'. See ovs-vswitchd log for details." From /var/log/openvswitch/ovs-vswitchd.log you will see: - "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: - failed to add geneve1 as port: Address family not supported + "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: + failed to add geneve1 as port: Address family not supported by protocol" - You will notice from the "ifconfig" output that the device + You will notice from the "ifconfig" output that the device genev_sys_6081 is not created. - If you do not disable IPv6 (remove ipv6.disable=1 from - /etc/default/grub + update-grub + reboot), the same - 'ovs-vsctl add-port' command completes successfully. - You can see that it is working properly by adding an - IP to the br1 and pinging each host. + If you do not disable IPv6 (remove ipv6.disable=1 from + /etc/default/grub + update-grub + reboot), the same + 'ovs-vsctl add-port' command completes successfully. + You can see that it is working properly by adding an + IP to the br1 and pinging each host. - On kernel 4.4 (4.4.0-128-generic), the error message doesn't - happen using the 'ovs-vsctl add-port' command, no warning is - shown in ovs-vswitchd.log, but the device genev_sys_6081 is + On kernel 4.4 (4.4.0-128-generic), the error message doesn't + happen
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
I have installed and booted to this kernel, and ensured no new regression introduced, although I cannot repro the issue. ** Tags removed: 4.15.0-24-generic cosmic kernel verification-needed-bionic verification-needed-cosmic ** Tags added: verification-done-bionic verification-done-cosmic ** Description changed: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running - the 4.15.0-38-generic kernel on Xenial. + the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: - "i40e: Fix for Tx timeouts when interface is brought up if - DCB is enabled" + "i40e: Fix for Tx timeouts when interface is brought up if + DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. - * Not easy to reproduce as it requires something like the - following example environment and heavy load: + * Not easy to reproduce as it requires something like the + following example environment and heavy load: - Kernel: 4.15.0-38-generic - Network driver: i40e - version: 2.1.14-k - firmware-version: 6.00 0x800034e6 18.3.6 - NIC: Intel 40Gb XL710 - DCB enabled - + Kernel: 4.15.0-38-generic + Network driver: i40e + version: 2.1.14-k + firmware-version: 6.00 0x800034e6 18.3.6 + NIC: Intel 40Gb XL710 + DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has - been running for several months in production-load testing + been running for several months in production-load testing successfully. - --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
** Changed in: linux (Ubuntu) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
I'm still trying to confirm this for Xenial. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: Fix Committed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Submitted patches for SRU. ** Description changed: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe - and Bionic kernels). + and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f + It requires the following commit as well: + + i40e: Do not allow use more TC queue pairs than MSI-X vectors exist + Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 + [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel -i40e driver version: 2.1.14-k -Any system with > 64 CPUs + i40e driver version: 2.1.14-k + Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: - + echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus - 00,, + 00,, - But for any queue number > 63, we see this error: + But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f It requires the following commit as well: i40e: Do not allow use more TC queue pairs than MSI-X vectors exist Commit: 1563f2d2e01242f05dd523ffd56fe104bc1afd58 [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
I am not sure we could deterministically provoke the issue. At the very least to ensure no other regression was introduced, I would run it under heavy network load. The environment in question which saw the issue had network load, contention for cpus and several other issues occur. The basic environment is: 1. For any 25Gb NIC/chipset that requires the 4.4 bnxt_en_bpo driver, set its 2 ports/interfaces up in bonding mode as follows: bond-lacp-rate fast bond-master bond0 bond-miimon 100 bond-mode 802.3ad bond-xmit-hash-policy layer3+4 mtu 9000 2. Run any heavy TCP network load test over the systems (e.g. iperf, netperf, file transfer, etc.) 3. Theoretically, it would appear that if the number of tx ring descriptors were lower, than that would be more likely to hit this (not successfully proven by testing here), but can lower it and see if that helps: # ethtool -G eno49 tx 128 // for example I am not sure if that helps, Scott. I'll try and smoke up more specific steps but I cannot guarantee you will see the issue. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
Will be submitting SRU request early next week; trying to get it into this next kernel release cycle. ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Just briefly wanted to say that this is one we've discussed at length -- we may not be able to get someone who has the right NIC to test with it in time. I'm sanity checking the kernel, but that is not exercising the key change here. If we could assume verification-done for our purposes here, that might be needed. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus
It's been reported by an external reporter and reproduced internally. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1820948] [NEW] i40e xps management broken when > 64 queues/cpus
Public bug reported: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument ** Affects: linux (Ubuntu) Importance: High Status: Confirmed ** Affects: linux (Ubuntu Bionic) Importance: High Assignee: Nivedita Singhvi (niveditasinghvi) Status: Confirmed ** Tags: bionic ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux (Ubuntu) Status: New => Confirmed ** Changed in: linux (Ubuntu Bionic) Status: New => Confirmed ** Changed in: linux (Ubuntu) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1820948 Title: i40e xps management broken when > 64 queues/cpus Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Confirmed Bug description: [Impact] Transmit packet steering (xps) settings don't work when the number of queues (cpus) is higher than 64. This is currently still an issue on the 4.15 kernel (Xenial -hwe and Bionic kernels). It was fixed in Intel's i40e driver version 2.7.11 and in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix). Fix - The following commit fixes this issue (as identified by Lihong Yang in discussion with Intel i40e team): "i40e: Fix the number of queues available to be mapped for use" Commit: bc6d33c8d93f520e97a8c6330b8910053d4f [Test Case] 1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel i40e driver version: 2.1.14-k Any system with > 64 CPUs 2. For any queue 0 - 63, you can read/set tx xps: echo > /sys/class/net/eth2/queues/tx-63/xps_cpus echo $? 0 cat /sys/class/net/eth2/queues/tx-63/xps_cpus 00,, But for any queue number > 63, we see this error: echo > /sys/class/net/eth2/queues/tx-64/xps_cpus echo: write error: Invalid argument cat /sys/class/net/eth2/queues/tx-64/xps_cpus cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
Submitted SRU request -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Tags added: bionic cosmic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel: 4.15.0-38-generic Network driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 NIC: Intel 40Gb XL710 DCB enabled [Regression Potential] Low, as the first only impacts i40e DCB environment, and has been running for several months in production-load testing successfully. --- Original Description Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Description changed: - Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 - to the Kernel 4.15.0-24-generic. + [Impact] + The i40e driver can get stalled on tx timeouts. This can happen when + DCB is enabled on the connected switch. This can also trigger a + second situation when a tx timeout occurs before the recovery of + a previous timeout has completed due to CPU load, which is not + handled correctly. This leads to networking delays, drops and + application timeouts and hangs. Note that the first tx timeout + cause is just one of the ways to end up in the second situation. + + This issue was seen on a heavily loaded Kafka broker node running + the 4.15.0-38-generic kernel on Xenial. + + Symptoms include messages in the kernel log of the form: + + --- + [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 + [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 + + + With the test kernel provided in this LP bug which had these + two commits compiled in, the problem has not been seen again, + and has been running successfully for several months: + + "i40e: Fix for Tx timeouts when interface is brought up if + DCB is enabled" + Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee + + "i40e: prevent overlapping tx_timeout recover" + Commit: d5585b7b6846a6d0f9517afe57be3843150719da + + * The first commit is already in Disco, Cosmic + * The second commit is already in Disco + * Bionic needs both patches and Cosmic needs the second + + [Test Case] + * We are considering the case of both issues above occurring. + * Seen by reporter on a Kafka broker node with heavy traffic. + * Not easy to reproduce as it requires something like the + following example environment and heavy load: + + Kernel: 4.15.0-38-generic + Network driver: i40e + version: 2.1.14-k + firmware-version: 6.00 0x800034e6 18.3.6 + NIC: Intel 40Gb XL710 + DCB enabled + + + [Regression Potential] + Low, as the first only impacts i40e DCB environment, and has + been running for several months in production-load testing + successfully. + + + --- Original Description + Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : - [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] The i40e driver can get stalled on tx timeouts. This can happen when DCB is enabled on the connected switch. This can also trigger a second situation when a tx timeout occurs before the recovery of a previous timeout has completed due to CPU load, which is not handled correctly. This leads to networking delays, drops and application timeouts and hangs. Note that the first tx timeout cause is just one of the ways to end up in the second situation. This issue was seen on a heavily loaded Kafka broker node running the 4.15.0-38-generic kernel on Xenial. Symptoms include messages in the kernel log of the form: --- [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 With the test kernel provided in this LP bug which had these two commits compiled in, the problem has not been seen again, and has been running successfully for several months: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da * The first commit is already in Disco, Cosmic * The second commit is already in Disco * Bionic needs both patches and Cosmic needs the second [Test Case] * We are considering the case of both issues above occurring. * Seen by reporter on a Kafka broker node with heavy traffic. * Not easy to reproduce as it requires something like the following example environment and heavy load: Kernel:
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Cosmic) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) ** Changed in: linux (Ubuntu Bionic) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu Cosmic) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
We have a user who has been successfully running under load with the test kernel provided here which was patched with the following two commits: "i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee "i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da The issue was hit while running on 4.15.0-38-generic #41~16.04.1-Ubuntu on Xenial (the hwe kernel). Symptoms include messages in the kernel log of the form: [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 [4733572.116270] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 2, NTC: 0x49, HWB: 0x123, NTU: 0x123, TAIL: 0x123, INT: 0x0 [4733572.116272] i40e :18:00.1 eno2: tx_timeout recovery level 1, hung_queue 2 Leading to Kafka server issues, etc. We are fairly confident this is the same as the original reporter, and we'd like to use this bug to proceed on the stable release update process. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Confirmed Status in linux source package in Cosmic: Confirmed Bug description: Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Terry, We've had a lot of discussion over this bug. It does not have a reliable reproducer, and I have not yet received any acks on testing of the above. Our thinking was that it was still better to patch it since it has been seen by the mainline driver as well and we'd like to avoid a re-occurrence of the situation. The need is to have the fix be available in the Xenial official bits, for sure (rather than providing a temporary test kernel via our ppa or something, for instance). FWIW, here are the boards in question: enum board_idx { BCM57301, BCM57417_NPAR, BCM58700, BCM57311, BCM57312, BCM57402, BCM57402_NPAR, BCM57407, BCM57412, BCM57414, BCM57416, BCM57417, BCM57412_NPAR, BCM57314, BCM57417_SFP, BCM57416_SFP, BCM57404_NPAR, BCM57406_NPAR, BCM57407_SFP, BCM57407_NPAR, BCM57414_NPAR, BCM57416_NPAR, BCM57452, BCM57454, NETXTREME_E_VF, NETXTREME_C_VF, }; Per conversation with Brad and Jay, it was agreed that patching the bnxt_en_bpo driver only with this fix was the way to go, despite the lack of a reproducer, rather than pulling in an entire new driver from Broadcom as also potentially mulled over. The FW version the issue was hit on: firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03 But it might be best to test with latest available firmware (214.0.166/1.9.2 pkg 21.40.16.6 or later). Not sure if that helps? Let me know if I can address anything else. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: In Progress Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Changed in: linux (Ubuntu Xenial) Status: Confirmed => In Progress ** Changed in: linux (Ubuntu Xenial) Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: In Progress Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Changed in: linux (Ubuntu Xenial) Status: New => Confirmed ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Confirmed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. [Test Case] * Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver. [Regression Potential] * The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver). * The patch is very small and backport is fairly minimal and simple. * The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Description changed: + [Impact] + + The bnxt_en_bpo driver experienced tx timeouts causing the system to + experience network stalls and fail to send data and heartbeat packets. + The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error - and triggered the Netdev Watchdog timer under load. + and triggered the Netdev Watchdog timer under load. * From kernel log: - "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" - See attached kern.log excerpt file for full excerpt of error log. + "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" + See attached kern.log excerpt file for full excerpt of error log. - * Release = Xenial - Kernel = 4.4.0-141-generic #167 - eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet - + * Release = Xenial + Kernel = 4.4.0-141-generic #167 + eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet + * This caused the driver to reset in order to recover: - - "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" - - driver: bnxt_en_bpo - version: 1.8.1 - source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() + + "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset + task!" + + driver: bnxt_en_bpo + version: 1.8.1 + source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures - on the system. + on the system. * The bnxt_en_po driver is the imported Broadcom driver - pulled in to support newer Broadcom HW (specific boards) - while the bnx_en module continues to support the older - HW. The current Linux upstream driver does not compile - easily with the 4.4 kernel (too many changes). + pulled in to support newer Broadcom HW (specific boards) + while the bnx_en module continues to support the older + HW. The current Linux upstream driver does not compile + easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: -"bnxt_en: Fix TX timeout during netpoll" -commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 - - This fix has not been applied to the bnxt_en_po driver - version, but review of the code indicates that it is - susceptible to the bug, and the fix would be reasonable. + "bnxt_en: Fix TX timeout during netpoll" + commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 - * No easy way to reproduce this + This fix has not been applied to the bnxt_en_po driver + version, but review of the code indicates that it is + susceptible to the bug, and the fix would be reasonable. + + [Test Case] + + * Unfortunately, this is not easy to reproduce. Also, it is only seen on + 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo + driver. + + [Regression Potential] + + * The patch is restricted to the bpo driver, with very constrained scope + - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as + opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed + driver). + + * The patch is very small and backport is fairly minimal and simple. + + * The fix has been running on the in-tree driver in upstream mainline as + well as the Ubuntu Linux in-tree driver, although the Broadcom driver + has a lot of lower level code that is different, this piece is still the + same. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Bug description: [Impact] The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets. The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
If anyone is interested and willing to test a 4.4 kernel patched with the fix "bnxt_en: Fix TX timeout during netpoll" backported to the bnxt_en_bpo driver, please find the packages here: http://people.canonical.com/~nivedita/bpo/ -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)
Any update on a Bionic fix? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1779756 Title: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04) Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Confirmed Bug description: Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to the Kernel 4.15.0-24-generic. On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network card no longer works and permanently displays these three lines : [ 98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1 [ 98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, hung_queue 8 [ 98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Confirmed Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Due to earlier NIC flapping observed on systems for the 25Gb Broadcom NIC, with originally the following config, the firmware was upgraded to avoid a known FW bug: $ cat ethtool_-i_enp59s0f1d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03 expansion-rom-version: bus-info: :3b:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no The FW was upgraded on affected systems to: $ cat ethtool_-i_eno2d1 driver: bnxt_en_bpo version: 1.8.1 firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6 expansion-rom-version: bus-info: :19:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: no supports-priv-flags: no Unfortunately, it's not quite clear which FW version the current bug happened on (I believe the newer but can't confirm -- happened in the midst of several reboots) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: Incomplete Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
** Attachment added: "kern.log.excerpt-netdev-watchdog-timeout.txt" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+attachment/5234643/+files/kern.log.excerpt-netdev-watchdog-timeout.txt -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: New Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1814095] [NEW] bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Public bug reported: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this ** Affects: linux (Ubuntu) Importance: High Status: New ** Tags: xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1814095 Title: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer Status in linux package in Ubuntu: New Bug description: The following 25Gb Broadcom NIC error was seen on Xenial running the 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network traffic (just once): * The bnxt_en_po driver froze on a "TX timed out" error and triggered the Netdev Watchdog timer under load. * From kernel log: "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out" See attached kern.log excerpt file for full excerpt of error log. * Release = Xenial Kernel = 4.4.0-141-generic #167 eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet * This caused the driver to reset in order to recover: "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!" driver: bnxt_en_bpo version: 1.8.1 source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout() * The loss of connectivity and softirq stall caused other failures on the system. * The bnxt_en_po driver is the imported Broadcom driver pulled in to support newer Broadcom HW (specific boards) while the bnx_en module continues to support the older HW. The current Linux upstream driver does not compile easily with the 4.4 kernel (too many changes). * This upstream and bnxt_en driver fix is a likely solution: "bnxt_en: Fix TX timeout during netpoll" commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906 This fix has not been applied to the bnxt_en_po driver version, but review of the code indicates that it is susceptible to the bug, and the fix would be reasonable. * No easy way to reproduce this To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp