[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free

2021-12-22 Thread Nivedita Singhvi
As several different forums are discussing this issue,
I'm using this LP bug to continue investigation in to
current manifestation of this bug (after 4.15 kernel).

I suspect it's in one of the other places not fixed, as
my colleague Dan stated a while ago.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1711407

Title:
  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
  In Progress
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  In Progress

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.

  --

  [Impact]

  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

  unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-
  samba-loop

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free

2021-12-22 Thread Nivedita Singhvi
We are seeing definitely a problem on kernels after 4.15.0-159-generic, 
which is the last known good kernel. 5.3* kernels are affected, but I 
do not have data on most recent upstream.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1711407

Title:
  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
  In Progress
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  In Progress

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.

  --

  [Impact]

  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

  unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-
  samba-loop

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1403152] Re: unregister_netdevice: waiting for lo to become free. Usage count

2021-10-06 Thread Nivedita Singhvi
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-lts-utopic in Ubuntu.
https://bugs.launchpad.net/bugs/1403152

Title:
  unregister_netdevice: waiting for lo to become free. Usage count

Status in Linux:
  Incomplete
Status in linux package in Ubuntu:
  Fix Released
Status in linux-lts-utopic package in Ubuntu:
  Won't Fix
Status in linux-lts-xenial package in Ubuntu:
  Won't Fix
Status in linux source package in Trusty:
  Fix Released
Status in linux-lts-utopic source package in Trusty:
  Fix Released
Status in linux-lts-xenial source package in Trusty:
  Won't Fix
Status in linux source package in Vivid:
  Fix Released

Bug description:
  SRU Justification:

  [Impact]

  Users of kernels that utilize NFS may see the following messages when
  shutting down and starting containers:

  unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  Setup multiple containers in parallel to mount and NFS share, create
  some traffic and shutdown. Eventually you will see the kernel message.

  Dave's script here:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1403152/comments/24

  [Fix]
  commit de84d89030fa4efa44c02c96c8b4a8176042c4ff upstream

  --

  I currently running trusty latest patches and i get on these hardware
  and software:

  Ubuntu 3.13.0-43.72-generic 3.13.11.11

  processor : 7
  vendor_id : GenuineIntel
  cpu family: 6
  model : 77
  model name: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz
  stepping  : 8
  microcode : 0x11d
  cpu MHz   : 2400.000
  cache size: 1024 KB
  physical id   : 0
  siblings  : 8
  core id   : 7
  cpu cores : 8
  apicid: 14
  initial apicid: 14
  fpu   : yes
  fpu_exception : yes
  cpuid level   : 11
  wp: yes
  flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm 
constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc 
aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm 
sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch 
arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms
  bogomips  : 4799.48
  clflush size  : 64
  cache_alignment   : 64
  address sizes : 36 bits physical, 48 bits virtual
  power management:

  somehow reproducable the subjected error, and lxc is working still but
  not more managable until a reboot.

  managable means every command hangs.

  I saw there are alot of bugs but they seams to relate to older version
  and are closed, so i decided to file a new one?

  I run alot of machine with trusty an lxc containers but only these kind of 
machines produces these errors, all
  other don't show these odd behavior.

  thx in advance

  meno

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1403152/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1711407] Re: unregister_netdevice: waiting for lo to become free

2021-10-06 Thread Nivedita Singhvi
Is anyone still seeing a similar issue  on current mainline?


** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1711407

Title:
  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
  In Progress
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Won't Fix
Status in linux source package in Artful:
  Won't Fix
Status in linux source package in Bionic:
  In Progress

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.

  --

  [Impact]

  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

  unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-
  samba-loop

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1900438] Re: Bcache bypasse writeback on caching device with fragmentation

2021-03-26 Thread Nivedita Singhvi
** Description changed:

  SRU Justification:
  
  [Impact]
- This bug in bcache [insert correct area] affects I/O performance on all 
versions of the kernel [correct versions affected]. It is particularly negative 
on ceph if used with bcache.
+ This bug in bcache affects I/O performance on all versions of the kernel 
[correct versions affected]. It is particularly negative on ceph if used with 
bcache.
  
  Write I/O latency would suddenly go to around 1 second from around 10 ms
  when hitting this issue and would easily be stuck there for hours or
  even days, especially bad for ceph on bcache architecture. This would
  make ceph extremely slow and make the entire cloud almost unusable.
  
  The root cause is that the dirty bucket had reached the 70 percent
  threshold, thus causing all writes to go direct to the backing HDD
  device. It might be fine if it actually had a lot of dirty data, but
  this happens when dirty data has not even reached over 10 percent, due
  to having high memory fragmentation. What makes it worse is that the
  writeback rate might be still at minimum value (8) due to the writeback
  percent not reached, so it takes ages for bcache to really reclaim
  enough dirty buckets to get itself out of this situation.
  
  [Fix]
  
  * 71dda2a5625f31bc3410cb69c3d31376a2b66f28 “bcache: consider the
  fragmentation when update the writeback rate”
  
- The current way to calculate the writeback rate only considered the dirty 
sectors. 
+ The current way to calculate the writeback rate only considered the dirty 
sectors.
  This usually works fine when memory fragmentation is not high, but it will 
give us an unreasonably low writeback rate when we are in the situation that a 
few dirty sectors have consumed a lot of dirty buckets. In some cases, the 
dirty buckets reached  CUTOFF_WRITEBACK_SYNC (i.e., stopped writeback)  while 
the dirty data (sectors) had not even reached the writeback_percent threshold 
(i.e., started writeback). In that situation, the writeback rate will still be 
the minimum value (8*512 = 4KB/s), thus it will cause all the writes to bestuck 
in a non-writeback mode because of the slow writeback.
  
  We accelerate the rate in 3 stages with different aggressiveness:
- the first stage starts when dirty buckets percent reach above 
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), 
+ the first stage starts when dirty buckets percent reach above 
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50),
  the second is BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57),
- the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). 
+ the third is BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64).
  
  By default the first stage tries to writeback the amount of dirty data
  in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) seconds,
  the second stage tries to writeback the amount of dirty data in one bucket
  in (1 / (dirty_buckets_percent - 57)) * 100 milliseconds, the third
  stage tries to writeback the amount of dirty data in one bucket in
  (1 / (dirty_buckets_percent - 64)) milliseconds.
  
  The initial rate at each stage can be controlled by 3 configurable
- parameters: 
+ parameters:
  
  writeback_rate_fp_term_{low|mid|high}
  
  They are by default 1, 10, 1000, chosen based on testing and production
  data, detailed below.
  
  A. When it comes to the low stage, it is still far from the 70%
-threshold, so we only want to give it a little bit push by setting the
-term to 1, it means the initial rate will be 170 if the fragment is 6,
-it is calculated by bucket_size/fragment, this rate is very small,
-but still much more reasonable than the minimum 8.
-For a production bcache with non-heavy workload, if the cache device
-is bigger than 1 TB, it may take hours to consume 1% buckets,
-so it is very possible to reclaim enough dirty buckets in this stage,
-thus to avoid entering the next stage.
+    threshold, so we only want to give it a little bit push by setting the
+    term to 1, it means the initial rate will be 170 if the fragment is 6,
+    it is calculated by bucket_size/fragment, this rate is very small,
+    but still much more reasonable than the minimum 8.
+    For a production bcache with non-heavy workload, if the cache device
+    is bigger than 1 TB, it may take hours to consume 1% buckets,
+    so it is very possible to reclaim enough dirty buckets in this stage,
+    thus to avoid entering the next stage.
  
  B. If the dirty buckets ratio didn’t turn around during the first stage,
-it comes to the mid stage, then it is necessary for mid stage
-to be more aggressive than low stage, so the initial rate is chosen
-to be 10 times more than the low stage, which means 1700 as the initial
-rate if the fragment is 6. This is a normal rate
-we usually see for a normal workload when writeback happens
-because of writeback_percent.
+    it comes to the mid stage, then it is necessary for mid stage
+    to be more aggressive than 

[Kernel-packages] [Bug 1906476] Re: PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, >z_sa_hdl)) failed

2021-01-19 Thread Nivedita Singhvi
** Tags added: seg

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to zfs-linux in Ubuntu.
https://bugs.launchpad.net/bugs/1906476

Title:
  PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 ==
  sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED,
  >z_sa_hdl)) failed

Status in zfs-linux package in Ubuntu:
  Confirmed

Bug description:
  Since today while running Ubuntu 21.04 Hirsute I started getting a ZFS
  panic in the kernel log which was also hanging Disk I/O for all
  Chrome/Electron Apps.

  I have narrowed down a few important notes:
  - It does not happen with module version 0.8.4-1ubuntu11 built and included 
with 5.8.0-29-generic

  - It was happening when using zfs-dkms 0.8.4-1ubuntu16 built with DKMS
  on the same kernel and also on 5.8.18-acso (a custom kernel).

  - For whatever reason multiple Chrome/Electron apps were affected,
  specifically Discord, Chrome and Mattermost. In all cases they seem
  (but I was unable to strace the processes so it was a bit hard ot
  confirm 100% but by deduction from /proc/PID/fd and the hanging ls)
  they seem hung trying to open files in their 'Cache' directory, e.g.
  ~/.cache/google-chrome/Default/Cache and ~/.config/Mattermost/Cache ..
  while the issue was going on I could not list that directory either
  "ls" would just hang.

  - Once I removed zfs-dkms only to revert to the kernel built-in
  version it immediately worked without changing anything, removing
  files, etc.

  - It happened over multiple reboots and kernels every time, all my
  Chrome apps weren't working but for whatever reason nothing else
  seemed affected.

  - It would log a series of spl_panic dumps into kern.log that look like this:
  Dec  2 12:36:42 optane kernel: [   72.857033] VERIFY(0 == 
sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, >z_sa_hdl)) 
failed
  Dec  2 12:36:42 optane kernel: [   72.857036] PANIC at 
zfs_znode.c:335:zfs_znode_sa_init()

  I could only find one other google reference to this issue, with 2 other 
users reporting the same error but on 20.04 here:
  https://github.com/openzfs/zfs/issues/10971

  - I was not experiencing the issue on 0.8.4-1ubuntu14 and fairly sure
  it was working on 0.8.4-1ubuntu15 but broken after upgrade to
  0.8.4-1ubuntu16. I will reinstall those zfs-dkms versions to verify
  that.

  There were a few originating call stacks but the first one I hit was

  Call Trace:
   dump_stack+0x74/0x95
   spl_dumpstack+0x29/0x2b [spl]
   spl_panic+0xd4/0xfc [spl]
   ? sa_cache_constructor+0x27/0x50 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dmu_buf_set_user_ie+0x54/0x80 [zfs]
   zfs_znode_sa_init+0xe0/0xf0 [zfs]
   zfs_znode_alloc+0x101/0x700 [zfs]
   ? arc_buf_fill+0x270/0xd30 [zfs]
   ? __cv_init+0x42/0x60 [spl]
   ? dnode_cons+0x28f/0x2a0 [zfs]
   ? _cond_resched+0x19/0x40
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? aggsum_add+0x153/0x170 [zfs]
   ? spl_kmem_alloc_impl+0xd8/0x110 [spl]
   ? arc_space_consume+0x54/0xe0 [zfs]
   ? dbuf_read+0x4a0/0xb50 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dnode_rele_and_unlock+0x5a/0xc0 [zfs]
   ? _cond_resched+0x19/0x40
   ? mutex_lock+0x12/0x40
   ? dmu_object_info_from_dnode+0x84/0xb0 [zfs]
   zfs_zget+0x1c3/0x270 [zfs]
   ? dmu_buf_rele+0x3a/0x40 [zfs]
   zfs_dirent_lock+0x349/0x680 [zfs]
   zfs_dirlook+0x90/0x2a0 [zfs]
   ? zfs_zaccess+0x10c/0x480 [zfs]
   zfs_lookup+0x202/0x3b0 [zfs]
   zpl_lookup+0xca/0x1e0 [zfs]
   path_openat+0x6a2/0xfe0
   do_filp_open+0x9b/0x110
   ? __check_object_size+0xdb/0x1b0
   ? __alloc_fd+0x46/0x170
   do_sys_openat2+0x217/0x2d0
   ? do_sys_openat2+0x217/0x2d0
   do_sys_open+0x59/0x80
   __x64_sys_openat+0x20/0x30

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system

2020-12-08 Thread Nivedita Singhvi
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1907262

Title:
  raid10: discard leads to corrupted file system

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Seems to be closely related to
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578

  After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126
  the fstrim command triggered by fstrim.timer causes a severe number of
  mismatches between two RAID10 component devices.

  This bug affects several machines in our company with different HW
  configurations (All using ECC RAM). Both, NVMe and SATA SSDs are
  affected.

  How to reproduce:
   - Create a RAID10 LVM and filesystem on two SSDs
  mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 
/dev/nvme1n1p2
  pvcreate -ff -y /dev/md0
  vgcreate -f -y VolGroup /dev/md0
  lvcreate -n root-L 100G -ay -y VolGroup
  mkfs.ext4 /dev/VolGroup/root
  mount /dev/VolGroup/root /mnt
   - Write some data, sync and delete it
  dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M
  sync
  rm /mnt/data.raw
   - Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0):
  cat /sys/block/md0/md/mismatch_cnt
   - Trigger the bug
  fstrim /mnt
   - Re-Check the RAID device
  echo check >/sys/block/md0/md/sync_action
   - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in 
the range of N*1):
  cat /sys/block/md0/md/mismatch_cnt

  After investigating this issue on several machines it *seems* that the
  first drive does the trim correctly while the second one goes wild. At
  least the number and severity of errors found by a  USB stick live
  session fsck.ext4 suggests this.

  To perform the single drive evaluation the RAID10 was started using a single 
drive at once:
mdadm --assemble /dev/md127 /dev/nvme0n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

vgchange -a n /dev/VolGroup
mdadm --stop /dev/md127

mdadm --assemble /dev/md127 /dev/nvme1n1p2
mdadm --run /dev/md127
fsck.ext4 -n -f /dev/VolGroup/root

  When starting these fscks without -n, on the first device it seems the
  directory structure is OK while on the second device there is only the
  lost+found folder left.

  Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53
  before) seems to have a quite similar issue.

  Unfortunately the risk/regression assessment in the aforementioned bug
  is not complete: the workaround only mitigates the issues during FS
  creation. This bug on the other hand is triggered by a weekly service
  (fstrim) causing severe file system corruption.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1521173] Re: AER: Corrected error received: id=00e0

2020-12-02 Thread Nivedita Singhvi
Seen this as well -- although I don't believe it's causing any
problems that we know of -- sure does look right now like it's 
only noise in the logs.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1521173

Title:
  AER: Corrected error received: id=00e0

Status in Linux:
  Unknown
Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  Triaged

Bug description:
  WORKAROUND: add pci=noaer to your kernel command line:

  1) edit /etc/default/grub and and add pci=noaer to the line starting with 
GRUB_CMDLINE_LINUX_DEFAULT. It will look like this:
  GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
  2) run "sudo update-grub"
  3) reboot

  

  My dmesg gets completely spammed with the following messages appearing
  over and over again. It stops after one s3 cycle; it only happens
  after reboot.

  [ 5315.986588] pcieport :00:1c.0: AER: Corrected error received: id=00e0
  [ 5315.987249] pcieport :00:1c.0: can't find device of ID00e0
  [ 5315.995632] pcieport :00:1c.0: AER: Corrected error received: id=00e0
  [ 5315.995664] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, 
type=Physical Layer, id=00e0(Receiver ID)
  [ 5315.995674] pcieport :00:1c.0:   device [8086:9d14] error 
status/mask=0001/2000
  [ 5315.995683] pcieport :00:1c.0:[ 0] Receiver Error
  [ 5316.002772] pcieport :00:1c.0: AER: Corrected error received: id=00e0
  [ 5316.002811] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, 
type=Physical Layer, id=00e0(Receiver ID)
  [ 5316.002826] pcieport :00:1c.0:   device [8086:9d14] error 
status/mask=0001/2000
  [ 5316.002838] pcieport :00:1c.0:[ 0] Receiver Error
  [ 5316.009926] pcieport :00:1c.0: AER: Corrected error received: id=00e0
  [ 5316.009964] pcieport :00:1c.0: PCIe Bus Error: severity=Corrected, 
type=Physical Layer, id=00e0(Receiver ID)
  [ 5316.009979] pcieport :00:1c.0:   device [8086:9d14] error 
status/mask=0001/2000
  [ 5316.009991] pcieport :00:1c.0:[ 0] Receiver Error

  ProblemType: BugDistroRelease: Ubuntu 16.04
  Package: linux-image-4.2.0-19-generic 4.2.0-19.23 [modified: 
boot/vmlinuz-4.2.0-19-generic]
  ProcVersionSignature: Ubuntu 4.2.0-19.23-generic 4.2.6
  Uname: Linux 4.2.0-19-generic x86_64
  ApportVersion: 2.19.2-0ubuntu8
  Architecture: amd64
  AudioDevicesInUse:
   USERPID ACCESS COMMAND
   /dev/snd/pcmC0D0c:   david  1502 F...m pulseaudio
   /dev/snd/controlC0:  david  1502 F pulseaudio
  CurrentDesktop: Unity
  Date: Mon Nov 30 13:19:00 2015
  EcryptfsInUse: Yes
  HibernationDevice: RESUME=UUID=fe528b90-b4eb-4a20-82bd-6a03b79cfb14
  InstallationDate: Installed on 2015-11-28 (2 days ago)
  InstallationMedia: Ubuntu 16.04 LTS "Xenial Xerus" - Alpha amd64 (20151127)
  MachineType: Dell Inc. Inspiron 13-7359
  ProcFB: 0 inteldrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.2.0-19-generic.efi.signed 
root=UUID=94d54f88-5d18-4e2b-960a-8717d6e618bb ro noprompt persistent quiet 
splash vt.handoff=7
  RelatedPackageVersions:
   linux-restricted-modules-4.2.0-19-generic N/A
   linux-backports-modules-4.2.0-19-generic  N/A
   linux-firmware1.153SourcePackage: linux
  UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 08/07/2015
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 01.00.00
  dmi.board.name: 0NT3WX
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A00
  dmi.chassis.type: 9
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr01.00.00:bd08/07/2015:svnDellInc.:pnInspiron13-7359:pvr:rvnDellInc.:rn0NT3WX:rvrA00:cvnDellInc.:ct9:cvr:
  dmi.product.name: Inspiron 13-7359
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1521173/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-07-02 Thread Nivedita Singhvi
Some of the 4.15 kernels fixed:

Bionic linux kernel: 4.15.0-109.110
Bionic linux-aws kernel: 4.15.0-1077.81
Xenial linux-hwe kernel: 4.15.0-107.108~16.04.1 
Xenial linux-gcp kernel: 4.15.0-1078.88~16.04.1

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Released

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-06-15 Thread Nivedita Singhvi
Packages tested

linux-gcp (4.15.0-1078.88~16.04.1) xenial;
linux-hwe (4.15.0-107.108~16.04.1) xenial;
linux-gcp-4.15 (4.15.0-1078.88) bionic;
linux (4.15.0-107.108) bionic;

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-06-14 Thread Nivedita Singhvi
Tested.


** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1882039] Re: The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs

2020-06-10 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu Bionic)
   Importance: Medium => High

** Changed in: linux (Ubuntu Bionic)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Eoan)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Gavin Guo (mimi0213kimo)

** Changed in: linux (Ubuntu Focal)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu Focal)
   Importance: Medium => High

** Changed in: linux (Ubuntu Eoan)
   Importance: Medium => High

** Changed in: linux (Ubuntu)
   Importance: Medium => High

** Changed in: linux (Ubuntu Eoan)
 Assignee: (unassigned) => Gavin Guo (mimi0213kimo)

** Changed in: linux (Ubuntu Focal)
 Assignee: (unassigned) => Gavin Guo (mimi0213kimo)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1882039

Title:
  The thread level parallelism would be a bottleneck when searching for
  the shared pmd by using hugetlbfs

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Eoan:
  In Progress
Status in linux source package in Focal:
  In Progress

Bug description:
  [Impact]

  There is performance overhead observed when many threads
  are using hugetlbfs in the database environment.

  [Fix]

  bdfbd98bc018 hugetlbfs: take read_lock on i_mmap for PMD sharing

  The patch improves the locking by using the read lock instead of the
  write lock. And it allows multiple threads searching the suitable shared
  VMA. As there is no modification inside the searching process. The 
  improvement increases the parallelism and decreases the waiting time of
  the other threads.

  [Test]

  The customer stand-up a database with seed data. Then they have a
  loading "driver" which makes a bunch of connections that look like user
  workflows from the database perspective. Finally, the measuring response
  times improvement can be observed for these "users" as well as various
  other metrics at the database level.

  [Regression Potential]

  The modification is only in replacing the write lock to a read one. And 
  there is no modification inside the loop. The regression probability is
  low.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1882039/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2020-05-28 Thread Nivedita Singhvi
Note that fix for all the above series are already released.

i.e, from Ubuntu-4.15.0-73.82.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  Fix Released
Status in linux source package in Eoan:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 18.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  Lsusb:
   Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
   Bus 001 Device 004: ID 1604:10c0 Tascam 
   Bus 001 Device 003: ID 1604:10c0 Tascam 
   Bus 001 Device 002: ID 1604:10c0 Tascam 
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
  MachineType: Dell Inc. PowerEdge R740xd
  Package: linux (not installed)
  

[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2020-05-28 Thread Nivedita Singhvi
Could anyone hitting this bug confirm it is a DUP
of LP Bug 1852077 and that latest releases fix this
issue? 

The handling of the state changes/updates borked here
due to not just marking it as a DUP and closing this one.

I will close this next week otherwise.

** Changed in: linux (Ubuntu Focal)
   Status: In Progress => Fix Released

** Changed in: linux (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

** Changed in: linux (Ubuntu Disco)
   Status: Fix Committed => Fix Released

** Changed in: linux (Ubuntu Eoan)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Disco:
  Fix Released
Status in linux source package in Eoan:
  Fix Released
Status in linux source package in Focal:
  Fix Released

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 

[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-05-22 Thread Nivedita Singhvi
Test kernel has been tested successfully so far by
original reporter and has fixed the Docker breakage
and so on.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  In Progress

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-05-21 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

** Changed in: linux (Ubuntu)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu)
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Invalid
Status in linux source package in Bionic:
  In Progress

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] Re: Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-05-21 Thread Nivedita Singhvi
SRU request has been submitted.

If anyone would like to test, there are test images up on:
https://people.canonical.com/~nivedita/ipvlan-test-fix-278887/

You can 'wget' the files and then 'dpkg -i' the modules, 
linux-image, modules-extra debs in that order, and reboot.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1879658

Title:
  Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  In Progress

Bug description:
  [IMPACT]

  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:

  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.

  The bug is caused by the following recent commit to Bionic
  & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:

  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
    * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
  for MTU size exposes the bug of the max mtu not being set correctly
  for the ipvlan driver (this has been previously fixed in bonding,
  teaming drivers).

  Fix:
  ---
  This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
  but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

  ipvlan: use ETH_MAX_MTU as max mtu
  commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

  [Test Case]

  1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

  2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
     (where test1 eno1 is the physical interface you are adding
  the ipvlan on)

  3. # ip link
  ...
  14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
  ...
    // check that your test1 ipvlan is created with mtu 9000

  4. Install 4.15.0-92 kernel or later

  5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument

  6. With the above fix commit backported to the xenial-hwe/Bionic,
  the jumbo mtu ipvlan creation works again, identical to before 92.

  [Regression Potential]

  This commit is in upstream mainline as of v4.18-rc2, and hence
  is already in Cosmic and later, i.e. all post Bionic releases
  currently. Hence there's low regression potential here.

  It only impacts ipvlan functionality, and not other networking
  systems, so core systems should not be affected by this. And
  affects on setup so it either works or doesn't. Patch is trivial.

  It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
  (where the latent bug got exposed).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1879658/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1879658] [NEW] Cannot create ipvlans with > 1500 MTU on recent Bionic kernels

2020-05-20 Thread Nivedita Singhvi
Public bug reported:

[IMPACT]

Setting an MTU larger than the default 1500 results in an
error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
when attempting to create ipvlan interfaces:

# ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
RTNETLINK answers: Invalid argument

This breaks Docker and other applications which use a Jumbo
MTU (9000) when using ipvlans.

The bug is caused by the following recent commit to Bionic
& Xenial-hwe; which is pulled in via the stable patchset below,
which enforces a strict min/max MTU when MTUs are being set up
via rtnetlink for ipvlans:

Breaking commit:
---
Ubuntu-hwe-4.15.0-92.93~16.04.1
* Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
  * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()

The above patch applies checks of dev->min_mtu and dev->max_mtu
to avoid a malicious user from crashing the kernel with a bad
value. It was patching the original patchset to centralize min/max
MTU checking from various different subsystems of the networking
kernel. However, in that patchset, the max_mtu had not been set
to the largest phys (64K) or jumbo (9000 bytes), and defaults to
1500. The recent commit above which enforces strict bounds checking
for MTU size exposes the bug of the max mtu not being set correctly
for the ipvlan driver (this has been previously fixed in bonding,
teaming drivers).

Fix:
---
This was fixed in the upstream kernel as of v4.18-rc2 for ipvlans,
but was not backported to Bionic along with other patches. The missing commit 
in the Bionic backport:

ipvlan: use ETH_MAX_MTU as max mtu
commit 548feb33c598dfaf9f8e066b842441ac49b84a8a

[Test Case]

1. Install any kernel earlier than 4.15.0-92 (Bionic/Xenial-hwe)

2. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
   (where test1 eno1 is the physical interface you are adding
the ipvlan on)

3. # ip link
...
14: test1@eno1:  mtu 9000 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
...
  // check that your test1 ipvlan is created with mtu 9000

4. Install 4.15.0-92 kernel or later

5. # ip link add test1 mtu 9000 link eno1 type ipvlan mode l2
RTNETLINK answers: Invalid argument

6. With the above fix commit backported to the xenial-hwe/Bionic,
the jumbo mtu ipvlan creation works again, identical to before 92.

[Regression Potential]

This commit is in upstream mainline as of v4.18-rc2, and hence
is already in Cosmic and later, i.e. all post Bionic releases
currently. Hence there's low regression potential here.

It only impacts ipvlan functionality, and not other networking
systems, so core systems should not be affected by this. And
affects on setup so it either works or doesn't. Patch is trivial.

It only impacts Bionic/Xenial-hwe 4.15.0-92 onwards versions
(where the latent bug got exposed).

** Affects: linux (Ubuntu)
 Importance: Critical
 Status: Incomplete

** Affects: linux (Ubuntu Bionic)
 Importance: Critical
 Status: Incomplete


** Tags: bionic sts

** Changed in: linux (Ubuntu)
   Importance: Undecided => Critical

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => Critical

** Description changed:

  [IMPACT]
  
  Setting an MTU larger than the default 1500 results in an
  error on the recent (4.15.0-92+) Bionic/Xenial -hwe kernels
  when attempting to create ipvlan interfaces:
  
  # ip link add test0 mtu 9000 link eno1 type ipvlan mode l2
  RTNETLINK answers: Invalid argument
  
- This breaks Docker and other applications which use a Jumbo 
+ This breaks Docker and other applications which use a Jumbo
  MTU (9000) when using ipvlans.
  
- The bug is caused by the following recent commit to Bionic 
- & Xenial-hwe; which is pulled in via the stable patchset below, 
+ The bug is caused by the following recent commit to Bionic
+ & Xenial-hwe; which is pulled in via the stable patchset below,
  which enforces a strict min/max MTU when MTUs are being set up
  via rtnetlink for ipvlans:
  
  Breaking commit:
  ---
  Ubuntu-hwe-4.15.0-92.93~16.04.1
  * Bionic update: upstream stable patchset 2020-02-21 (LP: #1864261)
-   * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()
+   * net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()
  
  The above patch applies checks of dev->min_mtu and dev->max_mtu
  to avoid a malicious user from crashing the kernel with a bad
  value. It was patching the original patchset to centralize min/max
  MTU checking from various different subsystems of the networking
  kernel. However, in that patchset, the max_mtu had not been set
  to the largest phys (64K) or jumbo (9000 bytes), and defaults to
  1500. The recent commit above which enforces strict bounds checking
- for MTU size exposes the bug of the max mtu not being set correctly.
+ for MTU size exposes the bug of the max mtu not being set correctly
+ for the 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-05-11 Thread Nivedita Singhvi
The issue we have reported is easily avoided by specifying
the primary port to be the active interface of the bond. 

On netplan-using systems:

Add the directive "primary: $interface" (e.g. "primary: p94s0f0")
to the "parameters:" section of the netplan config file.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-05-11 Thread Nivedita Singhvi
Hello, diarmuid,

Re: original issue report, were you able to resolve your issue?

Please let us know.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html
  - There is no 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-05-11 Thread Nivedita Singhvi
We are closing this LP bug for now as we aren't able to reproduce 
in-house, and we cannot get access to a live testing repro env 
at this time.  

Here is what we know:

- There seems to be different performance for some tests when
  the NIC is configured with active-backup bonding mode, between
  the case when the active interface is the primary port, and 
  when the active interface is the secondary port. 
  i.e.:

Primary port: enp94s0f0 // when this is the active, works fine
Secondary port: enp94s0f1d1 // when this is the active, more drops

- Switch info: 2 x Fortigate 1024D switches, each machine is connected 
  to both

- NIC info:  root@u072:~# lspci | grep BCM57416
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

# ethtool -i enp1s0f0np0
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31

- Our attempt at a reproducer (initially reported in production env via 
graphical
monitoring):

mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST
good system = ~ 0% drops
bad systems = ~ 8% drops

We are not getting NIC stats drops, nor UDP kernel drops, so it's
not clear where the packet is being dropped, whether it's being
dropped silently somewhere (?), or if that's a red herring and
a mtr test issue, and what's seen in production is something else.

If someone can reproduce this, or something similar, or if we manage
to, we will re-open this bug or file a new one.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  

[Kernel-packages] [Bug 1811963] Re: Sporadic problems with X710 (i40e) and bonding where one interface is shown as "state DOWN" and without LOWER_UP

2020-04-16 Thread Nivedita Singhvi
Hi Malte,

Was this issue resolved for you?

There are several other possibilities that it could be - and
if it's still a problem with current mainline, please let
us know.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1811963

Title:
  Sporadic problems with X710 (i40e) and bonding where one interface is
  shown as "state DOWN" and without LOWER_UP

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  After rebooting the physical server there is a 50/50 chance of all connected 
interfaces coming up. This affects Dell EMC R740's and R440's equipped with the 
X710 network cards.
  As far as I noticed (~20 reboots on different machines), this happens only 
when using bonding (in this case active-backup or mode 1, did not test 
different modes yet). The networking-hardware on the other side shows the ports 
"connected". tcpdump shows frames being received, even if the interface is in 
"state DOWN".

  Tried with:

  Ubuntu 16.04, kernel 4.4.0-141, driver 2.7.26 (from the Intel-website), 
firmware 18.8.9
  Ubuntu 16.04, kernel 4.4.0-141, driver 1.4.25-k, firmware 18.8.9
  Ubuntu 16.04, kernel 4.15.0-43 (hwe), driver 2.1.14-k, firmware 18.8.9

  The following excerpts are made using Intels driver in version 2.7.26,
  therefore tainting the kernel, but the same happens using the original
  kernel's version or the hardware enablement kernel's version.

  Sporadic failure case:

  [6.319226] i40e: loading out-of-tree module taints kernel.
  [6.319227] i40e: loading out-of-tree module taints kernel.
  [6.319422] i40e: module verification failed: signature and/or required 
key missing - tainting kernel
  [6.410837] i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network 
Driver - version 2.7.26
  [6.410838] i40e: Copyright(c) 2013 - 2018 Intel Corporation.
  [6.423542] i40e :3b:00.0: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 
18.8.9
  [6.658526] i40e :3b:00.0: MAC address: ff:ff:ff:ff:ff:ff
  [6.710391] i40e :3b:00.0: PCI-Express: Speed 8.0GT/s Width x8
  [6.725692] i40e :3b:00.0: Features: PF-id[0] VFs: 64 VSIs: 2 QP: 40 
RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
  [6.750239] i40e :3b:00.1: fw 6.81.49447 api 1.7 nvm 6.80 0x80003d72 
18.8.9
  [6.987874] i40e :3b:00.1: MAC address: ff:ff:ff:ff:ff:f1
  [7.005397] i40e :3b:00.1 eth0: NIC Link is Up, 10 Gbps Full Duplex, 
Flow Control: None
  [7.024993] i40e :3b:00.1: PCI-Express: Speed 8.0GT/s Width x8
  [7.040298] i40e :3b:00.1: Features: PF-id[1] VFs: 64 VSIs: 2 QP: 40 
RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
  [7.054384] i40e :3b:00.1 enp59s0f1: renamed from eth0
  [7.079613] i40e :3b:00.0 enp59s0f0: renamed from eth1
  [9.788893] i40e :3b:00.0 enp59s0f0: already using mac address 
ff:ff:ff:ff:ff:ff
  [9.819480] i40e :3b:00.1 enp59s0f1: set new mac address 
ff:ff:ff:ff:ff:ff

  [9.728194] bond0: Setting MII monitoring interval to 100
  [9.788690] bond0: Adding slave enp59s0f0
  [9.805195] bond0: Enslaving enp59s0f0 as a backup interface with a down 
link
  [9.819470] bond0: Adding slave enp59s0f1
  [9.836360] bond0: making interface enp59s0f1 the new active one
  [9.836614] bond0: Enslaving enp59s0f1 as an active interface with an up 
link

  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: fault-tolerance (active-backup)
  Primary Slave: None
  Currently Active Slave: enp59s0f1
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  Slave Interface: enp59s0f0
  MII Status: down
  Speed: Unknown
  Duplex: Unknown
  Link Failure Count: 0
  Permanent HW addr: ff:ff:ff:ff:ff:ff
  Slave queue ID: 0

  Slave Interface: enp59s0f1
  MII Status: up
  Speed: Unknown
  Duplex: Unknown
  Link Failure Count: 0
  Permanent HW addr: ff:ff:ff:ff:ff:f1
  Slave queue ID: 0

  4: enp59s0f0:  mtu 1500 qdisc mq 
master bond0 portid  state DOWN group default qlen 1000
  link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
  5: enp59s0f1:  mtu 1500 qdisc mq 
master bond0 portid fff1 state UP group default qlen 1000
  link/ether ff:ff:ff:ff:ff:f1 brd ff:ff:ff:ff:ff:ff
  6: bond0:  mtu 1500 qdisc noqueue 
state UP group default qlen 1000
  link/ether ff:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
  inet 123.123.123.123/24 brd 123.123.123.255 scope global bond0
 valid_lft forever preferred_lft forever
  inet6 :::::/64 scope link 
 valid_lft forever preferred_lft forever

  bond0 Link encap:Ethernet  HWaddr ff:ff:ff:ff:ff:ff  
inet addr:123.123.123.123  Bcast:123.123.123.255  Mask:255.255.255.0
inet6 addr: :::::/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
   

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-14 Thread Nivedita Singhvi
Edwin,

Do you happen to notice any IPv6 or LLDP or other link-local traffic
on the interfaces? (including backup interface). 

The MTR loss  % is purely a capture of their packets xmitted 
and responses received, so for that UDP MTR test, this is saying
that UDP packets were lost, somewhere. 

The NIC does not have any drops showing via ethtool -S 
stats but I'm hunting down which are the right pair of before/afters.

Other than the tpa_abort counts, there were no errors that I saw.
I can't tell what the tpa_abort means for the frame - is it purely 
a failure only to coalesce, or does it end up dropping packets at
some point in that functionality? I'm assuming not, as whatever the
reason, those would be counted as drops, I hope, and printed in 
the interface stats. 

I'll attach all the stats here once I get them sorted out, I thought
I had a clean diff of before and after from the tester, but after
looking through, I don't think the file I have is from before/after
the mtr test, as there was negligible UDP traffic. I'll try and
get clarification from the reporter. 

Note that when the provision of primary= is used to configure 
which interface is primary, and when the primary port is used
as the active interface for the bond, no problems are seen (and
that works deterministically to set the correct active interface).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-12 Thread Nivedita Singhvi
Additional observations.

MAAS is being used to deploy the system and configure
the bond interface and settings.

MAAS allows you to specify which is the primary interface, with
the other being the backup, for the active-backup bonding mode.
However, it does not appear to be working -it's not passing along 
a primary primitive, for instance, in the netplan yaml or otherwise 
resulting in this being honored (still need to confirm).

MAAS allows you to enter a mac address for the bond interface,
but if not supplied, by default it will use the mac address of
the "primary" interface, as configured.

MAAS then populates the /etc/netplan/50-cloud-init.yaml, including
a macaddr= line with the default.

netplan then passes that along to systemd-networkd.

The bonding kernel, however, will use as the active interface
whichever interface is first attached to the bond (i.e., which
completes getting attached to the bond interface first) in the
absence of a primary= directive.

The bonding kernel will, however, use the mac addr supplied
as an override.

So let's say the active interface was configured in MAAS to be
f0, and it's mac is used to be the mac address of the bond,
but f1 (the second port of the NIC) actually gets attached
first to the bond and is used as the active interface by the
bond.

We have a situation where f0 = backup, f1 = active, and bond0
is using the mac of f0. While this should work, there is a
potential for problems depending on the circumstances.

It's likely this has nothing to do with our current issue, but
here for completeness. Will see if we can test/confirm.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
Edwin, let me know if you can get in touch with me via the contact email 
on my Launchpad page. Thanks for all the help!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html
  

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
ethtool-enp94s0f0
--
Settings for enp94s0f0:
Supported ports: [ FIBRE ]
Supported link modes:   1baseT/Full 
Supported pause frame use: Symmetric Receive-only
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 1Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 1
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: d
Current message level: 0x (0)
   
Link detected: yes

ethtool-i-enp94s0f0
--
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version: 
bus-info: :5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no


ethtool-c-enp94s0f0
-
Coalesce parameters for enp94s0f0:
Adaptive RX: off  TX: off
stats-block-usecs: 100
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 10
rx-frames: 15
rx-usecs-irq: 1
rx-frames-irq: 1

tx-usecs: 28
tx-frames: 30
tx-usecs-irq: 2
tx-frames-irq: 2

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

ethtool-g-enp94s0f0

Ring parameters for enp94s0f0:
Pre-set maximums:
RX: 2047
RX Mini:0
RX Jumbo:   8191
TX: 2047
Current hardware settings:
RX: 511
RX Mini:0
RX Jumbo:   2044
TX: 511

ethtool-k-enp94s0f0
-
Features for enp94s0f0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: on
tls-hw-record: off [fixed]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
** Attachment added: "ethtool -S for inactive interface enp94s0f0"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853638/+attachment/5327556/+files/ethtool-S-enp94s0f0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
"Bad" System/NIC:

NIC: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
System: Dell
Kernel: 5.3.0-28-generic #30~18.04.1-Ubuntu

(Note, this issue has been seen on prior kernels as well, upgraded
 to latest to see if various problems were resolved)


Attaching stats/config files from nics from this system (seeing issue).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
Good System/Good NIC (all configurations work) Comparison

NIC: NetXtreme II BCM57000 10 Gigabit Ethernet QLogic 57000
System: Dell
Kernel: 5.0.0-25-generic #26~18.04.1-Ubuntu


/proc/net/bonding/bond0
---
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp5s0f1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp5s0f1
MII Status: up
Speed: 1 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:00:00:00:73:e2
Slave queue ID: 0

Slave Interface: enp5s0f0
MII Status: up
Speed: 1 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:00:00:00:73:e0
Slave queue ID: 0


/etc/netplan/50-cloud-init.yaml

network:
bonds:
bond0:
addresses:
- 00.00.235.182/25
gateway4: 00.00.235.129
interfaces:
- enp5s0f0
- enp5s0f1
macaddress: 00:00:00:00:73:e0
mtu: 9000
nameservers:
addresses:
- 00.00.235.172
- 00.00.235.171
search:
- maas
parameters:
down-delay: 0
gratuitious-arp: 1
mii-monitor-interval: 100
mode: active-backup
transmit-hash-policy: layer2
up-delay: 0
ethernets:
...(snip)..
enp5s0f0:
match:
macaddress: 00:00:00:00:73:e0
mtu: 9000
set-name: enp5s0f0
enp5s0f1:
match:
macaddress: 00:00:00:00:73:e2
mtu: 9000
set-name: enp5s0f1
version: 2

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-11 Thread Nivedita Singhvi
"Bad" Configuration for active-backup mode:



$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
  
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp94s0f1d1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp94s0f1d1
MII Status: up
Speed: 1 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 4c:d9:8f:48:08:da
Slave queue ID: 0

Slave Interface: enp94s0f0
MII Status: up
Speed: 1 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 4c:d9:8f:48:08:d9
Slave queue ID: 0

---
$ cat uname-rv 
5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020

---
Scrubbed /etc/netplan/50-cloud-init.yaml:
network:
bonds:
bond0:
addresses:
- 0.0.235.177/25
gateway4: 0.0.235.129
interfaces:
- enp94s0f0
- enp94s0f1d1
macaddress: 00:00:00:48:08:00
mtu: 9000
nameservers:
addresses:
- 0.0.235.171
- 0.0.235.172
search:
- maas
parameters:
down-delay: 0
gratuitious-arp: 1
mii-monitor-interval: 100
mode: active-backup
transmit-hash-policy: layer2
up-delay: 0
ethernets:
eno1:
match:
macaddress: 00:00:00:76:6e:ca
mtu: 1500
set-name: eno1
eno2:
match:
macaddress: 00:00:00:76:6e:cb
mtu: 1500
set-name: eno2
enp94s0f0:
match:
macaddress: 00:00:00:48:08:00
mtu: 9000
set-name: enp94s0f0
enp94s0f1d1:
match:
macaddress: 00:00:00:48:08:da
mtu: 9000
set-name: enp94s0f1d1
version: 2

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-10 Thread Nivedita Singhvi
We have narrowed it down to a flaw in a specific configuration setting
on this NIC, so we're comparing the good and bad configurations now.

Primary port: enp94s0f0
Secondary port: enp94s0f1d1


A] Good config for fault-tolerance (active-backup) bonding mode:
--
Primary port = active interface; Secondary port = backup

B] Bad config for fault-tolerance (active-backup) bonding mode:
--
Primary port = backup interface; Secondary port = active


We are consistently able to reproduce a drop rate difference
with UDP pkts, for the above good/bad cases:


Good Case UDP MTR Test Result
-
mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST
Start: 2020-02-10T10:14:01+
HOST: hostname Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- nn.nn.nnn.nnn  0.0%600.3   0.2   0.2   0.3   0.0


Bad Case UDP MTR Test Result
---
mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST
Start: 2020-02-10T14:10:52+
HOST: hostname Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- nn.nn.nnn.nnn  8.3%600.3   0.3   0.2   0.4   0.0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-10 Thread Nivedita Singhvi
The second port on the NIC definitely works as the active
interface in an active-backup bonding configuration on the
other NICs.

At the moment, it's only this particular NIC that is seeing
this problem that we know of.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-09 Thread Nivedita Singhvi
Hello Edwin,

Here is more information on the issue we are seeing wrt dropped 
packets and other connectivity issues with this NIC.

The problem is *only* seen when the second port on the NIC is 
chosen as the active interface of a active-backup configuration. 

So on the "bad" system with the interfaces:

enp94s0f0 -> when chosen as active, all OK
enp94s0f1d1 -> when chosen as active, not OK

I'll see if the reporters can confirm that on the "good" systems,
there was no problem when the second interface is active.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-02-04 Thread Nivedita Singhvi
Hey Edwin, sorry, I didn't see your last question.

I'll try and confirm but I've seen loss in both 
directions but it's not clear whether that's significant
enough or not yet. 

e.g., TCP traffic is retransmitted, so it could be segments
lost while outgoing or acks lost incoming. 

4407 retransmitted TCP segments 
130 TCP timeouts

in stats collected about 5 mins apart - which isn't 
sufficient a sample size, we're trying to get a new 
collection of stats, logs using the netperf TCP_RR test.

In our case, note, we're more concerned (and have more solid
data) of latency issues than dropped packets (which I expect
some of with heavy network testing). 

For example, netperf TCP_RR latency is about 70-78% of the older
systems for 1,1 request/response byte sizes as well as 64/64, 
100/200, 128/8192 sizes.

I'll update here as soon as we have more data from the production
environment.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-31 Thread Nivedita Singhvi
> NICs between systems? Are OS / kernel and driver
> versions the same on both systems? 

Yes, identical distro release, kernel, and most of the software
stack (I have not obtained and examined the full sw stack). 

Configuration of networking settings is also the same.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-31 Thread Nivedita Singhvi
Thanks very much for helping on this, Edwin! Please let me
know if there's anything specific you need. 

I'm asking them to disable any IPv6, LLDP traffic in their environment,
and retest and collect information again.

Also, I'd like to disable tpa, would this be at all useful:

modprobe bnx disable_tpa=1

??

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-31 Thread Nivedita Singhvi
> There are more than one variable at play here. 
> Does the problem follow the NIC if you swap the 
> NICs between systems? Are OS / kernel and driver 
> versions the same on both systems?

Unfortunately, I've not been able to get them to try
permutations or switches, as yet, as this is still a
production system/environment. 

I'll try and obtain more information about it.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-31 Thread Nivedita Singhvi
> The mtr packet loss is an interesting result. What mtr options did you
use? Is this a UDP or ICMP test?


The mtr command was:

mtr --no-dns --report --report-cycles 60 $IP_ADDR

so ICMP was going out.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-29 Thread Nivedita Singhvi
** Attachment added: "active interface ethtool-S"
   
https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324070/+files/ethtool-S-enp94s0f0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-29 Thread Nivedita Singhvi
** Attachment added: "backup interface ethtool-S"
   
https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1853638/+attachment/5324071/+files/ethtool-S-enp94s0f1d1

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we have tested them
  with normal 1G network cards and have no dropped samples.

  Personally I think its something to do with the 10G network card,
  possibly on a ubuntu driver???

  Note, Dell have said there is no hardware problem with the 10G
  interfaces

  I have followed the troubleshooting information on this link to try determine 
the problem: 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-29 Thread Nivedita Singhvi
Note that iperf was identical whereas netperf and mtr showed
up differences (so it's possibly sporadic as well, not continuous)


1. iperf tcp test
--
GoodSystem.9.84 Gbits/sec
BadSystem18.37 Gbits/sec
BadSystem2...9.85 Gbits/sec


2. iperf udp test
--
GoodSystem.1.05 Mbits/sec
BadSystem2...1.05 Mbits/sec


3. mtr ping test
---
GoodSystem..0.0% Loss; 0.2 Avg; 0.1 Best, 0.9 Worst, 0.1 StdDev
BadSystem2...11.7% Loss; 0.1 Avg; 0.1 Best, 0.2 Worst, 0.0 StdDev


4. netperf tcp_rr 1/1 bytes

GoodSystem..17921.83 t/sec
BadSystem1.13912.45 t/sec
BadSystem2


5. netperf tcp_rr 64/64 bytes

GoodSystem..16987.48 t/sec
BadSystem1.13355.93 t/sec
BadSystem2


6. netperf tcp_rr 128/8192 bytes
---
GoodSystem..2396.45 t/sec
BadSystem1.1678.54 t/sec
BadSystem2

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-29 Thread Nivedita Singhvi
Hello, Edwin,

We have two separate users/customers filing reports, and I can answer for
one of them. I'll ask the original poster separately as well to reply.

With respect to one of these situations, this is the following system:

Dell PowerEdge R440/0XP8V5, BIOS 2.2.11 06/14/2019

Note that a similar system does not have any issues:

Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.3.4 11/08/2016

So the NIC in the "bad" environment is:

BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet

The NIC in the "good" environment is:

Broadcom Inc. and subsidiaries NetXtreme II BCM57810
10 Gigabit Ethernet [14e4:1006]
Product Name: QLogic 57810 10 Gigabit Ethernet

I'll have to scrub some files and see what I can attach,
apologies, I'll have it here by tmrw. 

Unfortunately, we don't have an easy reproducer.
A single iperf and netperf test (both UDP and TCP) showed identical
results from both "good" and "bad" environments.

What we have is an identical kernel, network configuration and
stack with the "bad" system showing double, triple the latency
to the systems from a remote server. I'll have more information
for you shortly here regarding the exact k8 cmd.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-17 Thread Nivedita Singhvi
(active interface)

> cat ethtool-S-enp94s0f1d1 | grep abort
 [0]: tpa_aborts: 19775497
 [1]: tpa_aborts: 26758635
 [2]: tpa_aborts: 12008147
 [3]: tpa_aborts: 15829167
 [4]: tpa_aborts: 25099500
 [5]: tpa_aborts: 3292554
 [6]: tpa_aborts: 2863692
 [7]: tpa_aborts: 20224692


(backup interface)
> cat ethtool-S-enp94s0f0 | grep abort
 [0]: tpa_aborts: 3158584
 [1]: tpa_aborts: 1670319
 [2]: tpa_aborts: 1749371
 [3]: tpa_aborts: 1454301
 [4]: tpa_aborts: 123020
 [5]: tpa_aborts: 1403509
 [6]: tpa_aborts: 1298383
 [7]: tpa_aborts: 1858753

Netted out from previous capture, there were

*f0 = 2014 tpa_aborts
*d1 = 1118473 tpa_aborts


** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

** Changed in: linux (Ubuntu)
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-17 Thread Nivedita Singhvi
We suspect this is a device (hw/fw) issue, however, not NetworkManager 
or kernel (driver bnxt_en). I've added the kernel for the driver impact 
(just in case, for now). This is really to eliminate all other causes
and confirm whether it's the device at root cause). 

NIC

Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet
5e:00.1 Ethernet controller: Broadcom Inc. and subsidiaries 
BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

NIC Driver/FW
---
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version:
bus-info: :5e:00.1
supports-statistics: yes

Kernel
-
5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019

(appears to be an issue on all kernel versions)

Environment Configuration
-
active-backup bonding mode

(having the active backup up *might* potentially be the problem, 
 but it might just be the device itself).


The exact same distro, kernel, applications and configuration
works fine with a different NIC (Broadcom 10g bnx2x).

There were quite a few total tpa_abort stats counts (1118473)
during the duration of a 2 minute iperf test. 

Hoping to get more information from other users seeing the
same issue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  Confirmed
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence 

[Kernel-packages] [Bug 1853638] Re: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

2020-01-17 Thread Nivedita Singhvi
I have reports of the same device appearing to drop packets and incur
greater number of retransmissions under certain circumstances which
we're still trying to nail down.

I'm using this bug for now until proven to be a different problem.

This is causing issues in a production environment.


** Changed in: network-manager (Ubuntu)
   Status: New => Confirmed

** Changed in: network-manager (Ubuntu)
   Importance: Undecided => Critical

** Tags added: sts

** Also affects: linux (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853638

Title:
  BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be
  dropping data

Status in linux package in Ubuntu:
  New
Status in network-manager package in Ubuntu:
  Confirmed

Bug description:
  The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G
  RDMA Ethernet device seems to be dropping data

  Basically, we are dropping data, as you can see from the benchmark
  tool as follows:

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 
10e6 --tx_rate 10e6 --duration 300
  [INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; 
UHD_3.14.1.1-0-g98c7c986
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam

  [00:00:00.07] Creating the usrp device with: ...
  [INFO] [X300] X300 initialization sequence...
  [INFO] [X300] Maximum frame size: 1472 bytes.
  [INFO] [X300] Radio 1x clock: 200 MHz
  [INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
  [INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D000)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
  [INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
  [INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD1001)
  [INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0)
  [INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0)
  [INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0)
  Using Device: Single USRP:
Device: X-Series Device
Mboard 0: X310
RX Channel: 0
  RX DSP: 0
  RX Dboard: A
  RX Subdev: SBX-120 RX
RX Channel: 1
  RX DSP: 0
  RX Dboard: B
  RX Subdev: SBX-120 RX
TX Channel: 0
  TX DSP: 0
  TX Dboard: A
  TX Subdev: SBX-120 TX
TX Channel: 1
  TX DSP: 0
  TX Dboard: B
  TX Subdev: SBX-120 TX

  [00:00:04.305374] Setting device timestamp to 0...
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.310990] Testing receive rate 10.00 Msps on 1 channels
  [WARNING] [UHD] Unable to set the thread priority. Performance may be 
negatively affected.
  Please see the general application notes in the manual for instructions.
  EnvironmentError: OSError: error in pthread_setschedparam
  [00:00:04.318356] Testing transmit rate 10.00 Msps on 1 channels
  [00:00:06.693119] Detected Rx sequence error.
  D[00:00:09.402843] Detected Rx sequence error.
  DD[00:00:40.927978] Detected Rx sequence error.
  D[00:01:44.982243] Detected Rx sequence error.
  D[00:02:11.400692] Detected Rx sequence error.
  D[00:02:14.805292] Detected Rx sequence error.
  D[00:02:41.875596] Detected Rx sequence error.
  D[00:03:06.927743] Detected Rx sequence error.
  D[00:03:47.967891] Detected Rx sequence error.
  D[00:03:58.233659] Detected Rx sequence error.
  D[00:03:58.876588] Detected Rx sequence error.
  D[00:04:03.139770] Detected Rx sequence error.
  D[00:04:45.287465] Detected Rx sequence error.
  D[00:04:56.425845] Detected Rx sequence error.
  D[00:04:57.929209] Detected Rx sequence error.
  [00:05:04.529548] Benchmark complete.
  Benchmark rate summary:
Num received samples: 2995435936
Num dropped samples:  4622800
Num overruns detected:0
Num transmitted samples:  3008276544
Num sequence errors (Tx): 0
Num sequence errors (Rx): 15
Num underruns detected:   0
Num late commands:0
Num timeouts (Tx):0
Num timeouts (Rx):0
  Done!

  tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

  
  In this particular case description, the nodes are USRP x310s. However, we 
have the same issue with N210 nodes dropping samples connected to the BCM57416 
NetXtreme-E Dual-Media 10G RDMA Ethernet device.

  There is no problem with the USRPs themselves, as we 

[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2019-12-30 Thread Nivedita Singhvi
Fix has been committed to B, D, E. I've manually updated this
bug for now (it was not formally DUP'd to LP Bug 1852077.


** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Eoan)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Disco)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
   Status: New => Fix Committed

** Changed in: linux (Ubuntu Disco)
   Status: New => Fix Committed

** Changed in: linux (Ubuntu Eoan)
   Status: New => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Committed
Status in linux source package in Focal:
  In Progress

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', 

[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2019-11-21 Thread Nivedita Singhvi
FWIW, the fix has been committed to -stable:

"bonding: fix state transition issue in link monitoring"
Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  New
Status in linux source package in Focal:
  In Progress

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 18.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  Lsusb:
   Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
   Bus 001 Device 004: ID 1604:10c0 Tascam 
   Bus 001 Device 003: ID 1604:10c0 Tascam 
   Bus 001 Device 002: 

[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring

2019-11-21 Thread Nivedita Singhvi
FWIW, the fix has been committed to -stable:

"bonding: fix state transition issue in link monitoring"
Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea


** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1852077

Title:
  Backport: bonding: fix state transition issue in link monitoring

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  In Progress
Status in linux source package in Focal:
  In Progress

Bug description:
  == Justification ==
  From the well explained commit message:

  Since de77ecd4ef02 ("bonding: improve link-status update in
  mii-monitoring"), the bonding driver has utilized two separate variables
  to indicate the next link state a particular slave should transition to.
  Each is used to communicate to a different portion of the link state
  change commit logic; one to the bond_miimon_commit function itself, and
  another to the state transition logic.

   Unfortunately, the two variables can become unsynchronized,
  resulting in incorrect link state transitions within bonding.  This can
  cause slaves to become stuck in an incorrect link state until a
  subsequent carrier state transition.

   The issue occurs when a special case in bond_slave_netdev_event
  sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
  bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
  case will set the proposed next state (link_new_state) to BOND_LINK_UP,
  but the new_link to BOND_LINK_DOWN.  The setting of the final link state
  from new_link comes after that from link_new_state, and so the slave
  will end up incorrectly in _DOWN state.

   Resolve this by combining the two variables into one.

  == Fixes ==
  * 1899bb32 (bonding: fix state transition issue in link monitoring)

  This patch can be cherry-picked into E/F

  For older releases like B/D, it will needs to be backported as they are
  missing the slave_err() printk marco added in 5237ff79 (bonding: add
  slave_foo printk macros) as well as the commit to replace netdev_err()
  with slave_err() in e2a7420d (bonding/main: convert to using slave
  printk macros)

  For Xenial, the commit that causes this issue, de77ecd4, does not
  exist.

  == Test ==
  Test kernels can be found here:
  https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

  The X-hwe and Disco kernel were tested by the bug reporter, Aleksei,
  the patched kernel works as expected.

  == Regression Potential ==
  Low.
  This patch just unify the variable used in link state change commit
  logic to prevent the occurrence of an incorrect state. And the changes
  are limited to the bonding driver itself.

  (Although the include/net/bonding.h will be used in other drivers, but
  the changes to that file is only affecting this bond_main.c driver)

  == Original Bug Report ==
  There's an issue with bonding driver in the current ubuntu kernels.
  Sometimes one link stuck in a weird state.
  It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html 
in upstream.
  Commit 1899bb325149e481de31a4f32b59ea6f24e176ea.

  We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it
  should be reproducible with other current kernel versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2019-11-21 Thread Nivedita Singhvi
https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

There is a test kernel above (from that LP bug).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  New
Status in linux source package in Focal:
  In Progress

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 18.04
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
  Lsusb:
   Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
   Bus 001 Device 004: ID 1604:10c0 Tascam 
   Bus 001 Device 003: ID 1604:10c0 Tascam 
   Bus 001 Device 002: ID 1604:10c0 Tascam 
   Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
  MachineType: Dell Inc. PowerEdge R740xd
  Package: linux (not installed)
  

[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot

2019-11-21 Thread Nivedita Singhvi
This is being handled as a DUP of LP Bug 1852077

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077

** Changed in: linux (Ubuntu)
   Status: Expired => In Progress

** Tags added: sts

** Also affects: linux (Ubuntu Disco)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
   Status: In Progress

** Also affects: linux (Ubuntu Eoan)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1834322

Title:
  Losing port aggregate with 802.3ad port-channel/bonding aggregation on
  reboot

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  New
Status in linux source package in Disco:
  New
Status in linux source package in Eoan:
  New
Status in linux source package in Focal:
  In Progress

Bug description:
  We are losing port channel aggregation on reboot.

  After the reboot, /var/log/syslog contains the entries:
  [  250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports
  [  282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1)
 Check the configuration to verify that all adapters are 
connected to 802.3ad compliant switch ports

  Aggregator IDs of the slave interfaces are different:
  ubuntu@node-6:~$ cat /proc/net/bonding/bond2 
  Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

  Bonding Mode: IEEE 802.3ad Dynamic link aggregation
  Transmit Hash Policy: layer3+4 (1)
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  802.3ad info
  LACP rate: fast
  Min links: 0
  Aggregator selection policy (ad_select): stable

  Slave Interface: enp24s0f1np1
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:51
  Slave queue ID: 0
  Aggregator ID: 1
  Actor Churn State: none
  Partner Churn State: none
  Actor Churned Count: 0
  Partner Churned Count: 0

  Slave Interface: enp24s0f0np0
  MII Status: up
  Speed: 1 Mbps
  Duplex: full
  Link Failure Count: 0
  Permanent HW addr: b0:26:28:48:9f:50
  Slave queue ID: 0
  Aggregator ID: 2
  Actor Churn State: churned
  Partner Churn State: churned
  Actor Churned Count: 1
  Partner Churned Count: 1

  The mismatch in "Aggregator ID" on the port is a symptom of the issue.
  If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up',
  the port with the mismatched ID appears to renegotiate with the port-
  channel and becomes aggregated.

  The other way to workaround this issue is to put bond ports down and
  bring up port enp24s0f0np0 first and port enp24s0f1np1 second.

  When I change the order of bringing the ports up (first enp24s0f1np1,
  and second enp24s0f0np0), the issue is still there.

  When the issue occurs, a port on the switch, corresponding to
  interface enp24s0f0np0 is in Suspended state. After applying the
  workaround the port is no longer in Suspended state and Aggregator IDs
  in /proc/net/bonding/bond2 are equal.

  I installed 5.0.0 kernel, the issue is still there.

  Operating System: 
  Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64)

  ubuntu@node-6:~$ uname -a
  Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux

  ubuntu@node-6:~$ sudo lspci -vnvn
  https://pastebin.ubuntu.com/p/Dy2CKDbySC/

  Hardware: Dell PowerEdge R740xd
  BIOS version: 2.1.7

  sosreport: https://drive.google.com/open?id=1-eN7cZJIeu-
  AQBEU7Gw8a_AJTuq0AOZO

  ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G
  https://pastebin.ubuntu.com/p/sqCx79vZWM/

  ubuntu@node-6:~$ lspci -n | grep 18:00
  18:00.0 0200: 14e4:16d8 (rev 01)
  18:00.1 0200: 14e4:16d8 (rev 01)

  ubuntu@node-6:~$ modinfo bnx2x
  https://pastebin.ubuntu.com/p/pkmzsFjK8M/

  ubuntu@node-6:~$ ip -o l 
  https://pastebin.ubuntu.com/p/QpW7TjnT2v/

  ubuntu@node-6:~$ ip -o a
  https://pastebin.ubuntu.com/p/MczKtrnmDR/

  ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml
  https://pastebin.ubuntu.com/p/9cZpPc7C6P/

  ubuntu@node-6:~$ sudo lshw -c network
  https://pastebin.ubuntu.com/p/gmfgZptzDT/
  --- 
  ProblemType: Bug
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Jun 26 10:21 seq
   crw-rw 1 root audio 116, 33 Jun 26 10:21 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
  ApportVersion: 2.20.9-0ubuntu7.6
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 
'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  DistroRelease: Ubuntu 18.04
  IwConfig: 

[Kernel-packages] [Bug 1852077] Re: Backport: bonding: fix state transition issue in link monitoring

2019-11-21 Thread Nivedita Singhvi
Still waiting on these patches being committed to all the Ubuntu trees. 
Any ETA? Is this waiting on being picked up via -stable?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1852077

Title:
  Backport: bonding: fix state transition issue in link monitoring

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  In Progress
Status in linux source package in Focal:
  In Progress

Bug description:
  == Justification ==
  From the well explained commit message:

  Since de77ecd4ef02 ("bonding: improve link-status update in
  mii-monitoring"), the bonding driver has utilized two separate variables
  to indicate the next link state a particular slave should transition to.
  Each is used to communicate to a different portion of the link state
  change commit logic; one to the bond_miimon_commit function itself, and
  another to the state transition logic.

   Unfortunately, the two variables can become unsynchronized,
  resulting in incorrect link state transitions within bonding.  This can
  cause slaves to become stuck in an incorrect link state until a
  subsequent carrier state transition.

   The issue occurs when a special case in bond_slave_netdev_event
  sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
  bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
  case will set the proposed next state (link_new_state) to BOND_LINK_UP,
  but the new_link to BOND_LINK_DOWN.  The setting of the final link state
  from new_link comes after that from link_new_state, and so the slave
  will end up incorrectly in _DOWN state.

   Resolve this by combining the two variables into one.

  == Fixes ==
  * 1899bb32 (bonding: fix state transition issue in link monitoring)

  This patch can be cherry-picked into E/F

  For older releases like B/D, it will needs to be backported as they are
  missing the slave_err() printk marco added in 5237ff79 (bonding: add
  slave_foo printk macros) as well as the commit to replace netdev_err()
  with slave_err() in e2a7420d (bonding/main: convert to using slave
  printk macros)

  For Xenial, the commit that causes this issue, de77ecd4, does not
  exist.

  == Test ==
  Test kernels can be found here:
  https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

  The X-hwe and Disco kernel were tested by the bug reporter, Aleksei,
  the patched kernel works as expected.

  == Regression Potential ==
  Low.
  This patch just unify the variable used in link state change commit
  logic to prevent the occurrence of an incorrect state. And the changes
  are limited to the bonding driver itself.

  (Although the include/net/bonding.h will be used in other drivers, but
  the changes to that file is only affecting this bond_main.c driver)

  == Original Bug Report ==
  There's an issue with bonding driver in the current ubuntu kernels.
  Sometimes one link stuck in a weird state.
  It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html 
in upstream.
  Commit 1899bb325149e481de31a4f32b59ea6f24e176ea.

  We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it
  should be reproducible with other current kernel versions.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852077/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-09-05 Thread Nivedita Singhvi
This issue has been tested and successfully verified:

Verification successful !

"...test appliance built with 4.15.0-58 was unusable ... hundreds of
"BUG: non-zero pgtables_bytes on freeing mm: -16384" in syslog, RestAPI
interface timeouts, failed to produce FFDC data using sosreport.

Build with 4.15.0-60.67 displays none of these behaviors ... smoke test
completed successfully."


** Tags added: verification-done-bionic

** Changed in: linux (Ubuntu Bionic)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840789] Re: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled

2019-08-29 Thread Nivedita Singhvi
** Tags added: sts

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Xenial)
   Importance: High => Critical

** Changed in: linux (Ubuntu Bionic)
   Importance: High => Critical

** Changed in: linux (Ubuntu Disco)
   Importance: Undecided => Critical

** Changed in: linux (Ubuntu Eoan)
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840789

Title:
  bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

   * The bnx2x driver may cause hardware faults (leading to
     panic/reboot) and other behaviors as transmit timeouts,
     after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is
     introduced.

   * This issue has been observed by an user shortly
     after starting docker & kubelet, with adapters:
     - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c]
     - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79]

   * If options to ignore hardware faults are used
     (erst_disable=1 hest_disable=1 ghes.disable=1)
     the system doesn't panic/reboot and continues
     on to timeout on adapter stats, then transmit
     timeouts, spewing some adapter firmware dumps,
     but the network interface is non-functional.

   * The issue only happened when LLDP is enabled
     on the network switches, and crashdump shows
     the bnx2x driver is stuck/waits for firmware
     to complete the stop traffic command in LLDP
     handling. Workaround used is to disable LLDP
     in the network switches/ports.

   * Analysis of the driver and firmware dumps
     didn't help significantly towards finding
     the root cause.

   * Upstream/mainline recently just reverted the
     patch, due to similar problem reports, while
     looking for the root cause/proper fix.

  [Test Case]

   * No reproducible test case found outside
     the user's systems/cluster, where it is
     enough to start docker & kubelet & wait.

   * The user verified test kernels for Xenial
     and Bionic - the problem does not happen;
 build-tested on Disco.

  [Regression Potential]

   * Users who significantly use/apply the non-default
     traffic class (tc) / class of service (cos) might
     possibly see performance changes (if any at all)
     in such applications, however that's unclear now.

   * This is a recent revert upstream (v5.3-rc'ish),
     so there's chance things might change in this area.

   * Nonetheless, the patch is authored by the driver
     vendor, and made its way into stable kernels
     (e.g., v5.2.8 which made Eoan/19.10 recently).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840704] Re: ZFS kernel modules lack debug symbols

2019-08-29 Thread Nivedita Singhvi
** Tags added: sts

** Tags added: linux

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840704

Title:
  ZFS kernel modules lack debug symbols

Status in linux package in Ubuntu:
  In Progress

Bug description:
  The ZFS kernel modules aren't built with debug symbols,
  which introduces problems/issues for debugging/support.

  Patches are required in:

  1) linux kernel packaging, to add infrastructure to
     enable/build/strip/package debug symbols on DKMS.
     (this is sufficient with zfs-linux now in Eoan.)

  2) zfs-linux and spl-linux, for the stable releases,
     which need a few patches to enable debug symbols
 (add option './configure --enable-debuginfo' and
 '(ZFS|SPL)_DKMS_ENABLE_DEBUGINFO' to dkms.conf.)

  Initially submitting the kernel patchset for Unstable,
  for review/feedback.  It backports nicely into B/D/E,
  should it be accepted; for X (doesn't use DKMS builds)
  a simpler patch for the moment (until it does) works.

  The zfs/spl-linux patches are ready, to be submitted
  once the approach used by the kernel package settles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840704/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-08-26 Thread Nivedita Singhvi
I'll update here once kernel is uploaded.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-08-26 Thread Nivedita Singhvi
I unduped it for test process clarity.

Trying to get the relevant people to test the fix.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-08-26 Thread Nivedita Singhvi
** This bug is no longer a duplicate of bug 1837664
   Bionic update: upstream stable patchset 2019-07-23

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-08-23 Thread Nivedita Singhvi
*** This bug is a duplicate of bug 1837664 ***
https://bugs.launchpad.net/bugs/1837664

I'll unDUP it unless the kernel team says otherwise in IRC.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1840046] Re: BUG: non-zero pgtables_bytes on freeing mm: -16384

2019-08-23 Thread Nivedita Singhvi
*** This bug is a duplicate of bug 1837664 ***
https://bugs.launchpad.net/bugs/1837664

I'm not sure this bug should be DUP'd to the stable-release
bug. Might confuse the verification and handling triggers, 
perhaps?

Will need to make sure the fix is tested once the fix is
uploaded.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840046

Title:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [impact]

  This message is printed repeatedly in the logs:
  BUG: non-zero pgtables_bytes on freeing mm: -16384

  [test case]

  boot the 4.15.0-58 kernel on s390x

  [regression potential]

  this affects task pud accounting; regressions may be around cleaning
  up task memory.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840046/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-07-22 Thread Nivedita Singhvi
Verified on Xenial

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = 

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-06-26 Thread Nivedita Singhvi
As the test kernel with the backported Xenial fix 
has been up for almost 2 months now, I'm submitting
the SRU for Xenial, although I have not received
feedback from original reporter or others.

Backported patch for Xenial varies slightly from the
cherry-picked patch for B, C. 

My testing has been successful (see original testing
information in description).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-05-22 Thread Nivedita Singhvi
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released
Status in linux source package in Cosmic:
  Fix Released

Bug description:
  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.

  This issue was seen on a heavily loaded Kafka broker node running
  the 4.15.0-38-generic kernel on Xenial.

  Symptoms include messages in the kernel log of the form:

  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  

  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several months:

  "i40e: Fix for Tx timeouts when interface is brought up if
   DCB is enabled"
  Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

  "i40e: prevent overlapping tx_timeout recover"
  Commit: d5585b7b6846a6d0f9517afe57be3843150719da

  * The first commit is already in Disco, Cosmic
  * The second commit is already in Disco
  * Bionic needs both patches and Cosmic needs the second

  [Test Case]
  * We are considering the case of both issues above occurring.
  * Seen by reporter on a Kafka broker node with heavy traffic.
  * Not easy to reproduce as it requires something like the
    following example environment and heavy load:

    Kernel: 4.15.0-38-generic
    Network driver: i40e
  version: 2.1.14-k
  firmware-version: 6.00 0x800034e6 18.3.6
    NIC: Intel 40Gb XL710
    DCB enabled

  [Regression Potential]
  Low, as the first only impacts i40e DCB environment, and has
  been running for several months in production-load testing
  successfully.

  --- Original Description
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to 
the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-05-22 Thread Nivedita Singhvi
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Released

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-05-22 Thread Nivedita Singhvi
** Tags added: sts

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic verification-done-cosmic

** Tags removed: verification-done-cosmic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels).

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  It requires the following commit as well:

  i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
  Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
     i40e driver version: 2.1.14-k
     Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:

  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,,

    But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-05-22 Thread Nivedita Singhvi
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || metadata))
     

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-05-21 Thread Nivedita Singhvi
Bionic, Cosmic kernels successfully tested. 
I've updated the tags.


** Tags removed: verification-needed-bionic verification-needed-cosmic
** Tags added: verification-done-bionic verification-done-cosmic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
 

[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-05-14 Thread Nivedita Singhvi
Late update, but the original reporter did test the proposed
kernel on systems able to reproduce the problem and were 
tested successfully.

We do not yet have a way of reproducing this on Xenial (i.e,
any 4.4 kernel). I'm still leaving this an open issue, will be
trying to do this and once we can confirm/test, will update
and push an SRU for Xenial as well.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels).

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  It requires the following commit as well:

  i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
  Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
     i40e driver version: 2.1.14-k
     Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:

  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,,

    But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-05-09 Thread Nivedita Singhvi
A 4.4 test kernel with the fix backported is available at:

https://people.canonical.com/~nivedita/geneve-xenial-test/

if anyone wishes to validate the 4.4 X solution.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-05-07 Thread Nivedita Singhvi
Resubmitted SRU for B,C for this kernel cycle.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || 

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-17 Thread Nivedita Singhvi
Submitted SRU request for Bionic, Cosmic.

Huge thanks for the testing, Matthew!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-17 Thread Nivedita Singhvi
** Tags added: cosmic xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1794232

Title:
  Geneve tunnels don't work when ipv6 is disabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress
Status in linux source package in Disco:
  Fix Released

Bug description:
  SRU Justification

  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.

  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"

  Hence available in Disco and later; required in X,B,C.

  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
     # ovs-vsctl add-br br1
     # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z // ip of the other host

  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.

  Other Info:
  * Mainline commit msg includes reference to a fix for
    non-metadata tunnels (infrastructure is not yet in
    our tree prior to Disco), hence not being included
    at this time under this case.

    At this time, all geneve tunnels created as above
    are metadata-enabled.

  ---
  [Impact]

  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :

  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."

  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).

  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7

  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.

  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)

  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):

  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot

  2. Install OVS
  # apt install openvswitch-switch

  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z

  (where remote_ip is the IP of the other host)

  You will see the following error message:

  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."

  From /var/log/openvswitch/ovs-vswitchd.log you will see:

  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"

  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.

  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.

  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.

  With the fixed test kernel, the interfaces and tunnel
  is created successfully.

  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.

  [Other Info]

  * Analysis

  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.

  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:

  rather than:

  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6

  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :

  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...

  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || metadata))
   

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-17 Thread Nivedita Singhvi
** Description changed:

  SRU Justification
  
  Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.
  
  Fix:
  Fixed by upstream commit in v5.0:
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  "geneve: correctly handle ipv6.disable module parameter"
  
- Hence available in Disco and later; required in X,B,C
- Cherry picked and tested successfully for X, B, C.
+ Hence available in Disco and later; required in X,B,C.
  
  Testcase:
  1. Boot with "ipv6.disable=1"
  2. Then try and create a geneve tunnel using:
-# ovs-vsctl add-br br1
-# ovs-vsctl add-port br1 geneve1 -- set interface geneve1
- type=geneve options:remote_ip=192.168.x.z // ip of the other host
+    # ovs-vsctl add-br br1
+    # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
+ type=geneve options:remote_ip=192.168.x.z // ip of the other host
  
  Regression Potential: Low, only geneve tunnels when ipv6 dynamically
  disabled, current status is it doesn't work at all.
  
  Other Info:
  * Mainline commit msg includes reference to a fix for
-   non-metadata tunnels (infrastructure is not yet in
-   our tree prior to Disco), hence not being included
-   at this time under this case.
+   non-metadata tunnels (infrastructure is not yet in
+   our tree prior to Disco), hence not being included
+   at this time under this case.
  
-   At this time, all geneve tunnels created as above
-   are metadata-enabled.
- 
+   At this time, all geneve tunnels created as above
+   are metadata-enabled.
  
  ---
  [Impact]
  
  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :
  
  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."
  
  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).
  
  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  
  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.
  
  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)
  
  1. On any Ubuntu Xenial kernel, disable ipv6. This example
-    is shown with the4.15.0-23-generic kernel (which differs
+    is shown with the 4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):
  
  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot
  
  2. Install OVS
  # apt install openvswitch-switch
  
  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z
  
  (where remote_ip is the IP of the other host)
  
  You will see the following error message:
  
  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."
  
  From /var/log/openvswitch/ovs-vswitchd.log you will see:
  
  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"
  
  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.
  
  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.
  
  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.
  
  With the fixed test kernel, the interfaces and tunnel
  is created successfully.
  
  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.
  
  [Other Info]
  
  * Analysis
  
  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.
  
  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:
  
  rather than:
  
  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6
  
  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :
  
  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...
  
  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
   

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-17 Thread Nivedita Singhvi
** Description changed:

+ SRU Justification
+ 
+ Impact: Cannot create geneve tunnels if ipv6 is disabled dynamically.
+ 
+ Fix:
+ Fixed by upstream commit in v5.0:
+ Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
+ "geneve: correctly handle ipv6.disable module parameter"
+ 
+ Hence available in Disco and later; required in X,B,C
+ Cherry picked and tested successfully for X, B, C.
+ 
+ Testcase:
+ 1. Boot with "ipv6.disable=1"
+ 2. Then try and create a geneve tunnel using:
+# ovs-vsctl add-br br1
+# ovs-vsctl add-port br1 geneve1 -- set interface geneve1
+ type=geneve options:remote_ip=192.168.x.z // ip of the other host
+ 
+ Regression Potential: Low, only geneve tunnels when ipv6 dynamically
+ disabled, current status is it doesn't work at all.
+ 
+ Other Info:
+ * Mainline commit msg includes reference to a fix for
+   non-metadata tunnels (infrastructure is not yet in
+   our tree prior to Disco), hence not being included
+   at this time under this case.
+ 
+   At this time, all geneve tunnels created as above
+   are metadata-enabled.
+ 
+ 
+ ---
  [Impact]
  
  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :
  
  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."
  
  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).
  
  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  
  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.
  
  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)
  
  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):
  
  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot
  
  2. Install OVS
  # apt install openvswitch-switch
  
  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z
  
  (where remote_ip is the IP of the other host)
  
  You will see the following error message:
  
  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."
  
  From /var/log/openvswitch/ovs-vswitchd.log you will see:
  
  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"
  
  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.
  
  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.
  
  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.
  
  With the fixed test kernel, the interfaces and tunnel
  is created successfully.
  
  [Regression Potential]
  * Low -- affects the geneve driver only, and when ipv6 is
    disabled, and since it doesn't work in that case at all,
    this fix gets the tunnel up and running for the common case.
  
  [Other Info]
  
  * Analysis
  
  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.
  
  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:
  
  rather than:
  
  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6
  
  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :
  
  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...
  
  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || metadata))
  ret = geneve_sock_add(geneve, false);
  
  CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but
  even though ipv6 is false, metadata is always true
  for a geneve open as it is set unconditionally in
  ovs:
  
  In /lib/dpif_netlink_rtnl.c :
  
  case OVS_VPORT_TYPE_GENEVE:
  nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA);
  
  The second argument of geneve_sock_add is a boolean

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-17 Thread Nivedita Singhvi
** Description changed:

  [Impact]
  
  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :
  
  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."
  
  [Fix]
  There is an upstream commit for this in v5.0 mainline (and in Disco and later 
Ubuntu kernels).
  
  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  
  This fix is needed on all our series prior to Disco
  and the v5.0 kernel: X, C, B. It is identical to the
  fix we implemented and tested internally with, but had
  not pushed upstream yet.
  
  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)
  
  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):
  
  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot
  
  2. Install OVS
  # apt install openvswitch-switch
  
  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z
  
  (where remote_ip is the IP of the other host)
  
  You will see the following error message:
  
  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."
  
  From /var/log/openvswitch/ovs-vswitchd.log you will see:
  
  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"
  
  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.
  
  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.
  
  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.
  
  With the fixed test kernel, the interfaces and tunnel
  is created successfully.
  
  [Regression Potential]
- * Low -- affects the geneve driver only, and when ipv6 is 
-   disabled, and since it doesn't work in that case at all,
-   this fix gets the tunnel up and running for the common case.
- 
+ * Low -- affects the geneve driver only, and when ipv6 is
+   disabled, and since it doesn't work in that case at all,
+   this fix gets the tunnel up and running for the common case.
  
  [Other Info]
  
  * Analysis
  
  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.
  
  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:
  
  rather than:
  
  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6
  
  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :
  
  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...
  
  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || metadata))
  ret = geneve_sock_add(geneve, false);
  
  CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but
  even though ipv6 is false, metadata is always true
  for a geneve open as it is set unconditionally in
  ovs:
  
  In /lib/dpif_netlink_rtnl.c :
  
  case OVS_VPORT_TYPE_GENEVE:
  nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA);
  
  The second argument of geneve_sock_add is a boolean
  value indicating whether it's an ipv6 address family
  socket or not, and we thus incorrectly pass a true
  value rather than false.
  
  The current "|| metadata" check is unnecessary and incorrectly
  sends the tunnel creation code down the ipv6 path, which
  fails subsequently when the code expects an ipv6 family socket.
  
  * This issue exists in all versions of the kernel upto present
     mainline and net-next trees.
  
  * Testing with a trivial patch to remove that and make
    similar changes to those made for vxlan (which had the
    same issue) has been successful. Patches for various
    versions to be attached here soon.
  
  * Example Versions (bug exists in all versions of Ubuntu
-   and mainline):
+   and mainline)
+ 
+ Update: This has been patched upstream after original description filed
+ here, fix available in v5.0 mainline and Disco 

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-16 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu Disco)
   Status: In Progress => Fix Released

** Description changed:

  [Impact]
  
  When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
  an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :
  
  “ovs-vsctl: Error detected while setting up 'geneve0': could not
  add network device geneve0 to ofproto (Address family not supported
  by protocol)."
  
  [Fix]
- There is an upstream commit for this in v5.0 mainline. 
+ There is an upstream commit for this in v5.0 mainline.
  
  "geneve: correctly handle ipv6.disable module parameter"
  Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
  
- This fix is needed on all our series: X, C, B, D. It is identical
+ This fix is needed on all our series prior to Disco and the v5.0 kernel: X, 
C, B. It is identical
  to the fix we implemented and tested internally with, but
- had not pushed upstream yet. 
- 
+ had not pushed upstream yet.
  
  [Test Case]
  (Best to do this on a kvm guest VM so as not to interfere with
   your system's networking)
  
  1. On any Ubuntu Xenial kernel, disable ipv6. This example
     is shown with the4.15.0-23-generic kernel (which differs
     slightly from 4.4.x in symptoms):
  
  - Edit /etc/default/grub to add the line:
  GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot
  
  2. Install OVS
  # apt install openvswitch-switch
  
  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
  # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z
  
  (where remote_ip is the IP of the other host)
  
  You will see the following error message:
  
  "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."
  
  From /var/log/openvswitch/ovs-vswitchd.log you will see:
  
  "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
  failed to add geneve1 as port: Address family not supported
  by protocol"
  
  You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.
  
  If you do not disable IPv6 (remove ipv6.disable=1 from
  /etc/default/grub + update-grub + reboot), the same
  'ovs-vsctl add-port' command completes successfully.
  You can see that it is working properly by adding an
  IP to the br1 and pinging each host.
  
  On kernel 4.4 (4.4.0-128-generic), the error message doesn't
  happen using the 'ovs-vsctl add-port' command, no warning is
  shown in ovs-vswitchd.log, but the device genev_sys_6081 is
  also not created and ping test won't work.
  
- With the fixed test kernel, the interfaces and tunnel 
+ With the fixed test kernel, the interfaces and tunnel
  is created successfully.
- 
  
  [Other Info]
  
  * Analysis
  
  Geneve tunnels should work with either IPv4 or IPv6 environments
  as a design and support  principle.
  
  Currently, however, what's in the implementation requires support
  for ipv6 for metadata-based tunnels which geneve is:
  
  rather than:
  
  a) ipv4 + metadata // whether ipv6 compiled or dynamically disabled
  b) ipv4 + metadata + ipv6
  
  What enforces this in the current 4.4.0-x code when opening a Geneve
  tunnel is the following in geneve_open() :
  
  bool ipv6 = geneve->remote.sa.sa_family == AF_INET6;
  bool metadata = geneve->collect_md;
  ...
  
  #if IS_ENABLED(CONFIG_IPV6)
  geneve->sock6 = NULL;
  if (ipv6 || metadata)
  ret = geneve_sock_add(geneve, true);
  #endif
  if (!ret && (!ipv6 || metadata))
  ret = geneve_sock_add(geneve, false);
  
  CONFIG_IPV6 is enabled, IPv6 is disabled at boot, but
  even though ipv6 is false, metadata is always true
  for a geneve open as it is set unconditionally in
  ovs:
  
  In /lib/dpif_netlink_rtnl.c :
  
  case OVS_VPORT_TYPE_GENEVE:
  nl_msg_put_flag(, IFLA_GENEVE_COLLECT_METADATA);
  
  The second argument of geneve_sock_add is a boolean
  value indicating whether it's an ipv6 address family
  socket or not, and we thus incorrectly pass a true
  value rather than false.
  
  The current "|| metadata" check is unnecessary and incorrectly
  sends the tunnel creation code down the ipv6 path, which
  fails subsequently when the code expects an ipv6 family socket.
  
  * This issue exists in all versions of the kernel upto present
     mainline and net-next trees.
  
  * Testing with a trivial patch to remove that and make
    similar changes to those made for vxlan (which had the
    same issue) has been successful. Patches for various
    versions to be attached here soon.
  
  * Example Versions (bug exists in all versions of Ubuntu
    and mainline):
  
  $ uname -r
  4.4.0-135-generic
  
  $ lsb_release -rd
  Description:  Ubuntu 16.04.5 LTS
  Release:  16.04
  
  $ dpkg -l | grep openvswitch-switch
  ii  openvswitch-switch   2.5.4-0ubuntu0.16.04.1

** Description changed:

  [Impact]
  
  When attempting to 

[Kernel-packages] [Bug 1794232] Re: Geneve tunnels don't work when ipv6 is disabled

2019-04-16 Thread Nivedita Singhvi
We had tested a patch discussed above and tested internally,
with success - although we have limited testing (opening up
a geneve tunnel between 2 kvm guests). 

Jiri has now pushed an identical patch upstream which is 
available in the v5.0 kernel and later.

"geneve: correctly handle ipv6.disable module parameter"  
Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
 

Although I do not have testing validation from original 
poster, since it has been committed upstream, I'm going
to go ahead and get the SRU request started. 


** Changed in: linux (Ubuntu)
   Status: Triaged => In Progress

** Changed in: linux (Ubuntu)
   Importance: Medium => High

** Also affects: linux (Ubuntu Xenial)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Cosmic)
   Importance: Undecided
   Status: New

** Also affects: linux (Ubuntu Disco)
   Importance: High
   Status: In Progress

** Changed in: linux (Ubuntu Cosmic)
   Status: New => In Progress

** Changed in: linux (Ubuntu Disco)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Cosmic)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Xenial)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Xenial)
   Status: New => In Progress

** Changed in: linux (Ubuntu Cosmic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

** Description changed:

  [Impact]
  
- When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in 
- an OS environment with open vswitch, where ipv6 has been disabled, 
+ When attempting to create a geneve tunnel on Ubuntu 16.04 Xenial, in
+ an OS environment with open vswitch, where ipv6 has been disabled,
  the create fails with the error :
  
- “ovs-vsctl: Error detected while setting up 'geneve0': could not 
- add network device geneve0 to ofproto (Address family not supported 
+ “ovs-vsctl: Error detected while setting up 'geneve0': could not
+ add network device geneve0 to ofproto (Address family not supported
  by protocol)."
  
-  
+ [Fix]
+ There is an upstream commit for this in v5.0 mainline.
+ 
+ "geneve: correctly handle ipv6.disable module parameter"
+ Commit: cf1c9ccba7308e48a68fa77f476287d9d614e4c7
+ 
+ This fix is needed on all our series: X, C, B, D
+ 
+ 
  [Test Case]
- (Best to do this on a kvm guest VM so as not to interfere with  
-  your system's networking)
+ (Best to do this on a kvm guest VM so as not to interfere with
+  your system's networking)
  
  1. On any Ubuntu Xenial kernel, disable ipv6. This example
-is shown with the4.15.0-23-generic kernel (which differs
-slightly from 4.4.x in symptoms):   
-   
+    is shown with the4.15.0-23-generic kernel (which differs
+    slightly from 4.4.x in symptoms):
+ 
  - Edit /etc/default/grub to add the line:
- GRUB_CMDLINE_LINUX="ipv6.disable=1"
+ GRUB_CMDLINE_LINUX="ipv6.disable=1"
  - # update-grub
  - Reboot
- 
  
  2. Install OVS
  # apt install openvswitch-switch
  
  3. Create a Geneve tunnel
  # ovs-vsctl add-br br1
- # ovs-vsctl add-port br1 geneve1 -- set interface geneve1 
+ # ovs-vsctl add-port br1 geneve1 -- set interface geneve1
  type=geneve options:remote_ip=192.168.x.z
  
  (where remote_ip is the IP of the other host)
  
  You will see the following error message:
  
- "ovs-vsctl: Error detected while setting up 'geneve1'. 
+ "ovs-vsctl: Error detected while setting up 'geneve1'.
  See ovs-vswitchd log for details."
  
  From /var/log/openvswitch/ovs-vswitchd.log you will see:
  
- "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system: 
- failed to add geneve1 as port: Address family not supported 
+ "2018-07-02T16:48:13.295Z|00026|dpif|WARN|system@ovs-system:
+ failed to add geneve1 as port: Address family not supported
  by protocol"
  
- You will notice from the "ifconfig" output that the device 
+ You will notice from the "ifconfig" output that the device
  genev_sys_6081 is not created.
  
- If you do not disable IPv6 (remove ipv6.disable=1 from 
- /etc/default/grub + update-grub + reboot), the same 
- 'ovs-vsctl add-port' command completes successfully. 
- You can see that it is working properly by adding an 
- IP to the br1 and pinging each host. 
+ If you do not disable IPv6 (remove ipv6.disable=1 from
+ /etc/default/grub + update-grub + reboot), the same
+ 'ovs-vsctl add-port' command completes successfully.
+ You can see that it is working properly by adding an
+ IP to the br1 and pinging each host.
  
- On kernel 4.4 (4.4.0-128-generic), the error message doesn't 
- happen using the 'ovs-vsctl add-port' command, no warning is 
- shown in ovs-vswitchd.log, but the device genev_sys_6081 is 
+ On kernel 4.4 (4.4.0-128-generic), the error message doesn't
+ happen 

[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-04-07 Thread Nivedita Singhvi
I have installed and booted to this kernel, and ensured no
new regression introduced, although I cannot repro the issue.


** Tags removed: 4.15.0-24-generic cosmic kernel verification-needed-bionic 
verification-needed-cosmic
** Tags added: verification-done-bionic verification-done-cosmic

** Description changed:

  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.
  
  This issue was seen on a heavily loaded Kafka broker node running
- the 4.15.0-38-generic kernel on Xenial. 
+ the 4.15.0-38-generic kernel on Xenial.
  
  Symptoms include messages in the kernel log of the form:
  
  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  
  
  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several months:
  
- "i40e: Fix for Tx timeouts when interface is brought up if 
-  DCB is enabled"
+ "i40e: Fix for Tx timeouts when interface is brought up if
+  DCB is enabled"
  Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee
  
  "i40e: prevent overlapping tx_timeout recover"
  Commit: d5585b7b6846a6d0f9517afe57be3843150719da
  
  * The first commit is already in Disco, Cosmic
  * The second commit is already in Disco
  * Bionic needs both patches and Cosmic needs the second
  
  [Test Case]
  * We are considering the case of both issues above occurring.
  * Seen by reporter on a Kafka broker node with heavy traffic.
- * Not easy to reproduce as it requires something like the 
-   following example environment and heavy load:
+ * Not easy to reproduce as it requires something like the
+   following example environment and heavy load:
  
-   Kernel: 4.15.0-38-generic
-   Network driver: i40e
- version: 2.1.14-k
- firmware-version: 6.00 0x800034e6 18.3.6
-   NIC: Intel 40Gb XL710 
-   DCB enabled
- 
+   Kernel: 4.15.0-38-generic
+   Network driver: i40e
+ version: 2.1.14-k
+ firmware-version: 6.00 0x800034e6 18.3.6
+   NIC: Intel 40Gb XL710
+   DCB enabled
  
  [Regression Potential]
  Low, as the first only impacts i40e DCB environment, and has
- been running for several months in production-load testing 
+ been running for several months in production-load testing
  successfully.
- 
  
  --- Original Description
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to 
the Kernel 4.15.0-24-generic.
  
  On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet
  Converged Network Adapter X710-DA2" (driver i40e) the network card no
  longer works and permanently displays these three lines :
  
  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.

  This issue was seen on a heavily loaded Kafka broker node running
  the 4.15.0-38-generic kernel on Xenial.

  Symptoms include messages in the kernel log of the form:

  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  

  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several 

[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-04-03 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels).

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  It requires the following commit as well:

  i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
  Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
     i40e driver version: 2.1.14-k
     Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:

  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,,

    But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-03-31 Thread Nivedita Singhvi
I'm still trying to confirm this for Xenial.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels).

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  It requires the following commit as well:

  i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
  Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
     i40e driver version: 2.1.14-k
     Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:

  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,,

    But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-03-26 Thread Nivedita Singhvi
Submitted patches for SRU.

** Description changed:

  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
- and Bionic kernels). 
+ and Bionic kernels).
  
  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).
  
  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):
  
  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f
  
+ It requires the following commit as well:
+ 
+ i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
+ Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58
+ 
  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
-i40e driver version: 2.1.14-k 
-Any system with > 64 CPUs
+    i40e driver version: 2.1.14-k
+    Any system with > 64 CPUs
  
  2. For any queue 0 - 63, you can read/set tx xps:
-
+ 
  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
- 00,, 
+ 00,,
  
-   But for any queue number > 63, we see this error:
+   But for any queue number > 63, we see this error:
  
  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument
  
  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels).

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  It requires the following commit as well:

  i40e: Do not allow use more TC queue pairs than MSI-X vectors exist
  Commit:   1563f2d2e01242f05dd523ffd56fe104bc1afd58

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
     i40e driver version: 2.1.14-k
     Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:

  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,,

    But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-03-25 Thread Nivedita Singhvi
I am not sure we could deterministically provoke the 
issue. At the very least to ensure no other regression
was introduced, I would run it under heavy network load.

The environment in question which saw the issue had 
network load, contention for cpus and several other 
issues occur.

The basic environment is:

1. For any 25Gb NIC/chipset that requires the 4.4 bnxt_en_bpo
   driver, set its 2 ports/interfaces up in bonding mode 
   as follows:

bond-lacp-rate fast
bond-master bond0
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9000 

2. Run any heavy TCP network load test over the systems
   (e.g. iperf, netperf, file transfer, etc.)

3. Theoretically, it would appear that if the number of tx
   ring descriptors were lower, than that would be more
   likely to hit this (not successfully proven by testing
   here), but can lower it and see if that helps:

   # ethtool -G eno49 tx 128  // for example


I am not sure if that helps, Scott. I'll try and smoke
up more specific steps but I cannot guarantee you will
see the issue.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-03-22 Thread Nivedita Singhvi
Will be submitting SRU request early next week; trying to get
it into this next kernel release cycle. 

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Bionic)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu)
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels). 

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
 i40e driver version: 2.1.14-k 
 Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:
 
  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,, 

But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-03-22 Thread Nivedita Singhvi
Just briefly wanted to say that this is one we've discussed at 
length -- we may not be able to get someone who has the right
NIC to test with it in time. 

I'm sanity checking the kernel, but that is not exercising the 
key change here. 

If we could assume verification-done for our purposes here, 
that might be needed.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] Re: i40e xps management broken when > 64 queues/cpus

2019-03-19 Thread Nivedita Singhvi
It's been reported by an external reporter and reproduced 
internally.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels). 

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
 i40e driver version: 2.1.14-k 
 Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:
 
  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,, 

But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1820948] [NEW] i40e xps management broken when > 64 queues/cpus

2019-03-19 Thread Nivedita Singhvi
Public bug reported:

[Impact]
Transmit packet steering (xps) settings don't work when
the number of queues (cpus) is higher than 64. This is
currently still an issue on the 4.15 kernel (Xenial -hwe
and Bionic kernels). 

It was fixed in Intel's i40e driver version 2.7.11 and
in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

Fix
-
The following commit fixes this issue (as identified
by Lihong Yang in discussion with Intel i40e team):

"i40e: Fix the number of queues available to be mapped for use"
Commit: bc6d33c8d93f520e97a8c6330b8910053d4f


[Test Case]
1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
   i40e driver version: 2.1.14-k 
   Any system with > 64 CPUs

2. For any queue 0 - 63, you can read/set tx xps:
   
echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
echo $?
0
cat /sys/class/net/eth2/queues/tx-63/xps_cpus
00,, 

  But for any queue number > 63, we see this error:

echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
echo: write error: Invalid argument

cat /sys/class/net/eth2/queues/tx-64/xps_cpus
cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

** Affects: linux (Ubuntu)
 Importance: High
 Status: Confirmed

** Affects: linux (Ubuntu Bionic)
 Importance: High
     Assignee: Nivedita Singhvi (niveditasinghvi)
 Status: Confirmed


** Tags: bionic

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
   Status: New

** Changed in: linux (Ubuntu)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Bionic)
   Status: New => Confirmed

** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1820948

Title:
  i40e xps management broken when > 64 queues/cpus

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Confirmed

Bug description:
  [Impact]
  Transmit packet steering (xps) settings don't work when
  the number of queues (cpus) is higher than 64. This is
  currently still an issue on the 4.15 kernel (Xenial -hwe
  and Bionic kernels). 

  It was fixed in Intel's i40e driver version 2.7.11 and
  in 4.16-rc1 mainline Linux (i.e. Cosmic, Disco have fix).

  Fix
  -
  The following commit fixes this issue (as identified
  by Lihong Yang in discussion with Intel i40e team):

  "i40e: Fix the number of queues available to be mapped for use"
  Commit: bc6d33c8d93f520e97a8c6330b8910053d4f

  
  [Test Case]
  1. Kernel version: Bionic/Xenial -hwe: any 4.15 kernel
 i40e driver version: 2.1.14-k 
 Any system with > 64 CPUs

  2. For any queue 0 - 63, you can read/set tx xps:
 
  echo  > /sys/class/net/eth2/queues/tx-63/xps_cpus
  echo $?
  0
  cat /sys/class/net/eth2/queues/tx-63/xps_cpus
  00,, 

But for any queue number > 63, we see this error:

  echo  > /sys/class/net/eth2/queues/tx-64/xps_cpus
  echo: write error: Invalid argument

  cat /sys/class/net/eth2/queues/tx-64/xps_cpus
  cat: /sys/class/net/eth2/queues/tx-64/xps_cpus: Invalid argument

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1820948/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-03-19 Thread Nivedita Singhvi
Submitted SRU request

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.

  This issue was seen on a heavily loaded Kafka broker node running
  the 4.15.0-38-generic kernel on Xenial. 

  Symptoms include messages in the kernel log of the form:

  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  

  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several months:

  "i40e: Fix for Tx timeouts when interface is brought up if 
   DCB is enabled"
  Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

  "i40e: prevent overlapping tx_timeout recover"
  Commit: d5585b7b6846a6d0f9517afe57be3843150719da

  * The first commit is already in Disco, Cosmic
  * The second commit is already in Disco
  * Bionic needs both patches and Cosmic needs the second

  [Test Case]
  * We are considering the case of both issues above occurring.
  * Seen by reporter on a Kafka broker node with heavy traffic.
  * Not easy to reproduce as it requires something like the 
following example environment and heavy load:

Kernel: 4.15.0-38-generic
Network driver: i40e
  version: 2.1.14-k
  firmware-version: 6.00 0x800034e6 18.3.6
NIC: Intel 40Gb XL710 
DCB enabled

  
  [Regression Potential]
  Low, as the first only impacts i40e DCB environment, and has
  been running for several months in production-load testing 
  successfully.

  
  --- Original Description
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to 
the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-03-19 Thread Nivedita Singhvi
** Tags added: bionic cosmic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.

  This issue was seen on a heavily loaded Kafka broker node running
  the 4.15.0-38-generic kernel on Xenial. 

  Symptoms include messages in the kernel log of the form:

  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  

  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several months:

  "i40e: Fix for Tx timeouts when interface is brought up if 
   DCB is enabled"
  Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

  "i40e: prevent overlapping tx_timeout recover"
  Commit: d5585b7b6846a6d0f9517afe57be3843150719da

  * The first commit is already in Disco, Cosmic
  * The second commit is already in Disco
  * Bionic needs both patches and Cosmic needs the second

  [Test Case]
  * We are considering the case of both issues above occurring.
  * Seen by reporter on a Kafka broker node with heavy traffic.
  * Not easy to reproduce as it requires something like the 
following example environment and heavy load:

Kernel: 4.15.0-38-generic
Network driver: i40e
  version: 2.1.14-k
  firmware-version: 6.00 0x800034e6 18.3.6
NIC: Intel 40Gb XL710 
DCB enabled

  
  [Regression Potential]
  Low, as the first only impacts i40e DCB environment, and has
  been running for several months in production-load testing 
  successfully.

  
  --- Original Description
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to 
the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-03-19 Thread Nivedita Singhvi
** Description changed:

- Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13
- to the Kernel 4.15.0-24-generic.
+ [Impact]
+ The i40e driver can get stalled on tx timeouts. This can happen when
+ DCB is enabled on the connected switch. This can also trigger a
+ second situation when a tx timeout occurs before the recovery of
+ a previous timeout has completed due to CPU load, which is not
+ handled correctly. This leads to networking delays, drops and
+ application timeouts and hangs. Note that the first tx timeout
+ cause is just one of the ways to end up in the second situation.
+ 
+ This issue was seen on a heavily loaded Kafka broker node running
+ the 4.15.0-38-generic kernel on Xenial. 
+ 
+ Symptoms include messages in the kernel log of the form:
+ 
+ ---
+ [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
+ [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
+ 
+ 
+ With the test kernel provided in this LP bug which had these
+ two commits compiled in, the problem has not been seen again,
+ and has been running successfully for several months:
+ 
+ "i40e: Fix for Tx timeouts when interface is brought up if 
+  DCB is enabled"
+ Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee
+ 
+ "i40e: prevent overlapping tx_timeout recover"
+ Commit: d5585b7b6846a6d0f9517afe57be3843150719da
+ 
+ * The first commit is already in Disco, Cosmic
+ * The second commit is already in Disco
+ * Bionic needs both patches and Cosmic needs the second
+ 
+ [Test Case]
+ * We are considering the case of both issues above occurring.
+ * Seen by reporter on a Kafka broker node with heavy traffic.
+ * Not easy to reproduce as it requires something like the 
+   following example environment and heavy load:
+ 
+   Kernel: 4.15.0-38-generic
+   Network driver: i40e
+ version: 2.1.14-k
+ firmware-version: 6.00 0x800034e6 18.3.6
+   NIC: Intel 40Gb XL710 
+   DCB enabled
+ 
+ 
+ [Regression Potential]
+ Low, as the first only impacts i40e DCB environment, and has
+ been running for several months in production-load testing 
+ successfully.
+ 
+ 
+ --- Original Description
+ Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel 4.13 to 
the Kernel 4.15.0-24-generic.
  
  On a "Dell PowerEdge R330" server with a network adapter "Intel Ethernet
  Converged Network Adapter X710-DA2" (driver i40e) the network card no
  longer works and permanently displays these three lines :
  
- 
  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  [Impact]
  The i40e driver can get stalled on tx timeouts. This can happen when
  DCB is enabled on the connected switch. This can also trigger a
  second situation when a tx timeout occurs before the recovery of
  a previous timeout has completed due to CPU load, which is not
  handled correctly. This leads to networking delays, drops and
  application timeouts and hangs. Note that the first tx timeout
  cause is just one of the ways to end up in the second situation.

  This issue was seen on a heavily loaded Kafka broker node running
  the 4.15.0-38-generic kernel on Xenial. 

  Symptoms include messages in the kernel log of the form:

  ---
  [4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
  [4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
  

  With the test kernel provided in this LP bug which had these
  two commits compiled in, the problem has not been seen again,
  and has been running successfully for several months:

  "i40e: Fix for Tx timeouts when interface is brought up if 
   DCB is enabled"
  Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

  "i40e: prevent overlapping tx_timeout recover"
  Commit: d5585b7b6846a6d0f9517afe57be3843150719da

  * The first commit is already in Disco, Cosmic
  * The second commit is already in Disco
  * Bionic needs both patches and Cosmic needs the second

  [Test Case]
  * We are considering the case of both issues above occurring.
  * Seen by reporter on a Kafka broker node with heavy traffic.
  * Not easy to reproduce as it requires something like the 
following example environment and heavy load:

Kernel: 

[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-03-04 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu Bionic)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Cosmic)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

** Changed in: linux (Ubuntu Bionic)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Cosmic)
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel
  4.13 to the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  
  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-03-04 Thread Nivedita Singhvi
We have a user who has been successfully running under load
with the test kernel provided here which was patched with 
the following two commits:

"i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled"
Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee

"i40e: prevent overlapping tx_timeout recover"
Commit: d5585b7b6846a6d0f9517afe57be3843150719da

The issue was hit while running on 4.15.0-38-generic #41~16.04.1-Ubuntu 
on Xenial (the hwe kernel). 

Symptoms include messages in the kernel log of the form:

[4733544.982116] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 
0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
[4733544.982119] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 6
[4733572.116270] i40e :18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 2, NTC: 
0x49, HWB: 0x123, NTU: 0x123, TAIL: 0x123, INT: 0x0
[4733572.116272] i40e :18:00.1 eno2: tx_timeout recovery level 1, 
hung_queue 2

Leading to Kafka server issues, etc.

We are fairly confident this is the same as the original reporter,
and we'd like to use this bug to proceed on the stable release update process.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Confirmed
Status in linux source package in Cosmic:
  Confirmed

Bug description:
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel
  4.13 to the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  
  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-03-03 Thread Nivedita Singhvi
Terry,

We've had a lot of discussion over this bug. It does not have
a reliable reproducer, and I have not yet received any acks
on testing of the above. 

Our thinking was that it was still better to patch it since
it has been seen by the mainline driver as well and we'd like 
to avoid a re-occurrence of the situation. 

The need is to have the fix be available in the Xenial official
bits, for sure (rather than providing a temporary test kernel
via our ppa or something, for instance). 

FWIW, here are the boards in question:
enum board_idx {
BCM57301,
BCM57417_NPAR,
BCM58700,
BCM57311,
BCM57312,
BCM57402,
BCM57402_NPAR,
BCM57407,
BCM57412,
BCM57414,
BCM57416,
BCM57417,
BCM57412_NPAR,
BCM57314,
BCM57417_SFP,
BCM57416_SFP,
BCM57404_NPAR,
BCM57406_NPAR,
BCM57407_SFP,
BCM57407_NPAR,
BCM57414_NPAR,
BCM57416_NPAR,
BCM57452,
BCM57454,
NETXTREME_E_VF,
NETXTREME_C_VF,
};

Per conversation with Brad and Jay, it was agreed that patching
the bnxt_en_bpo driver only with this fix was the way to go, 
despite the lack of a reproducer, rather than pulling in an 
entire new driver from Broadcom as also potentially mulled over.


The FW version the issue was hit on: 
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03

But it might be best to test with latest available
firmware (214.0.166/1.9.2 pkg 21.40.16.6 or later).

Not sure if that helps? Let me know if I can address anything
else.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  In Progress

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-03-03 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu Xenial)
   Status: Confirmed => In Progress

** Changed in: linux (Ubuntu Xenial)
 Assignee: (unassigned) => Nivedita Singhvi (niveditasinghvi)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  In Progress

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-02-22 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu Xenial)
   Status: New => Confirmed

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Confirmed

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW (specific boards)
    while the bnx_en module continues to support the older
    HW. The current Linux upstream driver does not compile
    easily with the 4.4 kernel (too many changes).

  * This upstream and bnxt_en driver fix is a likely solution:
     "bnxt_en: Fix TX timeout during netpoll"
     commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

    This fix has not been applied to the bnxt_en_po driver
    version, but review of the code indicates that it is
    susceptible to the bug, and the fix would be reasonable.

  [Test Case]

  * Unfortunately, this is not easy to reproduce. Also, it is only seen
  on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
  driver.

  [Regression Potential]

  * The patch is restricted to the bpo driver, with very constrained
  scope - just the newest Broadcom NICs being used by the Xenial 4.4
  kernel (as opposed to the hwe 4.15 etc. kernels, which would have the
  in-tree fixed driver).

  * The patch is very small and backport is fairly minimal and simple.

  * The fix has been running on the in-tree driver in upstream mainline
  as well as the Ubuntu Linux in-tree driver, although the Broadcom
  driver has a lot of lower level code that is different, this piece is
  still the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-02-21 Thread Nivedita Singhvi
** Description changed:

+ [Impact]
+ 
+ The bnxt_en_bpo driver experienced tx timeouts causing the system to
+ experience network stalls and fail to send data and heartbeat packets.
+ 
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):
  
  * The bnxt_en_po driver froze on a "TX timed out" error
-   and triggered the Netdev Watchdog timer under load. 
+   and triggered the Netdev Watchdog timer under load.
  
  * From kernel log:
-   "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
-   See attached kern.log excerpt file for full excerpt of error log.
+   "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
+   See attached kern.log excerpt file for full excerpt of error log.
  
- * Release = Xenial 
-   Kernel = 4.4.0-141-generic #167
-   eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
-   
+ * Release = Xenial
+   Kernel = 4.4.0-141-generic #167
+   eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
+ 
  * This caused the driver to reset in order to recover:
-   
-   "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
-  
-   driver: bnxt_en_bpo
-   version: 1.8.1
-   source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
+ 
+   "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset
+ task!"
+ 
+   driver: bnxt_en_bpo
+   version: 1.8.1
+   source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
  
  * The loss of connectivity and softirq stall caused other failures
-   on the system. 
+   on the system.
  
  * The bnxt_en_po driver is the imported Broadcom driver
-   pulled in to support newer Broadcom HW (specific boards)
-   while the bnx_en module continues to support the older
-   HW. The current Linux upstream driver does not compile
-   easily with the 4.4 kernel (too many changes). 
+   pulled in to support newer Broadcom HW (specific boards)
+   while the bnx_en module continues to support the older
+   HW. The current Linux upstream driver does not compile
+   easily with the 4.4 kernel (too many changes).
  
  * This upstream and bnxt_en driver fix is a likely solution:
-"bnxt_en: Fix TX timeout during netpoll"
-commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
-   
-   This fix has not been applied to the bnxt_en_po driver
-   version, but review of the code indicates that it is 
-   susceptible to the bug, and the fix would be reasonable. 
+    "bnxt_en: Fix TX timeout during netpoll"
+    commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
  
- * No easy way to reproduce this
+   This fix has not been applied to the bnxt_en_po driver
+   version, but review of the code indicates that it is
+   susceptible to the bug, and the fix would be reasonable.
+ 
+ [Test Case]
+ 
+ * Unfortunately, this is not easy to reproduce. Also, it is only seen on
+ 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo
+ driver.
+ 
+ [Regression Potential]
+ 
+ * The patch is restricted to the bpo driver, with very constrained scope
+ - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
+ opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
+ driver).
+ 
+ * The patch is very small and backport is fairly minimal and simple.
+ 
+ * The fix has been running on the in-tree driver in upstream mainline as
+ well as the Ubuntu Linux in-tree driver, although the Broadcom driver
+ has a lot of lower level code that is different, this piece is still the
+ same.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [Impact]

  The bnxt_en_bpo driver experienced tx timeouts causing the system to
  experience network stalls and fail to send data and heartbeat packets.

  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
    and triggered the Netdev Watchdog timer under load.

  * From kernel log:
    "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
    See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial
    Kernel = 4.4.0-141-generic #167
    eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

    "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting
  reset task!"

    driver: bnxt_en_bpo
    version: 1.8.1
    source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
    on the system.

  * The bnxt_en_po driver is the imported Broadcom driver
    pulled in to support newer Broadcom HW 

[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-02-21 Thread Nivedita Singhvi
If anyone is interested and willing to test a 4.4 kernel 
patched with the fix "bnxt_en: Fix TX timeout during netpoll"
backported to the bnxt_en_bpo driver, please find the packages
here:

http://people.canonical.com/~nivedita/bpo/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load. 

  * From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial 
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

"bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
   
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
on the system. 

  * The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes). 

  * This upstream and bnxt_en driver fix is a likely solution:
 "bnxt_en: Fix TX timeout during netpoll"
 commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is 
susceptible to the bug, and the fix would be reasonable. 

  * No easy way to reproduce this

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1779756] Re: Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu 18.04)

2019-02-11 Thread Nivedita Singhvi
Any update on a Bionic fix?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1779756

Title:
  Intel XL710 - i40e driver does not work with kernel 4.15 (Ubuntu
  18.04)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Confirmed

Bug description:
  Today Ubuntu 16.04 LTS Enablement Stacks has moved from the Kernel
  4.13 to the Kernel 4.15.0-24-generic.

  On a "Dell PowerEdge R330" server with a network adapter "Intel
  Ethernet Converged Network Adapter X710-DA2" (driver i40e) the network
  card no longer works and permanently displays these three lines :

  
  [   98.012098] i40e :01:00.0 enp1s0f0: tx_timeout: VSI_seid: 388, Q 8, 
NTC: 0x0, HWB: 0x0, NTU: 0x1, TAIL: 0x1, INT: 0x1
  [   98.012119] i40e :01:00.0 enp1s0f0: tx_timeout recovery level 11, 
hung_queue 8
  [   98.012125] i40e :01:00.0 enp1s0f0: tx_timeout recovery unsuccessful

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-01-31 Thread Nivedita Singhvi
** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load. 

  * From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial 
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

"bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
   
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
on the system. 

  * The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes). 

  * This upstream and bnxt_en driver fix is a likely solution:
 "bnxt_en: Fix TX timeout during netpoll"
 commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is 
susceptible to the bug, and the fix would be reasonable. 

  * No easy way to reproduce this

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-01-31 Thread Nivedita Singhvi
Due to earlier NIC flapping observed on systems for the 
25Gb Broadcom NIC, with originally the following config,
the firmware was upgraded to avoid a known FW bug:

$ cat ethtool_-i_enp59s0f1d1
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03
expansion-rom-version:
bus-info: :3b:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no 

The FW was upgraded on affected systems to:

$ cat ethtool_-i_eno2d1
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6
expansion-rom-version: 
bus-info: :19:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

Unfortunately, it's not quite clear which FW version the
current bug happened on (I believe the newer but can't 
confirm -- happened in the midst of several reboots)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load. 

  * From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial 
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

"bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
   
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
on the system. 

  * The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes). 

  * This upstream and bnxt_en driver fix is a likely solution:
 "bnxt_en: Fix TX timeout during netpoll"
 commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is 
susceptible to the bug, and the fix would be reasonable. 

  * No easy way to reproduce this

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] Re: bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-01-31 Thread Nivedita Singhvi
** Attachment added: "kern.log.excerpt-netdev-watchdog-timeout.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+attachment/5234643/+files/kern.log.excerpt-netdev-watchdog-timeout.txt

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  New

Bug description:
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load. 

  * From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial 
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

"bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
   
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
on the system. 

  * The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes). 

  * This upstream and bnxt_en driver fix is a likely solution:
 "bnxt_en: Fix TX timeout during netpoll"
 commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is 
susceptible to the bug, and the fix would be reasonable. 

  * No easy way to reproduce this

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1814095] [NEW] bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

2019-01-31 Thread Nivedita Singhvi
Public bug reported:

The following 25Gb Broadcom NIC error was seen on Xenial
running the 4.4.0-141-generic kernel on an amd64 host
seeing moderate-heavy network traffic (just once):

* The bnxt_en_po driver froze on a "TX timed out" error
  and triggered the Netdev Watchdog timer under load. 

* From kernel log:
  "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
  See attached kern.log excerpt file for full excerpt of error log.

* Release = Xenial 
  Kernel = 4.4.0-141-generic #167
  eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
  
* This caused the driver to reset in order to recover:
  
  "bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
 
  driver: bnxt_en_bpo
  version: 1.8.1
  source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

* The loss of connectivity and softirq stall caused other failures
  on the system. 

* The bnxt_en_po driver is the imported Broadcom driver
  pulled in to support newer Broadcom HW (specific boards)
  while the bnx_en module continues to support the older
  HW. The current Linux upstream driver does not compile
  easily with the 4.4 kernel (too many changes). 

* This upstream and bnxt_en driver fix is a likely solution:
   "bnxt_en: Fix TX timeout during netpoll"
   commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
  
  This fix has not been applied to the bnxt_en_po driver
  version, but review of the code indicates that it is 
  susceptible to the bug, and the fix would be reasonable. 

* No easy way to reproduce this

** Affects: linux (Ubuntu)
 Importance: High
 Status: New


** Tags: xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1814095

Title:
  bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Status in linux package in Ubuntu:
  New

Bug description:
  The following 25Gb Broadcom NIC error was seen on Xenial
  running the 4.4.0-141-generic kernel on an amd64 host
  seeing moderate-heavy network traffic (just once):

  * The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load. 

  * From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.

  * Release = Xenial 
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

  * This caused the driver to reset in order to recover:

"bnxt_en_bpo :19:00.1 eno2d1: TX timeout detected, starting reset task!"
   
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

  * The loss of connectivity and softirq stall caused other failures
on the system. 

  * The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes). 

  * This upstream and bnxt_en driver fix is a likely solution:
 "bnxt_en: Fix TX timeout during netpoll"
 commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is 
susceptible to the bug, and the fix would be reasonable. 

  * No easy way to reproduce this

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1814095/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


  1   2   >