[Kernel-packages] [Bug 1919154] Re: Enable CONFIG_NO_HZ_FULL on supported architectures
Gerald, Using gettimeofday for testing the effects of NO_HZ_FULL on context switch duration may not be measuring anything that changes with regards to NO_HZ_FULL. gettimeofday is implemented via VDSO, and is not an actual system call that requires a context switch. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1919154 Title: Enable CONFIG_NO_HZ_FULL on supported architectures Status in linux package in Ubuntu: In Progress Status in linux source package in Focal: In Progress Status in linux source package in Groovy: Won't Fix Status in linux source package in Hirsute: In Progress Status in linux source package in Jammy: In Progress Status in linux source package in Lunar: In Progress Status in linux source package in Mantic: In Progress Bug description: [Impact] The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid sending scheduling-clock interrupts to CPUs with a single runnable task, and such CPUs are said to be "adaptive-ticks CPUs". This is important for applications with aggressive real-time response constraints because it allows them to improve their worst-case response times by the maximum duration of a scheduling-clock interrupt. It is also important for computationally intensive short-iteration workloads: If any CPU is delayed during a given iteration, all the other CPUs will be forced to wait idle while the delayed CPU finishes. Thus, the delay is multiplied by one less than the number of CPUs. In these situations, there is again strong motivation to avoid sending scheduling-clock interrupts. [Test Plan] In order to verify the change will not cause performance issues in context switch we should compare the results for: ./stress-ng --seq 0 --metrics-brief -t 15 Running on a dedicated machine and with the following services disabled: smartd.service, iscsid.service, apport.service, cron.service, anacron.timer, apt-daily.timer, apt-daily-upgrade.timer, fstrim.timer, logrotate.timer, motd-news.timer, man-db.timer. The results didn't show any performance regression: https://kernel.ubuntu.com/~mhcerri/lp1919154/ [Where problems could occur] Performance degradation might happen for workloads with intensive context switching. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1919154/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2036675] [NEW] 5.15.0-85 live migration regression
Public bug reported: Fixes added for LP 2032164 [0] to resolve an issue in live migration have unfortunately introduced a regression, causing a previously working live migration pattern to fail when tested with the 5.15.0-85 kernel from -proposed. Specifically, live migration from a PKRU-enabled host running a kernel older than 5.15.0-85 to a host running the 5.15.0-85 kernel will fail. The destination can be either with or without PKRU; both cases fail, although in different ways (one hangs, the other fails due to a PCID flag issue). The commits in question are commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0 Author: Dr. David Alan Gilbert Date: Mon Aug 21 14:47:28 2023 +0800 KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES commit 27a189b881278c8ad9c16b0ee05668d724352733 Author: Leonardo Bras Date: Mon Aug 21 14:47:27 2023 +0800 x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 [0] https://bugs.launchpad.net/bugs/2032164 ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2036675 Title: 5.15.0-85 live migration regression Status in linux package in Ubuntu: New Bug description: Fixes added for LP 2032164 [0] to resolve an issue in live migration have unfortunately introduced a regression, causing a previously working live migration pattern to fail when tested with the 5.15.0-85 kernel from -proposed. Specifically, live migration from a PKRU-enabled host running a kernel older than 5.15.0-85 to a host running the 5.15.0-85 kernel will fail. The destination can be either with or without PKRU; both cases fail, although in different ways (one hangs, the other fails due to a PCID flag issue). The commits in question are commit fa9225d64f215e8109de10f6b6c7a08f033d0ec0 Author: Dr. David Alan Gilbert Date: Mon Aug 21 14:47:28 2023 +0800 KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES commit 27a189b881278c8ad9c16b0ee05668d724352733 Author: Leonardo Bras Date: Mon Aug 21 14:47:27 2023 +0800 x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 [0] https://bugs.launchpad.net/bugs/2032164 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036675/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 2004262] Re: Intel E810 NICs driver in causing hangs when booting and bonds configured
https://lore.kernel.org/netdev/20230131213703.1347761-2-anthony.l.ngu...@intel.com/T/#u A possible fix for this problem. The patch was posted on intel-wired- lan a couple weeks ago and just hit netdev today. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2004262 Title: Intel E810 NICs driver in causing hangs when booting and bonds configured Status in linux package in Ubuntu: Incomplete Bug description: jammy 22.04.1 linux-image-generic 5.15.0-58-generic Intel E810-XXV Dual Port NICs in Dell PowerEdge 650 After beonding is enabled on switch and server side, the system will hang at initialing ubuntu. The kernel loads but around starting the Network Services the system can hang for sometimes 5 minutes, and in other cases, indefinitely. The message of: echo 0 > /proc/sys/kernel/hung_task_timeout_sec” systemd-resolve blocked for more than 120 seconds appears, and eventually the Network services just attempts to start and never does. This is with or without DHCP enabled. Tried this same setup with the hwe-22.04, hwe-20.04, hwe-22.04-ege and linux-oem kernels and all exhibit the same failure. To work around this. installing the Intel 'ice' driver of version 1.10.1.2.2 works. The system doesn't even remotely hang at startup and all networking functions remain working (ping, DNS, general accessibility). The driver can be found at https://downloadmirror.intel.com/763930/ice-1.10.1.2.2.tar.gz --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jan 31 13:08 seq crw-rw 1 root audio 116, 33 Jan 31 13:08 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.11-0ubuntu82.3 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: N/A CasperMD5json: { "result": "skip" } DistroRelease: Ubuntu 22.04 InstallationDate: Installed on 2023-01-27 (3 days ago) InstallationMedia: Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809) IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' MachineType: Dell Inc. PowerEdge R650 Package: linux (not installed) PciMultimedia: ProcFB: 0 mgag200drmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-58-generic root=UUID=668aab7c-abe9-434b-a810-acc6eab76cbc ro fsck.mode=skip ProcVersionSignature: Ubuntu 5.15.0-58.64-generic 5.15.74 RelatedPackageVersions: linux-restricted-modules-5.15.0-58-generic N/A linux-backports-modules-5.15.0-58-generic N/A linux-firmware 20220329.git681281e4-0ubuntu3.9 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' Tags: jammy uec-images Uname: Linux 5.15.0-58-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: N/A _MarkForUpload: True dmi.bios.date: 09/14/2022 dmi.bios.release: 1.8 dmi.bios.vendor: Dell Inc. dmi.bios.version: 1.8.2 dmi.board.name: 0PJ7YJ dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr1.8.2:bd09/14/2022:br1.8:svnDellInc.:pnPowerEdgeR650:pvr:rvnDellInc.:rn0PJ7YJ:rvrA01:cvnDellInc.:ct23:cvr:skuSKU=0912;ModelName=PowerEdgeR650: dmi.product.family: PowerEdge dmi.product.name: PowerEdge R650 dmi.product.sku: SKU=0912;ModelName=PowerEdge R650 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2004262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1959702] Re: Regression: ip6 ndp broken, host bridge doesn't add vlan guest entry to mdb
Harry, I'm still working to reproduce this, without success. I have set the .autoconf sysctl to 0 (which controls creation of local addresses in response to received Router Advertisements), as well as setting .addr_gen_mode to 1 (to disable SLAAC (fe80::) addresses). In any event, .autoconf=0 and .addr_gen_mode=1 still fails to reproduce the issue on my test system. I find that if I disable mcast_flood on the relevant bridge ports (i.e., bridge link set dev vnet1 mcast_flood off) I do see the behavior you describe, but in that case no variant that I've tried (no vid, and all vids in use) of "bridge mdb add ... grp ff02::1:ff00:2" appears to permit ND traffic to pass to the VM destination. Can you provide more specifics of how exactly the bridge and ports are configured? Ideally, both the method to set it up, as well as the configuration details when failing (i.e., "ip -s -d link show" for the bridge and relevant bridge ports, "bridge vlan show", "bridge mdb show", "bridge fdb show br [bridgename]") Also, to answer a question from your original report, the default setting in the kernel for multicast_snooping (enabled, i.e., 1) hasn't changed recently (and quite possibly ever). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1959702 Title: Regression: ip6 ndp broken, host bridge doesn't add vlan guest entry to mdb Status in linux package in Ubuntu: Confirmed Bug description: Starting at the end: I believe as the bug presently requires each of the host's bridge ports to be ipv6 addressable to enable ipv6 to function in the guest, and most admins won't think to add special entries into their host's nftables.conf to allow for it 'because who knew?' it represents what you might call a 'passive security vulnerability'. A recent kernel upgrade has broken ipv6/ip6 ndp in a host/kvm setup using a bridge on the host and vlans for some guests. I've tracked the problem to a failure of the mcast code to add entries to the host's mdb table. Manually adding the entries to the mdb on the bridge corrects the problem. It's very easy to demonstrate the bug in an all ubuntu setup. 1. On an ubuntu host, create two vms, I used libvirt, as set up below. 2. On the host, create a bridge and vlan with two ports, each with the chosen vlan as PVID and egress untagged. Assign those ports one each to the guests as the interface, use e1000. Be sure to NOT autoconfigure the host side of the bridge ports with any ip4 or ip6 address (including fe80::), it's just an avoidable security risk. We don't want to allow the host any sort of ip access / exposure to the vlan. In other words, treat the host's bridge ports as if a 'real off-host switch' without expectation of making each bridge's port being ip6 addressable on the bridge itself. (FWIW: Worth checking if the vlan is left tagged and not pvid, and the vlan is decoded in the guest as a separate interface, does the problem go away? It imposes the burden of vlan management awareness to the guest and so is not acceptable as a solution.) 3. On the host, assign a physical NIC to the bridge and the vlan to the nic. The egress is tagged for the chosen vlan and not PVID. Optionally set up an off-host gateway for the vlan, but it isn't necessary to show the bug. 4. On each guest, manually assign a unique ip4 and ip6 address on the same subnet (you'll see though dhcp4 could work if there was an off- host router providing related services, the bug prevents dhcp6 from working). 5. On one vm, ping the other. Notice ip4 pings work, ip6 pings do not. 6. Manually add the fe02::ffxx: entries for each vm to the vlan to the host bridge's multicast table. Use 'temp' if you're quick enough, otherwise perm. 7. Notice pings between the guests now work on ip6 and ipv4. Using tcpdump and watching icmp6 traffic, you'll notice the packets making it across the various bridge ports the moment you manually add the appropriate fe02::ff... multicast address to the mdb table. Beware a false sense of security: Once the ndp completes and the link addresses are in the fdb, it can 'seem like' everything is fine until the fdb times out and the required mdb entry again must be used to allow ndp to refresh the address. Setting mcast_querier doesn't help. Perhaps previous kernels turned off the multicast snooping by default and just flooded all the bridge ports with all multicast traffic so this bug was avoided. It's my hunch the reason there hasn't been more complaint about this is it takes an extra step to not autoconfigure the vm ports with fe80:: link local addresses on the host. I believe the existence of the fe80 address on the host ports engages ndp code on the host to load the mdb as if preparing for the host's side of the
[Kernel-packages] [Bug 1907262] Re: raid10: discard leads to corrupted file system
Thimo, Thanks for the update; just to clarify, for your "procedure to recover," are you saying that that procedure will always resolve the damage, or that even after that procedure, there may be corruption? -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Fix Committed Status in linux source package in Focal: Fix Committed Status in linux source package in Groovy: Fix Committed Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root-L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*1): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1894780] Re: Oops and hang when starting LVM snapshots on 5.4.0-47
wgrant, you said: That :a-152 is meant to be /sys/kernel/slab/:a-152. Even a working kernel shows some trouble there: $ uname -a Linux 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ ls -l /sys/kernel/slab | grep a-152 lrwxrwxrwx 1 root root 0 Sep 8 03:20 dm_bufio_buffer -> :a-152 Are you saying that the symlink is "some trouble" here? Because that part isn't an error, that's the effect of slab merge (that the kernel normally treats all slabs of the same size as one big slab with multiple references, more or less). Slab merge can be disabled via "slab_nomerge" on the command line. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1894780 Title: Oops and hang when starting LVM snapshots on 5.4.0-47 Status in linux package in Ubuntu: Confirmed Status in linux source package in Focal: New Bug description: One of my bionic servers with HWE 5.4.0 hangs on boot (apparently while starting LVM snapshots) after upgrading from Linux 5.4.0-42 to 5.4.0-47, with the following trace: [ 29.126292] kobject_add_internal failed for :a-152 with -EEXIST, don't try to register things with the same name in the same directory. [ 29.138854] BUG: kernel NULL pointer dereference, address: 0020 [ 29.145977] #PF: supervisor read access in kernel mode [ 29.145979] #PF: error_code(0x) - not-present page [ 29.145981] PGD 0 P4D 0 [ 29.158800] Oops: [#1] SMP NOPTI [ 29.162468] CPU: 6 PID: 2532 Comm: lvm Not tainted 5.4.0-46-generic #50~18.04.1-Ubuntu [ 29.170378] Hardware name: Supermicro AS -2023US-TR4/H11DSU-iN, BIOS 1.3 07/15/2019 [ 29.178038] RIP: 0010:free_percpu+0x120/0x1f0 [ 29.183786] Code: 43 64 48 01 d0 49 39 c4 0f 83 71 ff ff ff 65 8b 05 a5 4e bc 58 48 8b 15 0e 4e 20 01 48 98 48 8b 3c c2 4c 01 e7 e8 f0 97 02 00 <48> 8b 58 20 48 8b 53 38 e9 48 ff ff ff f3 c3 48 8b 43 38 48 89 45 [ 29.202530] RSP: 0018:a2f69c3d38e8 EFLAGS: 00010046 [ 29.209204] RAX: RBX: 92202ff397c0 RCX: a880a000 [ 29.216336] RDX: cf35c0f24f2cc3c0 RSI: 43817c451b92afcb RDI: [ 29.223469] RBP: a2f69c3d3918 R08: R09: a74a5300 [ 29.230609] R10: a2f69c3d3820 R11: R12: cf35c0f24f14c3c0 [ 29.237745] R13: cf362fb2a054c3c0 R14: 0287 R15: 0008 [ 29.244878] FS: 7f93a04b0900() GS:913faed8() knlGS: [ 29.252961] CS: 0010 DS: ES: CR0: 80050033 [ 29.258707] CR2: 0020 CR3: 003fa9d9 CR4: 003406e0 [ 29.265883] Call Trace: [ 29.268346] __kmem_cache_release+0x1a/0x30 [ 29.273913] __kmem_cache_create+0x4f9/0x550 [ 29.278192] ? __kmalloc_node+0x1eb/0x320 [ 29.282205] ? kvmalloc_node+0x31/0x80 [ 29.285962] create_cache+0x120/0x1f0 [ 29.291003] kmem_cache_create_usercopy+0x17d/0x270 [ 29.295882] kmem_cache_create+0x16/0x20 [ 29.300152] dm_bufio_client_create+0x1af/0x3f0 [dm_bufio] [ 29.305644] ? snapshot_map+0x5e0/0x5e0 [dm_snapshot] [ 29.310693] persistent_read_metadata+0x1ed/0x500 [dm_snapshot] [ 29.316627] ? _cond_resched+0x19/0x40 [ 29.320384] snapshot_ctr+0x79e/0x910 [dm_snapshot] [ 29.325276] dm_table_add_target+0x18d/0x370 [ 29.329552] table_load+0x12a/0x370 [ 29.333045] ctl_ioctl+0x1e2/0x590 [ 29.336450] ? retrieve_status+0x1c0/0x1c0 [ 29.340551] dm_ctl_ioctl+0xe/0x20 [ 29.343958] do_vfs_ioctl+0xa9/0x640 [ 29.347547] ? ksys_semctl.constprop.19+0xf7/0x190 [ 29.352337] ksys_ioctl+0x75/0x80 [ 29.355663] __x64_sys_ioctl+0x1a/0x20 [ 29.359421] do_syscall_64+0x57/0x190 [ 29.363094] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 29.368144] RIP: 0033:0x7f939f0286d7 [ 29.371732] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48 [ 29.390478] RSP: 002b:7ffe918df168 EFLAGS: 0202 ORIG_RAX: 0010 [ 29.398045] RAX: ffda RBX: 561c107f672c RCX: 7f939f0286d7 [ 29.405175] RDX: 561c1107c610 RSI: c138fd09 RDI: 0009 [ 29.412309] RBP: 7ffe918df220 R08: 7f939f59d120 R09: 7ffe918defd0 [ 29.419442] R10: 561c1107c6c0 R11: 0202 R12: 7f939f59c4e6 [ 29.426623] R13: 7f939f59c4e6 R14: 7f939f59c4e6 R15: 7f939f59c4e6 [ 29.433778] Modules linked in: dm_snapshot dm_bufio dm_zero nls_iso8859_1 ipmi_ssif input_leds amd64_edac_mod edac_mce_amd joydev kvm_amd kvm ccp k10temp ipmi_si
[Kernel-packages] [Bug 1834322] Re: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot
** Changed in: linux (Ubuntu) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1834322 Title: Losing port aggregate with 802.3ad port-channel/bonding aggregation on reboot Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Released Status in linux source package in Disco: Fix Released Status in linux source package in Eoan: Fix Released Status in linux source package in Focal: Fix Released Bug description: We are losing port channel aggregation on reboot. After the reboot, /var/log/syslog contains the entries: [ 250.790758] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports [ 282.029426] bond2: An illegal loopback occurred on adapter (enp24s0f1np1) Check the configuration to verify that all adapters are connected to 802.3ad compliant switch ports Aggregator IDs of the slave interfaces are different: ubuntu@node-6:~$ cat /proc/net/bonding/bond2 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: IEEE 802.3ad Dynamic link aggregation Transmit Hash Policy: layer3+4 (1) MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 802.3ad info LACP rate: fast Min links: 0 Aggregator selection policy (ad_select): stable Slave Interface: enp24s0f1np1 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:51 Slave queue ID: 0 Aggregator ID: 1 Actor Churn State: none Partner Churn State: none Actor Churned Count: 0 Partner Churned Count: 0 Slave Interface: enp24s0f0np0 MII Status: up Speed: 1 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: b0:26:28:48:9f:50 Slave queue ID: 0 Aggregator ID: 2 Actor Churn State: churned Partner Churn State: churned Actor Churned Count: 1 Partner Churned Count: 1 The mismatch in "Aggregator ID" on the port is a symptom of the issue. If we do 'ip link set dev bond2 down' and 'ip link set dev bond2 up', the port with the mismatched ID appears to renegotiate with the port- channel and becomes aggregated. The other way to workaround this issue is to put bond ports down and bring up port enp24s0f0np0 first and port enp24s0f1np1 second. When I change the order of bringing the ports up (first enp24s0f1np1, and second enp24s0f0np0), the issue is still there. When the issue occurs, a port on the switch, corresponding to interface enp24s0f0np0 is in Suspended state. After applying the workaround the port is no longer in Suspended state and Aggregator IDs in /proc/net/bonding/bond2 are equal. I installed 5.0.0 kernel, the issue is still there. Operating System: Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-52-generic x86_64) ubuntu@node-6:~$ uname -a Linux node-6 4.15.0-52-generic #56-Ubuntu SMP Tue Jun 4 22:49:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ubuntu@node-6:~$ sudo lspci -vnvn https://pastebin.ubuntu.com/p/Dy2CKDbySC/ Hardware: Dell PowerEdge R740xd BIOS version: 2.1.7 sosreport: https://drive.google.com/open?id=1-eN7cZJIeu- AQBEU7Gw8a_AJTuq0AOZO ubuntu@node-6:~$ lspci | grep Ethernet | grep 10G https://pastebin.ubuntu.com/p/sqCx79vZWM/ ubuntu@node-6:~$ lspci -n | grep 18:00 18:00.0 0200: 14e4:16d8 (rev 01) 18:00.1 0200: 14e4:16d8 (rev 01) ubuntu@node-6:~$ modinfo bnx2x https://pastebin.ubuntu.com/p/pkmzsFjK8M/ ubuntu@node-6:~$ ip -o l https://pastebin.ubuntu.com/p/QpW7TjnT2v/ ubuntu@node-6:~$ ip -o a https://pastebin.ubuntu.com/p/MczKtrnmDR/ ubuntu@node-6:~$ cat /etc/netplan/98-juju.yaml https://pastebin.ubuntu.com/p/9cZpPc7C6P/ ubuntu@node-6:~$ sudo lshw -c network https://pastebin.ubuntu.com/p/gmfgZptzDT/ --- ProblemType: Bug AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Jun 26 10:21 seq crw-rw 1 root audio 116, 33 Jun 26 10:21 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.6 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' Lsusb: Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 004: ID 1604:10c0 Tascam Bus 001 Device 003: ID 1604:10c0 Tascam Bus 001 Device 002: ID 1604:10c0 Tascam Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R740xd Package: linux (not installed) PciMultimedia:
[Kernel-packages] [Bug 1873537] [NEW] PCIe AER device recovery failed due to logic flaw
Public bug reported: SRU Justification Impact: During PCI Express Downstream Port Containment (DPC) recovery, certain types of failures do not recover due to a logic flaw in pcie_do_recovery(). The upstream git commit log explains the change: PCI/ERR: Update error status after reset_link() Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses reset_link() to recover from fatal errors. But during fatal error recovery, if the initial value of error status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using reset_link()) pcie_do_recovery() will report the recovery result as failure. Update the status of error after reset_link(). You can reproduce this issue by triggering a SW DPC using "DPC Software Trigger" bit in "DPC Control Register". You should see recovery failed dmesg log as below: pcieport :00:16.0: DPC: containment event, status:0x1f27 source:0x pcieport :00:16.0: DPC: software trigger detected pci :04:00.0: AER: can't recover (no error_detected callback) pcieport :00:16.0: AER: device recovery failed Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") Link: https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.158584.git.sathyanarayanan.kuppusw...@linux.intel.com [bhelgaas: split pci_channel_io_frozen simplification to separate patch] Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas Acked-by: Keith Busch Cc: Ashok Raj Note that a second prerequisite patch is necessary as well. This patch, commit b5dfbeacf74865a8d62a4f70f501cdc61510f8e0 Author: Kuppuswamy Sathyanarayanan Date: Fri Mar 27 17:33:24 2020 -0500 PCI/ERR: Combine pci_channel_io_frozen cases is a code readability change, and makes no functional changes. Testcase: On a system with DPC enabled, setpci may be used to set the DPC Software Trigger bit (bit 6, value 0x40) in the DPC Control register of a suitable PCIe device (a PCIe bridge, for example). On a system lacking the fix, the output will be as shown above (i.e., culminating in the "device recovery failed" message). With the fix applied, the device successfully recovers, resulting in a message of the form pcieport :d9:01.0: AER: Device recovery successful Regression Potential: The risk of regression is low, as (a) the path in question currently does not work, and (b) the changes are minimal, comprising only a housekeeping change and the logically correct updating of a status variable that did not previously occur. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1873537 Title: PCIe AER device recovery failed due to logic flaw Status in linux package in Ubuntu: New Bug description: SRU Justification Impact: During PCI Express Downstream Port Containment (DPC) recovery, certain types of failures do not recover due to a logic flaw in pcie_do_recovery(). The upstream git commit log explains the change: PCI/ERR: Update error status after reset_link() Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses reset_link() to recover from fatal errors. But during fatal error recovery, if the initial value of error status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using reset_link()) pcie_do_recovery() will report the recovery result as failure. Update the status of error after reset_link(). You can reproduce this issue by triggering a SW DPC using "DPC Software Trigger" bit in "DPC Control Register". You should see recovery failed dmesg log as below: pcieport :00:16.0: DPC: containment event, status:0x1f27 source:0x pcieport :00:16.0: DPC: software trigger detected pci :04:00.0: AER: can't recover (no error_detected callback) pcieport :00:16.0: AER: device recovery failed Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") Link: https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2.158584.git.sathyanarayanan.kuppusw...@linux.intel.com [bhelgaas: split pci_channel_io_frozen simplification to separate patch] Signed-off-by: Kuppuswamy Sathyanarayanan Signed-off-by: Bjorn Helgaas Acked-by: Keith Busch Cc: Ashok Raj Note that a second prerequisite patch is necessary as well. This patch, commit b5dfbeacf74865a8d62a4f70f501cdc61510f8e0 Author: Kuppuswamy Sathyanarayanan Date: Fri Mar 27 17:33:24 2020 -0500 PCI/ERR: Combine pci_channel_io_frozen cases is a code readability change, and makes no functional changes. Testcase: On a system with DPC enabled, setpci may be used to set the DPC Software Trigger bit (bit 6, value 0x40) in the DPC Control register of a suitable PCIe device (a PCIe bridge, for example).
[Kernel-packages] [Bug 1869423] [NEW] Restore kernel control of PCIe DPC via option
Public bug reported: SRU Justification: Impact: Since upstream commit eed85ff4c0da7 (4.16), control of PCIe DPC (Downstream Port Containment) is coupled with control of AER (Advanced Error Reporting), eliminating the option for the kernel to separately manage DPC (which was previously the default behavior). Fix: The upstream commit log explains the change: commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8 Author: Olof Johansson Date: Wed Oct 23 12:22:05 2019 -0700 PCI/DPC: Add "pcie_ports=dpc-native" to allow DPC without AER control Prior to eed85ff4c0da7 ("PCI/DPC: Enable DPC only if AER is available"), Linux handled DPC events regardless of whether firmware had granted it ownership of AER or DPC, e.g., via _OSC. PCIe r5.0, sec 6.2.10, recommends that the OS link control of DPC to control of AER, so after eed85ff4c0da7, Linux handles DPC events only if it has control of AER. On platforms that do not grant OS control of AER via _OSC, Linux DPC handling worked before eed85ff4c0da7 but not after. To make Linux DPC handling work on those platforms the same way they did before, add a "pcie_ports=dpc-native" kernel parameter that makes Linux handle DPC events regardless of whether it has control of AER. [bhelgaas: commit log, move pcie_ports_dpc_native to drivers/pci/] Link: https://lore.kernel.org/r/20191023192205.97024-1-o...@lixom.net Signed-off-by: Olof Johansson Signed-off-by: Bjorn Helgaas Testcase: Control of DPC can be determined from kernel boot messages when pciehp probes a capable slot; when the kernel controls DPC, messages of the format: pcieport :2d:00.0: pciehp: Slot #0 pcieport :2d:00.0: DPC: error containment capabilities: will appear; if the kernel does not control DPC, the DPC line will not be present (only the "pciehp: Slot" message). Additionally, devices bound to the kernel DPC PCIe port service driver will be found in the /sys/bus/pci_express/drivers/dpc/ sysfs directory; this will be empty of devices if the kernel does not control DPC. Regression Potential: The risk of regression is low as (a) by default, the patch has no effect (the default setting is to not enable the option), and (b) when enabled, the patch restores functionality that previously worked, and was, in fact, the default behavior. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1869423 Title: Restore kernel control of PCIe DPC via option Status in linux package in Ubuntu: New Bug description: SRU Justification: Impact: Since upstream commit eed85ff4c0da7 (4.16), control of PCIe DPC (Downstream Port Containment) is coupled with control of AER (Advanced Error Reporting), eliminating the option for the kernel to separately manage DPC (which was previously the default behavior). Fix: The upstream commit log explains the change: commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8 Author: Olof Johansson Date: Wed Oct 23 12:22:05 2019 -0700 PCI/DPC: Add "pcie_ports=dpc-native" to allow DPC without AER control Prior to eed85ff4c0da7 ("PCI/DPC: Enable DPC only if AER is available"), Linux handled DPC events regardless of whether firmware had granted it ownership of AER or DPC, e.g., via _OSC. PCIe r5.0, sec 6.2.10, recommends that the OS link control of DPC to control of AER, so after eed85ff4c0da7, Linux handles DPC events only if it has control of AER. On platforms that do not grant OS control of AER via _OSC, Linux DPC handling worked before eed85ff4c0da7 but not after. To make Linux DPC handling work on those platforms the same way they did before, add a "pcie_ports=dpc-native" kernel parameter that makes Linux handle DPC events regardless of whether it has control of AER. [bhelgaas: commit log, move pcie_ports_dpc_native to drivers/pci/] Link: https://lore.kernel.org/r/20191023192205.97024-1-o...@lixom.net Signed-off-by: Olof Johansson Signed-off-by: Bjorn Helgaas Testcase: Control of DPC can be determined from kernel boot messages when pciehp probes a capable slot; when the kernel controls DPC, messages of the format: pcieport :2d:00.0: pciehp: Slot #0 pcieport :2d:00.0: DPC: error containment capabilities: will appear; if the kernel does not control DPC, the DPC line will not be present (only the "pciehp: Slot" message). Additionally, devices bound to the kernel DPC PCIe port service driver will be found in the /sys/bus/pci_express/drivers/dpc/ sysfs directory; this will be empty of devices if the kernel does not control DPC. Regression
[Kernel-packages] [Bug 1805693] [NEW] User reports a hang on 18.04 LTS(4.15.18) under a heavy I/O load
Public bug reported: User reports a hang under heavy I/O: The IO hang problem on our cloud is caused by IO hang in block-wbt wbt_wait. The fix commit id is 2887e41b910bb14fd847cf01ab7a5993db989d88. It is a block write buffer throttle queue lock contention and thundering herd issue in wbt_wait() We can recreate the problem easily by running concurrent IO from multiple VMs with sequential write. We can provide fio workload as needed for recreate. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1805693 Title: User reports a hang on 18.04 LTS(4.15.18) under a heavy I/O load Status in linux package in Ubuntu: New Bug description: User reports a hang under heavy I/O: The IO hang problem on our cloud is caused by IO hang in block-wbt wbt_wait. The fix commit id is 2887e41b910bb14fd847cf01ab7a5993db989d88. It is a block write buffer throttle queue lock contention and thundering herd issue in wbt_wait() We can recreate the problem easily by running concurrent IO from multiple VMs with sequential write. We can provide fio workload as needed for recreate. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805693/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1800254] Re: packet socket panic in Trusty 3.13.0-157 and later
** Tags removed: verification-needed-trusty -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1800254 Title: packet socket panic in Trusty 3.13.0-157 and later Status in linux package in Ubuntu: Invalid Status in linux source package in Trusty: Fix Committed Bug description: SRU Justification: Due to changes added as part of c108ac876c02 ("packet: hold bind lock when rebinding to fanout hook"), it is possible for fanout_add to add a packet_type handler via dev_add_pack and then kfree the memory backing the packet_type. This corrupts the ptype_all list, causing the system to panic when network packet processing next traverses ptype_all. The erroneous path is taken when a PACKET_FANOUT setsockopt is performed on a packet socket that is bound to an interface that is administratively down. This is not due to any flaw of c108ac876c02, but rather than the packet socket code base differs subtly in 3.13 as compared to 4.4. This affects only the Trusty 3.13 kernel series, starting with 3.13.0-157. Fix: The remedy for this is to backport additional changes in the management of the dev_add_pack calls from 4.4. This moves the dev_add_pack and dev_remove_pack calls from fanout_add and _release into __fanout_link and _unlink. Testcase: The issue can be reproduced reliably by (a) creating an AF_PACKET socket, binding it to an interface that is administratively down, and then (c) attempting to set the PACKET_FANOUT sockopt. The setsockopt call will fail, but will corrupt ptype_all in the kernel. Subsequent network traffic will induce a panic when evaulating the corrupted ptype_all entry. A test program is attached. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1800254/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1800254] Re: packet socket panic in Trusty 3.13.0-157 and later
Reproducer for ptype_all corruption. Pass ifindex of an administratively down interface on the command line. ** Attachment added: "packet-fry.c" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1800254/+attachment/5206100/+files/packet-fry.c -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1800254 Title: packet socket panic in Trusty 3.13.0-157 and later Status in linux package in Ubuntu: New Bug description: SRU Justification: Due to changes added as part of c108ac876c02 ("packet: hold bind lock when rebinding to fanout hook"), it is possible for fanout_add to add a packet_type handler via dev_add_pack and then kfree the memory backing the packet_type. This corrupts the ptype_all list, causing the system to panic when network packet processing next traverses ptype_all. The erroneous path is taken when a PACKET_FANOUT setsockopt is performed on a packet socket that is bound to an interface that is administratively down. This is not due to any flaw of c108ac876c02, but rather than the packet socket code base differs subtly in 3.13 as compared to 4.4. This affects only the Trusty 3.13 kernel series, starting with 3.13.0-157. Fix: The remedy for this is to backport additional changes in the management of the dev_add_pack calls from 4.4. This moves the dev_add_pack and dev_remove_pack calls from fanout_add and _release into __fanout_link and _unlink. Testcase: The issue can be reproduced reliably by (a) creating an AF_PACKET socket, binding it to an interface that is administratively down, and then (c) attempting to set the PACKET_FANOUT sockopt. The setsockopt call will fail, but will corrupt ptype_all in the kernel. Subsequent network traffic will induce a panic when evaulating the corrupted ptype_all entry. A test program is attached. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1800254/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1800254] [NEW] packet socket panic in Trusty 3.13.0-157 and later
Public bug reported: SRU Justification: Due to changes added as part of c108ac876c02 ("packet: hold bind lock when rebinding to fanout hook"), it is possible for fanout_add to add a packet_type handler via dev_add_pack and then kfree the memory backing the packet_type. This corrupts the ptype_all list, causing the system to panic when network packet processing next traverses ptype_all. The erroneous path is taken when a PACKET_FANOUT setsockopt is performed on a packet socket that is bound to an interface that is administratively down. This is not due to any flaw of c108ac876c02, but rather than the packet socket code base differs subtly in 3.13 as compared to 4.4. This affects only the Trusty 3.13 kernel series, starting with 3.13.0-157. Fix: The remedy for this is to backport additional changes in the management of the dev_add_pack calls from 4.4. This moves the dev_add_pack and dev_remove_pack calls from fanout_add and _release into __fanout_link and _unlink. Testcase: The issue can be reproduced reliably by (a) creating an AF_PACKET socket, binding it to an interface that is administratively down, and then (c) attempting to set the PACKET_FANOUT sockopt. The setsockopt call will fail, but will corrupt ptype_all in the kernel. Subsequent network traffic will induce a panic when evaulating the corrupted ptype_all entry. A test program is attached. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1800254 Title: packet socket panic in Trusty 3.13.0-157 and later Status in linux package in Ubuntu: New Bug description: SRU Justification: Due to changes added as part of c108ac876c02 ("packet: hold bind lock when rebinding to fanout hook"), it is possible for fanout_add to add a packet_type handler via dev_add_pack and then kfree the memory backing the packet_type. This corrupts the ptype_all list, causing the system to panic when network packet processing next traverses ptype_all. The erroneous path is taken when a PACKET_FANOUT setsockopt is performed on a packet socket that is bound to an interface that is administratively down. This is not due to any flaw of c108ac876c02, but rather than the packet socket code base differs subtly in 3.13 as compared to 4.4. This affects only the Trusty 3.13 kernel series, starting with 3.13.0-157. Fix: The remedy for this is to backport additional changes in the management of the dev_add_pack calls from 4.4. This moves the dev_add_pack and dev_remove_pack calls from fanout_add and _release into __fanout_link and _unlink. Testcase: The issue can be reproduced reliably by (a) creating an AF_PACKET socket, binding it to an interface that is administratively down, and then (c) attempting to set the PACKET_FANOUT sockopt. The setsockopt call will fail, but will corrupt ptype_all in the kernel. Subsequent network traffic will induce a panic when evaulating the corrupted ptype_all entry. A test program is attached. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1800254/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1771480] Re: WARNING: CPU: 28 PID: 34085 at /build/linux-90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90()
The dev_disable_lro warning is happening due to some logic issues in the features code. The LRO on the VLAN (bond0.200, e.g.) that's being warned about does end up being disabled by a NETDEV_FEAT_CHANGE callback when the underlying bond0's features are updated, so the warning is spurious. Tracing the dev_disable_lro -> netdev_update_features for the bond0.2004 VLAN, I see: name="bond0" feat=219db89 hw_feat=20219cbe9 want_feat=20219cbe9 vlan_feat=198069 NETIF_F_LRO = 0x8000 dev_disable_lro wanted_features &= ~NETIF_F_LRO bond0.2004 wanted_features = 0x200194869# no LRO __netdev_update_features features = netdev_get_wanted_features return (dev->features & ~dev->hw_features) | dev->wanted_features; (0x19d809 & ~0x23839487b) | 0x200194869 ^LRO ^no LRO^no LRO 0x9000 | 0x200194869 $2 = 0x20019d869 ^ LRO vlan_dev_fix_features(dev, 0x20019d869) # has LRO struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; netdev_features_t old_features = features; features &= real_dev->vlan_features;# 0x198069 has LRO features |= NETIF_F_RXCSUM; # 0x100198069 has LRO features &= real_dev->features; # 0x198009 has LRO features |= old_features & NETIF_F_SOFT_FEATURES; # save GSO / GRO features |= NETIF_F_LLTX; return features; # will have LRO So, basically, LRO is set in the underlying bond0's features, so it ends up being kept in the VLAN device's features even though it wasn't in wanted_features. Later, dev_disable_lro will call dev_disable_lro on all the lower devices (the bond0 in this case), and the update of features for the bond0 will issue a NETDEV_FEAT_CHANGE callback to the bond0.2004 VLAN, which will then set the features correctly. The Ubuntu 3.13 __netdev_update_features (called by dev_disable_lro via netdev_update_features) lacks additional logic found in later kernels to sync the features to lower devices. That presumably triggers the NETDEV_FEAT_CHANGE within the call to __netdev_update_features so that the bond0.2004 VLAN is updated before we return back to dev_disable_lro (but I haven't verified this). I suspect the fix to eliminate the warning is to apply the "sync_lower:" block from a later kernel __netdev_update_features to 3.13, along with the netdev_sync_lower_features function it uses. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1771480 Title: WARNING: CPU: 28 PID: 34085 at /build/linux- 90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90() Status in linux package in Ubuntu: Incomplete Bug description: I have multiple instances of this dev_disable_lro error in kern.log. Also seeing this: systemd-udevd[1452]: timeout: killing 'bridge-network-interface' [2765] <4>May 1 22:56:42 xxx kernel: [ 404.520990] bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond <4>May 1 22:56:44 xxx kernel: [ 406.926429] bonding: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond <4>May 1 22:56:45 xxx kernel: [ 407.569020] [ cut here ] <4>May 1 22:56:45 xxx kernel: [ 407.569029] WARNING: CPU: 28 PID: 34085 at /build/linux-90Gc2C/linux-3.13.0/net/core/dev.c:1433 dev_disable_lro+0x87/0x90() <4>May 1 22:56:45 xxx kernel: [ 407.569032] netdevice: bond0.2004 <4>May 1 22:56:45 xxx kernel: [ 407.569032] failed to disable LRO! <4>May 1 22:56:45 xxx kernel: [ 407.569035] Modules linked in: 8021q garp mrp bridge stp llc bonding iptable_filter ip_tables x_tables nf_conntrack_proto_gre nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipmi_devintf mxm_wmi dcdbas x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me mei ipmi_si shpchp wmi acpi_power_meter mac_hid xfs libcrc32c raid10 usb_storage raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 igb ixgbe i2c_algo_bit multipath ahci dca ptp libahci pps_core linear megaraid_sas mdio dm_multipath scsi_dh <4>May 1 22:56:45 xxx kernel: [ 407.569112] CPU: 28 PID: 34085 Comm: brctl Not tainted 3.13.0-142-generic #191-Ubuntu <4>May 1 22:56:45 xxx kernel: [ 407.569115] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.7.1 001/22/2018 <4>May 1 22:56:45 xxx kernel: [ 407.569118] 881fcc753c70 8172e7fc 881fcc753cb8 <4>May 1 22:56:45 xxx kernel: [ 407.569129] 0009 881fcc753ca8 8106afad 883fcc6f8000 <4>May 1 22:56:45 xxx kernel: [ 407.569139] 883fcc696880 883fcc6f8000 881fce82dd40 <4>May 1 22:56:45 xxx
[Kernel-packages] [Bug 1765241] Re: virtio_scsi race can corrupt memory, panic kernel
** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1765241 Title: virtio_scsi race can corrupt memory, panic kernel Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: There's a race in the virtio_scsi driver (for all kernels) That race is inadvertently avoided on most kernels due to a synchronize_rcu call coincidentally placed in one of the racing code paths By happenstance, the set of patches backported to the Ubuntu 4.4 kernel ended up without a synchronize_rcu in the relevant place. The issue first manifests with commit be2a20802abbde04ae09846406d7b670731d97d2 Author: Jan KaraDate: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() BugLink: http://bugs.launchpad.net/bugs/1659111 The race can cause a kernel panic due to corruption of a freelist pointer in a slab cache. The panics occur in arbitrary places as the failure occurs at an allocation after the corruption of the pointer. However, the most common failure observed has been within virtio_scsi itself during probe processing, e.g.: [3.111628] [] kfree_const+0x22/0x30 [3.112340] [] kobject_release+0x94/0x190 [3.113126] [] kobject_put+0x27/0x50 [3.113838] [] put_device+0x17/0x20 [3.114568] [] __scsi_remove_device+0x92/0xe0 [3.115401] [] scsi_probe_and_add_lun+0x95b/0xe80 [3.116287] [] ? kmem_cache_alloc_trace+0x183/0x1f0 [3.117227] [] ? __pm_runtime_resume+0x5b/0x80 [3.118048] [] __scsi_scan_target+0x10a/0x690 [3.118879] [] scsi_scan_channel+0x7e/0xa0 [3.119653] [] scsi_scan_host_selected+0xf3/0x160 [3.120506] [] do_scsi_scan_host+0x8d/0x90 [3.121295] [] do_scan_async+0x1c/0x190 [3.122048] [] async_run_entry_fn+0x48/0x150 [3.122846] [] process_one_work+0x165/0x480 [3.123732] [] worker_thread+0x4b/0x4d0 [3.124508] [] ? process_one_work+0x480/0x480 Details on the race: CPU A: virtscsi_probe [...] __scsi_scan_target scsi_probe_and_add_lun [on return calls __scsi_remove_device, below] scsi_probe_lun [...] blk_execute_rq blk_execute_rq waits for the completion event, and then on wakeup returns up to scsi_probe_and_all_lun, which calls __scsi_remove_device. In order for the race to occur, the wakeup must occur on a CPU other than CPU B. After being woken up by the completion of the request, the call returns up the stack to scsi_probe_and_add_lun, which calls __scsi_remove_device: __scsi_remove_device blk_cleanup_queue [ no longer calls bdi_unregister ] scsi_target_reap(scsi_target(sdev)) scsi_target_reap_ref_put kref_put kref_sub scsi_target_reap_ref_release scsi_target_destroy ->target_destroy() = virtscsi_target_destroy kfree(tgt) <=== FREE TGT Note that the removal of the call to bdi_unregister in commit xenial be2a20802abbde block: Move bdi_unregister() to del_gendisk() and annotated above is the change that gates whether the issue manifests or not. The other code change from be2a20802abbde has no effect on the race. CPU B: vring_interrupt virtscsi_complete_cmd scsi_mq_done (via ->scsi_done()) scsi_mq_done blk_mq_complete_request __blk_mq_complete_request [...] blk_end_sync_rq complete [ wake up the task from CPU A ] After waking the CPU A task, execution returns up the stack, and then calls atomic_dec(>reqs) in virtscsi_complete_cmd immediately after returning from the call to ->scsi_done. If the activity on CPU A after it is woken up (starting at __scsi_remove_device) finishes before CPU B can call atomic_dec() in virtscsi_complete_cmd, then the atomic_dec() will modify a free list pointer in the freed slab object that contained tgt. This causes the system to panic on a subsequent allocation from the per-cpu slab cache. The call path on CPU B is significantly shorter than that on CPU A after wakeup, so it is likely that an external event delays CPU B. This could be either an interrupt within the VM or scheduling delays for the virtual cpu process on the hypervisor. Whatever the delay is, it is not the root cause but merely the triggering event. The virtscsi race window described above exists in all kernels that have been checked (4.4 upstream LTS, Ubuntu 4.13 and 4.15, and current mainline as of this writing). However, none of those kernels exhibit the panic in testing, only the Ubuntu 4.4 kernel after commit be2a20802abbde. The reason none of those kernels panic is they all have one
[Kernel-packages] [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
We've seen a similar-sounding issue in the past, but couldn't get it tracked down to the root cause. Is it possible to enable some instrumentation in the /etc/network/interfaces and obtain some data on a failing occurrence? What we've used in the past is adding something like pre-up echo 'file bond_3ad.c +p' > /sys/kernel/debug/dynamic_debug/control pre-up echo 'file bond_main.c +p' > /sys/kernel/debug/dynamic_debug/control to the /e/n/i section for the bond itself, and post-up tcpdump -U -p -w /tmp/eth4.td -i eth4 ether proto 0x8809 & to the sections for each slave in the bond (adjusting the "eth4" above to the actual interface name). The bond debug will appear in the kernel log, and the tcpdump data will have to copied from the output file specified on the tcpdump command line (and the tcpdump process terminated if need be). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1753662 Title: [i40e] LACP bonding start up race conditions Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded). Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service. How to reproduce: 1. configure LACP bonding with MAAS 2. provision machines 3. check the bonding status in /proc/net/bonding/bond* frequency of occurrence: About 5 bond pairs in 22 pairs at each provisioning. [reproducible combination] $ uname -a Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 1.4.25-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [non-reproducible combination] $ uname -a Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-116-generic 4.4.0-116.140 ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98 Uname: Linux 4.4.0-116-generic x86_64 AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 6 06:37 seq crw-rw 1 root audio 116, 33 Mar 6 06:37 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.15 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Tue Mar 6 06:46:32 2018 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R730 PciMultimedia: ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8 RelatedPackageVersions: linux-restricted-modules-4.4.0-116-generic N/A linux-backports-modules-4.4.0-116-generic N/A linux-firmware 1.157.17 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/16/2017 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.5.5 dmi.board.name: 072T6D dmi.board.vendor: Dell Inc. dmi.board.version: A08 dmi.chassis.asset.tag: 0018880 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.5.5:bd08/16/2017:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn072T6D:rvrA08:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R730 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to:
[Kernel-packages] [Bug 1753662] Re: [i40e] LACP bonding start up race conditions
I would suggest testing commit de77ecd4ef02ca783f7762e04e92b3d0964be66b Author: Mahesh BandewarDate: Mon Mar 27 11:37:33 2017 -0700 bonding: improve link-status update in mii-monitoring and commit d94708a553022bf012fa95af10532a134eeb5a52 Author: WANG Cong Date: Tue Jul 25 09:44:25 2017 -0700 bonding: commit link status change after propose backported to 4.4.0-120 (in the order above; the second is a fix to the first). The first patch initially appears in 4.12-rc1, the second in 4.13. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1753662 Title: [i40e] LACP bonding start up race conditions Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: When provisioning Ubuntu servers with MAAS at once, some bonding pairs will have unexpected LACP status such as "Expired". It randomly happens at each provisioning with the default xenial kernel(4.4), but not reproducible with HWE kernel(4.13). I'm using Intel X710 cards (Dell-branded). Using the HWE kernel works as a workaround for short term, but it's not ideal since 4.13 is not covered by Canonical Livepatch service. How to reproduce: 1. configure LACP bonding with MAAS 2. provision machines 3. check the bonding status in /proc/net/bonding/bond* frequency of occurrence: About 5 bond pairs in 22 pairs at each provisioning. [reproducible combination] $ uname -a Linux comp006 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 1.4.25-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes [non-reproducible combination] $ uname -a Linux comp006 4.13.0-36-generic #40~16.04.1-Ubuntu SMP Fri Feb 16 23:25:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux $ sudo ethtool -i eno1 driver: i40e version: 2.1.14-k firmware-version: 6.00 0x800034e6 18.3.6 expansion-rom-version: bus-info: :01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-116-generic 4.4.0-116.140 ProcVersionSignature: Ubuntu 4.4.0-116.140-generic 4.4.98 Uname: Linux 4.4.0-116-generic x86_64 AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Mar 6 06:37 seq crw-rw 1 root audio 116, 33 Mar 6 06:37 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.15 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Tue Mar 6 06:46:32 2018 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' Lsusb: Bus 002 Device 002: ID 8087:8002 Intel Corp. Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 003: ID 413c:a001 Dell Computer Corp. Hub Bus 001 Device 002: ID 8087:800a Intel Corp. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub MachineType: Dell Inc. PowerEdge R730 PciMultimedia: ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-116-generic.efi.signed root=UUID=0528f88e-cf1a-43e2-813a-e7261b88d460 ro console=tty0 console=ttyS0,115200n8 RelatedPackageVersions: linux-restricted-modules-4.4.0-116-generic N/A linux-backports-modules-4.4.0-116-generic N/A linux-firmware 1.157.17 RfKill: Error: [Errno 2] No such file or directory: 'rfkill' SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 08/16/2017 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.5.5 dmi.board.name: 072T6D dmi.board.vendor: Dell Inc. dmi.board.version: A08 dmi.chassis.asset.tag: 0018880 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.5.5:bd08/16/2017:svnDellInc.:pnPowerEdgeR730:pvr:rvnDellInc.:rn072T6D:rvrA08:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R730 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1753662/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1765241] Re: virtio_scsi race can corrupt memory, panic kernel
SRU Justification: Impact: This issue can cause system panics of systems using the virtio_scsi driver with the affected Ubuntu kernels. The issue manifests irregularly, as it is timing dependent. Fix: The issue is resolved by adding synchronization between the two code paths that race with one another. The most straightforward fix is to have the code wait for any outstanding requests to drain prior to freeing the target structure, e.g., --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -762,6 +762,10 @@ static int virtscsi_target_alloc(struct scsi_target *starget) static void virtscsi_target_destroy(struct scsi_target *starget) { struct virtio_scsi_target_state *tgt = starget->hostdata; + + /* we can race with concurrent virtscsi_complete_cmd */ + while (atomic_read(>reqs)) + cpu_relax(); kfree(tgt); } An alternative fix that was considered is to use a synchronize_rcu_expedited call, as that is the functionality that blocks the race in unaffected kernels. However, some call paths into virtscsi_target_destroy may hold mutexes that are not held by the upstream RCU sync calls (which enter via the block layer). For this reason the more confined fix described above was chosen. Testcase: This reproduces on Google Cloud, using the current, unmodified ubuntu-1404-lts image (with the Ubuntu 4.4 kernel). Using the two attached scripts, run e.g. ./create_shutdown_instance.sh 100 to create 100 instances. If an instance runs its startup script successfully, it'll shut itself down right away. So instances that are still running after a few minutes likely demonstrate this problem. The issue reproduces easily with n1-standard-4. create_shutdown_instance.sh: #!/bin/bash -e ZONE=us-central1-a for i in $(seq -w $1); do gcloud compute instances create shutdown-experiment-$i \ --zone="${ZONE}" \ --image-family=ubuntu-1404-lts \ --image-project=ubuntu-os-cloud \ --machine-type=n1-standard-4 \ --scopes compute-rw \ --metadata-from-file startup-script=immediate_shutdown.sh & done wait immediate_shutdown.sh: #!/bin/bash -x function get_metadata_value() { curl -H 'Metadata-Flavor: Google' \ "http://metadata.google.internal/computeMetadata/v1/instance/$1; } readonly ZONE="$(get_metadata_value zone | awk -F'/' '{print $NF}')" gcloud compute instances delete "$(hostname)" --zone="${ZONE}" --quiet -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1765241 Title: virtio_scsi race can corrupt memory, panic kernel Status in linux package in Ubuntu: Confirmed Bug description: There's a race in the virtio_scsi driver (for all kernels) That race is inadvertently avoided on most kernels due to a synchronize_rcu call coincidentally placed in one of the racing code paths By happenstance, the set of patches backported to the Ubuntu 4.4 kernel ended up without a synchronize_rcu in the relevant place. The issue first manifests with commit be2a20802abbde04ae09846406d7b670731d97d2 Author: Jan KaraDate: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() BugLink: http://bugs.launchpad.net/bugs/1659111 The race can cause a kernel panic due to corruption of a freelist pointer in a slab cache. The panics occur in arbitrary places as the failure occurs at an allocation after the corruption of the pointer. However, the most common failure observed has been within virtio_scsi itself during probe processing, e.g.: [3.111628] [] kfree_const+0x22/0x30 [3.112340] [] kobject_release+0x94/0x190 [3.113126] [] kobject_put+0x27/0x50 [3.113838] [] put_device+0x17/0x20 [3.114568] [] __scsi_remove_device+0x92/0xe0 [3.115401] [] scsi_probe_and_add_lun+0x95b/0xe80 [3.116287] [] ? kmem_cache_alloc_trace+0x183/0x1f0 [3.117227] [] ? __pm_runtime_resume+0x5b/0x80 [3.118048] [] __scsi_scan_target+0x10a/0x690 [3.118879] [] scsi_scan_channel+0x7e/0xa0 [3.119653] [] scsi_scan_host_selected+0xf3/0x160 [3.120506] [] do_scsi_scan_host+0x8d/0x90 [3.121295] [] do_scan_async+0x1c/0x190 [3.122048] [] async_run_entry_fn+0x48/0x150 [3.122846] [] process_one_work+0x165/0x480 [3.123732] [] worker_thread+0x4b/0x4d0 [3.124508] [] ? process_one_work+0x480/0x480 Details on the race: CPU A: virtscsi_probe [...] __scsi_scan_target scsi_probe_and_add_lun [on return calls __scsi_remove_device, below] scsi_probe_lun [...] blk_execute_rq blk_execute_rq waits for the completion event, and then on wakeup returns up to scsi_probe_and_all_lun, which calls __scsi_remove_device. In order for the race to occur, the wakeup must occur on a CPU other than CPU B. After being woken up by the completion of the
[Kernel-packages] [Bug 1765241] Re: virtio_scsi race can corrupt memory, panic kernel
SRU Justification: Impact: This issue can cause system panics of systems using the virtio_scsi driver with the affected Ubuntu kernels. The issue manifests irregularly, as it is timing dependent. Fix: The issue is resolved by adding synchronization between the two code paths that race with one another. The lowest regression risk is to use a synchronize_rcu_expedited call, as that is the functionality that blocks the race in unaffected kernels. diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c index 03a2aad..c122e68 100644 --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -762,6 +762,9 @@ static int virtscsi_target_alloc(struct scsi_target *starget) static void virtscsi_target_destroy(struct scsi_target *starget) { struct virtio_scsi_target_state *tgt = starget->hostdata; + + /* we can race with concurrent virtscsi_complete_cmd */ + synchronize_rcu_expedited(); kfree(tgt); } It is also possible to have the code wait for any outstanding requests to drain prior to freeing the target structure, e.g., --- a/drivers/scsi/virtio_scsi.c +++ b/drivers/scsi/virtio_scsi.c @@ -762,6 +762,10 @@ static int virtscsi_target_alloc(struct scsi_target *starget) static void virtscsi_target_destroy(struct scsi_target *starget) { struct virtio_scsi_target_state *tgt = starget->hostdata; + + /* we can race with concurrent virtscsi_complete_cmd */ + while (atomic_read(>reqs)) + cpu_relax(); kfree(tgt); } This completes a bit faster for the usual case, but SCSI target destroy is not a fast path and the above runs the risk of the loop never terminating. Testcase: This reproduces on Google Cloud, using the current, unmodified ubuntu-1404-lts image (with the Ubuntu 4.4 kernel). Using the two attached scripts, run e.g. ./create_shutdown_instance.sh 100 to create 100 instances. If an instance runs its startup script successfully, it'll shut itself down right away. So instances that are still running after a few minutes likely demonstrate this problem. The issue reproduces easily with n1-standard-4. create_shutdown_instance.sh: #!/bin/bash -e ZONE=us-central1-a for i in $(seq -w $1); do gcloud compute instances create shutdown-experiment-$i \ --zone="${ZONE}" \ --image-family=ubuntu-1404-lts \ --image-project=ubuntu-os-cloud \ --machine-type=n1-standard-4 \ --scopes compute-rw \ --metadata-from-file startup-script=immediate_shutdown.sh & done wait immediate_shutdown.sh: #!/bin/bash -x function get_metadata_value() { curl -H 'Metadata-Flavor: Google' \ "http://metadata.google.internal/computeMetadata/v1/instance/$1; } readonly ZONE="$(get_metadata_value zone | awk -F'/' '{print $NF}')" gcloud compute instances delete "$(hostname)" --zone="${ZONE}" --quiet -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1765241 Title: virtio_scsi race can corrupt memory, panic kernel Status in linux package in Ubuntu: Confirmed Bug description: There's a race in the virtio_scsi driver (for all kernels) That race is inadvertently avoided on most kernels due to a synchronize_rcu call coincidentally placed in one of the racing code paths By happenstance, the set of patches backported to the Ubuntu 4.4 kernel ended up without a synchronize_rcu in the relevant place. The issue first manifests with commit be2a20802abbde04ae09846406d7b670731d97d2 Author: Jan KaraDate: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() BugLink: http://bugs.launchpad.net/bugs/1659111 The race can cause a kernel panic due to corruption of a freelist pointer in a slab cache. The panics occur in arbitrary places as the failure occurs at an allocation after the corruption of the pointer. However, the most common failure observed has been within virtio_scsi itself during probe processing, e.g.: [3.111628] [] kfree_const+0x22/0x30 [3.112340] [] kobject_release+0x94/0x190 [3.113126] [] kobject_put+0x27/0x50 [3.113838] [] put_device+0x17/0x20 [3.114568] [] __scsi_remove_device+0x92/0xe0 [3.115401] [] scsi_probe_and_add_lun+0x95b/0xe80 [3.116287] [] ? kmem_cache_alloc_trace+0x183/0x1f0 [3.117227] [] ? __pm_runtime_resume+0x5b/0x80 [3.118048] [] __scsi_scan_target+0x10a/0x690 [3.118879] [] scsi_scan_channel+0x7e/0xa0 [3.119653] [] scsi_scan_host_selected+0xf3/0x160 [3.120506] [] do_scsi_scan_host+0x8d/0x90 [3.121295] [] do_scan_async+0x1c/0x190 [3.122048] [] async_run_entry_fn+0x48/0x150 [3.122846] [] process_one_work+0x165/0x480 [3.123732] [] worker_thread+0x4b/0x4d0 [3.124508] [] ? process_one_work+0x480/0x480 Details on
[Kernel-packages] [Bug 1765241] [NEW] virtio_scsi race can corrupt memory, panic kernel
s the race window on the Ubuntu 4.4 kernel. Resolving the issue can be accomplished by adding an RCU sync to virtscsi_target_destroy prior to freeing the target. It is also possible to use a loop of the format: + while (atomic_read(>reqs)) + cpu_relax(); but this is higher risk as the loop is non-terminating in the case of other failure. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Jay Vosburgh (jvosburgh) Status: Confirmed ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Jay Vosburgh (jvosburgh) ** Changed in: linux (Ubuntu) Status: New => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1765241 Title: virtio_scsi race can corrupt memory, panic kernel Status in linux package in Ubuntu: Confirmed Bug description: There's a race in the virtio_scsi driver (for all kernels) That race is inadvertently avoided on most kernels due to a synchronize_rcu call coincidentally placed in one of the racing code paths By happenstance, the set of patches backported to the Ubuntu 4.4 kernel ended up without a synchronize_rcu in the relevant place. The issue first manifests with commit be2a20802abbde04ae09846406d7b670731d97d2 Author: Jan Kara <j...@suse.cz> Date: Wed Feb 8 08:05:56 2017 +0100 block: Move bdi_unregister() to del_gendisk() BugLink: http://bugs.launchpad.net/bugs/1659111 The race can cause a kernel panic due to corruption of a freelist pointer in a slab cache. The panics occur in arbitrary places as the failure occurs at an allocation after the corruption of the pointer. However, the most common failure observed has been within virtio_scsi itself during probe processing, e.g.: [3.111628] [] kfree_const+0x22/0x30 [3.112340] [] kobject_release+0x94/0x190 [3.113126] [] kobject_put+0x27/0x50 [3.113838] [] put_device+0x17/0x20 [3.114568] [] __scsi_remove_device+0x92/0xe0 [3.115401] [] scsi_probe_and_add_lun+0x95b/0xe80 [3.116287] [] ? kmem_cache_alloc_trace+0x183/0x1f0 [3.117227] [] ? __pm_runtime_resume+0x5b/0x80 [3.118048] [] __scsi_scan_target+0x10a/0x690 [3.118879] [] scsi_scan_channel+0x7e/0xa0 [3.119653] [] scsi_scan_host_selected+0xf3/0x160 [3.120506] [] do_scsi_scan_host+0x8d/0x90 [3.121295] [] do_scan_async+0x1c/0x190 [3.122048] [] async_run_entry_fn+0x48/0x150 [3.122846] [] process_one_work+0x165/0x480 [3.123732] [] worker_thread+0x4b/0x4d0 [3.124508] [] ? process_one_work+0x480/0x480 Details on the race: CPU A: virtscsi_probe [...] __scsi_scan_target scsi_probe_and_add_lun [on return calls __scsi_remove_device, below] scsi_probe_lun [...] blk_execute_rq blk_execute_rq waits for the completion event, and then on wakeup returns up to scsi_probe_and_all_lun, which calls __scsi_remove_device. In order for the race to occur, the wakeup must occur on a CPU other than CPU B. After being woken up by the completion of the request, the call returns up the stack to scsi_probe_and_add_lun, which calls __scsi_remove_device: __scsi_remove_device blk_cleanup_queue [ no longer calls bdi_unregister ] scsi_target_reap(scsi_target(sdev)) scsi_target_reap_ref_put kref_put kref_sub scsi_target_reap_ref_release scsi_target_destroy ->target_destroy() = virtscsi_target_destroy kfree(tgt) <=== FREE TGT Note that the removal of the call to bdi_unregister in commit xenial be2a20802abbde block: Move bdi_unregister() to del_gendisk() and annotated above is the change that gates whether the issue manifests or not. The other code change from be2a20802abbde has no effect on the race. CPU B: vring_interrupt virtscsi_complete_cmd scsi_mq_done (via ->scsi_done()) scsi_mq_done blk_mq_complete_request __blk_mq_complete_request [...] blk_end_sync_rq complete [ wake up the task from CPU A ] After waking the CPU A task, execution returns up the stack, and then calls atomic_dec(>reqs) in virtscsi_complete_cmd immediately after returning from the call to ->scsi_done. If the activity on CPU A after it is woken up (starting at __scsi_remove_device) finishes before CPU B can call atomic_dec() in virtscsi_complete_cmd, then the atomic_dec() will modify a free list pointer in the freed slab object that contained tgt. This causes the system to panic on a subsequent allocation from the per-cpu slab cache. The call path on CPU B is significantly shorter than that on CPU A after wakeup, so it is likely that an external event delays CPU B. This could be either an interrupt wit
[Kernel-packages] [Bug 1716747] Re: High system load and mouse delays - pipe A vblank wait timed out
Joe, I didn't try anything in between, I went from 4.13.0-16 to -36 and -36 started wigging out again so I backed down to -16. I can try some interim kernels next week when I don't need to do work on the laptop in question. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1716747 Title: High system load and mouse delays - pipe A vblank wait timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Artful: In Progress Bug description: This issue has been observed on a Lenovo X220i laptop: 1. Booting the laptop with artful's 4.12 kernel results in a very sluggish system performance (mouse pointer delays) and a high system load. 2. /var/log/kern.log indicates a problem with the display driver (see below) 3. The system works without any issues if the zesty kernel (4.10) is used instead. Ubuntu release: Artful Aardvark (development branch) Kernel package version: 4.12.0.13.14 Sep 12 20:22:45 trinity kernel: [ 155.491074] pipe A vblank wait timed out Sep 12 20:22:45 trinity kernel: [ 155.491117] [ cut here ] Sep 12 20:22:45 trinity kernel: [ 155.491171] WARNING: CPU: 0 PID: 203 at /build/linux-cK2WUa/linux-4.12.0/drivers/gpu/drm/i915/intel_display.c:12636 intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491172] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter aufs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter binfmt_misc snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd intel_cstate snd_rawmidi intel_rapl_perf arc4 uvcvideo videobuf2_vmalloc iwldvm videobuf2_memops videobuf2_v4l2 mac80211 Sep 12 20:22:45 trinity kernel: [ 155.491215] videobuf2_core videodev joydev input_leds media iwlwifi serio_raw cfg80211 snd_seq snd_seq_device snd_timer thinkpad_acpi nvram snd mac_hid mei_me mei soundcore lpc_ich shpchp parport_pc ppdev lp parport ip_tables x_tables autofs4 xfs libcrc32c mmc_block i915 sdhci_pci uas usb_storage sdhci i2c_algo_bit drm_kms_helper psmouse syscopyarea ahci sysfillrect libahci sysimgblt e1000e fb_sys_fops drm ptp pps_core wmi video Sep 12 20:22:45 trinity kernel: [ 155.491251] CPU: 0 PID: 203 Comm: kworker/u16:5 Not tainted 4.12.0-13-generic #14-Ubuntu Sep 12 20:22:45 trinity kernel: [ 155.491253] Hardware name: LENOVO 4290W1A/4290W1A, BIOS 8DET69WW (1.39 ) 07/18/2013 Sep 12 20:22:45 trinity kernel: [ 155.491292] Workqueue: events_unbound intel_atomic_commit_work [i915] Sep 12 20:22:45 trinity kernel: [ 155.491295] task: 8842c14a8000 task.stack: ae8f0217c000 Sep 12 20:22:45 trinity kernel: [ 155.491330] RIP: 0010:intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491331] RSP: 0018:ae8f0217fd88 EFLAGS: 00010282 Sep 12 20:22:45 trinity kernel: [ 155.491333] RAX: 001c RBX: RCX: Sep 12 20:22:45 trinity kernel: [ 155.491334] RDX: RSI: 8842de20dcc8 RDI: 8842de20dcc8 Sep 12 20:22:45 trinity kernel: [ 155.491336] RBP: ae8f0217fe40 R08: 0001 R09: 039a Sep 12 20:22:45 trinity kernel: [ 155.491337] R10: ae8f0217fd88 R11: R12: 2359 Sep 12 20:22:45 trinity kernel: [ 155.491338] R13: 8842c150 R14: 8842c15e6000 R15: 0001 Sep 12 20:22:45 trinity kernel: [ 155.491340] FS: () GS:8842de20() knlGS: Sep 12 20:22:45 trinity kernel: [ 155.491342] CS: 0010 DS: ES: CR0: 80050033 Sep 12 20:22:45 trinity kernel: [ 155.491343] CR2: 00c420921000 CR3: 0003a0409000 CR4: 000406f0 Sep 12 20:22:45 trinity kernel: [ 155.491345] Call Trace: Sep 12 20:22:45 trinity kernel: [ 155.491352] ? wake_bit_function+0x60/0x60 Sep 12 20:22:45 trinity kernel: [ 155.491386] intel_atomic_commit_work+0x12/0x20 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491390] process_one_work+0x1e7/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491393] worker_thread+0x4a/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491396] kthread+0x125/0x140 Sep 12 20:22:45 trinity kernel: [ 155.491398] ? process_one_work+0x410/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491400] ? kthread_create_on_node+0x70/0x70 Sep 12 20:22:45 trinity kernel:
[Kernel-packages] [Bug 1716747] Re: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out
Joe, The issue has returned on my X220 tablet; running 4.13-0.36-generic and the fully updated 17.10 user space. Every time it happens the laptop display freezes for about 10 or 15 seconds. A concurrent ssh session is unaffected. [94261.464884] pipe A vblank wait timed out [94261.464948] [ cut here ] [94261.465044] WARNING: CPU: 2 PID: 16697 at /build/linux-r9581B/linux-4.13.0/dr ivers/gpu/drm/i915/intel_display.c:12848 intel_atomic_commit_tail+0xfa7/0xfb0 [i 915] [94261.465046] Modules linked in: ccm rfcomm xt_CHECKSUM iptable_mangle ipt_MASQ UERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 n f_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_t cpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_ filter bnep binfmt_misc zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) sp l(O) intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irq bypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel arc4 a es_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf snd_seq_mi di snd_seq_midi_event snd_hda_codec_hdmi snd_rawmidi iwldvm mac80211 snd_hda_cod ec_conexant snd_hda_codec_generic uvcvideo videobuf2_vmalloc videobuf2_memops vi deobuf2_v4l2 [94261.465098] input_leds thinkpad_acpi snd_seq snd_hda_intel serio_raw wmi_bmo f videobuf2_core btusb btrtl btbcm iwlwifi videodev btintel joydev bluetooth nvr am media snd_hda_codec snd_seq_device snd_hda_core ecdh_generic cfg80211 lpc_ich snd_hwdep snd_pcm shpchp snd_timer mei_me mei snd soundcore mac_hid nfsd parpor t_pc ppdev auth_rpcgss nfs_acl lp lockd parport grace sunrpc ip_tables x_tables autofs4 i915 i2c_algo_bit drm_kms_helper syscopyarea e1000e sysfillrect wacom sy simgblt ptp sdhci_pci fb_sys_fops ahci psmouse usbhid sdhci hid drm libahci pps_ core wmi video [94261.465153] CPU: 2 PID: 16697 Comm: Xorg Tainted: P O4.13.0-36- generic #40-Ubuntu [94261.465155] Hardware name: LENOVO 42992UU/42992UU, BIOS 8DET69WW (1.39 ) 07/1 8/2013 [94261.465157] task: 955d1d3845c0 task.stack: af29821bc000 [94261.465217] RIP: 0010:intel_atomic_commit_tail+0xfa7/0xfb0 [i915] [94261.465219] RSP: 0018:af29821bf8a8 EFLAGS: 00010286 [94261.465221] RAX: 001c RBX: RCX: [94261.465223] RDX: RSI: 0002 RDI: 0246 [94261.465225] RBP: af29821bf960 R08: 001c R09: 6177206b6e616c62 [94261.465226] R10: af29821bf8a8 R11: 74756f2064656d69 R12: 003c6b37 [94261.465228] R13: 955d3fa08000 R14: 955d3fbb9000 R15: 0001 [94261.465231] FS: 7fa6fdfd0500() GS:955d5e28() knlGS:0 000 [94261.465233] CS: 0010 DS: ES: CR0: 80050033 [94261.465235] CR2: 55ccba6e9ba8 CR3: 000402762004 CR4: 000606e0 [94261.465237] Call Trace: [94261.465250] ? wait_woken+0x80/0x80 [94261.465303] intel_atomic_commit+0x3d5/0x490 [i915] [94261.465331] ? drm_atomic_check_only+0x37e/0x540 [drm] [94261.465352] drm_atomic_commit+0x51/0x60 [drm] [94261.465367] restore_fbdev_mode+0x15e/0x270 [drm_kms_helper] [94261.465379] drm_fb_helper_restore_fbdev_mode_unlocked+0x2e/0x80 [drm_kms_helper] [94261.465389] drm_fb_helper_set_par+0x2d/0x60 [drm_kms_helper] [94261.465447] intel_fbdev_set_par+0x1a/0x70 [i915] [94261.465451] fb_set_var+0x19f/0x440 [94261.465456] ? __find_get_block+0xb6/0x2b0 [94261.465460] ? ext4_dirty_inode+0x48/0x70 [94261.465465] ? __ext4_handle_dirty_metadata+0x87/0x1c0 [94261.465472] fbcon_blank+0x2b7/0x3a0 [94261.465476] ? find_get_entry+0x1e/0xd0 [94261.465483] do_unblank_screen+0xba/0x1b0 [94261.465488] vt_ioctl+0x4e1/0x11a0 [94261.465493] ? __slab_free+0x14c/0x2d0 [94261.465497] ? __slab_free+0x14c/0x2d0 [94261.465502] tty_ioctl+0xf6/0x8b0 [94261.465507] ? vga_arb_release+0xd6/0x130 [94261.465511] ? security_file_free+0x44/0x60 [94261.465515] ? dput.part.23+0xba/0x1e0 [94261.465521] do_vfs_ioctl+0xa8/0x630 [94261.465527] ? entry_SYSCALL_64_after_hwframe+0xe9/0x139 [94261.465530] ? entry_SYSCALL_64_after_hwframe+0xe2/0x139 [94261.465534] ? entry_SYSCALL_64_after_hwframe+0xdb/0x139 [94261.465537] ? entry_SYSCALL_64_after_hwframe+0xd4/0x139 [94261.465541] ? entry_SYSCALL_64_after_hwframe+0xcd/0x139 [94261.465545] ? entry_SYSCALL_64_after_hwframe+0xc6/0x139 [94261.465548] ? entry_SYSCALL_64_after_hwframe+0xbf/0x139 [94261.465552] ? entry_SYSCALL_64_after_hwframe+0xb8/0x139 [94261.46] ? entry_SYSCALL_64_after_hwframe+0xb1/0x139 [94261.465560] SyS_ioctl+0x79/0x90 [94261.465563] ? entry_SYSCALL_64_after_hwframe+0x72/0x139 [94261.465567] entry_SYSCALL_64_fastpath+0x24/0xab [94261.465570] RIP: 0033:0x7fa6fb442ef7 [94261.465572] RSP: 002b:7ffcc51286d8 EFLAGS: 3246 ORIG_RAX: 0010 [94261.465575] RAX: ffda RBX: 000e RCX: 7fa6fb442ef7 [94261.465576] RDX:
[Kernel-packages] [Bug 1716747] Re: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out
Joe, No, I'm not seeing the issue now; running 4.13.0-16 for the last 10 days or so. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1716747 Title: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Artful: In Progress Bug description: This issue has been observed on a Lenovo X220i laptop: 1. Booting the laptop with artful's 4.12 kernel results in a very sluggish system performance (mouse pointer delays) and a high system load. 2. /var/log/kern.log indicates a problem with the display driver (see below) 3. The system works without any issues if the zesty kernel (4.10) is used instead. Ubuntu release: Artful Aardvark (development branch) Kernel package version: 4.12.0.13.14 Sep 12 20:22:45 trinity kernel: [ 155.491074] pipe A vblank wait timed out Sep 12 20:22:45 trinity kernel: [ 155.491117] [ cut here ] Sep 12 20:22:45 trinity kernel: [ 155.491171] WARNING: CPU: 0 PID: 203 at /build/linux-cK2WUa/linux-4.12.0/drivers/gpu/drm/i915/intel_display.c:12636 intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491172] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter aufs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter binfmt_misc snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd intel_cstate snd_rawmidi intel_rapl_perf arc4 uvcvideo videobuf2_vmalloc iwldvm videobuf2_memops videobuf2_v4l2 mac80211 Sep 12 20:22:45 trinity kernel: [ 155.491215] videobuf2_core videodev joydev input_leds media iwlwifi serio_raw cfg80211 snd_seq snd_seq_device snd_timer thinkpad_acpi nvram snd mac_hid mei_me mei soundcore lpc_ich shpchp parport_pc ppdev lp parport ip_tables x_tables autofs4 xfs libcrc32c mmc_block i915 sdhci_pci uas usb_storage sdhci i2c_algo_bit drm_kms_helper psmouse syscopyarea ahci sysfillrect libahci sysimgblt e1000e fb_sys_fops drm ptp pps_core wmi video Sep 12 20:22:45 trinity kernel: [ 155.491251] CPU: 0 PID: 203 Comm: kworker/u16:5 Not tainted 4.12.0-13-generic #14-Ubuntu Sep 12 20:22:45 trinity kernel: [ 155.491253] Hardware name: LENOVO 4290W1A/4290W1A, BIOS 8DET69WW (1.39 ) 07/18/2013 Sep 12 20:22:45 trinity kernel: [ 155.491292] Workqueue: events_unbound intel_atomic_commit_work [i915] Sep 12 20:22:45 trinity kernel: [ 155.491295] task: 8842c14a8000 task.stack: ae8f0217c000 Sep 12 20:22:45 trinity kernel: [ 155.491330] RIP: 0010:intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491331] RSP: 0018:ae8f0217fd88 EFLAGS: 00010282 Sep 12 20:22:45 trinity kernel: [ 155.491333] RAX: 001c RBX: RCX: Sep 12 20:22:45 trinity kernel: [ 155.491334] RDX: RSI: 8842de20dcc8 RDI: 8842de20dcc8 Sep 12 20:22:45 trinity kernel: [ 155.491336] RBP: ae8f0217fe40 R08: 0001 R09: 039a Sep 12 20:22:45 trinity kernel: [ 155.491337] R10: ae8f0217fd88 R11: R12: 2359 Sep 12 20:22:45 trinity kernel: [ 155.491338] R13: 8842c150 R14: 8842c15e6000 R15: 0001 Sep 12 20:22:45 trinity kernel: [ 155.491340] FS: () GS:8842de20() knlGS: Sep 12 20:22:45 trinity kernel: [ 155.491342] CS: 0010 DS: ES: CR0: 80050033 Sep 12 20:22:45 trinity kernel: [ 155.491343] CR2: 00c420921000 CR3: 0003a0409000 CR4: 000406f0 Sep 12 20:22:45 trinity kernel: [ 155.491345] Call Trace: Sep 12 20:22:45 trinity kernel: [ 155.491352] ? wake_bit_function+0x60/0x60 Sep 12 20:22:45 trinity kernel: [ 155.491386] intel_atomic_commit_work+0x12/0x20 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491390] process_one_work+0x1e7/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491393] worker_thread+0x4a/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491396] kthread+0x125/0x140 Sep 12 20:22:45 trinity kernel: [ 155.491398] ? process_one_work+0x410/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491400] ? kthread_create_on_node+0x70/0x70 Sep 12 20:22:45 trinity kernel: [ 155.491403] ret_from_fork+0x25/0x30 Sep 12 20:22:45 trinity kernel: [ 155.491405] Code: ff ff ff 48 83 c7 08 e8 af
[Kernel-packages] [Bug 1716747] Re: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out
Albert, This is the lspci from my X220 T: -[:00]-+-00.0 Intel Corporation 2nd Generation Core Processor Family DRAM Controller +-02.0 Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller +-16.0 Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 +-19.0 Intel Corporation 82579LM Gigabit Network Connection +-1a.0 Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 +-1b.0 Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller +-1c.0-[01]-- +-1c.1-[02]00.0 Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] +-1c.3-[03]-- +-1c.4-[04]00.0 Ricoh Co Ltd PCIe SDXC/MMC Host Controller +-1d.0 Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 +-1f.0 Intel Corporation QM67 Express Chipset Family LPC Controller +-1f.2 Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller \-1f.3 Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1716747 Title: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Artful: In Progress Bug description: This issue has been observed on a Lenovo X220i laptop: 1. Booting the laptop with artful's 4.12 kernel results in a very sluggish system performance (mouse pointer delays) and a high system load. 2. /var/log/kern.log indicates a problem with the display driver (see below) 3. The system works without any issues if the zesty kernel (4.10) is used instead. Ubuntu release: Artful Aardvark (development branch) Kernel package version: 4.12.0.13.14 Sep 12 20:22:45 trinity kernel: [ 155.491074] pipe A vblank wait timed out Sep 12 20:22:45 trinity kernel: [ 155.491117] [ cut here ] Sep 12 20:22:45 trinity kernel: [ 155.491171] WARNING: CPU: 0 PID: 203 at /build/linux-cK2WUa/linux-4.12.0/drivers/gpu/drm/i915/intel_display.c:12636 intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491172] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter aufs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter binfmt_misc snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd intel_cstate snd_rawmidi intel_rapl_perf arc4 uvcvideo videobuf2_vmalloc iwldvm videobuf2_memops videobuf2_v4l2 mac80211 Sep 12 20:22:45 trinity kernel: [ 155.491215] videobuf2_core videodev joydev input_leds media iwlwifi serio_raw cfg80211 snd_seq snd_seq_device snd_timer thinkpad_acpi nvram snd mac_hid mei_me mei soundcore lpc_ich shpchp parport_pc ppdev lp parport ip_tables x_tables autofs4 xfs libcrc32c mmc_block i915 sdhci_pci uas usb_storage sdhci i2c_algo_bit drm_kms_helper psmouse syscopyarea ahci sysfillrect libahci sysimgblt e1000e fb_sys_fops drm ptp pps_core wmi video Sep 12 20:22:45 trinity kernel: [ 155.491251] CPU: 0 PID: 203 Comm: kworker/u16:5 Not tainted 4.12.0-13-generic #14-Ubuntu Sep 12 20:22:45 trinity kernel: [ 155.491253] Hardware name: LENOVO 4290W1A/4290W1A, BIOS 8DET69WW (1.39 ) 07/18/2013 Sep 12 20:22:45 trinity kernel: [ 155.491292] Workqueue: events_unbound intel_atomic_commit_work [i915] Sep 12 20:22:45 trinity kernel: [ 155.491295] task: 8842c14a8000 task.stack: ae8f0217c000 Sep 12 20:22:45 trinity kernel: [ 155.491330] RIP: 0010:intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491331] RSP: 0018:ae8f0217fd88 EFLAGS: 00010282 Sep 12 20:22:45 trinity kernel: [ 155.491333] RAX: 001c RBX: RCX: Sep 12 20:22:45 trinity kernel: [ 155.491334] RDX: RSI: 8842de20dcc8 RDI: 8842de20dcc8 Sep 12 20:22:45 trinity kernel: [ 155.491336] RBP: ae8f0217fe40 R08: 0001 R09: 039a Sep 12 20:22:45 trinity kernel: [ 155.491337] R10: ae8f0217fd88 R11: R12: 2359 Sep 12 20:22:45 trinity kernel: [ 155.491338] R13: 8842c150 R14: 8842c15e6000 R15: 0001
[Kernel-packages] [Bug 1716747] Re: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out
Just a comment that I have observed this bug as well, on an X220 T. The test kernel from comment #11 also appears to resolve the problem (so far). I do not have any external USB controllers attached, though, so I'm not sure what the failure path was. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1716747 Title: linux 4.12 - high system load and mouse delays - pipe A vblank wait timed out Status in linux package in Ubuntu: In Progress Status in linux source package in Artful: In Progress Bug description: This issue has been observed on a Lenovo X220i laptop: 1. Booting the laptop with artful's 4.12 kernel results in a very sluggish system performance (mouse pointer delays) and a high system load. 2. /var/log/kern.log indicates a problem with the display driver (see below) 3. The system works without any issues if the zesty kernel (4.10) is used instead. Ubuntu release: Artful Aardvark (development branch) Kernel package version: 4.12.0.13.14 Sep 12 20:22:45 trinity kernel: [ 155.491074] pipe A vblank wait timed out Sep 12 20:22:45 trinity kernel: [ 155.491117] [ cut here ] Sep 12 20:22:45 trinity kernel: [ 155.491171] WARNING: CPU: 0 PID: 203 at /build/linux-cK2WUa/linux-4.12.0/drivers/gpu/drm/i915/intel_display.c:12636 intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491172] Modules linked in: nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter aufs xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter binfmt_misc snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_codec_generic snd_hda_intel intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd intel_cstate snd_rawmidi intel_rapl_perf arc4 uvcvideo videobuf2_vmalloc iwldvm videobuf2_memops videobuf2_v4l2 mac80211 Sep 12 20:22:45 trinity kernel: [ 155.491215] videobuf2_core videodev joydev input_leds media iwlwifi serio_raw cfg80211 snd_seq snd_seq_device snd_timer thinkpad_acpi nvram snd mac_hid mei_me mei soundcore lpc_ich shpchp parport_pc ppdev lp parport ip_tables x_tables autofs4 xfs libcrc32c mmc_block i915 sdhci_pci uas usb_storage sdhci i2c_algo_bit drm_kms_helper psmouse syscopyarea ahci sysfillrect libahci sysimgblt e1000e fb_sys_fops drm ptp pps_core wmi video Sep 12 20:22:45 trinity kernel: [ 155.491251] CPU: 0 PID: 203 Comm: kworker/u16:5 Not tainted 4.12.0-13-generic #14-Ubuntu Sep 12 20:22:45 trinity kernel: [ 155.491253] Hardware name: LENOVO 4290W1A/4290W1A, BIOS 8DET69WW (1.39 ) 07/18/2013 Sep 12 20:22:45 trinity kernel: [ 155.491292] Workqueue: events_unbound intel_atomic_commit_work [i915] Sep 12 20:22:45 trinity kernel: [ 155.491295] task: 8842c14a8000 task.stack: ae8f0217c000 Sep 12 20:22:45 trinity kernel: [ 155.491330] RIP: 0010:intel_atomic_commit_tail+0x1010/0x1020 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491331] RSP: 0018:ae8f0217fd88 EFLAGS: 00010282 Sep 12 20:22:45 trinity kernel: [ 155.491333] RAX: 001c RBX: RCX: Sep 12 20:22:45 trinity kernel: [ 155.491334] RDX: RSI: 8842de20dcc8 RDI: 8842de20dcc8 Sep 12 20:22:45 trinity kernel: [ 155.491336] RBP: ae8f0217fe40 R08: 0001 R09: 039a Sep 12 20:22:45 trinity kernel: [ 155.491337] R10: ae8f0217fd88 R11: R12: 2359 Sep 12 20:22:45 trinity kernel: [ 155.491338] R13: 8842c150 R14: 8842c15e6000 R15: 0001 Sep 12 20:22:45 trinity kernel: [ 155.491340] FS: () GS:8842de20() knlGS: Sep 12 20:22:45 trinity kernel: [ 155.491342] CS: 0010 DS: ES: CR0: 80050033 Sep 12 20:22:45 trinity kernel: [ 155.491343] CR2: 00c420921000 CR3: 0003a0409000 CR4: 000406f0 Sep 12 20:22:45 trinity kernel: [ 155.491345] Call Trace: Sep 12 20:22:45 trinity kernel: [ 155.491352] ? wake_bit_function+0x60/0x60 Sep 12 20:22:45 trinity kernel: [ 155.491386] intel_atomic_commit_work+0x12/0x20 [i915] Sep 12 20:22:45 trinity kernel: [ 155.491390] process_one_work+0x1e7/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491393] worker_thread+0x4a/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491396] kthread+0x125/0x140 Sep 12 20:22:45 trinity kernel: [ 155.491398] ? process_one_work+0x410/0x410 Sep 12 20:22:45 trinity kernel: [ 155.491400] ?
[Kernel-packages] [Bug 1700834] Re: Intel i40e PF reset under load
** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1700834 Title: Intel i40e PF reset under load Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Released Bug description: SRU Justification: Impact: Using an Intel i40e network device, under heavy traffic load with TSO enabled, the device will spontaneously reset itself and issue errors similar to the following: Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e :05:00.1: TX driver issue detected, PF reset issued This causes a full reset of the PF, which causes an interruption in traffic flow. In this case, these errors arise from a bug in the i40e device driver introduced by commit: commit 584a837e26408c66e87df87a022faa6a54c2b020 Author: Alexander DuyckDate: Wed Feb 17 11:02:50 2016 -0800 i40e/i40evf: Rewrite logic for 8 descriptor per packet check This patch was added to the Xenial kernel beginning with version 4.4.0-8.23. This bug does not manifest on any other Ubuntu kernel series. Fix: This error is resolved upstream by: commit 3f3f7cb875c0f621485644d4fd7453b0d37f00e4 Author: Alexander Duyck Date: Wed Mar 30 16:15:37 2016 -0700 i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet This fix was never backported into the Xenial 4.4 kernel series. Testcase: In this case, the issue occurs at a customer site using i40e based Intel network cards with SR-IOV enabled. Under heavy load, the card will reset itself as described. The customer has tested the 3f3f7cb875c patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1700834/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1709032] Re: Creating conntrack entry failure with kernel 4.4.0-89
The panic appears to be fixed upstream via: commit 9c3f3794926a997b1cab6c42480ff300efa2d162 Author: Liping ZhangDate: Sat Mar 25 16:35:29 2017 +0800 netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister If one cpu is doing nf_ct_extend_unregister while another cpu is doing __nf_ct_ext_add_length, then we may hit BUG_ON(t == NULL). Moreover, there's no synchronize_rcu invocation after set nf_ct_ext_types[id] to NULL, so it's possible that we may access invalid pointer. [...] -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1709032 Title: Creating conntrack entry failure with kernel 4.4.0-89 Status in Linux: Confirmed Status in linux package in Ubuntu: Confirmed Bug description: The functional job failure rate is at 100%. Every time some test gets stuck and job is killed after timeout. logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%5C %22gate-neutron-dsvm-functional-ubuntu- xenial%5C%22%20AND%20tags%3Aconsole%20AND%20message%3A%5C%22Killed%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20timeout%20-s%209%5C%22 2017-08-05 12:36:50.127672 | /home/jenkins/workspace/gate-neutron- dsvm-functional-ubuntu-xenial/devstack-gate/functions.sh: line 1129: 15261 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd" There are a few test executors left, which means there are more tests stuck: stack15468 15445 15468 0.0 0.0 328 796 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpDTLPoX stack15469 15468 15469 1.5 1.8 139332 150008 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpDTLPoX stack15470 15445 15470 0.0 0.0 328 700 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpICNqRQ stack15471 15470 15471 1.6 2.0 152056 164812 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpICNqRQ stack15474 15445 15474 0.0 0.0 328 792 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpe646Tl stack15475 15474 15475 1.6 1.9 149972 162516 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpe646Tl stack15476 15445 15476 0.0 0.0 328 804 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpv2ovhz stack15477 15476 15477 1.2 1.8 136760 149160 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpv2ovhz stack15478 15445 15478 0.0 0.0 328 712 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpDqXE8S stack15479 15478 15479 1.5 1.9 148784 161004 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpDqXE8S stack15480 15445 15480 0.0 0.0 328 804 /bin/sh -c OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} \ OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} \ OS_LOG_CAPTURE=${OS_LOG_CAPTURE:-1} \ OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-160} \ ${PYTHON:-python} -m subunit.run discover -t ./ ${OS_TEST_PATH:-./neutron/tests/unit} --load-list /tmp/tmpTmmShS stack15482 15480 15482 1.6 1.9 148856 161516 python -m subunit.run discover -t ./ ./neutron/tests/functional --load-list /tmp/tmpTmmShS To manage notifications about this bug go to: https://bugs.launchpad.net/linux/+bug/1709032/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1697053] Re: Missing IOTLB flush causes DMAR errors with SR-IOV
proposed kernel tested by customer ** Tags removed: verification-needed-trusty ** Tags added: verification-done-trusty -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1697053 Title: Missing IOTLB flush causes DMAR errors with SR-IOV Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Fix Committed Bug description: SRU Justification: Impact: Using SR-IOV with Intel IOMMUs can observe DMAR errors of the following type: [606483.223009] DMAR:[fault reason 05] PTE Write access is not set [606484.071974] dmar: DRHD: handling fault status reg 402 [606484.077121] dmar: DMAR:[DMA Write] Request device [d8:0a.1] fault addr 35c6e000 The DMAR error causes, at a minimum, loss of network traffic because the request being serviced is lost. Network cards were also observed to experience transmit timeouts after a DMAR fault. In this case, these errors arise from a race condition in the IOTLB management; this race is described (and fixed) in upstream commit: commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d Author: David WoodhouseDate: Wed Mar 5 17:09:32 2014 + iommu/vt-d: Clean up and fix page table clear/free behaviour This commit first appeared in mainline 3.15. This issue affects only the Ubuntu 3.13 kernel series. Fix: The race avoidance portion of the above was backported to 3.14-stable, but was never incorporated into the Ubuntu 3.13 kernel series. commit 51d20e1096a711f8cfa9d98a3ac2dd2c7c0fc20c Author: David Woodhouse Date: Mon Jun 9 14:09:53 2014 +0100 iommu/vt-d: Fix missing IOTLB flush in intel_iommu_unmap() Based on commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d upstream This 3.14-stable patch was tested by the customer and observed to resolve the issue in their environment. Testcase: In this case, the issue occurs on very recent Intel based servers using two different SR-IOV network cards (i40e and bnxt) at a customer site. The customer has tested the patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1697053/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1700834] Re: Intel i40e PF reset under load
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1700834 Title: Intel i40e PF reset under load Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification: Impact: Using an Intel i40e network device, under heavy traffic load with TSO enabled, the device will spontaneously reset itself and issue errors similar to the following: Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e :05:00.1: TX driver issue detected, PF reset issued This causes a full reset of the PF, which causes an interruption in traffic flow. In this case, these errors arise from a bug in the i40e device driver introduced by commit: commit 584a837e26408c66e87df87a022faa6a54c2b020 Author: Alexander DuyckDate: Wed Feb 17 11:02:50 2016 -0800 i40e/i40evf: Rewrite logic for 8 descriptor per packet check This patch was added to the Xenial kernel beginning with version 4.4.0-8.23. This bug does not manifest on any other Ubuntu kernel series. Fix: This error is resolved upstream by: commit 3f3f7cb875c0f621485644d4fd7453b0d37f00e4 Author: Alexander Duyck Date: Wed Mar 30 16:15:37 2016 -0700 i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet This fix was never backported into the Xenial 4.4 kernel series. Testcase: In this case, the issue occurs at a customer site using i40e based Intel network cards with SR-IOV enabled. Under heavy load, the card will reset itself as described. The customer has tested the 3f3f7cb875c patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1700834/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1700834] [NEW] Intel i40e PF reset under load
Public bug reported: SRU Justification: Impact: Using an Intel i40e network device, under heavy traffic load with TSO enabled, the device will spontaneously reset itself and issue errors similar to the following: Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e :05:00.1: TX driver issue detected, PF reset issued This causes a full reset of the PF, which causes an interruption in traffic flow. In this case, these errors arise from a bug in the i40e device driver introduced by commit: commit 584a837e26408c66e87df87a022faa6a54c2b020 Author: Alexander Duyck <adu...@mirantis.com> Date: Wed Feb 17 11:02:50 2016 -0800 i40e/i40evf: Rewrite logic for 8 descriptor per packet check This patch was added to the Xenial kernel beginning with version 4.4.0-8.23. This bug does not manifest on any other Ubuntu kernel series. Fix: This error is resolved upstream by: commit 3f3f7cb875c0f621485644d4fd7453b0d37f00e4 Author: Alexander Duyck <adu...@mirantis.com> Date: Wed Mar 30 16:15:37 2016 -0700 i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet This fix was never backported into the Xenial 4.4 kernel series. Testcase: In this case, the issue occurs at a customer site using i40e based Intel network cards with SR-IOV enabled. Under heavy load, the card will reset itself as described. The customer has tested the 3f3f7cb875c patch in their environment and confirmed that it resolves the issue. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Jay Vosburgh (jvosburgh) Status: New ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1700834 Title: Intel i40e PF reset under load Status in linux package in Ubuntu: New Bug description: SRU Justification: Impact: Using an Intel i40e network device, under heavy traffic load with TSO enabled, the device will spontaneously reset itself and issue errors similar to the following: Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e :05:00.1: TX driver issue detected, PF reset issued Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e :05:00.1: TX driver issue detected, PF reset issued This causes a full reset of the PF, which causes an interruption in traffic flow. In this case, these errors arise from a bug in the i40e device driver introduced by commit: commit 584a837e26408c66e87df87a022faa6a54c2b020 Author: Alexander Duyck <adu...@mirantis.com> Date: Wed Feb 17 11:02:50 2016 -0800 i40e/i40evf: Rewrite logic for 8 descriptor per packet check This patch was added to the Xenial kernel beginning with version 4.4.0-8.23. This bug does not manifest on any other Ubuntu kernel series. Fix: This error is resolved upstream by: commit 3f3f7cb875c0f621485644d4fd7453b0d37f00e4 Author: Alexander Duyck <adu...@mirantis.com> Date: Wed Mar 30 16:15:37 2016 -0700 i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet This fix was never backported into the Xenial 4.4 kernel series. Testcase: In this case, the issue occurs at a customer site using i40e based Intel network cards with SR-IOV enabled. Under heavy load, the card will reset itself as described. The customer has tested the 3f3f7cb875c patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1700834/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1697053] Re: Missing IOTLB flush causes DMAR errors with SR-IOV
** Changed in: linux (Ubuntu) Status: In Progress => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1697053 Title: Missing IOTLB flush causes DMAR errors with SR-IOV Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification: Impact: Using SR-IOV with Intel IOMMUs can observe DMAR errors of the following type: [606483.223009] DMAR:[fault reason 05] PTE Write access is not set [606484.071974] dmar: DRHD: handling fault status reg 402 [606484.077121] dmar: DMAR:[DMA Write] Request device [d8:0a.1] fault addr 35c6e000 The DMAR error causes, at a minimum, loss of network traffic because the request being serviced is lost. Network cards were also observed to experience transmit timeouts after a DMAR fault. In this case, these errors arise from a race condition in the IOTLB management; this race is described (and fixed) in upstream commit: commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d Author: David WoodhouseDate: Wed Mar 5 17:09:32 2014 + iommu/vt-d: Clean up and fix page table clear/free behaviour This commit first appeared in mainline 3.15. This issue affects only the Ubuntu 3.13 kernel series. Fix: The race avoidance portion of the above was backported to 3.14-stable, but was never incorporated into the Ubuntu 3.13 kernel series. commit 51d20e1096a711f8cfa9d98a3ac2dd2c7c0fc20c Author: David Woodhouse Date: Mon Jun 9 14:09:53 2014 +0100 iommu/vt-d: Fix missing IOTLB flush in intel_iommu_unmap() Based on commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d upstream This 3.14-stable patch was tested by the customer and observed to resolve the issue in their environment. Testcase: In this case, the issue occurs on very recent Intel based servers using two different SR-IOV network cards (i40e and bnxt) at a customer site. The customer has tested the patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1697053/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1697053] [NEW] Missing IOTLB flush causes DMAR errors with SR-IOV
Public bug reported: SRU Justification: Impact: Using SR-IOV with Intel IOMMUs can observe DMAR errors of the following type: [606483.223009] DMAR:[fault reason 05] PTE Write access is not set [606484.071974] dmar: DRHD: handling fault status reg 402 [606484.077121] dmar: DMAR:[DMA Write] Request device [d8:0a.1] fault addr 35c6e000 The DMAR error causes, at a minimum, loss of network traffic because the request being serviced is lost. Network cards were also observed to experience transmit timeouts after a DMAR fault. In this case, these errors arise from a race condition in the IOTLB management; this race is described (and fixed) in upstream commit: commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d Author: David Woodhouse <david.woodho...@intel.com> Date: Wed Mar 5 17:09:32 2014 + iommu/vt-d: Clean up and fix page table clear/free behaviour This commit first appeared in mainline 3.15. This issue affects only the Ubuntu 3.13 kernel series. Fix: The race avoidance portion of the above was backported to 3.14-stable, but was never incorporated into the Ubuntu 3.13 kernel series. commit 51d20e1096a711f8cfa9d98a3ac2dd2c7c0fc20c Author: David Woodhouse <dw...@infradead.org> Date: Mon Jun 9 14:09:53 2014 +0100 iommu/vt-d: Fix missing IOTLB flush in intel_iommu_unmap() Based on commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d upstream This 3.14-stable patch was tested by the customer and observed to resolve the issue in their environment. Testcase: In this case, the issue occurs on very recent Intel based servers using two different SR-IOV network cards (i40e and bnxt) at a customer site. The customer has tested the patch in their environment and confirmed that it resolves the issue. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Jay Vosburgh (jvosburgh) Status: New ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1697053 Title: Missing IOTLB flush causes DMAR errors with SR-IOV Status in linux package in Ubuntu: New Bug description: SRU Justification: Impact: Using SR-IOV with Intel IOMMUs can observe DMAR errors of the following type: [606483.223009] DMAR:[fault reason 05] PTE Write access is not set [606484.071974] dmar: DRHD: handling fault status reg 402 [606484.077121] dmar: DMAR:[DMA Write] Request device [d8:0a.1] fault addr 35c6e000 The DMAR error causes, at a minimum, loss of network traffic because the request being serviced is lost. Network cards were also observed to experience transmit timeouts after a DMAR fault. In this case, these errors arise from a race condition in the IOTLB management; this race is described (and fixed) in upstream commit: commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d Author: David Woodhouse <david.woodho...@intel.com> Date: Wed Mar 5 17:09:32 2014 + iommu/vt-d: Clean up and fix page table clear/free behaviour This commit first appeared in mainline 3.15. This issue affects only the Ubuntu 3.13 kernel series. Fix: The race avoidance portion of the above was backported to 3.14-stable, but was never incorporated into the Ubuntu 3.13 kernel series. commit 51d20e1096a711f8cfa9d98a3ac2dd2c7c0fc20c Author: David Woodhouse <dw...@infradead.org> Date: Mon Jun 9 14:09:53 2014 +0100 iommu/vt-d: Fix missing IOTLB flush in intel_iommu_unmap() Based on commit ea8ea460c9ace60bbb5ac6e5521d637d5c15293d upstream This 3.14-stable patch was tested by the customer and observed to resolve the issue in their environment. Testcase: In this case, the issue occurs on very recent Intel based servers using two different SR-IOV network cards (i40e and bnxt) at a customer site. The customer has tested the patch in their environment and confirmed that it resolves the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1697053/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1687512] Re: Kernel panics on Xenial when using cgroups and strict CFS limits
Customer has verified that 4.4.0-79-generic resolves the issue in their environment that would previously panic. ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1687512 Title: Kernel panics on Xenial when using cgroups and strict CFS limits Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Fix Committed Bug description: SRU Justification - [Impact] Apache Mesos and Kubernetes workloads on Xenial cause a panic (NULL pointer dereference) in the completely fair scheduler. These panics are in pick_next_entity and include pick_next_task_fair in the call stack. [Fix] Cherry-picking both 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7 (http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz) and 094f469172e00d6ab0a3130b0e01c83b3cf3a98d (http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz) fix the crash. They appear to be intended as a series - they were posted to LKML at the same time. [Testcase] The fix has been validated by the user who reported the bug Bug description --- We see a number of kernel panics on servers running Apache Mesos using cgroups with small (0.1-0.2) cpu limits. These all appear as NULL pointer dereferences in and around pick_next_entity and pick_next_task_fair, for example: [24334.493331] BUG: unable to handle kernel NULL pointer dereference at 0050 [24334.501611] IP: [] pick_next_entity+0x7f/0x160 [24334.507868] PGD 3eacfa067 PUD 3eacfb067 PMD 0 [24334.512806] Oops: [#1] SMP [24334.516420] Modules linked in: ipvlan xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs tcp_diag inet_diag nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev input_leds mac_hid i2c_piix4 8250_fintek parport_pc pvpanic parport serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi [24334.576359] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.4.0-66-generic #87~14.04.1-Ubuntu [24334.584748] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [24334.594188] task: 8803ee671c00 ti: 8803ee67c000 task.ti: 8803ee67c000 [24334.601799] RIP: 0010:[] [] pick_next_entity+0x7f/0x160 [24334.610490] RSP: 0018:8803ee67fdd8 EFLAGS: 00010086 [24334.615924] RAX: 8803ebed4c00 RBX: 880036529800 RCX: [24334.623190] RDX: 0225341f RSI: RDI: [24334.630479] RBP: 8803ee67fe00 R08: 0004 R09: [24334.637758] R10: 8803e7ed7600 R11: 0001 R12: [24334.645153] R13: R14: 0009067729c4 R15: 8803ee672178 [24334.652512] FS: () GS:8803ffd0() knlGS: [24334.660721] CS: 0010 DS: ES: CR0: 80050033 [24334.666587] CR2: 0050 CR3: 0003eacf9000 CR4: 001406e0 [24334.673851] Stack: [24334.675980] 8803ffd16e00 8803ffd16e00 8803e855a200 880036529800 [24334.683995] 0002 8803ee67fe68 810b98a6 8803ffd16e70 [24334.692024] 00016e00 8803e7ed7600 8803ee671c00 [24334.700172] Call Trace: [24334.702750] [] pick_next_task_fair+0x66/0x4b0 [24334.708886] [] __schedule+0x7f4/0x980 [24334.714349] [] schedule+0x35/0x80 [24334.719445] [] schedule_preempt_disabled+0xe/0x10 [24334.725962] [] cpu_startup_entry+0x18a/0x350 [24334.732012] [] start_secondary+0x149/0x170 [24334.737895] Code: 8b 70 50 4d 2b 74 24 50 4d 85 f6 7e 59 4c 89 e7 e8 67 ff ff ff 49 39 c6 7f 04 4c 8b 6b 48 48 8b 43 40 48 85 c0 74 1f 4c 8b 70 50 <4d> 2b 74 24 50 4d 85 f6 7e 2c 4c 89 e7 e8 3f ff ff ff 49 39 c6 [24334.765124] RIP [] pick_next_entity+0x7f/0x160 [24334.771473] RSP [24334.775077] CR2: 0050 [24334.779121] ---[ end trace 05d941efb97b7bae ]--- and [155852.028575] BUG: unable to handle kernel NULL pointer dereference at 0050 [155852.036931] IP: [] pick_next_entity+0x7f/0x160 [155852.043491] PGD 3ebae8067 PUD 3ebae9067 PMD 0 [155852.048550] Oops: [#1] SMP [155852.052437] Modules linked in: ipvlan veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge
[Kernel-packages] [Bug 1683947] Re: ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost
Jason, I work for Canonical; the issue came up with one of our customers. FWIW, I debugged the issue by first using kprobes and ftrace on the kernel of a running instance to trace the packet path through the kernel. Once it seemed that the affected packets were not being dropped somewhere on the instance and that MASQUERADE appeared to be operating correctly, I did a git bisect of the kernel to isolate the actual commit that resolved the problem (as the 4.11 kernel did not suffer from the issue). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683947 Title: ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost Status in linux package in Ubuntu: Confirmed Status in linux source package in Yakkety: New Bug description: SRU Justification: Impact: Configuring the 4.8 kernel with iptables MASQUERADE over virtio_net causes packets to be dropped by the hypervisor (host) due to improper flags being set based on the IP checksum state of the packet. The host performing MASQUERADE is affected by the bug. Issue was introduced by commit fd2a0437dc33b6425cabf74cc7fc7fdba6d5903b Author: Mike RapoportDate: Wed Jun 8 16:09:18 2016 +0300 virtio_net: introduce virtio_net_hdr_{from,to}_skb which first appears in v4.8-rc1 Fix: Fixed upstream by 3e9e40e74753 virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb(). 501db511397f virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit 6391a4481ba0 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving 3e9e40e74753 first appears in v4.9-rc5 (and is a prerequisite only), the others in v4.10-rc4. Testcase: Reproduction to date has been on GCE, although in principle it should manifest on any suitable topology using virtio_net. There is a dependency on the forwarded packets having skb->ip_summed == CHECKSUM_UNNECESSARY; not all incoming devices will have this property. On GCE, the following steps will induce the issue on an affected kernel: Setup a network: % gcloud compute networks create nat-network --mode legacy --range 10.240.0.0/16 % gcloud compute firewall-rules create nat-network-allow-ssh --allow tcp:22 --network nat-network % gcloud compute firewall-rules create nat-network-allow-internal --allow tcp:1-65535,udp:1-65535,icmp --source-ranges 10.240.0.0/16 --network nat-network Setup an Ubuntu 16.04 NAT VM: % gcloud compute instances create nat-gateway-16 --zone us-central1-a --network nat-network --can-ip-forward --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags nat --metadata startup- script='sysctl -w net.ipv4.ip_forward=1 ; iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE' Setup a route to use the 16.04 NAT: % gcloud compute routes create no-ip-internet-route --network nat- network --destination-range 0.0.0.0/0 --next-hop-instance nat- gateway-16 --next-hop-instance-zone us-central1-a --tags no-ip --priority 800 Setup a simple test VM without any external network: % gcloud compute instances create nat-client --zone us-central1-a --network nat-network --no-address --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags no-ip --metadata startup- script='wget --timeout=5 https://github.com/GoogleCloudPlatform /compute-image-packages/archive/20170327.tar.gz' Wait for it to boot... maybe 30 seconds or so. Look for serial port output: % gcloud compute instances get-serial-port-output nat-client --zone us-central1-a | grep startup-script You will see that the connection to github never succeeds - it just gets stuck on "Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113" and will timeout. (ignore the previous attempt from the successful 14.04 based NAT). Repeat the test by resettting the test client instance and watch for serial output: % gcloud compute instances reset nat-client --zone us-central1-a Wait a minute or so for new boot, then check the serial-port-output as above. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683947/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1683947] Re: ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost
** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683947 Title: ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification: Impact: Configuring the 4.8 kernel with iptables MASQUERADE over virtio_net causes packets to be dropped by the hypervisor (host) due to improper flags being set based on the IP checksum state of the packet. The host performing MASQUERADE is affected by the bug. Issue was introduced by commit fd2a0437dc33b6425cabf74cc7fc7fdba6d5903b Author: Mike RapoportDate: Wed Jun 8 16:09:18 2016 +0300 virtio_net: introduce virtio_net_hdr_{from,to}_skb which first appears in v4.8-rc1 Fix: Fixed upstream by 3e9e40e74753 virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb(). 501db511397f virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit 6391a4481ba0 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving 3e9e40e74753 first appears in v4.9-rc5 (and is a prerequisite only), the others in v4.10-rc4. Testcase: Reproduction to date has been on GCE, although in principle it should manifest on any suitable topology using virtio_net. There is a dependency on the forwarded packets having skb->ip_summed == CHECKSUM_UNNECESSARY; not all incoming devices will have this property. On GCE, the following steps will induce the issue on an affected kernel: Setup a network: % gcloud compute networks create nat-network --mode legacy --range 10.240.0.0/16 % gcloud compute firewall-rules create nat-network-allow-ssh --allow tcp:22 --network nat-network % gcloud compute firewall-rules create nat-network-allow-internal --allow tcp:1-65535,udp:1-65535,icmp --source-ranges 10.240.0.0/16 --network nat-network Setup an Ubuntu 16.04 NAT VM: % gcloud compute instances create nat-gateway-16 --zone us-central1-a --network nat-network --can-ip-forward --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags nat --metadata startup- script='sysctl -w net.ipv4.ip_forward=1 ; iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE' Setup a route to use the 16.04 NAT: % gcloud compute routes create no-ip-internet-route --network nat- network --destination-range 0.0.0.0/0 --next-hop-instance nat- gateway-16 --next-hop-instance-zone us-central1-a --tags no-ip --priority 800 Setup a simple test VM without any external network: % gcloud compute instances create nat-client --zone us-central1-a --network nat-network --no-address --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags no-ip --metadata startup- script='wget --timeout=5 https://github.com/GoogleCloudPlatform /compute-image-packages/archive/20170327.tar.gz' Wait for it to boot... maybe 30 seconds or so. Look for serial port output: % gcloud compute instances get-serial-port-output nat-client --zone us-central1-a | grep startup-script You will see that the connection to github never succeeds - it just gets stuck on "Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113" and will timeout. (ignore the previous attempt from the successful 14.04 based NAT). Repeat the test by resettting the test client instance and watch for serial output: % gcloud compute instances reset nat-client --zone us-central1-a Wait a minute or so for new boot, then check the serial-port-output as above. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683947/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1683947] [NEW] ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost
Public bug reported: SRU Justification: Impact: Configuring the 4.8 kernel with iptables MASQUERADE over virtio_net causes packets to be dropped by the hypervisor (host) due to improper flags being set based on the IP checksum state of the packet. The host performing MASQUERADE is affected by the bug. Issue was introduced by commit fd2a0437dc33b6425cabf74cc7fc7fdba6d5903b Author: Mike Rapoport <r...@linux.vnet.ibm.com> Date: Wed Jun 8 16:09:18 2016 +0300 virtio_net: introduce virtio_net_hdr_{from,to}_skb which first appears in v4.8-rc1 Fix: Fixed upstream by 3e9e40e74753 virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb(). 501db511397f virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit 6391a4481ba0 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving 3e9e40e74753 first appears in v4.9-rc5 (and is a prerequisite only), the others in v4.10-rc4. Testcase: Reproduction to date has been on GCE, although in principle it should manifest on any suitable topology using virtio_net. There is a dependency on the forwarded packets having skb->ip_summed == CHECKSUM_UNNECESSARY; not all incoming devices will have this property. On GCE, the following steps will induce the issue on an affected kernel: Setup a network: % gcloud compute networks create nat-network --mode legacy --range 10.240.0.0/16 % gcloud compute firewall-rules create nat-network-allow-ssh --allow tcp:22 --network nat-network % gcloud compute firewall-rules create nat-network-allow-internal --allow tcp:1-65535,udp:1-65535,icmp --source-ranges 10.240.0.0/16 --network nat-network Setup an Ubuntu 16.04 NAT VM: % gcloud compute instances create nat-gateway-16 --zone us-central1-a --network nat-network --can-ip-forward --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags nat --metadata startup- script='sysctl -w net.ipv4.ip_forward=1 ; iptables -t nat -A POSTROUTING -o ens4 -j MASQUERADE' Setup a route to use the 16.04 NAT: % gcloud compute routes create no-ip-internet-route --network nat- network --destination-range 0.0.0.0/0 --next-hop-instance nat-gateway-16 --next-hop-instance-zone us-central1-a --tags no-ip --priority 800 Setup a simple test VM without any external network: % gcloud compute instances create nat-client --zone us-central1-a --network nat-network --no-address --image-family ubuntu-1604-lts --image-project ubuntu-os-cloud --tags no-ip --metadata startup- script='wget --timeout=5 https://github.com/GoogleCloudPlatform/compute- image-packages/archive/20170327.tar.gz' Wait for it to boot... maybe 30 seconds or so. Look for serial port output: % gcloud compute instances get-serial-port-output nat-client --zone us- central1-a | grep startup-script You will see that the connection to github never succeeds - it just gets stuck on "Resolving github.com (github.com)... 192.30.253.112, 192.30.253.113" and will timeout. (ignore the previous attempt from the successful 14.04 based NAT). Repeat the test by resettting the test client instance and watch for serial output: % gcloud compute instances reset nat-client --zone us-central1-a Wait a minute or so for new boot, then check the serial-port-output as above. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Jay Vosburgh (jvosburgh) Status: New ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683947 Title: ubuntu 4.8 kernel, virtio_net error causes NAT packets to be lost Status in linux package in Ubuntu: New Bug description: SRU Justification: Impact: Configuring the 4.8 kernel with iptables MASQUERADE over virtio_net causes packets to be dropped by the hypervisor (host) due to improper flags being set based on the IP checksum state of the packet. The host performing MASQUERADE is affected by the bug. Issue was introduced by commit fd2a0437dc33b6425cabf74cc7fc7fdba6d5903b Author: Mike Rapoport <r...@linux.vnet.ibm.com> Date: Wed Jun 8 16:09:18 2016 +0300 virtio_net: introduce virtio_net_hdr_{from,to}_skb which first appears in v4.8-rc1 Fix: Fixed upstream by 3e9e40e74753 virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb(). 501db511397f virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit 6391a4481ba0 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving 3e9e40e74753 first appears in v4.9-rc5 (and is a prerequisite only), the others in v4.10-rc4. Testcase: Reproduction to date has been on GCE, although in principle it should manifest on any suitable topology using virtio_net. There is a dependency on the forwarded packets having skb->ip_summed == CHECKSUM_UNNECESSARY; not all incoming devices will have this property. On GCE, the following steps will induce the issue on
[Kernel-packages] [Bug 1658491] Re: VLAN SR-IOV regression for IXGBE driver
This issue may be fixed by this upstream commit: commit f60439bc21e3337429838e477903214f5bd8277f Author: Alexander DuyckDate: Thu Aug 11 14:51:56 2016 -0700 ixgbe: Force VLNCTRL.VFE to be set in all VMDq paths When I was adding the code for enabling VLAN promiscuous mode with SR-IOV enabled I had inadvertently left the VLNCTRL.VFE bit unchanged as I has assumed there was code in another path that was setting it when we enabled SR-IOV. This wasn't the case and as a result we were just disabling VLAN filtering for all the VFs apparently. Also the previous patches were always clearing CFIEN which was always set to 0 by the hardware anyway so I am dropping the redundant bit clearing. Fixes: 16369564915a ("ixgbe: Add support for VLAN promiscuous with SR-IOV") Signed-off-by: Alexander Duyck Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1658491 Title: VLAN SR-IOV regression for IXGBE driver Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: New Status in linux source package in Yakkety: New Status in linux source package in Zesty: In Progress Bug description: IXGBE driver, for SR-IOV setups, is misbehaving with VLANs. Description from affected user: - Create 2 networks (sriov 100 and 102 vlan) # neutron net-create --provider:physical_network=PHY0 --provider:network_type=vlan --provider:segmentation_id=100 PHY0_vlan_100 # neutron net-create --provider:physical_network=PHY0 --provider:network_type=vlan --provider:segmentation_id=102 PHY0_vlan_102 - Create the subnets: # neutron subnet-create PHY0_vlan_100 192.168.50.0/24 # neutron subnet-create PHY0_vlan_102 192.168.60.0/24 - Create the neutron ports: # neutron port-create e450757f-fec6-466e-bb21-a42a2019fe6b --name vlan_100_port1 --vnic-type direct # neutron port-create 32c468ed-7e1e-4267-bbbf-ec72d33e4454 --name vlan_102_port1 --vnic-type direct - Boot 2 VMs on 2 different hosts (add only 1 port to each of them + ovs dhcp network): # nova boot --flavor 789 --image ubuntu --nic net-id=1cf2a512-8963-413d-a745-99e758789c2b --nic port-id=92cf2867-cc0a-4e0d-aa87-14a345cdd708 102_port1_compute6 --key-name mkey --config-drive true --availability-zone nova:compute-0-6.domain.tld --poll # nova boot --flavor 789 --image ubutnu --nic net-id=1cf2a512-8963-413d-a745-99e758789c2b --nic port-id=baec6fd6-933d-4c58-94b6-44c50405d409 100_port1_compute5 --key-name mkey --config-drive true --availability-zone nova:compute-0-5.domain.tld --poll - After the VMs booted, configure the VFs: root@102-port1-compute6:~# ifconfig eth1 192.168.34.6 up root@100-port1-compute5:~# ifconfig eth1 192.168.34.5 up If I ping each other it works but it shouldn't work because in this case both of the VMs's interface (host VF) are in different vlans: - Pinging shouldn't work because the VMs interface (host VF) are in different VLANs. root@compute-0-5:~# ip link show eth6 8: eth6: mtu 2140 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether a0:36:9f:3f:1a:64 brd ff:ff:ff:ff:ff:ff vf 5 MAC fa:16:3e:f0:2c:e2, vlan 100, spoof checking on, link-state auto root@compute-0-6:~# ip link show eth5 8: eth5: mtu 2140 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether a0:36:9f:3f:20:88 brd ff:ff:ff:ff:ff:ff vf 7 MAC fa:16:3e:ce:69:41, vlan 101, spoof checking on, link-state auto But user can ping both VMs. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1658491/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1652348] Re: initrd dhcp fails / ignores valid response
I have instrumented ipconfig, and determined that the ultimate source of the problem is that, for the case of multiple interfaces, ipconfig has a dependency on the kernel's probe order of the network interfaces. For whatever reason, the -31 kernel probes the network devices in one order (e.g., ens3 then ens4), and the -57 kernel in the other order (ens4 first then ens3). The probe order of network devices (and PCI devices in general) is explicitly not defined, and so this is not a bug in the kernel itself; ipconfig is failing due to its dependency on a specific enumeration order. The issue in ipconfig is that it is using a single packet socket to attempt to multiplex packet traffic on multiple interfaces. Presuming that ens3 will answer DHCP and ens4 will not, for the case that works, the order ends up being something like: send DHCP request on ens3 send DHCP request on ens4 [ system gets DHCP response via ens3 ] try to receive DHCP reply sent by peer for ens3; this matches, and all is happy For the case that it fails, the sequence is roughly: send DHCP request on ens4 send DHCP request on ens3 [ system gets DHCP response via ens3 ] try to receive DHCP reply sent by peer for ens4; the reply is actually for ens3, so ipconfig throws it away (as the XID, et al, don't match what is expected for the ens4 DHCP request). This repeats until ipconfig gives up. As I said above, the issue is that ipconfig is trying to multiplex traffic for two interfaces on one packet socket. This is fine for sending, but for receiving on an unbound packet socket, there is no way to receive a packet sent to a specific interface. Packets are delivered to recvfrom/recvmsg in the order received. I note that ipconfig sets sll.sll_ifindex on the msghdr provided to recvfrom and recvmsg system calls; perhaps the author believed that this limits received packets to only packets received on that ifindex. This is not the case, and the sll_ifindex passed to recvfrom/recvmsg is ignored. I'm looking into whether or not there is an simple fix for this that will let ipconfig function without major rework to utilize one packet socket per interface. ** Tags removed: kernel-key ** Package changed: linux (Ubuntu) => klibc (Ubuntu) ** Changed in: klibc (Ubuntu) Status: Triaged => Confirmed ** Changed in: klibc (Ubuntu) Assignee: (unassigned) => Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1652348 Title: initrd dhcp fails / ignores valid response Status in klibc package in Ubuntu: Confirmed Status in klibc source package in Xenial: Triaged Bug description: Between kernel versions 4.4.0-53 and 4.4.0-57 a bug has been (re?)introduced that is breaking dhcp booting in the initrd environment. This is stopping instances that use iscsi storage from being able to connect. Over serial console it outputs: IP-Config: no response after 2 secs - giving up IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f1 hardware address 90:e2:ba:d1:36:39 mtu 1500 DHCP RARP IP-Config: no response after 3 secs - giving up with increasing delays until it fails. At which point a simple ipconfig -t dhcp -d "ens2f0" works. The console output is slightly garbled but should give you an idea: (initramfs) ipconfig -t dhcp -[ 728.379793] ixgbe :13:00.0 ens2f0: changing MTU from 1500 to 9000 d "ens2f0" IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f0 guessed broadcast address 10.0.1.255 IP-Config: ens2f0 complete (dhcp from 169.254.169.254): addres[ 728.980448] ixgbe :13:00.0 ens2f0: detected SFP+: 3 s: 10.0.1.56broadcast: 10.0.1.255 netmask: 255.255.255.0 gateway: 10.0.1.1 [ 729.148410] ixgbe :13:00.0 ens2f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX dns0 : 169.254.169.254 dns1 : 0.0.0.0 rootserver: 169.254.169.254 rootpath: filename : /ipxe.efi tcpdumps show that dhcp requests are being received from the host, and responses sent, but not accepted by the host. When the ipconfig command is issued manually, an identical dhcp request and response happens, only this time it is accepted. It doesn't appear to be that the messages are being sent and received incorrectly, just silently ignored by ipconfig. I was seeing this behaviour earlier this year, which I was able to fix by specifying "ip=dhcp" as a kernel parameter. About a month ago that was identified as causing us other problems (long story) and we dropped it, at which point we discovered the original bug was no longer an issue. Putting "ip=dhcp" back on with this kernel no longer fixes the problem. I've compared the two initrds and effectively the only thing that has changed between the two
[Kernel-packages] [Bug 1652348] Re: initrd dhcp fails / ignores valid response
I have reproduced the described issue locally using the instructions from comment 35; will start looking into the cause. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1652348 Title: initrd dhcp fails / ignores valid response Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Bug description: Between kernel versions 4.4.0-53 and 4.4.0-57 a bug has been (re?)introduced that is breaking dhcp booting in the initrd environment. This is stopping instances that use iscsi storage from being able to connect. Over serial console it outputs: IP-Config: no response after 2 secs - giving up IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f1 hardware address 90:e2:ba:d1:36:39 mtu 1500 DHCP RARP IP-Config: no response after 3 secs - giving up with increasing delays until it fails. At which point a simple ipconfig -t dhcp -d "ens2f0" works. The console output is slightly garbled but should give you an idea: (initramfs) ipconfig -t dhcp -[ 728.379793] ixgbe :13:00.0 ens2f0: changing MTU from 1500 to 9000 d "ens2f0" IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f0 guessed broadcast address 10.0.1.255 IP-Config: ens2f0 complete (dhcp from 169.254.169.254): addres[ 728.980448] ixgbe :13:00.0 ens2f0: detected SFP+: 3 s: 10.0.1.56broadcast: 10.0.1.255 netmask: 255.255.255.0 gateway: 10.0.1.1 [ 729.148410] ixgbe :13:00.0 ens2f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX dns0 : 169.254.169.254 dns1 : 0.0.0.0 rootserver: 169.254.169.254 rootpath: filename : /ipxe.efi tcpdumps show that dhcp requests are being received from the host, and responses sent, but not accepted by the host. When the ipconfig command is issued manually, an identical dhcp request and response happens, only this time it is accepted. It doesn't appear to be that the messages are being sent and received incorrectly, just silently ignored by ipconfig. I was seeing this behaviour earlier this year, which I was able to fix by specifying "ip=dhcp" as a kernel parameter. About a month ago that was identified as causing us other problems (long story) and we dropped it, at which point we discovered the original bug was no longer an issue. Putting "ip=dhcp" back on with this kernel no longer fixes the problem. I've compared the two initrds and effectively the only thing that has changed between the two is the kernel components. Ubuntu kernel bisect offending commit: # first bad commit: [fd4b5fa6e3487d15ede746f92601af008b2abbc0] mnt: Add a per mount namespace limit on the number of mounts Ubuntu kernel bisect offending commit submission: https://lkml.org/lkml/2016/10/5/308 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1652348/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1652348] Re: initrd dhcp fails / ignores valid response
Just a note that I'm setting up to try the reproduction instructions from comment #35 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1652348 Title: initrd dhcp fails / ignores valid response Status in linux package in Ubuntu: Incomplete Bug description: Between kernel versions 4.4.0-53 and 4.4.0-57 a bug has been (re?)introduced that is breaking dhcp booting in the initrd environment. This is stopping instances that use iscsi storage from being able to connect. Over serial console it outputs: IP-Config: no response after 2 secs - giving up IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f1 hardware address 90:e2:ba:d1:36:39 mtu 1500 DHCP RARP IP-Config: no response after 3 secs - giving up with increasing delays until it fails. At which point a simple ipconfig -t dhcp -d "ens2f0" works. The console output is slightly garbled but should give you an idea: (initramfs) ipconfig -t dhcp -[ 728.379793] ixgbe :13:00.0 ens2f0: changing MTU from 1500 to 9000 d "ens2f0" IP-Config: ens2f0 hardware address 90:e2:ba:d1:36:38 mtu 1500 DHCP RARP IP-Config: ens2f0 guessed broadcast address 10.0.1.255 IP-Config: ens2f0 complete (dhcp from 169.254.169.254): addres[ 728.980448] ixgbe :13:00.0 ens2f0: detected SFP+: 3 s: 10.0.1.56broadcast: 10.0.1.255 netmask: 255.255.255.0 gateway: 10.0.1.1 [ 729.148410] ixgbe :13:00.0 ens2f0: NIC Link is Up 10 Gbps, Flow Control: RX/TX dns0 : 169.254.169.254 dns1 : 0.0.0.0 rootserver: 169.254.169.254 rootpath: filename : /ipxe.efi tcpdumps show that dhcp requests are being received from the host, and responses sent, but not accepted by the host. When the ipconfig command is issued manually, an identical dhcp request and response happens, only this time it is accepted. It doesn't appear to be that the messages are being sent and received incorrectly, just silently ignored by ipconfig. I was seeing this behaviour earlier this year, which I was able to fix by specifying "ip=dhcp" as a kernel parameter. About a month ago that was identified as causing us other problems (long story) and we dropped it, at which point we discovered the original bug was no longer an issue. Putting "ip=dhcp" back on with this kernel no longer fixes the problem. I've compared the two initrds and effectively the only thing that has changed between the two is the kernel components. Ubuntu kernel bisect offending commit: # first bad commit: [fd4b5fa6e3487d15ede746f92601af008b2abbc0] mnt: Add a per mount namespace limit on the number of mounts Ubuntu kernel bisect offending commit submission: https://lkml.org/lkml/2016/10/5/308 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1652348/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1584092] Re: Docker misconfigured when using non-default overlay/underlay netmask size
I haven't tested this patch, but fanctl had the same issue, and I believe the fix is that the subnet math has to be "overlay_width + ( 32 - underlay_width )", not "overlay_width + underlay_width". Patch attached. ** Patch removed: "fanatic patch" https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+attachment/4667027/+files/fanatic.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to ubuntu-fan in Ubuntu. https://bugs.launchpad.net/bugs/1584092 Title: Docker misconfigured when using non-default overlay/underlay netmask size Status in ubuntu-fan package in Ubuntu: New Bug description: Fan allows for variable sized subnet map sizes. For example, if I want to map a /24 to a /16 instead of the default /16 to /8, Fan supports this. However, when configuring this via fanatic, I see that docker configuration fails. In /etc/default/docker, the --fixed-cidr flag is defined incorrectly. $ sudo fanatic Welcome to the fanatic fan networking wizard. This will help you set up an example fan network and optionally configure docker and/or LXD touse this network. See fanatic(1) for more details. Configure fan underlay (hit return to accept, or specify alternative) [192.168.0.0/16]: 192.168.1.0/24 Configure fan overlay (hit return to accept, or specify alternative) [250.0.0.0/8]: 250.99.0.0/16 Create LXD networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Profile fan-250-99 created Create docker networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. Test LXD networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n Test docker networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n This host IP address: 192.168.1.10 Remote test host IP address (none to skip): /usr/sbin/fanatic: Testing skipped $ grep "DOCKER_OPTS" /etc/default/docker # Use DOCKER_OPTS to modify the daemon startup options. #DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4" DOCKER_OPTS=" -b fan-250-99 --mtu=1450 --iptables=false --fixed-cidr=250.99.10.0/40" May 20 05:15:30 macbook docker[27364]: time="2016-05-20T05:15:30.411933688-07:00" level=fatal msg="Error starting daemon: Error initializing network controller: invalid CIDR address: 250.99.10.0/40" To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1584092] Re: Docker misconfigured when using non-default overlay/underlay netmask size
I haven't tested this patch, but fanctl had the same issue, and I believe the fix is that the subnet math has to be "overlay_width + ( 32 - underlay_width )", not "overlay_width + underlay_width". Patch attached. ** Patch added: "fanatic.patch" https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+attachment/4667033/+files/fanatic.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to ubuntu-fan in Ubuntu. https://bugs.launchpad.net/bugs/1584092 Title: Docker misconfigured when using non-default overlay/underlay netmask size Status in ubuntu-fan package in Ubuntu: New Bug description: Fan allows for variable sized subnet map sizes. For example, if I want to map a /24 to a /16 instead of the default /16 to /8, Fan supports this. However, when configuring this via fanatic, I see that docker configuration fails. In /etc/default/docker, the --fixed-cidr flag is defined incorrectly. $ sudo fanatic Welcome to the fanatic fan networking wizard. This will help you set up an example fan network and optionally configure docker and/or LXD touse this network. See fanatic(1) for more details. Configure fan underlay (hit return to accept, or specify alternative) [192.168.0.0/16]: 192.168.1.0/24 Configure fan overlay (hit return to accept, or specify alternative) [250.0.0.0/8]: 250.99.0.0/16 Create LXD networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Profile fan-250-99 created Create docker networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. Test LXD networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n Test docker networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n This host IP address: 192.168.1.10 Remote test host IP address (none to skip): /usr/sbin/fanatic: Testing skipped $ grep "DOCKER_OPTS" /etc/default/docker # Use DOCKER_OPTS to modify the daemon startup options. #DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4" DOCKER_OPTS=" -b fan-250-99 --mtu=1450 --iptables=false --fixed-cidr=250.99.10.0/40" May 20 05:15:30 macbook docker[27364]: time="2016-05-20T05:15:30.411933688-07:00" level=fatal msg="Error starting daemon: Error initializing network controller: invalid CIDR address: 250.99.10.0/40" To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1584092] Re: Docker misconfigured when using non-default overlay/underlay netmask size
I haven't tested this patch, but fanctl had the same issue, and I believe the fix is that the subnet math has to be "overlay_width + ( 32 - underlay_width )", not "overlay_width + underlay_width". Patch attached. ** Patch added: "fanatic patch" https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+attachment/4667027/+files/fanatic.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to ubuntu-fan in Ubuntu. https://bugs.launchpad.net/bugs/1584092 Title: Docker misconfigured when using non-default overlay/underlay netmask size Status in ubuntu-fan package in Ubuntu: New Bug description: Fan allows for variable sized subnet map sizes. For example, if I want to map a /24 to a /16 instead of the default /16 to /8, Fan supports this. However, when configuring this via fanatic, I see that docker configuration fails. In /etc/default/docker, the --fixed-cidr flag is defined incorrectly. $ sudo fanatic Welcome to the fanatic fan networking wizard. This will help you set up an example fan network and optionally configure docker and/or LXD touse this network. See fanatic(1) for more details. Configure fan underlay (hit return to accept, or specify alternative) [192.168.0.0/16]: 192.168.1.0/24 Configure fan overlay (hit return to accept, or specify alternative) [250.0.0.0/8]: 250.99.0.0/16 Create LXD networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Profile fan-250-99 created Create docker networking for underlay:192.168.1.0/24 overlay:250.99.0.0/16 [Yn]: Y Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details. Test LXD networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n Test docker networking for underlay:192.168.1.10/24 overlay:250.99.0.0/16 (NOTE: potentially triggers large image downloads) [Yn]: n This host IP address: 192.168.1.10 Remote test host IP address (none to skip): /usr/sbin/fanatic: Testing skipped $ grep "DOCKER_OPTS" /etc/default/docker # Use DOCKER_OPTS to modify the daemon startup options. #DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4" DOCKER_OPTS=" -b fan-250-99 --mtu=1450 --iptables=false --fixed-cidr=250.99.10.0/40" May 20 05:15:30 macbook docker[27364]: time="2016-05-20T05:15:30.411933688-07:00" level=fatal msg="Error starting daemon: Error initializing network controller: invalid CIDR address: 250.99.10.0/40" To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ubuntu-fan/+bug/1584092/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
** Tags removed: verification-needed-trusty verification-needed-vivid ** Tags added: verification-done-trusty verification-done-vivid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Fix Committed Status in linux source package in Vivid: Fix Committed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
The Wily kernel (4.2) already contains the fixes for this bug. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Fix Committed Status in linux source package in Vivid: Fix Committed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
Yes, the patch has been committed for the next Ubuntu kernel releases. I have no information on a Centos patch; you would need to file a bug against Centos or RHEL. No patch to Neutron is required. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Fix Committed Status in linux source package in Vivid: Fix Committed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1508706] Re: Networking hangs on azure using hv_netvsc; bisected
SRU Justification: Impact: Bug causes easily reproducible freeze of networking on affected systems when under moderate to high network load. Ordinary benchmark tools such as iperf induce the problem without difficulty. Affected systems are virtual machine instances running on Azure, utilizing the hv_netvsc network device driver. Fix: Fix is to apply patch provided by Microsoft: http://marc.info/?l=linux-kernel=144787522532687=2 Testcase: Tested as described in Bug Description. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1508706 Title: Networking hangs on azure using hv_netvsc; bisected Status in linux package in Ubuntu: Triaged Bug description: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY SrinivasanDate: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu, this was commit a4aeb290bd75af5e16a6144a418291476ac6140c Author: K. Y. Srinivasan Date: Wed Mar 18 12:29:29 2015 -0700 Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl() and for mainline it was commit 9eea92226407e7a117ef1ceef45380ebd000a0e2 Author: Alexei Starovoitov Date: Mon May 11 15:19:48 2015 -0700 pktgen: fix packet generation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1508706/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1508706] Re: Networking hangs on azure using hv_netvsc; bisected
We are testing this patch immediately (overnight US time) and will report our results as soon as they are available -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1508706 Title: Networking hangs on azure using hv_netvsc; bisected Status in linux package in Ubuntu: Triaged Bug description: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY SrinivasanDate: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu, this was commit a4aeb290bd75af5e16a6144a418291476ac6140c Author: K. Y. Srinivasan Date: Wed Mar 18 12:29:29 2015 -0700 Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl() and for mainline it was commit 9eea92226407e7a117ef1ceef45380ebd000a0e2 Author: Alexei Starovoitov Date: Mon May 11 15:19:48 2015 -0700 pktgen: fix packet generation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1508706/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
The equivalent testing to comment #20 was also performed on the 3.13 and 3.16 kernels, additionally, a customer separately validated the 3.13 and 3.16 patches in their environment. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
Test methodology performed on 3.19 kernel with patch applied: Host A: fd01:::1/64 direct connect to host C ip addr add fd01:::1/64 dev eth0 Host B: fd01:::2/64 direct connect to host C ip addr add fd01:::2/64 dev eth0 host C: direct connect interfaces for Hosts A & B bridged together: brctl addbr testbr0 brctl addif testbr0 eth1 brctl addif testbr0 eth5 ip link set dev eth1 up ip link set dev eth5 up ip link set dev testbr0 up ip addr add fd01:::99/64 dev testbr0 host A: continuous ping6 to host C's address beyond the bridge, using size large enough to generate fragmented IPv6 datagrams for mtu setting of 1500: ping6 -s 4000 fd01:::2 host C: load ip6tables_nat: ip6tables -t nat -Ln Observe on host A that ping continues uninterrupted Inspect eth1 and eth5 interfaces on host C with tcpdump to confirm traffic passes through the bridge -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1508706] Re: Networking hangs on azure using hv_netvsc; bisected
I have tested the patch referenced in comment #5 and it appears to resolve the network hang. I first built and tested the Ubuntu LTS 3.19.0-31.36~14.04.1 kernel and reproduced the issue using the methodology described in the original bug description. This is commit commit 15e42c329445b4e0f0aecefc39e205c44755c2ba Author: Luis HenriquesDate: Thu Oct 8 10:26:57 2015 +0100 UBUNTU: Ubuntu-lts-3.19.0-31.36~14.04.1 in the lts-backport-vivid branch of git://kernel.ubuntu.com/ubuntu /ubuntu-trusty.git I then applied the referenced patch and tested again and was unable to reproduce the issue after roughly an hour of testing. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1508706 Title: Networking hangs on azure using hv_netvsc; bisected Status in linux package in Ubuntu: Triaged Bug description: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu, this was commit a4aeb290bd75af5e16a6144a418291476ac6140c Author: K. Y. Srinivasan Date: Wed Mar 18 12:29:29 2015 -0700 Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl() and for mainline it was commit 9eea92226407e7a117ef1ceef45380ebd000a0e2 Author: Alexei Starovoitov Date: Mon May 11 15:19:48 2015 -0700 pktgen: fix packet generation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1508706/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
** Patch added: "Backport patch for trusty 3.13" https://bugs.launchpad.net/nova/+bug/1463911/+attachment/4520982/+files/ubuntu-trusty-3.13-sru.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
** Patch added: "Backport patch for trusty 3.16" https://bugs.launchpad.net/nova/+bug/1463911/+attachment/4520983/+files/ubuntu-trusty-3.16-sru.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
SRU Justification: Impact: This bug causes issues when ip6tables modules are loaded with IPv6 fragmented packets traversing a bridge. The extant conntrack processing will reassemble the IPv6 fragments for netfilter processing, but is incapable of re-fragmenting these datagrams for subsequent forwarding. This causes the fragmented IPv6 datagrams to be dropped. Fix: This is resolved by backporting functionality from mainline that re-fragments the IPv6 datagrams upon bridge egress. Testcase: The patch commit log includes a test case; to summarize: A bridge is configured with two ports and interfaces are attached to these ports. A traffic source beyond one port generates fragmented IPv6 datagrams, e.g., ping6 -s 2000, destined for a host beyond the bridge. With ip6tables modules unloaded, the IPv6 fragments will traverse the bridge. Loading ip6tables, e.g., "ip6tables -t nat -L", will cause IPv6 fragmented datagrams to be dropped on the unpatched kernel. These datagrams are correctly forwarded with the patch applied. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
** Patch added: "Backport patch for vivid 3.19" https://bugs.launchpad.net/nova/+bug/1463911/+attachment/4520984/+files/ubuntu-vivid-sru.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1508706] Re: Networking hangs on azure using hv_netvsc; bisected
Yes, it did, although it seemed to be easier to reproduce with vxlan configured. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1508706 Title: Networking hangs on azure using hv_netvsc; bisected Status in linux package in Ubuntu: Triaged Bug description: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY SrinivasanDate: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu, this was commit a4aeb290bd75af5e16a6144a418291476ac6140c Author: K. Y. Srinivasan Date: Wed Mar 18 12:29:29 2015 -0700 Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl() and for mainline it was commit 9eea92226407e7a117ef1ceef45380ebd000a0e2 Author: Alexei Starovoitov Date: Mon May 11 15:19:48 2015 -0700 pktgen: fix packet generation To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1508706/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1508706] [NEW] Networking hangs on azure using hv_netvsc; bisected
Public bug reported: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY SrinivasanDate: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu, this was commit a4aeb290bd75af5e16a6144a418291476ac6140c Author: K. Y. Srinivasan Date: Wed Mar 18 12:29:29 2015 -0700 Drivers: hv: vmbus: Export the vmbus_sendpacket_pagebuffer_ctl() and for mainline it was commit 9eea92226407e7a117ef1ceef45380ebd000a0e2 Author: Alexei Starovoitov Date: Mon May 11 15:19:48 2015 -0700 pktgen: fix packet generation ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1508706 Title: Networking hangs on azure using hv_netvsc; bisected Status in linux package in Ubuntu: New Bug description: Running Ubuntu instances on azure, testing basic networking between two instances. This involves configuring VXLAN between the two instances and running iperf and rsync of the kernel tree between the instances, e.g., ip link add vxlan0 type vxlan id 999 local 10.88.0.12 remote 10.88.0.11 dev eth0 ip l set vxlan0 up ip addr add 242.0.0.12/8 dev vxlan0 After some time (sometimes instantly, sometimes up to 30 minutes of activity), the networking will hang. This hang takes two forms: a complete loss of connectivity (all network, even the ssh session used to log in), or just a loss of connectivity between instances (the ssh session remains active). Sometimes for the latter case, the ssh session will then later hang. This first appeared when testing with the Ubuntu 3.19 kernel, and I subsequently bisected this to: commit effa2012d207f78cbc5a8360e62d420a8860b7e9 Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host BugLink: http://bugs.launchpad.net/bugs/1454892 Based on the information given to this driver (via the xmit_more skb flag), we can defer signaling the host if more packets are on the way. This will help make the host more efficient since it can potentially process a larger batch of packets. Implement this optimization. Signed-off-by: K. Y. Srinivasan Signed-off-by: David S. Miller Acked-by: Tim Gardner Acked-by: Brad Figg Signed-off-by: Brad Figg I also tested the mainline kernel (net-next); it fails with the equivalent commit: commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a Author: KY Srinivasan Date: Mon May 11 15:39:46 2015 -0700 hv_netvsc: Use the xmit_more skb flag to optimize signaling the host For both kernel trees, I also tested the prior commit and it did not exhibit the failure after many hours. For ubuntu,
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
The original patch had an error in it; I believe I've found it and once I verify that and clean it up a bit I"ll attach it to the bug. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1502238] Re: bridge does not forward neighbor solicitation packets
I set up a similar configuration locally, and I see the bridge correctly forwarding the IPv6 NS packets. The ping functions as expected. I have different network cards, and used IPv6 ULA addresses (fc00:1234::/64) but I'm not sure how that would affect the bridge forwarding decision. I'm also not sure what exactly is meant by your statement "Adding a host route for the 2001:: IP via the link IP"; I don't see any other reference to a 2001:: address. Could you clarify what this refers to? Also, for completeness, can you insure that there are no bridge table rules installed? This would be in the output of ebtables -t filter -L ebtables -t nat -L ebtables -t broute -L I would also suggest disabling the bridge callouts to arptables, ip6tables and iptables to see if that affects the behavior. This would be done via sysctl -w net.bridge.bridge-nf-call-arptables=0 sysctl -w net.bridge.bridge-nf-call-ip6tables=0 sysctl -w net.bridge.bridge-nf-call-iptables=0 (all of the above sysctl and ebtables commands need to be done as root) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1502238 Title: bridge does not forward neighbor solicitation packets Status in linux package in Ubuntu: Triaged Bug description: 3 hosts involved here: kailan is connected to a cisco switch, which is also connected to kurrat (eth3), which is running a bridge with tigernut connected to eth1. kurrat's controllers are 06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection, using the e1000e driver (3.13.0-65-generic kernel) (while kailan is doing a ping6 2601:282:8100:3500:82ee:73ff:fe99:368d): +kurrat 324 : sudo tcpdump -eni eth3 ip6 and not tcp and not udp tcpdump: WARNING: eth3: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth3, link-type EN10MB (Ethernet), capture size 65535 bytes 10:39:16.080888 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:16.431484 00:1c:c0:83:32:40 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::21c:c0ff:fe83:3240 > ff02::1: ICMP6, router advertisement, length 56 10:39:17.077446 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:18.077457 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:19.095034 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:20.093436 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:21.093425 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 10:39:21.43 00:1c:c0:83:32:40 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::21c:c0ff:fe83:3240 > ff02::1: ICMP6, router advertisement, length 56 10:39:22.111042 00:1c:c0:83:32:40 > 33:33:ff:99:36:8d, ethertype IPv6 (0x86dd), length 86: 2601:282:8100:3500::1 > ff02::1:ff99:368d: ICMP6, neighbor solicitation, who has 2601:282:8100:3500:82ee:73ff:fe99:368d, length 32 ^C 10 packets captured 11 packets received by filter 0 packets dropped by kernel +kurrat 325 : sudo tcpdump -eni eth1 ip6 and not tcp and not udp tcpdump: WARNING: eth1: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes 10:39:28.201110 00:1c:c0:83:32:40 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::21c:c0ff:fe83:3240 > ff02::1: ICMP6, router advertisement, length 56 10:39:31.552677 00:1c:c0:83:32:40 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::21c:c0ff:fe83:3240 > ff02::1: ICMP6, router advertisement, length 56 10:39:38.103919 08:10:78:fc:b3:d2 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 90: fe80::a10:78ff:fefc:b3d2 > ff02::1: HBH ICMP6, multicast listener query v2 [gaddr ::], length 28 10:39:39.663357 00:1c:c0:83:32:40 > 33:33:00:00:00:01, ethertype IPv6 (0x86dd), length 110: fe80::21c:c0ff:fe83:3240 > ff02::1: ICMP6, router advertisement,
[Kernel-packages] [Bug 1497812] Re: i40e bug: non physical MAC outbound frames appear as copied back inbound (mirrored)
Just looking at the log, it might be this: commit fa11cb3d16a9b9b296a2b811a49faf1356240348 Author: Anjali Singhai JainDate: Wed May 27 12:06:14 2015 -0400 i40e: Make sure to be in VEB mode if SRIOV is enabled at probe If SRIOV is enabled we need to be in VEB mode not VEPA mode at probe. This fixes an NPAR bug when SRIOV is enabled in the BIOS. Change-ID: Ibf006abafd9a0ca3698ec24848cd771cf345cbbc Signed-off-by: Anjali Singhai Jain Tested-by: Jim Young Signed-off-by: Jeff Kirsher -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1497812 Title: i40e bug: non physical MAC outbound frames appear as copied back inbound (mirrored) Status in linux package in Ubuntu: Triaged Status in linux-lts-vivid package in Ubuntu: Confirmed Bug description: Using 3.19.0-28-generic #30~14.04.1-Ubuntu with stock i40e driver version 2.2.2-k makes every 'non physical' MAC output frame appear as copied back at input, as if the switch was doing frame 'mirroring' (and/or hair-pinning). FYI same setup, with i40e upgraded to 1.2.48 from http://downloadmirror.intel.com/25282/eng/i40e-1.2.48.tar.gz behaves OK, fyi also we did a port mirroring setup at the switch directed to a different physical port for debugging, and didn't observe these frames to be physically present. See tcpdump -P in/out and more details at http://paste.ubuntu.com/12511680/ ProblemType: Bug DistroRelease: Ubuntu 14.04 Package: linux-image-3.19.0-28-generic 3.19.0-28.30~14.04.1 ProcVersionSignature: Ubuntu 3.19.0-28.30~14.04.1-generic 3.19.8-ckt5 Uname: Linux 3.19.0-28-generic x86_64 ApportVersion: 2.14.1-0ubuntu3.13 Architecture: amd64 Date: Mon Sep 21 02:05:28 2015 ProcEnviron: TERM=screen PATH=(custom, no user) LANG=en_US.UTF-8 SHELL=/bin/bash SourcePackage: linux-lts-vivid UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1497812/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1463911] Re: IPV6 fragmentation and mtu issue
I have done a backport of commit efb6de9b4ba0092b2c55f6a52d16294a8a698edd Author: Bernhard ThalerDate: Sat May 30 15:30:16 2015 +0200 netfilter: bridge: forward IPv6 fragmented packets to the trusty 3.13 kernel. This necessitated pulling in some bits from other patches as well. I am currently testing for regressions and will submit it for SRU if all goes well. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1463911 Title: IPV6 fragmentation and mtu issue Status in neutron: Confirmed Status in OpenStack Compute (nova): Confirmed Status in linux package in Ubuntu: Confirmed Bug description: Fragmented IPv6 packets are REJECTED by ip6tables on compute nodes. The traffic is goign through an intra-VM network and the packet loss is hurting the system. There is a patch for this issue: http://patchwork.ozlabs.org/patch/434957/ I would like to know is there any bug report or official release date for this issue ? This is pretty critical for my deployment. Thanks in advance, BR, Gyula To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1463911/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1442828] [NEW] change for LP 1425376 breaks systemd After=network-online.target
Public bug reported: The change to ifup@.service done as part of LP 1425376 appears to break the ordering of units marked as After=network-online.target. In my specific case, a new service script with After=network-online.target is erroneously run concurrently with dhclient. As the new script depends on networking configuration being complete, it fails as the IP addresses and routes from DHCP are not configured. This functioned correctly on vivid daily images from a few days ago, and appears to break starting with the vivid daily from approximately 0409. Infinity suggested this change as a likely suspect: diff -Nru systemd-219/debian/extra/units/ifup@.service systemd-219/debian/extra/units/ifup@.service --- systemd-219/debian/extra/units/ifup@.service2015-04-02 08:08:56.0 + +++ systemd-219/debian/extra/units/ifup@.service2015-04-07 14:38:38.0 + @@ -6,10 +6,8 @@ DefaultDependencies=no [Service] -Type=oneshot -ExecStart=/sbin/ifup --allow=hotplug %I -ExecStartPost=/sbin/ifup --allow=auto %I # only fail if ifupdown knows about the iface AND it's not up -ExecStartPost=/bin/sh -c 'if ifquery %I /dev/null; then ifquery --state %I /dev/null; fi' +ExecStart=/bin/sh -ec 'ifup --allow=hotplug %I; ifup --allow=auto %I; \ +if ifquery %I /dev/null; then ifquery --state %I /dev/null; fi' ExecStop=/sbin/ifdown %I RemainAfterExit=true and, indeed, reverting this (copying ifup@.service from a few-days old vivid image to a current image) resolves the problem. The affected version is ubuntu-vivid-daily-amd64-server-20150409.2 (installed via AWS). ** Affects: systemd (Ubuntu) Importance: Undecided Status: New ** Package changed: linux (Ubuntu) = systemd (Ubuntu) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1442828 Title: change for LP 1425376 breaks systemd After=network-online.target Status in systemd package in Ubuntu: New Bug description: The change to ifup@.service done as part of LP 1425376 appears to break the ordering of units marked as After=network-online.target. In my specific case, a new service script with After=network-online.target is erroneously run concurrently with dhclient. As the new script depends on networking configuration being complete, it fails as the IP addresses and routes from DHCP are not configured. This functioned correctly on vivid daily images from a few days ago, and appears to break starting with the vivid daily from approximately 0409. Infinity suggested this change as a likely suspect: diff -Nru systemd-219/debian/extra/units/ifup@.service systemd-219/debian/extra/units/ifup@.service --- systemd-219/debian/extra/units/ifup@.service 2015-04-02 08:08:56.0 + +++ systemd-219/debian/extra/units/ifup@.service 2015-04-07 14:38:38.0 + @@ -6,10 +6,8 @@ DefaultDependencies=no [Service] -Type=oneshot -ExecStart=/sbin/ifup --allow=hotplug %I -ExecStartPost=/sbin/ifup --allow=auto %I # only fail if ifupdown knows about the iface AND it's not up -ExecStartPost=/bin/sh -c 'if ifquery %I /dev/null; then ifquery --state %I /dev/null; fi' +ExecStart=/bin/sh -ec 'ifup --allow=hotplug %I; ifup --allow=auto %I; \ +if ifquery %I /dev/null; then ifquery --state %I /dev/null; fi' ExecStop=/sbin/ifdown %I RemainAfterExit=true and, indeed, reverting this (copying ifup@.service from a few-days old vivid image to a current image) resolves the problem. The affected version is ubuntu-vivid-daily-amd64-server-20150409.2 (installed via AWS). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1442828/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1409123] [NEW] hw csum failure in encapsulated network topolgies
Public bug reported: Virtualized network topologies that utilize encapsulation (e.g., VXLAN) and bridging may experience kernel errors of the format: [ 4297.761899] eth0: hw csum failure [ 4297.765210] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 3.18.0-rc4 -nn+ #22 [ 4297.765212] Hardware name: LENOVO 0829F3U/To be filled by O.E.M., BIOS 90KT15 AUS 07/21/2010 [ 4297.765216] 88013fc03ba8 8172f026 0 001 [ 4297.765219] 88013870e000 88013fc03bc8 8162ba52 8161c 1a0 [ 4297.765221] 8800afdf1000 88013fc03c08 8162325c 88013870e 000 [ 4297.765223] Call Trace: [ 4297.765224] IRQ [8172f026] dump_stack+0x46/0x58 [ 4297.765235] [8162ba52] netdev_rx_csum_fault+0x42/0x50 [ 4297.765238] [8161c1a0] ? skb_push+0x40/0x40 [ 4297.765240] [8162325c] __skb_checksum_complete+0xbc/0xd0 [ 4297.765243] [8168c602] tcp_v4_rcv+0x2e2/0x950 [ 4297.765246] [81666ca0] ? ip_rcv_finish+0x360/0x360 [ 4297.765248] [81660224] ? nf_hook_slow+0x74/0x130 [ 4297.765250] [81666ca0] ? ip_rcv_finish+0x360/0x360 [ 4297.765253] [81666d4c] ip_local_deliver_finish+0xac/0x220 [ 4297.765255] [81667058] ip_local_deliver+0x48/0x80 [ 4297.765257] [816669c1] ip_rcv_finish+0x81/0x360 [ 4297.765259] [81667332] ip_rcv+0x2a2/0x3f0 [ 4297.765261] [8162e932] __netif_receive_skb_core+0x562/0x7a0 [ 4297.765263] [8162eb88] __netif_receive_skb+0x18/0x60 [ 4297.765265] [8162f8f6] process_backlog+0xa6/0x150 The backtrace may vary, stacks descending into conntrack have also been observed: Call Trace: IRQ [8171a324] dump_stack+0x45/0x56 [8161bfba] netdev_rx_csum_fault+0x3a/0x40 [81614782] __skb_checksum_complete_head+0x62/0x70 [816147a1] __skb_checksum_complete+0x11/0x20 [816a3eac] nf_ip_checksum+0xcc/0x100 [a04df33b] udp_error+0xdb/0x1f0 [nf_conntrack] [a04d926e] nf_conntrack_in+0xee/0xb40 [nf_conntrack] [a0307653] ? do_execute_actions+0x2e3/0xab0 [openvswitch] [a0307e4b] ? ovs_execute_actions+0x2b/0x30 [openvswitch] [81654540] ? inet_del_offload+0x40/0x40 [a03b52e2] ipv4_conntrack_in+0x22/0x30 [nf_conntrack_ipv4] [8164e0aa] nf_iterate+0x9a/0xb0 [81654540] ? inet_del_offload+0x40/0x40 [8164e134] nf_hook_slow+0x74/0x130 [81654540] ? inet_del_offload+0x40/0x40 [81654f68] ip_rcv+0x2f8/0x3d0 The root cause of this is twofold: First, the kernel handling of forwarded packets that have been encapsulated (e.g., from VXLAN) for devices that support CHECKSUM_COMPLETE checksum offload fails to update the running checksum when decapsulating the packet. Second, for the enic device itself, the hardware is not correctly computing the checksum for some cases. Both of these issues are patched in mainline: commit 17e96834fd35997ca7cdfbf15413bcd5a36ad448 Author: Govindarajulu Varadarajan _gov...@gmx.com Date: Thu Dec 18 15:58:42 2014 +0530 enic: fix rx skb checksum commit 2c26d34bbcc0b3f30385d5587aa232289e2eed8e Author: Jay Vosburgh jay.vosbu...@canonical.com Date: Fri Dec 19 15:32:00 2014 -0800 net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Jay Vosburgh (jvosburgh) Status: New ** Changed in: linux (Ubuntu) Assignee: (unassigned) = Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1409123 Title: hw csum failure in encapsulated network topolgies Status in linux package in Ubuntu: New Bug description: Virtualized network topologies that utilize encapsulation (e.g., VXLAN) and bridging may experience kernel errors of the format: [ 4297.761899] eth0: hw csum failure [ 4297.765210] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 3.18.0-rc4 -nn+ #22 [ 4297.765212] Hardware name: LENOVO 0829F3U/To be filled by O.E.M., BIOS 90KT15 AUS 07/21/2010 [ 4297.765216] 88013fc03ba8 8172f026 0 001 [ 4297.765219] 88013870e000 88013fc03bc8 8162ba52 8161c 1a0 [ 4297.765221] 8800afdf1000 88013fc03c08 8162325c 88013870e 000 [ 4297.765223] Call Trace: [ 4297.765224] IRQ [8172f026] dump_stack+0x46/0x58 [ 4297.765235] [8162ba52] netdev_rx_csum_fault+0x42/0x50 [ 4297.765238] [8161c1a0] ? skb_push+0x40/0x40 [ 4297.765240] [8162325c] __skb_checksum_complete+0xbc/0xd0 [ 4297.765243] [8168c602] tcp_v4_rcv+0x2e2/0x950 [ 4297.765246] [81666ca0] ? ip_rcv_finish+0x360/0x360 [ 4297.765248] [81660224] ? nf_hook_slow+0x74/0x130 [ 4297.765250] [81666ca0] ? ip_rcv_finish+0x360/0x360
[Kernel-packages] [Bug 1233175] Re: Kernel panic : mempolicy potential use-after-free on server running mongodb
** Changed in: linux (Ubuntu) Assignee: (unassigned) = Jay Vosburgh (jvosburgh) ** Changed in: linux (Ubuntu Precise) Assignee: (unassigned) = Jay Vosburgh (jvosburgh) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1233175 Title: Kernel panic : mempolicy potential use-after-free on server running mongodb Status in “linux” package in Ubuntu: In Progress Status in “linux” source package in Precise: In Progress Bug description: PID: 21767 TASK: 8800874bdc00 CPU: 12 COMMAND: mongod #0 [880657cc3820] machine_kexec at 810393da #1 [880657cc3890] crash_kexec at 810b53f8 #2 [880657cc3960] oops_end at 8165e528 #3 [880657cc3990] die at 810178d8 #4 [880657cc39c0] do_trap at 8165de94 #5 [880657cc3a20] do_invalid_op at 81014f65 #6 [880657cc3ac0] invalid_op at 8166796b [exception RIP: slab_node+46] RIP: 8115a66e RSP: 880657cc3b70 RFLAGS: 00010097 RAX: RBX: 880657802c00 RCX: e62f6aef RDX: RSI: 0020 RDI: 880abf18a288 RBP: 880657cc3b80 R8: 0001 R9: 000100100010 R10: R11: 0022 R12: 0002 R13: R14: R15: 0020 ORIG_RAX: CS: 0010 SS: 0018 #7 [880657cc3b88] get_any_partial at 816496a0 #8 [880657cc3c18] __slab_alloc at 816498cf #9 [880657cc3cc8] __kmalloc_node_track_caller at 81166f07 #10 [880657cc3d38] __alloc_skb at 815364c8 #11 [880657cc3d88] __netdev_alloc_skb at 81536b14 #12 [880657cc3da8] enic_rq_alloc_buf at a005484c [enic] #13 [880657cc3e08] enic_poll_msix at a00559ff [enic] #14 [880657cc3e58] net_rx_action at 81545274 #15 [880657cc3ec8] __do_softirq at 8106f5f8 #16 [880657cc3f38] call_softirq at 81667bec #17 [880657cc3f50] do_softirq at 81016305 #18 [880657cc3f70] irq_exit at 8106f9de #19 [880657cc3f80] do_IRQ at 816684a3 --- IRQ stack --- #20 [880544d8bd48] ret_from_intr at 8165d82e [exception RIP: __slab_free+737] RIP: 81649467 RSP: 880544d8bdf8 RFLAGS: 0202 RAX: 0001 RBX: ff0a0210 RCX: 000180aa00a9 RDX: 000180aa00aa RSI: ea002afc6201 RDI: 880657806200 RBP: 880544d8bea8 R8: 0001 R9: R10: 8800874be020 R11: 8800874be030 R12: 880544d8be33 R13: 000d R14: 81191895 R15: 880544d8bdb8 ORIG_RAX: ff54 CS: 0010 SS: 0018 #21 [880544d8be30] __change_pid at 81087dca #22 [880544d8beb0] kmem_cache_free at 81163634 #23 [880544d8bef0] __mpol_put at 81159937 #24 [880544d8bf00] do_exit at 8106c75c #25 [880544d8bf70] sys_exit at 8106caf7 #26 [880544d8bf80] system_call_fastpath at 81665982 RIP: 7f6f476b8f37 RSP: 7f68cbcfdbb0 RFLAGS: 0202 RAX: 003c RBX: 81665982 RCX: RDX: 7f68cbcfe700 RSI: 7f6f478c9250 RDI: RBP: R8: 7f68cbcfe700 R9: 7f68e82a0370 R10: 7fff R11: 0246 R12: 8106caf7 R13: 880544d8bf78 R14: 0003 R15: 7f68f8744a10 ORIG_RAX: ... To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1233175/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1344323] [NEW] Trusty kernel network performance regression
Public bug reported: SRU Justification: Impact: Reduced TCP/IP receive performance for network devices that do not split packet headers into skb linear area (e.g., mlx4). The trusty kernel has incorporated commit eff44f9cc9a02aad53d568d3ae5020b6792ae4f6 Author: Jerry Chu hk...@google.com Date: Wed Dec 11 20:53:45 2013 -0800 net-gro: Prepare GRO stack for the upcoming tunneling support which modifies the GRO frag0 optimization, but unfortunately for some cases results in calls to __skb_pull_tail for every packet being received via the GRO path. This causes a reduction in TCP receive performance (or, more accurately, an increase in CPU load for TCP receive processing, which will cause throughput reduction for CPU limited workloads). Fix: This has already been fixed in mainline in commit a50e233c50dbc881abaa0e4070789064e8d12d70 Author: Eric Dumazet eduma...@google.com Date: Sat Mar 29 21:28:21 2014 -0700 net-gro: restore frag0 optimization The fix has been backported to and verified on the trusty kernel using mlx4 devices and iperf; an increase from 7.5 to 8.5 Gb/sec was observed when adding the patch, and the relevant portion of perf captures show changes in the call paths from: 7.17%iperf [kernel.kallsyms] [k] __pskb_pull_tail | --- __pskb_pull_tail | |--48.03%-- tcp_gro_receive | tcp4_gro_receive | inet_gro_receive | dev_gro_receive | napi_gro_frags | mlx4_en_process_rx_cq | mlx4_en_poll_rx_cq | net_rx_action | __do_softirq [...] |--28.53%-- napi_gro_frags | mlx4_en_process_rx_cq | mlx4_en_poll_rx_cq | net_rx_action | __do_softirq [...] |--13.11%-- inet_gro_receive | dev_gro_receive | napi_gro_frags | mlx4_en_process_rx_cq | mlx4_en_poll_rx_cq | net_rx_action | __do_softirq to: 4.87% iperf [kernel.kallsyms] [k] skb_gro_receive | --- skb_gro_receive | |--98.13%-- tcp_gro_receive | tcp4_gro_receive | inet_gro_receive | dev_gro_receive | napi_gro_frags | mlx4_en_process_rx_cq | mlx4_en_poll_rx_cq | net_rx_action | __do_softirq Testcase: The fix was tested using mlx4 10Gb/sec network devices between two arm64 systems using iperf -s on one end and iperf -c on the other. The unmodified kernel reported approximately 7.5 Gb/sec throughput, the fixed kernel approximately 8.5 Gb/sec. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1344323 Title: Trusty kernel network performance regression Status in “linux” package in Ubuntu: New Bug description: SRU Justification: Impact: Reduced TCP/IP receive performance for network devices that do not split packet headers into skb linear area (e.g., mlx4). The trusty kernel has incorporated commit eff44f9cc9a02aad53d568d3ae5020b6792ae4f6 Author: Jerry Chu hk...@google.com Date: Wed Dec 11 20:53:45 2013 -0800 net-gro: Prepare GRO stack for the upcoming tunneling support which modifies the GRO frag0 optimization, but unfortunately for some cases results in calls to __skb_pull_tail for every packet being received via the GRO path. This causes a reduction in TCP receive performance (or, more accurately, an increase in CPU load for TCP receive processing, which will cause throughput reduction for CPU limited workloads). Fix: This has already been fixed in mainline in commit a50e233c50dbc881abaa0e4070789064e8d12d70 Author: Eric Dumazet eduma...@google.com Date: Sat Mar 29 21:28:21 2014 -0700 net-gro: restore frag0 optimization The fix has been backported to and verified on the trusty kernel using mlx4 devices and iperf; an increase from 7.5 to 8.5 Gb/sec was observed when adding the patch, and the relevant portion