[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])
I've made some good progress here. I found that older version like 4.19 work, so I ran git bisect. I'm still doing the final check, but it looks like the series that causes the issue is the one containing these: d53d2f78cead bpf: Use vmalloc special flag 1a7b7d922081 modules: Use vmalloc special flag 868b104d7379 mm/vmalloc: Add flag for freeing of special permsissions In particular: commit 868b104d7379e28013e9d48bdd2db25e0bdcf751 (HEAD) Author: Rick Edgecombe Date: Thu Apr 25 17:11:36 2019 -0700 mm/vmalloc: Add flag for freeing of special permsissions Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to immediately clear executable TLB entries before freeing pages, and handle resetting permissions on the directmap. This flag is useful for any kind of memory with elevated permissions, or where there can be related permissions changes on the directmap. Today this is RO+X and RO memory. Although this enables directly vfreeing non-writeable memory now, non-writable memory cannot be freed in an interrupt because the allocation itself is used as a node on deferred free list. So when RO memory needs to be freed in an interrupt the code doing the vfree needs to have its own work queue, as was the case before the deferred vfree list was added to vmalloc. For architectures with set_direct_map_ implementations this whole operation can be done with one TLB flush when centralized like this. For others with directmap permissions, currently only arm64, a backup method using set_memory functions is used to reset the directmap. When arm64 adds set_direct_map_ functions, this backup can be removed. When the TLB is flushed to both remove TLB entries for the vmalloc range mapping and the direct map permissions, the lazy purge operation could be done to try to save a TLB flush later. However today vm_unmap_aliases could flush a TLB range that does not include the directmap. So a helper is added with extra parameters that can allow both the vmalloc address and the direct mapping to be flushed during this operation. The behavior of the normal vm_unmap_aliases function is unchanged. and commit d53d2f78ceadba081fc7785570798c3c8d50a718 Author: Rick Edgecombe Date: Thu Apr 25 17:11:38 2019 -0700 bpf: Use vmalloc special flag Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special permissioned memory in vmalloc and remove places where memory was set RW before freeing which is no longer needed. Don't track if the memory is RO anymore because it is now tracked in vmalloc. This is _extremely_ in "subtly break under the hash MMU" areas. Hopefully this is enough to get some Power MMU experts to weigh in. I will keep working on it. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1927076 Title: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1]) Status in ubuntu-kernel-tests: New Status in The Ubuntu-power-systems project: Confirmed Status in linux package in Ubuntu: Incomplete Status in linux source package in Focal: Confirmed Status in linux source package in Hirsute: Confirmed Bug description: It looks like our P8 node "entei" tend to fail with the IPv6 TCP test from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels: # send cpu 119, receive socket 119 # send cpu 121, receive socket 121 # send cpu 123, receive socket 123 # send cpu 125, receive socket 125 # send cpu 127, receive socket 127 # IPv6 TCP publish-job-status: using request.json It failed silently here, this can be 100% reproduced with Groovy 5.8 and Focal 5.8. This will cause the ubuntu_kernel_selftests being interrupted, the test result for other tests cannot be processed to our result page. Please find attachment for the complete "net" test result on this node with Groovy 5.8.0-52.59 Add the kqa-blocker tag as this might needs to be manually verified. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])
I can repro on upstream, all the way back to 5.4.0. It might have existed before that - I haven't tested any earlier yet. Was the test methodology changed just before this was found? I'm just wondering why it suddenly appeared ~a year after Focal was released. I thought it might have been a patch picked up for a SRU, but it's looking like the problem predates Focal by some way... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1927076 Title: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1]) Status in ubuntu-kernel-tests: New Status in The Ubuntu-power-systems project: Confirmed Status in linux package in Ubuntu: Incomplete Status in linux source package in Focal: Confirmed Status in linux source package in Hirsute: Confirmed Bug description: It looks like our P8 node "entei" tend to fail with the IPv6 TCP test from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels: # send cpu 119, receive socket 119 # send cpu 121, receive socket 121 # send cpu 123, receive socket 123 # send cpu 125, receive socket 125 # send cpu 127, receive socket 127 # IPv6 TCP publish-job-status: using request.json It failed silently here, this can be 100% reproduced with Groovy 5.8 and Focal 5.8. This will cause the ubuntu_kernel_selftests being interrupted, the test result for other tests cannot be processed to our result page. Please find attachment for the complete "net" test result on this node with Groovy 5.8.0-52.59 Add the kqa-blocker tag as this might needs to be manually verified. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])
I can repro this with the latest Focal kernel on: description: PowerNV product: 8247-22L (IBM Power System S822L) Trying to see if I can repro it upstream. FWIW my opening hypothesis is that something in a percpu data structure isn't getting updated over hotplug. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1927076 Title: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1]) Status in ubuntu-kernel-tests: New Status in The Ubuntu-power-systems project: Confirmed Status in linux package in Ubuntu: Incomplete Status in linux source package in Focal: Confirmed Status in linux source package in Hirsute: Confirmed Bug description: It looks like our P8 node "entei" tend to fail with the IPv6 TCP test from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels: # send cpu 119, receive socket 119 # send cpu 121, receive socket 121 # send cpu 123, receive socket 123 # send cpu 125, receive socket 125 # send cpu 127, receive socket 127 # IPv6 TCP publish-job-status: using request.json It failed silently here, this can be 100% reproduced with Groovy 5.8 and Focal 5.8. This will cause the ubuntu_kernel_selftests being interrupted, the test result for other tests cannot be processed to our result page. Please find attachment for the complete "net" test result on this node with Groovy 5.8.0-52.59 Add the kqa-blocker tag as this might needs to be manually verified. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1904906] Re: 5.10 kernel fails to boot with secure boot disabled
I cannot yet explain this, but after bisecting the config, I can repro this with pseries_le_defconfig + CONFIG_RCU_SCALE_TEST=m That's weird to me, and I'll continue to investigate. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1904906 Title: 5.10 kernel fails to boot with secure boot disabled Status in The Ubuntu-power-systems project: New Status in linux package in Ubuntu: New Bug description: Canonical requests to test the secure boot for the 5.10 kernel but kernel fails to boot with secure boot disabled. The 5.10 kernel can be found in: https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/bootstrap They can be installed by installing the linux-generic-wip package with this PPA enabled. As usual, they are only signed using a key specific to that PPA. This key can be retrieved from the signing tarballs for the kernels, e.g.: http://ppa.launchpad.net/canonical-kernel- team/bootstrap/ubuntu/dists/hirsute/main/signed/linux-5.10-ppc64el/5.10.0-2.3/signed.tar.gz Our tester installed the 5.10 kernel via aptitude. If booting directly from the bootmenu, it stucks at: "kexec_core: Starting new kernel" If booting recovery kernel for 5.10.0, it proceeds farther and after kexec_core, it failed at: " [0.029830] LSM: Security Framework initializing [0.029916] Yama: b " Two attempts with a different scenario; running with 5.8 kernel and boot via commandline for 5.10: kexec -l /boot/vmlinux-5.10.0-0-generic --initrd=/boot/initrd.img-5.10.0-0-generic --append="root=UUID=49d000cb-dba2-4d70-809e-38f2b31d0f09 ro quiet splash" kexec -e Both attempts also failed while rebooting, once with the same error as the error from booting with bootmenu; the other failure occurred a lot earlier. Wondering what new CONFIGs and/or features for the 5.10 kernel? To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1904906/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
The 5.3.0 HWE kernel also works, which means we now have a good workaround while we debug things. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: Incomplete Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
Hi Mauricio, 5.4.0-14 works for me, dmesg attached. I'll see if an HWE kernel supplied in the bionic repositories also works, maybe we can use that in the mean time so we don't fall any further behind on kernel updates while we debug this. Regards, Daniel ** Attachment added: "dmesg-5.4.0-14-generic" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329404/+files/dmesg-5.4.0-14-generic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: Incomplete Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
Ah, I was just about to tell you that I have just tried master-next at a59858e18bc8996f8c96d307a33e504b079dc541 ! I think that is the same sha that ended up being tagged as -89, so I think it provides us with the same information. Sadly -89 also doesn't seem to work; dmesg attached. I don't know anything about the qla2xxx driver, so I was planning to confirm that the problem was introduced by the set of qla2xxx changes that went into -73 and then bisect them. But, if you have any insight or specific knowledge that would suggest a better path, I'm very happy to give that a go. Regards, Daniel ** Attachment added: "dmesg-89" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329273/+files/dmesg-89 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: Incomplete Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
** Attachment added: "dmesg from -72" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329133/+files/dmesg-72 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: Incomplete Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
Hi Mauricio, Thanks for the prompt answer! After a lot of messing around to get a remote console, I can finally test. It looks like -88 doesn't work. I'm attaching a dmesg from -88 and -72. I will build and test master-next next. Regards, Daniel ** Attachment added: "dmesg from -88" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329132/+files/dmesg-88 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: Incomplete Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
** Attachment added: "lspci-vnvn.log" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5327820/+files/lspci-vnvn.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: New Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1863044] [NEW] qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter
Public bug reported: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. ** Affects: linux (Ubuntu) Importance: Undecided Status: New ** Attachment added: "version.log" https://bugs.launchpad.net/bugs/1863044/+attachment/5327819/+files/version.log -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1863044 Title: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter Status in linux package in Ubuntu: New Bug description: We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu 18.04.3. Storage is attached over Fiber Channel. They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not detected. This breaks the boot. rescan-scsi-bus.sh is also unable to find the LUNs. Reverting to the 4.15.0-72 kernel works. lspci reports: 06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0 I/O ports at 2c00 [size=256] Memory at 903fc000 (64-bit, non-prefetchable) [size=16K] Expansion ROM at 9030 [disabled] [size=256K] Capabilities: [44] Power Management version 3 Capabilities: [4c] Express Endpoint, MSI 00 Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [98] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=2 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting Kernel driver in use: qla2xxx Kernel modules: qla2xxx Let me know if you need any more details. I attach a version.log and lspci-vnvn.log from a working -72 boot. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1853142] Re: CVE-2019-18660: patches for Ubuntu
My colleague has verified all 4 versions. In all cases, on supported hardware, the test now operates as expected: the secret does not leak unless the mitigation is manually turned off. I notice the SRU verification is happening a bit sooner than I expected - when do you expect these kernels to be released? ** Tags removed: verification-needed-bionic verification-needed-disco verification-needed-eoan verification-needed-xenial ** Tags added: verification-done-bionic verification-done-disco verification-done-eoan verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853142 Title: CVE-2019-18660: patches for Ubuntu Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Disco: Fix Committed Status in linux source package in Eoan: Fix Committed Status in linux source package in Focal: Triaged Bug description: Hi, Recently you would have been notified about CVE-2019-18660 via email to the linux-distros private mailing list. In short, it is a bug in the Spectre v2 class affecting powerpc. We have developed some backports for supported Ubuntu kernels, and tested them in our lab. I will attach the patches shortly. Most of them should end up being identical to the versions in linux-stable, but the ones for Bionic are slightly different due to it using a 4.15 kernel. Please get in touch with me or Michael Ellerman (powerpc maintainer) if you have any questions or if we can be of any assistance. Kind regards, Daniel To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853142/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1853142] Re: CVE-2019-18660: patches for Ubuntu
The embargo has expired so I'm making this public now. ** Description changed: Hi, Recently you would have been notified about CVE-2019-18660 via email to the linux-distros private mailing list. In short, it is a bug in the Spectre v2 class affecting powerpc. We have developed some backports for supported Ubuntu kernels, and tested them in our lab. I will attach the patches shortly. Most of them should end up being identical to the versions in linux-stable, but the ones for Bionic are slightly different due to it using a 4.15 kernel. Please get in touch with me or Michael Ellerman (powerpc maintainer) if you have any questions or if we can be of any assistance. - If I understand the SRU cycles correctly, we've missed the current one - due for release on 2 December, so the earliest these patches could land - in is the kernel nominally slated to be released ~23 December. Are you - planning to still release a kernel then, or are your cycles going to - change over the end of year period? - - (If it helps, we've got some automation set up so we're able to do the - regression testing of -proposed kernels with these patches quickly.) Kind regards, Daniel ** Information type changed from Private Security to Public Security -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1853142 Title: CVE-2019-18660: patches for Ubuntu Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Triaged Status in linux source package in Bionic: Triaged Status in linux source package in Disco: Triaged Status in linux source package in Eoan: Triaged Status in linux source package in Focal: Triaged Bug description: Hi, Recently you would have been notified about CVE-2019-18660 via email to the linux-distros private mailing list. In short, it is a bug in the Spectre v2 class affecting powerpc. We have developed some backports for supported Ubuntu kernels, and tested them in our lab. I will attach the patches shortly. Most of them should end up being identical to the versions in linux-stable, but the ones for Bionic are slightly different due to it using a 4.15 kernel. Please get in touch with me or Michael Ellerman (powerpc maintainer) if you have any questions or if we can be of any assistance. Kind regards, Daniel To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853142/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1822870] Re: Backport support for software count cache flush Spectre v2 mitigation. (CVE) (required for POWER9 DD2.3)
Hi Michael R, I tried to apply your patches to test them and support the effort to get them included in the Bionic kernel, but I'm having some trouble applying them: ubuntu@dja-bionic:~/bionic$ git am ../patches/01-powerpc-64s-add-support-for-ori-barrier_nospec.patch Patch format detection failed. ubuntu@dja-bionic:~/bionic$ git am ../patches/01-powerpc-64s-add-support-for-ori-barrier_nospec.patch --patch-format mbox Applying: commit 2eea7f067f495e33b8b116b35b5988ab2b8aec55 fatal: empty ident name (for <>) not allowed How are you generating them? They don't look like they've been generated with git format-patch...? Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1822870 Title: Backport support for software count cache flush Spectre v2 mitigation. (CVE) (required for POWER9 DD2.3) Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: In Progress Bug description: For the different kernels: The HWE a563fd9c62f0 UBUNTU: Ubuntu-hwe-4.18.0-17.18~18.04.1 appears to have all patches. Disco appears to be missing only this patch: 92edf8df0ff2ae86cc632eeca0e651fd8431d40d powerpc/security: Fix spectre_v2 reporting Cosmic (which is supported until July) is missing a number of patches: cf175dc315f90185128fb061dc05b6fbb211aa2f powerpc/64: Disable the speculation barrier from the command line 6453b532f2c8856a80381e6b9a1f5ea2f12294df powerpc/64: Make stf barrier PPC_BOOK3S_64 specific. 179ab1cbf883575c3a585bcfc0f2160f1d22a149 powerpc/64: Add CONFIG_PPC_BARRIER_NOSPEC af375eefbfb27cbb5b831984e66d724a40d26b5c powerpc/64: Call setup_barrier_nospec() from setup_arch() 406d2b6ae3420f5bb2b3db6986dc6f0b6dbb637b powerpc/64: Make meltdown reporting Book3S 64 specific 06d0bbc6d0f56dacac3a79900e9a9a0d5972d818 powerpc/asm: Add a patch_site macro & helpers for patching instructions dc8c6cce9a26a51fc19961accb978217a3ba8c75 powerpc/64s: Add new security feature flags for count cache flush ee13cb249fabdff8b90aaff61add347749280087 powerpc/64s: Add support for software count cache flush ba72dc171954b782a79d25e0f4b3ed91090c3b1e powerpc/pseries: Query hypervisor for count cache flush settings 99d54754d3d5f896a8f616b0b6520662bc99d66b powerpc/powernv: Query firmware for count cache flush settings 7d8bad99ba5a22892f0cad6881289fdc3875a930 powerpc/fsl: Fix spectre_v2 mitigations reporting 92edf8df0ff2ae86cc632eeca0e651fd8431d40d powerpc/security: Fix spectre_v2 reporting This appears to already be in -next. For the bionic 18.04.1 (4.15) kernel only this patch is already part of master-next: a6b3964ad71a61bb7c61d80a60bea7d42187b2eb powerpc/64s: Add barrier_nospec The others are ported, there were only 3 that were not clean. Those are: 2eea7f067f495e33b8b116b35b5988ab2b8aec55 powerpc/64s: Add support for ori barrier_nospec patching This failed because commit a048a07d7f4535baa4cbad6bc024f175317ab938 is missing, but it does not look like that is required here. cb3d6759a93c6d0aea1c10deb6d00e111c29c19c powerpc/64s: Enable barrier_nospec based on firmware settings This failed because debugfs was already included, I can see that previously added, I didn't see where it was previously removed. 06d0bbc6d0f56dacac3a79900e9a9a0d5972d818 powerpc/asm: Add a patch_site macro & helpers for patching instructions This failed because 8183d99f4a22c is not included - but doesn't seem necessary. All other patches applied with, at most, some fuzz. Has had a little testing - boots, check debugfs, etc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1822870/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module
** Description changed: + SRU Justification + = + + [Impact] + + Some users see panics like the following when performing fstrim on a + bcached volume: + + [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0008 + [ 530.183928] #PF error: [normal kernel read fault] + [ 530.412392] PGD 801f42163067 P4D 801f42163067 PUD 1f42168067 PMD 0 + [ 530.750887] Oops: [#1] SMP PTI + [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3 + [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015 + [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620 + [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48 + 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3 + [ 532.838634] RSP: 0018:b9b708df39b0 EFLAGS: 00010246 + [ 533.093571] RAX: RBX: 00046000 RCX: + [ 533.441865] RDX: 0200 RSI: RDI: + [ 533.789922] RBP: b9b708df3a48 R08: 940d3b3fdd20 R09: + [ 534.137512] R10: b9b708df3958 R11: R12: + [ 534.485329] R13: R14: R15: 940d39212020 + [ 534.833319] FS: 7efec26e3840() GS:940d1f48() knlGS: + [ 535.224098] CS: 0010 DS: ES: CR0: 80050033 + [ 535.504318] CR2: 0008 CR3: 001f4e256004 CR4: 001606e0 + [ 535.851759] Call Trace: + [ 535.970308] ? mempool_alloc_slab+0x15/0x20 + [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache] + [ 536.403399] blk_mq_make_request+0x97/0x4f0 + [ 536.607036] generic_make_request+0x1e2/0x410 + [ 536.819164] submit_bio+0x73/0x150 + [ 536.980168] ? submit_bio+0x73/0x150 + [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60 + [ 537.391595] ? _cond_resched+0x1a/0x50 + [ 537.573774] submit_bio_wait+0x59/0x90 + [ 537.756105] blkdev_issue_discard+0x80/0xd0 + [ 537.959590] ext4_trim_fs+0x4a9/0x9e0 + [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0 + [ 538.324087] ext4_ioctl+0xea4/0x1530 + [ 538.497712] ? _copy_to_user+0x2a/0x40 + [ 538.679632] do_vfs_ioctl+0xa6/0x600 + [ 538.853127] ? __do_sys_newfstat+0x44/0x70 + [ 539.051951] ksys_ioctl+0x6d/0x80 + [ 539.212785] __x64_sys_ioctl+0x1a/0x20 + [ 539.394918] do_syscall_64+0x5a/0x110 + [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9 + + [Fix] + + Under certain conditions, the test for whether an operation should be + written back to the underlying device was incorrect. Specifically, in + should_writeback(), we were hitting a case where an optimisation for + partial stripe conditions was returning true and so should_writeback() + was returning true early. This caused the code to go down an incorrect + path and create bios that contained NULL pointers. + + To fix this issue, make sure that should_writeback() on a discard op + never returns true. + + + [Test Case] + + We have observed it on some systems where both: + 1) LVM/devmapper is involved (bcache backing device is LVM volume) and + 2) writeback cache is involved (bcache cache_mode is writeback) + + Not every machine exhibits the bug. On one machine that does exhibit the + bug, we can reliably reproduce it with: + + # echo writeback > /sys/block/bcache0/bcache/cache_mode + # mount /dev/bcache0 /test + # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test + + + [Regression Potential] + + This could affect any device where bcache is used. + + In mitigation, however: the patch is simple, is limited to considering + discard operations. The patch has been accepted upstream [1] and the + maintainer will be including it in SuSE kernels [2]. A Gentoo user + validated the upstream patch independently [3]. + + + [1] https://www.spinics.net/lists/linux-bcache/msg06997.html + [2] https://www.spinics.net/lists/linux-bcache/msg06998.html + [3] https://bugzilla.kernel.org/show_bug.cgi?id=196103#c3 + + + [Original Description] + This was on an 18.04.1 install running the 4.15-34 generic kernel image, running from a normal ext4 root device. I had just a short while before created a new bcache device that was mounted but to which no data had been written yet. Then without any apparent particular reason, an apport error popped up to inform of a bcache kernel oops. Crash log was uploaded but no idea how to link it, so I attach it as well. Mostly I would like to know how concerned I should be as after a previous, successful test I wanted to move the whole install to bcache. Ideally, if this is a bug or similar, it would be nice if it could get fixed. ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: linux-image-4.15.0-34-generic
[Kernel-packages] [Bug 1802421] Re: Xenial: data corruption when using i40e with iommu
The user has verified that the -proposed kernel resolves their issue. Regards, Daniel ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1802421 Title: Xenial: data corruption when using i40e with iommu Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Bug description: A user reports that using an i40e with intel_iommu=on with the Xenial GA kernel causes data corruption. Using the Xenial HWE kernel or an out-of-tree driver more recent than the version shipped with Xenial solves the issue. [Impact] Corrupted data is returned from the network card intermittently. This is often noticeable when using apt, as the checksums are verified. If often leads to failure of apt operations. When there are no checksums done, this could lead to silent data corruption. [Fix] This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. Picking this patch alone is sufficient to fix the issue. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. [Test] A user tested a Xenial 4.4 kernel with this patch applied and it fixed their issue - no data corruption was observed. (The test repeatedly deletes the apt cache and then does apt update.) [Regression Potential] It's a messy change inside i40e, so the risk is that i40e will be broken in some subtle way we haven't noticed, or have performance issues. None of these have been observed so far. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1805245] Re: powerpc/powernv/pci: Work around races in PCI bridge enabling
The OpenPower partner reports that their system is fixed with this kernel. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1805245 Title: powerpc/powernv/pci: Work around races in PCI bridge enabling Status in linux package in Ubuntu: Fix Released Status in linux source package in Bionic: Fix Committed Bug description: SRU Justification = [Impact] An IBM OpenPower partner reports their system with a bunch of NVMe drives fails the NVMe init due to some drives taking PCIe EEH errors. [Fix] Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream. [Testing] IBM reports that this patch fixes the user's issue. [Regression Potential] The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y). It only affects PowerPC. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805245/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module
Hi, I have a patch which I believe fixes your issue: https://www.spinics.net/lists/linux-bcache/msg06997.html It looks like it will go in to the 5.1 kernel, and I will propose it for backporting to earlier Ubuntu kernels. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793901 Title: kernel oops in bcache module Status in linux package in Ubuntu: Confirmed Bug description: This was on an 18.04.1 install running the 4.15-34 generic kernel image, running from a normal ext4 root device. I had just a short while before created a new bcache device that was mounted but to which no data had been written yet. Then without any apparent particular reason, an apport error popped up to inform of a bcache kernel oops. Crash log was uploaded but no idea how to link it, so I attach it as well. Mostly I would like to know how concerned I should be as after a previous, successful test I wanted to move the whole install to bcache. Ideally, if this is a bug or similar, it would be nice if it could get fixed. ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: linux-image-4.15.0-34-generic 4.15.0-34.37 ProcVersionSignature: Ubuntu 4.15.0-34.37-generic 4.15.18 Uname: Linux 4.15.0-34-generic x86_64 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair nvidia_modeset nvidia ApportVersion: 2.20.9-0ubuntu7.3 Architecture: amd64 CurrentDesktop: ubuntu:GNOME Date: Sat Sep 22 18:20:22 2018 HibernationDevice: RESUME=UUID=6bcbe7fa-85b7-4baf-9b69-0558a668bcdd InstallationDate: Installed on 2014-07-29 (1515 days ago) InstallationMedia: It IwConfig: zthnhe3w6d no wireless extensions. eth1 no wireless extensions. lono wireless extensions. MachineType: System manufacturer System Product Name ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR= LANG=de_DE.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-34-generic root=UUID=ebbab625-f14e-44ba-84d5-025ed92a5b2a ro quiet splash RelatedPackageVersions: linux-restricted-modules-4.15.0-34-generic N/A linux-backports-modules-4.15.0-34-generic N/A linux-firmware 1.173.1 RfKill: 0: hci0: Bluetooth Soft blocked: yes Hard blocked: no SourcePackage: linux UpgradeStatus: Upgraded to bionic on 2018-09-07 (15 days ago) dmi.bios.date: 10/22/2015 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 0604 dmi.board.asset.tag: Default string dmi.board.name: H170I-PLUS D3 dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev X.0x dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0604:bd10/22/2015:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnH170I-PLUSD3:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring: dmi.product.family: Default string dmi.product.name: System Product Name dmi.product.version: System Version dmi.sys.vendor: System manufacturer To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793901/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module
I think I have discovered the cause: https://lore.kernel.org/linux- block/87h8e9ii2l@linkitivity.dja.id.au/ ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793901 Title: kernel oops in bcache module Status in linux package in Ubuntu: Confirmed Bug description: This was on an 18.04.1 install running the 4.15-34 generic kernel image, running from a normal ext4 root device. I had just a short while before created a new bcache device that was mounted but to which no data had been written yet. Then without any apparent particular reason, an apport error popped up to inform of a bcache kernel oops. Crash log was uploaded but no idea how to link it, so I attach it as well. Mostly I would like to know how concerned I should be as after a previous, successful test I wanted to move the whole install to bcache. Ideally, if this is a bug or similar, it would be nice if it could get fixed. ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: linux-image-4.15.0-34-generic 4.15.0-34.37 ProcVersionSignature: Ubuntu 4.15.0-34.37-generic 4.15.18 Uname: Linux 4.15.0-34-generic x86_64 NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair nvidia_modeset nvidia ApportVersion: 2.20.9-0ubuntu7.3 Architecture: amd64 CurrentDesktop: ubuntu:GNOME Date: Sat Sep 22 18:20:22 2018 HibernationDevice: RESUME=UUID=6bcbe7fa-85b7-4baf-9b69-0558a668bcdd InstallationDate: Installed on 2014-07-29 (1515 days ago) InstallationMedia: It IwConfig: zthnhe3w6d no wireless extensions. eth1 no wireless extensions. lono wireless extensions. MachineType: System manufacturer System Product Name ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR= LANG=de_DE.UTF-8 SHELL=/bin/bash ProcFB: 0 EFI VGA ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-34-generic root=UUID=ebbab625-f14e-44ba-84d5-025ed92a5b2a ro quiet splash RelatedPackageVersions: linux-restricted-modules-4.15.0-34-generic N/A linux-backports-modules-4.15.0-34-generic N/A linux-firmware 1.173.1 RfKill: 0: hci0: Bluetooth Soft blocked: yes Hard blocked: no SourcePackage: linux UpgradeStatus: Upgraded to bionic on 2018-09-07 (15 days ago) dmi.bios.date: 10/22/2015 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 0604 dmi.board.asset.tag: Default string dmi.board.name: H170I-PLUS D3 dmi.board.vendor: ASUSTeK COMPUTER INC. dmi.board.version: Rev X.0x dmi.chassis.asset.tag: Default string dmi.chassis.type: 3 dmi.chassis.vendor: Default string dmi.chassis.version: Default string dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0604:bd10/22/2015:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnH170I-PLUSD3:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring: dmi.product.family: Default string dmi.product.name: System Product Name dmi.product.version: System Version dmi.sys.vendor: System manufacturer To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793901/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1801305] Re: Restore request-based mode to xen-blkfront for AWS kernels
I've checked that the proposed Xenial AWS kernel works - it boots successfully and uses the deadline scheduler by default on a t2.micro instance. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1801305 Title: Restore request-based mode to xen-blkfront for AWS kernels Status in linux package in Ubuntu: Triaged Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Status in linux source package in Disco: Triaged Bug description: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. For X this needs a small patch from upstream for error handling. For B/C this patchset is bigger as it includes the suspend/resume patches already in X, and a new fixup. These are desirable as the request mode patch assumes their presence. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1801305/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1805245] [NEW] powerpc/powernv/pci: Work around races in PCI bridge enabling
Public bug reported: SRU Justification = [Impact] An IBM OpenPower partner reports their system with a bunch of NVMe drives fails the NVMe init due to some drives taking PCIe EEH errors. [Fix] Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream. [Testing] IBM reports that this patch fixes the user's issue. [Regression Potential] The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y). It only affects PowerPC. ** Affects: linux (Ubuntu) Importance: Undecided Status: Confirmed ** Description changed: SRU Justification = [Impact] An IBM OpenPower partner reports their system with a bunch of NVMe drives fails the NVMe init due to some drives taking PCIe EEH errors. [Fix] Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream. [Testing] IBM reports that this patch fixes the user's issue. [Regression Potential] - The patch is already in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y). + The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y). It only affects PowerPC. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1805245 Title: powerpc/powernv/pci: Work around races in PCI bridge enabling Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = [Impact] An IBM OpenPower partner reports their system with a bunch of NVMe drives fails the NVMe init due to some drives taking PCIe EEH errors. [Fix] Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream. [Testing] IBM reports that this patch fixes the user's issue. [Regression Potential] The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y). It only affects PowerPC. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805245/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1802421] Re: Xenial: data corruption when using i40e with iommu
** Description changed: A user reports that using an i40e with intel_iommu=on with the Xenial GA kernel causes data corruption. Using the Xenial HWE kernel or an out-of- tree driver more recent than the version shipped with Xenial solves the issue. [Impact] Corrupted data is returned from the network card intermittently. This is often noticeable when using apt, as the checksums are verified. If often leads to failure of apt operations. When there are no checksums done, this could lead to silent data corruption. [Fix] - This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. + This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. Picking this patch alone is sufficient to fix the issue. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. [Test] A user tested a Xenial 4.4 kernel with this patch applied and it fixed their issue - no data corruption was observed. (The test repeatedly deletes the apt cache and then does apt update.) [Regression Potential] It's a messy change inside i40e, so the risk is that i40e will be broken in some subtle way we haven't noticed, or have performance issues. None of these have been observed so far. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1802421 Title: Xenial: data corruption when using i40e with iommu Status in linux package in Ubuntu: Confirmed Bug description: A user reports that using an i40e with intel_iommu=on with the Xenial GA kernel causes data corruption. Using the Xenial HWE kernel or an out-of-tree driver more recent than the version shipped with Xenial solves the issue. [Impact] Corrupted data is returned from the network card intermittently. This is often noticeable when using apt, as the checksums are verified. If often leads to failure of apt operations. When there are no checksums done, this could lead to silent data corruption. [Fix] This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. Picking this patch alone is sufficient to fix the issue. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. [Test] A user tested a Xenial 4.4 kernel with this patch applied and it fixed their issue - no data corruption was observed. (The test repeatedly deletes the apt cache and then does apt update.) [Regression Potential] It's a messy change inside i40e, so the risk is that i40e will be broken in some subtle way we haven't noticed, or have performance issues. None of these have been observed so far. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1802421] [NEW] Xenial: data corruption when using i40e with iommu
Public bug reported: A user reports that using an i40e with intel_iommu=on with the Xenial GA kernel causes data corruption. Using the Xenial HWE kernel or an out-of- tree driver more recent than the version shipped with Xenial solves the issue. [Impact] Corrupted data is returned from the network card intermittently. This is often noticeable when using apt, as the checksums are verified. If often leads to failure of apt operations. When there are no checksums done, this could lead to silent data corruption. [Fix] This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. Picking this patch alone is sufficient to fix the issue. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. [Test] A user tested a Xenial 4.4 kernel with this patch applied and it fixed their issue - no data corruption was observed. (The test repeatedly deletes the apt cache and then does apt update.) [Regression Potential] It's a messy change inside i40e, so the risk is that i40e will be broken in some subtle way we haven't noticed, or have performance issues. None of these have been observed so far. ** Affects: linux (Ubuntu) Importance: Undecided Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1802421 Title: Xenial: data corruption when using i40e with iommu Status in linux package in Ubuntu: Confirmed Bug description: A user reports that using an i40e with intel_iommu=on with the Xenial GA kernel causes data corruption. Using the Xenial HWE kernel or an out-of-tree driver more recent than the version shipped with Xenial solves the issue. [Impact] Corrupted data is returned from the network card intermittently. This is often noticeable when using apt, as the checksums are verified. If often leads to failure of apt operations. When there are no checksums done, this could lead to silent data corruption. [Fix] This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop packet split receive routine") which is part of a broader refactor. Picking this patch alone is sufficient to fix the issue. My theory is that iommu exposes an issue in the packet split receive routine and so removing it is sufficient to prevent the problem from occurring. [Test] A user tested a Xenial 4.4 kernel with this patch applied and it fixed their issue - no data corruption was observed. (The test repeatedly deletes the apt cache and then does apt update.) [Regression Potential] It's a messy change inside i40e, so the risk is that i40e will be broken in some subtle way we haven't noticed, or have performance issues. None of these have been observed so far. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1801305] Re: Restore request-based mode to xen-blkfront for AWS kernels
** Description changed: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. + For B/C this patchset is bigger as it includes the suspend/resume + patches already in X, and a new fixup. These are desirable as the + request mode patch assumes their presence. + [Regression Potential] - Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. + Could potentially break xen based disks on AWS. + + For B/C, the patches also add some code to the xen core around suspend + and resume, this code is much smaller and also mirrors code already in + Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). ** Description changed: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. + For X this needs a small patch from upstream for error handling. + For B/C this patchset is bigger as it includes the suspend/resume patches already in X, and a new fixup. These are desirable as the request mode patch assumes their presence. [Regression Potential] - Could potentially break xen based disks on AWS. + Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1801305 Title: Restore request-based mode to xen-blkfront for AWS kernels Status in linux package in Ubuntu: Confirmed Bug description: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. For X this needs a small patch from upstream for error handling. For B/C this patchset is bigger as it includes the suspend/resume patches already in X, and a new fixup. These are desirable as the request mode patch assumes their presence. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1801305/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1801305] [NEW] Restore request-based mode to xen-blkfront for AWS kernels
Public bug reported: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. For X this needs a small patch from upstream for error handling. For B/C this patchset is bigger as it includes the suspend/resume patches already in X, and a new fixup. These are desirable as the request mode patch assumes their presence. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). ** Affects: linux (Ubuntu) Importance: Undecided Status: Confirmed ** Description changed: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by - default. + default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. ** Description changed: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] - Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. + Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1801305 Title: Restore request-based mode to xen-blkfront for AWS kernels Status in linux package in Ubuntu: Confirmed Bug description: In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by default and cannot use the old I/O scheduler. [Impact] blk-mq is not as fast as the old request-based scheduler for some workloads on HDD disks. [Fix] Amazon Linux has a commit which reintroduces the request-based mode. It disables blk-mq by default but allows it to be switched back on with a kernel parameter. For X this needs a small patch from upstream for error handling. For B/C this patchset is bigger as it includes the suspend/resume patches already in X, and a new fixup. These are desirable as the request mode patch assumes their presence. [Regression Potential] Could potentially break xen based disks on AWS. For B/C, the patches also add some code to the xen core around suspend and resume, this code is much smaller and also mirrors code already in Xenial. [Tests] Tested by AWS for Xenial, and their kernel engineers vetted the patches. I tested the Bionic and Cosmic patchsets with fio, the system appears stable and the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did an apt update/upgrade and everything worked (no hash-sum mismatches). To manage notifications about this bug go to:
[Kernel-packages] [Bug 1798706] Re: Incomplete linking with boost_regex
** Tags added: sts -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1798706 Title: Incomplete linking with boost_regex Status in linux package in Ubuntu: In Progress Bug description: SRU Justification = [Impact] oslogin fails on Xenial and Trusty. In auth.log we see: Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_login.so Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_admin.so The error message is a bit deceptive - PAM tries to load the module from the correct location, fails, and then tries the other location where it is missing. It then reports the missing error rather than the real error. symlink the module into both paths leads to a much more useful error message: Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: pam_oslogin_login.so Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE [Test case] - set up GCE VM - turn on oslogin - attempt to log in [Fix] debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch re-orders the link flags to link boost_regex for oslogin. However, this didn't change the flags for PAM module linking. So fix that too. [Regression Potential] - fixes a regression - limited to oslogin, and how it is linked. [Other Notes] We still see a scary list of warnings when building, but they don't seem to have an impact on the common path: dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZNK5boost9re_detail31cpp_regex_traits_implementationIcE9transformEPKcS4_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol
[Kernel-packages] [Bug 1798706] [NEW] Incomplete linking with boost_regex
: linux (Ubuntu) Importance: Critical Assignee: Daniel Axtens (daxtens) Status: In Progress ** Patch added: "set-LDFLAGS-for-PAM.patch" https://bugs.launchpad.net/bugs/1798706/+attachment/5202754/+files/set-LDFLAGS-for-PAM.patch ** Changed in: linux (Ubuntu) Status: Confirmed => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1798706 Title: Incomplete linking with boost_regex Status in linux package in Ubuntu: In Progress Bug description: SRU Justification = [Impact] oslogin fails on Xenial and Trusty. In auth.log we see: Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_login.so Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_admin.so The error message is a bit deceptive - PAM tries to load the module from the correct location, fails, and then tries the other location where it is missing. It then reports the missing error rather than the real error. symlink the module into both paths leads to a much more useful error message: Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: pam_oslogin_login.so Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE [Test case] - set up GCE VM - turn on oslogin - attempt to log in [Fix] debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch re-orders the link flags to link boost_regex for oslogin. However, this didn't change the flags for PAM module linking. So fix that too. [Regression Potential] - fixes a regression - limited to oslogin, and how it is linked. [Other Notes] We still see a scary list of warnings when building, but they don't seem to have an impact on the common path: dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: wa
[Kernel-packages] [Bug 1798705] [NEW] Incomplete linking with boost_regex
: linux (Ubuntu) Importance: Critical Assignee: Daniel Axtens (daxtens) Status: Confirmed ** Patch added: "set-LDFLAGS-for-PAM.patch" https://bugs.launchpad.net/bugs/1798705/+attachment/5202753/+files/set-LDFLAGS-for-PAM.patch -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1798705 Title: Incomplete linking with boost_regex Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = [Impact] oslogin fails on Xenial and Trusty. In auth.log we see: Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_login.so Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open shared object file: No such file or directory Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: pam_oslogin_admin.so The error message is a bit deceptive - PAM tries to load the module from the correct location, fails, and then tries the other location where it is missing. It then reports the missing error rather than the real error. symlink the module into both paths leads to a much more useful error message: Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: pam_oslogin_login.so Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined symbol: _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE [Test case] - set up GCE VM - turn on oslogin - attempt to log in [Fix] debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch re-orders the link flags to link boost_regex for oslogin. However, this didn't change the flags for PAM module linking. So fix that too. [Regression Potential] - fixes a regression - limited to oslogin, and how it is linked. [Other Notes] We still see a scary list of warnings when building, but they don't seem to have an impact on the common path: dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_ used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so found in none of the libraries dpkg-shlibdeps: warning: symbol _ZNK5boost9re_detail31cpp_regex_traits_implementationIcE9transf
[Kernel-packages] [Bug 1797314] Re: fscache: bad refcounting in fscache_op_complete leads to OOPS
** Description changed: SRU Justification - [Impact] A kernel BUG is sometimes observed when using fscache: [4740718.880898] FS-Cache: [4740718.880920] FS-Cache: Assertion failed [4740718.880934] FS-Cache: 0 > 0 is false [4740718.881001] [ cut here ] [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! [4740718.881040] invalid opcode: [#1] SMP - + [4740718.892659] Call Trace: [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] [4740718.895180] [] process_one_work+0x150/0x3f0 [4740718.895966] [] worker_thread+0x11a/0x470 [4740718.896753] [] ? __schedule+0x359/0x980 [4740718.897783] [] ? rescuer_thread+0x310/0x310 [4740718.898581] [] kthread+0xd6/0xf0 [4740718.899469] [] ? kthread_park+0x60/0x60 [4740718.900477] [] ret_from_fork+0x3f/0x70 [4740718.901514] [] ? kthread_park+0x60/0x60 [Problem] - In include/fscache-cache.h, fscache_retrieval_complete reads, in part: + In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in + part: atomic_sub(n_pages, >n_pages); if (atomic_read(>n_pages) <= 0) fscache_op_complete(>op, true); The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, and end up calling fscache_op_complete in both the threads leading to the OOPS. [Fix] The fix is trivial to use atomic_sub_return instead of two calls. [Testcase] - The user has tested the patch successfully on their fscache/cachefiles setup. + I believe the user has tested the patch successfully on their fscache/cachefiles setup. [Regression Potential] Limited to fscache. Small, comprehensible change. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1797314 Title: fscache: bad refcounting in fscache_op_complete leads to OOPS Status in linux package in Ubuntu: Incomplete Bug description: SRU Justification - [Impact] A kernel BUG is sometimes observed when using fscache: [4740718.880898] FS-Cache: [4740718.880920] FS-Cache: Assertion failed [4740718.880934] FS-Cache: 0 > 0 is false [4740718.881001] [ cut here ] [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! [4740718.881040] invalid opcode: [#1] SMP [4740718.892659] Call Trace: [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] [4740718.895180] [] process_one_work+0x150/0x3f0 [4740718.895966] [] worker_thread+0x11a/0x470 [4740718.896753] [] ? __schedule+0x359/0x980 [4740718.897783] [] ? rescuer_thread+0x310/0x310 [4740718.898581] [] kthread+0xd6/0xf0 [4740718.899469] [] ? kthread_park+0x60/0x60 [4740718.900477] [] ret_from_fork+0x3f/0x70 [4740718.901514] [] ? kthread_park+0x60/0x60 [Problem] In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in part: atomic_sub(n_pages, >n_pages); if (atomic_read(>n_pages) <= 0) fscache_op_complete(>op, true); The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, and end up calling fscache_op_complete in both the threads leading to the OOPS. [Fix] The fix is trivial to use atomic_sub_return instead of two calls. [Testcase] I believe the user has tested the patch successfully on their fscache/cachefiles setup. [Regression Potential] Limited to fscache. Small, comprehensible change. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797314/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1797314] [NEW] fscache: bad refcounting in fscache_op_complete leads to OOPS
Public bug reported: SRU Justification - [Impact] A kernel BUG is sometimes observed when using fscache: [4740718.880898] FS-Cache: [4740718.880920] FS-Cache: Assertion failed [4740718.880934] FS-Cache: 0 > 0 is false [4740718.881001] [ cut here ] [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! [4740718.881040] invalid opcode: [#1] SMP [4740718.892659] Call Trace: [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] [4740718.895180] [] process_one_work+0x150/0x3f0 [4740718.895966] [] worker_thread+0x11a/0x470 [4740718.896753] [] ? __schedule+0x359/0x980 [4740718.897783] [] ? rescuer_thread+0x310/0x310 [4740718.898581] [] kthread+0xd6/0xf0 [4740718.899469] [] ? kthread_park+0x60/0x60 [4740718.900477] [] ret_from_fork+0x3f/0x70 [4740718.901514] [] ? kthread_park+0x60/0x60 [Problem] In include/fscache-cache.h, fscache_retrieval_complete reads, in part: atomic_sub(n_pages, >n_pages); if (atomic_read(>n_pages) <= 0) fscache_op_complete(>op, true); The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, and end up calling fscache_op_complete in both the threads leading to the OOPS. [Fix] The fix is trivial to use atomic_sub_return instead of two calls. [Testcase] The user has tested the patch successfully on their fscache/cachefiles setup. [Regression Potential] Limited to fscache. Small, comprehensible change. ** Affects: linux (Ubuntu) Importance: Undecided Status: Incomplete ** Description changed: SRU Justification - [Impact] A kernel BUG is sometimes observed when using fscache: + [4740718.880898] FS-Cache: + [4740718.880920] FS-Cache: Assertion failed + [4740718.880934] FS-Cache: 0 > 0 is false + [4740718.881001] [ cut here ] + [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! + [4740718.881040] invalid opcode: [#1] SMP + + [4740718.892659] Call Trace: + [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] + [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] + [4740718.895180] [] process_one_work+0x150/0x3f0 + [4740718.895966] [] worker_thread+0x11a/0x470 + [4740718.896753] [] ? __schedule+0x359/0x980 + [4740718.897783] [] ? rescuer_thread+0x310/0x310 + [4740718.898581] [] kthread+0xd6/0xf0 + [4740718.899469] [] ? kthread_park+0x60/0x60 + [4740718.900477] [] ret_from_fork+0x3f/0x70 + [4740718.901514] [] ? kthread_park+0x60/0x60 - Jun 25 11:32:08 kernel: [4740718.880898] FS-Cache: - Jun 25 11:32:08 kernel: [4740718.880920] FS-Cache: Assertion failed - Jun 25 11:32:08 kernel: [4740718.880934] FS-Cache: 0 > 0 is false - Jun 25 11:32:08 kernel: [4740718.881001] [ cut here ] - Jun 25 11:32:08 kernel: [4740718.881017] kernel BUG at /usr/src/linux-4.4.0/fs/fscache/operation.c:449! - Jun 25 11:32:08 kernel: [4740718.881040] invalid opcode: [#1] SMP - ... - Jun 25 11:32:08 kernel: [4740718.892659] Call Trace: - Jun 25 11:32:08 kernel: [4740718.893506] [] cachefiles_read_copier+0x3a9/0x410 [cachefiles] - Jun 25 11:32:08 kernel: [4740718.894374] [] fscache_op_work_func+0x22/0x50 [fscache] - Jun 25 11:32:08 kernel: [4740718.895180] [] process_one_work+0x150/0x3f0 - Jun 25 11:32:08 kernel: [4740718.895966] [] worker_thread+0x11a/0x470 - Jun 25 11:32:08 kernel: [4740718.896753] [] ? __schedule+0x359/0x980 - Jun 25 11:32:08 kernel: [4740718.897783] [] ? rescuer_thread+0x310/0x310 - Jun 25 11:32:08 kernel: [4740718.898581] [] kthread+0xd6/0xf0 - Jun 25 11:32:08 kernel: [4740718.899469] [] ? kthread_park+0x60/0x60 - Jun 25 11:32:08 kernel: [4740718.900477] [] ret_from_fork+0x3f/0x70 - Jun 25 11:32:08 kernel: [4740718.901514] [] ? kthread_park+0x60/0x60 - [Problem] In include/fscache-cache.h, fscache_retrieval_complete reads, in part: - atomic_sub(n_pages, >n_pages); - if (atomic_read(>n_pages) <= 0) - fscache_op_complete(>op, true); - - The code is using atomic_sub followed by an atomic_read. This causes two threads doing a decrement of pages to race with each other seeing the op->refcount <= 0 at same time, - and end up calling fscache_op_complete in both the threads leading to the OOPS. - + atomic_sub(n_pages, >n_pages); + if (atomic_read(>n_pages) <= 0) + fscache_op_complete(>op, true); + + The code is using atomic_sub followed by an atomic_read.
[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update
Hi, I haven't found the time to do this yet, sorry. Is it still an issue on the current Xenial kernel? Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1742658 Title: linux-generic-hwe-16.04 OOPS in nouveau after security update Status in linux package in Ubuntu: Confirmed Status in linux-hwe package in Ubuntu: New Status in linux-hwe-edge package in Ubuntu: New Status in linux-meta-hwe package in Ubuntu: New Status in linux-meta-hwe-edge package in Ubuntu: New Bug description: Description: Ubuntu 16.04.3 LTS Release: 16.04 After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen with nouveau. Previously I was running 4.10.0-42-generic, and that kernel still works fine. Here is the OOPS: an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 1.3.1 20120801 for :02:00.0 on minor 0 Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle kernel NULL pointer dereference at (null) Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 Jan 11 09:39:18 edvin-tower kernel: [3.100615] Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core pata_acpi libahci wmi Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. Precision Tower 5810/0K240Y, BIOS A05 12/16/2014 Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp gf119_disp_super [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 task.stack: ac9ec2134000 Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 EFLAGS: 00010206 Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: RCX: 0016 Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX: RSI: RDI: 9d898140d180 Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: R09: Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: R12: Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13: R14: ac9ec2137d00 R15: 9d898c542600 Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS: () GS:9d899fd0() knlGS: Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS: 0010 DS: ES: CR0: 80050033 Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2: CR3: 00029ac0a006 CR4: 001606e0 Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace: Jan 11 09:39:18 edvin-tower kernel: [3.100771] ? nvkm_dp_train_drive+0x214/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100798] nvkm_dp_train+0x582/0x970 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100824] nvkm_dp_acquire+0xd4/0x390 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100850] nv50_disp_super_2_2+0x6d/0x430 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100872] ? nvkm_devinit_pll_set+0xf/0x20 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100897] gf119_disp_super+0x1b7/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100904] ? __schedule+0x3ca/0x890 Jan 11 09:39:18 edvin-tower kernel: [3.100911] process_one_work+0x156/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100915] worker_thread+0x4b/0x460 Jan 11 09:39:18 edvin-tower kernel: [3.100920] kthread+0x109/0x140 Jan 11 09:39:18 edvin-tower kernel: [3.100924] ? process_one_work+0x410/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100928] ? kthread_create_on_node+0x70/0x70 Jan 11 09:39:18 edvin-tower kernel: [3.100934] ret_from_fork+0x1f/0x30 Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code: Bad RIP value. Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP: (null) RSP: ac9ec2137bd8 Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 93a79dae0d3ec749 ]--- ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-generic-hwe-16.04 4.13.0.26.46 ProcVersionSignature: Ubuntu
[Kernel-packages] [Bug 1793430] [NEW] Page leaking in cachefiles_read_backing_file while vmscan is active
Public bug reported: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1793430 Title: Page leaking in cachefiles_read_backing_file while vmscan is active Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification - [Description] In a heavily loaded system where the system pagecache is nearing memory limits and fscache is enabled, pages can be leaked by fscache while trying read pages from cachefiles backend. This can happen because two applications can be reading same page from a single mount, two threads can be trying to read the backing page at same time. This results in one of the thread finding that a page for the backing file or netfs file is already in the radix tree. During the error handling cachefiles does not cleanup the reference on backing page, leading to page leak. [Fix] The fix is straightforward, to decrement the reference when error is encounterd. [Testing] A user has tested the fix using following method for 12+ hrs. 1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs 2) create 1 files of 2.8MB in a NFS mount. 3) start a thread to simulate heavy VM presssure (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)& 4) start multiple parallel reader for data set at same time find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & .. .. find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & find /mnt/nfs -type f | xargs -P 80 cat > /dev/null & 5) finally check using cat /proc/fs/fscache/stats | grep -i pages ; free -h , cat /proc/meminfo and page-types -r -b lru to ensure all pages are freed. [Regression Potential] Limited to cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1783246] Re: Cephfs + fscache: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: jbd2__journal_start+0x22/0x1f0
** Changed in: linux (Ubuntu) Assignee: Daniel Axtens (daxtens) => (unassigned) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1783246 Title: Cephfs + fscache: unable to handle kernel NULL pointer dereference at IP: jbd2__journal_start+0x22/0x1f0 Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Fix Committed Bug description: SRU Justification - [Impact] Certain sequences of file system operations on a cephfs volume backed by fscache with an ext4 store can cause a kernel BUG: [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0 ... [ 5818.962490] Call Trace: [ 5818.963055] ? ext4_writepages+0x5d5/0xf40 [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120 [ 5818.964994] ext4_writepages+0x5d5/0xf40 [ 5818.965991] ? __enqueue_entity+0x5c/0x60 [ 5818.966791] ? check_preempt_wakeup+0x130/0x240 [ 5818.967679] do_writepages+0x4b/0xe0 [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0 [ 5818.969526] ? do_writepages+0x4b/0xe0 [ 5818.970493] ? ext4_statfs+0x114/0x260 [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100 [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100 [ 5818.973385] filemap_write_and_wait+0x31/0x90 [ 5818.974461] ext4_bmap+0x8c/0xe0 [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles] [ 5818.976718] ? _cond_resched+0x19/0x40 [ 5818.977482] ? wake_up_bit+0x42/0x50 [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache] [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache] [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph] [ 5818.981630] ceph_readpages+0x49/0x100 [ceph] [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0 [ 5818.983628] ? __cap_is_valid+0x21/0xb0 [ceph] [ 5818.984526] ondemand_readahead+0x11a/0x2a0 [ 5818.985374] ? ondemand_readahead+0x11a/0x2a0 [ 5818.986825] page_cache_async_readahead+0x71/0x80 [ 5818.987751] generic_file_read_iter+0x784/0xbf0 [ 5818.988663] ? ceph_put_cap_refs+0x1c4/0x330 [ceph] [ 5818.989620] ? page_cache_tree_insert+0xe0/0xe0 [ 5818.990519] ceph_read_iter+0x106/0x820 [ceph] [ 5818.991818] new_sync_read+0xe4/0x130 [ 5818.992588] __vfs_read+0x29/0x40 [ 5818.993504] vfs_read+0x8e/0x130 [ 5818.994192] SyS_read+0x55/0xc0 [ 5818.994870] do_syscall_64+0x73/0x130 [ 5818.995632] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Fix] Cherry-pick 5d988308283ecf062fa88f20ae05c52cce0bcdca from upstream. This patch stops cephfs from reusing current->journal for its own internal use, which means that it's valid when ext4 uses it via fscache. [Testcase] A user has been using the following test case: ( cat /proc/fs/fscache/stats > ~/test.log; i=0; while true; do touch small; echo 3 > /proc/sys/vm/drop_caches & md5sum small; let "i++"; if ! (( $i % 1000 )); then echo "Test iteration $i done" >> ~/test.log; cat /proc/fs/fscache/stats >> ~/test.log; fi; done ) > ~/nohup.out 2>&1 (It boils down to "touch file; drop caches; read file") Without the patch, this fails very quickly - usually the first time, always within a few iterations. With the patch, the user ran this loop for over 60 hours without incident. [Regression potential] The change is not trivial, but is limited to cephfs, and has been in mainline since v4.16. So the risk of regression is well contained. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1783246/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1774336] Re: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false
** Description changed: == SRU Justification == [Impact] Oops during heavy NFS + FSCache use: - [81738.886634] FS-Cache: + [81738.886634] FS-Cache: [81738.888281] FS-Cache: Assertion failed [81738.889461] FS-Cache: 6 == 5 is false [81738.890625] [ cut here ] [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494! 6 == 5 represents an operation being DEAD when it was not expected to be. [Cause] - There is a race in fscache and cachefiles. + There is a race in fscache and cachefiles. One thread is in cachefiles_read_waiter: - 1) object->work_lock is taken. - 2) the operation is added to the to_do list. - 3) the work lock is dropped. - 4) fscache_enqueue_retrieval is called, which takes a reference. + 1) object->work_lock is taken. + 2) the operation is added to the to_do list. + 3) the work lock is dropped. + 4) fscache_enqueue_retrieval is called, which takes a reference. Another thread is in cachefiles_read_copier: - 1) object->work_lock is taken - 2) an item is popped off the to_do list. - 3) object->work_lock is dropped. - 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference. + 1) object->work_lock is taken + 2) an item is popped off the to_do list. + 3) object->work_lock is dropped. + 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference. Now if the this process in cachefiles_read_copier takes place *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped before it is taken, which leads to the objects reference count hitting zero, which leads to lifecycle events for the object happening too soon, leading to the assertion failure later on. (This is simplified and clarified from the original upstream analysis for this patch at https://www.redhat.com/archives/linux- cachefs/2018-February/msg1.html and from a similar patch with a different approach to fixing the bug at https://www.redhat.com/archives /linux-cachefs/2017-June/msg2.html) [Fix] - Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This means that the object cannot be popped off the to_do list until it is in a fully consistent state with the reference taken. + + + (Old sauce patch being reverted) Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This means that the object cannot be popped off the to_do list until it is in a fully consistent state with the reference taken. + + (New upstream patch) Explicitly take a reference to the object while it + is being enqueued. Adjust another part of the code to deal with the + greater range of object states this exposes. [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - - Limited to fscache/cachefiles. - - The change makes things more conservative (doing more under lock) so that's reassuring. - - There may be performance impacts but none have been observed so far. + - Limited to fscache/cachefiles. + - The change makes things more conservative (taking more references) so that's reassuring. + - There may be performance impacts but none have been observed so far. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1774336 Title: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Released Status in linux source package in Xenial: Fix Released Status in linux source package in Artful: Fix Released Status in linux source package in Bionic: Fix Released Bug description: == SRU Justification == [Impact] Oops during heavy NFS + FSCache use: [81738.886634] FS-Cache: [81738.888281] FS-Cache: Assertion failed [81738.889461] FS-Cache: 6 == 5 is false [81738.890625] [ cut here ] [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494! 6 == 5 represents an operation being DEAD when it was not expected to be. [Cause] There is a race in fscache and cachefiles. One thread is in cachefiles_read_waiter: 1) object->work_lock is taken. 2) the operation is added to the to_do list. 3) the work lock is dropped. 4) fscache_enqueue_retrieval is called, which takes a reference. Another thread is in cachefiles_read_copier: 1) object->work_lock is taken 2) an item is popped off the to_do list. 3) object->work_lock is dropped. 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference. Now if the this process in cachefiles_read_copier takes place *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped before it is taken, which leads
[Kernel-packages] [Bug 1784864] [NEW] Various fscache/cachefiles bugs
Public bug reported: SRU Justification - A few bugs while using fscache/cachefiles on a NFS share have been reported by a user. All are intermittent/race conditions. [Impact] Various BUGs/OOPSes: - BUG on "Unexpected object collision" - CacheFiles: Error: Overlong wait for old active object to go away / CacheFiles: Error: Object already active / kernel BUG at fs/cachefiles/namei.c:163! - Unmounting an NFS share sometimes leads to an oops [Fix] Grab the following patches from Dave Howell's tree linux-fs tree in the fscache-fixes branch (https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes) - they're various small fixes within fscache/cachefiles. 4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object collision" 28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error handling [Testcase] The user has run ~100 hours of NFS stress tests and have not seen these bugs recur. [Regression Potential] - Limited to fscache/cachefiles. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1784864 Title: Various fscache/cachefiles bugs Status in linux package in Ubuntu: Invalid Bug description: SRU Justification - A few bugs while using fscache/cachefiles on a NFS share have been reported by a user. All are intermittent/race conditions. [Impact] Various BUGs/OOPSes: - BUG on "Unexpected object collision" - CacheFiles: Error: Overlong wait for old active object to go away / CacheFiles: Error: Object already active / kernel BUG at fs/cachefiles/namei.c:163! - Unmounting an NFS share sometimes leads to an oops [Fix] Grab the following patches from Dave Howell's tree linux-fs tree in the fscache-fixes branch (https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes) - they're various small fixes within fscache/cachefiles. 4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object collision" 28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error handling [Testcase] The user has run ~100 hours of NFS stress tests and have not seen these bugs recur. [Regression Potential] - Limited to fscache/cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784864/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1784864] Re: Various fscache/cachefiles bugs
Oops, my mistake, there are already LP bugs covering these issues. Regards, Daniel ** Changed in: linux (Ubuntu) Status: Confirmed => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1784864 Title: Various fscache/cachefiles bugs Status in linux package in Ubuntu: Invalid Bug description: SRU Justification - A few bugs while using fscache/cachefiles on a NFS share have been reported by a user. All are intermittent/race conditions. [Impact] Various BUGs/OOPSes: - BUG on "Unexpected object collision" - CacheFiles: Error: Overlong wait for old active object to go away / CacheFiles: Error: Object already active / kernel BUG at fs/cachefiles/namei.c:163! - Unmounting an NFS share sometimes leads to an oops [Fix] Grab the following patches from Dave Howell's tree linux-fs tree in the fscache-fixes branch (https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes) - they're various small fixes within fscache/cachefiles. 4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object collision" 28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error handling [Testcase] The user has run ~100 hours of NFS stress tests and have not seen these bugs recur. [Regression Potential] - Limited to fscache/cachefiles. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784864/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
Yes, we have closed the support case on our end at their request. Apparently increasing the reservation ratio has helped. Paulus - Hi! Thanks for the info and clearing up some of my misunderstandings. Great to hear from you and I hope things are going well at OzLabs :) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Opinion Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot. Regards, Daniel [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300 Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need
[Kernel-packages] [Bug 1783246] [NEW] Cephfs + fscache: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: jbd2__journal_start+0x22/0x1f0
Public bug reported: SRU Justification - [Impact] Certain sequences of file system operations on a cephfs volume backed by fscache with an ext4 store can cause a kernel BUG: [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0 ... [ 5818.962490] Call Trace: [ 5818.963055] ? ext4_writepages+0x5d5/0xf40 [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120 [ 5818.964994] ext4_writepages+0x5d5/0xf40 [ 5818.965991] ? __enqueue_entity+0x5c/0x60 [ 5818.966791] ? check_preempt_wakeup+0x130/0x240 [ 5818.967679] do_writepages+0x4b/0xe0 [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0 [ 5818.969526] ? do_writepages+0x4b/0xe0 [ 5818.970493] ? ext4_statfs+0x114/0x260 [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100 [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100 [ 5818.973385] filemap_write_and_wait+0x31/0x90 [ 5818.974461] ext4_bmap+0x8c/0xe0 [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles] [ 5818.976718] ? _cond_resched+0x19/0x40 [ 5818.977482] ? wake_up_bit+0x42/0x50 [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache] [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache] [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph] [ 5818.981630] ceph_readpages+0x49/0x100 [ceph] [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0 [ 5818.983628] ? __cap_is_valid+0x21/0xb0 [ceph] [ 5818.984526] ondemand_readahead+0x11a/0x2a0 [ 5818.985374] ? ondemand_readahead+0x11a/0x2a0 [ 5818.986825] page_cache_async_readahead+0x71/0x80 [ 5818.987751] generic_file_read_iter+0x784/0xbf0 [ 5818.988663] ? ceph_put_cap_refs+0x1c4/0x330 [ceph] [ 5818.989620] ? page_cache_tree_insert+0xe0/0xe0 [ 5818.990519] ceph_read_iter+0x106/0x820 [ceph] [ 5818.991818] new_sync_read+0xe4/0x130 [ 5818.992588] __vfs_read+0x29/0x40 [ 5818.993504] vfs_read+0x8e/0x130 [ 5818.994192] SyS_read+0x55/0xc0 [ 5818.994870] do_syscall_64+0x73/0x130 [ 5818.995632] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Fix] Cherry-pick 5d988308283ecf062fa88f20ae05c52cce0bcdca from upstream. This patch stops cephfs from reusing current->journal for its own internal use, which means that it's valid when ext4 uses it via fscache. [Testcase] A user has been using the following test case: ( cat /proc/fs/fscache/stats > ~/test.log; i=0; while true; do touch small; echo 3 > /proc/sys/vm/drop_caches & md5sum small; let "i++"; if ! (( $i % 1000 )); then echo "Test iteration $i done" >> ~/test.log; cat /proc/fs/fscache/stats >> ~/test.log; fi; done ) > ~/nohup.out 2>&1 (It boils down to "touch file; drop caches; read file") Without the patch, this fails very quickly - usually the first time, always within a few iterations. With the patch, the user ran this loop for over 60 hours without incident. [Regression potential] The change is not trivial, but is limited to cephfs, and has been in mainline since v4.16. So the risk of regression is well contained. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1783246 Title: Cephfs + fscache: unable to handle kernel NULL pointer dereference at IP: jbd2__journal_start+0x22/0x1f0 Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification - [Impact] Certain sequences of file system operations on a cephfs volume backed by fscache with an ext4 store can cause a kernel BUG: [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0 ... [ 5818.962490] Call Trace: [ 5818.963055] ? ext4_writepages+0x5d5/0xf40 [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120 [ 5818.964994] ext4_writepages+0x5d5/0xf40 [ 5818.965991] ? __enqueue_entity+0x5c/0x60 [ 5818.966791] ? check_preempt_wakeup+0x130/0x240 [ 5818.967679] do_writepages+0x4b/0xe0 [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0 [ 5818.969526] ? do_writepages+0x4b/0xe0 [ 5818.970493] ? ext4_statfs+0x114/0x260 [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100 [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100 [ 5818.973385] filemap_write_and_wait+0x31/0x90 [ 5818.974461] ext4_bmap+0x8c/0xe0 [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles] [ 5818.976718] ? _cond_resched+0x19/0x40 [ 5818.977482] ? wake_up_bit+0x42/0x50 [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache] [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache] [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph] [ 5818.981630] ceph_readpages+0x49/0x100 [ceph] [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0
Re: [Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla
Hi, I am told that this is the same machine but not while it was currently showing symptoms - due to the intermittent nature of the problem it was taken some time later. This matches what I see in the logs so I have no reason to doubt it. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot. Regards, Daniel [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300 Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need here
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
** Attachment added: "var_log_libvirt_qemu.tar.bz2" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164739/+files/var_log_libvirt_qemu.tar.bz2 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot. Regards, Daniel [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300 Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need here -- at least -- to get started: 1. What is the server model and at least basic c
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
** Attachment added: "syslog" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164740/+files/syslog -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot. Regards, Daniel [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300 Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need here -- at least -- to get started: 1. What is the server model and at least basic config info (I/O cards, firmware level)?
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
** Attachment added: "meminfo" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164738/+files/meminfo -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an option for you. Please also let me know if you require further information on setting boot parameters with Petitboot. Regards, Daniel [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300 Before we go any further, let's get the basic info here. Apparently there was a sosreport somewhere else, and a link would be good, but, here's what we need here -- at least -- to get started: 1. What is the server model and at least basic config info (I/O cards, firmware level)?
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
Based on the most recent information we have available to us (2018-05-09): 1. What is the server model and at least basic config info (I/O cards, firmware level)? Use /proc/meminfo, etc. Attach the syslog and the /var/log/libvirt/qemu logs. I am struggling a bit to determine the server model, but I'm uploading the relevant logs. 2. What is running on the host (at least uname -a). Sounds like from comment above like it's an older fix level, so let's get it updated to the curent level (and ensure the problem still exists) before proceeding: There is zero point in trying to figure out whether fixes that are known to exist in 16.04 are in this *particular* build level. Linux apsoscmp-as-a4p 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 20:00:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux I don't have any answers for (3); the user has been asked. ** Attachment added: "lspci" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164737/+files/lspci -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking a
[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)
Hi, This came up in the context of a customer issue. I have asked them if we can share anonymised data here, and I will pass on any response. >From my analysis of the code while working the case, it would seem that you could reproduce this by spinning up and tearing down VMs of varying memory sizes in order to fragment the CMA. It looks like PCI pass- through would exacerbate the issue, although I don't believe this was a factor in this instance. I wonder if this is fully 'solvable' per se - with memory overcommit it should be easy to simply run out of CMA space - but it should be possible to at least print much more helpful information either from the kernel or from qemu. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1781038 Title: KVM guest hash page table failed to allocate contiguous memory (CMA) Status in The Ubuntu-power-systems project: New Status in linux package in Ubuntu: New Bug description: Per an email forwarded within IBM, we wish to use this Launchpad bug to work on the technical discussion with the Canonical development folks and the IBM KVM and kernel team surrounding the analysis made by Daniel Axtens of Canonical for the customer issue raised in Case #00177825. The only statement at the moment by the KVM team was that there were various issues associated with CMA fragmentation causing issues with KVM guests. However, as mentioned, this bug is to allow the dialog amongst all the developers to see what can be done to help alleviate the situation or understand the root cause further. Please also note that we should not be attaching customer data to this bug. If that is necessary then we expect Canonical to help provide a controlled environment for reviewing that data so we avoid any privacy issues (e.g. for GDPR compliance). Here is the email from Daniel: I have looked at the sosreport you uploaded. Here is my analysis so far. Virtualisation on powerpc has some special requirements. To start a guest on a powerpc host, you need to allocate a contiguous area of memory to hold the guest's hash page table (HPT, or HTAB, depending on which document you look at). The HPT is required to track and manage guest memory. Your error reports show qemu asking the kernel to allocate an HTAB, and the kernel reporting that it had insufficient memory to do so. The required memory for the HPT scales with the guest memory size - it should be about 1/128th of guest memory, so for a 16GB guest, that's 128MB. However, the HPT has to be allocated as a single contiguous memory region. (This is in contrast to regular guest memory, which is not required to be contiguous from the host point of view.) The kernel keeps a special contiguous memory area (CMA) for these purposes, and keeps track of the total amounts in use and still available. These are shown in /proc/meminfo. From the system that ran the sosreport, we see: CmaTotal: 26853376 kB CmaFree: 4024448 kB So there is a total of about 25GB of CMA, of which about 3.8GB remain. This is obviously more than 128MB: - It's very possible that between the error and the sosreport, more contiguous memory became available. This would match the intermittent nature of the issue. - It also might be that the failure was due to fragmentation of memory in the CMA pool. That is, there might be more than 128MB, but it might all be in chunks that are smaller than 128MB, or which don't have the required alignment for a HPT. Given that the system's uptime was 112 days when the sosreport was generated, it would be unsurprising if fragmentation had occurred! (Relatedly - you're running 4.4.0-109, which does not have the Spectre and Meltdown fixes.) This issue has come up before - both in a public Canonical-IBM synchronised bug report[1], and with Red Hat[2]. It appears that there is some work within IBM to address this, but it seems to have stalled. I will get in touch with the IBM powerpc kernel team on their public mailing list and ask about the status. I will keep you updated. In the mean time, I have a potential solution/workaround. By default, 5% of memory is reserved for CMA (kernel source: arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can increase this with a boot parameter, so for example to reserve 10%, you could boot with kvm_cma_resv_ratio=10. This can be set in petitboot. This should significantly reduce the incidence of this issue - perhaps eliminating it entirely - at the cost of locking away more of the system's memory. You would need to experiment to determine the optimal value. Perhaps given that you are seeing the problem only intermittently, a ratio of 7% would be sufficient - that would give you ~35GB of CMA. Please let me know if testing this setting would be an opt
[Kernel-packages] [Bug 1777029] Re: fscache: Fix hanging wait on page discarded by writeback
** Changed in: linux (Ubuntu) Importance: Undecided => High ** Changed in: linux (Ubuntu Trusty) Importance: Undecided => High ** Changed in: linux (Ubuntu Xenial) Importance: Undecided => High ** Changed in: linux (Ubuntu Artful) Importance: Undecided => High ** Changed in: linux (Ubuntu Bionic) Importance: Undecided => High -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1777029 Title: fscache: Fix hanging wait on page discarded by writeback Status in linux package in Ubuntu: Confirmed Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Committed Status in linux source package in Bionic: Fix Committed Bug description: == SRU Justification == [Impact] Under heavy NFS + FSCache load, a user sometimes observes a hang in __fscache_wait_on_page_write+0x5f/0xa0. Example traces: [] __fscache_wait_on_page_write+0x5f/0xa0 [fscache] [] __fscache_uncache_all_inode_pages+0xba/0x120 [fscache] [] nfs_fscache_open_file+0x4e/0xc0 [nfs] [] __fscache_wait_on_page_write+0x5f/0xa0 [fscache] [] __nfs_fscache_invalidate_page+0x2c/0x80 [nfs] [] nfs_invalidate_page+0x63/0x90 [nfs] [] truncate_inode_page+0x80/0x90 [Fix] Cherry-pick 2c98425720233ae3e135add0c7e869b32913502f from upstream, which is a patch from the FSCache maintainer. [Testcase] The user has run a NFS stress-test with a similar home-grown patch, and will run a stress test on the proposed kernel. [Regression Potential] Patch is limited to FSCache, so regression potential is limited. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777029/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1774336] [NEW] FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false
Public bug reported: == SRU Justification == [Impact] Oops during heavy NFS + FSCache use: [81738.886634] FS-Cache: [81738.888281] FS-Cache: Assertion failed [81738.889461] FS-Cache: 6 == 5 is false [81738.890625] [ cut here ] [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494! 6 == 5 represents an operation being DEAD when it was not expected to be. [Cause] There is a race in fscache and cachefiles. One thread is in cachefiles_read_waiter: 1) object->work_lock is taken. 2) the operation is added to the to_do list. 3) the work lock is dropped. 4) fscache_enqueue_retrieval is called, which takes a reference. Another thread is in cachefiles_read_copier: 1) object->work_lock is taken 2) an item is popped off the to_do list. 3) object->work_lock is dropped. 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference. Now if the this process in cachefiles_read_copier takes place *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped before it is taken, which leads to the objects reference count hitting zero, which leads to lifecycle events for the object happening too soon, leading to the assertion failure later on. (This is simplified and clarified from the original upstream analysis for this patch at https://www.redhat.com/archives/linux- cachefs/2018-February/msg1.html and from a similar patch with a different approach to fixing the bug at https://www.redhat.com/archives /linux-cachefs/2017-June/msg2.html) [Fix] Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This means that the object cannot be popped off the to_do list until it is in a fully consistent state with the reference taken. [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - Limited to fscache/cachefiles. - The change makes things more conservative (doing more under lock) so that's reassuring. - There may be performance impacts but none have been observed so far. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed ** Changed in: linux (Ubuntu) Status: New => Confirmed ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Daniel Axtens (daxtens) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1774336 Title: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false Status in linux package in Ubuntu: Confirmed Bug description: == SRU Justification == [Impact] Oops during heavy NFS + FSCache use: [81738.886634] FS-Cache: [81738.888281] FS-Cache: Assertion failed [81738.889461] FS-Cache: 6 == 5 is false [81738.890625] [ cut here ] [81738.891706] kernel BUG at /build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494! 6 == 5 represents an operation being DEAD when it was not expected to be. [Cause] There is a race in fscache and cachefiles. One thread is in cachefiles_read_waiter: 1) object->work_lock is taken. 2) the operation is added to the to_do list. 3) the work lock is dropped. 4) fscache_enqueue_retrieval is called, which takes a reference. Another thread is in cachefiles_read_copier: 1) object->work_lock is taken 2) an item is popped off the to_do list. 3) object->work_lock is dropped. 4) some processing is done on the item, and fscache_put_retrieval() is called, dropping a reference. Now if the this process in cachefiles_read_copier takes place *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped before it is taken, which leads to the objects reference count hitting zero, which leads to lifecycle events for the object happening too soon, leading to the assertion failure later on. (This is simplified and clarified from the original upstream analysis for this patch at https://www.redhat.com/archives/linux- cachefs/2018-February/msg1.html and from a similar patch with a different approach to fixing the bug at https://www.redhat.com/archives/linux-cachefs/2017-June/msg2.html) [Fix] Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This means that the object cannot be popped off the to_do list until it is in a fully consistent state with the reference taken. [Testcase] A user has run ~100 hours of NFS stress tests and not seen this bug recur. [Regression Potential] - Limited to fscache/cachefiles. - The change makes things more conservative (doing more under lock) so that's reassuring. - There may be performance impacts but none have been observed so far. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1774336/+subscriptions -- Mailing lis
[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update
Hi, I have a report from another user reporting this. I will submit it to the kernel team. Regards, Daniel ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1742658 Title: linux-generic-hwe-16.04 OOPS in nouveau after security update Status in linux package in Ubuntu: Confirmed Status in linux-hwe package in Ubuntu: New Status in linux-hwe-edge package in Ubuntu: New Status in linux-meta-hwe package in Ubuntu: New Status in linux-meta-hwe-edge package in Ubuntu: New Bug description: Description: Ubuntu 16.04.3 LTS Release: 16.04 After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen with nouveau. Previously I was running 4.10.0-42-generic, and that kernel still works fine. Here is the OOPS: an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 1.3.1 20120801 for :02:00.0 on minor 0 Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle kernel NULL pointer dereference at (null) Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 Jan 11 09:39:18 edvin-tower kernel: [3.100615] Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core pata_acpi libahci wmi Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. Precision Tower 5810/0K240Y, BIOS A05 12/16/2014 Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp gf119_disp_super [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 task.stack: ac9ec2134000 Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 EFLAGS: 00010206 Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: RCX: 0016 Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX: RSI: RDI: 9d898140d180 Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: R09: Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: R12: Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13: R14: ac9ec2137d00 R15: 9d898c542600 Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS: () GS:9d899fd0() knlGS: Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS: 0010 DS: ES: CR0: 80050033 Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2: CR3: 00029ac0a006 CR4: 001606e0 Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace: Jan 11 09:39:18 edvin-tower kernel: [3.100771] ? nvkm_dp_train_drive+0x214/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100798] nvkm_dp_train+0x582/0x970 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100824] nvkm_dp_acquire+0xd4/0x390 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100850] nv50_disp_super_2_2+0x6d/0x430 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100872] ? nvkm_devinit_pll_set+0xf/0x20 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100897] gf119_disp_super+0x1b7/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100904] ? __schedule+0x3ca/0x890 Jan 11 09:39:18 edvin-tower kernel: [3.100911] process_one_work+0x156/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100915] worker_thread+0x4b/0x460 Jan 11 09:39:18 edvin-tower kernel: [3.100920] kthread+0x109/0x140 Jan 11 09:39:18 edvin-tower kernel: [3.100924] ? process_one_work+0x410/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100928] ? kthread_create_on_node+0x70/0x70 Jan 11 09:39:18 edvin-tower kernel: [3.100934] ret_from_fork+0x1f/0x30 Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code: Bad RIP value. Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP: (null) RSP: ac9ec2137bd8 Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 93a79dae0d3ec749 ]--- ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-generic-hwe-16.04
[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update
** Also affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1742658 Title: linux-generic-hwe-16.04 OOPS in nouveau after security update Status in linux package in Ubuntu: Confirmed Status in linux-hwe package in Ubuntu: New Status in linux-hwe-edge package in Ubuntu: New Status in linux-meta-hwe package in Ubuntu: New Status in linux-meta-hwe-edge package in Ubuntu: New Bug description: Description: Ubuntu 16.04.3 LTS Release: 16.04 After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen with nouveau. Previously I was running 4.10.0-42-generic, and that kernel still works fine. Here is the OOPS: an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 1.3.1 20120801 for :02:00.0 on minor 0 Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle kernel NULL pointer dereference at (null) Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 Jan 11 09:39:18 edvin-tower kernel: [3.100615] Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core pata_acpi libahci wmi Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. Precision Tower 5810/0K240Y, BIOS A05 12/16/2014 Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp gf119_disp_super [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 task.stack: ac9ec2134000 Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010: (null) Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 EFLAGS: 00010206 Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: RCX: 0016 Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX: RSI: RDI: 9d898140d180 Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: R09: Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: R12: Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13: R14: ac9ec2137d00 R15: 9d898c542600 Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS: () GS:9d899fd0() knlGS: Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS: 0010 DS: ES: CR0: 80050033 Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2: CR3: 00029ac0a006 CR4: 001606e0 Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace: Jan 11 09:39:18 edvin-tower kernel: [3.100771] ? nvkm_dp_train_drive+0x214/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100798] nvkm_dp_train+0x582/0x970 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100824] nvkm_dp_acquire+0xd4/0x390 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100850] nv50_disp_super_2_2+0x6d/0x430 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100872] ? nvkm_devinit_pll_set+0xf/0x20 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100897] gf119_disp_super+0x1b7/0x300 [nouveau] Jan 11 09:39:18 edvin-tower kernel: [3.100904] ? __schedule+0x3ca/0x890 Jan 11 09:39:18 edvin-tower kernel: [3.100911] process_one_work+0x156/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100915] worker_thread+0x4b/0x460 Jan 11 09:39:18 edvin-tower kernel: [3.100920] kthread+0x109/0x140 Jan 11 09:39:18 edvin-tower kernel: [3.100924] ? process_one_work+0x410/0x410 Jan 11 09:39:18 edvin-tower kernel: [3.100928] ? kthread_create_on_node+0x70/0x70 Jan 11 09:39:18 edvin-tower kernel: [3.100934] ret_from_fork+0x1f/0x30 Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code: Bad RIP value. Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP: (null) RSP: ac9ec2137bd8 Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 93a79dae0d3ec749 ]--- ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-generic-hwe-16.04 4.13.0.26.46 ProcVersionSignature: Ubuntu 4.10.0-42.46~16.04.1-generic 4.10.17 Uname: Linux
[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete
** Description changed: + == SRU Justification == + + [Impact] + Occasionally an application gets stuck in "D" state on NFS reads/sync and close system calls. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. + + [Fix] + Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is upstream in: + ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback") + + [Testcase] + See Test scenario in previous description. + + A test kernel with this patch was tested heavily (>100hrs of test suite) + without issue. + + [Regression Potential] + This changes memory allocation in NFS to use a different policy. This could potentially affect NFS. + + However, the patch is already in Artful and Bionic without issue. + + The patch does not apply to Trusty. + + == Previous Description == + Using Ubuntu Xenial user reports processes hang in D state waiting for disk io. Ocassionally one of the applications gets into "D" state on NFS reads/sync and close system calls. based on the kernel backtraces seems to be stuck in kmalloc allocation during cleanup of dirty NFS pages. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. [Test scenario] - 1) Applications running in Docker environment - 2) Application have cgroup limits --cpu-shares --memory -shm-limit - 3) python and C++ based applications (torch and caffe) - 4) Applications read big lmdb files and write results to NFS shares - 5) use NFS v3 , hard and fscache is enabled - 6) now swap space is configured + 1) Applications running in Docker environment + 2) Application have cgroup limits --cpu-shares --memory -shm-limit + 3) python and C++ based applications (torch and caffe) + 4) Applications read big lmdb files and write results to NFS shares + 5) use NFS v3 , hard and fscache is enabled + 6) now swap space is configured This prevents all other I/O activity on that mount to hang. we are running into this issue more frequently and identified few applications causing this problem. As updated in the description, the problem seems to be happening when exercising the stack try_to_free_mem_cgroup_pages+0xba/0x1a0 we see this with docker containers with cgroup option --memory . whenever there is a deadlock, we see that the process that is hung has reached the maximum cgroup limit, multiple times and typically cleans up dirty data and caches to bring the usage under the limit. This reclaim path happens many times and finally we hit probably a race get into deadlock ** Changed in: linux (Ubuntu) Assignee: Dragan S. (dragan-s) => Daniel Axtens (daxtens) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1750038 Title: user space process hung in 'D' state waiting for disk io to complete Status in linux package in Ubuntu: Incomplete Bug description: == SRU Justification == [Impact] Occasionally an application gets stuck in "D" state on NFS reads/sync and close system calls. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. [Fix] Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is upstream in: ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback") [Testcase] See Test scenario in previous description. A test kernel with this patch was tested heavily (>100hrs of test suite) without issue. [Regression Potential] This changes memory allocation in NFS to use a different policy. This could potentially affect NFS. However, the patch is already in Artful and Bionic without issue. The patch does not apply to Trusty. == Previous Description == Using Ubuntu Xenial user reports processes hang in D state waiting for disk io. Ocassionally one of the applications gets into "D" state on NFS reads/sync and close system calls. based on the kernel backtraces seems to be stuck in kmalloc allocation during cleanup of dirty NFS pages. All the subsequent operations on the NFS mounts are stuck and reboot is required to rectify the situation. [Test scenario] 1) Applications running in Docker environment 2) Application have cgroup limits --cpu-shares --memory -shm-limit 3) python and C++ based applications (torch and caffe) 4) Applications read big lmdb files and write results to NFS shares 5) use NFS v3 , hard and fscache is enabled 6) now swap space is configured This prevents all other I/O activity on that mount to hang. we are running into this issue more frequently and identified few applications causing this problem. As updated in the description, the problem seems to be happening when exercising the stack
[Kernel-packages] [Bug 1764246] [NEW] kdump kernel panics on Bionic
Public bug reported: The kdump/crashdump kernel is panicing during boot on Bionic. 1) Install the daily Bionic server or desktop ISO 2) apt install linux-crashdump, say yes to kdump being enabled 3) Reboot so as to boot with the correct kernel parameter 4) Run: root@bionic-server:~# echo 1 > /proc/sys/kernel/sysrq root@bionic-server:~# echo c > /proc/sysrq-trigger 5) Observe that the crashdump kernel panics before booting with an out-of-memory error. Log below. If I replace the bionic image with the artful cloud image, and repeat steps 2-4, the crashdump kernel boots and successfully stores the vmcore. The full log: [ 54.424512] sysrq: SysRq : Trigger a crash [ 54.427899] BUG: unable to handle kernel NULL pointer dereference at [ 54.433915] IP: sysrq_handle_crash+0x16/0x20 [ 54.437157] PGD 0 P4D 0 [ 54.439292] Oops: 0002 [#1] SMP PTI [ 54.444571] Modules linked in: snd_hda_codec_generic crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hda_core snd_hwdep input_leds joydev snd_pcm serio_raw snd_timer snd soundcore qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid qxl aesni_intel ttm drm_kms_helper aes_x86_64 crypto_simd cryptd glue_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse virtio_blk virtio_net drm i2c_piix4 pata_acpi floppy [ 54.468925] CPU: 0 PID: 1075 Comm: bash Not tainted 4.15.0-15-generic #16-Ubuntu [ 54.470377] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 54.472016] RIP: 0010:sysrq_handle_crash+0x16/0x20 [ 54.472891] RSP: 0018:a3a000643e30 EFLAGS: 00010286 [ 54.473826] RAX: 917e0950 RBX: 92787200 RCX: [ 54.475092] RDX: RSI: 90dfbfc16498 RDI: 0063 [ 54.476182] RBP: a3a000643e30 R08: R09: 022b [ 54.477272] R10: 0001 R11: 92b5280d R12: 0004 [ 54.478361] R13: 0063 R14: 0002 R15: 90dfbab0ef00 [ 54.479456] FS: 7ff6248af740() GS:90dfbfc0() knlGS: [ 54.480602] CS: 0010 DS: ES: CR0: 80050033 [ 54.481379] CR2: CR3: 7bcb2006 CR4: 003606f0 [ 54.482341] Call Trace: [ 54.482690] __handle_sysrq+0x9f/0x170 [ 54.483207] write_sysrq_trigger+0x34/0x40 [ 54.483775] proc_reg_write+0x45/0x70 [ 54.484281] __vfs_write+0x1b/0x40 [ 54.484751] vfs_write+0xb1/0x1a0 [ 54.485208] SyS_write+0x55/0xc0 [ 54.485669] do_syscall_64+0x73/0x130 [ 54.486156] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 54.486784] RIP: 0033:0x7ff623f84154 [ 54.487328] RSP: 002b:7ffe5f399678 EFLAGS: 0246 ORIG_RAX: 0001 [ 54.488272] RAX: ffda RBX: 0002 RCX: 7ff623f84154 [ 54.489179] RDX: 0002 RSI: 55a5a49151c0 RDI: 0001 [ 54.490313] RBP: 55a5a49151c0 R08: 000a R09: 0001 [ 54.491179] R10: 000a R11: 0246 R12: 7ff624260760 [ 54.492311] R13: 0002 R14: 7ff62425c2a0 R15: 7ff62425b760 [ 54.493124] Code: e7 e8 9f fb ff ff e9 c0 fe ff ff 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 c7 05 a8 a7 36 01 01 00 00 00 48 89 e5 0f ae f8 04 25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 c7 05 40 1f e8 [ 54.496461] RIP: sysrq_handle_crash+0x16/0x20 RSP: a3a000643e30 [ 54.497393] CR2: [0.00] Linux version 4.15.0-15-generic (buildd@lgw01-amd64-050) (gcc version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:58:14 UTC 2018 (Ubuntu 4.15.0-15.16-generic 4.15.15) [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-15-generic root=UUID=3e45b7ec-412a-11e8-a844-5254003896a5 ro maybe-ubiquity console=ttyS0 nr_cpus=1 systemd.unit=kdump-tools.service irqpoll nousb ata_piix.prefer_ms_hyperv=0 elfcorehdr=802164K [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Centaur CentaurHauls [0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [0.00] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x1000-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x2900-0x30f5cfff]
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
Hi, We do ship an iso for ppc64le for Trusty - I'm not sure whether it does bare metal/PowerNV or just as an LPAR under PowerVM, but it's probably a bit moot at this point. The good news is that as you can see, the artful kernel was released with the fix. The Xenial kernel also contains the fix; I'm not sure why that wasn't auto-added to this bug, but the release notes contain this fix: https://launchpad.net/ubuntu/+source/linux/4.4.0-119.143 I am not sure what the final status of the patch in Trusty is, I will let you know when I find out. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Released Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. To do this we move a function from one file to another and add another in the networking core. [Regression Potential] A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. T: This also involves a different change to the networking core to add the old-style GSO checking, which is more invasive. However the changes are simple and easily reviewed. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
Fantastic! If I understand correctly, that is sufficient for verification-done- artful, so I am changing that over for you. The one remaining kernel is Trusty 3.13. I am guessing your module doesn't compile for that? If it doesn't, there probably isn't much point on booting with just a virtual ethernet adaptor as the change is specifically to the bnx2x code. Thanks again for your prompt testing efforts! Regards, Daniel ** Tags removed: verification-needed-artful ** Tags added: verification-done-artful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Committed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. To do this we move a function from one file to another and add another in the networking core. [Regression Potential] A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. T: This also involves a different change to the networking core to add the old-style GSO checking, which is more invasive. However the changes are simple and easily reviewed. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
Hi, Thanks for the Xenial test! The kernel team process is that patches will always be committed from the most recent kernel first and then back to older kernels, so that no- one ends up with a regression if they upgrade to a more recent kernel. So if it is applied to Xenial it will be applied to Artful :) (FYI, it's already in the kernel that will be in Bionic next month.) I don't know what you're compiling with DKMS; are you able to do any test at all without it? Just testing that the machine boots and that you can ping someone would make me more comfortable. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Committed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. To do this we move a function from one file to another and add another in the networking core. [Regression Potential] A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. T: This also involves a different change to the networking core to add the old-style GSO checking, which is more invasive. However the changes are simple and easily reviewed. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
Hi, As well as Po-Hsu's comment above, I also have this internal update from the kernel team: As this is also a security fix, don't stress too much. If things could be verified for at least for one of the kernels until next week that is better than nothing. We are rather unlikely rip out fixes if they have a CVE (and do not appear to regress other things) (FYI, this issue is covered by CVE-2018-126.) So, just to confirm, in order of priority: 1) First confirm there are no new regressions (just a quick 'smoke test') on the 3 kernels. 2) Second, do a full test of 1 of the kernels. 3) Test the remaining 2 kernels. Please keep this bug updated as you go through each step. Hopefully this helps reduce the pressure for you! Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Committed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. To do this we move a function from one file to another and add another in the networking core. [Regression Potential] A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. T: This also involves a different change to the networking core to add the old-style GSO checking, which is more invasive. However the changes are simple and easily reviewed. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
Hi, I am the support engineer on the Canonical side who has been working on this with IBM Support on your behalf. Apologies for the confusion. I will contact our kernel team now and get this clarified for you as soon as I can. Now, I can't speak for the kernel team or make any commitments on their behalf. However, I know the full test process is time-consuming, so in the mean time, if you are able to boot with each of the kernels and just quickly verify that there are no obvious regressions - just that boot succeeds and that the network card can still send and receive data - I think that would be a very good first step. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Fix Released Status in linux source package in Trusty: Fix Committed Status in linux source package in Xenial: Fix Committed Status in linux source package in Artful: Fix Committed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. To do this we move a function from one file to another and add another in the networking core. [Regression Potential] A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. T: This also involves a different change to the networking core to add the old-style GSO checking, which is more invasive. However the changes are simple and easily reviewed. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1745364] Re: x86/net/bpf: return statement missing value
I have tested this with the kernel bpf self-test, and it passes. ** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1745364 Title: x86/net/bpf: return statement missing value Status in linux package in Ubuntu: Triaged Status in linux source package in Xenial: Fix Committed Bug description: SRU Justification = Coverity reports: *** CID 1464330: Uninitialized variables (MISSING_RETURN) /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile() 1082int i; 1083 1084 if (!bpf_jit_enable) 1085return prog; 1086 1087 if (!prog || !prog->len) >>> CID 1464330: Uninitialized variables (MISSING_RETURN) >>> Arriving at the end of a function without returning a value. 1088return; 1089 1090 addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL); 1091if (!addrs) 1092return prog; 1093 This is a result of 3098d8eae421 ("bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry- pick of d1c55ab5e41f upstream. In that patch, the return type of bpf_int_jit_compile was changed from void to struct bpf_prog*. That patch changed some of the return statements. It did not, however, change the return statement of the (!prog || !prog->len) check, as in upstream the (!prog || !prog->len) check was dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on prog"): """ There is never such a situation, where bpf_int_jit_compile() is called with either prog as NULL or len as 0, so the tests are unnecessary and confusing as people would just copy them. """ However, we haven't picked up 93a73d442d37, so when we cherry-picked d1c55ab5e41f, that branch remained unmodified, hence the static analysis warning. Impact == If the branch is not dead and someone can hit it, an undefined value can be returned, which could cause issues. Fix === For consistency and in case the branch is not actually dead on Xenial, we should do a fixup to 'return prog;' Regression Potential Limited to the BPF jit which is off by default. Limited to a branch that should be dead code anyway. Limited to an error handling path. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1745364/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
** Description changed: [SRU Justification] [Impact] - On Artful kernels, X fails to start and a kernel splat is printed. + On Artful and Bionic kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] - There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. - Rather than wait for that and deal with the massive regression - potential, the fix I have picked to submit is very very minimal and - touches only hibmc. + Bionic: There is a generic fix in 4.16 at + c67fa6edc8b11afe22c88a23963170bf5f151acf. It is part of a series that + applies this generic fix and does a bunch of cleanups; we can safely + just pick up the generic fix. + + Artful: Rather than a generic fix, I have submitted a very very minimal + fix that only touches hibmc. [Regression Potential] - Minimal - fix only touches hibmc driver. Tested on D05 board. + Artful: Minimal - fix only touches hibmc driver. Tested on D05 board. + Bionic: fix is to generic drm code, but is small and easily reviewable. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] - HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. + Artful: HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships, the HWE kernel will work with Xorg. + + Bionic: no extra notes. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in Linux: New Status in linux package in Ubuntu: Confirmed Status in linux source package in Artful: Fix Released Bug description: [SRU Justification] [Impact] On Artful and Bionic kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] Bionic: There is a generic fix in 4.16 at c67fa6edc8b11afe22c88a23963170bf5f151acf. It is part of a series that applies this generic fix and does a bunch of cleanups; we can safely just pick up the generic fix. Artful: Rather than a generic fix, I have submitted a very very minimal fix that only touches hibmc. [Regression Potential] Artful: Minimal - fix only touches hibmc driver. Tested on D05 board. Bionic: fix is to generic drm code, but is small and easily reviewable. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] Artful: HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships, the HWE kernel will work with Xorg. Bionic: no extra notes. To manage notifications about this bug go to: https://bugs.launchpad.net/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
Hi Fred, Thanks for the update. I have tried to nominate the bug for Bionic; I think the kernel team normally does this so we will see if that has worked. More importantly, I will test and send a patch for Bionic shortly. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in Linux: New Status in linux package in Ubuntu: Confirmed Status in linux source package in Artful: Fix Released Bug description: [SRU Justification] [Impact] On Artful kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. Rather than wait for that and deal with the massive regression potential, the fix I have picked to submit is very very minimal and touches only hibmc. [Regression Potential] Minimal - fix only touches hibmc driver. Tested on D05 board. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. To manage notifications about this bug go to: https://bugs.launchpad.net/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1644056] Re: kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931!
Hi all, I think there are two issues at play here, one is the bad pmd one, and one is the original "huge_memory: mapcount 0 page_mapcount 1". Perhaps we could break the bad pmd issue out into a different LP bug? People with the original bug - was anyone able to verify if this happened on a more recent kernel? My understanding of mm/huge_memory.c is that it was significantly refactored after 4.4, so I would be interested to hear if that makes the issue go away. I think disabling transparent huge pages on boot should also make the issue go away if anyone is able to try that? Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1644056 Title: kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts- xenial-4.4.0/mm/huge_memory.c:1931! Status in linux package in Ubuntu: Confirmed Bug description: Hi, While running IO on the following kernel/Ubuntu version: $ lsb_release -rd Description:Ubuntu 14.04.5 LTS Release:14.04 $ uname -a Linux 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/issue Ubuntu 14.04.5 LTS \n \l [1133672.985186] /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/pgtable-generic.c:33: bad pmd 881fd6790240(80004b8008e7) [1135572.440941] huge_memory: mapcount 0 page_mapcount 1 [1135572.441607] [ cut here ] [1135572.442059] kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931! [1135572.442571] invalid opcode: [#1] SMP [1135572.443028] Modules linked in: intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm joydev input_leds sb_edac irqbypass crct10dif_pclmul crc32_pclmul aesni_intel edac_core aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd dm_multipath lpc_ich ipmi_ssif ipmi_devintf shpchp 8250_fintek mac_hid acpi_power_meter ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ses enclosure raid1 ast ttm ixgbe hid_generic igb vxlan drm_kms_helper syscopyarea ip6_udp_tunnel usbhid dca sysfillrect udp_tunnel sysimgblt hid mxm_wmi fb_sys_fops mpt3sas ptp drm ahci raid_class pps_core libahci i2c_algo_bit mdio scsi_transport_sas fjes wmi [1135572.448909] CPU: 15 PID: 2018 Comm: sh Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu [1135572.450082] Hardware name: Quanta Computer Inc. X-100.Column.01/S2PC-MB(Dual 1G LOM), BIOS S2P_3B04.HGT02 09/21/2016 [1135572.451346] task: 8814923ae040 ti: 882eba658000 task.ti: 882eba658000 [1135572.452494] RIP: 0010:[] [] __split_huge_page+0x691/0x6d0 [1135572.453580] RSP: 0018:882eba65b7e0 EFLAGS: 00010292 [1135572.454589] RAX: 0027 RBX: ea00012e RCX: [1135572.455742] RDX: 0001 RSI: 883fff3cdc78 RDI: 883fff3cdc78 [1135572.457040] RBP: 882eba65b860 R08: R09: 881fe93eaf00 [1135572.458271] R10: 03ff R11: 0ac1 R12: [1135572.459539] R13: 882eba65ba10 R14: ea00012e R15: ea00012e [1135572.460746] FS: 7fcf6b305740() GS:883fff3c() knlGS: [1135572.461972] CS: 0010 DS: ES: CR0: 80050033 [1135572.463237] CR2: 558db9da0bb8 CR3: 0021667eb000 CR4: 003406e0 [1135572.464515] DR0: DR1: DR2: [1135572.465831] DR3: DR6: fffe0ff0 DR7: 0400 [1135572.467177] Stack: [1135572.468502] 882eba65b840 811bdfe9 883fed2e51d0 [1135572.469769] 811c67c8 883fee9bb760 0007f43c9000 882eba65ba10 [1135572.471040] 7b6d 81c72fa0 883ff17340f4 ea00012e [1135572.472277] Call Trace: [1135572.473574] [] ? rmap_walk+0x239/0x2d0 [1135572.475079] [] split_huge_page_to_list+0x67/0xd0 [1135572.476473] [] add_to_swap+0x57/0x70 [1135572.477852] [] shrink_page_list+0x62c/0x770 [1135572.479246] [] shrink_inactive_list+0x1e9/0x500 [1135572.480688] [] shrink_lruvec+0x58e/0x730 [1135572.482077] [] ? __queue_work+0x130/0x350 [1135572.483615] [] ? __queue_work+0x130/0x350 [1135572.485078] [] shrink_zone+0xdc/0x2c0 [1135572.486661] [] do_try_to_free_pages+0x164/0x440 [1135572.488354] [] ? throttle_direct_reclaim+0x8d/0x230 [1135572.490068] [] try_to_free_pages+0xb5/0x170 [1135572.491535] [] __alloc_pages_nodemask+0x597/0xac0 [1135572.493287] [] alloc_kmem_pages_node+0x4d/0xd0 [1135572.495083] [] copy_process+0x185/0x1c70 [1135572.496792] [] ? from_kgid_munged+0x12/0x20 [1135572.498404] [] ? cp_new_stat+0x13d/0x160 [1135572.500116] []
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
Hi, I installed 4.13.0-35-generic from artful-proposed. The kernel boots and X starts fine, so this has passed verification. Regards, Daniel ** Tags removed: verification-needed-artful ** Tags added: verification-done-artful -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in linux package in Ubuntu: Confirmed Status in linux source package in Artful: Fix Committed Bug description: [SRU Justification] [Impact] On Artful kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. Rather than wait for that and deal with the massive regression potential, the fix I have picked to submit is very very minimal and touches only hibmc. [Regression Potential] Minimal - fix only touches hibmc driver. Tested on D05 board. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1748342] Re: cgroup: remove cgroup directory leading kernel crash in kill_css
Hi, I'm happy to submit this patch to the kernel team, but I wanted to talk about the kernel process and ask a question first. The way this process usually works is: - patch submitted to kernel team - kernel team checks patch and if they are happy with it, applies it to the kernel - this is built into a "proposed" kernel. - the bug is updated with the proposed kernel. - someone - usually the bug reporter - must verify that the proposed kernel fixes the bug. There is usually a 5 working day window to do this. - if the verification is done, the new kernel contains the fix. If verification is not done, the patch is not included in the released kernel. I am not able to do the verification. If the kernel team provides a proposed kernel, are you or your customer able to verify it? Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1748342 Title: cgroup: remove cgroup directory leading kernel crash in kill_css Status in linux package in Ubuntu: Incomplete Bug description: We got feedback from customer that cvm(cloud virtual machine) crashed when using kubelet updating container-service in ubuntu xenial. Logs show as follow. We find a patch (commit 33c35aa4817864e056fd772230b0c6b552e36ea2) in linux mainline, which can indeed fix this bug. But ubuntu-xenial.git has not merged it yet. Do you guys have a plan for merging? --panic log- [2018-02-02 10:21:48][4397731.721563] BUG: unable to handle kernel paging request at 0001005c [2018-02-02 10:40:50][4397731.722666] IP: css_clear_dir+0x5/0x70 [2018-02-02 10:40:50][4397731.723261] PGD a12b067 [2018-02-02 10:40:50][4397731.723261] PUD 0 [2018-02-02 10:40:50][4397731.723628] [2018-02-02 10:40:50][4397731.724004] Oops: [#1] SMP [2018-02-02 10:40:50][4397731.724004] Modules linked in: xt_statistic nf_conntrack_netlink ebt_ip ebtable_filter ebtables veth xt_set ip_set_hash_net ip_set nfnetlink xt_nat xt_recent xt_mark ipt_REJ[2018-02-02 10:40:50]ECT nf_reject_ipv4 xt_tcpudp xt_comment ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_fil[2018-02-02 10:40:50]ter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs ppdev sb_edac edac_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev input_le[2018-02-02 10:40:50]ds serio_raw parport_pc parport i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 a[2018-02-02 10:40:50]sync_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath [2018-02-02 10:40:50][4397731.724004] linear cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt aesni_intel fb_sys_fops aes_x86_64 crypto_simd cryptd glue_helper psmouse virtio_blk virtio_n[2018-02-02 10:40:50]et drm pata_acpi floppy [2018-02-02 10:40:50][4397731.724004] CPU: 0 PID: 23347 Comm: kubelet Not tainted 4.10.0-32-generic #36~16.04.1-Ubuntu [2018-02-02 10:40:50][4397731.724004] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [2018-02-02 10:40:50][4397731.724004] task: 92abde59 task.stack: baa94165c000 [2018-02-02 10:40:50][4397731.724004] RIP: 0010:css_clear_dir+0x5/0x70 [2018-02-02 10:40:50][4397731.724004] RSP: 0018:baa94165fe10 EFLAGS: 00010206 [2018-02-02 10:40:50][4397731.724004] RAX: 47fd40005d7b RBX: ffe8 RCX: 92abffc0fcec [2018-02-02 10:40:50][4397731.724004] RDX: 9b070800 RSI: 0206 RDI: ffe8 [2018-02-02 10:40:50][4397731.724004] RBP: baa94165fe20 R08: c8b18701 R09: 000180220017 [2018-02-02 10:40:50][4397731.724004] R10: 92abc8b187f8 R11: 92abf7751d00 R12: 92abd5601000 [2018-02-02 10:40:50][4397731.724004] R13: R14: 92abd5601150 R15: [2018-02-02 10:40:50][4397731.724004] FS: 7f6f92ffd700() GS:92abffc0() knlGS: [2018-02-02 10:40:50][4397731.724004] CS: 0010 DS: ES: CR0: 80050033 [2018-02-02 10:40:50][4397731.724004] CR2: 0001005c CR3: 280cb000 CR4: 000406f0 [2018-02-02 10:40:50][4397731.724004] Call Trace: [2018-02-02 10:40:50][4397731.724004] ? kill_css+0x12/0x60 [2018-02-02 10:40:50][4397731.724004] cgroup_destroy_locked+0xa5/0xf0 [2018-02-02 10:40:50][4397731.724004] cgroup_rmdir+0x2c/0x90 [2018-02-02 10:40:50][4397731.724004] kernfs_iop_rmdir+0x4d/0x80 [2018-02-02 10:40:50][4397731.724004] vfs_rmdir+0xb4/0x130 [2018-02-02 10:40:50][4397731.724004] do_rmdir+0x1c7/0x1e0 [2018-02-02 10:40:50][4397731.724004] SyS_unlinkat+0x22/0x30 [2018-02-02 10:40:50][4397731.724004]
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
** Description changed: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. [Fix] Test packet size in bnx2x feature check path and disable GSO if it is - too large. + too large. To do this we move a function from one file to another and + add another in the networking core. [Regression Potential] - Limited to bnx2x card driver. + A/B/X: The changes to the network core are easily reviewed. The changes to behaviour are limited to the bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. + + T: This also involves a different change to the networking core to add + the old-style GSO checking, which is more invasive. However the changes + are simple and easily reviewed. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
This has been assigned CVE-2018-126. ** CVE added: https://cve.mitre.org/cgi- bin/cvename.cgi?name=2018-126 ** Description changed: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. - Impact - -- + [Impact] bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. - Fix - --- + [Fix] Test packet size in bnx2x feature check path and disable GSO if it is too large. - Regression Potential - + [Regression Potential] Limited to bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
** Description changed: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate. However, if a large packet with very large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. Impact -- bnx2x card panics, requiring power cycle to restore functionality. The workaround is turning off TSO, which prevents the crash as the kernel resegments *all* packets in software, not just ones that are too big. This has a performance cost. - Fix --- - Test packet size in bnx2x feature check path. + Test packet size in bnx2x feature check path and disable GSO if it is + too large. Regression Potential Limited to bnx2x card driver. The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
** Description changed: - (This bug provides a place to track the progress of this issue upstream - and then in to Ubuntu.) + SRU Justification + = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert - May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert + May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to - around 9700 bytes. Normally this is more than adequate as jumbo frames - are limited to 9000 bytes. However, if a large packet with large (>9700 - byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the - hardware will panic. + around 9700 bytes. Normally this is more than adequate. However, if a + large packet with very large (>9700 byte) TCP segments arrives through + ibmveth, and is passed to bnx2x, the hardware will panic. - Turning off TSO prevents the crash as the kernel resegments the data and - assembles the packets in software. This has a performance cost. + Impact + -- - Clearly at the very least, bnx2x should not crash in this case. + bnx2x card panics, requiring power cycle to restore functionality. - One patch to do this was sent upstream: - https://www.spinics.net/lists/netdev/msg452932.html + The workaround is turning off TSO, which prevents the crash as the + kernel resegments *all* packets in software, not just ones that are too + big. This has a performance cost. + + + Fix + --- + + Test packet size in bnx2x feature check path. + + Regression Potential + + + Limited to bnx2x card driver. + The most likely failure case is a false-positive on the size check, which would lead to a performance regression only. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch
[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!
A set of 2 patches to fix this was accepted upstream: https://github.com/torvalds/linux/commit/2b16f048729bf35e6c28a40cbfad07239f9dcd90 https://github.com/torvalds/linux/commit/8914a595110a6eca69a5e275b323f5d09e18f4f9 I will send an SRU shortly. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: (This bug provides a place to track the progress of this issue upstream and then in to Ubuntu.) A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate as jumbo frames are limited to 9000 bytes. However, if a large packet with large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. Turning off TSO prevents the crash as the kernel resegments the data and assembles the packets in software. This has a performance cost. Clearly at the very least, bnx2x should not crash in this case. One patch to do this was sent upstream: https://www.spinics.net/lists/netdev/msg452932.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1728489] Re: tar -x sometimes fails on overlayfs
** Changed in: linux (Ubuntu) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1728489 Title: tar -x sometimes fails on overlayfs Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Released Status in linux source package in Zesty: Fix Released Bug description: [SRU Justification] [Impact] A user is seeing failures from extracting tar archives on overlay filesystems on the 4.4 kernel in constrained environments. The error presents as: `tar: ./deps/0/bin: Directory renamed before its status could be extracted` Following this thread (http://www.spinics.net/lists/linux- unionfs/msg00856.html), it appears that this occurs when entries in the kernel's inode cache are reclaimed, and subsequent lookups return new inode numbers. Further testing showed that when setting `/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to reclaim inode cache entries due to memory pressure) the error does not recur, supporting the hypothesis that cache entries are being evicted. However, this setting may lead to a kernel OOM so is not a reasonable workaround even temporarily. The error cannot be reproduced on a 4.13 kernel, due to the series at https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The particular relevant commit is b7a807dc2010334e62e0afd89d6f7a8913eb14ff, which needs a couple of dependencies. [Fix] For Zesty, backport the entire series. For Xenial, where a full backport is not feasible, backport the key commit and the short list of dependencies. [Testcase] # Testing this bug The testcase for this particular bug is simple - create an overlay filesystem with all layers on the same underlying file system, and then see if the inode of a directory is constant across dropping the caches: mkdir -p /upper/upper /upper/work /lower mount -t overlay none /mnt -o lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work cd /mnt mkdir a stat a # observe inode number echo 2 > /proc/sys/vm/drop_caches stat a # compare inode number If the inode number is the same, the fix is successful. # Regression testing I have run the unionmount test suite from http://git.infradead.org/users/dhowells/unionmount-testsuite.git in overlay mode (./run --ov), and verified that it still passes. (The series cover letter mentions a fork of the test suite at https://github.com/amir73il/unionmount-testsuite/commits/overlayfs- devel. I have *not* attempted to get this running: it assumes a range of changes that are not present in our kernels.) [Regression Potential] As this changes overlayfs, there is potential for regression in the form of unexpected breakages to overlaysfs behaviour. I think this is adequately addressed by the regression testing. One option to reduce the regression potential on Zesty is to reduce the set of patches applied - rather than including the whole series we could include just the patches to solve this bug, which are much easier to inspect for correctness. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728489/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1745364] [NEW] x86/net/bpf: return statement missing value
Public bug reported: SRU Justification = Coverity reports: *** CID 1464330: Uninitialized variables (MISSING_RETURN) /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile() 1082int i; 1083 1084 if (!bpf_jit_enable) 1085return prog; 1086 1087 if (!prog || !prog->len) >>> CID 1464330: Uninitialized variables (MISSING_RETURN) >>> Arriving at the end of a function without returning a value. 1088return; 1089 1090 addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL); 1091if (!addrs) 1092return prog; 1093 This is a result of 3098d8eae421 ("bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry- pick of d1c55ab5e41f upstream. In that patch, the return type of bpf_int_jit_compile was changed from void to struct bpf_prog*. That patch changed some of the return statements. It did not, however, change the return statement of the (!prog || !prog->len) check, as in upstream the (!prog || !prog->len) check was dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on prog"): """ There is never such a situation, where bpf_int_jit_compile() is called with either prog as NULL or len as 0, so the tests are unnecessary and confusing as people would just copy them. """ However, we haven't picked up 93a73d442d37, so when we cherry-picked d1c55ab5e41f, that branch remained unmodified, hence the static analysis warning. Impact == If the branch is not dead and someone can hit it, an undefined value can be returned, which could cause issues. Fix === For consistency and in case the branch is not actually dead on Xenial, we should do a fixup to 'return prog;' Regression Potential Limited to the BPF jit which is off by default. Limited to a branch that should be dead code anyway. Limited to an error handling path. ** Affects: linux (Ubuntu) Importance: Undecided Status: Confirmed ** Description changed: + SRU Justification + = + Coverity reports: *** CID 1464330: Uninitialized variables (MISSING_RETURN) /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile() 1082int i; 1083 1084 if (!bpf_jit_enable) 1085return prog; 1086 1087 if (!prog || !prog->len) >>> CID 1464330: Uninitialized variables (MISSING_RETURN) >>> Arriving at the end of a function without returning a value. 1088return; 1089 1090 addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL); 1091if (!addrs) 1092return prog; 1093 This is a result of 3098d8eae421 ("bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry- pick of d1c55ab5e41f upstream. In that patch, the return type of bpf_int_jit_compile was changed from void to struct bpf_prog*. That patch changed some of the return statements. It did not, however, change the return statement of the (!prog || !prog->len) check, as in upstream the (!prog || !prog->len) check was dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on prog"): """ There is never such a situation, where bpf_int_jit_compile() is called with either prog as NULL or len as 0, so the tests are unnecessary and confusing as people would just copy them. """ However, we haven't picked up 93a73d442d37, so when we cherry-picked d1c55ab5e41f, that branch remained unmodified, hence the static analysis warning. Impact == If the branch is not dead and someone can hit it, an undefined value can be returned, which could cause issues. Fix === For consistency and in case the branch is not actually dead on Xenial, we should do a fixup to 'return prog;' Regression Potential Limited to the BPF jit which is off by default. Limited to a branch that should be dead code anyway. Limited to an error handling path. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1745364 Title: x86/net/bpf: return statement missing value Status in linux package in Ubuntu: Confirmed Bug description: SRU Justification = Coverity reports: *** CID 1464330: Uninitialized variables (MISSING_RETURN) /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile() 1082int i; 1083 1084 if (!bpf_jit_enable) 1085return prog; 1086 1087 if (!prog || !prog->len) >>> CID 1464330: Uninitialized variables (MISSING_RETURN) >>> Arriving at the end of a function without returning a value. 1088return; 1089 1090 addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL);
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
I have talked to the kernel team about this and updated Fred off-line. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in linux package in Ubuntu: Confirmed Bug description: [SRU Justification] [Impact] On Artful kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. Rather than wait for that and deal with the massive regression potential, the fix I have picked to submit is very very minimal and touches only hibmc. [Regression Potential] Minimal - fix only touches hibmc driver. Tested on D05 board. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
Hi Frank, Yes, that is how I see it - these changes can go through, but we need good docs to point people to as there is an incredibly high likelihood of misconfiguration at various points. Regards, Daniel On Tue, Dec 19, 2017 at 3:46 AM, Frank Heimes <1692...@bugs.launchpad.net> wrote: > Siva and Daniel, may I just ask where we are on this? > Well it looks to me that Siva/IBM sees this more as a miss-configuration, so > that the changes in comment #18 are _not_ needed. Daniel, do you see it now > the same way? > But in this case this needs to be documented somewhere, so that we can point > customers, too it - right? > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1692538 > > Title: > Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA > > Status in The Ubuntu-power-systems project: > In Progress > Status in linux package in Ubuntu: > Fix Released > Status in linux source package in Xenial: > In Progress > Status in linux source package in Zesty: > Fix Released > Status in linux source package in Artful: > Fix Released > > Bug description: > > == SRU Justification == > Commit 66aa0678ef is request to fix four issues with the ibmveth driver. > The issues are as follows: > - Issue 1: ibmveth doesn't support largesend and checksum offload features > when configured as "Trunk". > - Issue 2: SYN packet drops seen at destination VM. When the packet > originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO > server's inbound Trunk ibmveth, on validating "checksum good" bits in > ibmveth > receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY > flag. > - Issue 3: First packet of a TCP connection will be dropped, if there is > no OVS flow cached in datapath. > - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. > > The details for the fixes to these issues are described in the commits > git log. > > > > == Comment: #0 - BRYANT G. LY- 2017-05-22 08:40:16 == > ---Problem Description--- > >- Issue 1: ibmveth doesn't support largesend and checksum offload features > when configured as "Trunk". Driver has explicit checks to prevent > enabling these offloads. > >- Issue 2: SYN packet drops seen at destination VM. When the packet > originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to > IO server's inbound Trunk ibmveth, on validating "checksum good" bits > in ibmveth receive routine, SKB's ip_summed field is set with > CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux > Bridge) and delivered to outbound Trunk ibmveth. At this point the > outbound ibmveth transmit routine will not set "no checksum" and > "checksum good" bits in transmit buffer descriptor, as it does so only > when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets > delivered to destination VM, TCP layer receives the packet with checksum > value of 0 and with no checksum related flags in ip_summed field. This > leads to packet drops. So, TCP connections never goes through fine. > >- Issue 3: First packet of a TCP connection will be dropped, if there is > no OVS flow cached in datapath. OVS while trying to identify the flow, > computes the checksum. The computed checksum will be invalid at the > receiving end, as ibmveth transmit routine zeroes out the pseudo > checksum value in the packet. This leads to packet drop. > >- Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. > When Physical NIC has GRO enabled and when OVS bridges these packets, > OVS vport send code will end up calling dev_queue_xmit, which in turn > calls validate_xmit_skb. > In validate_xmit_skb routine, the larger packets will get segmented into > MSS sized segments, if SKB has a frag_list and if the driver to which > they are delivered to doesn't support NETIF_F_FRAGLIST feature. > > Contact Information = Bryant G. Ly/b...@us.ibm.com > > ---uname output--- > 4.8.0-51.54 > > Machine Type = p8 > > ---Debugger--- > A debugger is not configured > > ---Steps to Reproduce--- >Increases performance greatly > > The patch has been accepted upstream: > https://patchwork.ozlabs.org/patch/764533/ > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1692538 Title: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Zesty:
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
** Description changed: - ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the - hisilicon hibmc driver does not contain the callback and so X does not - start. + [SRU Justification] - Discussion and potential fix at https://lists.freedesktop.org/archives - /dri-devel/2017-November/159002.html + [Impact] + On Artful kernels, X fails to start and a kernel splat is printed. - This affects Artful, upstream has not landed on a solution yet as far as - I can tell, so lets backport the first proposed small fix. + This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is + incomplete: the hisilicon hibmc driver does not contain the callback and + so the kernel tries to execute code at NULL. + + [Fix] + There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. + + Rather than wait for that and deal with the massive regression + potential, the fix I have picked to submit is very very minimal and + touches only hibmc. + + [Regression Potential] + Minimal - fix only touches hibmc driver. Tested on D05 board. + + [Testcase] + Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. + + [Notes] + HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in linux package in Ubuntu: Confirmed Bug description: [SRU Justification] [Impact] On Artful kernels, X fails to start and a kernel splat is printed. This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so the kernel tries to execute code at NULL. [Fix] There is a discussion and potential fix at https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The fix hasn't landed yet and it looks like they're going to re-engineer the entire section instead. Rather than wait for that and deal with the massive regression potential, the fix I have picked to submit is very very minimal and touches only hibmc. [Regression Potential] Minimal - fix only touches hibmc driver. Tested on D05 board. [Testcase] Install patched kernel, try to start X. If it succeeds, the fix works. If there's a kernel splat, the fix does not work. [Notes] HiSilicon would really like this fix in Artful in such time so that when the next 16.04 point release ships in February, the HWE kernel will work with Xorg. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
Confirmed - the symptom is a kernel splat about "Attempting to execute userspace memory" triggered by Xorg with LR in ttm_bo_vm_fault - see attached screenshot (sorry!) ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed ** Attachment added: "splat.png" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+attachment/5022893/+files/splat.png -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in linux package in Ubuntu: Confirmed Bug description: ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so X does not start. Discussion and potential fix at https://lists.freedesktop.org/archives /dri-devel/2017-November/159002.html This affects Artful, upstream has not landed on a solution yet as far as I can tell, so lets backport the first proposed small fix. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID
Hi Fred, The artful repository is git://kernel.ubuntu.com/ubuntu/ubuntu- artful.git It contains 4417ec7a7c8d ("UBUNTU: SAUCE: PCI: Support hibmc VGA cards behind a misbehaving HiSilicon bridge") This was an earlier version of those patches and should allow xorg autoconfiguration to work. Regards, Daniel On Fri, Dec 15, 2017 at 6:38 PM, Fred Kimmywrote: > hi daniel: > > whether this following mainline patchset have merge into this artful branch > or not? > If do not merge this patchset, this xwindow function will fail it. > > Can you confirm it and provide this artful branch in order to test it > for me > > 505a1b5 vgaarb: Factor out EFI and fallback default device selection > a37c0f4 vgaarb: Select a default VGA device even if there's no legacy VGA > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1698700 > > Title: > hibmc driver does not include "pci:" prefix in bus ID > > Status in linux package in Ubuntu: > Incomplete > Status in linux source package in Zesty: > Fix Released > Status in linux source package in Artful: > Fix Released > > Bug description: > SRU Justification > > [Impact] > On the HiSilicon D05 (arm64) board, X crashes when started. [0] > > [Fix] > The crash is attributable to the bus ID that the hibmc driver reports for > the hibmc graphics card on the board. In particular, the bus id is missing > the "pci:" prefix that most other cards provide: [1] > - The busid reported on the arm64 system is "0007:a1:00.0" > - The busid reported on a amd64 system is "pci::00:02.0" > > X tests for this prefix. A missing prefix for PCI cards leads to an > Xorg crash. > > Fix this by using the set_pci_busid function from the DRM core. > > [Testcase] > Successfully tested on a D05 board. [2] > > [Regression Potential] > Changes are limited to the hibmc driver, so any regression should also be > limited to that driver. > > [Notes] > I submitted the patch upstream. However, upstream is refactoring the drm > core, and set_busid is going away. That does fix this issue but the > regression potential of the refactor is enormous, so this seems like the > wiser approach. [3] > > [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 > [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 > [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 > [3]: https://www.spinics.net/lists/dri-devel/msg143831.html > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1698700 Title: hibmc driver does not include "pci:" prefix in bus ID Status in linux package in Ubuntu: Incomplete Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Fix Released Bug description: SRU Justification [Impact] On the HiSilicon D05 (arm64) board, X crashes when started. [0] [Fix] The crash is attributable to the bus ID that the hibmc driver reports for the hibmc graphics card on the board. In particular, the bus id is missing the "pci:" prefix that most other cards provide: [1] - The busid reported on the arm64 system is "0007:a1:00.0" - The busid reported on a amd64 system is "pci::00:02.0" X tests for this prefix. A missing prefix for PCI cards leads to an Xorg crash. Fix this by using the set_pci_busid function from the DRM core. [Testcase] Successfully tested on a D05 board. [2] [Regression Potential] Changes are limited to the hibmc driver, so any regression should also be limited to that driver. [Notes] I submitted the patch upstream. However, upstream is refactoring the drm core, and set_busid is going away. That does fix this issue but the regression potential of the refactor is enormous, so this seems like the wiser approach. [3] [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 [3]: https://www.spinics.net/lists/dri-devel/msg143831.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID
There is another bug causing an artful regression - opening a new LP for that: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334 ** Changed in: linux (Ubuntu Artful) Status: Incomplete => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1698700 Title: hibmc driver does not include "pci:" prefix in bus ID Status in linux package in Ubuntu: Incomplete Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Fix Released Bug description: SRU Justification [Impact] On the HiSilicon D05 (arm64) board, X crashes when started. [0] [Fix] The crash is attributable to the bus ID that the hibmc driver reports for the hibmc graphics card on the board. In particular, the bus id is missing the "pci:" prefix that most other cards provide: [1] - The busid reported on the arm64 system is "0007:a1:00.0" - The busid reported on a amd64 system is "pci::00:02.0" X tests for this prefix. A missing prefix for PCI cards leads to an Xorg crash. Fix this by using the set_pci_busid function from the DRM core. [Testcase] Successfully tested on a D05 board. [2] [Regression Potential] Changes are limited to the hibmc driver, so any regression should also be limited to that driver. [Notes] I submitted the patch upstream. However, upstream is refactoring the drm core, and set_busid is going away. That does fix this issue but the regression potential of the refactor is enormous, so this seems like the wiser approach. [3] [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 [3]: https://www.spinics.net/lists/dri-devel/msg143831.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1738334] [NEW] hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")
Public bug reported: ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so X does not start. Discussion and potential fix at https://lists.freedesktop.org/archives /dri-devel/2017-November/159002.html This affects Artful, upstream has not landed on a solution yet as far as I can tell, so lets backport the first proposed small fix. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1738334 Title: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback") Status in linux package in Ubuntu: New Bug description: ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the hisilicon hibmc driver does not contain the callback and so X does not start. Discussion and potential fix at https://lists.freedesktop.org/archives /dri-devel/2017-November/159002.html This affects Artful, upstream has not landed on a solution yet as far as I can tell, so lets backport the first proposed small fix. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID
Hi Fred, I will have a look soon and update you. Regards, Daniel On Mon, Dec 11, 2017 at 6:00 PM, Fred Kimmywrote: > this patch will solve commit #10 bug, please merge this patch. > > thank you > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1698700 > > Title: > hibmc driver does not include "pci:" prefix in bus ID > > Status in linux package in Ubuntu: > Incomplete > Status in linux source package in Zesty: > Fix Released > Status in linux source package in Artful: > Incomplete > > Bug description: > SRU Justification > > [Impact] > On the HiSilicon D05 (arm64) board, X crashes when started. [0] > > [Fix] > The crash is attributable to the bus ID that the hibmc driver reports for > the hibmc graphics card on the board. In particular, the bus id is missing > the "pci:" prefix that most other cards provide: [1] > - The busid reported on the arm64 system is "0007:a1:00.0" > - The busid reported on a amd64 system is "pci::00:02.0" > > X tests for this prefix. A missing prefix for PCI cards leads to an > Xorg crash. > > Fix this by using the set_pci_busid function from the DRM core. > > [Testcase] > Successfully tested on a D05 board. [2] > > [Regression Potential] > Changes are limited to the hibmc driver, so any regression should also be > limited to that driver. > > [Notes] > I submitted the patch upstream. However, upstream is refactoring the drm > core, and set_busid is going away. That does fix this issue but the > regression potential of the refactor is enormous, so this seems like the > wiser approach. [3] > > [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 > [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 > [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 > [3]: https://www.spinics.net/lists/dri-devel/msg143831.html > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1698700 Title: hibmc driver does not include "pci:" prefix in bus ID Status in linux package in Ubuntu: Incomplete Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Incomplete Bug description: SRU Justification [Impact] On the HiSilicon D05 (arm64) board, X crashes when started. [0] [Fix] The crash is attributable to the bus ID that the hibmc driver reports for the hibmc graphics card on the board. In particular, the bus id is missing the "pci:" prefix that most other cards provide: [1] - The busid reported on the arm64 system is "0007:a1:00.0" - The busid reported on a amd64 system is "pci::00:02.0" X tests for this prefix. A missing prefix for PCI cards leads to an Xorg crash. Fix this by using the set_pci_busid function from the DRM core. [Testcase] Successfully tested on a D05 board. [2] [Regression Potential] Changes are limited to the hibmc driver, so any regression should also be limited to that driver. [Notes] I submitted the patch upstream. However, upstream is refactoring the drm core, and set_busid is going away. That does fix this issue but the regression potential of the refactor is enormous, so this seems like the wiser approach. [3] [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 [3]: https://www.spinics.net/lists/dri-devel/msg143831.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID
The patch does seem to be in Artful, following up with the user. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1698700 Title: hibmc driver does not include "pci:" prefix in bus ID Status in linux package in Ubuntu: Incomplete Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Incomplete Bug description: SRU Justification [Impact] On the HiSilicon D05 (arm64) board, X crashes when started. [0] [Fix] The crash is attributable to the bus ID that the hibmc driver reports for the hibmc graphics card on the board. In particular, the bus id is missing the "pci:" prefix that most other cards provide: [1] - The busid reported on the arm64 system is "0007:a1:00.0" - The busid reported on a amd64 system is "pci::00:02.0" X tests for this prefix. A missing prefix for PCI cards leads to an Xorg crash. Fix this by using the set_pci_busid function from the DRM core. [Testcase] Successfully tested on a D05 board. [2] [Regression Potential] Changes are limited to the hibmc driver, so any regression should also be limited to that driver. [Notes] I submitted the patch upstream. However, upstream is refactoring the drm core, and set_busid is going away. That does fix this issue but the regression potential of the refactor is enormous, so this seems like the wiser approach. [3] [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16 [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29 [3]: https://www.spinics.net/lists/dri-devel/msg143831.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1729119] Re: NVMe timeout is too short
** Description changed: [SRU Justification] [Impact] - Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. + Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. [Fix] The underlying parameters are unsigned ints, so make the module parameters unsigned ints too, by picking patch http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html + (Trusty specific) This also requires picking the patch that converts + the constant into a parameter, which is a clean cherry-pick. + [Regression Potential] - Very limited: only types of module parameters are changing, the patch is easily reviewable. + (X/Z/A) Very limited: only types of module parameters are changing, the patch is easily reviewable. + + (Trusty specific) Limited: a module parameter is added and its type is + changed. The patches are easily reviewable. + + [Testing] + (Trusty only) Boot tested on a c5.large instance on AWS which uses + NVMe to boot. Verified that the system still boots with the patches, + and that a timeout of 123456s is permitted. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1729119 Title: NVMe timeout is too short Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Committed Status in linux-aws source package in Xenial: Fix Released Status in linux source package in Zesty: Fix Committed Status in linux-aws source package in Zesty: Invalid Status in linux source package in Artful: Fix Committed Status in linux-aws source package in Artful: Invalid Bug description: [SRU Justification] [Impact] Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. [Fix] The underlying parameters are unsigned ints, so make the module parameters unsigned ints too, by picking patch http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html (Trusty specific) This also requires picking the patch that converts the constant into a parameter, which is a clean cherry-pick. [Regression Potential] (X/Z/A) Very limited: only types of module parameters are changing, the patch is easily reviewable. (Trusty specific) Limited: a module parameter is added and its type is changed. The patches are easily reviewable. [Testing] (Trusty only) Boot tested on a c5.large instance on AWS which uses NVMe to boot. Verified that the system still boots with the patches, and that a timeout of 123456s is permitted. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
Hi Siva, Thank you for your quick and thoughtful response. I will ask about the default MTU for the veth interface to see if the user increased it themselves. I'm not sure I completely understand what you mean about largesend offload being disabled after retransmits. I'm also not completely sure if it's largesend offload or just large packets that are causing issues. If I have understood correctly (e.g. https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/tcp_large_send_offload.htm) large-send offload is what Linux would call TCP Segmentation Offload (TSO) - does that match your understanding? Here's my concern. The code I'm looking at (let's look at Zesty, so v4.10) is in ibmveth.c, ibmveth_poll(). There we see: if (length > netdev->mtu + ETH_HLEN) { ibmveth_rx_mss_helper(skb, mss, lrg_pkt); adapter->rx_large_packets++; } Then ibmveth_rx_mss_helper() has the following - setting GSO on regardless of the large_pkt bit: /* if mss is not set through Large Packet bit/mss in rx buffer, * expect that the mss will be written to the tcp header checksum. */ tcph = (struct tcphdr *)(skb->data + offset); if (lrg_pkt) { skb_shinfo(skb)->gso_size = mss; } else if (offset) { skb_shinfo(skb)->gso_size = ntohs(tcph->check); tcph->check = 0; } It looks to me that Linux will interpret a packet from the veth adaptor as a GSO/GRO packet based only on whether or not the size of the received packet is greater than the linux-side MTU plus the header size - not based on whether AIX thinks it is transmitting a LSO packet. To put it another way - if I have understood correctly - there are two ways we could end up with a GSO/GRO packet coming out of a veth adaptor. The ibmveth_rx_mss_helper path is taken when the size of the packet is greater than MTU+ETH_HLEN, which can happen when: 1) The AIX end has turned on LSO, so the large_packet bit is set 2) Large-send is off in AIX but there is a mis-matched MTU between AIX and Linux In the first case case, you say that AIX will turn off largesend, which will fix the issue. But in the second case, if I have understood correctly, AIX will not be able to do anything. Unless you are saying that AIX will dynamically reduce the MTU for a connection in the presence of a number of re-transmits? This isn't necessarily wrong behaviour from AIX - Linux can't do anything in this situation either; a 'hop' that can participate in Path MTU Discovery would be needed. If I understand it, then, the optimal configuration would be for the AIX LPAR to set an MTU of 1500/9000 and turn on LSO for veth on the AIX side - does that sound right? Thanks again! Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1692538 Title: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Fix Released Bug description: == SRU Justification == Commit 66aa0678ef is request to fix four issues with the ibmveth driver. The issues are as follows: - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. - Issue 3: First packet of a TCP connection will be dropped, if there is no OVS flow cached in datapath. - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. The details for the fixes to these issues are described in the commits git log. == Comment: #0 - BRYANT G. LY- 2017-05-22 08:40:16 == ---Problem Description--- - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". Driver has explicit checks to prevent enabling these offloads. - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux Bridge) and delivered to outbound Trunk ibmveth. At this point the outbound ibmveth transmit routine will not set "no checksum" and "checksum good" bits in transmit buffer descriptor, as it does so only when the ip_summed field is CHECKSUM_PARTIAL. When this
[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
Hi Bryant, So, to be crystal clear, IBM's position is if customers are using this setup, that they should set the MTU in their AIX partitions to 1500? (or 9000 if using jumbo frames) Is this documented anywhere on your website that we can point users to? I ask because I have asked one of your customers/our users to do this in a support context and they were unhappy about the performance impact. So if this is the official line, can we have some official documentation of it? Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1692538 Title: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Fix Released Bug description: == SRU Justification == Commit 66aa0678ef is request to fix four issues with the ibmveth driver. The issues are as follows: - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. - Issue 3: First packet of a TCP connection will be dropped, if there is no OVS flow cached in datapath. - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. The details for the fixes to these issues are described in the commits git log. == Comment: #0 - BRYANT G. LY- 2017-05-22 08:40:16 == ---Problem Description--- - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". Driver has explicit checks to prevent enabling these offloads. - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux Bridge) and delivered to outbound Trunk ibmveth. At this point the outbound ibmveth transmit routine will not set "no checksum" and "checksum good" bits in transmit buffer descriptor, as it does so only when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets delivered to destination VM, TCP layer receives the packet with checksum value of 0 and with no checksum related flags in ip_summed field. This leads to packet drops. So, TCP connections never goes through fine. - Issue 3: First packet of a TCP connection will be dropped, if there is no OVS flow cached in datapath. OVS while trying to identify the flow, computes the checksum. The computed checksum will be invalid at the receiving end, as ibmveth transmit routine zeroes out the pseudo checksum value in the packet. This leads to packet drop. - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. When Physical NIC has GRO enabled and when OVS bridges these packets, OVS vport send code will end up calling dev_queue_xmit, which in turn calls validate_xmit_skb. In validate_xmit_skb routine, the larger packets will get segmented into MSS sized segments, if SKB has a frag_list and if the driver to which they are delivered to doesn't support NETIF_F_FRAGLIST feature. Contact Information = Bryant G. Ly/b...@us.ibm.com ---uname output--- 4.8.0-51.54 Machine Type = p8 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Increases performance greatly The patch has been accepted upstream: https://patchwork.ozlabs.org/patch/764533/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
Just as an update: I am working with Jay V on a set of patches to drop the oversized packets at the openvswitch/bridge level to prevent the crash I mentioned. But that is not sufficient to solve the underlying problem: there will still be packet loss when there's an MTU mismatch here. A device in AIX with a 64k MTU being bridged (via openvswitch or a native bridge) to a device with a 1500 or 9000 byte MTU is never going to work reliably and efficiently, and IBM will need to figure out how they want to solve this. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1692538 Title: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA Status in The Ubuntu-power-systems project: In Progress Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Zesty: Fix Released Status in linux source package in Artful: Fix Released Bug description: == SRU Justification == Commit 66aa0678ef is request to fix four issues with the ibmveth driver. The issues are as follows: - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. - Issue 3: First packet of a TCP connection will be dropped, if there is no OVS flow cached in datapath. - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. The details for the fixes to these issues are described in the commits git log. == Comment: #0 - BRYANT G. LY- 2017-05-22 08:40:16 == ---Problem Description--- - Issue 1: ibmveth doesn't support largesend and checksum offload features when configured as "Trunk". Driver has explicit checks to prevent enabling these offloads. - Issue 2: SYN packet drops seen at destination VM. When the packet originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux Bridge) and delivered to outbound Trunk ibmveth. At this point the outbound ibmveth transmit routine will not set "no checksum" and "checksum good" bits in transmit buffer descriptor, as it does so only when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets delivered to destination VM, TCP layer receives the packet with checksum value of 0 and with no checksum related flags in ip_summed field. This leads to packet drops. So, TCP connections never goes through fine. - Issue 3: First packet of a TCP connection will be dropped, if there is no OVS flow cached in datapath. OVS while trying to identify the flow, computes the checksum. The computed checksum will be invalid at the receiving end, as ibmveth transmit routine zeroes out the pseudo checksum value in the packet. This leads to packet drop. - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. When Physical NIC has GRO enabled and when OVS bridges these packets, OVS vport send code will end up calling dev_queue_xmit, which in turn calls validate_xmit_skb. In validate_xmit_skb routine, the larger packets will get segmented into MSS sized segments, if SKB has a frag_list and if the driver to which they are delivered to doesn't support NETIF_F_FRAGLIST feature. Contact Information = Bryant G. Ly/b...@us.ibm.com ---uname output--- 4.8.0-51.54 Machine Type = p8 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Increases performance greatly The patch has been accepted upstream: https://patchwork.ozlabs.org/patch/764533/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1729119] Re: NVMe timeout is too short
** Tags removed: verification-needed-xenial ** Tags added: verification-done-xenial -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1729119 Title: NVMe timeout is too short Status in linux package in Ubuntu: Confirmed Status in linux-aws package in Ubuntu: Confirmed Status in linux source package in Xenial: New Status in linux-aws source package in Xenial: Fix Committed Bug description: [SRU Justification] [Impact] Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. [Fix] The underlying parameters are unsigned ints, so make the module parameters unsigned ints too, by picking patch http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html [Regression Potential] Very limited: only types of module parameters are changing, the patch is easily reviewable. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715812] Re: Neighbour confirmation broken, breaks ARP cache aging
** Changed in: linux (Ubuntu) Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715812 Title: Neighbour confirmation broken, breaks ARP cache aging Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Released Status in linux source package in Zesty: Fix Released Bug description: [SRU Justification] [Impact] A host can lose access to another host whose MAC address changes if they have active connections to other hosts that share a route. The ARP cache does not time out as expected - instead the old MAC address is continuously reconfirmed. [Fix] Apply series [1], which changes the algorithm for neighbour confirmation. That is, from upstream: 51ce8bd4d17a net: pending_confirm is not used anymore 0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 63fca65d0863 net: add confirm_neigh method to dst_ops c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm c86a773c7802 sctp: add dst_pending_confirm flag 4ff0620354f2 net: add dst_pending_confirm flag to skbuff 9b8805a32559 sock: add sk_dst_pending_confirm flag [Test case] Create 3 real or virtual systems, all hooked up to a switch. One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0. Put all the systems in the same subnet, e.g. 192.168.200.0/24 Call the system with the bond A, and the other two systems B and C. On B, run in 3 shells: - netperf -t TCP_RR to C - ping -f A - watch 'ip -s neigh show 192.168.200.0/24' On A, cause the bond to fail over. Observe that: - without the patches, B intermittently fails to notice the change in A's MAC address. This presents as the ping failing and not recovering, and the arp table showing the old mac address never timing out and never being replace with a new mac address. - with the patches, the arp cache times out and B sends another mac probe and detects A's new address. It helps to use taskset to put ping and netperf on the same CPU, or use single-CPU vms. See [2] for more details. [References] [2] Original report: https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1729119] [NEW] NVMe timeout is too short
Public bug reported: [SRU Justification] [Impact] Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. [Fix] The underlying parameters are unsigned ints, so make the module parameters unsigned ints too, by picking patch http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html [Regression Potential] Very limited: only types of module parameters are changing, the patch is easily reviewable. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1729119 Title: NVMe timeout is too short Status in linux package in Ubuntu: Confirmed Bug description: [SRU Justification] [Impact] Some NVMe operations time out too quickly. The module parameters allow the timeouts to be extended, but only up to 255s, as the counters are bytes. [Fix] The underlying parameters are unsigned ints, so make the module parameters unsigned ints too, by picking patch http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html [Regression Potential] Very limited: only types of module parameters are changing, the patch is easily reviewable. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1728489] [NEW] tar -x sometimes fails on overlayfs
Public bug reported: [SRU Justification] [Impact] A user is seeing failures from extracting tar archives on overlay filesystems on the 4.4 kernel in constrained environments. The error presents as: `tar: ./deps/0/bin: Directory renamed before its status could be extracted` Following this thread (http://www.spinics.net/lists/linux- unionfs/msg00856.html), it appears that this occurs when entries in the kernel's inode cache are reclaimed, and subsequent lookups return new inode numbers. Further testing showed that when setting `/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to reclaim inode cache entries due to memory pressure) the error does not recur, supporting the hypothesis that cache entries are being evicted. However, this setting may lead to a kernel OOM so is not a reasonable workaround even temporarily. The error cannot be reproduced on a 4.13 kernel, due to the series at https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The particular relevant commit is b7a807dc2010334e62e0afd89d6f7a8913eb14ff, which needs a couple of dependencies. [Fix] For Zesty, backport the entire series. For Xenial, where a full backport is not feasible, backport the key commit and the short list of dependencies. [Testcase] # Testing this bug The testcase for this particular bug is simple - create an overlay filesystem with all layers on the same underlying file system, and then see if the inode of a directory is constant across dropping the caches: mkdir -p /upper/upper /upper/work /lower mount -t overlay none /mnt -o lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work cd /mnt mkdir a stat a # observe inode number echo 2 > /proc/sys/vm/drop_caches stat a # compare inode number If the inode number is the same, the fix is successful. # Regression testing I have run the unionmount test suite from http://git.infradead.org/users/dhowells/unionmount-testsuite.git in overlay mode (./run --ov), and verified that it still passes. (The series cover letter mentions a fork of the test suite at https://github.com/amir73il/unionmount-testsuite/commits/overlayfs- devel. I have *not* attempted to get this running: it assumes a range of changes that are not present in our kernels.) [Regression Potential] As this changes overlayfs, there is potential for regression in the form of unexpected breakages to overlaysfs behaviour. I think this is adequately addressed by the regression testing. One option to reduce the regression potential on Zesty is to reduce the set of patches applied - rather than including the whole series we could include just the patches to solve this bug, which are much easier to inspect for correctness. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1728489 Title: tar -x sometimes fails on overlayfs Status in linux package in Ubuntu: Confirmed Bug description: [SRU Justification] [Impact] A user is seeing failures from extracting tar archives on overlay filesystems on the 4.4 kernel in constrained environments. The error presents as: `tar: ./deps/0/bin: Directory renamed before its status could be extracted` Following this thread (http://www.spinics.net/lists/linux- unionfs/msg00856.html), it appears that this occurs when entries in the kernel's inode cache are reclaimed, and subsequent lookups return new inode numbers. Further testing showed that when setting `/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to reclaim inode cache entries due to memory pressure) the error does not recur, supporting the hypothesis that cache entries are being evicted. However, this setting may lead to a kernel OOM so is not a reasonable workaround even temporarily. The error cannot be reproduced on a 4.13 kernel, due to the series at https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The particular relevant commit is b7a807dc2010334e62e0afd89d6f7a8913eb14ff, which needs a couple of dependencies. [Fix] For Zesty, backport the entire series. For Xenial, where a full backport is not feasible, backport the key commit and the short list of dependencies. [Testcase] # Testing this bug The testcase for this particular bug is simple - create an overlay filesystem with all layers on the same underlying file system, and then see if the inode of a directory is constant across dropping the caches: mkdir -p /upper/upper /upper/work /lower mount -t overlay none /mnt -o lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work cd /mnt mkdir a stat a # observe inode number echo 2 > /proc/sys/vm/drop_caches stat a # compare inode number If the inode number is the same, the fix is successful. # Regression testing I have run the unionmount test suit
[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module
Hi, It turns out that support for this driver would require a very large backport with several series of patches, involving significant refactoring, code movement and other code change. This makes it very hard for us to be sure that our backport is correct, and that it's not going to fail unexpectedly on this new model or on any of the many older models supported by this driver. Therefore, we have decided that the complexity and risk of regression is unacceptably high, and we will not be providing a backport to the 4.4 kernel series. This means that you will need to use the HWE kernel for this chassis. The Artful kernel, which is out next week, has full support. ** Changed in: linux (Ubuntu) Status: Confirmed => Won't Fix -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683587 Title: LSI Harpoon support in megaraid_sas module Status in linux package in Ubuntu: Won't Fix Bug description: The Dell PERC H740 series RAID controllers, codename "Harpoon", are not supported in standard Ubuntu kernels. There is a series of kernel patches required to support these: http://www.mail-archive.com/linux- ker...@vger.kernel.org/msg1307314.html There is also a relevant follow-up series: https://www.spinics.net/lists/linux-scsi/msg104667.html especially patches 1 and 12. The relevant PCI IDs from the PCI database (http://pciids.sourceforge.net/v2.2/pci.ids) are: 0016 MegaRAID Tri-Mode SAS3508 1028 1fc9 PERC H840 Adapter 1028 1fcb PERC H740P Adapter 1028 1fcd PERC H740P Mini 1028 1fcf PERC H740P Mini They should be supported from Xenial onwards. The upstream commit is going in for 4.11, so this will need to be backported to v4.4 and v4.10. I am working on SRU patches for this. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715812] Re: Neighbour confirmation broken, breaks ARP cache aging
Verified on Xenial and Zesty. ** Tags removed: verification-needed-zesty ** Tags added: verification-done-zesty -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715812 Title: Neighbour confirmation broken, breaks ARP cache aging Status in linux package in Ubuntu: Confirmed Status in linux source package in Xenial: Fix Committed Status in linux source package in Zesty: Fix Committed Bug description: [SRU Justification] [Impact] A host can lose access to another host whose MAC address changes if they have active connections to other hosts that share a route. The ARP cache does not time out as expected - instead the old MAC address is continuously reconfirmed. [Fix] Apply series [1], which changes the algorithm for neighbour confirmation. That is, from upstream: 51ce8bd4d17a net: pending_confirm is not used anymore 0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 63fca65d0863 net: add confirm_neigh method to dst_ops c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm c86a773c7802 sctp: add dst_pending_confirm flag 4ff0620354f2 net: add dst_pending_confirm flag to skbuff 9b8805a32559 sock: add sk_dst_pending_confirm flag [Test case] Create 3 real or virtual systems, all hooked up to a switch. One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0. Put all the systems in the same subnet, e.g. 192.168.200.0/24 Call the system with the bond A, and the other two systems B and C. On B, run in 3 shells: - netperf -t TCP_RR to C - ping -f A - watch 'ip -s neigh show 192.168.200.0/24' On A, cause the bond to fail over. Observe that: - without the patches, B intermittently fails to notice the change in A's MAC address. This presents as the ping failing and not recovering, and the arp table showing the old mac address never timing out and never being replace with a new mac address. - with the patches, the arp cache times out and B sends another mac probe and detects A's new address. It helps to use taskset to put ping and netperf on the same CPU, or use single-CPU vms. See [2] for more details. [References] [2] Original report: https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715812] [NEW] Neighbour confirmation broken, breaks ARP cache aging
Public bug reported: [SRU Justification] [Impact] A host can lose access to another host whose MAC address changes if they have active connections to other hosts that share a route. The ARP cache does not time out as expected - instead the old MAC address is continuously reconfirmed. [Fix] Apply series [1], which changes the algorithm for neighbour confirmation. That is, from upstream: 51ce8bd4d17a net: pending_confirm is not used anymore 0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 63fca65d0863 net: add confirm_neigh method to dst_ops c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm c86a773c7802 sctp: add dst_pending_confirm flag 4ff0620354f2 net: add dst_pending_confirm flag to skbuff 9b8805a32559 sock: add sk_dst_pending_confirm flag [Test case] Create 3 real or virtual systems, all hooked up to a switch. One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0. Put all the systems in the same subnet, e.g. 192.168.200.0/24 Call the system with the bond A, and the other two systems B and C. On B, run in 3 shells: - netperf -t TCP_RR to C - ping -f A - watch 'ip -s neigh show 192.168.200.0/24' On A, cause the bond to fail over. Observe that: - without the patches, B intermittently fails to notice the change in A's MAC address. This presents as the ping failing and not recovering, and the arp table showing the old mac address never timing out and never being replace with a new mac address. - with the patches, the arp cache times out and B sends another mac probe and detects A's new address. It helps to use taskset to put ping and netperf on the same CPU, or use single-CPU vms. See [2] for more details. [References] [2] Original report: https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715812 Title: Neighbour confirmation broken, breaks ARP cache aging Status in linux package in Ubuntu: Confirmed Bug description: [SRU Justification] [Impact] A host can lose access to another host whose MAC address changes if they have active connections to other hosts that share a route. The ARP cache does not time out as expected - instead the old MAC address is continuously reconfirmed. [Fix] Apply series [1], which changes the algorithm for neighbour confirmation. That is, from upstream: 51ce8bd4d17a net: pending_confirm is not used anymore 0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 63fca65d0863 net: add confirm_neigh method to dst_ops c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm c86a773c7802 sctp: add dst_pending_confirm flag 4ff0620354f2 net: add dst_pending_confirm flag to skbuff 9b8805a32559 sock: add sk_dst_pending_confirm flag [Test case] Create 3 real or virtual systems, all hooked up to a switch. One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0. Put all the systems in the same subnet, e.g. 192.168.200.0/24 Call the system with the bond A, and the other two systems B and C. On B, run in 3 shells: - netperf -t TCP_RR to C - ping -f A - watch 'ip -s neigh show 192.168.200.0/24' On A, cause the bond to fail over. Observe that: - without the patches, B intermittently fails to notice the change in A's MAC address. This presents as the ping failing and not recovering, and the arp table showing the old mac address never timing out and never being replace with a new mac address. - with the patches, the arp cache times out and B sends another mac probe and detects A's new address. It helps to use taskset to put ping and netperf on the same CPU, or use single-CPU vms. See [2] for more details. [References] [2] Original report: https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1715519] [NEW] bnx2x_attn_int_deasserted3:4323 MC assert!
Public bug reported: (This bug provides a place to track the progress of this issue upstream and then in to Ubuntu.) A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and AIX is done in a way that is not standards-compliant, and so was never made part of Linux. Instead, the Linux driver has always understood large frames and passed them up the network stack. In some cases (e.g. with TCP), multiple TCP segments are coalesced into one large packet. In Linux, this goes through the generic receive offload code, using a similar mechanism to GSO. These segments can be very large which presents as a very large MSS (maximum segment size) or gso_size. Normally, the large packet is simply passed to whatever network application on Linux is going to consume it, and everything is OK. However, in this case, the packets go through Open vSwitch, and are then passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and GSO, but with a restriction: the maximum segment size is limited to around 9700 bytes. Normally this is more than adequate as jumbo frames are limited to 9000 bytes. However, if a large packet with large (>9700 byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the hardware will panic. Turning off TSO prevents the crash as the kernel resegments the data and assembles the packets in software. This has a performance cost. Clearly at the very least, bnx2x should not crash in this case. One patch to do this was sent upstream: https://www.spinics.net/lists/netdev/msg452932.html ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Daniel Axtens (daxtens) Status: Confirmed -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1715519 Title: bnx2x_attn_int_deasserted3:4323 MC assert! Status in linux package in Ubuntu: Confirmed Bug description: (This bug provides a place to track the progress of this issue upstream and then in to Ubuntu.) A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x card attached, and uses openvswitch to bridge an ibmveth interface for traffic from other LPARs. We see the following crash sometimes when running netperf: May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 0x25e42a7e 0x00462a38 0x00010052 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: [bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - ... (dump of registers follows) ... Subsequent debugging reveals that the packets causing the issue come through the ibmveth interface - from the AIX LPAR. The veth protocol is 'special' - communication between LPARs on the same chassis can use very large (64k) frames to reduce overhead. Normal networks cannot handle such large packets, so traditionally, the VIOS partition would signal to the AIX partitions that it was 'special', and AIX would send regular, ethernet-sized packets to VIOS, which VIOS would then send out. This signalling between VIOS and
[Kernel-packages] [Bug 1714420] Re: kernel oops - kvm guest started at boot time
** Description changed: + [SRU Justification] + + [Impact] + System OOPSes shortly after boot when KVM guests are started. + + [Fix] + Cherry-pick patch e47057151422a67ce08747176fa21cb3b526a2c9 + + [Testcase] + Tested at IBM - boot a machine with a KVM guest configured to start at boot. Without this patch, observe OOPS, with this patch, observe no OOPS. + + [Regression Potential] + Patch is contained in arch/powerpc; so regression potential limited to that arch. Patch accepted to kernel stable trees, suggesting others also believe it to be of low risk. + + [Original Report] + [0.00] Linux version 4.4.0-93-generic (buildd@bos01-ppc64el-025) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #116-Ubuntu SMP Fri Aug 11 16:30:16 UTC 2017 (Ubuntu 4.4.0-93.116-generic 4.4.79) ... [ 380.184554] KVM guest htab at c0799900 (order 29), LPID 2 [ 380.527576] Facility 'TM' unavailable, exception at 0xd0003aad7f10, MSR=90009033 [ 380.527717] Oops: Unexpected facility unavailable exception, sig: 6 [#2] [ 380.527775] SMP NR_CPUS=2048 NUMA PowerNV [ 380.527823] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter overlay binfmt_misc bridge stp llc kvm_hv uio_pdrv_genirq uio leds_powernv ipmi_powernv ibmpowernv vmx_crypto powernv_rng ipmi_msghandler kvm_pr kvm autofs4 xfs btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 raid10 ses enclosure mlx4_en be2net lpfc vxlan mlx4_core scsi_transport_fc ip6_udp_tunnel udp_tunnel ipr [ 380.528781] CPU: 24 PID: 4277 Comm: qemu-system-ppc Tainted: G D 4.4.0-93-generic #116-Ubuntu [ 380.528861] task: c3c389b0 ti: c01fb2428000 task.ti: c01fb2428000 [ 380.528929] NIP: d0003aad7f10 LR: d00037d52a14 CTR: d0003aad7e40 [ 380.528997] REGS: c01fb242b7b0 TRAP: 0f60 Tainted: G D (4.4.0-93-generic) [ 380.529076] MSR: 90009033CR: 22024848 XER: [ 380.529247] CFAR: d0003aad7ea4 SOFTE: 1 -GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 c01fdbf6 -GPR04: c01f8580 c01fb242bbc0 -GPR08: 0001 c3c389b0 0001 d00037d578f8 -GPR12: d0003aad7e40 cfb4e400 001f -GPR16: 3fff7206 0080 3fff892c4390 3fff7285f200 -GPR20: 010009988430 0100099affd0 3fff7285eb60 100c1ff0 -GPR24: 3bcf4e10 3fff72040028 c01fdbf6 -GPR28: c01f8580 c01fdbf6 c01f8580 + GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 c01fdbf6 + GPR04: c01f8580 c01fb242bbc0 + GPR08: 0001 c3c389b0 0001 d00037d578f8 + GPR12: d0003aad7e40 cfb4e400 001f + GPR16: 3fff7206 0080 3fff892c4390 3fff7285f200 + GPR20: 010009988430 0100099affd0 3fff7285eb60 100c1ff0 + GPR24: 3bcf4e10 3fff72040028 c01fdbf6 + GPR28: c01f8580 c01fdbf6 c01f8580 [ 380.530119] NIP [d0003aad7f10] kvmppc_vcpu_run_hv+0xd0/0xff0 [kvm_hv] [ 380.530188] LR [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 380.530245] Call Trace: [ 380.530270] [c01fb242ba30] [c01fb242bab0] 0xc01fb242bab0 (unreliable) [ 380.530353] [c01fb242bb70] [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 380.530436] [c01fb242bba0] [d00037d4f674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm] [ 380.530519] [c01fb242bbe0] [d00037d43918] kvm_vcpu_ioctl+0x528/0x7b0 [kvm] [ 380.530602] [c01fb242bd40] [c02fff60] do_vfs_ioctl+0x480/0x7d0 [ 380.530671] [c01fb242bde0] [c0300384] SyS_ioctl+0xd4/0xf0 [ 380.530742] [c01fb242be30] [c0009204] system_call+0x38/0xb4 [ 380.530837] Instruction dump: [ 380.530904] e92d02a0 e9290a50 e9290108 792a07e3 41820058 e92d02a0 e9290a50 e9290108 [ 380.531126] 7927e8a4 78e71f87 40820ed8 e92d02a0 <7d4022a6> f9490ee8 e92d02a0 7d4122a6 [ 380.531350] ---[ end trace 8f9b3b82f9a07d76 ]--- - - Needs kernel
[Kernel-packages] [Bug 1714420] Re: kernel oops - kvm guest started at boot time
** Changed in: linux (Ubuntu) Assignee: (unassigned) => Daniel Axtens (daxtens) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1714420 Title: kernel oops - kvm guest started at boot time Status in linux package in Ubuntu: Confirmed Bug description: [0.00] Linux version 4.4.0-93-generic (buildd@bos01-ppc64el-025) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #116-Ubuntu SMP Fri Aug 11 16:30:16 UTC 2017 (Ubuntu 4.4.0-93.116-generic 4.4.79) ... [ 380.184554] KVM guest htab at c0799900 (order 29), LPID 2 [ 380.527576] Facility 'TM' unavailable, exception at 0xd0003aad7f10, MSR=90009033 [ 380.527717] Oops: Unexpected facility unavailable exception, sig: 6 [#2] [ 380.527775] SMP NR_CPUS=2048 NUMA PowerNV [ 380.527823] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter overlay binfmt_misc bridge stp llc kvm_hv uio_pdrv_genirq uio leds_powernv ipmi_powernv ibmpowernv vmx_crypto powernv_rng ipmi_msghandler kvm_pr kvm autofs4 xfs btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 raid10 ses enclosure mlx4_en be2net lpfc vxlan mlx4_core scsi_transport_fc ip6_udp_tunnel udp_tunnel ipr [ 380.528781] CPU: 24 PID: 4277 Comm: qemu-system-ppc Tainted: G D 4.4.0-93-generic #116-Ubuntu [ 380.528861] task: c3c389b0 ti: c01fb2428000 task.ti: c01fb2428000 [ 380.528929] NIP: d0003aad7f10 LR: d00037d52a14 CTR: d0003aad7e40 [ 380.528997] REGS: c01fb242b7b0 TRAP: 0f60 Tainted: G D (4.4.0-93-generic) [ 380.529076] MSR: 90009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22024848 XER: [ 380.529247] CFAR: d0003aad7ea4 SOFTE: 1 GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 c01fdbf6 GPR04: c01f8580 c01fb242bbc0 GPR08: 0001 c3c389b0 0001 d00037d578f8 GPR12: d0003aad7e40 cfb4e400 001f GPR16: 3fff7206 0080 3fff892c4390 3fff7285f200 GPR20: 010009988430 0100099affd0 3fff7285eb60 100c1ff0 GPR24: 3bcf4e10 3fff72040028 c01fdbf6 GPR28: c01f8580 c01fdbf6 c01f8580 [ 380.530119] NIP [d0003aad7f10] kvmppc_vcpu_run_hv+0xd0/0xff0 [kvm_hv] [ 380.530188] LR [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 380.530245] Call Trace: [ 380.530270] [c01fb242ba30] [c01fb242bab0] 0xc01fb242bab0 (unreliable) [ 380.530353] [c01fb242bb70] [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm] [ 380.530436] [c01fb242bba0] [d00037d4f674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm] [ 380.530519] [c01fb242bbe0] [d00037d43918] kvm_vcpu_ioctl+0x528/0x7b0 [kvm] [ 380.530602] [c01fb242bd40] [c02fff60] do_vfs_ioctl+0x480/0x7d0 [ 380.530671] [c01fb242bde0] [c0300384] SyS_ioctl+0xd4/0xf0 [ 380.530742] [c01fb242be30] [c0009204] system_call+0x38/0xb4 [ 380.530837] Instruction dump: [ 380.530904] e92d02a0 e9290a50 e9290108 792a07e3 41820058 e92d02a0 e9290a50 e9290108 [ 380.531126] 7927e8a4 78e71f87 40820ed8 e92d02a0 <7d4022a6> f9490ee8 e92d02a0 7d4122a6 [ 380.531350] ---[ end trace 8f9b3b82f9a07d76 ]--- Needs kernel patch e47057151422a67ce08747176fa21cb3b526a2c9 according to Cyril ProblemType: Bug DistroRelease: Ubuntu 16.04 Package: linux-image-4.4.0-93-generic 4.4.0-93.116 ProcVersionSignature: Ubuntu 4.4.0-93.116-generic 4.4.79 Uname: Linux 4.4.0-93-generic ppc64le AlsaDevices: total 0 crw-rw 1 root audio 116, 1 Sep 1 15:03 seq crw-rw 1 root audio 116, 33 Sep 1 15:03 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay' ApportVersion: 2.20.1-0ubuntu2.10 Architecture: ppc64el ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Date: Fri Sep 1 15:34:14 2017 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig' JournalErrors: Error: command ['journalctl', '-b', '--priority=warning', '--lines=1000'] failed with exit code 1: Hint: You are curre
[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module
** Description changed: The Dell PERC H740 series RAID controllers, codename "Harpoon", are not supported in standard Ubuntu kernels. - The kernel patch to support these new devices is: + There is a series of kernel patches required to support these: + http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1307314.html - https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056 + There is also a relevant follow-up series: https://www.spinics.net/lists + /linux-scsi/msg104667.html especially patches 1 and 12. The relevant PCI IDs from the PCI database (http://pciids.sourceforge.net/v2.2/pci.ids) are: - 0016 MegaRAID Tri-Mode SAS3508 - 1028 1fc9 PERC H840 Adapter - 1028 1fcb PERC H740P Adapter - 1028 1fcd PERC H740P Mini - 1028 1fcf PERC H740P Mini + 0016 MegaRAID Tri-Mode SAS3508 + 1028 1fc9 PERC H840 Adapter + 1028 1fcb PERC H740P Adapter + 1028 1fcd PERC H740P Mini + 1028 1fcf PERC H740P Mini - They should be supported from Trusty onwards. The upstream commit is - going in for 4.11, so this will need to be backported to - v3.13/v4.4/v4.8/v4.10. + They should be supported from Xenial onwards. The upstream commit is + going in for 4.11, so this will need to be backported to v4.4 and v4.10. I am working on SRU patches for this. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683587 Title: LSI Harpoon support in megaraid_sas module Status in linux package in Ubuntu: Confirmed Bug description: The Dell PERC H740 series RAID controllers, codename "Harpoon", are not supported in standard Ubuntu kernels. There is a series of kernel patches required to support these: http://www.mail-archive.com/linux- ker...@vger.kernel.org/msg1307314.html There is also a relevant follow-up series: https://www.spinics.net/lists/linux-scsi/msg104667.html especially patches 1 and 12. The relevant PCI IDs from the PCI database (http://pciids.sourceforge.net/v2.2/pci.ids) are: 0016 MegaRAID Tri-Mode SAS3508 1028 1fc9 PERC H840 Adapter 1028 1fcb PERC H740P Adapter 1028 1fcd PERC H740P Mini 1028 1fcf PERC H740P Mini They should be supported from Xenial onwards. The upstream commit is going in for 4.11, so this will need to be backported to v4.4 and v4.10. I am working on SRU patches for this. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1687512] Re: Kernel panics on Xenial when using cgroups and strict CFS limits
** Changed in: linux (Ubuntu) Status: Triaged => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1687512 Title: Kernel panics on Xenial when using cgroups and strict CFS limits Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: Fix Released Bug description: SRU Justification - [Impact] Apache Mesos and Kubernetes workloads on Xenial cause a panic (NULL pointer dereference) in the completely fair scheduler. These panics are in pick_next_entity and include pick_next_task_fair in the call stack. [Fix] Cherry-picking both 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7 (http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz) and 094f469172e00d6ab0a3130b0e01c83b3cf3a98d (http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz) fix the crash. They appear to be intended as a series - they were posted to LKML at the same time. [Testcase] The fix has been validated by the user who reported the bug Bug description --- We see a number of kernel panics on servers running Apache Mesos using cgroups with small (0.1-0.2) cpu limits. These all appear as NULL pointer dereferences in and around pick_next_entity and pick_next_task_fair, for example: [24334.493331] BUG: unable to handle kernel NULL pointer dereference at 0050 [24334.501611] IP: [] pick_next_entity+0x7f/0x160 [24334.507868] PGD 3eacfa067 PUD 3eacfb067 PMD 0 [24334.512806] Oops: [#1] SMP [24334.516420] Modules linked in: ipvlan xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs tcp_diag inet_diag nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev input_leds mac_hid i2c_piix4 8250_fintek parport_pc pvpanic parport serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi [24334.576359] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.4.0-66-generic #87~14.04.1-Ubuntu [24334.584748] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [24334.594188] task: 8803ee671c00 ti: 8803ee67c000 task.ti: 8803ee67c000 [24334.601799] RIP: 0010:[] [] pick_next_entity+0x7f/0x160 [24334.610490] RSP: 0018:8803ee67fdd8 EFLAGS: 00010086 [24334.615924] RAX: 8803ebed4c00 RBX: 880036529800 RCX: [24334.623190] RDX: 0225341f RSI: RDI: [24334.630479] RBP: 8803ee67fe00 R08: 0004 R09: [24334.637758] R10: 8803e7ed7600 R11: 0001 R12: [24334.645153] R13: R14: 0009067729c4 R15: 8803ee672178 [24334.652512] FS: () GS:8803ffd0() knlGS: [24334.660721] CS: 0010 DS: ES: CR0: 80050033 [24334.666587] CR2: 0050 CR3: 0003eacf9000 CR4: 001406e0 [24334.673851] Stack: [24334.675980] 8803ffd16e00 8803ffd16e00 8803e855a200 880036529800 [24334.683995] 0002 8803ee67fe68 810b98a6 8803ffd16e70 [24334.692024] 00016e00 8803e7ed7600 8803ee671c00 [24334.700172] Call Trace: [24334.702750] [] pick_next_task_fair+0x66/0x4b0 [24334.708886] [] __schedule+0x7f4/0x980 [24334.714349] [] schedule+0x35/0x80 [24334.719445] [] schedule_preempt_disabled+0xe/0x10 [24334.725962] [] cpu_startup_entry+0x18a/0x350 [24334.732012] [] start_secondary+0x149/0x170 [24334.737895] Code: 8b 70 50 4d 2b 74 24 50 4d 85 f6 7e 59 4c 89 e7 e8 67 ff ff ff 49 39 c6 7f 04 4c 8b 6b 48 48 8b 43 40 48 85 c0 74 1f 4c 8b 70 50 <4d> 2b 74 24 50 4d 85 f6 7e 2c 4c 89 e7 e8 3f ff ff ff 49 39 c6 [24334.765124] RIP [] pick_next_entity+0x7f/0x160 [24334.771473] RSP [24334.775077] CR2: 0050 [24334.779121] ---[ end trace 05d941efb97b7bae ]--- and [155852.028575] BUG: unable to handle kernel NULL pointer dereference at 0050 [155852.036931] IP: [] pick_next_entity+0x7f/0x160 [155852.043491] PGD 3ebae8067 PUD 3ebae9067 PMD 0 [155852.048550] Oops: [#1] SMP [155852.052437] Modules linked in: ipvlan veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev input_leds mac_hid i2c_piix4 parport_pc
[Kernel-packages] [Bug 1699627] Re: XDP eBPF programs fail to verify on Zesty ppc64el
** Changed in: linux (Ubuntu) Status: In Progress => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1699627 Title: XDP eBPF programs fail to verify on Zesty ppc64el Status in linux package in Ubuntu: Fix Released Status in linux source package in Zesty: Fix Released Bug description: SRU Justification [Impact] Some XDP examples such as https://github.com/netoptimizer/prototype-kernel fail on ppc64el at the eBPF verification stage. [Fix] This is because CONFIG_HAS_EFFICIENT_UNALIGNED_ACCESS is not set on ppc64el. It is not set because the kernel is being compiled for CPU_POWER7 instead of CPU_POWER8, and we don't have efficient unaligned access on POWER7. Swap to building for POWER8. As a bonus, this should make everything a little bit faster. [Regression Potential] - IBM never released any officially supported Power7 LE systems - LE was only ever supported on Power8. Therefore this should not break any systems. - Regression potential is also limited to one arch. - Artful-next already has this fix and nothing bad has happened there. [Test] Create a P8 VM with a virtio network card and 2 vcpus. The VM needs to have some network features turned off, and enough queues. The following virsh snippet in the section should suffice: Then: - apt install clang llvm - get the prototype-kernel repo - go to the kernel/samples/bpf directory - make - sudo mount -t bpf bpf /sys/fs/bpf/ - sudo ./xdp_ddos01_blacklist --dev enp0s1 Observe that without this patch, we get a long debug splat ending with: 32: (61) r1 = *(u32 *)(r8 +12) misaligned packet access off 0+18+12 size 4 load_bpf_file: Permission denied With this patch we don't get that error and the program is successfully verifies and loads. (It still doesn't run - there is other breakage I'm chasing down - but it definitely gets further.) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1699627/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
Re: [Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module
Hi Edward, I am glad to hear the modified ISO works. I have backported the patches and am in discussions with the kernel team about including them in the default kernel. One of our issues is that the patch set is quite large so we're worried about regressions - do you have any older H7** raid controllers? Are you in a position to help with regression testing? Regards, Daniel On Mon, Aug 21, 2017 at 8:26 AM, Edward P <1683...@bugs.launchpad.net> wrote: > What I did to get it working for now is creating a modified ISO with > kernel version v4.11.12 that has support for this RAID controller. Works > fine, so hope this patch is applied to the default kernel that is > shipped with Ubuntu soon. > > -- > You received this bug notification because you are subscribed to the bug > report. > https://bugs.launchpad.net/bugs/1683587 > > Title: > LSI Harpoon support in megaraid_sas module > > Status in linux package in Ubuntu: > Confirmed > > Bug description: > The Dell PERC H740 series RAID controllers, codename "Harpoon", are > not supported in standard Ubuntu kernels. > > The kernel patch to support these new devices is: > > > https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056 > > The relevant PCI IDs from the PCI database > (http://pciids.sourceforge.net/v2.2/pci.ids) are: > > 0016 MegaRAID Tri-Mode SAS3508 > 1028 1fc9 PERC H840 Adapter > 1028 1fcb PERC H740P Adapter > 1028 1fcd PERC H740P Mini > 1028 1fcf PERC H740P Mini > > They should be supported from Trusty onwards. The upstream commit is > going in for 4.11, so this will need to be backported to > v3.13/v4.4/v4.8/v4.10. > > I am working on SRU patches for this. > > To manage notifications about this bug go to: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683587 Title: LSI Harpoon support in megaraid_sas module Status in linux package in Ubuntu: Confirmed Bug description: The Dell PERC H740 series RAID controllers, codename "Harpoon", are not supported in standard Ubuntu kernels. The kernel patch to support these new devices is: https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056 The relevant PCI IDs from the PCI database (http://pciids.sourceforge.net/v2.2/pci.ids) are: 0016 MegaRAID Tri-Mode SAS3508 1028 1fc9 PERC H840 Adapter 1028 1fcb PERC H740P Adapter 1028 1fcd PERC H740P Mini 1028 1fcf PERC H740P Mini They should be supported from Trusty onwards. The upstream commit is going in for 4.11, so this will need to be backported to v3.13/v4.4/v4.8/v4.10. I am working on SRU patches for this. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1701297] Re: NTP reload failure (unable to read library) on overlayfs
Hi Marzog, What commit has been committed to Linux? I cannot find it. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1701297 Title: NTP reload failure (unable to read library) on overlayfs Status in cloud-init: Won't Fix Status in apparmor package in Ubuntu: Confirmed Status in cloud-init package in Ubuntu: Incomplete Status in linux package in Ubuntu: Fix Committed Bug description: After update [1] of cloud-init in Ubuntu (which landed in xenial- updates on 2017-06-27), it is causing NTP reload failures. https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f- 0ubuntu1~16.04.1 In MAAS scenarios, this is causing the machine to fail to deploy. Related bugs: * bug 1645644: cloud-init ntp not using expected servers To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module
Hi, We currently have a user testing the patches for Xenial onwards. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1683587 Title: LSI Harpoon support in megaraid_sas module Status in linux package in Ubuntu: Confirmed Bug description: The Dell PERC H740 series RAID controllers, codename "Harpoon", are not supported in standard Ubuntu kernels. The kernel patch to support these new devices is: https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056 The relevant PCI IDs from the PCI database (http://pciids.sourceforge.net/v2.2/pci.ids) are: 0016 MegaRAID Tri-Mode SAS3508 1028 1fc9 PERC H840 Adapter 1028 1fcb PERC H740P Adapter 1028 1fcd PERC H740P Mini 1028 1fcf PERC H740P Mini They should be supported from Trusty onwards. The upstream commit is going in for 4.11, so this will need to be backported to v3.13/v4.4/v4.8/v4.10. I am working on SRU patches for this. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1698706] Re: Quirk for non-compliant PCI bridge on HiSilicon D05 board
** Tags added: kernel-da-key -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1698706 Title: Quirk for non-compliant PCI bridge on HiSilicon D05 board Status in linux package in Ubuntu: Fix Committed Status in linux source package in Zesty: Fix Committed Bug description: SRU Justification [Impact] Xorg autodetection does not work on HiSilicon D05 boards. [Fix] The HiSilicon D05 board has some PCI bridges (PCI ID 19e5:1610) that are not spec-compliant: they do not set the VGA Enable bit when a VGA card is behind the bridge. This stops vgaarb setting the device as a boot vga device, breaking Xorg auto-detection. [0] Despite this, the hibmc VGA card (PCI ID 19e5:1711) is known to work when behind these bridges. Provide a quirk so that this combination of bridge and card works. [Testcase] On an affected board, run: # find /sys/devices -name boot_vga -exec cat \{\} \; This should print 0 without this patch and 1 with this patch. [Regression Potential] There is a risk with overriding the VGA arbiter that adding additional VGA cards to the board may go wrong somehow. The fixup specifically tests for the bridge and card on the board, so regressions should be limited to that combination of bridge and card. [Notes] HiSilicon is hoping to have 16.04.3 HWE kernel support their board, hence the submission of this patch before it has been accepted upstream. The patch has been submitted upstream and I will continue to work with upstream to land it.[1] [0] https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 - this bug tracked debugging of a segfault and then this issue. Comments 25 (https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/25) and 31 onwards detail this issue. [1] https://patchwork.ozlabs.org/patch/778054/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698706/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1699627] Re: XDP eBPF programs fail to verify on Zesty ppc64el
Also verified by an IBMer on a real P8. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1699627 Title: XDP eBPF programs fail to verify on Zesty ppc64el Status in linux package in Ubuntu: In Progress Status in linux source package in Zesty: Fix Committed Bug description: SRU Justification [Impact] Some XDP examples such as https://github.com/netoptimizer/prototype-kernel fail on ppc64el at the eBPF verification stage. [Fix] This is because CONFIG_HAS_EFFICIENT_UNALIGNED_ACCESS is not set on ppc64el. It is not set because the kernel is being compiled for CPU_POWER7 instead of CPU_POWER8, and we don't have efficient unaligned access on POWER7. Swap to building for POWER8. As a bonus, this should make everything a little bit faster. [Regression Potential] - IBM never released any officially supported Power7 LE systems - LE was only ever supported on Power8. Therefore this should not break any systems. - Regression potential is also limited to one arch. - Artful-next already has this fix and nothing bad has happened there. [Test] Create a P8 VM with a virtio network card and 2 vcpus. The VM needs to have some network features turned off, and enough queues. The following virsh snippet in the section should suffice: Then: - apt install clang llvm - get the prototype-kernel repo - go to the kernel/samples/bpf directory - make - sudo mount -t bpf bpf /sys/fs/bpf/ - sudo ./xdp_ddos01_blacklist --dev enp0s1 Observe that without this patch, we get a long debug splat ending with: 32: (61) r1 = *(u32 *)(r8 +12) misaligned packet access off 0+18+12 size 4 load_bpf_file: Permission denied With this patch we don't get that error and the program is successfully verifies and loads. (It still doesn't run - there is other breakage I'm chasing down - but it definitely gets further.) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1699627/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1701297] Re: NTP reload failure (unable to read library) on overlayfs
Tyler - thanks for that. John - this is coming up in some internal support team escalations so I'm going to have a look at the kernel changes myself and will let you know if I find anything. I'd be keen to sync up if you have any leads. Regards, Daniel -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1701297 Title: NTP reload failure (unable to read library) on overlayfs Status in cloud-init: Incomplete Status in apparmor package in Ubuntu: Confirmed Status in cloud-init package in Ubuntu: Incomplete Status in linux package in Ubuntu: Confirmed Bug description: After update [1] of cloud-init in Ubuntu (which landed in xenial- updates on 2017-06-27), it is causing NTP reload failures. https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f- 0ubuntu1~16.04.1 In MAAS scenarios, this is causing the machine to fail to deploy. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp