[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

2021-11-16 Thread Daniel Axtens
I've made some good progress here.

I found that older version like 4.19 work, so I ran git bisect. I'm
still doing the final check, but it looks like the series that causes
the issue is the one containing these:

d53d2f78cead bpf: Use vmalloc special flag
1a7b7d922081 modules: Use vmalloc special flag
868b104d7379 mm/vmalloc: Add flag for freeing of special permsissions

In particular:

commit 868b104d7379e28013e9d48bdd2db25e0bdcf751 (HEAD)
Author: Rick Edgecombe 
Date:   Thu Apr 25 17:11:36 2019 -0700

mm/vmalloc: Add flag for freeing of special permsissions

Add a new flag VM_FLUSH_RESET_PERMS, for enabling vfree operations to
immediately clear executable TLB entries before freeing pages, and handle
resetting permissions on the directmap. This flag is useful for any kind
of memory with elevated permissions, or where there can be related
permissions changes on the directmap. Today this is RO+X and RO memory.

Although this enables directly vfreeing non-writeable memory now,
non-writable memory cannot be freed in an interrupt because the allocation
itself is used as a node on deferred free list. So when RO memory needs to
be freed in an interrupt the code doing the vfree needs to have its own
work queue, as was the case before the deferred vfree list was added to
vmalloc.

For architectures with set_direct_map_ implementations this whole operation
can be done with one TLB flush when centralized like this. For others with
directmap permissions, currently only arm64, a backup method using
set_memory functions is used to reset the directmap. When arm64 adds
set_direct_map_ functions, this backup can be removed.

When the TLB is flushed to both remove TLB entries for the vmalloc range
mapping and the direct map permissions, the lazy purge operation could be
done to try to save a TLB flush later. However today vm_unmap_aliases
could flush a TLB range that does not include the directmap. So a helper
is added with extra parameters that can allow both the vmalloc address and
the direct mapping to be flushed during this operation. The behavior of the
normal vm_unmap_aliases function is unchanged.

and

commit d53d2f78ceadba081fc7785570798c3c8d50a718
Author: Rick Edgecombe 
Date:   Thu Apr 25 17:11:38 2019 -0700

bpf: Use vmalloc special flag

Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special
permissioned memory in vmalloc and remove places where memory was set RW
before freeing which is no longer needed. Don't track if the memory is RO
anymore because it is now tracked in vmalloc.


This is _extremely_ in "subtly break under the hash MMU" areas.

Hopefully this is enough to get some Power MMU experts to weigh in. I
will keep working on it.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1927076

Title:
  IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash
  P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

Status in ubuntu-kernel-tests:
  New
Status in The Ubuntu-power-systems project:
  Confirmed
Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Hirsute:
  Confirmed

Bug description:
  It looks like our P8 node "entei" tend to fail with the IPv6 TCP test
  from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels:

   # send cpu 119, receive socket 119
   # send cpu 121, receive socket 121
   # send cpu 123, receive socket 123
   # send cpu 125, receive socket 125
   # send cpu 127, receive socket 127
   #  IPv6 TCP 
  publish-job-status: using request.json

  It failed silently here, this can be 100% reproduced with Groovy 5.8
  and Focal 5.8.

  This will cause the ubuntu_kernel_selftests being interrupted, the
  test result for other tests cannot be processed to our result page.

  Please find attachment for the complete "net" test result on this node
  with Groovy 5.8.0-52.59

  Add the kqa-blocker tag as this might needs to be manually verified.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

2021-11-12 Thread Daniel Axtens
I can repro on upstream, all the way back to 5.4.0. It might have
existed before that - I haven't tested any earlier yet.

Was the test methodology changed just before this was found? I'm just
wondering why it suddenly appeared ~a year after Focal was released. I
thought it might have been a patch picked up for a SRU, but it's looking
like the problem predates Focal by some way...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1927076

Title:
  IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash
  P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

Status in ubuntu-kernel-tests:
  New
Status in The Ubuntu-power-systems project:
  Confirmed
Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Hirsute:
  Confirmed

Bug description:
  It looks like our P8 node "entei" tend to fail with the IPv6 TCP test
  from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels:

   # send cpu 119, receive socket 119
   # send cpu 121, receive socket 121
   # send cpu 123, receive socket 123
   # send cpu 125, receive socket 125
   # send cpu 127, receive socket 127
   #  IPv6 TCP 
  publish-job-status: using request.json

  It failed silently here, this can be 100% reproduced with Groovy 5.8
  and Focal 5.8.

  This will cause the ubuntu_kernel_selftests being interrupted, the
  test result for other tests cannot be processed to our result page.

  Please find attachment for the complete "net" test result on this node
  with Groovy 5.8.0-52.59

  Add the kqa-blocker tag as this might needs to be manually verified.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1927076] Re: IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

2021-11-11 Thread Daniel Axtens
I can repro this with the latest Focal kernel on:

description: PowerNV
product: 8247-22L (IBM Power System S822L)

Trying to see if I can repro it upstream.

FWIW my opening hypothesis is that something in a percpu data structure
isn't getting updated over hotplug.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1927076

Title:
  IPv6 TCP in reuseport_bpf_cpu from ubuntu_kernel_selftests/net crash
  P8 node entei (Oops: Exception in kernel mode, sig: 4 [#1])

Status in ubuntu-kernel-tests:
  New
Status in The Ubuntu-power-systems project:
  Confirmed
Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Focal:
  Confirmed
Status in linux source package in Hirsute:
  Confirmed

Bug description:
  It looks like our P8 node "entei" tend to fail with the IPv6 TCP test
  from reuseport_bpf_cpu in ubuntu_kernel_selftests/net on 5.8 kernels:

   # send cpu 119, receive socket 119
   # send cpu 121, receive socket 121
   # send cpu 123, receive socket 123
   # send cpu 125, receive socket 125
   # send cpu 127, receive socket 127
   #  IPv6 TCP 
  publish-job-status: using request.json

  It failed silently here, this can be 100% reproduced with Groovy 5.8
  and Focal 5.8.

  This will cause the ubuntu_kernel_selftests being interrupted, the
  test result for other tests cannot be processed to our result page.

  Please find attachment for the complete "net" test result on this node
  with Groovy 5.8.0-52.59

  Add the kqa-blocker tag as this might needs to be manually verified.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1927076/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1904906] Re: 5.10 kernel fails to boot with secure boot disabled

2020-11-26 Thread Daniel Axtens
I cannot yet explain this, but after bisecting the config, I can repro
this with pseries_le_defconfig + CONFIG_RCU_SCALE_TEST=m

That's weird to me, and I'll continue to investigate.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1904906

Title:
  5.10 kernel fails to boot with secure boot disabled

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  Canonical requests to test the secure boot for the 5.10 kernel but
  kernel fails to boot with secure boot disabled.

  The 5.10 kernel can be found in:
  https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/bootstrap

  They can be installed by installing the linux-generic-wip package with
  this PPA enabled. As usual, they are only signed using a key specific to
  that PPA. This key can be retrieved from the signing tarballs for the
  kernels, e.g.:

  http://ppa.launchpad.net/canonical-kernel-
  
team/bootstrap/ubuntu/dists/hirsute/main/signed/linux-5.10-ppc64el/5.10.0-2.3/signed.tar.gz

  Our tester installed the 5.10 kernel via aptitude.
  If booting directly from the bootmenu, it stucks at:
  "kexec_core: Starting new kernel"

  If booting recovery kernel for 5.10.0, it proceeds farther and after 
kexec_core, it failed at: 
  "
  [0.029830] LSM: Security Framework initializing
  [0.029916] Yama: b
  "

  Two attempts with a different scenario; running with 5.8 kernel and boot via 
commandline for 5.10:
  kexec -l /boot/vmlinux-5.10.0-0-generic 
--initrd=/boot/initrd.img-5.10.0-0-generic 
--append="root=UUID=49d000cb-dba2-4d70-809e-38f2b31d0f09 ro quiet splash"
  kexec -e

  Both attempts also failed while rebooting, once with the same error as
  the error from booting with bootmenu; the other failure occurred a lot
  earlier.

  Wondering what new CONFIGs and/or features for the 5.10 kernel?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1904906/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-18 Thread Daniel Axtens
The 5.3.0 HWE kernel also works, which means we now have a good
workaround while we debug things.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-18 Thread Daniel Axtens
Hi Mauricio,

5.4.0-14 works for me, dmesg attached.

I'll see if an HWE kernel supplied in the bionic repositories also
works, maybe we can use that in the mean time so we don't fall any
further behind on kernel updates while we debug this.

Regards,
Daniel

** Attachment added: "dmesg-5.4.0-14-generic"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329404/+files/dmesg-5.4.0-14-generic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-18 Thread Daniel Axtens
Ah, I was just about to tell you that I have just tried master-next at
a59858e18bc8996f8c96d307a33e504b079dc541 ! I think that is the same sha
that ended up being tagged as -89, so I think it provides us with the
same information.

Sadly -89 also doesn't seem to work; dmesg attached.

I don't know anything about the qla2xxx driver, so I was planning to
confirm that the problem was introduced by the set of qla2xxx changes
that went into -73 and then bisect them. But, if you have any insight or
specific knowledge that would suggest a better path, I'm very happy to
give that a go.

Regards,
Daniel

** Attachment added: "dmesg-89"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329273/+files/dmesg-89

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-17 Thread Daniel Axtens
** Attachment added: "dmesg from -72"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329133/+files/dmesg-72

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-17 Thread Daniel Axtens
Hi Mauricio,

Thanks for the prompt answer! After a lot of messing around to get a
remote console, I can finally test. It looks like -88 doesn't work. I'm
attaching a dmesg from -88 and -72. I will build and test master-next
next.

Regards,
Daniel

** Attachment added: "dmesg from -88"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5329132/+files/dmesg-88

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] Re: qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-12 Thread Daniel Axtens
** Attachment added: "lspci-vnvn.log"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+attachment/5327820/+files/lspci-vnvn.log

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  New

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1863044] [NEW] qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

2020-02-12 Thread Daniel Axtens
Public bug reported:

We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
18.04.3. Storage is attached over Fiber Channel.

They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs. On
4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are not
detected. This breaks the boot. rescan-scsi-bus.sh is also unable to
find the LUNs. Reverting to the 4.15.0-72 kernel works.

lspci reports:

06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

Let me know if you need any more details. I attach a version.log and
lspci-vnvn.log from a working -72 boot.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: New

** Attachment added: "version.log"
   
https://bugs.launchpad.net/bugs/1863044/+attachment/5327819/+files/version.log

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1863044

Title:
  qla2xxx no longer detects LUNs with 4.15.0-74+ on IBM BladeCenter

Status in linux package in Ubuntu:
  New

Bug description:
  We have an IBM BladeCentre Hx5 with a number of blades running Ubuntu
  18.04.3. Storage is attached over Fiber Channel.

  They all boot fine with 4.15.0-72 - the qla2xxx detects all the LUNs.
  On 4.15.0-74 and 4.15.0-76, the qla2xxx driver loads but the LUNs are
  not detected. This breaks the boot. rescan-scsi-bus.sh is also unable
  to find the LUNs. Reverting to the 4.15.0-72 kernel works.

  lspci reports:

  06:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI 
Express HBA (rev 02)
Subsystem: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express 
HBA
Flags: bus master, fast devsel, latency 0, IRQ 24, NUMA node 0
I/O ports at 2c00 [size=256]
Memory at 903fc000 (64-bit, non-prefetchable) [size=16K]
Expansion ROM at 9030 [disabled] [size=256K]
Capabilities: [44] Power Management version 3
Capabilities: [4c] Express Endpoint, MSI 00
Capabilities: [88] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [98] Vital Product Data
Capabilities: [a0] MSI-X: Enable+ Count=2 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting 
Kernel driver in use: qla2xxx
Kernel modules: qla2xxx

  Let me know if you need any more details. I attach a version.log and
  lspci-vnvn.log from a working -72 boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1853142] Re: CVE-2019-18660: patches for Ubuntu

2019-12-06 Thread Daniel Axtens
My colleague has verified all 4 versions. In all cases, on supported
hardware, the test now operates as expected: the secret does not leak
unless the mitigation is manually turned off.

I notice the SRU verification is happening a bit sooner than I expected
- when do you expect these kernels to be released?

** Tags removed: verification-needed-bionic verification-needed-disco 
verification-needed-eoan verification-needed-xenial
** Tags added: verification-done-bionic verification-done-disco 
verification-done-eoan verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853142

Title:
  CVE-2019-18660: patches for Ubuntu

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Disco:
  Fix Committed
Status in linux source package in Eoan:
  Fix Committed
Status in linux source package in Focal:
  Triaged

Bug description:
  Hi,

  Recently you would have been notified about CVE-2019-18660 via email
  to the linux-distros private mailing list. In short, it is a bug in
  the Spectre v2 class affecting powerpc.

  We have developed some backports for supported Ubuntu kernels, and
  tested them in our lab. I will attach the patches shortly. Most of
  them should end up being identical to the versions in linux-stable,
  but the ones for Bionic are slightly different due to it using a 4.15
  kernel.

  Please get in touch with me or Michael Ellerman (powerpc maintainer)
  if you have any questions or if we can be of any assistance.

  
  Kind regards,
  Daniel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853142/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1853142] Re: CVE-2019-18660: patches for Ubuntu

2019-11-28 Thread Daniel Axtens
The embargo has expired so I'm making this public now.


** Description changed:

  Hi,
  
  Recently you would have been notified about CVE-2019-18660 via email to
  the linux-distros private mailing list. In short, it is a bug in the
  Spectre v2 class affecting powerpc.
  
  We have developed some backports for supported Ubuntu kernels, and
  tested them in our lab. I will attach the patches shortly. Most of them
  should end up being identical to the versions in linux-stable, but the
  ones for Bionic are slightly different due to it using a 4.15 kernel.
  
  Please get in touch with me or Michael Ellerman (powerpc maintainer) if
  you have any questions or if we can be of any assistance.
  
- If I understand the SRU cycles correctly, we've missed the current one
- due for release on 2 December, so the earliest these patches could land
- in is the kernel nominally slated to be released ~23 December. Are you
- planning to still release a kernel then, or are your cycles going to
- change over the end of year period?
- 
- (If it helps, we've got some automation set up so we're able to do the
- regression testing of -proposed kernels with these patches quickly.)
  
  Kind regards,
  Daniel

** Information type changed from Private Security to Public Security

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1853142

Title:
  CVE-2019-18660: patches for Ubuntu

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  Triaged
Status in linux source package in Bionic:
  Triaged
Status in linux source package in Disco:
  Triaged
Status in linux source package in Eoan:
  Triaged
Status in linux source package in Focal:
  Triaged

Bug description:
  Hi,

  Recently you would have been notified about CVE-2019-18660 via email
  to the linux-distros private mailing list. In short, it is a bug in
  the Spectre v2 class affecting powerpc.

  We have developed some backports for supported Ubuntu kernels, and
  tested them in our lab. I will attach the patches shortly. Most of
  them should end up being identical to the versions in linux-stable,
  but the ones for Bionic are slightly different due to it using a 4.15
  kernel.

  Please get in touch with me or Michael Ellerman (powerpc maintainer)
  if you have any questions or if we can be of any assistance.

  
  Kind regards,
  Daniel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1853142/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1822870] Re: Backport support for software count cache flush Spectre v2 mitigation. (CVE) (required for POWER9 DD2.3)

2019-04-08 Thread Daniel Axtens
Hi Michael R,

I tried to apply your patches to test them and support the effort to get
them included in the Bionic kernel, but I'm having some trouble applying
them:

ubuntu@dja-bionic:~/bionic$ git am 
../patches/01-powerpc-64s-add-support-for-ori-barrier_nospec.patch
Patch format detection failed.
ubuntu@dja-bionic:~/bionic$ git am 
../patches/01-powerpc-64s-add-support-for-ori-barrier_nospec.patch 
--patch-format mbox
Applying: commit 2eea7f067f495e33b8b116b35b5988ab2b8aec55
fatal: empty ident name (for <>) not allowed

How are you generating them? They don't look like they've been generated
with git format-patch...?

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1822870

Title:
  Backport support for software count cache flush Spectre v2 mitigation.
  (CVE) (required for POWER9 DD2.3)

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  In Progress

Bug description:
  For the different kernels:

  The HWE a563fd9c62f0 UBUNTU: Ubuntu-hwe-4.18.0-17.18~18.04.1 appears
  to have all patches.

  Disco appears to be missing only this patch:
  92edf8df0ff2ae86cc632eeca0e651fd8431d40d powerpc/security: Fix spectre_v2 
reporting

  Cosmic (which is supported until July) is missing a number of patches:
  cf175dc315f90185128fb061dc05b6fbb211aa2f powerpc/64: Disable the speculation 
barrier from the command line
  6453b532f2c8856a80381e6b9a1f5ea2f12294df powerpc/64: Make stf barrier 
PPC_BOOK3S_64 specific.
  179ab1cbf883575c3a585bcfc0f2160f1d22a149 powerpc/64: Add 
CONFIG_PPC_BARRIER_NOSPEC
  af375eefbfb27cbb5b831984e66d724a40d26b5c powerpc/64: Call 
setup_barrier_nospec() from setup_arch()
  406d2b6ae3420f5bb2b3db6986dc6f0b6dbb637b powerpc/64: Make meltdown reporting 
Book3S 64 specific
  06d0bbc6d0f56dacac3a79900e9a9a0d5972d818 powerpc/asm: Add a patch_site macro 
& helpers for patching instructions
  dc8c6cce9a26a51fc19961accb978217a3ba8c75 powerpc/64s: Add new security 
feature flags for count cache flush
  ee13cb249fabdff8b90aaff61add347749280087 powerpc/64s: Add support for 
software count cache flush
  ba72dc171954b782a79d25e0f4b3ed91090c3b1e powerpc/pseries: Query hypervisor 
for count cache flush settings
  99d54754d3d5f896a8f616b0b6520662bc99d66b powerpc/powernv: Query firmware for 
count cache flush settings
  7d8bad99ba5a22892f0cad6881289fdc3875a930 powerpc/fsl: Fix spectre_v2 
mitigations reporting
  92edf8df0ff2ae86cc632eeca0e651fd8431d40d powerpc/security: Fix spectre_v2 
reporting
  This appears to already be in -next.

  For the bionic 18.04.1 (4.15) kernel only this patch is already part of 
master-next:
  a6b3964ad71a61bb7c61d80a60bea7d42187b2eb powerpc/64s: Add barrier_nospec

  The others are ported, there were only 3 that were not clean.  Those are:
  2eea7f067f495e33b8b116b35b5988ab2b8aec55 powerpc/64s: Add support for ori 
barrier_nospec patching
  This failed because commit a048a07d7f4535baa4cbad6bc024f175317ab938 is 
missing, but it does not look like that is required here.

  cb3d6759a93c6d0aea1c10deb6d00e111c29c19c powerpc/64s: Enable barrier_nospec 
based on firmware settings
  This failed because debugfs was already included, I can see that previously 
added, I didn't see where it was previously removed.

  06d0bbc6d0f56dacac3a79900e9a9a0d5972d818 powerpc/asm: Add a patch_site macro 
& helpers for patching instructions
  This failed because 8183d99f4a22c is not included - but doesn't seem 
necessary.

  All other patches applied with, at most, some fuzz.

  Has had a little testing - boots, check debugfs, etc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1822870/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module

2019-01-24 Thread Daniel Axtens
** Description changed:

+ SRU Justification
+ =
+ 
+ [Impact]
+ 
+ Some users see panics like the following when performing fstrim on a
+ bcached volume:
+ 
+ [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 
0008
+ [  530.183928] #PF error: [normal kernel read fault]
+ [  530.412392] PGD 801f42163067 P4D 801f42163067 PUD 1f42168067 PMD 0
+ [  530.750887] Oops:  [#1] SMP PTI
+ [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 
5.0.0-rc1+ #3
+ [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, 
BIOS P89 12/27/2015
+ [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
+ [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 
85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 
44 8b 56 0c 48
+ 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
+ [  532.838634] RSP: 0018:b9b708df39b0 EFLAGS: 00010246
+ [  533.093571] RAX:  RBX: 00046000 RCX: 

+ [  533.441865] RDX: 0200 RSI:  RDI: 

+ [  533.789922] RBP: b9b708df3a48 R08: 940d3b3fdd20 R09: 

+ [  534.137512] R10: b9b708df3958 R11:  R12: 

+ [  534.485329] R13:  R14:  R15: 
940d39212020
+ [  534.833319] FS:  7efec26e3840() GS:940d1f48() 
knlGS:
+ [  535.224098] CS:  0010 DS:  ES:  CR0: 80050033
+ [  535.504318] CR2: 0008 CR3: 001f4e256004 CR4: 
001606e0
+ [  535.851759] Call Trace:
+ [  535.970308]  ? mempool_alloc_slab+0x15/0x20
+ [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
+ [  536.403399]  blk_mq_make_request+0x97/0x4f0
+ [  536.607036]  generic_make_request+0x1e2/0x410
+ [  536.819164]  submit_bio+0x73/0x150
+ [  536.980168]  ? submit_bio+0x73/0x150
+ [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
+ [  537.391595]  ? _cond_resched+0x1a/0x50
+ [  537.573774]  submit_bio_wait+0x59/0x90
+ [  537.756105]  blkdev_issue_discard+0x80/0xd0
+ [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
+ [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
+ [  538.324087]  ext4_ioctl+0xea4/0x1530
+ [  538.497712]  ? _copy_to_user+0x2a/0x40
+ [  538.679632]  do_vfs_ioctl+0xa6/0x600
+ [  538.853127]  ? __do_sys_newfstat+0x44/0x70
+ [  539.051951]  ksys_ioctl+0x6d/0x80
+ [  539.212785]  __x64_sys_ioctl+0x1a/0x20
+ [  539.394918]  do_syscall_64+0x5a/0x110
+ [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
+ 
+ [Fix]
+ 
+ Under certain conditions, the test for whether an operation should be
+ written back to the underlying device was incorrect. Specifically, in
+ should_writeback(), we were hitting a case where an optimisation for
+ partial stripe conditions was returning true and so should_writeback()
+ was returning true early. This caused the code to go down an incorrect
+ path and create bios that contained NULL pointers.
+ 
+ To fix this issue, make sure that should_writeback() on a discard op
+ never returns true.
+ 
+ 
+ [Test Case]
+ 
+ We have observed it on some systems where both:
+ 1) LVM/devmapper is involved (bcache backing device is LVM volume) and
+ 2) writeback cache is involved (bcache cache_mode is writeback)
+ 
+ Not every machine exhibits the bug. On one machine that does exhibit the
+ bug, we can reliably reproduce it with:
+ 
+  # echo writeback > /sys/block/bcache0/bcache/cache_mode
+  # mount /dev/bcache0 /test
+  # for i in {0..10}; do file="$(mktemp /test/zero.XXX)"; dd if=/dev/zero 
of="$file" bs=1M count=256; sync; rm $file; done; fstrim -v /test
+ 
+ 
+ [Regression Potential]
+ 
+ This could affect any device where bcache is used.
+ 
+ In mitigation, however: the patch is simple, is limited to considering
+ discard operations. The patch has been accepted upstream [1] and the
+ maintainer will be including it in SuSE kernels [2]. A Gentoo user
+ validated the upstream patch independently [3].
+ 
+ 
+ [1] https://www.spinics.net/lists/linux-bcache/msg06997.html
+ [2] https://www.spinics.net/lists/linux-bcache/msg06998.html
+ [3] https://bugzilla.kernel.org/show_bug.cgi?id=196103#c3
+ 
+ 
+ [Original Description]
+ 
  This was on an 18.04.1 install running the 4.15-34 generic kernel image, 
running from a normal ext4 root device.
  I had just a short while before created a new bcache device that was mounted 
but to which no data had been written yet. Then without any apparent particular 
reason, an apport error popped up to inform of a bcache kernel oops. Crash log 
was uploaded but no idea how to link it, so I attach it as well.
  Mostly I would like to know how concerned I should be as after a previous, 
successful test I wanted to move the whole install to bcache. Ideally, if this 
is a bug or similar, it would be nice if it could get fixed.
  
  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-34-generic 

[Kernel-packages] [Bug 1802421] Re: Xenial: data corruption when using i40e with iommu

2019-01-22 Thread Daniel Axtens
The user has verified that the -proposed kernel resolves their issue.

Regards,
Daniel

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1802421

Title:
  Xenial: data corruption when using i40e with iommu

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  A user reports that using an i40e with intel_iommu=on with the Xenial
  GA kernel causes data corruption. Using the Xenial HWE kernel or an
  out-of-tree driver more recent than the version shipped with Xenial
  solves the issue.

  [Impact]
  Corrupted data is returned from the network card intermittently. This is 
often noticeable when using apt, as the checksums are verified. If often leads 
to failure of apt operations. When there are no checksums done, this could lead 
to silent data corruption.

  [Fix]
  This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: 
Drop packet split receive routine") which is part of a broader refactor. 
Picking this patch alone is sufficient to fix the issue. My theory is that 
iommu exposes an issue in the packet split receive routine and so removing it 
is sufficient to prevent the problem from occurring.

  [Test]
  A user tested a Xenial 4.4 kernel with this patch applied and it fixed their 
issue - no data corruption was observed. (The test repeatedly deletes the apt 
cache and then does apt update.)

  [Regression Potential]
  It's a messy change inside i40e, so the risk is that i40e will be broken in 
some subtle way we haven't noticed, or have performance issues. None of these 
have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1805245] Re: powerpc/powernv/pci: Work around races in PCI bridge enabling

2019-01-21 Thread Daniel Axtens
The OpenPower partner reports that their system is fixed with this
kernel.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1805245

Title:
  powerpc/powernv/pci: Work around races in PCI bridge enabling

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  SRU Justification
  =

  [Impact]

  An IBM OpenPower partner reports their system with a bunch of NVMe
  drives fails the NVMe init due to some drives taking PCIe EEH errors.

  [Fix]

  Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream.

  [Testing]

  IBM reports that this patch fixes the user's issue.

  [Regression Potential]

  The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and 
in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y).
  It only affects PowerPC.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805245/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module

2019-01-20 Thread Daniel Axtens
Hi,

I have a patch which I believe fixes your issue:
https://www.spinics.net/lists/linux-bcache/msg06997.html

It looks like it will go in to the 5.1 kernel, and I will propose it for
backporting to earlier Ubuntu kernels.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793901

Title:
  kernel oops in bcache module

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  This was on an 18.04.1 install running the 4.15-34 generic kernel image, 
running from a normal ext4 root device.
  I had just a short while before created a new bcache device that was mounted 
but to which no data had been written yet. Then without any apparent particular 
reason, an apport error popped up to inform of a bcache kernel oops. Crash log 
was uploaded but no idea how to link it, so I attach it as well.
  Mostly I would like to know how concerned I should be as after a previous, 
successful test I wanted to move the whole install to bcache. Ideally, if this 
is a bug or similar, it would be nice if it could get fixed.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-34-generic 4.15.0-34.37
  ProcVersionSignature: Ubuntu 4.15.0-34.37-generic 4.15.18
  Uname: Linux 4.15.0-34-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair nvidia_modeset 
nvidia
  ApportVersion: 2.20.9-0ubuntu7.3
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sat Sep 22 18:20:22 2018
  HibernationDevice: RESUME=UUID=6bcbe7fa-85b7-4baf-9b69-0558a668bcdd
  InstallationDate: Installed on 2014-07-29 (1515 days ago)
  InstallationMedia: It
  IwConfig:
   zthnhe3w6d  no wireless extensions.
   
   eth1  no wireless extensions.
   
   lono wireless extensions.
  MachineType: System manufacturer System Product Name
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=de_DE.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 EFI VGA
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-34-generic 
root=UUID=ebbab625-f14e-44ba-84d5-025ed92a5b2a ro quiet splash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-34-generic N/A
   linux-backports-modules-4.15.0-34-generic  N/A
   linux-firmware 1.173.1
  RfKill:
   0: hci0: Bluetooth
Soft blocked: yes
Hard blocked: no
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2018-09-07 (15 days ago)
  dmi.bios.date: 10/22/2015
  dmi.bios.vendor: American Megatrends Inc.
  dmi.bios.version: 0604
  dmi.board.asset.tag: Default string
  dmi.board.name: H170I-PLUS D3
  dmi.board.vendor: ASUSTeK COMPUTER INC.
  dmi.board.version: Rev X.0x
  dmi.chassis.asset.tag: Default string
  dmi.chassis.type: 3
  dmi.chassis.vendor: Default string
  dmi.chassis.version: Default string
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr0604:bd10/22/2015:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnH170I-PLUSD3:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
  dmi.product.family: Default string
  dmi.product.name: System Product Name
  dmi.product.version: System Version
  dmi.sys.vendor: System manufacturer

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793901/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1793901] Re: kernel oops in bcache module

2019-01-15 Thread Daniel Axtens
I think I have discovered the cause: https://lore.kernel.org/linux-
block/87h8e9ii2l@linkitivity.dja.id.au/

** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793901

Title:
  kernel oops in bcache module

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  This was on an 18.04.1 install running the 4.15-34 generic kernel image, 
running from a normal ext4 root device.
  I had just a short while before created a new bcache device that was mounted 
but to which no data had been written yet. Then without any apparent particular 
reason, an apport error popped up to inform of a bcache kernel oops. Crash log 
was uploaded but no idea how to link it, so I attach it as well.
  Mostly I would like to know how concerned I should be as after a previous, 
successful test I wanted to move the whole install to bcache. Ideally, if this 
is a bug or similar, it would be nice if it could get fixed.

  ProblemType: Bug
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-34-generic 4.15.0-34.37
  ProcVersionSignature: Ubuntu 4.15.0-34.37-generic 4.15.18
  Uname: Linux 4.15.0-34-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair nvidia_modeset 
nvidia
  ApportVersion: 2.20.9-0ubuntu7.3
  Architecture: amd64
  CurrentDesktop: ubuntu:GNOME
  Date: Sat Sep 22 18:20:22 2018
  HibernationDevice: RESUME=UUID=6bcbe7fa-85b7-4baf-9b69-0558a668bcdd
  InstallationDate: Installed on 2014-07-29 (1515 days ago)
  InstallationMedia: It
  IwConfig:
   zthnhe3w6d  no wireless extensions.
   
   eth1  no wireless extensions.
   
   lono wireless extensions.
  MachineType: System manufacturer System Product Name
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=
   LANG=de_DE.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 EFI VGA
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-34-generic 
root=UUID=ebbab625-f14e-44ba-84d5-025ed92a5b2a ro quiet splash
  RelatedPackageVersions:
   linux-restricted-modules-4.15.0-34-generic N/A
   linux-backports-modules-4.15.0-34-generic  N/A
   linux-firmware 1.173.1
  RfKill:
   0: hci0: Bluetooth
Soft blocked: yes
Hard blocked: no
  SourcePackage: linux
  UpgradeStatus: Upgraded to bionic on 2018-09-07 (15 days ago)
  dmi.bios.date: 10/22/2015
  dmi.bios.vendor: American Megatrends Inc.
  dmi.bios.version: 0604
  dmi.board.asset.tag: Default string
  dmi.board.name: H170I-PLUS D3
  dmi.board.vendor: ASUSTeK COMPUTER INC.
  dmi.board.version: Rev X.0x
  dmi.chassis.asset.tag: Default string
  dmi.chassis.type: 3
  dmi.chassis.vendor: Default string
  dmi.chassis.version: Default string
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr0604:bd10/22/2015:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnH170I-PLUSD3:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
  dmi.product.family: Default string
  dmi.product.name: System Product Name
  dmi.product.version: System Version
  dmi.sys.vendor: System manufacturer

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793901/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1801305] Re: Restore request-based mode to xen-blkfront for AWS kernels

2018-11-27 Thread Daniel Axtens
I've checked that the proposed Xenial AWS kernel works - it boots
successfully and uses the deadline scheduler by default on a t2.micro
instance.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1801305

Title:
  Restore request-based mode to xen-blkfront for AWS kernels

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed
Status in linux source package in Disco:
  Triaged

Bug description:
  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.

  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.

  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.

  For X this needs a small patch from upstream for error handling.

  For B/C this patchset is bigger as it includes the suspend/resume
  patches already in X, and a new fixup. These are desirable as the
  request mode patch assumes their presence.

  [Regression Potential]
  Could potentially break xen based disks on AWS.

  For B/C, the patches also add some code to the xen core around suspend
  and resume, this code is much smaller and also mirrors code already in
  Xenial.

  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1801305/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1805245] [NEW] powerpc/powernv/pci: Work around races in PCI bridge enabling

2018-11-26 Thread Daniel Axtens
Public bug reported:

SRU Justification
=

[Impact]

An IBM OpenPower partner reports their system with a bunch of NVMe
drives fails the NVMe init due to some drives taking PCIe EEH errors.

[Fix]

Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream.

[Testing]

IBM reports that this patch fixes the user's issue.

[Regression Potential]

The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and 
in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y).
It only affects PowerPC.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: Confirmed

** Description changed:

  SRU Justification
  =
  
  [Impact]
  
  An IBM OpenPower partner reports their system with a bunch of NVMe
  drives fails the NVMe init due to some drives taking PCIe EEH errors.
  
  [Fix]
  
  Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream.
  
  [Testing]
  
  IBM reports that this patch fixes the user's issue.
  
  [Regression Potential]
  
- The patch is already in some stable trees 
(1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y).
+ The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and 
in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y).
  It only affects PowerPC.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1805245

Title:
  powerpc/powernv/pci: Work around races in PCI bridge enabling

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  [Impact]

  An IBM OpenPower partner reports their system with a bunch of NVMe
  drives fails the NVMe init due to some drives taking PCIe EEH errors.

  [Fix]

  Pick patch db2173198b9513f7add8009f225afa1f1c79bcc6 upstream.

  [Testing]

  IBM reports that this patch fixes the user's issue.

  [Regression Potential]

  The patch is already in Cosmic (db33bbe77b9594133fecf0dc290322437170627f) and 
in some stable trees (1eb08e7b192d2c412175f607cf51449c916abd57 in 4.14.y).
  It only affects PowerPC.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1805245/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1802421] Re: Xenial: data corruption when using i40e with iommu

2018-11-08 Thread Daniel Axtens
** Description changed:

  A user reports that using an i40e with intel_iommu=on with the Xenial GA
  kernel causes data corruption. Using the Xenial HWE kernel or an out-of-
  tree driver more recent than the version shipped with Xenial solves the
  issue.
  
  [Impact]
  Corrupted data is returned from the network card intermittently. This is 
often noticeable when using apt, as the checksums are verified. If often leads 
to failure of apt operations. When there are no checksums done, this could lead 
to silent data corruption.
  
  [Fix]
- This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: 
Drop packet split receive routine") which is part of a broader refactor. My 
theory is that iommu exposes an issue in the packet split receive routine and 
so removing it is sufficient to prevent the problem from occurring.
+ This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: 
Drop packet split receive routine") which is part of a broader refactor. 
Picking this patch alone is sufficient to fix the issue. My theory is that 
iommu exposes an issue in the packet split receive routine and so removing it 
is sufficient to prevent the problem from occurring.
  
  [Test]
  A user tested a Xenial 4.4 kernel with this patch applied and it fixed their 
issue - no data corruption was observed. (The test repeatedly deletes the apt 
cache and then does apt update.)
  
  [Regression Potential]
  It's a messy change inside i40e, so the risk is that i40e will be broken in 
some subtle way we haven't noticed, or have performance issues. None of these 
have been observed so far.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1802421

Title:
  Xenial: data corruption when using i40e with iommu

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  A user reports that using an i40e with intel_iommu=on with the Xenial
  GA kernel causes data corruption. Using the Xenial HWE kernel or an
  out-of-tree driver more recent than the version shipped with Xenial
  solves the issue.

  [Impact]
  Corrupted data is returned from the network card intermittently. This is 
often noticeable when using apt, as the checksums are verified. If often leads 
to failure of apt operations. When there are no checksums done, this could lead 
to silent data corruption.

  [Fix]
  This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: 
Drop packet split receive routine") which is part of a broader refactor. 
Picking this patch alone is sufficient to fix the issue. My theory is that 
iommu exposes an issue in the packet split receive routine and so removing it 
is sufficient to prevent the problem from occurring.

  [Test]
  A user tested a Xenial 4.4 kernel with this patch applied and it fixed their 
issue - no data corruption was observed. (The test repeatedly deletes the apt 
cache and then does apt update.)

  [Regression Potential]
  It's a messy change inside i40e, so the risk is that i40e will be broken in 
some subtle way we haven't noticed, or have performance issues. None of these 
have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1802421] [NEW] Xenial: data corruption when using i40e with iommu

2018-11-08 Thread Daniel Axtens
Public bug reported:

A user reports that using an i40e with intel_iommu=on with the Xenial GA
kernel causes data corruption. Using the Xenial HWE kernel or an out-of-
tree driver more recent than the version shipped with Xenial solves the
issue.

[Impact]
Corrupted data is returned from the network card intermittently. This is often 
noticeable when using apt, as the checksums are verified. If often leads to 
failure of apt operations. When there are no checksums done, this could lead to 
silent data corruption.

[Fix]
This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: Drop 
packet split receive routine") which is part of a broader refactor. Picking 
this patch alone is sufficient to fix the issue. My theory is that iommu 
exposes an issue in the packet split receive routine and so removing it is 
sufficient to prevent the problem from occurring.

[Test]
A user tested a Xenial 4.4 kernel with this patch applied and it fixed their 
issue - no data corruption was observed. (The test repeatedly deletes the apt 
cache and then does apt update.)

[Regression Potential]
It's a messy change inside i40e, so the risk is that i40e will be broken in 
some subtle way we haven't noticed, or have performance issues. None of these 
have been observed so far.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1802421

Title:
  Xenial: data corruption when using i40e with iommu

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  A user reports that using an i40e with intel_iommu=on with the Xenial
  GA kernel causes data corruption. Using the Xenial HWE kernel or an
  out-of-tree driver more recent than the version shipped with Xenial
  solves the issue.

  [Impact]
  Corrupted data is returned from the network card intermittently. This is 
often noticeable when using apt, as the checksums are verified. If often leads 
to failure of apt operations. When there are no checksums done, this could lead 
to silent data corruption.

  [Fix]
  This was fixed somewhere post-4.4. Testing identified b32bfa17246d ("i40e: 
Drop packet split receive routine") which is part of a broader refactor. 
Picking this patch alone is sufficient to fix the issue. My theory is that 
iommu exposes an issue in the packet split receive routine and so removing it 
is sufficient to prevent the problem from occurring.

  [Test]
  A user tested a Xenial 4.4 kernel with this patch applied and it fixed their 
issue - no data corruption was observed. (The test repeatedly deletes the apt 
cache and then does apt update.)

  [Regression Potential]
  It's a messy change inside i40e, so the risk is that i40e will be broken in 
some subtle way we haven't noticed, or have performance issues. None of these 
have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1802421/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1801305] Re: Restore request-based mode to xen-blkfront for AWS kernels

2018-11-02 Thread Daniel Axtens
** Description changed:

  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.
  
  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.
  
  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.
  
+ For B/C this patchset is bigger as it includes the suspend/resume
+ patches already in X, and a new fixup. These are desirable as the
+ request mode patch assumes their presence.
+ 
  [Regression Potential]
- Could potentially break xen based disks on AWS. For B/C, the patches also add 
some code to the xen core around suspend and resume, this code is much smaller 
and also mirrors code already in Xenial.
+ Could potentially break xen based disks on AWS. 
+ 
+ For B/C, the patches also add some code to the xen core around suspend
+ and resume, this code is much smaller and also mirrors code already in
+ Xenial.
  
  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

** Description changed:

  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.
  
  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.
  
  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.
  
+ For X this needs a small patch from upstream for error handling.
+ 
  For B/C this patchset is bigger as it includes the suspend/resume
  patches already in X, and a new fixup. These are desirable as the
  request mode patch assumes their presence.
  
  [Regression Potential]
- Could potentially break xen based disks on AWS. 
+ Could potentially break xen based disks on AWS.
  
  For B/C, the patches also add some code to the xen core around suspend
  and resume, this code is much smaller and also mirrors code already in
  Xenial.
  
  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1801305

Title:
  Restore request-based mode to xen-blkfront for AWS kernels

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.

  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.

  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.

  For X this needs a small patch from upstream for error handling.

  For B/C this patchset is bigger as it includes the suspend/resume
  patches already in X, and a new fixup. These are desirable as the
  request mode patch assumes their presence.

  [Regression Potential]
  Could potentially break xen based disks on AWS.

  For B/C, the patches also add some code to the xen core around suspend
  and resume, this code is much smaller and also mirrors code already in
  Xenial.

  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1801305/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1801305] [NEW] Restore request-based mode to xen-blkfront for AWS kernels

2018-11-02 Thread Daniel Axtens
Public bug reported:

In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
default and cannot use the old I/O scheduler.

[Impact]
blk-mq is not as fast as the old request-based scheduler for some workloads on 
HDD disks.

[Fix]
Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.

For X this needs a small patch from upstream for error handling.

For B/C this patchset is bigger as it includes the suspend/resume
patches already in X, and a new fixup. These are desirable as the
request mode patch assumes their presence.

[Regression Potential]
Could potentially break xen based disks on AWS.

For B/C, the patches also add some code to the xen core around suspend
and resume, this code is much smaller and also mirrors code already in
Xenial.

[Tests]
Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: Confirmed

** Description changed:

  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
- default.
+ default and cannot use the old I/O scheduler.
  
  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.
  
  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.
  
  [Regression Potential]
  Could potentially break xen based disks on AWS. For B/C, the patches also add 
some code to the xen core around suspend and resume, this code is much smaller 
and also mirrors code already in Xenial.
  
  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing.

** Description changed:

  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.
  
  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.
  
  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.
  
  [Regression Potential]
  Could potentially break xen based disks on AWS. For B/C, the patches also add 
some code to the xen core around suspend and resume, this code is much smaller 
and also mirrors code already in Xenial.
  
  [Tests]
- Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing.
+ Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1801305

Title:
  Restore request-based mode to xen-blkfront for AWS kernels

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  In current Ubuntu kernels, PV blkfront drivers have blk-mq enabled by
  default and cannot use the old I/O scheduler.

  [Impact]
  blk-mq is not as fast as the old request-based scheduler for some workloads 
on HDD disks.

  [Fix]
  Amazon Linux has a commit which reintroduces the request-based mode. It 
disables blk-mq by default but allows it to be switched back on with a kernel 
parameter.

  For X this needs a small patch from upstream for error handling.

  For B/C this patchset is bigger as it includes the suspend/resume
  patches already in X, and a new fixup. These are desirable as the
  request mode patch assumes their presence.

  [Regression Potential]
  Could potentially break xen based disks on AWS.

  For B/C, the patches also add some code to the xen core around suspend
  and resume, this code is much smaller and also mirrors code already in
  Xenial.

  [Tests]
  Tested by AWS for Xenial, and their kernel engineers vetted the patches. I 
tested the Bionic and Cosmic patchsets with fio, the system appears stable and 
the IOPS promised for EBS Provisioned IOPS disks were met in my testing. I did 
an apt update/upgrade and everything worked (no hash-sum mismatches).

To manage notifications about this bug go to:

[Kernel-packages] [Bug 1798706] Re: Incomplete linking with boost_regex

2018-10-18 Thread Daniel Axtens
** Tags added: sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1798706

Title:
  Incomplete linking with boost_regex

Status in linux package in Ubuntu:
  In Progress

Bug description:
  SRU Justification
  =

  [Impact]
  oslogin fails on Xenial and Trusty.

  In auth.log we see:

  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_admin.so

  The error message is a bit deceptive - PAM tries to load the module
  from the correct location, fails, and then tries the other location
  where it is missing. It then reports the missing error rather than the
  real error.

  symlink the module into both paths leads to a much more useful error
  message:

  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE

  [Test case]
   - set up GCE VM
   - turn on oslogin
   - attempt to log in

  [Fix]
  
debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch 
re-orders the link flags to link boost_regex for oslogin. However, this didn't 
change the flags for PAM module linking. So fix that too.

  [Regression Potential]
  - fixes a regression
  - limited to oslogin, and how it is linked.

  [Other Notes]
  We still see a scary list of warnings when building, but they don't seem to 
have an impact on the common path:
  dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used 
by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZNK5boost9re_detail31cpp_regex_traits_implementationIcE9transformEPKcS4_ used 
by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 

[Kernel-packages] [Bug 1798706] [NEW] Incomplete linking with boost_regex

2018-10-18 Thread Daniel Axtens
: linux (Ubuntu)
 Importance: Critical
 Assignee: Daniel Axtens (daxtens)
 Status: In Progress

** Patch added: "set-LDFLAGS-for-PAM.patch"
   
https://bugs.launchpad.net/bugs/1798706/+attachment/5202754/+files/set-LDFLAGS-for-PAM.patch

** Changed in: linux (Ubuntu)
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1798706

Title:
  Incomplete linking with boost_regex

Status in linux package in Ubuntu:
  In Progress

Bug description:
  SRU Justification
  =

  [Impact]
  oslogin fails on Xenial and Trusty.

  In auth.log we see:

  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_admin.so

  The error message is a bit deceptive - PAM tries to load the module
  from the correct location, fails, and then tries the other location
  where it is missing. It then reports the missing error rather than the
  real error.

  symlink the module into both paths leads to a much more useful error
  message:

  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE

  [Test case]
   - set up GCE VM
   - turn on oslogin
   - attempt to log in

  [Fix]
  
debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch 
re-orders the link flags to link boost_regex for oslogin. However, this didn't 
change the flags for PAM module linking. So fix that too.

  [Regression Potential]
  - fixes a regression
  - limited to oslogin, and how it is linked.

  [Other Notes]
  We still see a scary list of warnings when building, but they don't seem to 
have an impact on the common path:
  dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used 
by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: wa

[Kernel-packages] [Bug 1798705] [NEW] Incomplete linking with boost_regex

2018-10-18 Thread Daniel Axtens
: linux (Ubuntu)
 Importance: Critical
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

** Patch added: "set-LDFLAGS-for-PAM.patch"
   
https://bugs.launchpad.net/bugs/1798705/+attachment/5202753/+files/set-LDFLAGS-for-PAM.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1798705

Title:
  Incomplete linking with boost_regex

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  [Impact]
  oslogin fails on Xenial and Trusty.

  In auth.log we see:

  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: cannot open 
shared object file: No such file or directory
  Oct 17 16:35:59 davecore-oslogin sshd[10073]: PAM adding faulty module: 
pam_oslogin_admin.so

  The error message is a bit deceptive - PAM tries to load the module
  from the correct location, fails, and then tries the other location
  where it is missing. It then reports the missing error rather than the
  real error.

  symlink the module into both paths leads to a much more useful error
  message:

  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_login.so): /lib/security/pam_oslogin_login.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM adding faulty module: 
pam_oslogin_login.so
  Oct 18 06:45:12 dja-202158 sshd[16554]: PAM unable to 
dlopen(pam_oslogin_admin.so): /lib/security/pam_oslogin_admin.so: undefined 
symbol: 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE

  [Test case]
   - set up GCE VM
   - turn on oslogin
   - attempt to log in

  [Fix]
  
debian/patches/0002-Set-LDFLAGS-at-the-end-of-the-c-command-line-right-b.patch 
re-orders the link flags to link boost_regex for oslogin. However, this didn't 
change the flags for PAM module linking. So fix that too.

  [Regression Potential]
  - fixes a regression
  - limited to oslogin, and how it is linked.

  [Other Notes]
  We still see a scary list of warnings when building, but they don't seem to 
have an impact on the common path:
  dpkg-shlibdeps: warning: symbol _ZN5boost9re_detail13put_mem_blockEPv used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail14verify_optionsEjNS_15regex_constants12_match_flagsE used 
by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZNK5boost9re_detail31cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost13match_resultsIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISB_12maybe_assignERKSF_
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail12perl_matcherIN9__gnu_cxx17__normal_iteratorIPKcNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcESaINS_9sub_matchISC_EEENS_12regex_traitsIcNS_16cpp_regex_traitsIcE14construct_initERKNS_11basic_regexIcSJ_EENS_15regex_constants12_match_flagsE
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost11basic_regexIcNS_12regex_traitsIcNS_16cpp_regex_traitsIcE9do_assignEPKcS7_j
 used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZN5boost9re_detail19raise_runtime_errorERKSt13runtime_error used by 
debian/google-compute-engine-oslogin/lib/libnss_google-compute-engine-oslogin-1.3.1.so
 found in none of the libraries
  dpkg-shlibdeps: warning: symbol 
_ZNK5boost9re_detail31cpp_regex_traits_implementationIcE9transf

[Kernel-packages] [Bug 1797314] Re: fscache: bad refcounting in fscache_op_complete leads to OOPS

2018-10-11 Thread Daniel Axtens
** Description changed:

  SRU Justification
  -
  
  [Impact]
  
  A kernel BUG is sometimes observed when using fscache:
  [4740718.880898] FS-Cache:
  [4740718.880920] FS-Cache: Assertion failed
  [4740718.880934] FS-Cache: 0 > 0 is false
  [4740718.881001] [ cut here ]
  [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
  [4740718.881040] invalid opcode:  [#1] SMP
- 
+ 
  [4740718.892659] Call Trace:
  [4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
  [4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
  [4740718.895180]  [] process_one_work+0x150/0x3f0
  [4740718.895966]  [] worker_thread+0x11a/0x470
  [4740718.896753]  [] ? __schedule+0x359/0x980
  [4740718.897783]  [] ? rescuer_thread+0x310/0x310
  [4740718.898581]  [] kthread+0xd6/0xf0
  [4740718.899469]  [] ? kthread_park+0x60/0x60
  [4740718.900477]  [] ret_from_fork+0x3f/0x70
  [4740718.901514]  [] ? kthread_park+0x60/0x60
  
  [Problem]
  
- In include/fscache-cache.h, fscache_retrieval_complete reads, in part:
+ In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in
+ part:
  
  atomic_sub(n_pages, >n_pages);
  if (atomic_read(>n_pages) <= 0)
  fscache_op_complete(>op, true);
  
  The code is using atomic_sub followed by an atomic_read. This causes two
  threads doing a decrement of pages to race with each other seeing the
  op->refcount <= 0 at same time, and end up calling fscache_op_complete
  in both the threads leading to the OOPS.
  
  [Fix]
  The fix is trivial to use atomic_sub_return instead of two calls.
  
  [Testcase]
- The user has tested the patch successfully on their fscache/cachefiles setup.
+ I believe the user has tested the patch successfully on their 
fscache/cachefiles setup.
  
  [Regression Potential]
  Limited to fscache. Small, comprehensible change.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1797314

Title:
  fscache: bad refcounting in fscache_op_complete leads to OOPS

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  SRU Justification
  -

  [Impact]

  A kernel BUG is sometimes observed when using fscache:
  [4740718.880898] FS-Cache:
  [4740718.880920] FS-Cache: Assertion failed
  [4740718.880934] FS-Cache: 0 > 0 is false
  [4740718.881001] [ cut here ]
  [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
  [4740718.881040] invalid opcode:  [#1] SMP

  [4740718.892659] Call Trace:
  [4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
  [4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
  [4740718.895180]  [] process_one_work+0x150/0x3f0
  [4740718.895966]  [] worker_thread+0x11a/0x470
  [4740718.896753]  [] ? __schedule+0x359/0x980
  [4740718.897783]  [] ? rescuer_thread+0x310/0x310
  [4740718.898581]  [] kthread+0xd6/0xf0
  [4740718.899469]  [] ? kthread_park+0x60/0x60
  [4740718.900477]  [] ret_from_fork+0x3f/0x70
  [4740718.901514]  [] ? kthread_park+0x60/0x60

  [Problem]

  In include/linux/fscache-cache.h, fscache_retrieval_complete reads, in
  part:

  atomic_sub(n_pages, >n_pages);
  if (atomic_read(>n_pages) <= 0)
  fscache_op_complete(>op, true);

  The code is using atomic_sub followed by an atomic_read. This causes
  two threads doing a decrement of pages to race with each other seeing
  the op->refcount <= 0 at same time, and end up calling
  fscache_op_complete in both the threads leading to the OOPS.

  [Fix]
  The fix is trivial to use atomic_sub_return instead of two calls.

  [Testcase]
  I believe the user has tested the patch successfully on their 
fscache/cachefiles setup.

  [Regression Potential]
  Limited to fscache. Small, comprehensible change.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797314/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1797314] [NEW] fscache: bad refcounting in fscache_op_complete leads to OOPS

2018-10-11 Thread Daniel Axtens
Public bug reported:

SRU Justification
-

[Impact]

A kernel BUG is sometimes observed when using fscache:
[4740718.880898] FS-Cache:
[4740718.880920] FS-Cache: Assertion failed
[4740718.880934] FS-Cache: 0 > 0 is false
[4740718.881001] [ cut here ]
[4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
[4740718.881040] invalid opcode:  [#1] SMP

[4740718.892659] Call Trace:
[4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
[4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
[4740718.895180]  [] process_one_work+0x150/0x3f0
[4740718.895966]  [] worker_thread+0x11a/0x470
[4740718.896753]  [] ? __schedule+0x359/0x980
[4740718.897783]  [] ? rescuer_thread+0x310/0x310
[4740718.898581]  [] kthread+0xd6/0xf0
[4740718.899469]  [] ? kthread_park+0x60/0x60
[4740718.900477]  [] ret_from_fork+0x3f/0x70
[4740718.901514]  [] ? kthread_park+0x60/0x60

[Problem]

In include/fscache-cache.h, fscache_retrieval_complete reads, in part:

atomic_sub(n_pages, >n_pages);
if (atomic_read(>n_pages) <= 0)
fscache_op_complete(>op, true);

The code is using atomic_sub followed by an atomic_read. This causes two
threads doing a decrement of pages to race with each other seeing the
op->refcount <= 0 at same time, and end up calling fscache_op_complete
in both the threads leading to the OOPS.

[Fix]
The fix is trivial to use atomic_sub_return instead of two calls.

[Testcase]
The user has tested the patch successfully on their fscache/cachefiles setup.

[Regression Potential]
Limited to fscache. Small, comprehensible change.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: Incomplete

** Description changed:

  SRU Justification
  -
  
  [Impact]
  
  A kernel BUG is sometimes observed when using fscache:
+ [4740718.880898] FS-Cache:
+ [4740718.880920] FS-Cache: Assertion failed
+ [4740718.880934] FS-Cache: 0 > 0 is false
+ [4740718.881001] [ cut here ]
+ [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
+ [4740718.881040] invalid opcode:  [#1] SMP
+ 
+ [4740718.892659] Call Trace:
+ [4740718.893506]  [] cachefiles_read_copier+0x3a9/0x410 
[cachefiles]
+ [4740718.894374]  [] fscache_op_work_func+0x22/0x50 
[fscache]
+ [4740718.895180]  [] process_one_work+0x150/0x3f0
+ [4740718.895966]  [] worker_thread+0x11a/0x470
+ [4740718.896753]  [] ? __schedule+0x359/0x980
+ [4740718.897783]  [] ? rescuer_thread+0x310/0x310
+ [4740718.898581]  [] kthread+0xd6/0xf0
+ [4740718.899469]  [] ? kthread_park+0x60/0x60
+ [4740718.900477]  [] ret_from_fork+0x3f/0x70
+ [4740718.901514]  [] ? kthread_park+0x60/0x60
  
- Jun 25 11:32:08  kernel: [4740718.880898] FS-Cache:
- Jun 25 11:32:08  kernel: [4740718.880920] FS-Cache: Assertion failed
- Jun 25 11:32:08  kernel: [4740718.880934] FS-Cache: 0 > 0 is false
- Jun 25 11:32:08  kernel: [4740718.881001] [ cut here 
]
- Jun 25 11:32:08  kernel: [4740718.881017] kernel BUG at 
/usr/src/linux-4.4.0/fs/fscache/operation.c:449!
- Jun 25 11:32:08  kernel: [4740718.881040] invalid opcode:  [#1] SMP
- ...
- Jun 25 11:32:08  kernel: [4740718.892659] Call Trace:
- Jun 25 11:32:08  kernel: [4740718.893506]  [] 
cachefiles_read_copier+0x3a9/0x410 [cachefiles]
- Jun 25 11:32:08  kernel: [4740718.894374]  [] 
fscache_op_work_func+0x22/0x50 [fscache]
- Jun 25 11:32:08  kernel: [4740718.895180]  [] 
process_one_work+0x150/0x3f0
- Jun 25 11:32:08  kernel: [4740718.895966]  [] 
worker_thread+0x11a/0x470
- Jun 25 11:32:08  kernel: [4740718.896753]  [] ? 
__schedule+0x359/0x980
- Jun 25 11:32:08  kernel: [4740718.897783]  [] ? 
rescuer_thread+0x310/0x310
- Jun 25 11:32:08  kernel: [4740718.898581]  [] 
kthread+0xd6/0xf0
- Jun 25 11:32:08  kernel: [4740718.899469]  [] ? 
kthread_park+0x60/0x60
- Jun 25 11:32:08  kernel: [4740718.900477]  [] 
ret_from_fork+0x3f/0x70
- Jun 25 11:32:08  kernel: [4740718.901514]  [] ? 
kthread_park+0x60/0x60
- 
  [Problem]
  
  In include/fscache-cache.h, fscache_retrieval_complete reads, in part:
  
- atomic_sub(n_pages, >n_pages);
- if (atomic_read(>n_pages) <= 0)
- fscache_op_complete(>op, true);
- 
- The code is using atomic_sub followed by an atomic_read. This causes two 
threads doing a decrement of pages to race with each other seeing the 
op->refcount <= 0 at same time,
- and end up calling fscache_op_complete in both the threads leading to the 
OOPS.
- 
+ atomic_sub(n_pages, >n_pages);
+ if (atomic_read(>n_pages) <= 0)
+ fscache_op_complete(>op, true);
+ 
+ The code is using atomic_sub followed by an atomic_read. 

[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update

2018-10-02 Thread Daniel Axtens
Hi,

I haven't found the time to do this yet, sorry. Is it still an issue on
the current Xenial kernel?

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1742658

Title:
  linux-generic-hwe-16.04 OOPS in nouveau after security update

Status in linux package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  New
Status in linux-hwe-edge package in Ubuntu:
  New
Status in linux-meta-hwe package in Ubuntu:
  New
Status in linux-meta-hwe-edge package in Ubuntu:
  New

Bug description:
  Description:  Ubuntu 16.04.3 LTS
  Release:  16.04

  After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen 
with nouveau.
  Previously I was running 4.10.0-42-generic, and that kernel still works fine.

  Here is the OOPS:

  an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 
1.3.1 20120801 for :02:00.0 on minor 0
  Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle 
kernel NULL pointer dereference at   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP:   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100615] 
  Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI
  Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: 
hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core 
pata_acpi libahci wmi
  Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: 
kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu
  Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. 
Precision Tower 5810/0K240Y, BIOS A05 12/16/2014
  Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp 
gf119_disp_super [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 
task.stack: ac9ec2134000
  Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010:  (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 
EFLAGS: 00010206
  Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: 
 RCX: 0016
  Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX:  RSI: 
 RDI: 9d898140d180
  Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: 
 R09: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: 
 R12: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13:  R14: 
ac9ec2137d00 R15: 9d898c542600
  Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS:  
() GS:9d899fd0() knlGS:
  Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS:  0010 DS:  ES: 
 CR0: 80050033
  Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2:  CR3: 
00029ac0a006 CR4: 001606e0
  Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace:
  Jan 11 09:39:18 edvin-tower kernel: [3.100771]  ? 
nvkm_dp_train_drive+0x214/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100798]  nvkm_dp_train+0x582/0x970 
[nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100824]  
nvkm_dp_acquire+0xd4/0x390 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100850]  
nv50_disp_super_2_2+0x6d/0x430 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100872]  ? 
nvkm_devinit_pll_set+0xf/0x20 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100897]  
gf119_disp_super+0x1b7/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100904]  ? __schedule+0x3ca/0x890
  Jan 11 09:39:18 edvin-tower kernel: [3.100911]  
process_one_work+0x156/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100915]  worker_thread+0x4b/0x460
  Jan 11 09:39:18 edvin-tower kernel: [3.100920]  kthread+0x109/0x140
  Jan 11 09:39:18 edvin-tower kernel: [3.100924]  ? 
process_one_work+0x410/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100928]  ? 
kthread_create_on_node+0x70/0x70
  Jan 11 09:39:18 edvin-tower kernel: [3.100934]  ret_from_fork+0x1f/0x30
  Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code:  Bad RIP value.
  Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP:   (null) RSP: 
ac9ec2137bd8
  Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 
93a79dae0d3ec749 ]---

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-generic-hwe-16.04 4.13.0.26.46
  ProcVersionSignature: Ubuntu 

[Kernel-packages] [Bug 1793430] [NEW] Page leaking in cachefiles_read_backing_file while vmscan is active

2018-09-19 Thread Daniel Axtens
Public bug reported:

SRU Justification
-

[Description]
In a heavily loaded system where the system pagecache is nearing memory limits 
and fscache is enabled, pages can be leaked by fscache while trying read pages 
from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.

[Fix]
The fix is straightforward, to decrement the reference when error is encounterd.

[Testing]
A user has tested the fix using following method for 12+ hrs.

1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
2) create 1 files of 2.8MB in a NFS mount.
3) start a thread to simulate heavy VM presssure
   (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
4) start multiple parallel reader for data set at same time
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   ..
   ..
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
   find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
   free -h , cat /proc/meminfo and page-types -r -b lru
   to ensure all pages are freed.

[Regression Potential]
Limited to cachefiles.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1793430

Title:
  Page leaking in cachefiles_read_backing_file while vmscan is active

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  -

  [Description]
  In a heavily loaded system where the system pagecache is nearing memory 
limits and fscache is enabled, pages can be leaked by fscache while trying read 
pages from cachefiles backend. This can happen because two applications can be 
reading same page from a single mount, two threads can be trying to read the 
backing page at same time. This results in one of the thread finding that a 
page for the backing file or netfs file is already in the radix tree. During 
the error handling cachefiles does not cleanup the reference on backing page, 
leading to page leak.
  
  [Fix]
  The fix is straightforward, to decrement the reference when error is 
encounterd.
  
  [Testing]
  A user has tested the fix using following method for 12+ hrs.
  
  1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc :/export /mnt/nfs
  2) create 1 files of 2.8MB in a NFS mount.
  3) start a thread to simulate heavy VM presssure
 (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
  4) start multiple parallel reader for data set at same time
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 ..
 ..
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
 find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
  5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
 free -h , cat /proc/meminfo and page-types -r -b lru
 to ensure all pages are freed.

  [Regression Potential]
  Limited to cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1793430/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1783246] Re: Cephfs + fscache: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: jbd2__journal_start+0x22/0x1f0

2018-08-15 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
 Assignee: Daniel Axtens (daxtens) => (unassigned)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1783246

Title:
  Cephfs + fscache: unable to handle kernel NULL pointer dereference at
   IP: jbd2__journal_start+0x22/0x1f0

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  SRU Justification
  -

  [Impact]
  Certain sequences of file system operations on a cephfs volume backed by 
fscache with an ext4 store can cause a kernel BUG:

  
  [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at 

  [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0
  ...
  [ 5818.962490] Call Trace:
  [ 5818.963055] ? ext4_writepages+0x5d5/0xf40
  [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120
  [ 5818.964994] ext4_writepages+0x5d5/0xf40
  [ 5818.965991] ? __enqueue_entity+0x5c/0x60
  [ 5818.966791] ? check_preempt_wakeup+0x130/0x240
  [ 5818.967679] do_writepages+0x4b/0xe0
  [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0
  [ 5818.969526] ? do_writepages+0x4b/0xe0
  [ 5818.970493] ? ext4_statfs+0x114/0x260
  [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.973385] filemap_write_and_wait+0x31/0x90
  [ 5818.974461] ext4_bmap+0x8c/0xe0
  [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles]
  [ 5818.976718] ? _cond_resched+0x19/0x40
  [ 5818.977482] ? wake_up_bit+0x42/0x50
  [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache]
  [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache]
  [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph]
  [ 5818.981630] ceph_readpages+0x49/0x100 [ceph]
  [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0
  [ 5818.983628] ? __cap_is_valid+0x21/0xb0 [ceph]
  [ 5818.984526] ondemand_readahead+0x11a/0x2a0
  [ 5818.985374] ? ondemand_readahead+0x11a/0x2a0
  [ 5818.986825] page_cache_async_readahead+0x71/0x80
  [ 5818.987751] generic_file_read_iter+0x784/0xbf0
  [ 5818.988663] ? ceph_put_cap_refs+0x1c4/0x330 [ceph]
  [ 5818.989620] ? page_cache_tree_insert+0xe0/0xe0
  [ 5818.990519] ceph_read_iter+0x106/0x820 [ceph]
  [ 5818.991818] new_sync_read+0xe4/0x130
  [ 5818.992588] __vfs_read+0x29/0x40
  [ 5818.993504] vfs_read+0x8e/0x130
  [ 5818.994192] SyS_read+0x55/0xc0
  [ 5818.994870] do_syscall_64+0x73/0x130
  [ 5818.995632] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

  [Fix]
  Cherry-pick 5d988308283ecf062fa88f20ae05c52cce0bcdca from upstream.

  This patch stops cephfs from reusing current->journal for its own
  internal use, which means that it's valid when ext4 uses it via
  fscache.

  [Testcase]
  A user has been using the following test case:
  ( cat /proc/fs/fscache/stats > ~/test.log; i=0; while true; do
  touch small; echo 3 > /proc/sys/vm/drop_caches & md5sum small; let "i++"; 
if ! (( $i % 1000 )); then
  echo "Test iteration $i done" >> ~/test.log; cat 
/proc/fs/fscache/stats >> ~/test.log;
  fi;
  done ) > ~/nohup.out 2>&1

  (It boils down to "touch file; drop caches; read file")
  Without the patch, this fails very quickly - usually the first time, always 
within a few iterations. With the patch, the user ran this loop for over 60 
hours without incident.

  [Regression potential]
  The change is not trivial, but is limited to cephfs, and has been in mainline 
since v4.16. So the risk of regression is well contained.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1783246/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774336] Re: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

2018-08-01 Thread Daniel Axtens
** Description changed:

  == SRU Justification ==
  
  [Impact]
  Oops during heavy NFS + FSCache use:
  
- [81738.886634] FS-Cache: 
+ [81738.886634] FS-Cache:
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] [ cut here ]
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!
  
  6 == 5 represents an operation being DEAD when it was not expected to
  be.
  
  [Cause]
- There is a race in fscache and cachefiles. 
+ There is a race in fscache and cachefiles.
  
  One thread is in cachefiles_read_waiter:
-  1) object->work_lock is taken.
-  2) the operation is added to the to_do list.
-  3) the work lock is dropped.
-  4) fscache_enqueue_retrieval is called, which takes a reference.
+  1) object->work_lock is taken.
+  2) the operation is added to the to_do list.
+  3) the work lock is dropped.
+  4) fscache_enqueue_retrieval is called, which takes a reference.
  
  Another thread is in cachefiles_read_copier:
-  1) object->work_lock is taken
-  2) an item is popped off the to_do list.
-  3) object->work_lock is dropped.
-  4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.
+  1) object->work_lock is taken
+  2) an item is popped off the to_do list.
+  3) object->work_lock is dropped.
+  4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.
  
  Now if the this process in cachefiles_read_copier takes place *between*
  steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped
  before it is taken, which leads to the objects reference count hitting
  zero, which leads to lifecycle events for the object happening too soon,
  leading to the assertion failure later on.
  
  (This is simplified and clarified from the original upstream analysis
  for this patch at https://www.redhat.com/archives/linux-
  cachefs/2018-February/msg1.html and from a similar patch with a
  different approach to fixing the bug at https://www.redhat.com/archives
  /linux-cachefs/2017-June/msg2.html)
  
  [Fix]
- Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This 
means that the object cannot be popped off the to_do list until it is in a 
fully consistent state with the reference taken.
+ 
+ 
+  (Old sauce patch being reverted) Move fscache_enqueue_retrieval under the 
lock in cachefiles_read_waiter. This means that the object cannot be popped off 
the to_do list until it is in a fully consistent state with the reference taken.
+ 
+  (New upstream patch) Explicitly take a reference to the object while it
+ is being enqueued. Adjust another part of the code to deal with the
+ greater range of object states this exposes.
  
  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.
  
  [Regression Potential]
-  - Limited to fscache/cachefiles. 
-  - The change makes things more conservative (doing more under lock) so 
that's reassuring. 
-  - There may be performance impacts but none have been observed so far.
+  - Limited to fscache/cachefiles.
+  - The change makes things more conservative (taking more references) so 
that's reassuring.
+  - There may be performance impacts but none have been observed so far.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774336

Title:
  FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Released
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Artful:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache use:

  [81738.886634] FS-Cache:
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] [ cut here ]
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

  6 == 5 represents an operation being DEAD when it was not expected to
  be.

  [Cause]
  There is a race in fscache and cachefiles.

  One thread is in cachefiles_read_waiter:
   1) object->work_lock is taken.
   2) the operation is added to the to_do list.
   3) the work lock is dropped.
   4) fscache_enqueue_retrieval is called, which takes a reference.

  Another thread is in cachefiles_read_copier:
   1) object->work_lock is taken
   2) an item is popped off the to_do list.
   3) object->work_lock is dropped.
   4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.

  Now if the this process in cachefiles_read_copier takes place
  *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
  dropped before it is taken, which leads 

[Kernel-packages] [Bug 1784864] [NEW] Various fscache/cachefiles bugs

2018-08-01 Thread Daniel Axtens
Public bug reported:

SRU Justification
-

A few bugs while using fscache/cachefiles on a NFS share have been
reported by a user. All are intermittent/race conditions.

[Impact]
Various BUGs/OOPSes:

 - BUG on "Unexpected object collision"

 - CacheFiles: Error: Overlong wait for old active object to go away /
   CacheFiles: Error: Object already active /
   kernel BUG at fs/cachefiles/namei.c:163!

 - Unmounting an NFS share sometimes leads to an oops

[Fix]
Grab the following patches from Dave Howell's tree linux-fs tree in the 
fscache-fixes branch 
(https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes)
 - they're various small fixes within fscache/cachefiles.

4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object 
collision"
28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error 
handling

[Testcase]
The user has run ~100 hours of NFS stress tests and have not seen these bugs 
recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

** Affects: linux (Ubuntu)
 Importance: Undecided
     Assignee: Daniel Axtens (daxtens)
 Status: Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1784864

Title:
  Various fscache/cachefiles bugs

Status in linux package in Ubuntu:
  Invalid

Bug description:
  SRU Justification
  -

  A few bugs while using fscache/cachefiles on a NFS share have been
  reported by a user. All are intermittent/race conditions.

  [Impact]
  Various BUGs/OOPSes:

   - BUG on "Unexpected object collision"

   - CacheFiles: Error: Overlong wait for old active object to go away /
 CacheFiles: Error: Object already active /
 kernel BUG at fs/cachefiles/namei.c:163!

   - Unmounting an NFS share sometimes leads to an oops

  [Fix]
  Grab the following patches from Dave Howell's tree linux-fs tree in the 
fscache-fixes branch 
(https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes)
 - they're various small fixes within fscache/cachefiles.

  4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object 
collision"
  28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE 
flag
  aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error 
handling

  [Testcase]
  The user has run ~100 hours of NFS stress tests and have not seen these bugs 
recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784864/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1784864] Re: Various fscache/cachefiles bugs

2018-08-01 Thread Daniel Axtens
Oops, my mistake, there are already LP bugs covering these issues.

Regards,
Daniel

** Changed in: linux (Ubuntu)
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1784864

Title:
  Various fscache/cachefiles bugs

Status in linux package in Ubuntu:
  Invalid

Bug description:
  SRU Justification
  -

  A few bugs while using fscache/cachefiles on a NFS share have been
  reported by a user. All are intermittent/race conditions.

  [Impact]
  Various BUGs/OOPSes:

   - BUG on "Unexpected object collision"

   - CacheFiles: Error: Overlong wait for old active object to go away /
 CacheFiles: Error: Object already active /
 kernel BUG at fs/cachefiles/namei.c:163!

   - Unmounting an NFS share sometimes leads to an oops

  [Fix]
  Grab the following patches from Dave Howell's tree linux-fs tree in the 
fscache-fixes branch 
(https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes)
 - they're various small fixes within fscache/cachefiles.

  4856fccd559f cachefiles: Wait rather than BUG'ing on "Unexpected object 
collision"
  28d64cf8990c cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE 
flag
  aedc4ca703bc fscache: Fix reference overput in fscache_attach_object() error 
handling

  [Testcase]
  The user has run ~100 hours of NFS stress tests and have not seen these bugs 
recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1784864/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-31 Thread Daniel Axtens
Yes, we have closed the support case on our end at their request.
Apparently increasing the reservation ratio has helped.

Paulus - Hi! Thanks for the info and clearing up some of my
misunderstandings. Great to hear from you and I hope things are going
well at OzLabs :)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Opinion
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need

[Kernel-packages] [Bug 1783246] [NEW] Cephfs + fscache: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: jbd2__journal_start+0x22/0x1f0

2018-07-23 Thread Daniel Axtens
Public bug reported:

SRU Justification
-

[Impact]
Certain sequences of file system operations on a cephfs volume backed by 
fscache with an ext4 store can cause a kernel BUG:


[ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at 

[ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0
...
[ 5818.962490] Call Trace:
[ 5818.963055] ? ext4_writepages+0x5d5/0xf40
[ 5818.963884] __ext4_journal_start_sb+0x6d/0x120
[ 5818.964994] ext4_writepages+0x5d5/0xf40
[ 5818.965991] ? __enqueue_entity+0x5c/0x60
[ 5818.966791] ? check_preempt_wakeup+0x130/0x240
[ 5818.967679] do_writepages+0x4b/0xe0
[ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0
[ 5818.969526] ? do_writepages+0x4b/0xe0
[ 5818.970493] ? ext4_statfs+0x114/0x260
[ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100
[ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100
[ 5818.973385] filemap_write_and_wait+0x31/0x90
[ 5818.974461] ext4_bmap+0x8c/0xe0
[ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles]
[ 5818.976718] ? _cond_resched+0x19/0x40
[ 5818.977482] ? wake_up_bit+0x42/0x50
[ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache]
[ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache]
[ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph]
[ 5818.981630] ceph_readpages+0x49/0x100 [ceph]
[ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0
[ 5818.983628] ? __cap_is_valid+0x21/0xb0 [ceph]
[ 5818.984526] ondemand_readahead+0x11a/0x2a0
[ 5818.985374] ? ondemand_readahead+0x11a/0x2a0
[ 5818.986825] page_cache_async_readahead+0x71/0x80
[ 5818.987751] generic_file_read_iter+0x784/0xbf0
[ 5818.988663] ? ceph_put_cap_refs+0x1c4/0x330 [ceph]
[ 5818.989620] ? page_cache_tree_insert+0xe0/0xe0
[ 5818.990519] ceph_read_iter+0x106/0x820 [ceph]
[ 5818.991818] new_sync_read+0xe4/0x130
[ 5818.992588] __vfs_read+0x29/0x40
[ 5818.993504] vfs_read+0x8e/0x130
[ 5818.994192] SyS_read+0x55/0xc0
[ 5818.994870] do_syscall_64+0x73/0x130
[ 5818.995632] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[Fix]
Cherry-pick 5d988308283ecf062fa88f20ae05c52cce0bcdca from upstream.

This patch stops cephfs from reusing current->journal for its own
internal use, which means that it's valid when ext4 uses it via fscache.

[Testcase]
A user has been using the following test case:
( cat /proc/fs/fscache/stats > ~/test.log; i=0; while true; do
touch small; echo 3 > /proc/sys/vm/drop_caches & md5sum small; let "i++"; 
if ! (( $i % 1000 )); then
echo "Test iteration $i done" >> ~/test.log; cat /proc/fs/fscache/stats 
>> ~/test.log;
fi;
done ) > ~/nohup.out 2>&1

(It boils down to "touch file; drop caches; read file")
Without the patch, this fails very quickly - usually the first time, always 
within a few iterations. With the patch, the user ran this loop for over 60 
hours without incident.

[Regression potential]
The change is not trivial, but is limited to cephfs, and has been in mainline 
since v4.16. So the risk of regression is well contained.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1783246

Title:
  Cephfs + fscache: unable to handle kernel NULL pointer dereference at
   IP: jbd2__journal_start+0x22/0x1f0

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  -

  [Impact]
  Certain sequences of file system operations on a cephfs volume backed by 
fscache with an ext4 store can cause a kernel BUG:

  
  [ 5818.932770] BUG: unable to handle kernel NULL pointer dereference at 

  [ 5818.934354] IP: jbd2__journal_start+0x33/0x1e0
  ...
  [ 5818.962490] Call Trace:
  [ 5818.963055] ? ext4_writepages+0x5d5/0xf40
  [ 5818.963884] __ext4_journal_start_sb+0x6d/0x120
  [ 5818.964994] ext4_writepages+0x5d5/0xf40
  [ 5818.965991] ? __enqueue_entity+0x5c/0x60
  [ 5818.966791] ? check_preempt_wakeup+0x130/0x240
  [ 5818.967679] do_writepages+0x4b/0xe0
  [ 5818.968625] ? ext4_mark_inode_dirty+0x1d0/0x1d0
  [ 5818.969526] ? do_writepages+0x4b/0xe0
  [ 5818.970493] ? ext4_statfs+0x114/0x260
  [ 5818.971267] __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.972425] ? __filemap_fdatawrite_range+0xc1/0x100
  [ 5818.973385] filemap_write_and_wait+0x31/0x90
  [ 5818.974461] ext4_bmap+0x8c/0xe0
  [ 5818.975150] cachefiles_read_or_alloc_pages+0x1bf/0xd90 [cachefiles]
  [ 5818.976718] ? _cond_resched+0x19/0x40
  [ 5818.977482] ? wake_up_bit+0x42/0x50
  [ 5818.978227] ? fscache_run_op.isra.8+0x4c/0x80 [fscache]
  [ 5818.979249] __fscache_read_or_alloc_pages+0x1d3/0x2e0 [fscache]
  [ 5818.980397] ceph_readpages_from_fscache+0x6c/0xe0 [ceph]
  [ 5818.981630] ceph_readpages+0x49/0x100 [ceph]
  [ 5818.982691] __do_page_cache_readahead+0x1c9/0x2c0
 

Re: [Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-18 Thread Daniel Axtens
Hi,

I am told that this is the same machine but not while it was currently
showing symptoms - due to the intermittent nature of the problem it
was taken some time later. This matches what I see in the logs so I
have no reason to doubt it.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here

[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-18 Thread Daniel Axtens
** Attachment added: "var_log_libvirt_qemu.tar.bz2"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164739/+files/var_log_libvirt_qemu.tar.bz2

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here -- at least -- to get started:

  1. What is the server model and at least basic c

[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-18 Thread Daniel Axtens
** Attachment added: "syslog"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164740/+files/syslog

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here -- at least -- to get started:

  1. What is the server model and at least basic config info (I/O cards,
  firmware level)?

[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-18 Thread Daniel Axtens
** Attachment added: "meminfo"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164738/+files/meminfo

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here -- at least -- to get started:

  1. What is the server model and at least basic config info (I/O cards,
  firmware level)?

[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-18 Thread Daniel Axtens
Based on the most recent information we have available to us
(2018-05-09):

1. What is the server model and at least basic config info (I/O cards,
firmware level)? Use /proc/meminfo, etc. Attach the syslog and the
/var/log/libvirt/qemu logs.

I am struggling a bit to determine the server model, but I'm uploading
the relevant logs.

2. What is running on the host (at least uname -a). Sounds like from
comment above like it's an older fix level, so let's get it updated to
the curent level (and ensure the problem still exists) before
proceeding: There is zero point in trying to figure out whether fixes
that are known to exist in 16.04 are in this *particular* build level.

Linux apsoscmp-as-a4p 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9
20:00:40 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I don't have any answers for (3); the user has been asked.


** Attachment added: "lspci"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781038/+attachment/5164737/+files/lspci

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking a

[Kernel-packages] [Bug 1781038] Re: KVM guest hash page table failed to allocate contiguous memory (CMA)

2018-07-11 Thread Daniel Axtens
Hi,

This came up in the context of a customer issue. I have asked them if we
can share anonymised data here, and I will pass on any response.

>From my analysis of the code while working the case, it would seem that
you could reproduce this by spinning up and tearing down VMs of varying
memory sizes in order to fragment the CMA. It looks like PCI pass-
through would exacerbate the issue, although I don't believe this was a
factor in this instance.

I wonder if this is fully 'solvable' per se - with memory overcommit it
should be easy to simply run out of CMA space - but it should be
possible to at least print much more helpful information either from the
kernel or from qemu.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an opt

[Kernel-packages] [Bug 1777029] Re: fscache: Fix hanging wait on page discarded by writeback

2018-06-19 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Trusty)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Xenial)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Artful)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1777029

Title:
  fscache: Fix hanging wait on page discarded by writeback

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  == SRU Justification ==

  [Impact]
  Under heavy NFS + FSCache load, a user sometimes observes a hang in 
__fscache_wait_on_page_write+0x5f/0xa0.

  Example traces:
  [] __fscache_wait_on_page_write+0x5f/0xa0 [fscache]
  [] __fscache_uncache_all_inode_pages+0xba/0x120 [fscache]
  [] nfs_fscache_open_file+0x4e/0xc0 [nfs]

  [] __fscache_wait_on_page_write+0x5f/0xa0 [fscache]
  [] __nfs_fscache_invalidate_page+0x2c/0x80 [nfs]
  [] nfs_invalidate_page+0x63/0x90 [nfs]
  [] truncate_inode_page+0x80/0x90

  [Fix]
  Cherry-pick 2c98425720233ae3e135add0c7e869b32913502f from upstream, which is 
a patch from the FSCache maintainer.

  [Testcase]
  The user has run a NFS stress-test with a similar home-grown patch, and will 
run a stress test on the proposed kernel.

  [Regression Potential]
  Patch is limited to FSCache, so regression potential is limited.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777029/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1774336] [NEW] FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

2018-05-31 Thread Daniel Axtens
Public bug reported:

== SRU Justification ==

[Impact]
Oops during heavy NFS + FSCache use:

[81738.886634] FS-Cache: 
[81738.888281] FS-Cache: Assertion failed
[81738.889461] FS-Cache: 6 == 5 is false
[81738.890625] [ cut here ]
[81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

6 == 5 represents an operation being DEAD when it was not expected to
be.

[Cause]
There is a race in fscache and cachefiles. 

One thread is in cachefiles_read_waiter:
 1) object->work_lock is taken.
 2) the operation is added to the to_do list.
 3) the work lock is dropped.
 4) fscache_enqueue_retrieval is called, which takes a reference.

Another thread is in cachefiles_read_copier:
 1) object->work_lock is taken
 2) an item is popped off the to_do list.
 3) object->work_lock is dropped.
 4) some processing is done on the item, and fscache_put_retrieval() is called, 
dropping a reference.

Now if the this process in cachefiles_read_copier takes place *between*
steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped
before it is taken, which leads to the objects reference count hitting
zero, which leads to lifecycle events for the object happening too soon,
leading to the assertion failure later on.

(This is simplified and clarified from the original upstream analysis
for this patch at https://www.redhat.com/archives/linux-
cachefs/2018-February/msg1.html and from a similar patch with a
different approach to fixing the bug at https://www.redhat.com/archives
/linux-cachefs/2017-June/msg2.html)

[Fix]
Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This 
means that the object cannot be popped off the to_do list until it is in a 
fully consistent state with the reference taken.

[Testcase]
A user has run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles. 
 - The change makes things more conservative (doing more under lock) so that's 
reassuring. 
 - There may be performance impacts but none have been observed so far.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

** Changed in: linux (Ubuntu)
   Status: New => Confirmed

** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Daniel Axtens (daxtens)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774336

Title:
  FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache use:

  [81738.886634] FS-Cache: 
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] [ cut here ]
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

  6 == 5 represents an operation being DEAD when it was not expected to
  be.

  [Cause]
  There is a race in fscache and cachefiles. 

  One thread is in cachefiles_read_waiter:
   1) object->work_lock is taken.
   2) the operation is added to the to_do list.
   3) the work lock is dropped.
   4) fscache_enqueue_retrieval is called, which takes a reference.

  Another thread is in cachefiles_read_copier:
   1) object->work_lock is taken
   2) an item is popped off the to_do list.
   3) object->work_lock is dropped.
   4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.

  Now if the this process in cachefiles_read_copier takes place
  *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
  dropped before it is taken, which leads to the objects reference count
  hitting zero, which leads to lifecycle events for the object happening
  too soon, leading to the assertion failure later on.

  (This is simplified and clarified from the original upstream analysis
  for this patch at https://www.redhat.com/archives/linux-
  cachefs/2018-February/msg1.html and from a similar patch with a
  different approach to fixing the bug at
  https://www.redhat.com/archives/linux-cachefs/2017-June/msg2.html)

  [Fix]
  Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This 
means that the object cannot be popped off the to_do list until it is in a 
fully consistent state with the reference taken.

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles. 
   - The change makes things more conservative (doing more under lock) so 
that's reassuring. 
   - There may be performance impacts but none have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1774336/+subscriptions

-- 
Mailing lis

[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update

2018-05-22 Thread Daniel Axtens
Hi,

I have a report from another user reporting this. I will submit it to
the kernel team.

Regards,
Daniel

** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1742658

Title:
  linux-generic-hwe-16.04 OOPS in nouveau after security update

Status in linux package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  New
Status in linux-hwe-edge package in Ubuntu:
  New
Status in linux-meta-hwe package in Ubuntu:
  New
Status in linux-meta-hwe-edge package in Ubuntu:
  New

Bug description:
  Description:  Ubuntu 16.04.3 LTS
  Release:  16.04

  After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen 
with nouveau.
  Previously I was running 4.10.0-42-generic, and that kernel still works fine.

  Here is the OOPS:

  an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 
1.3.1 20120801 for :02:00.0 on minor 0
  Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle 
kernel NULL pointer dereference at   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP:   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100615] 
  Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI
  Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: 
hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core 
pata_acpi libahci wmi
  Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: 
kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu
  Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. 
Precision Tower 5810/0K240Y, BIOS A05 12/16/2014
  Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp 
gf119_disp_super [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 
task.stack: ac9ec2134000
  Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010:  (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 
EFLAGS: 00010206
  Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: 
 RCX: 0016
  Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX:  RSI: 
 RDI: 9d898140d180
  Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: 
 R09: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: 
 R12: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13:  R14: 
ac9ec2137d00 R15: 9d898c542600
  Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS:  
() GS:9d899fd0() knlGS:
  Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS:  0010 DS:  ES: 
 CR0: 80050033
  Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2:  CR3: 
00029ac0a006 CR4: 001606e0
  Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace:
  Jan 11 09:39:18 edvin-tower kernel: [3.100771]  ? 
nvkm_dp_train_drive+0x214/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100798]  nvkm_dp_train+0x582/0x970 
[nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100824]  
nvkm_dp_acquire+0xd4/0x390 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100850]  
nv50_disp_super_2_2+0x6d/0x430 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100872]  ? 
nvkm_devinit_pll_set+0xf/0x20 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100897]  
gf119_disp_super+0x1b7/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100904]  ? __schedule+0x3ca/0x890
  Jan 11 09:39:18 edvin-tower kernel: [3.100911]  
process_one_work+0x156/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100915]  worker_thread+0x4b/0x460
  Jan 11 09:39:18 edvin-tower kernel: [3.100920]  kthread+0x109/0x140
  Jan 11 09:39:18 edvin-tower kernel: [3.100924]  ? 
process_one_work+0x410/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100928]  ? 
kthread_create_on_node+0x70/0x70
  Jan 11 09:39:18 edvin-tower kernel: [3.100934]  ret_from_fork+0x1f/0x30
  Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code:  Bad RIP value.
  Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP:   (null) RSP: 
ac9ec2137bd8
  Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 
93a79dae0d3ec749 ]---

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-generic-hwe-16.04 

[Kernel-packages] [Bug 1742658] Re: linux-generic-hwe-16.04 OOPS in nouveau after security update

2018-05-22 Thread Daniel Axtens
** Also affects: linux (Ubuntu)
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1742658

Title:
  linux-generic-hwe-16.04 OOPS in nouveau after security update

Status in linux package in Ubuntu:
  Confirmed
Status in linux-hwe package in Ubuntu:
  New
Status in linux-hwe-edge package in Ubuntu:
  New
Status in linux-meta-hwe package in Ubuntu:
  New
Status in linux-meta-hwe-edge package in Ubuntu:
  New

Bug description:
  Description:  Ubuntu 16.04.3 LTS
  Release:  16.04

  After upgrading linux-generic-hwe-16.04 to 4.13.0.26.46 I get a black screen 
with nouveau.
  Previously I was running 4.10.0-42-generic, and that kernel still works fine.

  Here is the OOPS:

  an 11 09:39:18 edvin-tower kernel: [3.079986] [drm] Initialized nouveau 
1.3.1 20120801 for :02:00.0 on minor 0
  Jan 11 09:39:18 edvin-tower kernel: [3.100591] BUG: unable to handle 
kernel NULL pointer dereference at   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100606] IP:   (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100610] PGD 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100611] P4D 0 
  Jan 11 09:39:18 edvin-tower kernel: [3.100615] 
  Jan 11 09:39:18 edvin-tower kernel: [3.100620] Oops: 0010 [#1] SMP PTI
  Jan 11 09:39:18 edvin-tower kernel: [3.100623] Modules linked in: 
hid_generic usbhid hid nouveau mxm_wmi video i2c_algo_bit ttm drm_kms_helper 
syscopyarea sysfillrect e1000e sysimgblt fb_sys_fops drm ptp ahci pps_core 
pata_acpi libahci wmi
  Jan 11 09:39:18 edvin-tower kernel: [3.100643] CPU: 4 PID: 238 Comm: 
kworker/u16:7 Not tainted 4.13.0-26-generic #29~16.04.2-Ubuntu
  Jan 11 09:39:18 edvin-tower kernel: [3.100649] Hardware name: Dell Inc. 
Precision Tower 5810/0K240Y, BIOS A05 12/16/2014
  Jan 11 09:39:18 edvin-tower kernel: [3.100688] Workqueue: nvkm-disp 
gf119_disp_super [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100694] task: 9d8982d25d00 
task.stack: ac9ec2134000
  Jan 11 09:39:18 edvin-tower kernel: [3.100698] RIP: 0010:  (null)
  Jan 11 09:39:18 edvin-tower kernel: [3.100701] RSP: 0018:ac9ec2137bd8 
EFLAGS: 00010206
  Jan 11 09:39:18 edvin-tower kernel: [3.100706] RAX: c0416f20 RBX: 
 RCX: 0016
  Jan 11 09:39:18 edvin-tower kernel: [3.100710] RDX:  RSI: 
 RDI: 9d898140d180
  Jan 11 09:39:18 edvin-tower kernel: [3.100715] RBP: ac9ec2137c70 R08: 
 R09: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100719] R10: 1000 R11: 
 R12: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100724] R13:  R14: 
ac9ec2137d00 R15: 9d898c542600
  Jan 11 09:39:18 edvin-tower kernel: [3.100728] FS:  
() GS:9d899fd0() knlGS:
  Jan 11 09:39:18 edvin-tower kernel: [3.100733] CS:  0010 DS:  ES: 
 CR0: 80050033
  Jan 11 09:39:18 edvin-tower kernel: [3.100737] CR2:  CR3: 
00029ac0a006 CR4: 001606e0
  Jan 11 09:39:18 edvin-tower kernel: [3.100742] Call Trace:
  Jan 11 09:39:18 edvin-tower kernel: [3.100771]  ? 
nvkm_dp_train_drive+0x214/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100798]  nvkm_dp_train+0x582/0x970 
[nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100824]  
nvkm_dp_acquire+0xd4/0x390 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100850]  
nv50_disp_super_2_2+0x6d/0x430 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100872]  ? 
nvkm_devinit_pll_set+0xf/0x20 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100897]  
gf119_disp_super+0x1b7/0x300 [nouveau]
  Jan 11 09:39:18 edvin-tower kernel: [3.100904]  ? __schedule+0x3ca/0x890
  Jan 11 09:39:18 edvin-tower kernel: [3.100911]  
process_one_work+0x156/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100915]  worker_thread+0x4b/0x460
  Jan 11 09:39:18 edvin-tower kernel: [3.100920]  kthread+0x109/0x140
  Jan 11 09:39:18 edvin-tower kernel: [3.100924]  ? 
process_one_work+0x410/0x410
  Jan 11 09:39:18 edvin-tower kernel: [3.100928]  ? 
kthread_create_on_node+0x70/0x70
  Jan 11 09:39:18 edvin-tower kernel: [3.100934]  ret_from_fork+0x1f/0x30
  Jan 11 09:39:18 edvin-tower kernel: [3.100938] Code:  Bad RIP value.
  Jan 11 09:39:18 edvin-tower kernel: [3.100944] RIP:   (null) RSP: 
ac9ec2137bd8
  Jan 11 09:39:18 edvin-tower kernel: [3.100948] CR2: 
  Jan 11 09:39:18 edvin-tower kernel: [3.100952] ---[ end trace 
93a79dae0d3ec749 ]---

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-generic-hwe-16.04 4.13.0.26.46
  ProcVersionSignature: Ubuntu 4.10.0-42.46~16.04.1-generic 4.10.17
  Uname: Linux 

[Kernel-packages] [Bug 1750038] Re: user space process hung in 'D' state waiting for disk io to complete

2018-05-07 Thread Daniel Axtens
** Description changed:

+ == SRU Justification ==
+ 
+ [Impact]
+ Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.
+ 
+ [Fix]
+ Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
+ ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")
+ 
+ [Testcase]
+ See Test scenario in previous description.
+ 
+ A test kernel with this patch was tested heavily (>100hrs of test suite)
+ without issue.
+ 
+ [Regression Potential]
+ This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 
+ 
+ However, the patch is already in Artful and Bionic without issue.
+ 
+ The patch does not apply to Trusty.
+ 
+ == Previous Description ==
+ 
  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.
  
  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces seems
  to be stuck in kmalloc allocation during cleanup of dirty NFS pages.
  
  All the subsequent operations on the NFS mounts are stuck and reboot is
  required to rectify the situation.
  
  [Test scenario]
  
- 1) Applications running in Docker environment 
- 2) Application have cgroup limits --cpu-shares --memory -shm-limit 
- 3) python and C++ based applications (torch and caffe) 
- 4) Applications read big lmdb files and write results to NFS shares 
- 5) use NFS v3 , hard and fscache is enabled 
- 6) now swap space is configured 
+ 1) Applications running in Docker environment
+ 2) Application have cgroup limits --cpu-shares --memory -shm-limit
+ 3) python and C++ based applications (torch and caffe)
+ 4) Applications read big lmdb files and write results to NFS shares
+ 5) use NFS v3 , hard and fscache is enabled
+ 6) now swap space is configured
  
  This prevents all other I/O activity on that mount to hang.
  
  we are running into this issue more frequently and identified few
  applications causing this problem.
  
  As updated in the description, the problem seems to be happening when
  exercising the stack
  
  try_to_free_mem_cgroup_pages+0xba/0x1a0
  
  we see this with docker containers with cgroup option --memory
  .
  
  whenever there is a deadlock, we see that the process that is hung has
  reached the maximum cgroup limit, multiple times and typically cleans up
  dirty data and caches to bring the usage under the limit.
  
  This reclaim path happens many times and finally we hit probably a race
  get into deadlock

** Changed in: linux (Ubuntu)
 Assignee: Dragan S. (dragan-s) => Daniel Axtens (daxtens)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1750038

Title:
  user space process hung in 'D' state waiting for disk io to complete

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  == SRU Justification ==

  [Impact]
  Occasionally an application gets stuck in "D" state on NFS reads/sync and 
close system calls. All the subsequent operations on the NFS mounts are stuck 
and reboot is required to rectify the situation.

  [Fix]
  Use GPF_NOIO in some allocations in writeback to avoid a deadlock. This is 
upstream in:
  ae97aa524ef4 ("NFS: Use GFP_NOIO for two allocations in writeback")

  [Testcase]
  See Test scenario in previous description.

  A test kernel with this patch was tested heavily (>100hrs of test
  suite) without issue.

  [Regression Potential]
  This changes memory allocation in NFS to use a different policy. This could 
potentially affect NFS. 

  However, the patch is already in Artful and Bionic without issue.

  The patch does not apply to Trusty.

  == Previous Description ==

  Using Ubuntu Xenial user reports processes hang in D state waiting for
  disk io.

  Ocassionally one of the applications gets into "D" state on NFS
  reads/sync and close system calls. based on the kernel backtraces
  seems to be stuck in kmalloc allocation during cleanup of dirty NFS
  pages.

  All the subsequent operations on the NFS mounts are stuck and reboot
  is required to rectify the situation.

  [Test scenario]

  1) Applications running in Docker environment
  2) Application have cgroup limits --cpu-shares --memory -shm-limit
  3) python and C++ based applications (torch and caffe)
  4) Applications read big lmdb files and write results to NFS shares
  5) use NFS v3 , hard and fscache is enabled
  6) now swap space is configured

  This prevents all other I/O activity on that mount to hang.

  we are running into this issue more frequently and identified few
  applications causing this problem.

  As updated in the description, the problem seems to be happening when
  exercising the stack

[Kernel-packages] [Bug 1764246] [NEW] kdump kernel panics on Bionic

2018-04-15 Thread Daniel Axtens
Public bug reported:

The kdump/crashdump kernel is panicing during boot on Bionic.

1) Install the daily Bionic server or desktop ISO
2) apt install linux-crashdump, say yes to kdump being enabled
3) Reboot so as to boot with the correct kernel parameter
4) Run:
root@bionic-server:~# echo 1 > /proc/sys/kernel/sysrq 
root@bionic-server:~# echo c > /proc/sysrq-trigger 
5) Observe that the crashdump kernel panics before booting with an 
out-of-memory error. Log below.

If I replace the bionic image with the artful cloud image, and repeat
steps 2-4, the crashdump kernel boots and successfully stores the
vmcore.

The full log:

[   54.424512] sysrq: SysRq : Trigger a crash
[   54.427899] BUG: unable to handle kernel NULL pointer dereference at 

[   54.433915] IP: sysrq_handle_crash+0x16/0x20
[   54.437157] PGD 0 P4D 0 
[   54.439292] Oops: 0002 [#1] SMP PTI
[   54.444571] Modules linked in: snd_hda_codec_generic crct10dif_pclmul 
crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hda_core 
snd_hwdep input_leds joydev snd_pcm serio_raw snd_timer snd soundcore 
qemu_fw_cfg mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs 
zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid 
hid qxl aesni_intel ttm drm_kms_helper aes_x86_64 crypto_simd cryptd 
glue_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse virtio_blk 
virtio_net drm i2c_piix4 pata_acpi floppy
[   54.468925] CPU: 0 PID: 1075 Comm: bash Not tainted 4.15.0-15-generic 
#16-Ubuntu
[   54.470377] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[   54.472016] RIP: 0010:sysrq_handle_crash+0x16/0x20
[   54.472891] RSP: 0018:a3a000643e30 EFLAGS: 00010286
[   54.473826] RAX: 917e0950 RBX: 92787200 RCX: 
[   54.475092] RDX:  RSI: 90dfbfc16498 RDI: 0063
[   54.476182] RBP: a3a000643e30 R08:  R09: 022b
[   54.477272] R10: 0001 R11: 92b5280d R12: 0004
[   54.478361] R13: 0063 R14: 0002 R15: 90dfbab0ef00
[   54.479456] FS:  7ff6248af740() GS:90dfbfc0() 
knlGS:
[   54.480602] CS:  0010 DS:  ES:  CR0: 80050033
[   54.481379] CR2:  CR3: 7bcb2006 CR4: 003606f0
[   54.482341] Call Trace:
[   54.482690]  __handle_sysrq+0x9f/0x170
[   54.483207]  write_sysrq_trigger+0x34/0x40
[   54.483775]  proc_reg_write+0x45/0x70
[   54.484281]  __vfs_write+0x1b/0x40
[   54.484751]  vfs_write+0xb1/0x1a0
[   54.485208]  SyS_write+0x55/0xc0
[   54.485669]  do_syscall_64+0x73/0x130
[   54.486156]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   54.486784] RIP: 0033:0x7ff623f84154
[   54.487328] RSP: 002b:7ffe5f399678 EFLAGS: 0246 ORIG_RAX: 
0001
[   54.488272] RAX: ffda RBX: 0002 RCX: 7ff623f84154
[   54.489179] RDX: 0002 RSI: 55a5a49151c0 RDI: 0001
[   54.490313] RBP: 55a5a49151c0 R08: 000a R09: 0001
[   54.491179] R10: 000a R11: 0246 R12: 7ff624260760
[   54.492311] R13: 0002 R14: 7ff62425c2a0 R15: 7ff62425b760
[   54.493124] Code: e7 e8 9f fb ff ff e9 c0 fe ff ff 90 90 90 90 90 90 90 90 
90 90 0f 1f 44 00 00 55 c7 05 a8 a7 36 01 01 00 00 00 48 89 e5 0f ae f8  04 
25 00 00 00 00 01 5d c3 0f 1f 44 00 00 55 c7 05 40 1f e8 
[   54.496461] RIP: sysrq_handle_crash+0x16/0x20 RSP: a3a000643e30
[   54.497393] CR2: 
[0.00] Linux version 4.15.0-15-generic (buildd@lgw01-amd64-050) (gcc 
version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:58:14 UTC 
2018 (Ubuntu 4.15.0-15.16-generic 4.15.15)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-15-generic 
root=UUID=3e45b7ec-412a-11e8-a844-5254003896a5 ro maybe-ubiquity console=ttyS0 
nr_cpus=1 systemd.unit=kdump-tools.service irqpoll nousb 
ata_piix.prefer_ms_hyperv=0 elfcorehdr=802164K
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, 
using 'standard' format.
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x1000-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x2900-0x30f5cfff] 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-04-03 Thread Daniel Axtens
Hi,

We do ship an iso for ppc64le for Trusty - I'm not sure whether it does
bare metal/PowerNV or just as an LPAR under PowerVM, but it's probably a
bit moot at this point.

The good news is that as you can see, the artful kernel was released with the 
fix.
The Xenial kernel also contains the fix; I'm not sure why that wasn't 
auto-added to this bug, but the release notes contain this fix: 
https://launchpad.net/ubuntu/+source/linux/4.4.0-119.143

I am not sure what the final status of the patch in Trusty is, I will
let you know when I find out.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Released

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate.
  However, if a large packet with very large (>9700 byte) TCP segments
  arrives through ibmveth, and is passed to bnx2x, the hardware will
  panic.

  [Impact]

  bnx2x card panics, requiring power cycle to restore functionality.

  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are
  too big. This has a performance cost.

  [Fix]

  Test packet size in bnx2x feature check path and disable GSO if it is
  too large. To do this we move a function from one file to another and
  add another in the networking core.

  [Regression Potential]

  A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

  T: This also involves a different change to the networking core to add
  the old-style GSO checking, which is more invasive. However the
  changes are simple and easily reviewed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-03-28 Thread Daniel Axtens
Fantastic!

If I understand correctly, that is sufficient for verification-done-
artful, so I am changing that over for you.

The one remaining kernel is Trusty 3.13. I am guessing your module
doesn't compile for that? If it doesn't, there probably isn't much point
on booting with just a virtual ethernet adaptor as the change is
specifically to the bnx2x code.

Thanks again for your prompt testing efforts!

Regards,
Daniel

** Tags removed: verification-needed-artful
** Tags added: verification-done-artful

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate.
  However, if a large packet with very large (>9700 byte) TCP segments
  arrives through ibmveth, and is passed to bnx2x, the hardware will
  panic.

  [Impact]

  bnx2x card panics, requiring power cycle to restore functionality.

  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are
  too big. This has a performance cost.

  [Fix]

  Test packet size in bnx2x feature check path and disable GSO if it is
  too large. To do this we move a function from one file to another and
  add another in the networking core.

  [Regression Potential]

  A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

  T: This also involves a different change to the networking core to add
  the old-style GSO checking, which is more invasive. However the
  changes are simple and easily reviewed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-03-28 Thread Daniel Axtens
Hi,

Thanks for the Xenial test!

The kernel team process is that patches will always be committed from
the most recent kernel first and then back to older kernels, so that no-
one ends up with a regression if they upgrade to a more recent kernel.
So if it is applied to Xenial it will be applied to Artful :) (FYI, it's
already in the kernel that will be in Bionic next month.)

I don't know what you're compiling with DKMS; are you able to do any
test at all without it? Just testing that the machine boots and that you
can ping someone would make me more comfortable.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate.
  However, if a large packet with very large (>9700 byte) TCP segments
  arrives through ibmveth, and is passed to bnx2x, the hardware will
  panic.

  [Impact]

  bnx2x card panics, requiring power cycle to restore functionality.

  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are
  too big. This has a performance cost.

  [Fix]

  Test packet size in bnx2x feature check path and disable GSO if it is
  too large. To do this we move a function from one file to another and
  add another in the networking core.

  [Regression Potential]

  A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

  T: This also involves a different change to the networking core to add
  the old-style GSO checking, which is more invasive. However the
  changes are simple and easily reviewed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-03-23 Thread Daniel Axtens
Hi,

As well as Po-Hsu's comment above, I also have this internal update from
the kernel team:

As this is also a security fix, don't stress too much. If things
could be verified for at least for one of the kernels until next week
that is better than nothing. We are rather unlikely rip out fixes if
they have a CVE (and do not appear to regress other things)

(FYI, this issue is covered by CVE-2018-126.)

So, just to confirm, in order of priority:

 1) First confirm there are no new regressions (just a quick 'smoke
test') on the 3 kernels.

 2) Second, do a full test of 1 of the kernels.

 3) Test the remaining 2 kernels.

Please keep this bug updated as you go through each step.

Hopefully this helps reduce the pressure for you!

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate.
  However, if a large packet with very large (>9700 byte) TCP segments
  arrives through ibmveth, and is passed to bnx2x, the hardware will
  panic.

  [Impact]

  bnx2x card panics, requiring power cycle to restore functionality.

  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are
  too big. This has a performance cost.

  [Fix]

  Test packet size in bnx2x feature check path and disable GSO if it is
  too large. To do this we move a function from one file to another and
  add another in the networking core.

  [Regression Potential]

  A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

  T: This also involves a different change to the networking core to add
  the old-style GSO checking, which is more invasive. However the
  changes are simple and easily reviewed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-03-22 Thread Daniel Axtens
Hi,

I am the support engineer on the Canonical side who has been working on
this with IBM Support on your behalf. Apologies for the confusion. I
will contact our kernel team now and get this clarified for you as soon
as I can.

Now, I can't speak for the kernel team or make any commitments on their
behalf. However, I know the full test process is time-consuming, so in
the mean time, if you are able to boot with each of the kernels and just
quickly verify that there are no obvious regressions - just that boot
succeeds and that the network card can still send and receive data - I
think that would be a very good first step.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Artful:
  Fix Committed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate.
  However, if a large packet with very large (>9700 byte) TCP segments
  arrives through ibmveth, and is passed to bnx2x, the hardware will
  panic.

  [Impact]

  bnx2x card panics, requiring power cycle to restore functionality.

  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are
  too big. This has a performance cost.

  [Fix]

  Test packet size in bnx2x feature check path and disable GSO if it is
  too large. To do this we move a function from one file to another and
  add another in the networking core.

  [Regression Potential]

  A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

  T: This also involves a different change to the networking core to add
  the old-style GSO checking, which is more invasive. However the
  changes are simple and easily reviewed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1745364] Re: x86/net/bpf: return statement missing value

2018-03-21 Thread Daniel Axtens
I have tested this with the kernel bpf self-test, and it passes.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1745364

Title:
  x86/net/bpf: return statement missing value

Status in linux package in Ubuntu:
  Triaged
Status in linux source package in Xenial:
  Fix Committed

Bug description:
  SRU Justification
  =

  Coverity reports:

  *** CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile()
  1082int i;
  1083 1084   if (!bpf_jit_enable)
  1085return prog;
  1086 1087   if (!prog || !prog->len)
  >>> CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  >>> Arriving at the end of a function without returning a value.
  1088return;
  1089 1090   addrs = kmalloc(prog->len * sizeof(*addrs), 
GFP_KERNEL);
  1091if (!addrs)
  1092return prog;
  1093

  This is a result of 3098d8eae421 ("bpf: prepare
  bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry-
  pick of d1c55ab5e41f upstream. In that patch, the return type of
  bpf_int_jit_compile was changed from void to struct bpf_prog*. That
  patch changed some of the return statements.

  It did not, however, change the return statement of the (!prog ||
  !prog->len) check, as in upstream the (!prog || !prog->len) check was
  dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on
  prog"):

  """
  There is never such a situation, where bpf_int_jit_compile() is
  called with either prog as NULL or len as 0, so the tests are
  unnecessary and confusing as people would just copy them.
  """

  However, we haven't picked up 93a73d442d37, so when we cherry-picked
  d1c55ab5e41f, that branch remained unmodified, hence the static
  analysis warning.

  Impact
  ==

  If the branch is not dead and someone can hit it, an undefined value
  can be returned, which could cause issues.

  Fix
  ===

  For consistency and in case the branch is not actually dead on Xenial,
  we should do a fixup to 'return prog;'

  Regression Potential
  

  Limited to the BPF jit which is off by default.
  Limited to a branch that should be dead code anyway.
  Limited to an error handling path.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1745364/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2018-02-28 Thread Daniel Axtens
** Description changed:

  [SRU Justification]
  
  [Impact]
- On Artful kernels, X fails to start and a kernel splat is printed.
+ On Artful and Bionic kernels, X fails to start and a kernel splat is printed.
  
  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback and
  so the kernel tries to execute code at NULL.
  
  [Fix]
- There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.
  
- Rather than wait for that and deal with the massive regression
- potential, the fix I have picked to submit is very very minimal and
- touches only hibmc.
+ Bionic: There is a generic fix in 4.16 at
+ c67fa6edc8b11afe22c88a23963170bf5f151acf. It is part of a series that
+ applies this generic fix and does a bunch of cleanups; we can safely
+ just pick up the generic fix.
+ 
+ Artful: Rather than a generic fix, I have submitted a very very minimal
+ fix that only touches hibmc.
  
  [Regression Potential]
- Minimal - fix only touches hibmc driver. Tested on D05 board.
+ Artful: Minimal - fix only touches hibmc driver. Tested on D05 board.
+ Bionic: fix is to generic drm code, but is small and easily reviewable.
  
  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.
  
  [Notes]
- HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.
+ Artful: HiSilicon would really like this fix in Artful in such time so that 
when the next 16.04 point release ships, the HWE kernel will work with Xorg.
+ 
+ Bionic: no extra notes.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in Linux:
  New
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Artful:
  Fix Released

Bug description:
  [SRU Justification]

  [Impact]
  On Artful and Bionic kernels, X fails to start and a kernel splat is printed.

  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback
  and so the kernel tries to execute code at NULL.

  [Fix]

  Bionic: There is a generic fix in 4.16 at
  c67fa6edc8b11afe22c88a23963170bf5f151acf. It is part of a series that
  applies this generic fix and does a bunch of cleanups; we can safely
  just pick up the generic fix.

  Artful: Rather than a generic fix, I have submitted a very very
  minimal fix that only touches hibmc.

  [Regression Potential]
  Artful: Minimal - fix only touches hibmc driver. Tested on D05 board.
  Bionic: fix is to generic drm code, but is small and easily reviewable.

  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.

  [Notes]
  Artful: HiSilicon would really like this fix in Artful in such time so that 
when the next 16.04 point release ships, the HWE kernel will work with Xorg.

  Bionic: no extra notes.

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2018-02-28 Thread Daniel Axtens
Hi Fred,

Thanks for the update. I have tried to nominate the bug for Bionic; I
think the kernel team normally does this so we will see if that has
worked.

More importantly, I will test and send a patch for Bionic shortly.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in Linux:
  New
Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Artful:
  Fix Released

Bug description:
  [SRU Justification]

  [Impact]
  On Artful kernels, X fails to start and a kernel splat is printed.

  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback
  and so the kernel tries to execute code at NULL.

  [Fix]
  There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.

  Rather than wait for that and deal with the massive regression
  potential, the fix I have picked to submit is very very minimal and
  touches only hibmc.

  [Regression Potential]
  Minimal - fix only touches hibmc driver. Tested on D05 board.

  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.

  [Notes]
  HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1644056] Re: kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931!

2018-02-21 Thread Daniel Axtens
Hi all,

I think there are two issues at play here, one is the bad pmd one, and
one is the original "huge_memory: mapcount 0 page_mapcount 1".

Perhaps we could break the bad pmd issue out into a different LP bug?

People with the original bug - was anyone able to verify if this
happened on a more recent kernel? My understanding of mm/huge_memory.c
is that it was significantly refactored after 4.4, so I would be
interested to hear if that makes the issue go away.

I think disabling transparent huge pages on boot should also make the
issue go away if anyone is able to try that?

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1644056

Title:
  kernel BUG at /build/linux-lts-xenial-gUF4JR/linux-lts-
  xenial-4.4.0/mm/huge_memory.c:1931!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Hi,

  While running IO on the following kernel/Ubuntu version:

  $ lsb_release -rd
  Description:Ubuntu 14.04.5 LTS
  Release:14.04

  $ uname -a
  Linux 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux

  $ cat /etc/issue
  Ubuntu 14.04.5 LTS \n \l

  [1133672.985186] 
/build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/pgtable-generic.c:33: 
bad pmd 881fd6790240(80004b8008e7)
  [1135572.440941] huge_memory: mapcount 0 page_mapcount 1
  [1135572.441607] [ cut here ]
  [1135572.442059] kernel BUG at 
/build/linux-lts-xenial-gUF4JR/linux-lts-xenial-4.4.0/mm/huge_memory.c:1931!
  [1135572.442571] invalid opcode:  [#1] SMP
  [1135572.443028] Modules linked in: intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm joydev input_leds sb_edac irqbypass 
crct10dif_pclmul crc32_pclmul aesni_intel edac_core aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd dm_multipath lpc_ich ipmi_ssif ipmi_devintf 
shpchp 8250_fintek mac_hid acpi_power_meter ipmi_si ipmi_msghandler iTCO_wdt 
iTCO_vendor_support raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ses enclosure 
raid1 ast ttm ixgbe hid_generic igb vxlan drm_kms_helper syscopyarea 
ip6_udp_tunnel usbhid dca sysfillrect udp_tunnel sysimgblt hid mxm_wmi 
fb_sys_fops mpt3sas ptp drm ahci raid_class pps_core libahci i2c_algo_bit mdio 
scsi_transport_sas fjes wmi
  [1135572.448909] CPU: 15 PID: 2018 Comm: sh Not tainted 4.4.0-31-generic 
#50~14.04.1-Ubuntu
  [1135572.450082] Hardware name: Quanta Computer Inc. 
X-100.Column.01/S2PC-MB(Dual 1G LOM), BIOS S2P_3B04.HGT02 09/21/2016
  [1135572.451346] task: 8814923ae040 ti: 882eba658000 task.ti: 
882eba658000
  [1135572.452494] RIP: 0010:[]  [] 
__split_huge_page+0x691/0x6d0
  [1135572.453580] RSP: 0018:882eba65b7e0  EFLAGS: 00010292
  [1135572.454589] RAX: 0027 RBX: ea00012e RCX: 

  [1135572.455742] RDX: 0001 RSI: 883fff3cdc78 RDI: 
883fff3cdc78
  [1135572.457040] RBP: 882eba65b860 R08:  R09: 
881fe93eaf00
  [1135572.458271] R10: 03ff R11: 0ac1 R12: 

  [1135572.459539] R13: 882eba65ba10 R14: ea00012e R15: 
ea00012e
  [1135572.460746] FS:  7fcf6b305740() GS:883fff3c() 
knlGS:
  [1135572.461972] CS:  0010 DS:  ES:  CR0: 80050033
  [1135572.463237] CR2: 558db9da0bb8 CR3: 0021667eb000 CR4: 
003406e0
  [1135572.464515] DR0:  DR1:  DR2: 

  [1135572.465831] DR3:  DR6: fffe0ff0 DR7: 
0400
  [1135572.467177] Stack:
  [1135572.468502]   882eba65b840 811bdfe9 
883fed2e51d0
  [1135572.469769]  811c67c8 883fee9bb760 0007f43c9000 
882eba65ba10
  [1135572.471040]  7b6d 81c72fa0 883ff17340f4 
ea00012e
  [1135572.472277] Call Trace:
  [1135572.473574]  [] ? rmap_walk+0x239/0x2d0
  [1135572.475079]  [] split_huge_page_to_list+0x67/0xd0
  [1135572.476473]  [] add_to_swap+0x57/0x70
  [1135572.477852]  [] shrink_page_list+0x62c/0x770
  [1135572.479246]  [] shrink_inactive_list+0x1e9/0x500
  [1135572.480688]  [] shrink_lruvec+0x58e/0x730
  [1135572.482077]  [] ? __queue_work+0x130/0x350
  [1135572.483615]  [] ? __queue_work+0x130/0x350
  [1135572.485078]  [] shrink_zone+0xdc/0x2c0
  [1135572.486661]  [] do_try_to_free_pages+0x164/0x440
  [1135572.488354]  [] ? throttle_direct_reclaim+0x8d/0x230
  [1135572.490068]  [] try_to_free_pages+0xb5/0x170
  [1135572.491535]  [] __alloc_pages_nodemask+0x597/0xac0
  [1135572.493287]  [] alloc_kmem_pages_node+0x4d/0xd0
  [1135572.495083]  [] copy_process+0x185/0x1c70
  [1135572.496792]  [] ? from_kgid_munged+0x12/0x20
  [1135572.498404]  [] ? cp_new_stat+0x13d/0x160
  [1135572.500116]  [] 

[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2018-02-14 Thread Daniel Axtens
Hi,

I installed 4.13.0-35-generic from artful-proposed. The kernel boots and
X starts fine, so this has passed verification.

Regards,
Daniel

** Tags removed: verification-needed-artful
** Tags added: verification-done-artful

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Artful:
  Fix Committed

Bug description:
  [SRU Justification]

  [Impact]
  On Artful kernels, X fails to start and a kernel splat is printed.

  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback
  and so the kernel tries to execute code at NULL.

  [Fix]
  There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.

  Rather than wait for that and deal with the massive regression
  potential, the fix I have picked to submit is very very minimal and
  touches only hibmc.

  [Regression Potential]
  Minimal - fix only touches hibmc driver. Tested on D05 board.

  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.

  [Notes]
  HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1748342] Re: cgroup: remove cgroup directory leading kernel crash in kill_css

2018-02-11 Thread Daniel Axtens
Hi,

I'm happy to submit this patch to the kernel team, but I wanted to talk
about the kernel process and ask a question first.

The way this process usually works is:
 - patch submitted to kernel team
 - kernel team checks patch and if they are happy with it, applies it to the 
kernel
 - this is built into a "proposed" kernel.
 - the bug is updated with the proposed kernel.
 - someone - usually the bug reporter - must verify that the proposed kernel 
fixes the bug. There is usually a 5 working day window to do this.
 - if the verification is done, the new kernel contains the fix. If 
verification is not done, the patch is not included in the released kernel.

I am not able to do the verification. If the kernel team provides a
proposed kernel, are you or your customer able to verify it?

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1748342

Title:
  cgroup: remove cgroup directory leading kernel crash in kill_css

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  We got feedback from customer that cvm(cloud virtual machine) crashed when 
using kubelet updating container-service in ubuntu xenial. Logs show as follow. 
  We find a patch (commit 33c35aa4817864e056fd772230b0c6b552e36ea2) in linux 
mainline, which can indeed fix this bug. But ubuntu-xenial.git has not merged 
it yet. 

  Do you guys have a plan for merging?

  --panic log-
  [2018-02-02 10:21:48][4397731.721563] BUG: unable to handle kernel paging 
request at 0001005c
  [2018-02-02 10:40:50][4397731.722666] IP: css_clear_dir+0x5/0x70
  [2018-02-02 10:40:50][4397731.723261] PGD a12b067 
  [2018-02-02 10:40:50][4397731.723261] PUD 0 
  [2018-02-02 10:40:50][4397731.723628] 
  [2018-02-02 10:40:50][4397731.724004] Oops:  [#1] SMP
  [2018-02-02 10:40:50][4397731.724004] Modules linked in: xt_statistic 
nf_conntrack_netlink ebt_ip ebtable_filter ebtables veth xt_set ip_set_hash_net 
ip_set nfnetlink xt_nat xt_recent xt_mark ipt_REJ[2018-02-02 10:40:50]ECT 
nf_reject_ipv4 xt_tcpudp xt_comment ipt_MASQUERADE nf_nat_masquerade_ipv4 
xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
xt_addrtype iptable_fil[2018-02-02 10:40:50]ter ip_tables xt_conntrack x_tables 
nf_nat nf_conntrack br_netfilter bridge stp llc aufs ppdev sb_edac edac_core 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev input_le[2018-02-02 
10:40:50]ds serio_raw parport_pc parport i2c_piix4 mac_hid ib_iser rdma_cm 
iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 a[2018-02-02 
10:40:50]sync_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid1 raid0 multipath
  [2018-02-02 10:40:50][4397731.724004]  linear cirrus ttm drm_kms_helper 
syscopyarea sysfillrect sysimgblt aesni_intel fb_sys_fops aes_x86_64 
crypto_simd cryptd glue_helper psmouse virtio_blk virtio_n[2018-02-02 
10:40:50]et drm pata_acpi floppy
  [2018-02-02 10:40:50][4397731.724004] CPU: 0 PID: 23347 Comm: kubelet Not 
tainted 4.10.0-32-generic #36~16.04.1-Ubuntu
  [2018-02-02 10:40:50][4397731.724004] Hardware name: Bochs Bochs, BIOS Bochs 
01/01/2011
  [2018-02-02 10:40:50][4397731.724004] task: 92abde59 task.stack: 
baa94165c000
  [2018-02-02 10:40:50][4397731.724004] RIP: 0010:css_clear_dir+0x5/0x70
  [2018-02-02 10:40:50][4397731.724004] RSP: 0018:baa94165fe10 EFLAGS: 
00010206
  [2018-02-02 10:40:50][4397731.724004] RAX: 47fd40005d7b RBX: 
ffe8 RCX: 92abffc0fcec
  [2018-02-02 10:40:50][4397731.724004] RDX: 9b070800 RSI: 
0206 RDI: ffe8
  [2018-02-02 10:40:50][4397731.724004] RBP: baa94165fe20 R08: 
c8b18701 R09: 000180220017
  [2018-02-02 10:40:50][4397731.724004] R10: 92abc8b187f8 R11: 
92abf7751d00 R12: 92abd5601000
  [2018-02-02 10:40:50][4397731.724004] R13:  R14: 
92abd5601150 R15: 
  [2018-02-02 10:40:50][4397731.724004] FS:  7f6f92ffd700() 
GS:92abffc0() knlGS:
  [2018-02-02 10:40:50][4397731.724004] CS:  0010 DS:  ES:  CR0: 
80050033
  [2018-02-02 10:40:50][4397731.724004] CR2: 0001005c CR3: 
280cb000 CR4: 000406f0
  [2018-02-02 10:40:50][4397731.724004] Call Trace:
  [2018-02-02 10:40:50][4397731.724004]  ? kill_css+0x12/0x60
  [2018-02-02 10:40:50][4397731.724004]  cgroup_destroy_locked+0xa5/0xf0
  [2018-02-02 10:40:50][4397731.724004]  cgroup_rmdir+0x2c/0x90
  [2018-02-02 10:40:50][4397731.724004]  kernfs_iop_rmdir+0x4d/0x80
  [2018-02-02 10:40:50][4397731.724004]  vfs_rmdir+0xb4/0x130
  [2018-02-02 10:40:50][4397731.724004]  do_rmdir+0x1c7/0x1e0
  [2018-02-02 10:40:50][4397731.724004]  SyS_unlinkat+0x22/0x30
  [2018-02-02 10:40:50][4397731.724004]  

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-02-08 Thread Daniel Axtens
** Description changed:

  SRU Justification
  =
  
  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.
  
  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...
  
  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol is
  'special' - communication between LPARs on the same chassis can use very
  large (64k) frames to reduce overhead. Normal networks cannot handle
  such large packets, so traditionally, the VIOS partition would signal to
  the AIX partitions that it was 'special', and AIX would send regular,
  ethernet-sized packets to VIOS, which VIOS would then send out.
  
  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.
  
  In some cases (e.g. with TCP), multiple TCP segments are coalesced into
  one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size) or
  gso_size.
  
  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.
  
  However, in this case, the packets go through Open vSwitch, and are then
  passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and
  GSO, but with a restriction: the maximum segment size is limited to
  around 9700 bytes. Normally this is more than adequate. However, if a
  large packet with very large (>9700 byte) TCP segments arrives through
  ibmveth, and is passed to bnx2x, the hardware will panic.
  
  [Impact]
  
  bnx2x card panics, requiring power cycle to restore functionality.
  
  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are too
  big. This has a performance cost.
  
  [Fix]
  
  Test packet size in bnx2x feature check path and disable GSO if it is
- too large.
+ too large. To do this we move a function from one file to another and
+ add another in the networking core.
  
  [Regression Potential]
  
- Limited to bnx2x card driver.
+ A/B/X: The changes to the network core are easily reviewed. The changes to 
behaviour are limited to the bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.
+ 
+ T: This also involves a different change to the networking core to add
+ the old-style GSO checking, which is more invasive. However the changes
+ are simple and easily reviewed.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-02-08 Thread Daniel Axtens
This has been assigned CVE-2018-126.

** CVE added: https://cve.mitre.org/cgi-
bin/cvename.cgi?name=2018-126

** Description changed:

  SRU Justification
  =
  
  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.
  
  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...
  
  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol is
  'special' - communication between LPARs on the same chassis can use very
  large (64k) frames to reduce overhead. Normal networks cannot handle
  such large packets, so traditionally, the VIOS partition would signal to
  the AIX partitions that it was 'special', and AIX would send regular,
  ethernet-sized packets to VIOS, which VIOS would then send out.
  
  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.
  
  In some cases (e.g. with TCP), multiple TCP segments are coalesced into
  one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size) or
  gso_size.
  
  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.
  
  However, in this case, the packets go through Open vSwitch, and are then
  passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and
  GSO, but with a restriction: the maximum segment size is limited to
  around 9700 bytes. Normally this is more than adequate. However, if a
  large packet with very large (>9700 byte) TCP segments arrives through
  ibmveth, and is passed to bnx2x, the hardware will panic.
  
- Impact
- --
+ [Impact]
  
  bnx2x card panics, requiring power cycle to restore functionality.
  
  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are too
  big. This has a performance cost.
  
- Fix
- ---
+ [Fix]
  
  Test packet size in bnx2x feature check path and disable GSO if it is
  too large.
  
- Regression Potential
- 
+ [Regression Potential]
  
  Limited to bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-02-05 Thread Daniel Axtens
** Description changed:

  SRU Justification
  =
  
  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.
  
  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...
  
  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol is
  'special' - communication between LPARs on the same chassis can use very
  large (64k) frames to reduce overhead. Normal networks cannot handle
  such large packets, so traditionally, the VIOS partition would signal to
  the AIX partitions that it was 'special', and AIX would send regular,
  ethernet-sized packets to VIOS, which VIOS would then send out.
  
  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.
  
  In some cases (e.g. with TCP), multiple TCP segments are coalesced into
  one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size) or
  gso_size.
  
  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.
  
  However, in this case, the packets go through Open vSwitch, and are then
  passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and
  GSO, but with a restriction: the maximum segment size is limited to
  around 9700 bytes. Normally this is more than adequate. However, if a
  large packet with very large (>9700 byte) TCP segments arrives through
  ibmveth, and is passed to bnx2x, the hardware will panic.
  
  Impact
  --
  
  bnx2x card panics, requiring power cycle to restore functionality.
  
  The workaround is turning off TSO, which prevents the crash as the
  kernel resegments *all* packets in software, not just ones that are too
  big. This has a performance cost.
  
- 
  Fix
  ---
  
- Test packet size in bnx2x feature check path.
+ Test packet size in bnx2x feature check path and disable GSO if it is
+ too large.
  
  Regression Potential
  
  
  Limited to bnx2x card driver.
  The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-02-05 Thread Daniel Axtens
** Description changed:

- (This bug provides a place to track the progress of this issue upstream
- and then in to Ubuntu.)
+ SRU Justification
+ =
  
  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.
  
  We see the following crash sometimes when running netperf:
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! 
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052 
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert 
- May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - 
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert!
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert
+ May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump -
  ... (dump of registers follows) ...
  
  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol is
  'special' - communication between LPARs on the same chassis can use very
  large (64k) frames to reduce overhead. Normal networks cannot handle
  such large packets, so traditionally, the VIOS partition would signal to
  the AIX partitions that it was 'special', and AIX would send regular,
  ethernet-sized packets to VIOS, which VIOS would then send out.
  
  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.
  
  In some cases (e.g. with TCP), multiple TCP segments are coalesced into
  one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size) or
  gso_size.
  
  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.
  
  However, in this case, the packets go through Open vSwitch, and are then
  passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and
  GSO, but with a restriction: the maximum segment size is limited to
- around 9700 bytes. Normally this is more than adequate as jumbo frames
- are limited to 9000 bytes. However, if a large packet with large (>9700
- byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the
- hardware will panic.
+ around 9700 bytes. Normally this is more than adequate. However, if a
+ large packet with very large (>9700 byte) TCP segments arrives through
+ ibmveth, and is passed to bnx2x, the hardware will panic.
  
- Turning off TSO prevents the crash as the kernel resegments the data and
- assembles the packets in software. This has a performance cost.
+ Impact
+ --
  
- Clearly at the very least, bnx2x should not crash in this case.
+ bnx2x card panics, requiring power cycle to restore functionality.
  
- One patch to do this was sent upstream:
- https://www.spinics.net/lists/netdev/msg452932.html
+ The workaround is turning off TSO, which prevents the crash as the
+ kernel resegments *all* packets in software, not just ones that are too
+ big. This has a performance cost.
+ 
+ 
+ Fix
+ ---
+ 
+ Test packet size in bnx2x feature check path.
+ 
+ Regression Potential
+ 
+ 
+ Limited to bnx2x card driver.
+ The most likely failure case is a false-positive on the size check, which 
would lead to a performance regression only.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch 

[Kernel-packages] [Bug 1715519] Re: bnx2x_attn_int_deasserted3:4323 MC assert!

2018-02-05 Thread Daniel Axtens
A set of 2 patches to fix this was accepted upstream:

https://github.com/torvalds/linux/commit/2b16f048729bf35e6c28a40cbfad07239f9dcd90
https://github.com/torvalds/linux/commit/8914a595110a6eca69a5e275b323f5d09e18f4f9

I will send an SRU shortly.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  (This bug provides a place to track the progress of this issue
  upstream and then in to Ubuntu.)

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - 
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and AIX is done in a way that is not
  standards-compliant, and so was never made part of Linux. Instead, the
  Linux driver has always understood large frames and passed them up the
  network stack.

  In some cases (e.g. with TCP), multiple TCP segments are coalesced
  into one large packet. In Linux, this goes through the generic receive
  offload code, using a similar mechanism to GSO. These segments can be
  very large which presents as a very large MSS (maximum segment size)
  or gso_size.

  Normally, the large packet is simply passed to whatever network
  application on Linux is going to consume it, and everything is OK.

  However, in this case, the packets go through Open vSwitch, and are
  then passed to the bnx2x driver. The bnx2x driver/hardware supports
  TSO and GSO, but with a restriction: the maximum segment size is
  limited to around 9700 bytes. Normally this is more than adequate as
  jumbo frames are limited to 9000 bytes. However, if a large packet
  with large (>9700 byte) TCP segments arrives through ibmveth, and is
  passed to bnx2x, the hardware will panic.

  Turning off TSO prevents the crash as the kernel resegments the data
  and assembles the packets in software. This has a performance cost.

  Clearly at the very least, bnx2x should not crash in this case.

  One patch to do this was sent upstream:
  https://www.spinics.net/lists/netdev/msg452932.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715519/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1728489] Re: tar -x sometimes fails on overlayfs

2018-01-30 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1728489

Title:
  tar -x sometimes fails on overlayfs

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Zesty:
  Fix Released

Bug description:
  [SRU Justification]

  [Impact]
  A user is seeing failures from extracting tar archives on overlay filesystems 
on the 4.4 kernel in constrained environments. The error presents as: 

  `tar: ./deps/0/bin: Directory renamed before its status could be
  extracted`

  Following this thread (http://www.spinics.net/lists/linux-
  unionfs/msg00856.html), it appears that this occurs when entries in
  the kernel's inode cache are reclaimed, and subsequent lookups return
  new inode numbers.

  Further testing showed that when setting
  `/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to
  reclaim inode cache entries due to memory pressure) the error does not
  recur, supporting the hypothesis that cache entries are being evicted.
  However, this setting may lead to a kernel OOM so is not a reasonable
  workaround even temporarily.

  The error cannot be reproduced on a 4.13 kernel, due to the series at
  https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The
  particular relevant commit is
  b7a807dc2010334e62e0afd89d6f7a8913eb14ff, which needs a couple of
  dependencies.

  [Fix]
  For Zesty, backport the entire series.
  For Xenial, where a full backport is not feasible, backport the key commit 
and the short list of dependencies.

  [Testcase]

  # Testing this bug

  The testcase for this particular bug is simple - create an overlay
  filesystem with all layers on the same underlying file system, and
  then see if the inode of a directory is constant across dropping the
  caches:

  mkdir -p /upper/upper /upper/work /lower
  mount -t overlay none /mnt -o 
lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work
  cd /mnt
  mkdir a
  stat a # observe inode number
  echo 2 > /proc/sys/vm/drop_caches
  stat a # compare inode number

  If the inode number is the same, the fix is successful.

  # Regression testing

  I have run the unionmount test suite from
  http://git.infradead.org/users/dhowells/unionmount-testsuite.git in
  overlay mode (./run --ov), and verified that it still passes.

  (The series cover letter mentions a fork of the test suite at
  https://github.com/amir73il/unionmount-testsuite/commits/overlayfs-
  devel. I have *not* attempted to get this running: it assumes a range
  of changes that are not present in our kernels.)

  [Regression Potential]
  As this changes overlayfs, there is potential for regression in the form of 
unexpected breakages to overlaysfs behaviour.

  I think this is adequately addressed by the regression testing.

  One option to reduce the regression potential on Zesty is to reduce
  the set of patches applied - rather than including the whole series we
  could include just the patches to solve this bug, which are much
  easier to inspect for correctness.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1728489/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1745364] [NEW] x86/net/bpf: return statement missing value

2018-01-25 Thread Daniel Axtens
Public bug reported:

SRU Justification
=

Coverity reports:

*** CID 1464330:  Uninitialized variables  (MISSING_RETURN)
/arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile()
1082int i;
1083 1084   if (!bpf_jit_enable)
1085return prog;
1086 1087   if (!prog || !prog->len)
>>> CID 1464330:  Uninitialized variables  (MISSING_RETURN)
>>> Arriving at the end of a function without returning a value.
1088return;
1089 1090   addrs = kmalloc(prog->len * sizeof(*addrs), GFP_KERNEL);
1091if (!addrs)
1092return prog;
1093

This is a result of 3098d8eae421 ("bpf: prepare
bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry-
pick of d1c55ab5e41f upstream. In that patch, the return type of
bpf_int_jit_compile was changed from void to struct bpf_prog*. That
patch changed some of the return statements.

It did not, however, change the return statement of the (!prog ||
!prog->len) check, as in upstream the (!prog || !prog->len) check was
dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on
prog"):

"""
There is never such a situation, where bpf_int_jit_compile() is
called with either prog as NULL or len as 0, so the tests are
unnecessary and confusing as people would just copy them.
"""

However, we haven't picked up 93a73d442d37, so when we cherry-picked
d1c55ab5e41f, that branch remained unmodified, hence the static analysis
warning.

Impact
==

If the branch is not dead and someone can hit it, an undefined value can
be returned, which could cause issues.

Fix
===

For consistency and in case the branch is not actually dead on Xenial,
we should do a fixup to 'return prog;'

Regression Potential


Limited to the BPF jit which is off by default.
Limited to a branch that should be dead code anyway.
Limited to an error handling path.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Status: Confirmed

** Description changed:

+ SRU Justification
+ =
+ 
  Coverity reports:
  
  *** CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile()
  1082int i;
  1083 1084   if (!bpf_jit_enable)
  1085return prog;
  1086 1087   if (!prog || !prog->len)
  >>> CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  >>> Arriving at the end of a function without returning a value.
  1088return;
  1089 1090   addrs = kmalloc(prog->len * sizeof(*addrs), 
GFP_KERNEL);
  1091if (!addrs)
  1092return prog;
  1093
  
  This is a result of 3098d8eae421 ("bpf: prepare
  bpf_int_jit_compile/bpf_prog_select_runtime apis"), which is a cherry-
  pick of d1c55ab5e41f upstream. In that patch, the return type of
  bpf_int_jit_compile was changed from void to struct bpf_prog*. That
  patch changed some of the return statements.
  
  It did not, however, change the return statement of the (!prog ||
  !prog->len) check, as in upstream the (!prog || !prog->len) check was
  dropped in 93a73d442d37 ("bpf, x86/arm64: remove useless checks on
  prog"):
  
  """
  There is never such a situation, where bpf_int_jit_compile() is
  called with either prog as NULL or len as 0, so the tests are
  unnecessary and confusing as people would just copy them.
  """
  
  However, we haven't picked up 93a73d442d37, so when we cherry-picked
  d1c55ab5e41f, that branch remained unmodified, hence the static analysis
  warning.
  
  Impact
  ==
  
  If the branch is not dead and someone can hit it, an undefined value can
  be returned, which could cause issues.
  
  Fix
  ===
  
  For consistency and in case the branch is not actually dead on Xenial,
  we should do a fixup to 'return prog;'
  
  Regression Potential
  
  
  Limited to the BPF jit which is off by default.
  Limited to a branch that should be dead code anyway.
  Limited to an error handling path.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1745364

Title:
  x86/net/bpf: return statement missing value

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  SRU Justification
  =

  Coverity reports:

  *** CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  /arch/x86/net/bpf_jit_comp.c: 1088 in bpf_int_jit_compile()
  1082int i;
  1083 1084   if (!bpf_jit_enable)
  1085return prog;
  1086 1087   if (!prog || !prog->len)
  >>> CID 1464330:  Uninitialized variables  (MISSING_RETURN)
  >>> Arriving at the end of a function without returning a value.
  1088return;
  1089 1090   addrs = kmalloc(prog->len * sizeof(*addrs), 
GFP_KERNEL);
  

[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2018-01-22 Thread Daniel Axtens
I have talked to the kernel team about this and updated Fred off-line.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [SRU Justification]

  [Impact]
  On Artful kernels, X fails to start and a kernel splat is printed.

  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback
  and so the kernel tries to execute code at NULL.

  [Fix]
  There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.

  Rather than wait for that and deal with the massive regression
  potential, the fix I have picked to submit is very very minimal and
  touches only hibmc.

  [Regression Potential]
  Minimal - fix only touches hibmc driver. Tested on D05 board.

  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.

  [Notes]
  HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


Re: [Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

2018-01-10 Thread Daniel Axtens
Hi Frank,

Yes, that is how I see it - these changes can go through, but we need
good docs to point people to as there is an incredibly high likelihood
of misconfiguration at various points.

Regards,
Daniel

On Tue, Dec 19, 2017 at 3:46 AM, Frank Heimes
<1692...@bugs.launchpad.net> wrote:
> Siva and Daniel, may I just ask where we are on this?
> Well it looks to me that Siva/IBM sees this more as a miss-configuration, so 
> that the changes in comment #18 are _not_ needed. Daniel, do you see it now 
> the same way?
> But in this case this needs to be documented somewhere, so that we can point 
> customers, too it - right?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1692538
>
> Title:
>   Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA
>
> Status in The Ubuntu-power-systems project:
>   In Progress
> Status in linux package in Ubuntu:
>   Fix Released
> Status in linux source package in Xenial:
>   In Progress
> Status in linux source package in Zesty:
>   Fix Released
> Status in linux source package in Artful:
>   Fix Released
>
> Bug description:
>
>   == SRU Justification ==
>   Commit 66aa0678ef is request to fix four issues with the ibmveth driver.
>   The issues are as follows:
>   - Issue 1: ibmveth doesn't support largesend and checksum offload features 
> when configured as "Trunk".
>   - Issue 2: SYN packet drops seen at destination VM. When the packet
>   originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
>   server's inbound Trunk ibmveth, on validating "checksum good" bits in 
> ibmveth
>   receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY 
> flag.
>   - Issue 3: First packet of a TCP connection will be dropped, if there is
>   no OVS flow cached in datapath.
>   - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>
>   The details for the fixes to these issues are described in the commits
>   git log.
>
>
>
>   == Comment: #0 - BRYANT G. LY  - 2017-05-22 08:40:16 ==
>   ---Problem Description---
>
>- Issue 1: ibmveth doesn't support largesend and checksum offload features
>  when configured as "Trunk". Driver has explicit checks to prevent
>  enabling these offloads.
>
>- Issue 2: SYN packet drops seen at destination VM. When the packet
>  originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
>  IO server's inbound Trunk ibmveth, on validating "checksum good" bits
>  in ibmveth receive routine, SKB's ip_summed field is set with
>  CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
>  Bridge) and delivered to outbound Trunk ibmveth. At this point the
>  outbound ibmveth transmit routine will not set "no checksum" and
>  "checksum good" bits in transmit buffer descriptor, as it does so only
>  when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
>  delivered to destination VM, TCP layer receives the packet with checksum
>  value of 0 and with no checksum related flags in ip_summed field. This
>  leads to packet drops. So, TCP connections never goes through fine.
>
>- Issue 3: First packet of a TCP connection will be dropped, if there is
>  no OVS flow cached in datapath. OVS while trying to identify the flow,
>  computes the checksum. The computed checksum will be invalid at the
>  receiving end, as ibmveth transmit routine zeroes out the pseudo
>  checksum value in the packet. This leads to packet drop.
>
>- Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
>  When Physical NIC has GRO enabled and when OVS bridges these packets,
>  OVS vport send code will end up calling dev_queue_xmit, which in turn
>  calls validate_xmit_skb.
>  In validate_xmit_skb routine, the larger packets will get segmented into
>  MSS sized segments, if SKB has a frag_list and if the driver to which
>  they are delivered to doesn't support NETIF_F_FRAGLIST feature.
>
>   Contact Information = Bryant G. Ly/b...@us.ibm.com
>
>   ---uname output---
>   4.8.0-51.54
>
>   Machine Type = p8
>
>   ---Debugger---
>   A debugger is not configured
>
>   ---Steps to Reproduce---
>Increases performance greatly
>
>   The patch has been accepted upstream:
>   https://patchwork.ozlabs.org/patch/764533/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1692538

Title:
  Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:

[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2017-12-15 Thread Daniel Axtens
** Description changed:

- ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the
- hisilicon hibmc driver does not contain the callback and so X does not
- start.
+ [SRU Justification]
  
- Discussion and potential fix at https://lists.freedesktop.org/archives
- /dri-devel/2017-November/159002.html
+ [Impact]
+ On Artful kernels, X fails to start and a kernel splat is printed.
  
- This affects Artful, upstream has not landed on a solution yet as far as
- I can tell, so lets backport the first proposed small fix.
+ This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
+ incomplete: the hisilicon hibmc driver does not contain the callback and
+ so the kernel tries to execute code at NULL.
+ 
+ [Fix]
+ There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.
+ 
+ Rather than wait for that and deal with the massive regression
+ potential, the fix I have picked to submit is very very minimal and
+ touches only hibmc.
+ 
+ [Regression Potential]
+ Minimal - fix only touches hibmc driver. Tested on D05 board.
+ 
+ [Testcase]
+ Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.
+ 
+ [Notes]
+ HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [SRU Justification]

  [Impact]
  On Artful kernels, X fails to start and a kernel splat is printed.

  This is cbecause ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is
  incomplete: the hisilicon hibmc driver does not contain the callback
  and so the kernel tries to execute code at NULL.

  [Fix]
  There is a discussion and potential fix at 
https://lists.freedesktop.org/archives/dri-devel/2017-November/159002.html The 
fix hasn't landed yet and it looks like they're going to re-engineer the entire 
section instead.

  Rather than wait for that and deal with the massive regression
  potential, the fix I have picked to submit is very very minimal and
  touches only hibmc.

  [Regression Potential]
  Minimal - fix only touches hibmc driver. Tested on D05 board.

  [Testcase]
  Install patched kernel, try to start X. If it succeeds, the fix works. If 
there's a kernel splat, the fix does not work.

  [Notes]
  HiSilicon would really like this fix in Artful in such time so that when the 
next 16.04 point release ships in February, the HWE kernel will work with Xorg.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1738334] Re: hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2017-12-15 Thread Daniel Axtens
Confirmed - the symptom is a kernel splat about "Attempting to execute
userspace memory" triggered by Xorg with LR in ttm_bo_vm_fault - see
attached screenshot (sorry!)


** Changed in: linux (Ubuntu)
   Status: Incomplete => Confirmed

** Attachment added: "splat.png"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+attachment/5022893/+files/splat.png

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the
  hisilicon hibmc driver does not contain the callback and so X does not
  start.

  Discussion and potential fix at https://lists.freedesktop.org/archives
  /dri-devel/2017-November/159002.html

  This affects Artful, upstream has not landed on a solution yet as far
  as I can tell, so lets backport the first proposed small fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


Re: [Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID

2017-12-15 Thread Daniel Axtens
Hi Fred,

The artful repository is git://kernel.ubuntu.com/ubuntu/ubuntu-
artful.git

It contains 4417ec7a7c8d ("UBUNTU: SAUCE: PCI: Support hibmc VGA cards
behind a misbehaving HiSilicon bridge")

This was an earlier version of those patches and should allow xorg
autoconfiguration to work.

Regards,
Daniel

On Fri, Dec 15, 2017 at 6:38 PM, Fred Kimmy
 wrote:
> hi daniel:
>
> whether this following mainline patchset have merge into this artful branch 
> or not?
> If do not merge this patchset, this xwindow function will fail it.
>
> Can you confirm it and provide this artful branch in order to test it
> for me
>
> 505a1b5 vgaarb: Factor out EFI and fallback default device selection
> a37c0f4 vgaarb: Select a default VGA device even if there's no legacy VGA
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1698700
>
> Title:
>   hibmc driver does not include "pci:" prefix in bus ID
>
> Status in linux package in Ubuntu:
>   Incomplete
> Status in linux source package in Zesty:
>   Fix Released
> Status in linux source package in Artful:
>   Fix Released
>
> Bug description:
>   SRU Justification
>
>   [Impact]
>   On the HiSilicon D05 (arm64) board, X crashes when started. [0]
>
>   [Fix]
>   The crash is attributable to the bus ID that the hibmc driver reports for 
> the hibmc graphics card on the board. In particular, the bus id is missing 
> the "pci:" prefix that most other cards provide: [1]
>   - The busid reported on the arm64 system is "0007:a1:00.0"
>   - The busid reported on a amd64 system is "pci::00:02.0"
>
>   X tests for this prefix. A missing prefix for PCI cards leads to an
>   Xorg crash.
>
>   Fix this by using the set_pci_busid function from the DRM core.
>
>   [Testcase]
>   Successfully tested on a D05 board. [2]
>
>   [Regression Potential]
>   Changes are limited to the hibmc driver, so any regression should also be 
> limited to that driver.
>
>   [Notes]
>   I submitted the patch upstream. However, upstream is refactoring the drm 
> core, and set_busid is going away. That does fix this issue but the 
> regression potential of the refactor is enormous, so this seems like the 
> wiser approach. [3]
>
>   [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
>   [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
>   [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
>   [3]: https://www.spinics.net/lists/dri-devel/msg143831.html
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1698700

Title:
  hibmc driver does not include "pci:" prefix in bus ID

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Fix Released

Bug description:
  SRU Justification

  [Impact]
  On the HiSilicon D05 (arm64) board, X crashes when started. [0]

  [Fix]
  The crash is attributable to the bus ID that the hibmc driver reports for the 
hibmc graphics card on the board. In particular, the bus id is missing the 
"pci:" prefix that most other cards provide: [1]
  - The busid reported on the arm64 system is "0007:a1:00.0"
  - The busid reported on a amd64 system is "pci::00:02.0"

  X tests for this prefix. A missing prefix for PCI cards leads to an
  Xorg crash.

  Fix this by using the set_pci_busid function from the DRM core.

  [Testcase]
  Successfully tested on a D05 board. [2]

  [Regression Potential]
  Changes are limited to the hibmc driver, so any regression should also be 
limited to that driver.

  [Notes]
  I submitted the patch upstream. However, upstream is refactoring the drm 
core, and set_busid is going away. That does fix this issue but the regression 
potential of the refactor is enormous, so this seems like the wiser approach. 
[3]

  [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
  [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
  [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
  [3]: https://www.spinics.net/lists/dri-devel/msg143831.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID

2017-12-14 Thread Daniel Axtens
There is another bug causing an artful regression - opening a new LP for
that: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334

** Changed in: linux (Ubuntu Artful)
   Status: Incomplete => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1698700

Title:
  hibmc driver does not include "pci:" prefix in bus ID

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Fix Released

Bug description:
  SRU Justification

  [Impact]
  On the HiSilicon D05 (arm64) board, X crashes when started. [0]

  [Fix]
  The crash is attributable to the bus ID that the hibmc driver reports for the 
hibmc graphics card on the board. In particular, the bus id is missing the 
"pci:" prefix that most other cards provide: [1]
  - The busid reported on the arm64 system is "0007:a1:00.0"
  - The busid reported on a amd64 system is "pci::00:02.0"

  X tests for this prefix. A missing prefix for PCI cards leads to an
  Xorg crash.

  Fix this by using the set_pci_busid function from the DRM core.

  [Testcase]
  Successfully tested on a D05 board. [2]

  [Regression Potential]
  Changes are limited to the hibmc driver, so any regression should also be 
limited to that driver.

  [Notes]
  I submitted the patch upstream. However, upstream is refactoring the drm 
core, and set_busid is going away. That does fix this issue but the regression 
potential of the refactor is enormous, so this seems like the wiser approach. 
[3]

  [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
  [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
  [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
  [3]: https://www.spinics.net/lists/dri-devel/msg143831.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1738334] [NEW] hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add io_mem_pfn callback")

2017-12-14 Thread Daniel Axtens
Public bug reported:

ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the
hisilicon hibmc driver does not contain the callback and so X does not
start.

Discussion and potential fix at https://lists.freedesktop.org/archives
/dri-devel/2017-November/159002.html

This affects Artful, upstream has not landed on a solution yet as far as
I can tell, so lets backport the first proposed small fix.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1738334

Title:
  hisilicon hibmc regression due to ea642c3216cb ("drm/ttm: add
  io_mem_pfn callback")

Status in linux package in Ubuntu:
  New

Bug description:
  ea642c3216cb ("drm/ttm: add io_mem_pfn callback") is incomplete: the
  hisilicon hibmc driver does not contain the callback and so X does not
  start.

  Discussion and potential fix at https://lists.freedesktop.org/archives
  /dri-devel/2017-November/159002.html

  This affects Artful, upstream has not landed on a solution yet as far
  as I can tell, so lets backport the first proposed small fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1738334/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


Re: [Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID

2017-12-10 Thread Daniel Axtens
Hi Fred,

I will have a look soon and update you.

Regards,
Daniel

On Mon, Dec 11, 2017 at 6:00 PM, Fred Kimmy
 wrote:
> this  patch will solve commit #10 bug, please merge this patch.
>
> thank you
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1698700
>
> Title:
>   hibmc driver does not include "pci:" prefix in bus ID
>
> Status in linux package in Ubuntu:
>   Incomplete
> Status in linux source package in Zesty:
>   Fix Released
> Status in linux source package in Artful:
>   Incomplete
>
> Bug description:
>   SRU Justification
>
>   [Impact]
>   On the HiSilicon D05 (arm64) board, X crashes when started. [0]
>
>   [Fix]
>   The crash is attributable to the bus ID that the hibmc driver reports for 
> the hibmc graphics card on the board. In particular, the bus id is missing 
> the "pci:" prefix that most other cards provide: [1]
>   - The busid reported on the arm64 system is "0007:a1:00.0"
>   - The busid reported on a amd64 system is "pci::00:02.0"
>
>   X tests for this prefix. A missing prefix for PCI cards leads to an
>   Xorg crash.
>
>   Fix this by using the set_pci_busid function from the DRM core.
>
>   [Testcase]
>   Successfully tested on a D05 board. [2]
>
>   [Regression Potential]
>   Changes are limited to the hibmc driver, so any regression should also be 
> limited to that driver.
>
>   [Notes]
>   I submitted the patch upstream. However, upstream is refactoring the drm 
> core, and set_busid is going away. That does fix this issue but the 
> regression potential of the refactor is enormous, so this seems like the 
> wiser approach. [3]
>
>   [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
>   [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
>   [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
>   [3]: https://www.spinics.net/lists/dri-devel/msg143831.html
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1698700

Title:
  hibmc driver does not include "pci:" prefix in bus ID

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Incomplete

Bug description:
  SRU Justification

  [Impact]
  On the HiSilicon D05 (arm64) board, X crashes when started. [0]

  [Fix]
  The crash is attributable to the bus ID that the hibmc driver reports for the 
hibmc graphics card on the board. In particular, the bus id is missing the 
"pci:" prefix that most other cards provide: [1]
  - The busid reported on the arm64 system is "0007:a1:00.0"
  - The busid reported on a amd64 system is "pci::00:02.0"

  X tests for this prefix. A missing prefix for PCI cards leads to an
  Xorg crash.

  Fix this by using the set_pci_busid function from the DRM core.

  [Testcase]
  Successfully tested on a D05 board. [2]

  [Regression Potential]
  Changes are limited to the hibmc driver, so any regression should also be 
limited to that driver.

  [Notes]
  I submitted the patch upstream. However, upstream is refactoring the drm 
core, and set_busid is going away. That does fix this issue but the regression 
potential of the refactor is enormous, so this seems like the wiser approach. 
[3]

  [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
  [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
  [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
  [3]: https://www.spinics.net/lists/dri-devel/msg143831.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1698700] Re: hibmc driver does not include "pci:" prefix in bus ID

2017-12-03 Thread Daniel Axtens
The patch does seem to be in Artful, following up with the user.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1698700

Title:
  hibmc driver does not include "pci:" prefix in bus ID

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Incomplete

Bug description:
  SRU Justification

  [Impact]
  On the HiSilicon D05 (arm64) board, X crashes when started. [0]

  [Fix]
  The crash is attributable to the bus ID that the hibmc driver reports for the 
hibmc graphics card on the board. In particular, the bus id is missing the 
"pci:" prefix that most other cards provide: [1]
  - The busid reported on the arm64 system is "0007:a1:00.0"
  - The busid reported on a amd64 system is "pci::00:02.0"

  X tests for this prefix. A missing prefix for PCI cards leads to an
  Xorg crash.

  Fix this by using the set_pci_busid function from the DRM core.

  [Testcase]
  Successfully tested on a D05 board. [2]

  [Regression Potential]
  Changes are limited to the hibmc driver, so any regression should also be 
limited to that driver.

  [Notes]
  I submitted the patch upstream. However, upstream is refactoring the drm 
core, and set_busid is going away. That does fix this issue but the regression 
potential of the refactor is enormous, so this seems like the wiser approach. 
[3]

  [0]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991
  [1]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/16
  [2]: https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/29
  [3]: https://www.spinics.net/lists/dri-devel/msg143831.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698700/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1729119] Re: NVMe timeout is too short

2017-11-21 Thread Daniel Axtens
** Description changed:

  [SRU Justification]
  
  [Impact]
- Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes. 
+ Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes.
  
  [Fix]
  The underlying parameters are unsigned ints, so make the module parameters 
unsigned ints too, by picking patch 
http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html
  
+ (Trusty specific) This also requires picking the patch that converts
+ the constant into a parameter, which is a clean cherry-pick.
+ 
  [Regression Potential]
- Very limited: only types of module parameters are changing, the patch is 
easily reviewable.
+ (X/Z/A) Very limited: only types of module parameters are changing, the patch 
is easily reviewable.
+ 
+ (Trusty specific) Limited: a module parameter is added and its type is
+ changed. The patches are easily reviewable.
+ 
+ [Testing]
+ (Trusty only) Boot tested on a c5.large instance on AWS which uses
+ NVMe to boot. Verified that the system still boots with the patches,
+ and that a timeout of 123456s is permitted.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1729119

Title:
  NVMe timeout is too short

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Committed
Status in linux-aws source package in Xenial:
  Fix Released
Status in linux source package in Zesty:
  Fix Committed
Status in linux-aws source package in Zesty:
  Invalid
Status in linux source package in Artful:
  Fix Committed
Status in linux-aws source package in Artful:
  Invalid

Bug description:
  [SRU Justification]

  [Impact]
  Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes.

  [Fix]
  The underlying parameters are unsigned ints, so make the module parameters 
unsigned ints too, by picking patch 
http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html

  (Trusty specific) This also requires picking the patch that converts
  the constant into a parameter, which is a clean cherry-pick.

  [Regression Potential]
  (X/Z/A) Very limited: only types of module parameters are changing, the patch 
is easily reviewable.

  (Trusty specific) Limited: a module parameter is added and its type is
  changed. The patches are easily reviewable.

  [Testing]
  (Trusty only) Boot tested on a c5.large instance on AWS which uses
  NVMe to boot. Verified that the system still boots with the patches,
  and that a timeout of 123456s is permitted.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

2017-11-20 Thread Daniel Axtens
Hi Siva,

Thank you for your quick and thoughtful response.

I will ask about the default MTU for the veth interface to see if the
user increased it themselves.

I'm not sure I completely understand what you mean about largesend
offload being disabled after retransmits. I'm also not completely sure
if it's largesend offload or just large packets that are causing issues.
If I have understood correctly (e.g.
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/tcp_large_send_offload.htm)
large-send offload is what Linux would call TCP Segmentation Offload
(TSO) - does that match your understanding?

Here's my concern. The code I'm looking at (let's look at Zesty, so
v4.10) is in ibmveth.c, ibmveth_poll(). There we see:

 if (length > netdev->mtu + ETH_HLEN) {
ibmveth_rx_mss_helper(skb, mss, lrg_pkt);
adapter->rx_large_packets++;
 }

Then ibmveth_rx_mss_helper() has the following - setting GSO on
regardless of the large_pkt bit:

 /* if mss is not set through Large Packet bit/mss in rx buffer,
  * expect that the mss will be written to the tcp header checksum.
  */
 tcph = (struct tcphdr *)(skb->data + offset);
 if (lrg_pkt) {
skb_shinfo(skb)->gso_size = mss;
 } else if (offset) {
skb_shinfo(skb)->gso_size = ntohs(tcph->check);
tcph->check = 0;
 }

It looks to me that Linux will interpret a packet from the veth adaptor
as a GSO/GRO packet based only on whether or not the size of the
received packet is greater than the linux-side MTU plus the header size
- not based on whether AIX thinks it is transmitting a LSO packet.

To put it another way - if I have understood correctly - there are two
ways we could end up with a GSO/GRO packet coming out of a veth adaptor.
The ibmveth_rx_mss_helper path is taken when the size of the packet is
greater than MTU+ETH_HLEN, which can happen when:

 1) The AIX end has turned on LSO, so the large_packet bit is set
 2) Large-send is off in AIX but there is a mis-matched MTU between AIX and 
Linux

In the first case case, you say that AIX will turn off largesend, which
will fix the issue. But in the second case, if I have understood
correctly, AIX will not be able to do anything. Unless you are saying
that AIX will dynamically reduce the MTU for a connection in the
presence of a number of re-transmits?

This isn't necessarily wrong behaviour from AIX - Linux can't do
anything in this situation either; a 'hop' that can participate in Path
MTU Discovery would be needed.

If I understand it, then, the optimal configuration would be for the AIX
LPAR to set an MTU of 1500/9000 and turn on LSO for veth on the AIX side
- does that sound right?

Thanks again!
Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1692538

Title:
  Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Fix Released

Bug description:
  
  == SRU Justification ==
  Commit 66aa0678ef is request to fix four issues with the ibmveth driver.
  The issues are as follows:
  - Issue 1: ibmveth doesn't support largesend and checksum offload features 
when configured as "Trunk".
  - Issue 2: SYN packet drops seen at destination VM. When the packet
  originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
  server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
  receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
  - Issue 3: First packet of a TCP connection will be dropped, if there is
  no OVS flow cached in datapath.
  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.

  The details for the fixes to these issues are described in the commits
  git log.



  == Comment: #0 - BRYANT G. LY  - 2017-05-22 08:40:16 ==
  ---Problem Description---

   - Issue 1: ibmveth doesn't support largesend and checksum offload features
     when configured as "Trunk". Driver has explicit checks to prevent
     enabling these offloads.

   - Issue 2: SYN packet drops seen at destination VM. When the packet
     originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
     IO server's inbound Trunk ibmveth, on validating "checksum good" bits
     in ibmveth receive routine, SKB's ip_summed field is set with
     CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
     Bridge) and delivered to outbound Trunk ibmveth. At this point the
     outbound ibmveth transmit routine will not set "no checksum" and
     "checksum good" bits in transmit buffer descriptor, as it does so only
     when the ip_summed field is CHECKSUM_PARTIAL. When this 

[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

2017-11-19 Thread Daniel Axtens
Hi Bryant,

So, to be crystal clear, IBM's position is if customers are using this
setup, that they should set the MTU in their AIX partitions to 1500? (or
9000 if using jumbo frames)

Is this documented anywhere on your website that we can point users to?

I ask because I have asked one of your customers/our users to do this in
a support context and they were unhappy about the performance impact. So
if this is the official line, can we have some official documentation of
it?

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1692538

Title:
  Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Fix Released

Bug description:
  
  == SRU Justification ==
  Commit 66aa0678ef is request to fix four issues with the ibmveth driver.
  The issues are as follows:
  - Issue 1: ibmveth doesn't support largesend and checksum offload features 
when configured as "Trunk".
  - Issue 2: SYN packet drops seen at destination VM. When the packet
  originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
  server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
  receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
  - Issue 3: First packet of a TCP connection will be dropped, if there is
  no OVS flow cached in datapath.
  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.

  The details for the fixes to these issues are described in the commits
  git log.



  == Comment: #0 - BRYANT G. LY  - 2017-05-22 08:40:16 ==
  ---Problem Description---

   - Issue 1: ibmveth doesn't support largesend and checksum offload features
     when configured as "Trunk". Driver has explicit checks to prevent
     enabling these offloads.

   - Issue 2: SYN packet drops seen at destination VM. When the packet
     originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
     IO server's inbound Trunk ibmveth, on validating "checksum good" bits
     in ibmveth receive routine, SKB's ip_summed field is set with
     CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
     Bridge) and delivered to outbound Trunk ibmveth. At this point the
     outbound ibmveth transmit routine will not set "no checksum" and
     "checksum good" bits in transmit buffer descriptor, as it does so only
     when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
     delivered to destination VM, TCP layer receives the packet with checksum
     value of 0 and with no checksum related flags in ip_summed field. This
     leads to packet drops. So, TCP connections never goes through fine.

   - Issue 3: First packet of a TCP connection will be dropped, if there is
     no OVS flow cached in datapath. OVS while trying to identify the flow,
     computes the checksum. The computed checksum will be invalid at the
     receiving end, as ibmveth transmit routine zeroes out the pseudo
     checksum value in the packet. This leads to packet drop.

   - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
     When Physical NIC has GRO enabled and when OVS bridges these packets,
     OVS vport send code will end up calling dev_queue_xmit, which in turn
     calls validate_xmit_skb.
     In validate_xmit_skb routine, the larger packets will get segmented into
     MSS sized segments, if SKB has a frag_list and if the driver to which
     they are delivered to doesn't support NETIF_F_FRAGLIST feature.

  Contact Information = Bryant G. Ly/b...@us.ibm.com

  ---uname output---
  4.8.0-51.54

  Machine Type = p8

  ---Debugger---
  A debugger is not configured

  ---Steps to Reproduce---
   Increases performance greatly

  The patch has been accepted upstream:
  https://patchwork.ozlabs.org/patch/764533/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1692538] Re: Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

2017-11-17 Thread Daniel Axtens
Just as an update: I am working with Jay V on a set of patches to drop
the oversized packets at the openvswitch/bridge level to prevent the
crash I mentioned.

But that is not sufficient to solve the underlying problem: there will
still be packet loss when there's an MTU mismatch here. A device in AIX
with a 64k MTU being bridged (via openvswitch or a native bridge) to a
device with a 1500 or 9000 byte MTU is never going to work reliably and
efficiently, and IBM will need to figure out how they want to solve
this.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1692538

Title:
  Ubuntu 16.04.02: ibmveth: Support to enable LSO/CSO for Trunk VEA

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  Fix Released
Status in linux source package in Artful:
  Fix Released

Bug description:
  
  == SRU Justification ==
  Commit 66aa0678ef is request to fix four issues with the ibmveth driver.
  The issues are as follows:
  - Issue 1: ibmveth doesn't support largesend and checksum offload features 
when configured as "Trunk".
  - Issue 2: SYN packet drops seen at destination VM. When the packet
  originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to IO
  server's inbound Trunk ibmveth, on validating "checksum good" bits in ibmveth
  receive routine, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY flag.
  - Issue 3: First packet of a TCP connection will be dropped, if there is
  no OVS flow cached in datapath.
  - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.

  The details for the fixes to these issues are described in the commits
  git log.



  == Comment: #0 - BRYANT G. LY  - 2017-05-22 08:40:16 ==
  ---Problem Description---

   - Issue 1: ibmveth doesn't support largesend and checksum offload features
     when configured as "Trunk". Driver has explicit checks to prevent
     enabling these offloads.

   - Issue 2: SYN packet drops seen at destination VM. When the packet
     originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to
     IO server's inbound Trunk ibmveth, on validating "checksum good" bits
     in ibmveth receive routine, SKB's ip_summed field is set with
     CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux
     Bridge) and delivered to outbound Trunk ibmveth. At this point the
     outbound ibmveth transmit routine will not set "no checksum" and
     "checksum good" bits in transmit buffer descriptor, as it does so only
     when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets
     delivered to destination VM, TCP layer receives the packet with checksum
     value of 0 and with no checksum related flags in ip_summed field. This
     leads to packet drops. So, TCP connections never goes through fine.

   - Issue 3: First packet of a TCP connection will be dropped, if there is
     no OVS flow cached in datapath. OVS while trying to identify the flow,
     computes the checksum. The computed checksum will be invalid at the
     receiving end, as ibmveth transmit routine zeroes out the pseudo
     checksum value in the packet. This leads to packet drop.

   - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list.
     When Physical NIC has GRO enabled and when OVS bridges these packets,
     OVS vport send code will end up calling dev_queue_xmit, which in turn
     calls validate_xmit_skb.
     In validate_xmit_skb routine, the larger packets will get segmented into
     MSS sized segments, if SKB has a frag_list and if the driver to which
     they are delivered to doesn't support NETIF_F_FRAGLIST feature.

  Contact Information = Bryant G. Ly/b...@us.ibm.com

  ---uname output---
  4.8.0-51.54

  Machine Type = p8

  ---Debugger---
  A debugger is not configured

  ---Steps to Reproduce---
   Increases performance greatly

  The patch has been accepted upstream:
  https://patchwork.ozlabs.org/patch/764533/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1692538/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1729119] Re: NVMe timeout is too short

2017-11-14 Thread Daniel Axtens
** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1729119

Title:
  NVMe timeout is too short

Status in linux package in Ubuntu:
  Confirmed
Status in linux-aws package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  New
Status in linux-aws source package in Xenial:
  Fix Committed

Bug description:
  [SRU Justification]

  [Impact]
  Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes. 

  [Fix]
  The underlying parameters are unsigned ints, so make the module parameters 
unsigned ints too, by picking patch 
http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html

  [Regression Potential]
  Very limited: only types of module parameters are changing, the patch is 
easily reviewable.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715812] Re: Neighbour confirmation broken, breaks ARP cache aging

2017-11-02 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715812

Title:
  Neighbour confirmation broken, breaks ARP cache aging

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Zesty:
  Fix Released

Bug description:
  [SRU Justification]

  [Impact]
  A host can lose access to another host whose MAC address changes if they have 
active connections to other hosts that share a route. The ARP cache does not 
time out as expected - instead the old MAC address is continuously reconfirmed.

  [Fix]
  Apply series [1], which changes the algorithm for neighbour confirmation.
  That is, from upstream:
  51ce8bd4d17a net: pending_confirm is not used anymore 
  0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 
  63fca65d0863 net: add confirm_neigh method to dst_ops 
  c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm 
  c86a773c7802 sctp: add dst_pending_confirm flag 
  4ff0620354f2 net: add dst_pending_confirm flag to skbuff 
  9b8805a32559 sock: add sk_dst_pending_confirm flag 

  [Test case]
  Create 3 real or virtual systems, all hooked up to a switch.
  One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0.

  Put all the systems in the same subnet, e.g. 192.168.200.0/24

  Call the system with the bond A, and the other two systems B and C.

  On B, run in 3 shells:
   - netperf -t TCP_RR to C
   - ping -f A
   - watch 'ip -s neigh show 192.168.200.0/24'

  On A, cause the bond to fail over.

  Observe that:

   - without the patches, B intermittently fails to notice the change in
  A's MAC address. This presents as the ping failing and not recovering,
  and the arp table showing the old mac address never timing out and
  never being replace with a new mac address.

   - with the patches, the arp cache times out and B sends another mac
  probe and detects A's new address.

  It helps to use taskset to put ping and netperf on the same CPU, or
  use single-CPU vms.

  See [2] for more details.

  [References]
  [2] Original report: 
https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html
  [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1729119] [NEW] NVMe timeout is too short

2017-10-31 Thread Daniel Axtens
Public bug reported:

[SRU Justification]

[Impact]
Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes. 

[Fix]
The underlying parameters are unsigned ints, so make the module parameters 
unsigned ints too, by picking patch 
http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html

[Regression Potential]
Very limited: only types of module parameters are changing, the patch is easily 
reviewable.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1729119

Title:
  NVMe timeout is too short

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [SRU Justification]

  [Impact]
  Some NVMe operations time out too quickly. The module parameters allow the 
timeouts to be extended, but only up to 255s, as the counters are bytes. 

  [Fix]
  The underlying parameters are unsigned ints, so make the module parameters 
unsigned ints too, by picking patch 
http://lists.infradead.org/pipermail/linux-nvme/2017-September/012701.html

  [Regression Potential]
  Very limited: only types of module parameters are changing, the patch is 
easily reviewable.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729119/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1728489] [NEW] tar -x sometimes fails on overlayfs

2017-10-29 Thread Daniel Axtens
Public bug reported:

[SRU Justification]

[Impact]
A user is seeing failures from extracting tar archives on overlay filesystems 
on the 4.4 kernel in constrained environments. The error presents as: 

`tar: ./deps/0/bin: Directory renamed before its status could be
extracted`

Following this thread (http://www.spinics.net/lists/linux-
unionfs/msg00856.html), it appears that this occurs when entries in the
kernel's inode cache are reclaimed, and subsequent lookups return new
inode numbers.

Further testing showed that when setting
`/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to
reclaim inode cache entries due to memory pressure) the error does not
recur, supporting the hypothesis that cache entries are being evicted.
However, this setting may lead to a kernel OOM so is not a reasonable
workaround even temporarily.

The error cannot be reproduced on a 4.13 kernel, due to the series at
https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The
particular relevant commit is b7a807dc2010334e62e0afd89d6f7a8913eb14ff,
which needs a couple of dependencies.

[Fix]
For Zesty, backport the entire series.
For Xenial, where a full backport is not feasible, backport the key commit and 
the short list of dependencies.

[Testcase]

# Testing this bug

The testcase for this particular bug is simple - create an overlay
filesystem with all layers on the same underlying file system, and then
see if the inode of a directory is constant across dropping the caches:

mkdir -p /upper/upper /upper/work /lower
mount -t overlay none /mnt -o 
lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work
cd /mnt
mkdir a
stat a # observe inode number
echo 2 > /proc/sys/vm/drop_caches
stat a # compare inode number

If the inode number is the same, the fix is successful.

# Regression testing

I have run the unionmount test suite from
http://git.infradead.org/users/dhowells/unionmount-testsuite.git in
overlay mode (./run --ov), and verified that it still passes.

(The series cover letter mentions a fork of the test suite at
https://github.com/amir73il/unionmount-testsuite/commits/overlayfs-
devel. I have *not* attempted to get this running: it assumes a range of
changes that are not present in our kernels.)

[Regression Potential]
As this changes overlayfs, there is potential for regression in the form of 
unexpected breakages to overlaysfs behaviour.

I think this is adequately addressed by the regression testing.

One option to reduce the regression potential on Zesty is to reduce the
set of patches applied - rather than including the whole series we could
include just the patches to solve this bug, which are much easier to
inspect for correctness.

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1728489

Title:
  tar -x sometimes fails on overlayfs

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [SRU Justification]

  [Impact]
  A user is seeing failures from extracting tar archives on overlay filesystems 
on the 4.4 kernel in constrained environments. The error presents as: 

  `tar: ./deps/0/bin: Directory renamed before its status could be
  extracted`

  Following this thread (http://www.spinics.net/lists/linux-
  unionfs/msg00856.html), it appears that this occurs when entries in
  the kernel's inode cache are reclaimed, and subsequent lookups return
  new inode numbers.

  Further testing showed that when setting
  `/proc/sys/vm/vfs_cache_pressure` to 0 (don't allow the kernel to
  reclaim inode cache entries due to memory pressure) the error does not
  recur, supporting the hypothesis that cache entries are being evicted.
  However, this setting may lead to a kernel OOM so is not a reasonable
  workaround even temporarily.

  The error cannot be reproduced on a 4.13 kernel, due to the series at
  https://www.spinics.net/lists/linux-fsdevel/msg110235.html. The
  particular relevant commit is
  b7a807dc2010334e62e0afd89d6f7a8913eb14ff, which needs a couple of
  dependencies.

  [Fix]
  For Zesty, backport the entire series.
  For Xenial, where a full backport is not feasible, backport the key commit 
and the short list of dependencies.

  [Testcase]

  # Testing this bug

  The testcase for this particular bug is simple - create an overlay
  filesystem with all layers on the same underlying file system, and
  then see if the inode of a directory is constant across dropping the
  caches:

  mkdir -p /upper/upper /upper/work /lower
  mount -t overlay none /mnt -o 
lowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work
  cd /mnt
  mkdir a
  stat a # observe inode number
  echo 2 > /proc/sys/vm/drop_caches
  stat a # compare inode number

  If the inode number is the same, the fix is successful.

  # Regression testing

  I have run the unionmount test suit

[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module

2017-10-16 Thread Daniel Axtens
Hi,

It turns out that support for this driver would require a very large
backport with several series of patches, involving significant
refactoring, code movement and other code change. This makes it very
hard for us to be sure that our backport is correct, and that it's not
going to fail unexpectedly on this new model or on any of the many
older models supported by this driver.

Therefore, we have decided that the complexity and risk of regression
is unacceptably high, and we will not be providing a backport to the
4.4 kernel series.

This means that you will need to use the HWE kernel for this
chassis. The Artful kernel, which is out next week, has full support.

** Changed in: linux (Ubuntu)
   Status: Confirmed => Won't Fix

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1683587

Title:
  LSI Harpoon support in megaraid_sas module

Status in linux package in Ubuntu:
  Won't Fix

Bug description:
  The Dell PERC H740 series RAID controllers, codename "Harpoon", are
  not supported in standard Ubuntu kernels.

  There is a series of kernel patches required to support these:
  http://www.mail-archive.com/linux-
  ker...@vger.kernel.org/msg1307314.html

  There is also a relevant follow-up series:
  https://www.spinics.net/lists/linux-scsi/msg104667.html especially
  patches 1 and 12.

  The relevant PCI IDs from the PCI database
  (http://pciids.sourceforge.net/v2.2/pci.ids) are:

  0016 MegaRAID Tri-Mode SAS3508
  1028 1fc9 PERC H840 Adapter
  1028 1fcb PERC H740P Adapter
  1028 1fcd PERC H740P Mini
  1028 1fcf PERC H740P Mini

  They should be supported from Xenial onwards. The upstream commit is
  going in for 4.11, so this will need to be backported to v4.4 and
  v4.10.

  I am working on SRU patches for this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715812] Re: Neighbour confirmation broken, breaks ARP cache aging

2017-09-27 Thread Daniel Axtens
Verified on Xenial and Zesty.

** Tags removed: verification-needed-zesty
** Tags added: verification-done-zesty

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715812

Title:
  Neighbour confirmation broken, breaks ARP cache aging

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Xenial:
  Fix Committed
Status in linux source package in Zesty:
  Fix Committed

Bug description:
  [SRU Justification]

  [Impact]
  A host can lose access to another host whose MAC address changes if they have 
active connections to other hosts that share a route. The ARP cache does not 
time out as expected - instead the old MAC address is continuously reconfirmed.

  [Fix]
  Apply series [1], which changes the algorithm for neighbour confirmation.
  That is, from upstream:
  51ce8bd4d17a net: pending_confirm is not used anymore 
  0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 
  63fca65d0863 net: add confirm_neigh method to dst_ops 
  c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm 
  c86a773c7802 sctp: add dst_pending_confirm flag 
  4ff0620354f2 net: add dst_pending_confirm flag to skbuff 
  9b8805a32559 sock: add sk_dst_pending_confirm flag 

  [Test case]
  Create 3 real or virtual systems, all hooked up to a switch.
  One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0.

  Put all the systems in the same subnet, e.g. 192.168.200.0/24

  Call the system with the bond A, and the other two systems B and C.

  On B, run in 3 shells:
   - netperf -t TCP_RR to C
   - ping -f A
   - watch 'ip -s neigh show 192.168.200.0/24'

  On A, cause the bond to fail over.

  Observe that:

   - without the patches, B intermittently fails to notice the change in
  A's MAC address. This presents as the ping failing and not recovering,
  and the arp table showing the old mac address never timing out and
  never being replace with a new mac address.

   - with the patches, the arp cache times out and B sends another mac
  probe and detects A's new address.

  It helps to use taskset to put ping and netperf on the same CPU, or
  use single-CPU vms.

  See [2] for more details.

  [References]
  [2] Original report: 
https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html
  [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715812] [NEW] Neighbour confirmation broken, breaks ARP cache aging

2017-09-08 Thread Daniel Axtens
Public bug reported:

[SRU Justification]

[Impact]
A host can lose access to another host whose MAC address changes if they have 
active connections to other hosts that share a route. The ARP cache does not 
time out as expected - instead the old MAC address is continuously reconfirmed.

[Fix]
Apply series [1], which changes the algorithm for neighbour confirmation.
That is, from upstream:
51ce8bd4d17a net: pending_confirm is not used anymore 
0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 
63fca65d0863 net: add confirm_neigh method to dst_ops 
c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm 
c86a773c7802 sctp: add dst_pending_confirm flag 
4ff0620354f2 net: add dst_pending_confirm flag to skbuff 
9b8805a32559 sock: add sk_dst_pending_confirm flag 

[Test case]
Create 3 real or virtual systems, all hooked up to a switch.
One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0.

Put all the systems in the same subnet, e.g. 192.168.200.0/24

Call the system with the bond A, and the other two systems B and C.

On B, run in 3 shells:
 - netperf -t TCP_RR to C
 - ping -f A
 - watch 'ip -s neigh show 192.168.200.0/24'

On A, cause the bond to fail over.

Observe that:

 - without the patches, B intermittently fails to notice the change in
A's MAC address. This presents as the ping failing and not recovering,
and the arp table showing the old mac address never timing out and never
being replace with a new mac address.

 - with the patches, the arp cache times out and B sends another mac
probe and detects A's new address.

It helps to use taskset to put ping and netperf on the same CPU, or use
single-CPU vms.

See [2] for more details.

[References]
[2] Original report: 
https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html
[1]: https://www.spinics.net/lists/linux-rdma/msg45907.html

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715812

Title:
  Neighbour confirmation broken, breaks ARP cache aging

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [SRU Justification]

  [Impact]
  A host can lose access to another host whose MAC address changes if they have 
active connections to other hosts that share a route. The ARP cache does not 
time out as expected - instead the old MAC address is continuously reconfirmed.

  [Fix]
  Apply series [1], which changes the algorithm for neighbour confirmation.
  That is, from upstream:
  51ce8bd4d17a net: pending_confirm is not used anymore 
  0dec879f636f net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP 
  63fca65d0863 net: add confirm_neigh method to dst_ops 
  c3a2e8370534 tcp: replace dst_confirm with sk_dst_confirm 
  c86a773c7802 sctp: add dst_pending_confirm flag 
  4ff0620354f2 net: add dst_pending_confirm flag to skbuff 
  9b8805a32559 sock: add sk_dst_pending_confirm flag 

  [Test case]
  Create 3 real or virtual systems, all hooked up to a switch.
  One system needs an active-backup bond with fail_over_mac=1 num_grat_arp=0.

  Put all the systems in the same subnet, e.g. 192.168.200.0/24

  Call the system with the bond A, and the other two systems B and C.

  On B, run in 3 shells:
   - netperf -t TCP_RR to C
   - ping -f A
   - watch 'ip -s neigh show 192.168.200.0/24'

  On A, cause the bond to fail over.

  Observe that:

   - without the patches, B intermittently fails to notice the change in
  A's MAC address. This presents as the ping failing and not recovering,
  and the arp table showing the old mac address never timing out and
  never being replace with a new mac address.

   - with the patches, the arp cache times out and B sends another mac
  probe and detects A's new address.

  It helps to use taskset to put ping and netperf on the same CPU, or
  use single-CPU vms.

  See [2] for more details.

  [References]
  [2] Original report: 
https://www.mail-archive.com/netdev@vger.kernel.org/msg138762.html
  [1]: https://www.spinics.net/lists/linux-rdma/msg45907.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1715812/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1715519] [NEW] bnx2x_attn_int_deasserted3:4323 MC assert!

2017-09-06 Thread Daniel Axtens
Public bug reported:

(This bug provides a place to track the progress of this issue upstream
and then in to Ubuntu.)

A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
card attached, and uses openvswitch to bridge an ibmveth interface for
traffic from other LPARs.

We see the following crash sometimes when running netperf:
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! 
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052 
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert 
May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - 
... (dump of registers follows) ...

Subsequent debugging reveals that the packets causing the issue come
through the ibmveth interface - from the AIX LPAR. The veth protocol is
'special' - communication between LPARs on the same chassis can use very
large (64k) frames to reduce overhead. Normal networks cannot handle
such large packets, so traditionally, the VIOS partition would signal to
the AIX partitions that it was 'special', and AIX would send regular,
ethernet-sized packets to VIOS, which VIOS would then send out.

This signalling between VIOS and AIX is done in a way that is not
standards-compliant, and so was never made part of Linux. Instead, the
Linux driver has always understood large frames and passed them up the
network stack.

In some cases (e.g. with TCP), multiple TCP segments are coalesced into
one large packet. In Linux, this goes through the generic receive
offload code, using a similar mechanism to GSO. These segments can be
very large which presents as a very large MSS (maximum segment size) or
gso_size.

Normally, the large packet is simply passed to whatever network
application on Linux is going to consume it, and everything is OK.

However, in this case, the packets go through Open vSwitch, and are then
passed to the bnx2x driver. The bnx2x driver/hardware supports TSO and
GSO, but with a restriction: the maximum segment size is limited to
around 9700 bytes. Normally this is more than adequate as jumbo frames
are limited to 9000 bytes. However, if a large packet with large (>9700
byte) TCP segments arrives through ibmveth, and is passed to bnx2x, the
hardware will panic.

Turning off TSO prevents the crash as the kernel resegments the data and
assembles the packets in software. This has a performance cost.

Clearly at the very least, bnx2x should not crash in this case.

One patch to do this was sent upstream:
https://www.spinics.net/lists/netdev/msg452932.html

** Affects: linux (Ubuntu)
 Importance: Undecided
 Assignee: Daniel Axtens (daxtens)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1715519

Title:
  bnx2x_attn_int_deasserted3:4323 MC assert!

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  (This bug provides a place to track the progress of this issue
  upstream and then in to Ubuntu.)

  A ppc64le system runs as a guest under PowerVM. This guest has a bnx2x
  card attached, and uses openvswitch to bridge an ibmveth interface for
  traffic from other LPARs.

  We see the following crash sometimes when running netperf:
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4323(enP24p1s0f2)]MC assert! 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:720(enP24p1s0f2)]XSTORM_ASSERT_LIST_INDEX 0x2 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:736(enP24p1s0f2)]XSTORM_ASSERT_INDEX 0x0 = 0x 
0x25e42a7e 0x00462a38 0x00010052 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_mc_assert:750(enP24p1s0f2)]Chip Revision: everest3, FW Version: 7_13_1 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_attn_int_deasserted3:4329(enP24p1s0f2)]driver assert 
  May 10 17:16:32 tuk6r1phn2 kernel: bnx2x: 
[bnx2x_panic_dump:923(enP24p1s0f2)]begin crash dump - 
  ... (dump of registers follows) ...

  Subsequent debugging reveals that the packets causing the issue come
  through the ibmveth interface - from the AIX LPAR. The veth protocol
  is 'special' - communication between LPARs on the same chassis can use
  very large (64k) frames to reduce overhead. Normal networks cannot
  handle such large packets, so traditionally, the VIOS partition would
  signal to the AIX partitions that it was 'special', and AIX would send
  regular, ethernet-sized packets to VIOS, which VIOS would then send
  out.

  This signalling between VIOS and 

[Kernel-packages] [Bug 1714420] Re: kernel oops - kvm guest started at boot time

2017-09-04 Thread Daniel Axtens
** Description changed:

+ [SRU Justification]
+ 
+ [Impact]
+ System OOPSes shortly after boot when KVM guests are started.
+ 
+ [Fix]
+ Cherry-pick patch e47057151422a67ce08747176fa21cb3b526a2c9
+ 
+ [Testcase]
+ Tested at IBM - boot a machine with a KVM guest configured to start at boot. 
Without this patch, observe OOPS, with this patch, observe no OOPS.
+ 
+ [Regression Potential]
+ Patch is contained in arch/powerpc; so regression potential limited to that 
arch. Patch accepted to kernel stable trees, suggesting others also believe it 
to be of low risk.
+ 
+ [Original Report]
+ 
  [0.00] Linux version 4.4.0-93-generic (buildd@bos01-ppc64el-025)
  (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) )
  #116-Ubuntu SMP Fri Aug 11 16:30:16 UTC 2017 (Ubuntu
  4.4.0-93.116-generic 4.4.79)
  
  ...
  [  380.184554] KVM guest htab at c0799900 (order 29), LPID 2
  [  380.527576] Facility 'TM' unavailable, exception at 0xd0003aad7f10, 
MSR=90009033
  [  380.527717] Oops: Unexpected facility unavailable exception, sig: 6 [#2]
  [  380.527775] SMP NR_CPUS=2048 NUMA PowerNV
  [  380.527823] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM 
iptable_mangle ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables 
ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user 
xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter 
overlay binfmt_misc bridge stp llc kvm_hv uio_pdrv_genirq uio leds_powernv 
ipmi_powernv ibmpowernv vmx_crypto powernv_rng ipmi_msghandler kvm_pr kvm 
autofs4 xfs btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 raid10 ses 
enclosure mlx4_en be2net lpfc vxlan mlx4_core scsi_transport_fc ip6_udp_tunnel 
udp_tunnel ipr
  [  380.528781] CPU: 24 PID: 4277 Comm: qemu-system-ppc Tainted: G  D  
   4.4.0-93-generic #116-Ubuntu
  [  380.528861] task: c3c389b0 ti: c01fb2428000 task.ti: 
c01fb2428000
  [  380.528929] NIP: d0003aad7f10 LR: d00037d52a14 CTR: 
d0003aad7e40
  [  380.528997] REGS: c01fb242b7b0 TRAP: 0f60   Tainted: G  D  
(4.4.0-93-generic)
  [  380.529076] MSR: 90009033   CR: 22024848  
XER: 
  [  380.529247] CFAR: d0003aad7ea4 SOFTE: 1
-GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 
c01fdbf6
-GPR04: c01f8580 c01fb242bbc0  

-GPR08: 0001 c3c389b0 0001 
d00037d578f8
-GPR12: d0003aad7e40 cfb4e400  
001f
-GPR16: 3fff7206 0080 3fff892c4390 
3fff7285f200
-GPR20: 010009988430 0100099affd0 3fff7285eb60 
100c1ff0
-GPR24: 3bcf4e10 3fff72040028  
c01fdbf6
-GPR28:  c01f8580 c01fdbf6 
c01f8580
+    GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 
c01fdbf6
+    GPR04: c01f8580 c01fb242bbc0  

+    GPR08: 0001 c3c389b0 0001 
d00037d578f8
+    GPR12: d0003aad7e40 cfb4e400  
001f
+    GPR16: 3fff7206 0080 3fff892c4390 
3fff7285f200
+    GPR20: 010009988430 0100099affd0 3fff7285eb60 
100c1ff0
+    GPR24: 3bcf4e10 3fff72040028  
c01fdbf6
+    GPR28:  c01f8580 c01fdbf6 
c01f8580
  [  380.530119] NIP [d0003aad7f10] kvmppc_vcpu_run_hv+0xd0/0xff0 [kvm_hv]
  [  380.530188] LR [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm]
  [  380.530245] Call Trace:
  [  380.530270] [c01fb242ba30] [c01fb242bab0] 0xc01fb242bab0 
(unreliable)
  [  380.530353] [c01fb242bb70] [d00037d52a14] 
kvmppc_vcpu_run+0x44/0x60 [kvm]
  [  380.530436] [c01fb242bba0] [d00037d4f674] 
kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
  [  380.530519] [c01fb242bbe0] [d00037d43918] 
kvm_vcpu_ioctl+0x528/0x7b0 [kvm]
  [  380.530602] [c01fb242bd40] [c02fff60] do_vfs_ioctl+0x480/0x7d0
  [  380.530671] [c01fb242bde0] [c0300384] SyS_ioctl+0xd4/0xf0
  [  380.530742] [c01fb242be30] [c0009204] system_call+0x38/0xb4
  [  380.530837] Instruction dump:
  [  380.530904] e92d02a0 e9290a50 e9290108 792a07e3 41820058 e92d02a0 e9290a50 
e9290108
  [  380.531126] 7927e8a4 78e71f87 40820ed8 e92d02a0 <7d4022a6> f9490ee8 
e92d02a0 7d4122a6
  [  380.531350] ---[ end trace 8f9b3b82f9a07d76 ]---
  
- 
- Needs kernel 

[Kernel-packages] [Bug 1714420] Re: kernel oops - kvm guest started at boot time

2017-09-01 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
 Assignee: (unassigned) => Daniel Axtens (daxtens)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1714420

Title:
  kernel oops -  kvm guest started at boot time

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  [0.00] Linux version 4.4.0-93-generic
  (buildd@bos01-ppc64el-025) (gcc version 5.4.0 20160609 (Ubuntu/IBM
  5.4.0-6ubuntu1~16.04.4) ) #116-Ubuntu SMP Fri Aug 11 16:30:16 UTC 2017
  (Ubuntu 4.4.0-93.116-generic 4.4.79)

  ...
  [  380.184554] KVM guest htab at c0799900 (order 29), LPID 2
  [  380.527576] Facility 'TM' unavailable, exception at 0xd0003aad7f10, 
MSR=90009033
  [  380.527717] Oops: Unexpected facility unavailable exception, sig: 6 [#2]
  [  380.527775] SMP NR_CPUS=2048 NUMA PowerNV
  [  380.527823] Modules linked in: vhost_net vhost macvtap macvlan xt_CHECKSUM 
iptable_mangle ipt_REJECT nf_reject_ipv4 xt_tcpudp ebtable_filter ebtables 
ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user 
xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype 
iptable_filter ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter 
overlay binfmt_misc bridge stp llc kvm_hv uio_pdrv_genirq uio leds_powernv 
ipmi_powernv ibmpowernv vmx_crypto powernv_rng ipmi_msghandler kvm_pr kvm 
autofs4 xfs btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 raid10 ses 
enclosure mlx4_en be2net lpfc vxlan mlx4_core scsi_transport_fc ip6_udp_tunnel 
udp_tunnel ipr
  [  380.528781] CPU: 24 PID: 4277 Comm: qemu-system-ppc Tainted: G  D  
   4.4.0-93-generic #116-Ubuntu
  [  380.528861] task: c3c389b0 ti: c01fb2428000 task.ti: 
c01fb2428000
  [  380.528929] NIP: d0003aad7f10 LR: d00037d52a14 CTR: 
d0003aad7e40
  [  380.528997] REGS: c01fb242b7b0 TRAP: 0f60   Tainted: G  D  
(4.4.0-93-generic)
  [  380.529076] MSR: 90009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 22024848  
XER: 
  [  380.529247] CFAR: d0003aad7ea4 SOFTE: 1
 GPR00: d00037d52a14 c01fb242ba30 d0003aaec018 
c01fdbf6
 GPR04: c01f8580 c01fb242bbc0  

 GPR08: 0001 c3c389b0 0001 
d00037d578f8
 GPR12: d0003aad7e40 cfb4e400  
001f
 GPR16: 3fff7206 0080 3fff892c4390 
3fff7285f200
 GPR20: 010009988430 0100099affd0 3fff7285eb60 
100c1ff0
 GPR24: 3bcf4e10 3fff72040028  
c01fdbf6
 GPR28:  c01f8580 c01fdbf6 
c01f8580
  [  380.530119] NIP [d0003aad7f10] kvmppc_vcpu_run_hv+0xd0/0xff0 [kvm_hv]
  [  380.530188] LR [d00037d52a14] kvmppc_vcpu_run+0x44/0x60 [kvm]
  [  380.530245] Call Trace:
  [  380.530270] [c01fb242ba30] [c01fb242bab0] 0xc01fb242bab0 
(unreliable)
  [  380.530353] [c01fb242bb70] [d00037d52a14] 
kvmppc_vcpu_run+0x44/0x60 [kvm]
  [  380.530436] [c01fb242bba0] [d00037d4f674] 
kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
  [  380.530519] [c01fb242bbe0] [d00037d43918] 
kvm_vcpu_ioctl+0x528/0x7b0 [kvm]
  [  380.530602] [c01fb242bd40] [c02fff60] do_vfs_ioctl+0x480/0x7d0
  [  380.530671] [c01fb242bde0] [c0300384] SyS_ioctl+0xd4/0xf0
  [  380.530742] [c01fb242be30] [c0009204] system_call+0x38/0xb4
  [  380.530837] Instruction dump:
  [  380.530904] e92d02a0 e9290a50 e9290108 792a07e3 41820058 e92d02a0 e9290a50 
e9290108
  [  380.531126] 7927e8a4 78e71f87 40820ed8 e92d02a0 <7d4022a6> f9490ee8 
e92d02a0 7d4122a6
  [  380.531350] ---[ end trace 8f9b3b82f9a07d76 ]---

  
  Needs kernel patch e47057151422a67ce08747176fa21cb3b526a2c9 according to Cyril

  ProblemType: Bug
  DistroRelease: Ubuntu 16.04
  Package: linux-image-4.4.0-93-generic 4.4.0-93.116
  ProcVersionSignature: Ubuntu 4.4.0-93.116-generic 4.4.79
  Uname: Linux 4.4.0-93-generic ppc64le
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 Sep  1 15:03 seq
   crw-rw 1 root audio 116, 33 Sep  1 15:03 timer
  AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
  ApportVersion: 2.20.1-0ubuntu2.10
  Architecture: ppc64el
  ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  Date: Fri Sep  1 15:34:14 2017
  IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
  JournalErrors:
   Error: command ['journalctl', '-b', '--priority=warning', '--lines=1000'] 
failed with exit code 1: Hint: You are curre

[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module

2017-08-20 Thread Daniel Axtens
** Description changed:

  The Dell PERC H740 series RAID controllers, codename "Harpoon", are not
  supported in standard Ubuntu kernels.
  
- The kernel patch to support these new devices is:
+ There is a series of kernel patches required to support these:
+ http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1307314.html
  
- 
https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056
+ There is also a relevant follow-up series: https://www.spinics.net/lists
+ /linux-scsi/msg104667.html especially patches 1 and 12.
  
  The relevant PCI IDs from the PCI database
  (http://pciids.sourceforge.net/v2.2/pci.ids) are:
  
- 0016 MegaRAID Tri-Mode SAS3508 
- 1028 1fc9 PERC H840 Adapter 
- 1028 1fcb PERC H740P Adapter 
- 1028 1fcd PERC H740P Mini 
- 1028 1fcf PERC H740P Mini 
+ 0016 MegaRAID Tri-Mode SAS3508
+ 1028 1fc9 PERC H840 Adapter
+ 1028 1fcb PERC H740P Adapter
+ 1028 1fcd PERC H740P Mini
+ 1028 1fcf PERC H740P Mini
  
- They should be supported from Trusty onwards. The upstream commit is
- going in for 4.11, so this will need to be backported to
- v3.13/v4.4/v4.8/v4.10.
+ They should be supported from Xenial onwards. The upstream commit is
+ going in for 4.11, so this will need to be backported to v4.4 and v4.10.
  
  I am working on SRU patches for this.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1683587

Title:
  LSI Harpoon support in megaraid_sas module

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  The Dell PERC H740 series RAID controllers, codename "Harpoon", are
  not supported in standard Ubuntu kernels.

  There is a series of kernel patches required to support these:
  http://www.mail-archive.com/linux-
  ker...@vger.kernel.org/msg1307314.html

  There is also a relevant follow-up series:
  https://www.spinics.net/lists/linux-scsi/msg104667.html especially
  patches 1 and 12.

  The relevant PCI IDs from the PCI database
  (http://pciids.sourceforge.net/v2.2/pci.ids) are:

  0016 MegaRAID Tri-Mode SAS3508
  1028 1fc9 PERC H840 Adapter
  1028 1fcb PERC H740P Adapter
  1028 1fcd PERC H740P Mini
  1028 1fcf PERC H740P Mini

  They should be supported from Xenial onwards. The upstream commit is
  going in for 4.11, so this will need to be backported to v4.4 and
  v4.10.

  I am working on SRU patches for this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1687512] Re: Kernel panics on Xenial when using cgroups and strict CFS limits

2017-08-20 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1687512

Title:
  Kernel panics on Xenial when using cgroups and strict CFS limits

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  Fix Released

Bug description:
  SRU Justification
  -

  [Impact]
  Apache Mesos and Kubernetes workloads on Xenial cause a panic
  (NULL pointer dereference) in the completely fair scheduler.

  These panics are in pick_next_entity and include pick_next_task_fair
  in the call stack.

  [Fix]
  Cherry-picking both
  754bd598be9bbc953bc709a9e8ed7f3188bfb9d7
  (http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz)
  and
  094f469172e00d6ab0a3130b0e01c83b3cf3a98d
  (http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz)
  fix the crash.
  They appear to be intended as a series - they were posted to LKML at
  the same time.

  [Testcase]
  The fix has been validated by the user who reported the bug

  Bug description
  ---

  We see a number of kernel panics on servers running Apache Mesos using
  cgroups with small (0.1-0.2) cpu limits.

  These all appear as NULL pointer dereferences in and around
  pick_next_entity and pick_next_task_fair, for example:

  [24334.493331] BUG: unable to handle kernel NULL pointer dereference at 
0050
  [24334.501611] IP: [] pick_next_entity+0x7f/0x160
  [24334.507868] PGD 3eacfa067 PUD 3eacfb067 PMD 0
  [24334.512806] Oops:  [#1] SMP
  [24334.516420] Modules linked in: ipvlan xt_nat xt_tcpudp veth ipt_MASQUERADE 
nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack 
x_tables nf_nat nf_conntrack br_netfilter bridge stp llc aufs tcp_diag 
inet_diag nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt 
ppdev input_leds mac_hid i2c_piix4 8250_fintek parport_pc pvpanic parport 
serio_raw crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel 
aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
  [24334.576359] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.4.0-66-generic 
#87~14.04.1-Ubuntu
  [24334.584748] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
  [24334.594188] task: 8803ee671c00 ti: 8803ee67c000 task.ti: 
8803ee67c000
  [24334.601799] RIP: 0010:[] [] 
pick_next_entity+0x7f/0x160
  [24334.610490] RSP: 0018:8803ee67fdd8 EFLAGS: 00010086
  [24334.615924] RAX: 8803ebed4c00 RBX: 880036529800 RCX: 

  [24334.623190] RDX: 0225341f RSI:  RDI: 

  [24334.630479] RBP: 8803ee67fe00 R08: 0004 R09: 

  [24334.637758] R10: 8803e7ed7600 R11: 0001 R12: 

  [24334.645153] R13:  R14: 0009067729c4 R15: 
8803ee672178
  [24334.652512] FS: () GS:8803ffd0() 
knlGS:
  [24334.660721] CS: 0010 DS:  ES:  CR0: 80050033
  [24334.666587] CR2: 0050 CR3: 0003eacf9000 CR4: 
001406e0
  [24334.673851] Stack:
  [24334.675980] 8803ffd16e00 8803ffd16e00 8803e855a200 
880036529800
  [24334.683995] 0002 8803ee67fe68 810b98a6 
8803ffd16e70
  [24334.692024] 00016e00 8803e7ed7600 8803ee671c00 

  [24334.700172] Call Trace:
  [24334.702750] [] pick_next_task_fair+0x66/0x4b0
  [24334.708886] [] __schedule+0x7f4/0x980
  [24334.714349] [] schedule+0x35/0x80
  [24334.719445] [] schedule_preempt_disabled+0xe/0x10
  [24334.725962] [] cpu_startup_entry+0x18a/0x350
  [24334.732012] [] start_secondary+0x149/0x170
  [24334.737895] Code: 8b 70 50 4d 2b 74 24 50 4d 85 f6 7e 59 4c 89 e7 e8 67 ff 
ff ff 49 39 c6 7f 04 4c 8b 6b 48 48 8b 43 40 48 85 c0 74 1f 4c 8b 70 50 <4d> 2b 
74 24 50 4d 85 f6 7e 2c 4c 89 e7 e8 3f ff ff ff 49 39 c6
  [24334.765124] RIP [] pick_next_entity+0x7f/0x160
  [24334.771473] RSP 
  [24334.775077] CR2: 0050
  [24334.779121] ---[ end trace 05d941efb97b7bae ]---

  and

  [155852.028575] BUG: unable to handle kernel NULL pointer dereference at 
0050
  [155852.036931] IP: [] pick_next_entity+0x7f/0x160
  [155852.043491] PGD 3ebae8067 PUD 3ebae9067 PMD 0
  [155852.048550] Oops:  [#1] SMP
  [155852.052437] Modules linked in: ipvlan veth xt_nat xt_tcpudp 
ipt_MASQUERADE nf_nat_masquerade_ipv4 xfrm_user xfrm_algo iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter 
ip_tables xt_conntrack x_tables nf_nat nf_conntrack br_netfilter bridge stp llc 
aufs nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache dm_crypt ppdev 
input_leds mac_hid i2c_piix4 parport_pc 

[Kernel-packages] [Bug 1699627] Re: XDP eBPF programs fail to verify on Zesty ppc64el

2017-08-20 Thread Daniel Axtens
** Changed in: linux (Ubuntu)
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1699627

Title:
  XDP eBPF programs fail to verify on Zesty ppc64el

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Zesty:
  Fix Released

Bug description:
  SRU Justification

  [Impact]
  Some XDP examples such as https://github.com/netoptimizer/prototype-kernel 
fail on ppc64el at the eBPF verification stage.

  [Fix]
  This is because CONFIG_HAS_EFFICIENT_UNALIGNED_ACCESS is not set on ppc64el. 
It is not set because the kernel is being compiled for CPU_POWER7 instead of 
CPU_POWER8, and we don't have efficient unaligned access on POWER7.

  Swap to building for POWER8.

  As a bonus, this should make everything a little bit faster.

  [Regression Potential]

   - IBM never released any officially supported Power7 LE systems - LE
  was only ever supported on Power8. Therefore this should not break any
  systems.

   - Regression potential is also limited to one arch.

   - Artful-next already has this fix and nothing bad has happened
  there.

  [Test]
  Create a P8 VM with a virtio network card and 2 vcpus.

  The VM needs to have some network features turned off, and enough
  queues. The following virsh snippet in the  section should
  suffice:

 
   
   
 

  Then:
  - apt install clang llvm
  - get the prototype-kernel repo
  - go to the kernel/samples/bpf directory
  - make
  - sudo mount -t bpf bpf /sys/fs/bpf/
  - sudo ./xdp_ddos01_blacklist --dev enp0s1

  Observe that without this patch, we get a long debug splat ending
  with:

  32: (61) r1 = *(u32 *)(r8 +12)
  misaligned packet access off 0+18+12 size 4
  load_bpf_file: Permission denied

  With this patch we don't get that error and the program is
  successfully verifies and loads. (It still doesn't run - there is
  other breakage I'm chasing down - but it definitely gets further.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1699627/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


Re: [Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module

2017-08-20 Thread Daniel Axtens
Hi Edward,

I am glad to hear the modified ISO works. I have backported the
patches and am in discussions with the kernel team about including
them in the default kernel.

One of our issues is that the patch set is quite large so we're
worried about regressions - do you have any older H7** raid
controllers? Are you in a position to help with regression testing?

Regards,
Daniel

On Mon, Aug 21, 2017 at 8:26 AM, Edward P <1683...@bugs.launchpad.net> wrote:
> What I did to get it working for now is creating a modified ISO with
> kernel version v4.11.12 that has support for this RAID controller. Works
> fine, so hope this patch is applied to the default kernel that is
> shipped with Ubuntu soon.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1683587
>
> Title:
>   LSI Harpoon support in megaraid_sas module
>
> Status in linux package in Ubuntu:
>   Confirmed
>
> Bug description:
>   The Dell PERC H740 series RAID controllers, codename "Harpoon", are
>   not supported in standard Ubuntu kernels.
>
>   The kernel patch to support these new devices is:
>
>   
> https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056
>
>   The relevant PCI IDs from the PCI database
>   (http://pciids.sourceforge.net/v2.2/pci.ids) are:
>
>   0016 MegaRAID Tri-Mode SAS3508
>   1028 1fc9 PERC H840 Adapter
>   1028 1fcb PERC H740P Adapter
>   1028 1fcd PERC H740P Mini
>   1028 1fcf PERC H740P Mini
>
>   They should be supported from Trusty onwards. The upstream commit is
>   going in for 4.11, so this will need to be backported to
>   v3.13/v4.4/v4.8/v4.10.
>
>   I am working on SRU patches for this.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1683587

Title:
  LSI Harpoon support in megaraid_sas module

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  The Dell PERC H740 series RAID controllers, codename "Harpoon", are
  not supported in standard Ubuntu kernels.

  The kernel patch to support these new devices is:

  
https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056

  The relevant PCI IDs from the PCI database
  (http://pciids.sourceforge.net/v2.2/pci.ids) are:

  0016 MegaRAID Tri-Mode SAS3508 
  1028 1fc9 PERC H840 Adapter 
  1028 1fcb PERC H740P Adapter 
  1028 1fcd PERC H740P Mini 
  1028 1fcf PERC H740P Mini 

  They should be supported from Trusty onwards. The upstream commit is
  going in for 4.11, so this will need to be backported to
  v3.13/v4.4/v4.8/v4.10.

  I am working on SRU patches for this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1701297] Re: NTP reload failure (unable to read library) on overlayfs

2017-08-07 Thread Daniel Axtens
Hi Marzog,

What commit has been committed to Linux? I cannot find it.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1701297

Title:
  NTP reload failure (unable to read library) on overlayfs

Status in cloud-init:
  Won't Fix
Status in apparmor package in Ubuntu:
  Confirmed
Status in cloud-init package in Ubuntu:
  Incomplete
Status in linux package in Ubuntu:
  Fix Committed

Bug description:
  After update [1] of cloud-init in Ubuntu (which landed in xenial-
  updates on 2017-06-27), it is causing NTP reload failures.

  https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-
  0ubuntu1~16.04.1

  In MAAS scenarios, this is causing the machine to fail to deploy.

  Related bugs:
   * bug 1645644: cloud-init ntp not using expected servers

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1683587] Re: LSI Harpoon support in megaraid_sas module

2017-08-03 Thread Daniel Axtens
Hi,

We currently have a user testing the patches for Xenial onwards.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1683587

Title:
  LSI Harpoon support in megaraid_sas module

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  The Dell PERC H740 series RAID controllers, codename "Harpoon", are
  not supported in standard Ubuntu kernels.

  The kernel patch to support these new devices is:

  
https://github.com/torvalds/linux/commit/45f4f2eb3da3cbff02c3d77c784c81320c733056

  The relevant PCI IDs from the PCI database
  (http://pciids.sourceforge.net/v2.2/pci.ids) are:

  0016 MegaRAID Tri-Mode SAS3508 
  1028 1fc9 PERC H840 Adapter 
  1028 1fcb PERC H740P Adapter 
  1028 1fcd PERC H740P Mini 
  1028 1fcf PERC H740P Mini 

  They should be supported from Trusty onwards. The upstream commit is
  going in for 4.11, so this will need to be backported to
  v3.13/v4.4/v4.8/v4.10.

  I am working on SRU patches for this.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1683587/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1698706] Re: Quirk for non-compliant PCI bridge on HiSilicon D05 board

2017-07-21 Thread Daniel Axtens
** Tags added: kernel-da-key

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1698706

Title:
  Quirk for non-compliant PCI bridge on HiSilicon D05 board

Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Zesty:
  Fix Committed

Bug description:
  SRU Justification

  [Impact]
  Xorg autodetection does not work on HiSilicon D05 boards.

  [Fix]
  The HiSilicon D05 board has some PCI bridges (PCI ID 19e5:1610) that are not 
spec-compliant: they do not set the VGA Enable bit when a VGA card is behind 
the bridge. This stops vgaarb setting the device as a boot vga device, breaking 
Xorg auto-detection. [0]

  Despite this, the hibmc VGA card (PCI ID 19e5:1711) is known to work
  when behind these bridges.

  Provide a quirk so that this combination of bridge and card works.

  [Testcase]
  On an affected board, run:
  # find /sys/devices -name boot_vga -exec cat \{\} \;

  This should print 0 without this patch and 1 with this patch.

  [Regression Potential]
  There is a risk with overriding the VGA arbiter that adding additional VGA 
cards to the board may go wrong somehow. The fixup specifically tests for the 
bridge and card on the board, so regressions should be limited to that 
combination of bridge and card.

  [Notes]
  HiSilicon is hoping to have 16.04.3 HWE kernel support their board, hence the 
submission of this patch before it has been accepted upstream. The patch has 
been submitted upstream and I will continue to work with upstream to land it.[1]

  [0] https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991 - this
  bug tracked debugging of a segfault and then this issue. Comments 25
  (https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1691991/comments/25)
  and 31 onwards detail this issue.

  [1] https://patchwork.ozlabs.org/patch/778054/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1698706/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1699627] Re: XDP eBPF programs fail to verify on Zesty ppc64el

2017-07-14 Thread Daniel Axtens
Also verified by an IBMer on a real P8.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1699627

Title:
  XDP eBPF programs fail to verify on Zesty ppc64el

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Zesty:
  Fix Committed

Bug description:
  SRU Justification

  [Impact]
  Some XDP examples such as https://github.com/netoptimizer/prototype-kernel 
fail on ppc64el at the eBPF verification stage.

  [Fix]
  This is because CONFIG_HAS_EFFICIENT_UNALIGNED_ACCESS is not set on ppc64el. 
It is not set because the kernel is being compiled for CPU_POWER7 instead of 
CPU_POWER8, and we don't have efficient unaligned access on POWER7.

  Swap to building for POWER8.

  As a bonus, this should make everything a little bit faster.

  [Regression Potential]

   - IBM never released any officially supported Power7 LE systems - LE
  was only ever supported on Power8. Therefore this should not break any
  systems.

   - Regression potential is also limited to one arch.

   - Artful-next already has this fix and nothing bad has happened
  there.

  [Test]
  Create a P8 VM with a virtio network card and 2 vcpus.

  The VM needs to have some network features turned off, and enough
  queues. The following virsh snippet in the  section should
  suffice:

 
   
   
 

  Then:
  - apt install clang llvm
  - get the prototype-kernel repo
  - go to the kernel/samples/bpf directory
  - make
  - sudo mount -t bpf bpf /sys/fs/bpf/
  - sudo ./xdp_ddos01_blacklist --dev enp0s1

  Observe that without this patch, we get a long debug splat ending
  with:

  32: (61) r1 = *(u32 *)(r8 +12)
  misaligned packet access off 0+18+12 size 4
  load_bpf_file: Permission denied

  With this patch we don't get that error and the program is
  successfully verifies and loads. (It still doesn't run - there is
  other breakage I'm chasing down - but it definitely gets further.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1699627/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1701297] Re: NTP reload failure (unable to read library) on overlayfs

2017-07-06 Thread Daniel Axtens
Tyler - thanks for that.

John - this is coming up in some internal support team escalations so
I'm going to have a look at the kernel changes myself and will let you
know if I find anything. I'd be keen to sync up if you have any leads.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1701297

Title:
  NTP reload failure (unable to read library) on overlayfs

Status in cloud-init:
  Incomplete
Status in apparmor package in Ubuntu:
  Confirmed
Status in cloud-init package in Ubuntu:
  Incomplete
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  After update [1] of cloud-init in Ubuntu (which landed in xenial-
  updates on 2017-06-27), it is causing NTP reload failures.

  https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-
  0ubuntu1~16.04.1

  In MAAS scenarios, this is causing the machine to fail to deploy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/1701297/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


  1   2   >