[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
Hi test0830, This particular issue was fixed by AWS on their backend, it wasn't an issue with Ubuntu at all. If you are having similar issues, you should open a new bug or a support ticket on AWS. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Fix Released Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
Hello @mruffell , How to find this patch for this problem? thanks -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Fix Released Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
I have received confirmation from AWS that the fix for this issue has been deployed to all commercial regions. I have tested multiple VMs over the last few days and detaching EBS volumes are working without problems. This issue is now resolved. ** Changed in: linux (Ubuntu) Status: Incomplete => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Fix Released Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
Hello @johannes-wuerbach Thanks for the suggestion. I had a look at the patch you suggested: commit 2181e455612a8db2761eabbf126640552a451e96 Author: Anton Eidelman Date: Thu Jun 20 08:48:10 2019 +0200 subject: nvme: fix possible io failures when removing multipathed ns The patch was backported to 5.2.3, 5.1.20 and 4.19.61, so I went and tested it. I did a quick test by booting mainline 5.2.2, reproduced the issue, and then installed 5.2.3. I was able to reproduce the issue again, suggesting that this commit does not fix the issue. I then tried the latest 5.2.8, and also reproduced the issue. Just to be sure, I backported the commit to 4.15.0-58-generic, and tested again: You can find the test kernel here: https://launchpad.net/~mruffell/+archive/ubuntu/sf234512-test I was also able to reproduce the issue, and I am confident that this commit does nothing for this particular problem. Thanks for the suggestion! But in this particular case that commit is not the answer. I have opened a ticket with AWS, and they have confirmed that they know about the problem and have developed a fix for their cloud platform. It is currently being deployed, and deployment will take a few weeks to complete. You might have been lucky and received a patched cloud machine at the same time 4.19.61 came out, or it fixed another unrelated issue that had the same symptoms. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Incomplete Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
@mruffell we saw the same issue on our AWS machines, but a kernel version including https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.19.y=f0c83dd15ee1e89f73523cb82da9205d204cf440 (e.g. 4.19.61) fixed it for us. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Incomplete Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1837833] Re: EBS Volumes get stuck detaching from AWS instances
** Changed in: linux (Ubuntu) Assignee: (unassigned) => Matthew Ruffell (mruffell) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1837833 Title: EBS Volumes get stuck detaching from AWS instances Status in linux package in Ubuntu: Incomplete Bug description: On AWS, it is possible to get a EBS volume stuck in the detaching state, where it is not present in lsblk or lspci, but the cloud console says the volume is detaching, and pci / nvme errors are output to dmesg every 180 seconds. To reproduce reliably: 1) Start a AWS instance. Can be any type, nitro or regular. I have successfully reproduced on m5.large and t3.small. Add an extra volume during creation, of any size. 2) Connect to the instance. lsblk and lspci show the nvme device there. Detach the volume from the web console. It will detach successfully. You can attach and then detach the volume again, and it will also work successfully. 3) With the volume detached, reboot the instance. I do "sudo reboot". 4) When the instance comes back up and you have logged in, attach the volume. dmesg will have (normal): kernel: [ 67] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 67] pci :00:1f.0: reg 0x10: [mem 0x-0x3fff] kernel: [ 67] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 67] nvme nvme1: pci function :00:1f.0 kernel: [ 67] nvme :00:1f.0: enabling device ( -> 0002) kernel: [ 67] ACPI: PCI Interrupt Link [LNKC] enabled at IRQ 10 5) Detach the volume from the web console. If you keep refreshing the volume view, you will see the volume in the detaching state and the volume is still in use. The device will be missing from lsblk and lspci. dmesg will print these messages every 180 seconds: 4.4 -> 4.15 kernel: [ 603] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 603] pci :00:1f.0: reg 0x10: [mem 0x8000-0x80003fff] kernel: [ 603] pci :00:1f.0: BAR 0: assigned [mem 0x8000-0x80003fff] kernel: [ 603] nvme nvme1: pci function :00:1f.0 kernel: [ 603] nvme nvme1: failed to mark controller live kernel: [ 603] nvme nvme1: Removing after probe failure status: 0 Latest mainline kernel: kernel: [ 243] pci :00:1f.0: [1d0f:8061] type 00 class 0x010802 kernel: [ 243] pci :00:1f.0: reg 0x10: [mem 0xc000-0xc0003fff] kernel: [ 243] pci :00:1f.0: BAR 0: assigned [mem 0xc000-0xc0003fff] kernel: [ 243] nvme nvme1: pci function :00:1f.0 kernel: [ 244] nvme nvme1: failed to mark controller CONNECTING kernel: [ 244] nvme nvme1: Removing after probe failure status: 0 The volume is now stuck detaching and the above will continue until you forcefully detach the volume. This seems to effect all distributions and all kernels. I have tested xenial, bionic, bionic + hwe, bionic + eoan, bionic + mainline 5.2.2, rhel8 and Amazon Linux 2. All are affected with the same symptoms. I tried to trace NVMe on 5.2, but with not much success on finding any problem: https://paste.ubuntu.com/p/c6rmDpvHJk/ To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837833/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp