Public bug reported:
[Impact]
With the latest focal/linux (5.4.0-101.114) and impish/linux (5.13.0-31.34)
kernels built for SRU cycle 2022.02.21 some AWS instances fail to boot. This
impacts mostly the instance types: c4.large, c3.xlarge and x1e.xlarge. However,
not all instances deployed on those types will fail. This is affecting mostly
c4.large which fails about 80-90% of all deployments.
This was traced to be caused by the network interface failing to come
up. The following console log snippets from 5.4.0-101-generic on a
c4.large show some hints of what's going on:
[...]
[ 3.990368] unchecked MSR access error: RDMSR from 0xc90 at rIP:
0xffffffff8ea733c8 (native_read_msr+0x8/0x40)
[ 3.998463] Call Trace:
[ 4.001164] ? set_rdt_options+0x91/0x91
[ 4.004864] resctrl_late_init+0x592/0x63c
[ 4.008711] ? set_rdt_options+0x91/0x91
[ 4.012452] do_one_initcall+0x4a/0x200
[ 4.016115] kernel_init_freeable+0x1c0/0x263
[ 4.020402] ? rest_init+0xb0/0xb0
[ 4.024889] kernel_init+0xe/0x110
[ 4.029245] ret_from_fork+0x35/0x40
[...]
[ 7.718268] ena: The ena device sent a completion but the driver didn't
receive a MSI-X interrupt (cmd 8), autopolling mode is OFF
[ 7.727036] ena: Failed to submit get_feature command 12 error: -62
[ 7.731691] ena 0000:00:03.0: Cannot init indirect table
[ 7.735636] ena 0000:00:03.0: Cannot init RSS rc: -62
[ 7.740700] ena: probe of 0000:00:03.0 failed with error -62
[...]
[Fix]
Reverting the following upstream stable commit fixes the issue:
83dbf898a2d4 PCI/MSI: Mask MSI-X vectors only on success
[Test Case]
Boot an affected AWS instance type with focal/linux (5.4.0-101.114) and
impish/linux (5.13.0-31.34) kernels with the mentioned patch reverted. Then
boot with the original kernels. It should boot successfully with the reverted
patch but fail with the original kernels.
[Regression Potential]
The patch description mentions fixing a MSI-X issue with a Marvell NVME device,
which doesn't seem to be following the PCI-E specification. Reverting this
commit will keep the issue on systems with that particular NVME device unfixed.
As of now there is no follow-up fix for this commit upstream, we might need to
keep an eye on any change and re-apply it in case a fix is found.
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Affects: linux (Ubuntu Focal)
Importance: Undecided
Status: Confirmed
** Affects: linux (Ubuntu Impish)
Importance: Undecided
Status: Confirmed
** Also affects: linux (Ubuntu Impish)
Importance: Undecided
Status: New
** Also affects: linux (Ubuntu Focal)
Importance: Undecided
Status: New
** Changed in: linux (Ubuntu Focal)
Status: New => Confirmed
** Changed in: linux (Ubuntu Impish)
Status: New => Confirmed
** Description changed:
[Impact]
-
- With the latest focal/linux (5.4.0-101.114) and impish/linux
- (5.13.0-31.34) kernels built for SRU cycle 2022.02.21 some AWS instances
- fail to boot. This impacts mostly the instance types: c4.large,
- c3.xlarge and x1e.xlarge. However, not all instances deployed on those
- types will fail. This is affecting mostly c4.large which fails about
- 80-90% of all deployments.
+ With the latest focal/linux (5.4.0-101.114) and impish/linux (5.13.0-31.34)
kernels built for SRU cycle 2022.02.21 some AWS instances fail to boot. This
impacts mostly the instance types: c4.large, c3.xlarge and x1e.xlarge. However,
not all instances deployed on those types will fail. This is affecting mostly
c4.large which fails about 80-90% of all deployments.
This was traced to be caused by the network interface failing to come
up. The following console log snippets from 5.4.0-101-generic on a
c4.large show some hints of what's going on:
[...]
[ 3.990368] unchecked MSR access error: RDMSR from 0xc90 at rIP:
0xffffffff8ea733c8 (native_read_msr+0x8/0x40)
[ 3.998463] Call Trace:
[ 4.001164] ? set_rdt_options+0x91/0x91
[ 4.004864] resctrl_late_init+0x592/0x63c
[ 4.008711] ? set_rdt_options+0x91/0x91
[ 4.012452] do_one_initcall+0x4a/0x200
[ 4.016115] kernel_init_freeable+0x1c0/0x263
[ 4.020402] ? rest_init+0xb0/0xb0
[ 4.024889] kernel_init+0xe/0x110
[ 4.029245] ret_from_fork+0x35/0x40
[...]
[ 7.718268] ena: The ena device sent a completion but the driver didn't
receive a MSI-X interrupt (cmd 8), autopolling mode is OFF
[ 7.727036] ena: Failed to submit get_feature command 12 error: -62
[ 7.731691] ena 0000:00:03.0: Cannot init indirect table
[ 7.735636] ena 0000:00:03.0: Cannot init RSS rc: -62
[ 7.740700] ena: probe of 0000:00:03.0 failed with error -62
[...]
[Fix]
Reverting the following upstream stable commit fixes the issue:
83dbf898a2d4 PCI/MSI: Mask MSI-X vectors only on success
[Test Case]
Boot an affected AWS instance type with focal/linux (5.4.0-101.114) and
impish/linux (5.13.0-31.34) kernels with the mentioned patch reverted. Then
boot with the original kernels. It should boot successfully with the reverted
patch but fail with the original kernels.
[Regression Potential]
The patch description mentions fixing a MSI-X issue with a Marvell NVME
device, which doesn't seem to be following the PCI-E specification. Reverting
this commit will keep the issue on systems with that particular NVME device
unfixed.
As of now there is no follow-up fix for this commit upstream, we might need
to keep an eye on any change and re-apply it in case a fix is found.
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1961968
Title:
Broken network on some AWS instances with focal/impish kernels
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1961968/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs