** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/2155222
  
  [Impact]
  
  Jammy VMs running on "Gen2" v6 instance types on Azure fail to collect a kdump
  with both the 5.15 and 6.8 HWE kernel, yet kdump succeeds for 6.8 onward on
  noble onward. Even stranger, it succeeds on jammy with secureboot enabled, and
  fails with secureboot disabled.
  
  The difference between jammy and noble onward can be explained with userspace
  tools, as kdump-tools uses -c (--kexec-syscall) by default, and changes to
  -s (--kexec-file-syscall) when secureboot is enabled. Noble onward works due 
to
  using -a (--kexec-syscall-auto) by default, which defaults to -s. Noble will
  fail when using -c instead.
  
  From man kexec:
  
  -s (--kexec-file-syscall)
        Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively.
  
  -c (--kexec-syscall)
        Specify that the old KEXEC_LOAD syscall should be used exclusively (the 
default).
  
  -a (--kexec-syscall-auto)
        Try the new KEXEC_FILE_LOAD syscall first and when it is not supported 
or the kernel does not understand the supplied  image  fall  back  to  the  old
        KEXEC_LOAD interface.
  
        There is no one single interface that always works.
  
        KEXEC_FILE_LOAD is required on systems that use locked-down secure boot 
to verify the kernel signature.  KEXEC_LOAD may be also disabled in the kernel
        configuration.
  
        KEXEC_LOAD is required for some kernel image formats and on
  architectures that do not implement KEXEC_FILE_LOAD.
  
  Regardless, the issue is actually a hyperv subsystem issue in the
  kernel.
  
  When the kexec / kdump kernel boots, vmbus_reserve_fb() fails to reserve the
  framebuffer MMIO range due to a Gen2 VM's screen.lfb_base being zero. This
  causes a MMIO conflict between hyperv-drm and pci-hyperv: when the 
pci-hyperv's
  hv_allocate_config_window() calls vmbus_allocate_mmio() to get an MMIO range,
  it usually gets a 32-bit MMIO range that overlaps with the framebuffer MMIO
  range, and later hv_pci_enter_d0() fails with an error message
  "PCI Pass-through VSP failed D0 Entry with status" since the host thinks that
  PCI devices must not use MMIO space that the host has assigned to the
  framebuffer.
  
  This is especially an issue if pci-hyperv is built-in and hyperv-drm is built 
as
  a module. Consequently, the kdump/kexec kernel fails to detect PCI devices via
  pci-hyperv, and may fail to mount the root file system, which may reside in a
  NVMe disk.
  
  The end result is that capturing kdumps fail when -c (--kexec-syscall) is 
used,
  which is the default on jammy.
  
  [Fix]
  
- This is currently queued up in the hyperv maintainer tree, in the hyperv-fixes
- branch:
+ The fix landed in 7.2-rc1:
  
- commit 016a25e4b0df4d77e7c258edee4aaf982e4ee809 hyperv
+ commit 016a25e4b0df4d77e7c258edee4aaf982e4ee809
  From: Dexuan Cui <[email protected]>
  Date: Thu, 7 May 2026 14:28:38 -0700
  Subject: Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 
VMs
- Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/commit/?h=hyperv-fixes&id=016a25e4b0df4d77e7c258edee4aaf982e4ee809
- 
- This is expected to make the 7.2 merge window.
+ Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=016a25e4b0df4d77e7c258edee4aaf982e4ee809
  
  This fix is required for hyperv users, and is mostly relevant for -azure users
  only, but I am still requesting this for -generic to ensure that anyone using
  -generic on Azure can still kexec, and to make it easier to bisect -generic on
  Azure in the future.
  
  [Testcase]
  
  This needs to be tested on Azure on both v5 and v6 instance types. The issue
  occurs with v6 instance types, but we need to ensure we do not cause a
  regression with v5 instance types.
  
  For each series you are testing, create a VM with the following instance 
types:
  - Standard_D4ads_v5
  - Standard_D4ads_v6
  
  For the image type, you need to select "Gen2" images:
  - "Ubuntu Server 22.04 LTS - x64 Gen2"
  - "Ubuntu Server 24.04 LTS - x64 Gen2"
  - "Ubuntu Server 25.10 - x64 Gen 2"
  - "Ubuntu Server 26.04 LTS - x64 Gen 2"
  
  If you are going to test with -c (--kexec-syscall), secureboot needs to be
  disabled, and you can do this with:
  - Under Security type, select "Configure security features"
  - uncheck "Enable Secure Boot". Save.
  
  Create the VM.
  
  Log in, and install kdump-tools:
  
  $ sudo apt update
  $ sudo apt install kdump-tools
  
  Say yes to each prompt.
  
  $ sudo vim /etc/default/grub.d/kdump-tools.cfg
  Change crashkernel=512M-:192M from 192M to 1G, save, exit.
  
  $ sudo vim /etc/kernel/postinst.d/kdump-tools
  Change dep to most, save exit.
  
  $ sudo update-grub
  $ sudo reboot
  
  Verify that the cmdline has crashkernel set to 1G memory:
  $ cat /proc/cmdline
  $ kdump-config show
  
  On the Azure Web Interface, select "Serial Console" for the VM, and watch the
  serial console.
  
  $ sudo sysctl -w kernel.sysrq=1
  $ sudo su
  $ echo c > /proc/sysrq-trigger
  
  Watch the kernel panic and reboot into the crash kernel.
  
  On failure:
  
  The kexec kernel gets stuck, and writes these messages to dmesg.
  
  [    1.157729] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: PCI VMBus 
probing: Using version 0x10004
  [    1.167427] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: Retrying D0 Entry
  [    1.173231] hv_pci 7ad35d50-c05b-47ab-b3a0-56a9a845852b: PCI Pass-through 
VSP failed D0 Entry with status c000000d
  [    1.181091] hv_vmbus: probe failed for device 
7ad35d50-c05b-47ab-b3a0-56a9a845852b (-71)
  [    1.186890] hv_pci: probe of 7ad35d50-c05b-47ab-b3a0-56a9a845852b failed 
with error -71
  [    1.194422] hv_pci 00000001-7870-47b5-b203-907d12ca697e: PCI VMBus 
probing: Using version 0x10004
  [    1.202172] hv_pci 00000001-7870-47b5-b203-907d12ca697e: Retrying D0 Entry
  [    1.207877] hv_pci 00000001-7870-47b5-b203-907d12ca697e: PCI Pass-through 
VSP failed D0 Entry with status c000000d
  
  The kexec kernel gives up, and reboots. No kdump is generated. /var/crash will
  be empty.
  
  On success:
  
  The kdump is collected, and saved to /var/crash, and will be present on next
  boot.
  
  There are test kernels available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf425760-test
  
  If you install the test kernel and reboot, kdump will work correctly on v6
  instance types.
  
  [Where problems could occur]
  
  This changes how vmbus_reserve_fb() reserves MMIO space for the framebuffer,
  and if a regression were to occur, it could affect the pci-hyperv and 
hyperv-drm
  drivers from being able to claim the correct MMIO ranges.
  
  This could show as instances failing to start or failing to kexec / collect a
  kdump with the crashkernel.
  
  This fix works both on amd64 and arm64 instance types, as well as with 32bit
  and 64bit pci busses.
  
  [Other info]
  
  Upstream mailing list threads:
  
  Abandoned Patch:
  V1: 
https://lore.kernel.org/linux-hyperv/[email protected]/
  V2: 
https://lore.kernel.org/linux-hyperv/[email protected]/
  
  Current Patch:
  V1: 
https://lore.kernel.org/linux-hyperv/[email protected]/
  V2: 
https://lore.kernel.org/linux-hyperv/[email protected]/
  V3: 
https://lore.kernel.org/linux-hyperv/[email protected]/

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2155222

Title:
  [hyperv] Ensure MMIO Mapping is Correct for Kexec / kdump kernel on
  Azure v6 Instance Types

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2155222/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to