I have a virtual machine used to test the NVMe emulation in bhyve. All
of the tests in the VM pass running under FreeBSD 13.1-R, but the same
VM running under -current causes bhyve(8) to dump core because of a
segmentation fault.
git bisect identified the last "good" commit on main as
cb2ae6163174 sysvsem: Fix a typo
After this commit, there are a half-dozen commits related to merging
the llvm project release/14.x
The core dump is repeatable and consistent. Back traces under lldb
look similar to this:
* thread #22, name = 'vcpu 2', stop reason = signal SIGSEGV: invalid
address (fault address: 0xb8)
* frame #0: 0x383eb9fc916b
bhyve`pci_nvme_read(ctx=0x38483ad2d700, vcpu=0,
pi=0x, baridx=-188391150,
offset=0, size=0) at
pci_nvme.c:3035:34
frame #1: 0x384834616280
frame #2: 0x383eb9fc1f7a
bhyve`pci_emul_mem_handler(ctx=,
vcpu=,
dir=, addr=, size=,
val=, arg1=0x3846e5b71600, arg2=0) at
pci_emul.c:498:4
In frame 0, pi being NULL causes the core dump, but most of the
arguments are invalid / garbage. Looking earlier in the stack, the
vcpu value should be 2, the ctx pointer doesn't match, and the value
passed to pi isn't NULL.
Poking around in frame 2, I can see that the "direction" is a memory
write (dir == MEM_F_WRITE) and the statement being executed is this:
(*pe->pe_barwrite)(ctx, vcpu, pdi, bidx, offset, size, *val);
Confusingly, the function pointer pe_barwrite is pci_nvme_write() and
not pci_nvme_read() where the crash occurs. I've confirmed the fault
is in pci_nvme_read() by adding an assert for pi != NULL. This is
especially odd because pci_emul_mem_handler() directly calls
pci_nvme_read() and pci_nvme_write(). So why does frame 1 exist at
all?
Using gdb, the back traces either don't decode at all or look similar to this:
(gdb) bt
#0 pci_nvme_read (ctx=0x944c1168700, vcpu=0, pi=0x0,
baridx=-1835053270, offset=0, size=0)
at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_nvme.c:3035
#1 0x09436891d8e8 in _CurrentRuneLocale () from /lib/libc.so.7
#2 0x09436a73ca28 in ?? ()
#3 0x09436a73e1c0 in ?? ()
...
#34 0x09436a747600 in ?? ()
#35 0x093b3e76b088 in pci_de_lpc ()
#36 0x09436a716500 in ?? ()
#37 0x0944c3196d10 in ?? ()
#38 0x093b3e74501a in pci_emul_mem_handler (ctx=0x9436a7bd670, vcpu=0,
dir=, addr=, size=0,
val=0x646165725f657469, arg1=0x1,
arg2=10185153275136)
at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_emul.c:498
Other random tidbits:
- disabling compiler optimization (i.e. -O0) for the two files in
question (pci_nvme.c and pci_emul.c) makes the core dump go away
- using the default optimization level but generously sprinkling
debug printf everywhere makes the core dump go away.
I'm not sure where to go from here and could use some help.
--chuck