Re: bhyve core dump related to llvm 14

2022-08-09 Thread Michael Dexter

On 7/21/22 8:31 AM, Chuck Tuffli wrote:

I have a virtual machine used to test the NVMe emulation in bhyve. All
of the tests in the VM pass running under FreeBSD 13.1-R, but the same
VM running under -current causes bhyve(8) to dump core because of a
segmentation fault.

git bisect identified the last "good" commit on main as
 cb2ae6163174 sysvsem: Fix a typo
After this commit, there are a half-dozen commits related to merging
the llvm project release/14.x



Chuck and I put our heads together to find a way to reproduce this issue 
and came up with this:


Attache a 1gb disk image as emulation type "nvme" to a VM of any recent 
version, and run this command:


nvmecontrol io-passthru -o 0x2 -l 4096 -4 0x20 -r nvme0ns1

This fails gracefully on 13.0R and 13.1R, but panics the bhyve process 
with a 14-CURRENT host after the LLVM 14 import.


I have detailed reproduction steps and the debug output in this bug report:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=265749

Michael



bhyve core dump related to llvm 14

2022-07-21 Thread Chuck Tuffli
I have a virtual machine used to test the NVMe emulation in bhyve. All
of the tests in the VM pass running under FreeBSD 13.1-R, but the same
VM running under -current causes bhyve(8) to dump core because of a
segmentation fault.

git bisect identified the last "good" commit on main as
cb2ae6163174 sysvsem: Fix a typo
After this commit, there are a half-dozen commits related to merging
the llvm project release/14.x

The core dump is repeatable and consistent. Back traces under lldb
look similar to this:

* thread #22, name = 'vcpu 2', stop reason = signal SIGSEGV: invalid
address (fault address: 0xb8)
  * frame #0: 0x383eb9fc916b
 bhyve`pci_nvme_read(ctx=0x38483ad2d700, vcpu=0,
 pi=0x, baridx=-188391150,
offset=0, size=0) at
 pci_nvme.c:3035:34
frame #1: 0x384834616280
frame #2: 0x383eb9fc1f7a
 bhyve`pci_emul_mem_handler(ctx=,
vcpu=,
 dir=, addr=, size=,
 val=, arg1=0x3846e5b71600, arg2=0) at
 pci_emul.c:498:4

In frame 0, pi being NULL causes the core dump, but most of the
arguments are invalid / garbage. Looking earlier in the stack, the
vcpu value should be 2, the ctx pointer doesn't match, and the value
passed to pi isn't NULL.

Poking around in frame 2, I can see that the "direction" is a memory
write (dir == MEM_F_WRITE) and the statement being executed is this:
(*pe->pe_barwrite)(ctx, vcpu, pdi, bidx, offset, size, *val);

Confusingly, the function pointer pe_barwrite is pci_nvme_write() and
not pci_nvme_read() where the crash occurs. I've confirmed the fault
is in pci_nvme_read() by adding an assert for pi != NULL. This is
especially odd because pci_emul_mem_handler() directly calls
pci_nvme_read() and pci_nvme_write(). So why does frame 1 exist at
all?

Using gdb, the back traces either don't decode at all or look similar to this:
(gdb) bt
#0  pci_nvme_read (ctx=0x944c1168700, vcpu=0, pi=0x0,
baridx=-1835053270, offset=0, size=0)
at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_nvme.c:3035
#1  0x09436891d8e8 in _CurrentRuneLocale () from /lib/libc.so.7
#2  0x09436a73ca28 in ?? ()
#3  0x09436a73e1c0 in ?? ()
...
#34 0x09436a747600 in ?? ()
#35 0x093b3e76b088 in pci_de_lpc ()
#36 0x09436a716500 in ?? ()
#37 0x0944c3196d10 in ?? ()
#38 0x093b3e74501a in pci_emul_mem_handler (ctx=0x9436a7bd670, vcpu=0,
dir=, addr=, size=0,
val=0x646165725f657469, arg1=0x1,
arg2=10185153275136)
at /poudriere/jails/14-current-amd64/usr/src/usr.sbin/bhyve/pci_emul.c:498

Other random tidbits:
 - disabling compiler optimization (i.e. -O0) for the two files in
question (pci_nvme.c and pci_emul.c) makes the core dump go away
 - using the default optimization level but generously sprinkling
debug printf everywhere makes the core dump go away.

I'm not sure where to go from here and could use some help.

--chuck