Mischa <open...@mlst.nl> writes:
> On 2022-12-12 16:02, Dave Voutila wrote: >> Mischa <open...@mlst.nl> writes: >> >>> Hi Dave, >>> Great stuff!! >>> Everything is patched, build and booted. >>> What is the best way to test this? >> Start guests as usual. I'd say the only thing definitively to >> manually >> check is that they see the same amount of physical memory as before the >> patch. > > That is indeed different. Before the patch allocating 1G would be > displayed as 1G. > Now a 1G allocation is 1.3G, 2G is 2.3G, 8G is 8.3G, etc. > So I can reproduce, how are you measuring it? >>> Start a bunch of VMs with bsd.rd? Does this still need to be a >>> decompressed bsd.rd? >> Booting compressed bsd.rd's has been working for awhile now, so no >> need >> to decompress them, btw. If you boot a bsd.rd, it's exercising the >> changes to loadfile_elf.c, which is important to test. > > I missed that completely, good to know. > > Starting/stopping a bunch of VMs with bsd.rd (only) at the moment. > Not sure if this is helpful, but got this in the logs: > > Dec 12 16:21:47 current /bsd: vmm_handle_cpuid: function 0x06 > (thermal/power mgt) not supported > Dec 12 16:21:47 current /bsd: vcpu_run_vmx: unimplemented exit type 32 > (WRMSR instruction) Ah, this is a different issue! Just noise for now, but I need to add VMX_EXIT_WRMSR to the list of those we ignore on re-entry. (We're tripping the default condition in a swtich statement checking on the last exit type...a default that's only enabled with VMM_DEBUG.) > Dec 12 16:21:47 current /bsd: vcpu @ 0xffff800022e3a000 in long mode > Dec 12 16:21:47 current /bsd: CPL=0 > Dec 12 16:21:47 current /bsd: rax=0x0000000000000081 > rbx=0xffffffff818f8008 rcx=0x00000000000001a0 > Dec 12 16:21:47 current /bsd: rdx=0x0000000000000000 > rbp=0xffffffff81a06da0 rdi=0x00000000000001a0 > Dec 12 16:21:47 current /bsd: rsi=0xffffffff81a06d88 > r8=0x0000000000000028 r9=0x00000000a9144070 > Dec 12 16:21:47 current /bsd: r10=0x0000000000000000 > r11=0xffffffff81a06ce0 r12=0xffffffff81a06e28 > Dec 12 16:21:47 current /bsd: r13=0xffffffff8147da20 > r14=0xffffffff818f6ff0 r15=0x0000000000000006 > Dec 12 16:21:47 current /bsd: rip=0xffffffff81221d50 > rsp=0xffffffff81a06d80 > Dec 12 16:21:47 current /bsd: rflags=0x0000000000000246 (cf PF af ZF > sf tf IF df of nt rf vm ac vif vip id IOPL=0) > Dec 12 16:21:47 current /bsd: cr0=0x0000000080010031 (PG cd nw am WP > NE ET ts em mp PE) > Dec 12 16:21:47 current /bsd: cr2=0x0000000000000000 > Dec 12 16:21:47 current /bsd: cr3=0x000000007f7d8000 (pwt pcd) > Dec 12 16:21:47 current /bsd: cr4=0x00000000000026b0 (pke smap smep > osxsave pcide fsgsbase smxe VMXE OSXMMEXCPT OSFXSR pce PGE mce PAE PSE > de tsd pvi vme) > Dec 12 16:21:47 current /bsd: --Guest Segment Info-- > Dec 12 16:21:47 current /bsd: cs=0x0008 rpl=0 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=0xa09b > Dec 12 16:21:47 current /bsd: granularity=1 dib=0 l(64 bit)=1 > present=1 sys=1 type=code, r/x, accessed > Dec 12 16:21:47 current /bsd: ds=0x0010 rpl=0 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=vmm_handle_cpuid: unsupported > rax=0x40000100 > Dec 12 16:21:47 current /bsd: 0xa093 > Dec 12 16:21:47 current /bsd: granularity=1 dib=0 l(64 bit)=1 > present=1 sys=1 type=data, r/w, accessed > Dec 12 16:21:47 current /bsd: es=0x0010 rpl=0 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=0xa093 > Dec 12 16:21:47 current /bsd: granularity=1 dib=0 l(64 bit)=1 > present=1 sys=1 type=data, r/w, accessed > Dec 12 16:21:47 current /bsd: fs=0x0000 rpl=0 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=0x1c000 > Dec 12 16:21:47 current /bsd: (unusable) > Dec 12 16:21:47 current /bsd: gs=0x0000 rpl=0 base=0xffffffff818f6ff0 > limit=0x00000000ffffffff a/r=0x1c000 > Dec 12 16:21:47 current /bsd: (unusable) > Dec 12 16:21:47 current /bsd: ss=0x0010 rpl=0 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=0xa093 > Dec 12 16:21:47 current /bsd: granularity=1 dib=0 l(64 bit)=1 > present=1 sys=1 type=data, r/w, accessed > Dec 12 16:21:47 current /bsd: tr=0x0030 base=0xffffffff818f5000 > limit=0x0000000000000067 a/r=0x008b > Dec 12 16:21:47 current /bsd: granularity=0 dib=0 l(64 bit)=0 > present=1 sys=0 type=tss (busy) > Dec 12 16:21:47 current /bsd: gdtr base=0xffffffff818f5068 > limit=0x000000000000003f > Dec 12 16:21:47 current /bsd: idtr base=0xffff800000020000 > limit=0x0000000000000fff > Dec 12 16:21:47 current /bsd: ldtr=0x0000 base=0x0000000000000000 > limit=0x00000000ffffffff a/r=0x1c000 > Dec 12 16:21:47 current /bsd: (unusable) > Dec 12 16:21:47 current /bsd: vmm_handle_cpuid: function 0x06 > (thermal/power mgt) not supported > Dec 12 16:21:47 current /bsd: --Guest MSRs @ 0xfffffddea339d000 > (paddr: 0x0000005ea339d000)-- > Dec 12 16:21:47 current /bsd: MSR 0 @ 0xfffffddea339d000 : > 0xc0000080 (EFER > Dec 12 16:21:47 current /bsd: ), value=0x0000000000000d01 (SCE LME LMA > NXE) > Dec 12 16:21:47 current /bsd: MSR 1 @ 0xfffffddea339d010 : > 0xc0000081 (STAR), value=0x001b000800000000 > Dec 12 16:21:47 current /bsd: MSR 2 @ 0xfffffddea339d020 : > 0xc0000082 (LSTAR), value=0xffffffff813b8000 > Dec 12 16:21:47 current /bsd: MSR 3 @ 0xfffffddea339d030 : > 0xc0000083 (CSTAR), value=0xffffffff813ba000 > Dec 12 16:21:47 current /bsd: MSR 4 @ > Dec 12 16:21:47 current /bsd: 0xfffffddea339d040 : 0xc0000084 > (SFMASK), val > Dec 12 16:21:47 current /bsd: ue=0x0000000 > Dec 12 16:21:47 current /bsd: 000044701 > > Mischa > > > >>> On 2022-12-10 23:51, Dave Voutila wrote: >>>> tech@, >>>> The below diff tweaks how vmd and vmm define memory ranges (adding a >>>> "type" attribute) so we can properly build an e820 memory map to >>>> hand to >>>> things like SeaBIOS or the OpenBSD ramdisk kernel (when direct >>>> booting >>>> bsd.rd). >>>> Why do it? We've been carrying a few patches to SeaBIOS in the ports >>>> tree to hack around how vmd articulates some memory range details. By >>>> finally implementing a proper bios memory map table we can drop >>>> some of >>>> those patches. (Diff to ports@ coming shortly.) >>>> Bonus is it cleans up how we were hacking a bios memory map for >>>> direct >>>> booting ramdisk kernels. >>>> Note: the below diff *will* work with the current SeaBIOS >>>> (vmm-firmware), so you do *not* need to build the port. >>>> You will, however, need to: >>>> - build, install, & reboot into a new kernel >>>> - make sure you update /usr/include/amd64/vmmvar.h with a copy of >>>> symlink to sys/arch/amd64/include/vmmvar.h >>>> - rebuild & install vmctl >>>> - rebuild & install vmd >>>> This should *not* result in any behavioral changes of current vmd >>>> guests. If you notice any, especially guests failing to start, please >>>> rebuild a kernel with VMM_DEBUG to help diagnose the regression. >>>> -dv >>>> diff refs/heads/master refs/heads/vmd-e820 >>>> commit - a96642fb40af450c6576e205fab247cdbce0b5ed >>>> commit + f3cb01998127d200e95ff9984a7503eb16c2a8d8 >>>> blob - 3f7e0ce405ae3c6b0b4a787de341839886f97436 >>>> blob + f2a464217838d3f0a50e4131b5b074b315e490fb >>>> --- sys/arch/amd64/amd64/vmm.c >>>> +++ sys/arch/amd64/amd64/vmm.c >>>> @@ -1643,21 +1643,27 @@ vm_create_check_mem_ranges(struct >>>> vm_create_params *vc >>>> const paddr_t maxgpa = VMM_MAX_VM_MEM_SIZE; >>>> if (vcp->vcp_nmemranges == 0 || >>>> - vcp->vcp_nmemranges > VMM_MAX_MEM_RANGES) >>>> + vcp->vcp_nmemranges > VMM_MAX_MEM_RANGES) { >>>> + DPRINTF("invalid number of guest memory ranges\n"); >>>> return (0); >>>> + } >>>> for (i = 0; i < vcp->vcp_nmemranges; i++) { >>>> vmr = &vcp->vcp_memranges[i]; >>>> /* Only page-aligned addresses and sizes are permitted >>>> */ >>>> if ((vmr->vmr_gpa & PAGE_MASK) || (vmr->vmr_va & PAGE_MASK) || >>>> - (vmr->vmr_size & PAGE_MASK) || vmr->vmr_size == 0) >>>> + (vmr->vmr_size & PAGE_MASK) || vmr->vmr_size == 0) { >>>> + DPRINTF("memory range %zu is not page aligned\n", i); >>>> return (0); >>>> + } >>>> /* Make sure that VMM_MAX_VM_MEM_SIZE is not exceeded >>>> */ >>>> if (vmr->vmr_gpa >= maxgpa || >>>> - vmr->vmr_size > maxgpa - vmr->vmr_gpa) >>>> + vmr->vmr_size > maxgpa - vmr->vmr_gpa) { >>>> + DPRINTF("exceeded max memory size\n"); >>>> return (0); >>>> + } >>>> /* >>>> * Make sure that all virtual addresses are within the address >>>> @@ -1667,39 +1673,55 @@ vm_create_check_mem_ranges(struct >>>> vm_create_params *vc >>>> */ >>>> if (vmr->vmr_va < VM_MIN_ADDRESS || >>>> vmr->vmr_va >= VM_MAXUSER_ADDRESS || >>>> - vmr->vmr_size >= VM_MAXUSER_ADDRESS - vmr->vmr_va) >>>> + vmr->vmr_size >= VM_MAXUSER_ADDRESS - vmr->vmr_va) { >>>> + DPRINTF("guest va not within range or wraps\n"); >>>> return (0); >>>> + } >>>> /* >>>> * Specifying ranges within the PCI MMIO space is forbidden. >>>> * Disallow ranges that start inside the MMIO space: >>>> * [VMM_PCI_MMIO_BAR_BASE .. VMM_PCI_MMIO_BAR_END] >>>> */ >>>> - if (vmr->vmr_gpa >= VMM_PCI_MMIO_BAR_BASE && >>>> - vmr->vmr_gpa <= VMM_PCI_MMIO_BAR_END) >>>> + if (vmr->vmr_type == VM_MEM_RAM && >>>> + vmr->vmr_gpa >= VMM_PCI_MMIO_BAR_BASE && >>>> + vmr->vmr_gpa <= VMM_PCI_MMIO_BAR_END) { >>>> + DPRINTF("guest RAM range %zu cannot being in mmio range" >>>> + " (gpa=0x%lx)\n", i, vmr->vmr_gpa); >>>> return (0); >>>> + } >>>> /* >>>> * ... and disallow ranges that end inside the MMIO space: >>>> * (VMM_PCI_MMIO_BAR_BASE .. VMM_PCI_MMIO_BAR_END] >>>> */ >>>> - if (vmr->vmr_gpa + vmr->vmr_size > VMM_PCI_MMIO_BAR_BASE && >>>> - vmr->vmr_gpa + vmr->vmr_size <= VMM_PCI_MMIO_BAR_END) >>>> + if (vmr->vmr_type == VM_MEM_RAM && >>>> + vmr->vmr_gpa + vmr->vmr_size > VMM_PCI_MMIO_BAR_BASE && >>>> + vmr->vmr_gpa + vmr->vmr_size <= VMM_PCI_MMIO_BAR_END) { >>>> + DPRINTF("guest RAM range %zu cannot end in mmio range" >>>> + " (gpa=0x%lx, sz=0x%lx)\n", i, vmr->vmr_gpa, >>>> + vmr->vmr_size); >>>> return (0); >>>> + } >>>> /* >>>> * Make sure that guest physical memory ranges do not overlap >>>> * and that they are ascending. >>>> */ >>>> - if (i > 0 && pvmr->vmr_gpa + pvmr->vmr_size > vmr->vmr_gpa) >>>> + if (i > 0 && pvmr->vmr_gpa + pvmr->vmr_size > vmr->vmr_gpa) { >>>> + DPRINTF("guest range %zu overlaps or !ascending\n", i); >>>> return (0); >>>> + } >>>> memsize += vmr->vmr_size; >>>> pvmr = vmr; >>>> } >>>> - if (memsize % (1024 * 1024) != 0) >>>> + if (memsize % (1024 * 1024) != 0) { >>>> + DPRINTF("memory size not a multiple of 1MB\n"); >>>> return (0); >>>> + } >>>> + >>>> return (memsize); >>>> } >>>> blob - 94feca154717c1e3016990ad260036cd79e29b65 >>>> blob + 2c57f10b9340e8a779f50bee18d235a299721571 >>>> --- sys/arch/amd64/include/vmmvar.h >>>> +++ sys/arch/amd64/include/vmmvar.h >>>> @@ -451,6 +451,9 @@ struct vm_mem_range { >>>> paddr_t vmr_gpa; >>>> vaddr_t vmr_va; >>>> size_t vmr_size; >>>> + int vmr_type; >>>> +#define VM_MEM_RAM 0 >>>> +#define VM_MEM_RESERVED 1 >>>> }; >>>> /* >>>> blob - 4ec036912cafa154f4eb24ce757f0cb6e4c6bf4a >>>> blob + eb0bea236ed0d6c4d68f6699eb6720ef8fca296c >>>> --- usr.sbin/vmd/fw_cfg.c >>>> +++ usr.sbin/vmd/fw_cfg.c >>>> @@ -16,6 +16,7 @@ >>>> */ >>>> #include <sys/types.h> >>>> #include <sys/uio.h> >>>> +#include <machine/biosvar.h> /* bios_memmap_t */ >>>> #include <machine/vmmvar.h> >>>> #include <stdlib.h> >>>> @@ -63,6 +64,8 @@ static int fw_cfg_select_file(uint16_t); >>>> static uint64_t fw_cfg_dma_addr; >>>> +static bios_memmap_t e820[VMM_MAX_MEM_RANGES]; >>>> + >>>> static int fw_cfg_select_file(uint16_t); >>>> static void fw_cfg_file_dir(void); >>>> @@ -71,7 +74,27 @@ fw_cfg_init(struct vmop_create_params *vmc) >>>> { >>>> const char *bootorder = NULL; >>>> unsigned int sd = 0; >>>> + size_t i, e820_len = 0; >>>> + /* Define e820 memory ranges. */ >>>> + memset(&e820, 0, sizeof(e820)); >>>> + for (i = 0; i < vmc->vmc_params.vcp_nmemranges; i++) { >>>> + struct vm_mem_range *range = &vmc->vmc_params.vcp_memranges[i]; >>>> + bios_memmap_t *entry = &e820[i]; >>>> + >>>> + entry->addr = range->vmr_gpa; >>>> + entry->size = range->vmr_size; >>>> + if (range->vmr_type == VM_MEM_RAM) >>>> + entry->type = BIOS_MAP_FREE; >>>> + else if (range->vmr_type == VM_MEM_RESERVED) >>>> + entry->type = BIOS_MAP_RES; >>>> + else >>>> + fatalx("undefined memory type %d", entry->type); >>>> + >>>> + e820_len += sizeof(bios_memmap_t); >>>> + } >>>> + fw_cfg_add_file("etc/e820", &e820, e820_len); >>>> + >>>> /* do not double print chars on serial port */ >>>> fw_cfg_add_file("etc/screen-and-debug", &sd, sizeof(sd)); >>>> blob - 651719542d28ce44bccb0487867ece7e72686606 >>>> blob + b7f79eb9e140073f75563a6dcb5fdad3cb2b2d22 >>>> --- usr.sbin/vmd/loadfile_elf.c >>>> +++ usr.sbin/vmd/loadfile_elf.c >>>> @@ -334,38 +334,26 @@ create_bios_memmap(struct vm_create_params >>>> *vcp, bios_ >>>> static size_t >>>> create_bios_memmap(struct vm_create_params *vcp, bios_memmap_t >>>> *memmap) >>>> { >>>> - size_t i, n = 0, sz; >>>> - paddr_t gpa; >>>> + size_t i, n = 0; >>>> struct vm_mem_range *vmr; >>>> - for (i = 0; i < vcp->vcp_nmemranges; i++) { >>>> + for (i = 0; i < vcp->vcp_nmemranges; i++, n++) { >>>> vmr = &vcp->vcp_memranges[i]; >>>> - gpa = vmr->vmr_gpa; >>>> - sz = vmr->vmr_size; >>>> - >>>> - /* >>>> - * Make sure that we do not mark the ROM/video RAM area in the >>>> - * low memory as physcal memory available to the kernel. >>>> - */ >>>> - if (gpa < 0x100000 && gpa + sz > LOWMEM_KB * 1024) { >>>> - if (gpa >= LOWMEM_KB * 1024) >>>> - sz = 0; >>>> - else >>>> - sz = LOWMEM_KB * 1024 - gpa; >>>> - } >>>> - >>>> - if (sz != 0) { >>>> - memmap[n].addr = gpa; >>>> - memmap[n].size = sz; >>>> - memmap[n].type = 0x1; /* Type 1 : Normal memory */ >>>> - n++; >>>> - } >>>> + memmap[n].addr = vmr->vmr_gpa; >>>> + memmap[n].size = vmr->vmr_size; >>>> + if (vmr->vmr_type == VM_MEM_RAM) >>>> + memmap[n].type = BIOS_MAP_FREE; >>>> + else if (vmr->vmr_type == VM_MEM_RESERVED) >>>> + memmap[n].type = BIOS_MAP_RES; >>>> + else >>>> + fatalx("%s: invalid vm memory range type %d\n", >>>> + __func__, vmr->vmr_type); >>>> } >>>> /* Null mem map entry to denote the end of the ranges */ >>>> memmap[n].addr = 0x0; >>>> memmap[n].size = 0x0; >>>> - memmap[n].type = 0x0; >>>> + memmap[n].type = BIOS_MAP_END; >>>> n++; >>>> return (n); >>>> blob - f1d9b97741c11f8cc4faa3f79658cd87135d2b29 >>>> blob + 7a1b3bb39cfd4651b076bf5c5e74012bdd11754e >>>> --- usr.sbin/vmd/vm.c >>>> +++ usr.sbin/vmd/vm.c >>>> @@ -899,6 +899,7 @@ create_memory_map(struct vm_create_params *vcp) >>>> len = LOWMEM_KB * 1024; >>>> vcp->vcp_memranges[0].vmr_gpa = 0x0; >>>> vcp->vcp_memranges[0].vmr_size = len; >>>> + vcp->vcp_memranges[0].vmr_type = VM_MEM_RAM; >>>> mem_bytes -= len; >>>> /* >>>> @@ -913,12 +914,14 @@ create_memory_map(struct vm_create_params *vcp) >>>> len = MB(1) - (LOWMEM_KB * 1024); >>>> vcp->vcp_memranges[1].vmr_gpa = LOWMEM_KB * 1024; >>>> vcp->vcp_memranges[1].vmr_size = len; >>>> + vcp->vcp_memranges[1].vmr_type = VM_MEM_RESERVED; >>>> mem_bytes -= len; >>>> /* If we have less than 2MB remaining, still create a 2nd BIOS >>>> area. */ >>>> if (mem_bytes <= MB(2)) { >>>> vcp->vcp_memranges[2].vmr_gpa = VMM_PCI_MMIO_BAR_END; >>>> vcp->vcp_memranges[2].vmr_size = MB(2); >>>> + vcp->vcp_memranges[2].vmr_type = VM_MEM_RESERVED; >>>> vcp->vcp_nmemranges = 3; >>>> return; >>>> } >>>> @@ -939,18 +942,27 @@ create_memory_map(struct vm_create_params *vcp) >>>> /* Third memory region: area above 1MB to MMIO region */ >>>> vcp->vcp_memranges[2].vmr_gpa = MB(1); >>>> vcp->vcp_memranges[2].vmr_size = above_1m; >>>> + vcp->vcp_memranges[2].vmr_type = VM_MEM_RAM; >>>> - /* Fourth region: 2nd copy of BIOS above MMIO ending at 4GB */ >>>> - vcp->vcp_memranges[3].vmr_gpa = VMM_PCI_MMIO_BAR_END + 1; >>>> - vcp->vcp_memranges[3].vmr_size = MB(2); >>>> + /* Fourth region: PCI MMIO range */ >>>> + vcp->vcp_memranges[3].vmr_gpa = VMM_PCI_MMIO_BAR_BASE; >>>> + vcp->vcp_memranges[3].vmr_size = VMM_PCI_MMIO_BAR_END - >>>> + VMM_PCI_MMIO_BAR_BASE + 1; >>>> + vcp->vcp_memranges[3].vmr_type = VM_MEM_RESERVED; >>>> - /* Fifth region: any remainder above 4GB */ >>>> + /* Fifth region: 2nd copy of BIOS above MMIO ending at 4GB */ >>>> + vcp->vcp_memranges[4].vmr_gpa = VMM_PCI_MMIO_BAR_END + 1; >>>> + vcp->vcp_memranges[4].vmr_size = MB(2); >>>> + vcp->vcp_memranges[4].vmr_type = VM_MEM_RESERVED; >>>> + >>>> + /* Sixth region: any remainder above 4GB */ >>>> if (above_4g > 0) { >>>> - vcp->vcp_memranges[4].vmr_gpa = GB(4); >>>> - vcp->vcp_memranges[4].vmr_size = above_4g; >>>> + vcp->vcp_memranges[5].vmr_gpa = GB(4); >>>> + vcp->vcp_memranges[5].vmr_size = above_4g; >>>> + vcp->vcp_memranges[5].vmr_type = VM_MEM_RAM; >>>> + vcp->vcp_nmemranges = 6; >>>> + } else >>>> vcp->vcp_nmemranges = 5; >>>> - } else >>>> - vcp->vcp_nmemranges = 4; >>>> } >>>> /*