patch vmm(4) - vmm(4): wire faulted in pages might have gone too far.

2021-04-05 Thread Adam Steen
Hi

I believe the change [1] vmm(4): wire faulted in pages, might have been
a bit heavy handed a broken the use of VMM_IOC_MPROTECT_EPT ioctl.

[1] https://marc.info/?l=openbsd-cvs&m=161144130825752

please see the patch below which restores this functionality

Cheers
Adam


diff 917cc7a95e2615798ffadc2455c3a44858a11e95 /home/adams/devl/openbsd/src
blob - d2b4c387464638441fc98898e77dd01cb9bc3250
file + sys/arch/amd64/amd64/vmm.c
--- sys/arch/amd64/amd64/vmm.c
+++ sys/arch/amd64/amd64/vmm.c
@@ -5509,8 +5509,19 @@ svm_handle_np_fault(struct vcpu *vcpu)
 int
 vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
 {
-   int ret;
+   int fault_type, ret;
 
+   fault_type = vmx_get_guest_faulttype();
+   if (fault_type == -1) {
+   printf("%s: invalid fault type\n", __func__);
+   return (EINVAL);
+   }
+
+   if (fault_type == VM_FAULT_PROTECT) {
+   vcpu->vc_exit.vee.vee_fault_type = VEE_FAULT_PROTECT;
+   return (EAGAIN);
+   }
+
ret = uvm_fault(vcpu->vc_parent->vm_map, gpa, VM_FAULT_WIRE,
PROT_READ | PROT_WRITE | PROT_EXEC);
 



Re: vmm.4: document supported ioctls

2021-04-01 Thread Adam Steen
Thanks Dave,

I like the description and listing the ioctl, and referencing vmmvar.h is a 
good idea

Cheers
Adam

On Thu, Apr 1, 2021 at 18:49, Dave Voutila  wrote:

> This diff documents the ioctl(2) values supported by vmm(4). Besides
> vmd(8) there's at least one application I've seen (Solo5) directly using
> vmm(4).
>
> For now I've tried to have short summary-level details, enough to
> provide a general idea of purpose. The data structures aren't included
> in the doc, but the mention of  at least points to
> where they're defined for now.
>
> Feedback on my mdoc macro usage or verbiage? OK?
>
> -dv
>
> Index: share/man/man4/man4.amd64/vmm.4
> ===
> RCS file: /cvs/src/share/man/man4/man4.amd64/vmm.4,v
> retrieving revision 1.5
> diff -u -p -r1.5 vmm.4
> --- share/man/man4/man4.amd64/vmm.4 6 Dec 2015 19:06:17 - 1.5
> +++ share/man/man4/man4.amd64/vmm.4 1 Apr 2021 10:36:50 -
> @@ -22,6 +22,8 @@
> .Nd virtual machine monitor
> .Sh SYNOPSIS
> .Cd "vmm0 at mainbus0"
> +.Pp
> +.In machine/vmmvar.h
> .Sh DESCRIPTION
> The
> .Nm
> @@ -67,6 +69,45 @@ driver requires at least one CPU with ha
> capabilities and nested or extended paging capabilities to be
> present on the host.
> For more information, consult the CPU vendor's documentation.
> +.Pp
> +The following
> +.Xr ioctl 2
> +commands are provided for managing
> +.Nm
> +guests:
> +.Pp
> +.Bl -tag -width Ds -offset indent
> +.It Dv VMM_IOC_CREATE Fa "struct vm_create_params *"
> +Create a VM, initializing
> +.Xr vmm 4
> +if not yet started. (Does not start the VCPU.)
> +.It Dv VMM_IOC_RUN Fa "struct vm_run_params *"
> +Run a VCPU for a defined VM. Returns on VM-exit, the VCPU stopped, or
> +an error occurred
> +.It Dv VMM_IOC_INFO Fa "struct vm_info_params *"
> +Get information about the VMs currently hosted by
> +.Xr vmm 4
> +.It Dv VMM_IOC_TERM Fa "struct vm_terminate_params *"
> +Terminate a given VM
> +.It Dv VMM_IOC_RESETCPU Fa "struct vm_resetcpu_params *"
> +Reset a VCPU to power-on-init state using the provided register state
> +.It Dv VMM_IOC_INTR Fa "struct vm_intr_params *"
> +Signal a pending interrupt for a VCPU
> +.It Dv VMM_IOC_READREGS Fa "struct vm_rwregs_params *"
> +Read registers of a VCPU
> +.It Dv VMM_IOC_WRITEREGS Fa "struct vm_rwregs_params *"
> +Write registers values of a VCPU
> +.It Dv VMM_IOC_READVMPARAMS Fa "struct vm_revmparams_params *"
> +Read paravirtualized hardware parameters (such as
> +.Xr pvclock 4
> +version) for a VM
> +.It Dv VMM_IOC_WRITEVMPARAMS Fa "struct vm_rwvmparams_params *"
> +Write paravirtualized hardware parameters (such as
> +.Xr pvclock 4
> +guest physical addresss) for a VM
> +.It Dv VMM_IOC_MPROTECT_EPT Fa "struct vm_mprotect_ept_params *"
> +Set access protections on guest page table entries
> +(only supported on hosts providing EPT or RVI)
> .Sh SEE ALSO
> .Xr cpu 4 ,
> .Xr intro 4 ,

patch: vamm(4) IA32_EPT_VPID_CAP_XO_TRANSLATIONS specified incorrectly.

2021-03-19 Thread Adam Steen
Hi

IA32_EPT_VPID_CAP_XO_TRANSLATIONS is specified incorrectly, see the
patch below.

Cheers
Adam

On Fri, Feb 26, 2021 at 01:08:17PM +0800, Adam Steen wrote:
> Hi
> 
> IA32_EPT_VPID_CAP_XO_TRANSLATIONS is specified as 0x0 and not (1ULL << 0)
> ie 0 and not bit 0 as on.
> 
> Please see the attach diff to correct this and rename
> IA32_EPT_VPID_CAP_XO_TRANSLATIONS to IA32_EPT_VPID_CAP_XO to reduce
> wordyness.
> 
> Cheers
> Adam
> 
> diff 0e7183d43c8ed36e5d169be05df61472565710eb /home/adams/devl/openbsd/src
> blob - 72104b78c733f4ce8ee91910b0b4a6f0f51c15aa
> file + sys/arch/amd64/amd64/vmm.c
> --- sys/arch/amd64/amd64/vmm.c
> +++ sys/arch/amd64/amd64/vmm.c
> @@ -916,7 +916,7 @@ vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
>   if (vmm_softc->mode == VMM_MODE_EPT) {
>   msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
>   if (prot == PROT_EXEC &&
> - (msr & IA32_EPT_VPID_CAP_XO_TRANSLATIONS) == 0) {
> + (msr & IA32_EPT_VPID_CAP_XO) == 0) {
>   DPRINTF("%s: Execute only permissions unsupported,"
>  " adding read permission\n", __func__);
>  
> blob - e0232887ff69a882c00e3b4059baa2b12f047c2d
> file + sys/arch/amd64/include/specialreg.h
> --- sys/arch/amd64/include/specialreg.h
> +++ sys/arch/amd64/include/specialreg.h
> @@ -957,7 +957,7 @@
>  #define IA32_VMX_TRUE_ENTRY_CTLS 0x490
>  #define IA32_VMX_VMFUNC  0x491
>  
> -#define IA32_EPT_VPID_CAP_XO_TRANSLATIONS0x0
> +#define IA32_EPT_VPID_CAP_XO (1ULL << 0)
>  #define IA32_EPT_VPID_CAP_PAGE_WALK_4(1ULL << 6)
>  #define IA32_EPT_VPID_CAP_WB (1ULL << 14)
>  #define IA32_EPT_VPID_CAP_AD_BITS(1ULL << 21)




amd64 specialreg.h IA32_EPT_VPID_CAP_XO_TRANSLATIONS specified incorrectly.

2021-02-25 Thread Adam Steen
Hi

IA32_EPT_VPID_CAP_XO_TRANSLATIONS is specified as 0x0 and not (1ULL << 0)
ie 0 and not bit 0 as on.

Please see the attach diff to correct this and rename
IA32_EPT_VPID_CAP_XO_TRANSLATIONS to IA32_EPT_VPID_CAP_XO to reduce
wordyness.

Cheers
Adam

diff 0e7183d43c8ed36e5d169be05df61472565710eb /home/adams/devl/openbsd/src
blob - 72104b78c733f4ce8ee91910b0b4a6f0f51c15aa
file + sys/arch/amd64/amd64/vmm.c
--- sys/arch/amd64/amd64/vmm.c
+++ sys/arch/amd64/amd64/vmm.c
@@ -916,7 +916,7 @@ vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
if (vmm_softc->mode == VMM_MODE_EPT) {
msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
if (prot == PROT_EXEC &&
-   (msr & IA32_EPT_VPID_CAP_XO_TRANSLATIONS) == 0) {
+   (msr & IA32_EPT_VPID_CAP_XO) == 0) {
DPRINTF("%s: Execute only permissions unsupported,"
   " adding read permission\n", __func__);
 
blob - e0232887ff69a882c00e3b4059baa2b12f047c2d
file + sys/arch/amd64/include/specialreg.h
--- sys/arch/amd64/include/specialreg.h
+++ sys/arch/amd64/include/specialreg.h
@@ -957,7 +957,7 @@
 #define IA32_VMX_TRUE_ENTRY_CTLS   0x490
 #define IA32_VMX_VMFUNC0x491
 
-#define IA32_EPT_VPID_CAP_XO_TRANSLATIONS  0x0
+#define IA32_EPT_VPID_CAP_XO   (1ULL << 0)
 #define IA32_EPT_VPID_CAP_PAGE_WALK_4  (1ULL << 6)
 #define IA32_EPT_VPID_CAP_WB   (1ULL << 14)
 #define IA32_EPT_VPID_CAP_AD_BITS  (1ULL << 21)



FW: Add mprotect_ept ioctl to vmm(4)

2020-04-07 Thread Adam Steen
> On Fri, Feb 07, 2020 at 01:25:38PM -0800, Mike Larkin wrote:
> > On Fri, Feb 07, 2020 at 04:20:16AM +0000, Adam Steen wrote:
> > > Hi
> > >
> > > Please see the attached patch to add an 'IOCTL handler to sets the access
> > > protections of the ept'
> > >
> > > vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
> > > a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
> > > available here 
> > > https://github.com/Solo5/solo5/compare/master...adamsteen:wnox
> > >
> > > there are changes to vmd too, but this is just to ensure completeness,
> > > if mprotect ept is called in the future, we would want the vm to be
> > > stopped if we get a protection fault.
> > >
> > > I was unsure what todo if called with execute only permissions on a cpu 
> > > that
> > > does not support it. I went with add read permissions and logging the
> > > fact, instead of returning EINVAL.
> > >
> > > Cheers
> > > Adam
> > >
> >
> > I have been giving Adam feedback on this diff for a while. There are a few
> > minor comments below, but I think this is ok if someone wants to commit it 
> > after
> > the fixes below are incorporated.
> >
> > -ml
> >
>
> See updated comment below.
>
> -ml



> > > + /* No W^X permissions */
> > > + if ((prot & PROT_MASK) != prot &&
> > > + (prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC)) {
> > > + DPRINTF("%s: No W^X permissions\n", __func__);
> > > + return (EINVAL);
> > > + }
> >
> > I would probably reword this to "W+X permission requested".
> >



> > > +/*
> > > + * vmx_mprotect_ept
> > > + *
> > > + * apply the ept protections to the requested pages, faulting the page if
> >
> > "faulting in"
> >
> > > + * required.
> > > + */
> > > +int
> > > +vmx_mprotect_ept(vm_map_t vm_map, paddr_t sgpa, paddr_t egpa, int prot)
> > > +{
> > > + struct vmx_invept_descriptor vid;
> > > + pmap_t pmap;
> > > + pt_entry_t *pte;
> > > + paddr_t addr;
> > > + int ret = 0;
> > > > +
> > > + pmap = vm_map->pmap;
> > > +
> > > + for (addr = sgpa; addr < egpa; addr += PAGE_SIZE) {
> > > + pte = vmx_pmap_find_pte_ept(pmap, addr);
> > > + if (pte == NULL) {
> >
> > if (pte & PG_V) == 0
> >
> 
> After reading a reply from Adam, I think the original suggested way is fine.
> My idea of checking PG_V is fine for RWX EPT entries, but as soon as anyone
> uses XO entries, this check wouldn't work.
> 
> -ml

Hi

I have incorporated the fixes ml requested above and a fixed a few nits
from pd@, but with the crazyness of life these days, i haven't been able to it
commited, so i attaching it again below.

Cheers
Adam

diff refs/heads/master refs/heads/mprotect_ept
blob - b0a08291108adc4b21680e4265c520873c13ebff
blob + ce8e21406dde9a5ad4a8505b5cacf10acf4a7379
--- sys/arch/amd64/amd64/vmm.c
+++ sys/arch/amd64/amd64/vmm.c
@@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *);
 int vm_resetcpu(struct vm_resetcpu_params *);
 int vm_intr_pending(struct vm_intr_params *);
 int vm_rwregs(struct vm_rwregs_params *, int);
+int vm_mprotect_ept(struct vm_mprotect_ept_params *);
 int vm_rwvmparams(struct vm_rwvmparams_params *, int);
 int vm_find(uint32_t, struct vm **);
 int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
@@ -186,6 +187,8 @@ int svm_fault_page(struct vcpu *, paddr_t);
 int vmx_fault_page(struct vcpu *, paddr_t);
 int vmx_handle_np_fault(struct vcpu *);
 int svm_handle_np_fault(struct vcpu *);
+int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
+pt_entry_t *vmx_pmap_find_pte_ept(pmap_t, paddr_t);
 int vmm_alloc_vpid(uint16_t *);
 void vmm_free_vpid(uint16_t);
 const char *vcpu_state_decode(u_int);
@@ -494,6 +497,9 @@ vmmioctl(dev_t dev, u_long cmd, caddr_t data, int flag
case VMM_IOC_WRITEREGS:
ret = vm_rwregs((struct vm_rwregs_params *)data, 1);
break;
+   case VMM_IOC_MPROTECT_EPT:
+   ret = vm_mprotect_ept((struct vm_mprotect_ept_params *)data);
+   break;
case VMM_IOC_READVMPARAMS:
ret = vm_rwvmparams((struct vm_rwvmparams_params *)data, 0);
break;
@@ -532,6 +538,7 @@ pledge_ioctl_vmm(struct proc *p, long com)
case VMM_IOC_INTR:
case VMM_IOC_READREGS:
case VMM_IOC_WRITEREGS:
+   case VMM_IOC_MPROTECT_EPT:
case VMM_IOC

vmm(4): handle invalid writes to cr0 - patch

2020-02-17 Thread Adam Steen
Hi

Please see the patch below to handle invalid writes to cr0 as per the
Intel SDM Volume 3.

The 3 cases i am handling with this change are
1. CR0.PG: Setting the PG flag when the PE flag is clear causes a
   general-protection exception (#GP). (Intel SDM Volume 3abcd page 76, Section
   2.5 Control Registers)

2. CR0.CD and CR0.NW, CD: 0 and NW: 1 Invalid setting. Generates a
   general-protection exception (#GP) with an error code of 0.
   (Intel SDM Volume 3abcd page 438, Table 11-5. Cache Operating Modes,
   via Intel SDM Volume 3abcd page 76, See also comment from CR0.CD

3. Bits 63:32 of CR0 and CR4 are reserved and must be written with zeros.
   Writing a nonzero value to any of the upper 32 bits results in a
   general-protection exception, #GP(0). (Intel SDM Volume 3abcd page
   75, 2.5 Control Registers, Paragraph 2, bullet point 2.

Cheers
Adam

? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.262
diff -u -p -u -p -r1.262 vmm.c
--- sys/arch/amd64/amd64/vmm.c  17 Feb 2020 18:16:10 -  1.262
+++ sys/arch/amd64/amd64/vmm.c  18 Feb 2020 02:59:53 -
@@ -5685,6 +5685,33 @@ vmx_handle_cr0_write(struct vcpu *vcpu, 
return (EINVAL);
}
 
+   /* 2.5 CONTROL REGISTERS CR0.PG */
+   if ((r & CR0_PG) && (r & CR0_PE) == 0) {
+   DPRINTF("%s: PG flag set when the PE flag is clear,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
+   /* 
+* 11.5.1 Cache Control Registers and Bits
+* Table 11-5. Cache Operating Modes
+*/
+   if ((r & CR0_NW) && (r & CR0_CD) == 0) {
+   DPRINTF("%s: NW flag set when the CD flag is clear,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
+   /* 2.5 CONTROL REGISTERS */
+   if (r & 0xULL) {
+   DPRINTF("%s: setting bits 63:32 of %%cr0 is invalid,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
/* CR0 must always have NE set */
r |= CR0_NE;
 



Re: Add mprotect_ept ioctl to vmm(4)

2020-02-09 Thread Adam Steen
On Fri, Feb 07, 2020 at 04:38:16PM -0800, Mike Larkin wrote:
> 
> On Fri, Feb 07, 2020 at 01:25:38PM -0800, Mike Larkin wrote:
> > On Fri, Feb 07, 2020 at 04:20:16AM +, Adam Steen wrote:
> > > Hi
> > >
> > > Please see the attached patch to add an 'IOCTL handler to sets the access
> > > protections of the ept'
> > >
> > > vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
> > > a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
> > > available here 
> > > https://github.com/Solo5/solo5/compare/master...adamsteen:wnox
> > >
> > > there are changes to vmd too, but this is just to ensure completeness,
> > > if mprotect ept is called in the future, we would want the vm to be
> > > stopped if we get a protection fault.
> > >
> > > I was unsure what todo if called with execute only permissions on a cpu 
> > > that
> > > does not support it. I went with add read permissions and logging the
> > > fact, instead of returning EINVAL.
> > >
> > > Cheers
> > > Adam
> > >
> >
> > I have been giving Adam feedback on this diff for a while. There are a few
> > minor comments below, but I think this is ok if someone wants to commit it 
> > after
> > the fixes below are incorporated.
> >
> > -ml
> >
> 
> See updated comment below.
> 
> -ml
>

Please see the updated patch below and further comments.




> > > + /* No W^X permissions */
> > > + if ((prot & PROT_MASK) != prot &&
> > > + (prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC)) {
> > > + DPRINTF("%s: No W^X permissions\n", __func__);
> > > + return (EINVAL);
> > > + }
> >
> > I would probably reword this to "W+X permission requested".
> >

updated.



> > > + /* No execute only on EPT CPUs that don't have that capability */
> > > + if (vmm_softc->mode == VMM_MODE_EPT) {
> > > + msr = rdmsr(IA32_VMX_EPT_VPID_CAP);
> > > + if (prot == PROT_EXEC &&
> > > + (msr & IA32_EPT_VPID_CAP_XO_TRANSLATIONS)) {
> > > + printf("%s: Execute only permissions unsupported,"
> > > +" adding read permission\n", __func__);
> > > + /* XXX should this return (EINVAL) */
> > > +
> > > + prot |= PROT_READ;
> > > + }
> > > + }

the (msr & IA32_EPT_VPID_CAP_XO_TRANSLATIONS) check should be
(msr & IA32_EPT_VPID_CAP_XO_TRANSLATIONS) == 0, ie only add read
permissions if the cpu does NOT support execute only.

changed the printf to DPRINTF



> > > +/*
> > > + * vmx_mprotect_ept
> > > + *
> > > + * apply the ept protections to the requested pages, faulting the page if
> >
> > "faulting in"
> >
> > > + * required.
> > > + */
> > > +int
> > > +vmx_mprotect_ept(vm_map_t vm_map, paddr_t sgpa, paddr_t egpa, int prot)

updated.



> > > + /*
> > > +  * l3idx should always be < MAXDSIZ/1GB because we don't support more
> > > +  * than MAXDSIZ guest phys mem.
> > > +  */
> > > + if (l3idx >= MAXDSIZ / ((paddr_t)1024*1024*1024))
> >
> > Spaces around *s
> >
> > > + return NULL;
> > > +

updated.



? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.258
diff -u -p -u -p -r1.258 vmm.c
--- sys/arch/amd64/amd64/vmm.c  31 Jan 2020 01:51:27 -  1.258
+++ sys/arch/amd64/amd64/vmm.c  10 Feb 2020 00:45:20 -
@@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *)
 int vm_resetcpu(struct vm_resetcpu_params *);
 int vm_intr_pending(struct vm_intr_params *);
 int vm_rwregs(struct vm_rwregs_params *, int);
+int vm_mprotect_ept(struct vm_mprotect_ept_params *);
 int vm_rwvmparams(struct vm_rwvmparams_params *, int);
 int vm_find(uint32_t, struct vm **);
 int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
@@ -186,6 +187,8 @@ int svm_fault_page(struct vcpu *, paddr_
 int vmx_fault_page(struct vcpu *, paddr_t);
 int vmx_handle_np_fault(struct vcpu *);
 int svm_handle_np_fault(struct vcpu *);
+int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
+pt_entry_t *vmx_pmap_find_pte_ept(pmap_t, paddr_t);
 int vmm_alloc_vpid(uint16_t *);
 void vmm_free_vpid(uint16_t);
 const char *vcpu_state_decode(u_int);
@@ -493,6 +496,9 @@ vmmioctl(dev_t dev, u_long 

Add mprotect_ept ioctl to vmm(4)

2020-02-06 Thread Adam Steen
Hi

Please see the attached patch to add an 'IOCTL handler to sets the access
protections of the ept'

vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
available here https://github.com/Solo5/solo5/compare/master...adamsteen:wnox

there are changes to vmd too, but this is just to ensure completeness,
if mprotect ept is called in the future, we would want the vm to be
stopped if we get a protection fault.

I was unsure what todo if called with execute only permissions on a cpu that
does not support it. I went with add read permissions and logging the
fact, instead of returning EINVAL.

Cheers
Adam

? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.258
diff -u -p -u -p -r1.258 vmm.c
--- sys/arch/amd64/amd64/vmm.c  31 Jan 2020 01:51:27 -  1.258
+++ sys/arch/amd64/amd64/vmm.c  7 Feb 2020 03:15:16 -
@@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *)
 int vm_resetcpu(struct vm_resetcpu_params *);
 int vm_intr_pending(struct vm_intr_params *);
 int vm_rwregs(struct vm_rwregs_params *, int);
+int vm_mprotect_ept(struct vm_mprotect_ept_params *);
 int vm_rwvmparams(struct vm_rwvmparams_params *, int);
 int vm_find(uint32_t, struct vm **);
 int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
@@ -186,6 +187,8 @@ int svm_fault_page(struct vcpu *, paddr_
 int vmx_fault_page(struct vcpu *, paddr_t);
 int vmx_handle_np_fault(struct vcpu *);
 int svm_handle_np_fault(struct vcpu *);
+int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
+pt_entry_t *vmx_pmap_find_pte_ept(pmap_t, paddr_t);
 int vmm_alloc_vpid(uint16_t *);
 void vmm_free_vpid(uint16_t);
 const char *vcpu_state_decode(u_int);
@@ -493,6 +496,9 @@ vmmioctl(dev_t dev, u_long cmd, caddr_t 
case VMM_IOC_WRITEREGS:
ret = vm_rwregs((struct vm_rwregs_params *)data, 1);
break;
+   case VMM_IOC_MPROTECT_EPT:
+   ret = vm_mprotect_ept((struct vm_mprotect_ept_params *)data);
+   break;
case VMM_IOC_READVMPARAMS:
ret = vm_rwvmparams((struct vm_rwvmparams_params *)data, 0);
break;
@@ -531,6 +537,7 @@ pledge_ioctl_vmm(struct proc *p, long co
case VMM_IOC_INTR:
case VMM_IOC_READREGS:
case VMM_IOC_WRITEREGS:
+   case VMM_IOC_MPROTECT_EPT:
case VMM_IOC_READVMPARAMS:
case VMM_IOC_WRITEVMPARAMS:
return (0);
@@ -806,6 +813,288 @@ vm_rwregs(struct vm_rwregs_params *vrwp,
 }
 
 /*
+ * vm_mprotect_ept
+ *
+ * IOCTL handler to sets the access protections of the ept
+ *
+ * Parameters:
+ *   vmep: decribes the memory for which the protect will be applied..
+ *
+ * Return values:
+ *  0: if successful
+ *  ENOENT: if the VM defined by 'vmep' cannot be found
+ *  EINVAL: if the sgpa or size is not page aligned, the prot is invalid,
+ *  size is too large (512GB), there is wraparound
+ *  (like start = 512GB-1 and end = 512GB-2),
+ *  the address specified is not within the vm's mem range
+ *  or the address lies inside reserved (MMIO) memory
+ */
+int
+vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
+{
+   struct vm *vm;
+   struct vcpu *vcpu;
+   vaddr_t sgpa;
+   size_t size;
+   vm_prot_t prot;
+   uint64_t msr;
+   int ret, memtype;
+
+   /* If not EPT or RVI, nothing to do here */
+   if (!(vmm_softc->mode == VMM_MODE_EPT
+   || vmm_softc->mode == VMM_MODE_RVI))
+   return (0);
+
+   /* Find the desired VM */
+   rw_enter_read(&vmm_softc->vm_lock);
+   ret = vm_find(vmep->vmep_vm_id, &vm);
+   rw_exit_read(&vmm_softc->vm_lock);
+
+   /* Not found? exit. */
+   if (ret != 0) {
+   DPRINTF("%s: vm id %u not found\n", __func__,
+   vmep->vmep_vm_id);
+   return (ret);
+   }
+
+   rw_enter_read(&vm->vm_vcpu_lock);
+   SLIST_FOREACH(vcpu, &vm->vm_vcpu_list, vc_vcpu_link) {
+   if (vcpu->vc_id == vmep->vmep_vcpu_id)
+   break;
+   }
+   rw_exit_read(&vm->vm_vcpu_lock);
+
+   if (vcpu == NULL) {
+   DPRINTF("%s: vcpu id %u of vm %u not found\n", __func__,
+   vmep->vmep_vcpu_id, vmep->vmep_vm_id);
+   return (ENOENT);
+   }
+
+   if (vcpu->vc_state != VCPU_STATE_STOPPED) {
+   DPRINTF("%s: mprotect_ept %u on vm %u attempted "
+   "while vcpu was in state %u (%s)\n", __func__,
+   vmep->vmep_vcpu_id, vmep->vmep_vm_id, vcpu->vc_state,
+   vcpu_state_decode(vcpu->vc_state));
+
+   return (EBUSY);
+   }
+
+   /* Only proceed if the pmap is in the correct mode */
+   KASSERT((vmm_softc->mode == VMM_MODE_EPT &&
+   vm->vm_map->

vmm(4) patch - iniatialise eptp to zero for vmx like svm

2020-02-05 Thread Adam Steen
Hi

Again while working on a larger patch i noticed that the eptp for vmx
was not getting initialised to zero like the svm code path, as part of
a VMM_IOC_RESETCPU ioctl call.

please see the attach patch to initialise eptp to zero

cheers
Adam

? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.258
diff -u -p -u -p -r1.258 vmm.c
--- sys/arch/amd64/amd64/vmm.c  31 Jan 2020 01:51:27 -  1.258
+++ sys/arch/amd64/amd64/vmm.c  6 Feb 2020 02:18:30 -
@@ -2895,6 +2895,8 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
/* xcr0 power on default sets bit 0 (x87 state) */
vcpu->vc_gueststate.vg_xcr0 = XCR0_X87 & xsave_mask;
 
+   vcpu->vc_parent->vm_map->pmap->eptp = 0;
+
 exit:
/* Flush the VMCS */
if (vmclear(&vcpu->vc_control_pa)) {




Remove unused code from vmm

2020-01-30 Thread Adam Steen
 Hi

 While working on a patch, i noticed that vmm_get_guest_faulttype was
 incorrect for amd (VMM_MODE_RVI) cpus, apon further inspection realised
 it was unused. Please see the patch below to remove it.

cheers
Adam

? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.257
diff -u -p -u -p -r1.257 vmm.c
--- sys/arch/amd64/amd64/vmm.c  13 Dec 2019 03:38:15 -  1.257
+++ sys/arch/amd64/amd64/vmm.c  30 Jan 2020 06:47:41 -
@@ -177,7 +177,6 @@ void vmx_handle_intr(struct vcpu *);
 void vmx_handle_intwin(struct vcpu *);
 void vmx_handle_misc_enable_msr(struct vcpu *);
 int vmm_get_guest_memtype(struct vm *, paddr_t);
-int vmm_get_guest_faulttype(void);
 int vmx_get_guest_faulttype(void);
 int svm_get_guest_faulttype(struct vmcb *);
 int vmx_get_exit_qualification(uint64_t *);
@@ -5073,23 +5072,6 @@ vmm_get_guest_memtype(struct vm *vm, pad
 
DPRINTF("guest memtype @ 0x%llx unknown\n", (uint64_t)gpa);
return (VMM_MEM_TYPE_UNKNOWN);
-}
-
-/*
- * vmm_get_guest_faulttype
- *
- * Determines the type (R/W/X) of the last fault on the VCPU last run on
- * this PCPU. Calls the appropriate architecture-specific subroutine.
- */
-int
-vmm_get_guest_faulttype(void)
-{
-   if (vmm_softc->mode == VMM_MODE_EPT)
-   return vmx_get_guest_faulttype();
-   else if (vmm_softc->mode == VMM_MODE_RVI)
-   return vmx_get_guest_faulttype();
-   else
-   panic("%s: unknown vmm mode: %d", __func__, vmm_softc->mode);
 }
 
 /*



[WIP] mprotect_ept for vmm(4)

2020-01-05 Thread Adam Steen
Hi

I have been working on this diff on and off for a while now, mlarkin was
able to give me lots of tips, but now i am stuck, so i thought i would
ask for nits, tips or even your doing it wrong.

The code below causes the vm to triple fault without the
((*pte & EPT_WB) == EPT_WB) check in vmx_mprotect_ept, some calls pass
this check and i am able to modify the ept, other fail, and are skipped.

I am just not sure i am looking up the ept entry correctly, any help
here would be grealy appreciated.

The kernel code is called from solo5[1] as follows.

if(mprotect(vaddr_start, size, prot) == -1)
return -1;

ret = 0;

struct hvt_b *hvb = hvt->b;
struct vm_mprotect_ept_params *vmep;

vmep = calloc(1, sizeof (struct vm_mprotect_ept_params));
if (vmep == NULL) {
warn("calloc");
return -1;
}

vmep->vmep_vm_id = hvb->vcp_id;
vmep->vmep_vcpu_id = hvb->vcpu_id;

// sgpa = vmr->vmr_gpa(0x0) + (addr - vmr->vmr_va)
// vmep->vmep_sgpa = (vaddr_t)0x0 + (vaddr_t)(vaddr_start - hvt->mem);
vmep->vmep_sgpa = addr_start;
vmep->vmep_size = size;
vmep->vmep_prot = prot;

if (ioctl(hvb->vmd_fd, VMM_IOC_MPROTECT_EPT, vmep) < 0) {
warn("mprotect ept vmm ioctl failed - exiting");
ret = -1;
}

Cheers
Adam

[1]
https://github.com/adamsteen/solo5/blob/wnox/tenders/hvt/hvt_openbsd.c#L169-L225


? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.257
diff -u -p -u -p -r1.257 vmm.c
--- sys/arch/amd64/amd64/vmm.c  13 Dec 2019 03:38:15 -  1.257
+++ sys/arch/amd64/amd64/vmm.c  6 Jan 2020 03:30:39 -
@@ -41,7 +41,7 @@
 #include 
 #include 
 
-/* #define VMM_DEBUG */
+#define VMM_DEBUG
 
 void *l1tf_flush_region;
 
@@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *)
 int vm_resetcpu(struct vm_resetcpu_params *);
 int vm_intr_pending(struct vm_intr_params *);
 int vm_rwregs(struct vm_rwregs_params *, int);
+int vm_mprotect_ept(struct vm_mprotect_ept_params *);
 int vm_rwvmparams(struct vm_rwvmparams_params *, int);
 int vm_find(uint32_t, struct vm **);
 int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
@@ -201,6 +202,8 @@ void vmx_setmsrbw(struct vcpu *, uint32_
 void vmx_setmsrbrw(struct vcpu *, uint32_t);
 void svm_set_clean(struct vcpu *, uint32_t);
 void svm_set_dirty(struct vcpu *, uint32_t);
+int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
+pt_entry_t * vmx_pmap_find_pte_ept(pmap_t, paddr_t);
 
 void vmm_init_pvclock(struct vcpu *, paddr_t);
 int vmm_update_pvclock(struct vcpu *);
@@ -225,6 +228,9 @@ void vmm_decode_efer_value(uint64_t);
 void vmm_decode_rflags(uint64_t);
 void vmm_decode_misc_enable_value(uint64_t);
 const char *vmm_decode_cpu_mode(struct vcpu *);
+void dump_requested_vmx_mprotect_ept(int prot);
+void vmx_dump_pte_prot(pt_entry_t *pte);
+void vmx_dump_pte_after(pt_entry_t *pte);
 
 extern int mtrr2mrt(int);
 
@@ -494,6 +500,9 @@ vmmioctl(dev_t dev, u_long cmd, caddr_t 
case VMM_IOC_WRITEREGS:
ret = vm_rwregs((struct vm_rwregs_params *)data, 1);
break;
+   case VMM_IOC_MPROTECT_EPT:
+   ret = vm_mprotect_ept((struct vm_mprotect_ept_params *)data);
+   break;
case VMM_IOC_READVMPARAMS:
ret = vm_rwvmparams((struct vm_rwvmparams_params *)data, 0);
break;
@@ -532,6 +541,7 @@ pledge_ioctl_vmm(struct proc *p, long co
case VMM_IOC_INTR:
case VMM_IOC_READREGS:
case VMM_IOC_WRITEREGS:
+   case VMM_IOC_MPROTECT_EPT:
case VMM_IOC_READVMPARAMS:
case VMM_IOC_WRITEVMPARAMS:
return (0);
@@ -807,6 +817,261 @@ vm_rwregs(struct vm_rwregs_params *vrwp,
 }
 
 /*
+ * vm_mprotect_ept
+ *
+ * IOCTL handler to sets the access protections of the ept
+ *
+ * Parameters:
+ *   vmep: decribes the memory for which the protect will be applied..
+ *
+ * Return values:
+ *  0: if successful
+ *  ENOENT: if the VM defined by 'vmep' cannot be found
+ *  EINVAL: if the sgpa or size is not page aligned, the prot is WX or RWX,
+ *  size is too large (512GB), there is wraparound
+ *  (like start = 512GB-1 and end = 512GB-2),
+ *  the address specified is not within the vm's mem range
+ *  or the address lies inside reserved (MMIO) memory
+ */
+int
+vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
+{
+   struct vm *vm;
+   struct vcpu *vcpu;
+   struct vm_mem_range *vmr;
+   vaddr_t sgpa;
+   size_t size;
+   vm_prot_t prot;
+   int i, ret, mem_type;
+
+   /* Find the desired VM */
+   rw_enter_read(&vmm_softc->vm_lock);
+   ret = vm_find(vmep->vmep_vm_id, &vm);
+   rw_exit_read(&vmm_softc->vm_lock);
+
+   /* Not found? exit. */
+   if (ret != 0) {
+   DPRINTF("%s: vm id %u not found\n", __func__,
+   vmep-

Packet loss / ENOBUFs with kqueue(2) and tap(4)

2019-08-28 Thread Adam Steen
Hi

I am experiencing packet loss and ENOBUFs, I have a program with
with two tap(4) interfaces and i am using kqueue(2) to determine
when and which tap interface to process the packet. I
email bugs@ over a month ago with no reply
https://marc.info/?l=openbsd-bugs&m=156229879107900&w=2
so i thought i would try here, paraphrased below.

Back Story: Over the last few years i have ported Solo5, used by
MirageOS and others to run on OpenBSD's vmm(4).
In Solo5 we have been working towards supporting multiple network
interfaces and implemented this using kqueue(2) and tap(4).

The Problem is demonstrated as follows.
setting up two Tap interfaces,
starting up the Unikernel(Solo5) on vmm.
In another session flood pinging the first Tap interface,
Solo5 handles this with no packets dropped.
In another session ping the second Tap interface,
then for every ping to the second interface a packet is dropped
on the first.
If you switch to a flood ping on the second tab interface,
you will observe massive packet loss on both interfaces, and
ping complaining about No buffer space available (ENOBUFS).

see https://github.com/Solo5/solo5/issues/374 for more information.

I have been able to reproduce this in a hacked up exampled program,
available here https://github.com/adamsteen/test_net_2if. Please note
this is hacked, generally butchered program, which demonstrates the
problem. (if required i can try and clean up this test case)

01. git clone https://github.com/adamsteen/test_net_2if
02. cd test_net_2if
03. make
04. doas setup.sh (Setup up the Tap interfaces)
05. doas ./test_net_2if
06. in another seesion start a flood ping
doas ping -f 10.0.0.2
07. Observe that the flood ping is functioning correctly,
with no packets dropped.
08. In another session, start a normal ping
ping 10.1.0.2
09. Observe that, for each ping sent to service1, a packet is dropped.
10. Kill the normal ping
11. start a flood ping
doas ping -f 10.1.0.2
12. Observe massive packet loss on both interfaces, and ping
complaining about No buffer space available (ENOBUFS).

Cheers
Adam

ps Sorry for the poor english.

dmesg:
OpenBSD 6.5-current (GENERIC.MP) #123: Sat Jun 29 19:39:46 AWST 2019
ast...@x220.adamsteen.com.au:/sys/arch/amd64/compile/GENERIC.MP
real mem = 17041059840 (16251MB)
avail mem = 16514461696 (15749MB)
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 2.6 @ 0xdae9c000 (64 entries)
bios0: vendor LENOVO version "8DET69WW (1.39 )" date 07/18/2013
bios0: LENOVO 4291N58
acpi0 at bios0: ACPI 4.0
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SLIC SSDT SSDT SSDT HPET APIC MCFG ECDT ASF! TCPA SSDT 
SSDT \
DMAR UEFI UEFI UEFI
acpi0: wakeup devices LID_(S3) SLPB(S3) IGBE(S4) EXP4(S4) EXP7(S4) EHC1(S3) 
EHC2(S3) \
HDEF(S4) acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 14318179 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz, 2492.31 MHz, 06-2a-07
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,D
 \
S,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2
 \
,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,RDTS
 \
CP,LONG,LAHF,PERF,ITSC,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN

cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1.2, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz, 2491.91 MHz, 06-2a-07
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,D
 \
S,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2
 \
,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,NXE,RDTS
 \
CP,LONG,LAHF,PERF,ITSC,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN

cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
ioapic0 at mainbus0: apid 2 pa 0xfec0, version 20, 24 pins
acpimcfg0 at acpi0
acpimcfg0: addr 0xf800, bus 0-63
acpiec0 at acpi0
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus -1 (PEG_)
acpiprt2 at acpi0: bus 2 (EXP1)
acpiprt3 at acpi0: bus 3 (EXP2)
acpiprt4 at acpi0: bus 5 (EXP4)
acpiprt5 at acpi0: bus 13 (EXP5)
acpiprt6 at acpi0: bus -1 (EXP7)
acpicpu0 at acpi0: C3(350@104 io@0x415), C1(1000@1 halt), PSS
acpicpu1 at acpi0: C3(350@104 io@0x415), C1(1000@1 halt), PSS
acpipwrres0 at acpi0: PUBS, resource for EHC1, EHC2
acpitz0 at acpi0: critical temperature is 99 degC
acpibtn0 at acpi0: LID_
acpibtn1 at acpi0: SLPB
acpipci0 at acpi0 PCI0: 0x 0x0011 0x0001
acpicmos0 at acpi0
acpibat0 at acpi0: BAT0 model "42T4861" serial 16466 type LION oem "SANYO"
acpiac0 at acpi0: AC unit onli

Allow vmd to specify which ports it can handle

2018-10-23 Thread Adam Steen
Hi tech

The following diff allows vmd to specify which ports it can handle
or fix "XXX something better than a hardcoded list here, maybe configure via
vmd via the device list in vm create params?"

There are currently two implementation of bsearch in the kernel and this patch
would add a third, I think these should be consolidated, but i didn't know
where to put the new function.
the implementations are in:
  - ieee80211_regdomain.c
  - sys/kern/kern_pledge.c

Cheers
Adam

Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.221
diff -u -p -u -p -r1.221 vmm.c
--- sys/arch/amd64/amd64/vmm.c  7 Oct 2018 22:43:06 -   1.221
+++ sys/arch/amd64/amd64/vmm.c  23 Oct 2018 11:37:56 -
@@ -114,6 +114,7 @@ int vmmioctl(dev_t, u_long, caddr_t, int
 int vmmclose(dev_t, int, int, struct proc *);
 int vmm_start(void);
 int vmm_stop(void);
+int vmm_compare_vei_port(const void *a, const void *b);
 size_t vm_create_check_mem_ranges(struct vm_create_params *);
 int vm_create(struct vm_create_params *, struct proc *);
 int vm_run(struct vm_run_params *);
@@ -1116,6 +1117,10 @@ vm_create(struct vm_create_params *vcp,
return (ret);
}
rw_enter_write(&vm->vm_vcpu_lock);
+
+   vcpu->vc_nportranges = vcp->vcp_nportranges;
+   memcpy(vcpu->vc_portranges, vcp->vcp_portranges,
+   vcpu->vc_nportranges * sizeof(vcp->vcp_portranges[0]));
vcpu->vc_id = vm->vm_vcpu_ct;
vm->vm_vcpu_ct++;
SLIST_INSERT_HEAD(&vm->vm_vcpu_list, vcpu, vc_vcpu_link);
@@ -5059,6 +5064,38 @@ vmm_get_guest_cpu_mode(struct vcpu *vcpu
 }

 /*
+ * XXX this should be consolidated in the kernel
+ * see
+ *  - ieee80211_regdomain.c
+ *  - sys/kern/kern_pledge.c
+ */
+#ifndef bsearch
+const void *vmm_bsearch(const void *, const void *, size_t, size_t,
+int (*)(const void *, const void *));
+
+const void *
+vmm_bsearch(const void *key, const void *base0, size_t nmemb, size_t size,
+int (*compar)(const void *, const void *))
+{
+   const char *base = base0;
+   int lim, cmp;
+   const void *p;
+
+   for (lim = nmemb; lim != 0; lim >>= 1) {
+   p = base + (lim >> 1) * size;
+   cmp = (*compar)(key, p);
+   if (cmp == 0)
+   return ((const void *)p);
+   if (cmp > 0) {  /* key > p: move right */
+   base = (const char *)p + size;
+   lim--;
+   } /* else move left */
+   }
+   return (NULL);
+}
+#endif
+
+/*
  * svm_handle_inout
  *
  * Exit handler for IN/OUT instructions.
@@ -5113,28 +5150,15 @@ svm_handle_inout(struct vcpu *vcpu)
vcpu->vc_gueststate.vg_rip += insn_length;

/*
-* The following ports usually belong to devices owned by vmd.
+* The ports specified by vc_portranges belong to devices owned by vmd.
 * Return EAGAIN to signal help needed from userspace (vmd).
 * Return 0 to indicate we don't care about this port.
-*
-* XXX something better than a hardcoded list here, maybe
-* configure via vmd via the device list in vm create params?
 */
-   switch (vcpu->vc_exit.vei.vei_port) {
-   case IO_ICU1 ... IO_ICU1 + 1:
-   case 0x40 ... 0x43:
-   case PCKBC_AUX:
-   case IO_RTC ... IO_RTC + 1:
-   case IO_ICU2 ... IO_ICU2 + 1:
-   case 0x3f8 ... 0x3ff:
-   case ELCR0 ... ELCR1:
-   case 0x500 ... 0x50f:
-   case 0xcf8:
-   case 0xcfc ... 0xcff:
-   case VMM_PCI_IO_BAR_BASE ... VMM_PCI_IO_BAR_END:
+   if (vmm_bsearch(&vcpu->vc_exit.vei.vei_port, vcpu->vc_portranges,
+   vcpu->vc_nportranges, sizeof(struct vm_port_range),
+   vmm_compare_vei_port)) {
ret = EAGAIN;
-   break;
-   default:
+   } else {
/* Read from unsupported ports returns FFs */
if (vcpu->vc_exit.vei.vei_dir == 1) {
switch(vcpu->vc_exit.vei.vei_size) {
@@ -5206,28 +5230,15 @@ vmx_handle_inout(struct vcpu *vcpu)
vcpu->vc_gueststate.vg_rip += insn_length;

/*
-* The following ports usually belong to devices owned by vmd.
+* The ports specified by vc_portranges belong to devices owned by vmd.
 * Return EAGAIN to signal help needed from userspace (vmd).
 * Return 0 to indicate we don't care about this port.
-*
-* XXX something better than a hardcoded list here, maybe
-* configure via vmd via the device list in vm create params?
 */
-   switch (vcpu->vc_exit.vei.vei_port) {
-   case IO_ICU1 ... IO_ICU1 + 1:
-   case 0x40 ... 0x43:
-   case PCKBC_AUX:
-   case IO_RTC ... IO_RTC + 1:
-   case IO_ICU2 ... IO_ICU2 + 1:
-   case 0x3f8 ... 0x3ff:
-   ca

Re: TSC timecounters

2017-10-07 Thread Adam Steen
On Sun, Oct 8, 2017 at 6:03 AM, Theo de Raadt  wrote:

> > Adam will correct me if I'm wrong, but his idea was to provide clock
> > emulation to the operating system running in userland (solo5/unikernel).
> > Perhaps vmd can make use of this interface too.
>
> But why does it matter if it knows the frequency?
>

Hi Mike, Theo

yes, Mike you are correct.

To provide more background, quoting the solo5 readme[https://github.com/
Solo5/solo5], "Solo5 is most useful as a "base layer" to run MirageOS
unikernels, either on various existing hypervisors (KVM/QEMU, bhyve) or on
a specialized "unikernel monitor" called ukvm."

With the Help of Mike B and Mile L, i have been successfully running solo5
test kernels on ukvm (userland) and vmm (OpenBSD kernel) locally.

ukvm needs to know the tsc frequency, so it can pass this information to a
Solo5 unikernel which then uses rdtsc() to implement a clock.

the following links to source code provide some more information
solo5 unikernel clock implementation
https://github.com/Solo5/solo5/blob/f8a277f83807333685742228ffef0d
87270207cf/kernel/ukvm/tscclock.c#L60-L106

solo5 unikernel clock initialisation
https://github.com/Solo5/solo5/blob/9bad601c79ba7a549c87d22aaa970d
695e81e188/kernel/ukvm/time.c#L25
ukvm on OpenBSD setting up the tsc frequency
https://github.com/adamsteen/solo5/blob/master/ukvm/ukvm_
hv_openbsd_x86_64.c#L137
and

https://github.com/adamsteen/solo5/blob/200d7c4d4feb7aca62a3bae61f2ab1ec4bdd5bc9/ukvm/ukvm_hv_openbsd_x86_64.c#L52-L73





Cheers
Adam


Re: TSC timecounters

2017-10-07 Thread Adam Steen
On Sat, Oct 07, 2017 at 06:27:53PM +0800, Adam Steen wrote:
> On Sat, Oct 7, 2017 at 5:52 PM, Adam Steen  wrote:
> 
> > On Fri, Oct 06, 2017 at 03:58:18PM +0200, Mike Belopuhov wrote:
> > > Hi,
> > >
> > > An experimental change to use TSC as a timecounter source on a variety
> > > of modern Intel and AMD CPUs has been just committed and enabled on
> > > OpenBSD/amd64 thanks to the work done by Adam Steen.
> > >
> > > The rationale is, quoting the commit message:
> > >
> > >   If frequency of an invariant (non-stop) time stamp counter is measured
> > >   using an independent working timecounter that has a known frequency, we
> > >   can assume that the measured TSC frequency is as good as the resolution
> > >   of the timecounter that we use to perform the measurement. This lets us
> > >   switch from this high quality but expensive source to the cheaper TSC
> > >   without sacrificing precision on a wide range of modern CPUs.
> > >
> > > You can query and change the current timecounter source in the runtime
> > > via sysctl:
> > >
> > >   % sysctl kern.timecounter.{choice,hardware}
> > >   kern.timecounter.choice=i8254(0) tsc(2000) acpihpet0(1000)
> > acpitimer0(1000) dummy(-100)
> > >   kern.timecounter.hardware=tsc
> > >
> > > Please make sure your NTP drift (/var/db/ntpd.drift) stays within
> > -20..+20
> > > or at least is not worse than it is right now.
> > >
> > > And finally, please make sure to run a "make config" when building the
> > > kernel to update offset tables because of the cpu_info structure changes.
> > >
> > > Regards,
> > > Mike
> > >
> >
> > Hi,
> >
> > Now that we have an accurate tsc frequency, I would like to expose this
> > information to userland via a sysctl.
> >
> > The diff below exposes the tsc frequency and if it is invariant.
> >
> > Cheers,
> > Adam
> >
> >
> Please ignore that diff, looks like i had some dregs from a older diff i
> had been testing

Hi,

Take two, please see the diff below making tsc frequency and tsc is
invariant available to userland via a machdep sysctl.

I need to tsc frequency for my port of Solo5(ukvm) to OpenBSD vmm, the time
counter used by a solo5 is based on the tsc frequency.

Cheers
Adam

Index: sys/arch/amd64/amd64/machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/machdep.c,v
retrieving revision 1.231
diff -u -p -u -p -r1.231 machdep.c
--- sys/arch/amd64/amd64/machdep.c  12 Jul 2017 06:26:32 -  1.231
+++ sys/arch/amd64/amd64/machdep.c  7 Oct 2017 10:50:31 -
@@ -425,6 +425,8 @@ int
 cpu_sysctl(int *name, u_int namelen, void *oldp, size_t *oldlenp, void *newp,
 size_t newlen, struct proc *p)
 {
+   extern uint64_t tsc_frequency;
+   extern int  tsc_is_invariant;
extern int amd64_has_xcrypt;
dev_t consdev;
dev_t dev;
@@ -496,6 +498,12 @@ cpu_sysctl(int *name, u_int namelen, voi
pckbc_release_console();
return (error);
 #endif
+   case CPU_TSCFREQ:
+   return (sysctl_rdquad(oldp, oldlenp, newp,
+   tsc_frequency));
+   case CPU_INVARIANTTSC:
+   return (sysctl_rdint(oldp, oldlenp, newp,
+   tsc_is_invariant));
default:
return (EOPNOTSUPP);
}
Index: sys/arch/amd64/include/cpu.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.115
diff -u -p -u -p -r1.115 cpu.h
--- sys/arch/amd64/include/cpu.h6 Oct 2017 13:33:53 -   1.115
+++ sys/arch/amd64/include/cpu.h7 Oct 2017 10:50:33 -
@@ -426,7 +426,9 @@ void mp_setperf_init(void);
 #define CPU_XCRYPT 12  /* supports VIA xcrypt in userland */
 #define CPU_LIDACTION  14  /* action caused by lid close */
 #define CPU_FORCEUKBD  15  /* Force ukbd(4) as console keyboard */
-#define CPU_MAXID  16  /* number of valid machdep ids */
+#define CPU_TSCFREQ16  /* tsc frequency */
+#define CPU_INVARIANTTSC   17  /* tsc is invariant */
+#define CPU_MAXID  18  /* number of valid machdep ids */
 
 #defineCTL_MACHDEP_NAMES { \
{ 0, 0 }, \
@@ -445,6 +447,8 @@ void mp_setperf_init(void);
{ 0, 0 }, \
{ "lidaction", CTLTYPE_INT }, \
{ "forceukbd", CTLTYPE_INT }, \
+   { "tscfreq", CTLTYPE_QUAD }, \
+   { "invarianttsc", CTLTYPE_INT }, \
 }
 
 /*



Re: TSC timecounters

2017-10-07 Thread Adam Steen
On Sat, Oct 7, 2017 at 5:52 PM, Adam Steen  wrote:

> On Fri, Oct 06, 2017 at 03:58:18PM +0200, Mike Belopuhov wrote:
> > Hi,
> >
> > An experimental change to use TSC as a timecounter source on a variety
> > of modern Intel and AMD CPUs has been just committed and enabled on
> > OpenBSD/amd64 thanks to the work done by Adam Steen.
> >
> > The rationale is, quoting the commit message:
> >
> >   If frequency of an invariant (non-stop) time stamp counter is measured
> >   using an independent working timecounter that has a known frequency, we
> >   can assume that the measured TSC frequency is as good as the resolution
> >   of the timecounter that we use to perform the measurement. This lets us
> >   switch from this high quality but expensive source to the cheaper TSC
> >   without sacrificing precision on a wide range of modern CPUs.
> >
> > You can query and change the current timecounter source in the runtime
> > via sysctl:
> >
> >   % sysctl kern.timecounter.{choice,hardware}
> >   kern.timecounter.choice=i8254(0) tsc(2000) acpihpet0(1000)
> acpitimer0(1000) dummy(-100)
> >   kern.timecounter.hardware=tsc
> >
> > Please make sure your NTP drift (/var/db/ntpd.drift) stays within
> -20..+20
> > or at least is not worse than it is right now.
> >
> > And finally, please make sure to run a "make config" when building the
> > kernel to update offset tables because of the cpu_info structure changes.
> >
> > Regards,
> > Mike
> >
>
> Hi,
>
> Now that we have an accurate tsc frequency, I would like to expose this
> information to userland via a sysctl.
>
> The diff below exposes the tsc frequency and if it is invariant.
>
> Cheers,
> Adam
>
>
Please ignore that diff, looks like i had some dregs from a older diff i
had been testing


Re: TSC timecounters

2017-10-07 Thread Adam Steen
On Fri, Oct 06, 2017 at 03:58:18PM +0200, Mike Belopuhov wrote:
> Hi,
> 
> An experimental change to use TSC as a timecounter source on a variety
> of modern Intel and AMD CPUs has been just committed and enabled on
> OpenBSD/amd64 thanks to the work done by Adam Steen.
> 
> The rationale is, quoting the commit message:
> 
>   If frequency of an invariant (non-stop) time stamp counter is measured
>   using an independent working timecounter that has a known frequency, we
>   can assume that the measured TSC frequency is as good as the resolution
>   of the timecounter that we use to perform the measurement. This lets us
>   switch from this high quality but expensive source to the cheaper TSC
>   without sacrificing precision on a wide range of modern CPUs.
> 
> You can query and change the current timecounter source in the runtime
> via sysctl:
> 
>   % sysctl kern.timecounter.{choice,hardware}
>   kern.timecounter.choice=i8254(0) tsc(2000) acpihpet0(1000) acpitimer0(1000) 
> dummy(-100)
>   kern.timecounter.hardware=tsc
> 
> Please make sure your NTP drift (/var/db/ntpd.drift) stays within -20..+20
> or at least is not worse than it is right now.
> 
> And finally, please make sure to run a "make config" when building the
> kernel to update offset tables because of the cpu_info structure changes.
> 
> Regards,
> Mike
> 

Hi,

Now that we have an accurate tsc frequency, I would like to expose this
information to userland via a sysctl.

The diff below exposes the tsc frequency and if it is invariant.

Cheers,
Adam

Index: sys/arch/amd64/amd64/machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/machdep.c,v
retrieving revision 1.231
diff -u -p -u -p -r1.231 machdep.c
--- sys/arch/amd64/amd64/machdep.c  12 Jul 2017 06:26:32 -  1.231
+++ sys/arch/amd64/amd64/machdep.c  7 Oct 2017 09:25:24 -
@@ -425,6 +425,8 @@ int
 cpu_sysctl(int *name, u_int namelen, void *oldp, size_t *oldlenp, void *newp,
 size_t newlen, struct proc *p)
 {
+   extern uint64_t amd64_tsc_frequency;
+   extern int amd64_has_invariant_tsc;
extern int amd64_has_xcrypt;
dev_t consdev;
dev_t dev;
@@ -496,6 +498,12 @@ cpu_sysctl(int *name, u_int namelen, voi
pckbc_release_console();
return (error);
 #endif
+   case CPU_TSCFREQ:
+   return (sysctl_rdquad(oldp, oldlenp, newp,
+   amd64_tsc_frequency));
+   case CPU_INVARIANTTSC:
+   return (sysctl_rdint(oldp, oldlenp, newp,
+   amd64_has_invariant_tsc));
default:
return (EOPNOTSUPP);
}
Index: sys/arch/amd64/include/cpu.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.115
diff -u -p -u -p -r1.115 cpu.h
--- sys/arch/amd64/include/cpu.h6 Oct 2017 13:33:53 -   1.115
+++ sys/arch/amd64/include/cpu.h7 Oct 2017 09:25:26 -
@@ -426,7 +426,9 @@ void mp_setperf_init(void);
 #define CPU_XCRYPT 12  /* supports VIA xcrypt in userland */
 #define CPU_LIDACTION  14  /* action caused by lid close */
 #define CPU_FORCEUKBD  15  /* Force ukbd(4) as console keyboard */
-#define CPU_MAXID  16  /* number of valid machdep ids */
+#define CPU_TSCFREQ16  /* tsc frequency */
+#define CPU_INVARIANTTSC   17  /* has invariant tsc */
+#define CPU_MAXID  18  /* number of valid machdep ids */
 
 #defineCTL_MACHDEP_NAMES { \
{ 0, 0 }, \
@@ -445,6 +447,8 @@ void mp_setperf_init(void);
{ 0, 0 }, \
{ "lidaction", CTLTYPE_INT }, \
{ "forceukbd", CTLTYPE_INT }, \
+   { "tscfreq", CTLTYPE_QUAD }, \
+   { "invarianttsc", CTLTYPE_INT }, \
 }
 
 /*



Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-08-27 Thread Adam Steen
On Fri, Aug 25, 2017 at 12:43:44PM +0200, Mike Belopuhov wrote:
> On Fri, Aug 25, 2017 at 00:40 -0700, Mike Larkin wrote:
> > On Thu, Aug 24, 2017 at 12:39:33PM +0800, Adam Steen wrote:
> > > On Thu, Aug 24, 2017 at 2:35 AM, Mike Larkin  wrote:
> > > > On Wed, Aug 23, 2017 at 09:29:15PM +0800, Adam Steen wrote:
> > > >>
> > > >> Thank you Mike on the feedback on the last patch, please see the diff
> > > >> below, update with your input and style(9)
> > > >>
> > > >> I have continued to use tsc as my timecounter and /var/db/ntpd.driff
> > > >> has stayed under 10.
> > > >>
> > > >> cat /var/db/ntpd.drift
> > > >> 6.665
> > > >>
> > > >> ntpctl -s all
> > > >> 4/4 peers valid, constraint offset -1s, clock synced, stratum 3
> > > >>
> > > >> peer
> > > >>wt tl st  next  poll  offset   delay  jitter
> > > >> 144.48.166.166 from pool pool.ntp.org
> > > >> 1 10  24s   32s-3.159ms87.723ms10.389ms
> > > >> 13.55.50.68 from pool pool.ntp.org
> > > >> 1 10  3   11s   32s-3.433ms86.053ms18.095ms
> > > >> 14.202.204.182 from pool pool.ntp.org
> > > >> 1 10  1   14s   32s 1.486ms86.545ms16.483ms
> > > >> 27.124.125.250 from pool pool.ntp.org
> > > >>  *  1 10  2   12s   30s   -10.275ms54.156ms70.389ms
> > > >>
> > > >> Cheers
> > > >> Adam
> > > >
> > > > IIRC you have an x220, right?
> > > >
> > > > If so, could you try letting the clock run for a bit (while using tsc
> > > > timecounter selection) after apm -L to drop the speed? (make sure
> > > > apm shows that it dropped).
> > > >
> > > > Even though my x230 supposedly has a constant/invar TSC (according to
> > > > cpuid), the TSC drops from 2.5GHz to 1.2GHz when apm -L runs, which
> > > > causes time to run too slowly when tsc is selected there.
> > > >
> > > > -ml
> > > >
> > >
> > > Yes, x220
> > > (bios: LENOVO version "8DET69WW (1.39 )" date 07/18/2013)
> > > (cpu: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz, 2491.91 MHz)
> > >
> > > I took some measurements to before starting the test.
> > >
> > > note: the laptop has been up for a few days with apm -A set via 
> > > rc.conf.local
> > > and sysctl kern.timecounter.hardware as tsc via sysctl.conf and mostly
> > > idle.
> > >
> > > cat /var/db/ntpd.drift
> > > 6.459
> > >
> > > apm -v
> > > Battery state: high, 100% remaining, unknown life estimate
> > > A/C adapter state: connected
> > > Performance adjustment mode: auto (800 MHz)
> > >
> > > 6 hours ago i ran apm -L, verified it was running slowly (800 MHz),
> > > and got the following results
> > >
> > > The clock appears correct (comparing to other computers)
> > >
> > > apm -v
> > > Battery state: high, 100% remaining, unknown life estimate
> > > A/C adapter state: connected
> > > Performance adjustment mode: manual (800 MHz)
> > >
> > > cat /var/db/ntpd.drift
> > > 6.385
> > >
> > > ntpctl -s all
> > > 4/4 peers valid, constraint offset 0s, clock synced, stratum 4
> > >
> > > peer
> > >wt tl st  next  poll  offset   delay  jitter
> > > 203.23.237.200 from pool pool.ntp.org
> > > 1 10  2  153s 1505s   -25.546ms73.450ms 2.644ms
> > > 203.114.73.24 from pool pool.ntp.org
> > > 1 10  2  253s 1560s-1.042ms75.133ms 0.752ms
> > > 192.189.54.33 from pool pool.ntp.org
> > >  *  1 10  2  204s 1558s31.644ms70.910ms 3.388ms
> > > 54.252.165.245 from pool pool.ntp.org
> > > 1 10  2  238s 1518s 0.146ms73.005ms 2.025ms
> > >
> > > I will leave the laptop in lower power mode over the weekend and see
> > > what happens.
> > >
> >
> > No need, I think you've convinced me that it works :)
> 
> But does it actually work on x230 as well?  I'm surprised to learn
> that you've observed TSC frequency change on Ivy Bridge.  I was
> under impression that everything since at least Sandy Bridge (x220)
> has constant and invariant TSC as advert

Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-08-23 Thread Adam Steen
On Thu, Aug 24, 2017 at 2:35 AM, Mike Larkin  wrote:
> On Wed, Aug 23, 2017 at 09:29:15PM +0800, Adam Steen wrote:
>> On Thu, Aug 17, 2017 at 12:19:28PM +0800, Adam Steen wrote:
>> > On Tue, Aug 8, 2017 at 10:12 PM, Mike Belopuhov  wrote:
>> > > On Tue, Aug 08, 2017 at 08:18 +0800, Adam Steen wrote:
>> > >> On Mon, Jul 31, 2017 at 3:58 PM, Mike Belopuhov  
>> > >> wrote:
>> > >> > On Mon, Jul 31, 2017 at 09:48 +0800, Adam Steen wrote:
>> > >> >> Ted Unangst  wrote:
>> > >> >> > we don't currently export this info, but we could add some 
>> > >> >> > sysctls. there's
>> > >> >> > some cpufeatures stuff there, but generally stuff isn't exported 
>> > >> >> > until
>> > >> >> > somebody finds a use for it... it shouldn't be too hard to add 
>> > >> >> > something to
>> > >> >> > amd64/machdep.c sysctl if you're interested.
>> > >> >>
>> > >> >> I am interested, as i need the info, i will look into it and 
>> > >> >> hopefully
>> > >> >> come back with a patch.
>> > >> >
>> > >> > This is a bad idea because TSC as the time source is only usable
>> > >> > by OpenBSD on Skylake and Kaby Lake CPUs since they encode the TSC
>> > >> > frequency in the CPUID. All older CPUs have their TSCs measured
>> > >> > against the PIT. Currently the measurement done by the kernel isn't
>> > >> > very precise and if TSC is selected as a timecounter, the machine
>> > >> > would be gaining time on a pace that cannot be corrected by our NTP
>> > >> > daemon. (IIRC, about an hour a day on my Haswell running with NTP).
>> > >> >
>> > >> > To be able to use TSC as a timecounter source on OpenBSD or Solo5
>> > >> > you'd have to improve the in-kernel measurement of the TSC frequency
>> > >> > first. I've tried to perform 10 measurements and take an average and
>> > >> > it does improve accuracy, however I believe we need to poach another
>> > >> > bit from Linux and re-calibrate TSC via HPET:
>> > >> >
>> > >> >  
>> > >> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L409
>> > >> >
>> > >> > I think this is the most sane thing we can do. Here's a complete
>> > >> > procedure that Linux kernel undertakes:
>> > >> >
>> > >> >  
>> > >> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L751
>> > >> >
>> > >> > Regards,
>> > >> > Mike
>> > >>
>> > >> Hi Mike/All
>> > >>
>> > >> I would like to improve the accuracy of TSC frequency calibration as
>> > >> Mike B. describes above.
>> > >>
>> > >> I initially thought the calibration would take place at line 470 of
>> > >> amd64/identcpu.c
>> > >> (https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/identcpu.c?annotate=1.87)
>> > >>
>> > >
>> > > Indeed, it cannot happen there simply because you don't know at
>> > > that point whether or not HPET actually exists.
>> > >
>> > >> But I looked into using the acpihpet directly but it is never exposed
>> > >> outside of acpihpet.c.
>> > >>
>> > >
>> > > And it shouldn't be.
>> > >
>> > >> Could someone point me to were if would be appropriate to complete
>> > >> this calibration and how to use the acpihpet?
>> > >
>> > > The way I envision this is a multi-step approach:
>> > >
>> > > 1) TSC frequency is approximated with the PIT (possibly performing
>> > > multiple measurements and averaging them out; also keep in mind that
>> > > doing it 8 times means you can shift the sum right by 3 instead of
>> > > using actual integer division).  This is what should happen around
>> > > the line 470 of identcpu.c
>> > >
>> > > 2) A function can be provided by identcpu.c to further adjust the
>> > > TSC frequency once acpitimer(4) (this is a PM timer) and acpihpet(4)
>> > > (or any 

Re: Improve the accuracy of the TSC frequency calibration - Updated Patch

2017-08-23 Thread Adam Steen
On Thu, Aug 17, 2017 at 12:19:28PM +0800, Adam Steen wrote:
> On Tue, Aug 8, 2017 at 10:12 PM, Mike Belopuhov  wrote:
> > On Tue, Aug 08, 2017 at 08:18 +0800, Adam Steen wrote:
> >> On Mon, Jul 31, 2017 at 3:58 PM, Mike Belopuhov  wrote:
> >> > On Mon, Jul 31, 2017 at 09:48 +0800, Adam Steen wrote:
> >> >> Ted Unangst  wrote:
> >> >> > we don't currently export this info, but we could add some sysctls. 
> >> >> > there's
> >> >> > some cpufeatures stuff there, but generally stuff isn't exported until
> >> >> > somebody finds a use for it... it shouldn't be too hard to add 
> >> >> > something to
> >> >> > amd64/machdep.c sysctl if you're interested.
> >> >>
> >> >> I am interested, as i need the info, i will look into it and hopefully
> >> >> come back with a patch.
> >> >
> >> > This is a bad idea because TSC as the time source is only usable
> >> > by OpenBSD on Skylake and Kaby Lake CPUs since they encode the TSC
> >> > frequency in the CPUID. All older CPUs have their TSCs measured
> >> > against the PIT. Currently the measurement done by the kernel isn't
> >> > very precise and if TSC is selected as a timecounter, the machine
> >> > would be gaining time on a pace that cannot be corrected by our NTP
> >> > daemon. (IIRC, about an hour a day on my Haswell running with NTP).
> >> >
> >> > To be able to use TSC as a timecounter source on OpenBSD or Solo5
> >> > you'd have to improve the in-kernel measurement of the TSC frequency
> >> > first. I've tried to perform 10 measurements and take an average and
> >> > it does improve accuracy, however I believe we need to poach another
> >> > bit from Linux and re-calibrate TSC via HPET:
> >> >
> >> >  
> >> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L409
> >> >
> >> > I think this is the most sane thing we can do. Here's a complete
> >> > procedure that Linux kernel undertakes:
> >> >
> >> >  
> >> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L751
> >> >
> >> > Regards,
> >> > Mike
> >>
> >> Hi Mike/All
> >>
> >> I would like to improve the accuracy of TSC frequency calibration as
> >> Mike B. describes above.
> >>
> >> I initially thought the calibration would take place at line 470 of
> >> amd64/identcpu.c
> >> (https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/identcpu.c?annotate=1.87)
> >>
> >
> > Indeed, it cannot happen there simply because you don't know at
> > that point whether or not HPET actually exists.
> >
> >> But I looked into using the acpihpet directly but it is never exposed
> >> outside of acpihpet.c.
> >>
> >
> > And it shouldn't be.
> >
> >> Could someone point me to were if would be appropriate to complete
> >> this calibration and how to use the acpihpet?
> >
> > The way I envision this is a multi-step approach:
> >
> > 1) TSC frequency is approximated with the PIT (possibly performing
> > multiple measurements and averaging them out; also keep in mind that
> > doing it 8 times means you can shift the sum right by 3 instead of
> > using actual integer division).  This is what should happen around
> > the line 470 of identcpu.c
> >
> > 2) A function can be provided by identcpu.c to further adjust the
> > TSC frequency once acpitimer(4) (this is a PM timer) and acpihpet(4)
> > (or any other timer for that matter) are attached.
> >
> > 3) Once acpitimer(4) or acpihpet(4) or any other timecounter source
> > are attached and are verified to be operating correctly, they can
> > perform TSC re-calibration and update the TSC frequency with their
> > measurements.  The idea here is that the function (or functions) that
> > facilitate this must abstract enough logic so that you don't have to
> > duplicate it in the acpitimer or acpihpet themselves.
> >
> >> (Will it need to be
> >> exposed like i8254_delay/delay_func/delay in machdep.c and cpu.h)
> >>
> >
> > No it won't.
> >
> >> Lastly should the calibration be done using both delay(i8254 pit) and
> >> hpet timers similar to Linux described above or just using the hpet?
> >>
> >

Re: Improve the accuracy of the TSC frequency calibration (Was: Calculate the frequency of the tsc timecounter)

2017-08-16 Thread Adam Steen
On Tue, Aug 8, 2017 at 10:12 PM, Mike Belopuhov  wrote:
> On Tue, Aug 08, 2017 at 08:18 +0800, Adam Steen wrote:
>> On Mon, Jul 31, 2017 at 3:58 PM, Mike Belopuhov  wrote:
>> > On Mon, Jul 31, 2017 at 09:48 +0800, Adam Steen wrote:
>> >> Ted Unangst  wrote:
>> >> > we don't currently export this info, but we could add some sysctls. 
>> >> > there's
>> >> > some cpufeatures stuff there, but generally stuff isn't exported until
>> >> > somebody finds a use for it... it shouldn't be too hard to add 
>> >> > something to
>> >> > amd64/machdep.c sysctl if you're interested.
>> >>
>> >> I am interested, as i need the info, i will look into it and hopefully
>> >> come back with a patch.
>> >
>> > This is a bad idea because TSC as the time source is only usable
>> > by OpenBSD on Skylake and Kaby Lake CPUs since they encode the TSC
>> > frequency in the CPUID. All older CPUs have their TSCs measured
>> > against the PIT. Currently the measurement done by the kernel isn't
>> > very precise and if TSC is selected as a timecounter, the machine
>> > would be gaining time on a pace that cannot be corrected by our NTP
>> > daemon. (IIRC, about an hour a day on my Haswell running with NTP).
>> >
>> > To be able to use TSC as a timecounter source on OpenBSD or Solo5
>> > you'd have to improve the in-kernel measurement of the TSC frequency
>> > first. I've tried to perform 10 measurements and take an average and
>> > it does improve accuracy, however I believe we need to poach another
>> > bit from Linux and re-calibrate TSC via HPET:
>> >
>> >  
>> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L409
>> >
>> > I think this is the most sane thing we can do. Here's a complete
>> > procedure that Linux kernel undertakes:
>> >
>> >  
>> > http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L751
>> >
>> > Regards,
>> > Mike
>>
>> Hi Mike/All
>>
>> I would like to improve the accuracy of TSC frequency calibration as
>> Mike B. describes above.
>>
>> I initially thought the calibration would take place at line 470 of
>> amd64/identcpu.c
>> (https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/identcpu.c?annotate=1.87)
>>
>
> Indeed, it cannot happen there simply because you don't know at
> that point whether or not HPET actually exists.
>
>> But I looked into using the acpihpet directly but it is never exposed
>> outside of acpihpet.c.
>>
>
> And it shouldn't be.
>
>> Could someone point me to were if would be appropriate to complete
>> this calibration and how to use the acpihpet?
>
> The way I envision this is a multi-step approach:
>
> 1) TSC frequency is approximated with the PIT (possibly performing
> multiple measurements and averaging them out; also keep in mind that
> doing it 8 times means you can shift the sum right by 3 instead of
> using actual integer division).  This is what should happen around
> the line 470 of identcpu.c
>
> 2) A function can be provided by identcpu.c to further adjust the
> TSC frequency once acpitimer(4) (this is a PM timer) and acpihpet(4)
> (or any other timer for that matter) are attached.
>
> 3) Once acpitimer(4) or acpihpet(4) or any other timecounter source
> are attached and are verified to be operating correctly, they can
> perform TSC re-calibration and update the TSC frequency with their
> measurements.  The idea here is that the function (or functions) that
> facilitate this must abstract enough logic so that you don't have to
> duplicate it in the acpitimer or acpihpet themselves.
>
>> (Will it need to be
>> exposed like i8254_delay/delay_func/delay in machdep.c and cpu.h)
>>
>
> No it won't.
>
>> Lastly should the calibration be done using both delay(i8254 pit) and
>> hpet timers similar to Linux described above or just using the hpet?
>>
>
> Well, that's what I was arguing for.  As I said in my initial mail
> on misc (not quoted here), the TSC must be calibrated using separate
> known clocks sources.

Hi Mike

Please see the below diff to improve the accuracy of the TSC
frequency. It is model after the linux calibration you linked to
earlier. https://marc.info/?l=openbsd-misc&m=150148792804747&w=2

I feel like i don't know enough about the kernel internals, the
consistency of the results across reboots are not as close as i 

Improve the accuracy of the TSC frequency calibration (Was: Calculate the frequency of the tsc timecounter)

2017-08-07 Thread Adam Steen
On Mon, Jul 31, 2017 at 3:58 PM, Mike Belopuhov  wrote:
> On Mon, Jul 31, 2017 at 09:48 +0800, Adam Steen wrote:
>> Ted Unangst  wrote:
>> > we don't currently export this info, but we could add some sysctls. there's
>> > some cpufeatures stuff there, but generally stuff isn't exported until
>> > somebody finds a use for it... it shouldn't be too hard to add something to
>> > amd64/machdep.c sysctl if you're interested.
>>
>> I am interested, as i need the info, i will look into it and hopefully
>> come back with a patch.
>
> This is a bad idea because TSC as the time source is only usable
> by OpenBSD on Skylake and Kaby Lake CPUs since they encode the TSC
> frequency in the CPUID. All older CPUs have their TSCs measured
> against the PIT. Currently the measurement done by the kernel isn't
> very precise and if TSC is selected as a timecounter, the machine
> would be gaining time on a pace that cannot be corrected by our NTP
> daemon. (IIRC, about an hour a day on my Haswell running with NTP).
>
> To be able to use TSC as a timecounter source on OpenBSD or Solo5
> you'd have to improve the in-kernel measurement of the TSC frequency
> first. I've tried to perform 10 measurements and take an average and
> it does improve accuracy, however I believe we need to poach another
> bit from Linux and re-calibrate TSC via HPET:
>
>  
> http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L409
>
> I think this is the most sane thing we can do. Here's a complete
> procedure that Linux kernel undertakes:
>
>  
> http://elixir.free-electrons.com/linux/v4.12.4/source/arch/x86/kernel/tsc.c#L751
>
> Regards,
> Mike

Hi Mike/All

I would like to improve the accuracy of TSC frequency calibration as
Mike B. describes above.

I initially thought the calibration would take place at line 470 of
amd64/identcpu.c
(https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/arch/amd64/amd64/identcpu.c?annotate=1.87)

But I looked into using the acpihpet directly but it is never exposed
outside of acpihpet.c.

Could someone point me to were if would be appropriate to complete
this calibration and how to use the acpihpet? (Will it need to be
exposed like i8254_delay/delay_func/delay in machdep.c and cpu.h)

Lastly should the calibration be done using both delay(i8254 pit) and
hpet timers similar to Linux described above or just using the hpet?

Cheers
Adam

ps i have a diff in my tree exporting tsc_freq and invariant_tsc via
sysctl which is what Ted is describing above too.