Re: vmm(4): cpuid leaf 0x15 fixes clock speed problem in Linux guest [PATCH]

2020-12-05 Thread Pratik Vyas

* Jozef Hatala  [2020-11-29 00:32:17 -0800]:


On Sun, Nov 29, 2020 at 12:04:04AM -0800, I wrote:

On Sun, Nov 29, 2020 at 06:36:17AM +, Mike wrote:
> And what are you going to return for the other leaf nodes now that
> you are claiming a CPUID level of 0x15, on CPUs that are less than
> that?
>
> Eg, if your host supports level 0xb (11), what happens in your diff
> if the guest VM requests CPUID leaf node 0x0c (12)?
>
> It seems you are only handing the single 0x15 case.

That is true.

We need to note though that the current situation is not correct either.
[...]
What do you suggest?


Two ideas spring to mind for how to resolve this:

(a) Figure out from Intel's documentation for each leaf (and sub-leaf)
what response is correct and does not require us to emulate too much.

(b) Add a sysctl knob that controls whether we even announce a higher
cpu level to the guest than what the host supports, so that we do not
expose people to the incorrectness by default.

What do you think?

jh



Hi Jozef,

I don't see Mike's replies but can you try Dave's linux kernel module [0]
and see if it performs well when linux chooses to use it?  disclaimer:
I haven't tried it but my attempt to attach kvm-clock itself worked [1].

This cpuid emulation bit was added during the time when using tsc was
the only way to get a precise clock and before pvclock was added [2].  This
also doesn't work on AMD machines (on at least mine).  We could get rid
of this cpuid emulation.


0: https://github.com/voutilad/vmm_clock
1: http://ix.io/2lzK
2: https://github.com/openbsd/src/commit/fa1855cdc021

--
Pratik



Re: vmd(8) and thread safety: a quick proof of concept using libevent 2.1 from ports

2020-05-25 Thread Pratik Vyas
* Claudio Jeker  [2020-05-25 10:14:11 +0200]:

> The problem is that the vm exit is handled by a different thread then the
> event loop. So some event_add/evtimer_add calls are done from a different
> thread then the main event loop (mainly interrupt processing).
> 
> One approach to fix this is to ensure that the event functions are
> exclusivly used in the main event thread but as usual it is easy to
> introduce bugs again. 

Hi Claudio,

I think it is possible to do this without too much complexity till we
have one only one cpu.  Bascially, we have two threads right now that do
event_add/evtimer_add, 
1) vmm_pipe / imsg comm channel between vmm and vm process
2) vcpu0 loop

I think it's possible to make the vmm_pipe be not its own thread and
just be an event based implementation.

Dave actually discovered that if you disable vmm_pipe, this libevent
state corruption does not occour.

I will take a stab at this.


--
Pratik



Re: vmm timer for linux guests

2020-05-22 Thread Pratik Vyas

* Renato Aguiar  [2020-05-21 12:55:45 -0700]:


Hi Sivaram,

I'm the author of the e-mail thread that you mentioned. After feedback 
I got from OpenBSD community, I created a patch for Linux to enable 
kvm-clock when booting on VMM. It managed to keep clock in sync, but I 
experienced random crashes in vmd after some time running the VM.


Multiple fixes have been merged to vmm/vmd since then, so I'm planning 
to give it another try with OpenBSD 6.7.


If you are interested, I can share the patch with you.

Regards,

--
Renato Aguiar



Hi Sivaram and Renato,

I have been looking at this issue this week with Dave.  (funnily Dave
and I independently also decided to write a linux patch to attach
pvclock and debugging these crashes).

If you have your linux vm with kvm-clock around, can you try this rather
crude patch and see if it still crashes?  This might also fix the '100%
cpu when using alpine via console' problem.

For ref, this is my linux side patch to attach kvm-clock http://ix.io/2lzK

--
Pratik

diff --git a/usr.sbin/vmd/ns8250.c b/usr.sbin/vmd/ns8250.c
index 497e6fad550..33f1a371c16 100644
--- a/usr.sbin/vmd/ns8250.c
+++ b/usr.sbin/vmd/ns8250.c
@@ -36,6 +36,8 @@
 
 extern char *__progname;

 struct ns8250_dev com1_dev;
+struct event read_delay_ev;
+struct timeval read_delay_tv;
 
 static void com_rcv_event(int, short, void *);

 static void com_rcv(struct ns8250_dev *, uint32_t, uint32_t);
@@ -61,6 +63,11 @@ ratelimit(int fd, short type, void *arg)
vcpu_deassert_pic_irq(com1_dev.vmid, 0, com1_dev.irq);
 }
 
+static void

+arm_read(int fd, short type, void *arg) {
+   event_add(_dev.event, NULL);
+}
+
 void
 ns8250_init(int fd, uint32_t vmid)
 {
@@ -96,7 +103,7 @@ ns8250_init(int fd, uint32_t vmid)
 */
com1_dev.pause_ct = (com1_dev.baudrate / 8) / 1000 * 10;
 
-	event_set(_dev.event, com1_dev.fd, EV_READ | EV_PERSIST,

+   event_set(_dev.event, com1_dev.fd, EV_READ,
com_rcv_event, (void *)(intptr_t)vmid);
 
 	/*

@@ -112,6 +119,9 @@ ns8250_init(int fd, uint32_t vmid)
timerclear(_dev.rate_tv);
com1_dev.rate_tv.tv_usec = 1;
evtimer_set(_dev.rate, ratelimit, NULL);
+   read_delay_tv.tv_usec = 1;
+   evtimer_set(_delay_ev, arm_read,
+   (void *)(intptr_t)vmid);
 }
 
 static void

@@ -131,6 +141,7 @@ com_rcv_event(int fd, short kind, void *arg)
 */
if (com1_dev.rcv_pending) {
mutex_unlock(_dev.mutex);
+   evtimer_add(_delay_ev, _delay_tv);
return;
}
 
@@ -146,6 +157,7 @@ com_rcv_event(int fd, short kind, void *arg)

vcpu_deassert_pic_irq((uintptr_t)arg, 0, com1_dev.irq);
}
}
+   event_add(_dev.event, NULL);
 
 	mutex_unlock(_dev.mutex);

 }



Re: FW: Add mprotect_ept ioctl to vmm(4)

2020-04-07 Thread Pratik Vyas

* Adam Steen  [2020-04-07 08:18:19 +]:


On Fri, Feb 07, 2020 at 01:25:38PM -0800, Mike Larkin wrote:
> On Fri, Feb 07, 2020 at 04:20:16AM +, Adam Steen wrote:
> > Hi
> >
> > Please see the attached patch to add an 'IOCTL handler to sets the access
> > protections of the ept'
> >
> > vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
> > a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
> > available here 
https://github.com/Solo5/solo5/compare/master...adamsteen:wnox
> >
> > there are changes to vmd too, but this is just to ensure completeness,
> > if mprotect ept is called in the future, we would want the vm to be
> > stopped if we get a protection fault.
> >
> > I was unsure what todo if called with execute only permissions on a cpu that
> > does not support it. I went with add read permissions and logging the
> > fact, instead of returning EINVAL.
> >
> > Cheers
> > Adam
> >
>
> I have been giving Adam feedback on this diff for a while. There are a few
> minor comments below, but I think this is ok if someone wants to commit it 
after
> the fixes below are incorporated.
>
> -ml
>

See updated comment below.

-ml





> > + /* No W^X permissions */
> > + if ((prot & PROT_MASK) != prot &&
> > + (prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC)) {
> > + DPRINTF("%s: No W^X permissions\n", __func__);
> > + return (EINVAL);
> > + }
>
> I would probably reword this to "W+X permission requested".
>





> > +/*
> > + * vmx_mprotect_ept
> > + *
> > + * apply the ept protections to the requested pages, faulting the page if
>
> "faulting in"
>
> > + * required.
> > + */
> > +int
> > +vmx_mprotect_ept(vm_map_t vm_map, paddr_t sgpa, paddr_t egpa, int prot)
> > +{
> > + struct vmx_invept_descriptor vid;
> > + pmap_t pmap;
> > + pt_entry_t *pte;
> > + paddr_t addr;
> > + int ret = 0;
> > > +
> > + pmap = vm_map->pmap;
> > +
> > + for (addr = sgpa; addr < egpa; addr += PAGE_SIZE) {
> > + pte = vmx_pmap_find_pte_ept(pmap, addr);
> > + if (pte == NULL) {
>
> if (pte & PG_V) == 0
>

After reading a reply from Adam, I think the original suggested way is fine.
My idea of checking PG_V is fine for RWX EPT entries, but as soon as anyone
uses XO entries, this check wouldn't work.

-ml


Hi

I have incorporated the fixes ml requested above and a fixed a few nits
from pd@, but with the crazyness of life these days, i haven't been able to it
commited, so i attaching it again below.


tested ok last week.  I can commit it (this and cr0 one) tonight and
watch for any fallout

Thanks and apologies for delay!

--
Pratik



Re: some vulns

2020-02-25 Thread Pratik Vyas

* Maxime Villard  [2020-02-22 12:16:35 +0100]:


CVSROOT:/cvs
Module name:src
Changes by: morti...@cvs.openbsd.org2020/02/15 15:59:55

Modified files:
sys/arch/amd64/amd64: vmm.c

Log message:
Add bounds check on addresses passed from guests in pvclock.

Fixes an issue where a guest can write to host memory by passing bogus 
addresses.


I'm a bit confused here. It is not because the GPAs are contiguous that the
HPAs are too. If the structure crosses a page, the guest still can write to
host memory.



Yup, you're right.  Thanks Maxime!

--
Pratik



Re: vmm(4): handle invalid writes to cr0 - patch

2020-02-18 Thread Pratik Vyas

* Adam Steen  [2020-02-18 04:56:57 +]:


Hi

Please see the patch below to handle invalid writes to cr0 as per the
Intel SDM Volume 3.

The 3 cases i am handling with this change are
1. CR0.PG: Setting the PG flag when the PE flag is clear causes a
  general-protection exception (#GP). (Intel SDM Volume 3abcd page 76, Section
  2.5 Control Registers)

2. CR0.CD and CR0.NW, CD: 0 and NW: 1 Invalid setting. Generates a
  general-protection exception (#GP) with an error code of 0.
  (Intel SDM Volume 3abcd page 438, Table 11-5. Cache Operating Modes,
  via Intel SDM Volume 3abcd page 76, See also comment from CR0.CD

3. Bits 63:32 of CR0 and CR4 are reserved and must be written with zeros.
  Writing a nonzero value to any of the upper 32 bits results in a
  general-protection exception, #GP(0). (Intel SDM Volume 3abcd page
  75, 2.5 Control Registers, Paragraph 2, bullet point 2.

Cheers
Adam

? div
Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.262
diff -u -p -u -p -r1.262 vmm.c
--- sys/arch/amd64/amd64/vmm.c  17 Feb 2020 18:16:10 -  1.262
+++ sys/arch/amd64/amd64/vmm.c  18 Feb 2020 02:59:53 -
@@ -5685,6 +5685,33 @@ vmx_handle_cr0_write(struct vcpu *vcpu,
return (EINVAL);
}

+   /* 2.5 CONTROL REGISTERS CR0.PG */
+   if ((r & CR0_PG) && (r & CR0_PE) == 0) {
+   DPRINTF("%s: PG flag set when the PE flag is clear,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
+   /*
+* 11.5.1 Cache Control Registers and Bits
+* Table 11-5. Cache Operating Modes
+*/
+   if ((r & CR0_NW) && (r & CR0_CD) == 0) {
+   DPRINTF("%s: NW flag set when the CD flag is clear,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
+   /* 2.5 CONTROL REGISTERS */
+   if (r & 0xULL) {
+   DPRINTF("%s: setting bits 63:32 of %%cr0 is invalid,"
+   " inject #GP, cr0=0x%llx\n", __func__, r);
+   vmm_inject_gp(vcpu);
+   return (0);
+   }
+
/* CR0 must always have NE set */
r |= CR0_NE;





Looks good!

ok pd@



Re: vmctl(8): uninitialized value

2020-01-02 Thread Pratik Vyas

* Benjamin Baier  [2020-01-02 22:01:14 +0100]:


On Thu, 2 Jan 2020 18:56:14 +0100
Klemens Nanni  wrote:


On Thu, Jan 02, 2020 at 04:37:17PM +0100, Benjamin Baier wrote:
> "case CMD_SEND:" sets done=1 so ret will never be written to and
> the uninitialized value of ret is used to determine the return
> value of the function vmmaction.
Good catch:

$ doas vmctl start -b ~/vm/bsd.rd -m 128M test ; echo $?
vmctl: starting without disks
vmctl: starting without network interfaces
vmctl: started vm 4 successfully, tty /dev/ttyp2
0
$ doas vmctl send test >/dev/null ; echo $?
vmctl: sent vm test successfully
1

With your diff it exits zero.

I also just noticed that above example reproducibly causes vmd(8) to
exit:

Jan  2 18:53:57 eru vmd[55128]: startup
Jan  2 18:54:18 eru vmd[55128]: test: started vm 4 successfully, tty /dev/ttyp2
Jan  2 18:54:28 eru vmd[49983]: priv exiting, pid 49983
Jan  2 18:54:28 eru vmd[29885]: control exiting, pid 29885


I don't get vmd to exit, but got this in /var/log/messages with the above 
example:
Jan  2 21:38:05 x220 vmd[86810]: control_dispatch_vmd: lost control connection: 
fd 7



Hi Benjamin,

kn@ is running with vm.malloc_conf=SU.  The crash is due to vm_remove
calling free on vm here,
https://github.com/openbsd/src/blob/926f477f07d3bba063ff6ee1ea9e0b7369ed8930/usr.sbin/vmd/vmm.c#L542

will work on a fix.

--
Pratik



Re: vmctl: parse_size(): Use local variable instead of function parameter

2019-12-16 Thread Pratik Vyas

* Klemens Nanni  [2019-12-16 23:42:04 +0100]:


On Fri, Dec 06, 2019 at 06:49:52PM +0100, Klemens Nanni wrote:

The parse_size() wrapper around scan_scaled(3) writes its intermediate
result to the function argument which is always passed as literal zero.

This seems odd, the function parameter has no meaning but merely serves
as storage, so let's use a proper function scoped variable instead.

OK?

Ping.



That's better.  ok!
--
Pratik



[PATCH] staggered start of vms in vm.conf

2019-12-08 Thread Pratik Vyas

Hi!

This is an attempt to address 'thundering herd' problem when a lot of
vms are configured in vm.conf.  A lot of vms booting in parallel can
overload the host and also mess up tsc calibration in openbsd guests as
it uses PIT which doesn't fire reliably if the host is overloaded.


This diff makes vmd start vms in a staggered fashion with default parallelism of
number of cpus on the host and a delay of 30s.  Default can be overridden with
a line like following in vm.conf

staggered start parallel 4 delay 30


Every non-disabled vm starts in waiting state.  If you are eager to
start a vm that is way further in the list, you can vmctl start it.

Discussed the idea with ori@, mlarkin@ and phessler@.

Comments / ok? 


--
Pratik

Index: usr.sbin/vmctl/vmctl.c
===
RCS file: /home/cvs/src/usr.sbin/vmctl/vmctl.c,v
retrieving revision 1.71
diff -u -p -a -u -r1.71 vmctl.c
--- usr.sbin/vmctl/vmctl.c  7 Sep 2019 09:11:14 -   1.71
+++ usr.sbin/vmctl/vmctl.c  8 Dec 2019 09:29:39 -
@@ -716,6 +716,8 @@ vm_state(unsigned int mask)
{
if (mask & VM_STATE_PAUSED)
return "paused";
+   else if (mask & VM_STATE_WAITING)
+   return "waiting";
else if (mask & VM_STATE_RUNNING)
return "running";
else if (mask & VM_STATE_SHUTDOWN)
Index: usr.sbin/vmd/parse.y
===
RCS file: /home/cvs/src/usr.sbin/vmd/parse.y,v
retrieving revision 1.52
diff -u -p -a -u -r1.52 parse.y
--- usr.sbin/vmd/parse.y14 May 2019 06:05:45 -  1.52
+++ usr.sbin/vmd/parse.y8 Dec 2019 09:29:39 -
@@ -122,7 +122,8 @@ typedef struct {
%token  INCLUDE ERROR
%token  ADD ALLOW BOOT CDROM DEVICE DISABLE DISK DOWN ENABLE FORMAT GROUP
%token  INET6 INSTANCE INTERFACE LLADDR LOCAL LOCKED MEMORY NET NIFS OWNER
-%token PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID
+%token PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID STAGGERED START
+%token  PARALLEL DELAY
%token  NUMBER
%token  STRING
%type   lladdr
@@ -217,6 +218,11 @@ main   : LOCAL INET6 {
env->vmd_ps.ps_csock.cs_uid = $3.uid;
env->vmd_ps.ps_csock.cs_gid = $3.gid == -1 ? 0 : $3.gid;
}
+   | STAGGERED START PARALLEL NUMBER DELAY NUMBER {
+   env->vmd_cfg.cfg_flags |= VMD_CFG_STAGGERED_START;
+   env->vmd_cfg.delay.tv_sec = $6;
+   env->vmd_cfg.parallelism = $4;
+   }
;

switch  : SWITCH string {
@@ -368,6 +374,8 @@ vm  : VM string vm_instance {
} else {
if (vcp_disable)
vm->vm_state |= 
VM_STATE_DISABLED;
+   else
+   vm->vm_state |= 
VM_STATE_WAITING;
log_debug("%s:%d: vm \"%s\" "
"registered (%s)",
file->name, yylval.lineno,
@@ -766,6 +774,7 @@ lookup(char *s)
{ "allow",ALLOW },
{ "boot", BOOT },
{ "cdrom",CDROM },
+   { "delay",DELAY },
{ "device",   DEVICE },
{ "disable",  DISABLE },
{ "disk", DISK },
@@ -785,10 +794,13 @@ lookup(char *s)
{ "memory",   MEMORY },
{ "net",  NET },
{ "owner",OWNER },
+   { "parallel", PARALLEL },
{ "prefix",   PREFIX },
{ "rdomain",  RDOMAIN },
{ "size", SIZE },
{ "socket",   SOCKET },
+   { "staggered",STAGGERED },
+   { "start",START  },
{ "switch",   SWITCH },
{ "up",   UP },
{ "vm",   VM }
Index: usr.sbin/vmd/vm.conf.5
===
RCS file: /home/cvs/src/usr.sbin/vmd/vm.conf.5,v
retrieving revision 1.44
diff -u -p -a -u -r1.44 vm.conf.5
--- usr.sbin/vmd/vm.conf.5  14 May 2019 12:47:17 -  1.44
+++ usr.sbin/vmd/vm.conf.5  8 Dec 2019 09:29:39 -
@@ -91,6 +91,16 @@ vm "vm1.example.com" {
.Sh GLOBAL CONFIGURATION
The following setting can be configured globally:
.Bl -tag -width Ds
+.It Ic staggered start parallel Ar parallelism Ic delay Ar seconds
+Start all configured vms in staggered fashion with
+.Ar parallelism
+instances in parallel every
+.Ar delay
+seconds.  Defaults to
+.Ar 

Re: [PATCH] attach pvclock with lower priority if tsc is unstable

2019-12-06 Thread Pratik Vyas

* Pratik Vyas  [2019-11-24 23:07:26 -0800]:


Hello tech@,

This diff attaches pvclock with lower priority (500) in case of unstable
tsc (PVCLOCK_FLAG_TSC_STABLE) instead of not attaching at all.

For reference current priorities,
tsc (variant)  : -2000
i8254  : 0
acpitimer  : 1000
acpihpet0  : 1000
pvclock (stable tsc)   : 1500
tsc (invariant, stable): 2000

--
Pratik



Does this look ok? (or any comments?)

--
Pratik



[PATCH] fix vmm pvclock accuracy

2019-11-25 Thread Pratik Vyas

Hi tech@,

This patch fixes vmm pvclock accuracy issues.  Shift math error
discovered by George Koehler.  This diff also fixes the error in tsc
multiplier which was correct only if the host timecounter is tsc.

--
Pratik


Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.254
diff -u -p -a -u -r1.254 vmm.c
--- sys/arch/amd64/amd64/vmm.c  22 Sep 2019 08:47:54 -  1.254
+++ sys/arch/amd64/amd64/vmm.c  26 Nov 2019 00:08:10 -
@@ -28,7 +28,6 @@
#include 
#include 
#include 
-#include 

#include 

@@ -6879,8 +6878,11 @@ void
vmm_init_pvclock(struct vcpu *vcpu, paddr_t gpa)
{
vcpu->vc_pvclock_system_gpa = gpa;
-   vcpu->vc_pvclock_system_tsc_mul =
-   (int) ((10L << 20) / tc_getfrequency());
+   if (tsc_frequency > 0)
+   vcpu->vc_pvclock_system_tsc_mul =
+   (int) ((10L << 20) / tsc_frequency);
+   else
+   vcpu->vc_pvclock_system_tsc_mul = 0;
vmm_update_pvclock(vcpu);
}

@@ -6906,7 +6908,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
nanotime();
pvclock_ti->ti_system_time =
tv.tv_sec * 10L + tv.tv_nsec;
-   pvclock_ti->ti_tsc_shift = -20;
+   pvclock_ti->ti_tsc_shift = 12;
pvclock_ti->ti_tsc_to_system_mul =
vcpu->vc_pvclock_system_tsc_mul;
pvclock_ti->ti_flags = PVCLOCK_FLAG_TSC_STABLE;



[PATCH] attach pvclock with lower priority if tsc is unstable

2019-11-24 Thread Pratik Vyas

Hello tech@,

This diff attaches pvclock with lower priority (500) in case of unstable
tsc (PVCLOCK_FLAG_TSC_STABLE) instead of not attaching at all.

For reference current priorities,
tsc (variant)  : -2000
i8254  : 0
acpitimer  : 1000
acpihpet0  : 1000
pvclock (stable tsc)   : 1500
tsc (invariant, stable): 2000

--
Pratik

Index: sys/dev/pv/pvclock.c
===
RCS file: /home/cvs/src/sys/dev/pv/pvclock.c,v
retrieving revision 1.4
diff -u -p -a -u -r1.4 pvclock.c
--- sys/dev/pv/pvclock.c13 May 2019 15:40:34 -  1.4
+++ sys/dev/pv/pvclock.c25 Nov 2019 06:49:09 -
@@ -29,11 +29,14 @@
#include 

#include 
+#include 
#include 

#include 
#include 

+uint pvclock_lastcount;
+
struct pvclock_softc {
struct devicesc_dev;
void*sc_time;
@@ -141,21 +144,22 @@ pvclock_attach(struct device *parent, st
flags = ti->ti_flags;
} while (!pvclock_read_done(ti, version));

-   if ((flags & PVCLOCK_FLAG_TSC_STABLE) == 0) {
-   wrmsr(KVM_MSR_SYSTEM_TIME, pa & ~PVCLOCK_SYSTEM_TIME_ENABLE);
-   km_free(sc->sc_time, PAGE_SIZE, _any, _zero);
-   printf(": unstable clock\n");
-   return;
-   }
-
sc->sc_tc = _timecounter;
sc->sc_tc->tc_name = DEVNAME(sc);
sc->sc_tc->tc_frequency = 10ULL;
sc->sc_tc->tc_priv = sc;

+   pvclock_lastcount = 0;
+
/* Better than HPET but below TSC */
sc->sc_tc->tc_quality = 1500;

+   if ((flags & PVCLOCK_FLAG_TSC_STABLE) == 0) {
+   /* if tsc is not stable, set a lower priority */
+   /* Better than i8254 but below HPET */
+   sc->sc_tc->tc_quality = 500;
+   }
+
tc_init(sc->sc_tc);

printf("\n");
@@ -216,10 +220,6 @@ pvclock_get_timecount(struct timecounter
flags = ti->ti_flags;
} while (!pvclock_read_done(ti, version));

-   /* This bit must be set as we attached based on the stable flag */
-   if ((flags & PVCLOCK_FLAG_TSC_STABLE) == 0)
-   panic("%s: unstable result on stable clock", DEVNAME(sc));
-
/*
 * The algorithm is described in
 * linux/Documentation/virtual/kvm/msr.txt
@@ -230,6 +230,14 @@ pvclock_get_timecount(struct timecounter
else
delta <<= shift;
ctr = ((delta * mul_frac) >> 32) + system_time;
+
+   if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0)
+   return (ctr);
+
+   if (ctr < pvclock_lastcount)
+   return (pvclock_lastcount);
+
+   atomic_swap_uint(_lastcount, ctr);

return (ctr);
}



Re: Attach kvm-clock to Linux guests on VMM

2019-05-27 Thread Pratik Vyas

* Renato Aguiar  [2019-05-27 03:53:11 -0700]:


Hi,

The following patch makes Linux guests use kvm-clock by setting KVM's 
CPUID signature on VMM:




I think the right thing is to make linux attach pvclock if it's on
OpenBSD vmm.  You want to send them a patch?

Otherwise, does vmm pvclock keep good time on linux from your experiments?

--
Pratik



Re: vmd: servicing virtio devices from separate processes

2018-10-20 Thread Pratik Vyas

* David Gwynne  [2018-10-20 12:19:56 +1000]:



Would sending and receiving a VM still work if I/O is run in different 
processes?

dlg



Hi dlg,

It will have to be reworked completely but can be done, I think.

--
Pratik



Re: vmd: rate-limit to avoid reboot loops

2018-10-08 Thread Pratik Vyas

* Reyk Floeter  [2018-10-05 23:32:44 +0200]:


Hi,

it sometimes happens that a VM is stuck in a reboot loop.  This isn't
very pleasent for vmd, so this diff attempts to introduce a hard
rate-limit: if the VM rebooted after less than VM_START_RATE_SEC (6)
seconds, increment a counter.  If this happens VM_START_RATE_LIMIT (3)
times in a row, stop the VM.

The idea is that it might be desirable in some cases to reboot quickly
(you're either really fast on the boot prompt, or you use something
like grub that can automatically reboot into a previous kernel).  But
if this happens too often (more than 3 times), something is wrong and
cannot be intended, not even in the worst Linux/grub/unikernel/...
situation.

These limits are a guessed default.

Test case: I dd'ed random bytes to a kernel after some initial bytes,
keeping the original size of the kernel.  The boot loader loads the
header, the complete kernel, tries to boot it and *boom*, reset ;)

Comments?  Concerns?  Better ideas?  OKs?



We could rate limit based on the number of triple faults.  I have often
break;ed on VMX_EXIT_TRIPLE_FAULT in vcpu_exit (vm.c) to not get annoyed by 
this.

--
Pratik



Re: vmd: division by zero in vcpu_process_com_data

2018-10-03 Thread Pratik Vyas

* Mike Larkin  [2018-10-03 12:19:09 -0700]:


How about this? pd, thoughts?

This code is just the rate limiter code.

Today the code says "have I reached the number of characters output based on
my baud rate that indicates I need to pause a bit?". And pausing after 0
characters has been output makes no sense, so only engage the limiter if
we are pausing between each "1 character or more".

-ml


I like it

ok pd@



Re: vmd: division by zero in vcpu_process_com_data

2018-10-03 Thread Pratik Vyas

* Greg Steuck  [2018-10-03 11:40:22 -0700]:


$ egdb /syzkaller/src/usr.sbin/vmd/obj/vmd /var/crash/vmd/38082.core
 Core was generated by `vmd'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0  0x0c07a64174a0 in vcpu_process_com_data (vei=,
vm_id=, vcpu_id=)
   at /syzkaller/src/usr.sbin/vmd/ns8250.c:240
240 if (com1_dev.byte_out % com1_dev.pause_ct
== 0) {
[Current thread is 1 (process 259192)]
(gdb) p com1_dev.pause_ct
$1 = 0
$1 = {mutex = 0xc0a4b1242c0, regs = {lcr = 5 '\005', fcr = 6 '\006', iir =
3 '\003', ier = 15 '\017', divlo = 64 '@', divhi = 56 '8',
   msr = 0 '\000', lsr = 0 '\000', mcr = 11 '\v', scr = 0 '\000', data = 0
'\000'}, event = {ev_next = {tqe_next = 0xc0a4b120808,
 tqe_prev = 0xc09e1428848}, ev_active_next = {tqe_next = 0x0, tqe_prev
= 0x0}, ev_signal_next = {tqe_next = 0x0, tqe_prev = 0x0},
   min_heap_idx = 4294967295, ev_base = 0xc0a52c3bc00, ev_fd = 9,
ev_events = 18, ev_ncalls = 0, ev_pncalls = 0x0, ev_timeout = {tv_sec = 0,
 tv_usec = 0}, ev_pri = 0, ev_callback = 0xc07a6417340
, ev_arg = 0x6, ev_res = 0, ev_flags = 4226}, rate =
{ev_next = {
 tqe_next = 0x0, tqe_prev = 0x0}, ev_active_next = {tqe_next =
0xc07a66b65c8 , tqe_prev = 0xc0a4b121f40}, ev_signal_next = {
 tqe_next = 0x0, tqe_prev = 0x0}, min_heap_idx = 4294967295, ev_base =
0xc0a52c3bc00, ev_fd = -1, ev_events = 0, ev_ncalls = 0,
   ev_pncalls = 0xc0a4b8f3f68, ev_timeout = {tv_sec = 2745, tv_usec =
969355}, ev_pri = 0, ev_callback = 0xc07a64173c0 , ev_arg = 0x0,
   ev_res = 1, ev_flags = 128}, rate_tv = {tv_sec = 0, tv_usec = 1},
fd = 9, irq = 4, rcv_pending = 0, vmid = 6, byte_out = 56924,
 baudrate = 8, pause_ct = 0}



Nice :)

Easy to repro, boot cd and stty com0 4800 in boot>
and continue

crude diff attached.


--
Pratik

Index: usr.sbin/vmd/ns8250.c
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmd/ns8250.c,v
retrieving revision 1.17
diff -u -p -a -u -r1.17 ns8250.c
--- usr.sbin/vmd/ns8250.c   12 Jul 2018 10:15:44 -  1.17
+++ usr.sbin/vmd/ns8250.c   3 Oct 2018 19:03:08 -
@@ -312,13 +312,13 @@ vcpu_process_com_lcr(struct vm_exit *vei
if (vei->vei.vei_dir == VEI_DIR_OUT) {
if (com1_dev.regs.lcr & LCR_DLAB) {
if (!(data & LCR_DLAB)) {
-   if (com1_dev.regs.divlo == 0 &&
-   com1_dev.regs.divhi == 0) {
+   divisor = com1_dev.regs.divlo |
+com1_dev.regs.divhi << 8;
+   /* can't set baud < 9600  */
+   if (divisor == 0 || (divisor > (115200/9600))) {
log_warnx("%s: ignoring invalid "
"baudrate", __func__);
} else {
-   divisor = com1_dev.regs.divlo |
-com1_dev.regs.divhi << 8;
com1_dev.baudrate = 115200 / divisor;
com1_dev.pause_ct =
(com1_dev.baudrate / 8) / 1000 * 10;



Re: vmd cores

2018-10-03 Thread Pratik Vyas

* Greg Steuck  [2018-10-03 10:56:28 -0700]:


Hi Mike,

I'm getting core files from vmds. Here's the most recent one. Should I
start collecting more stack traces and sending them to you?

ci-openbsd$ doas /usr/local/bin/egdb /syzkaller/src/usr.sbin/vmd/obj/vmd
/var/crash/vmd/89501.core
Reading symbols from /syzkaller/src/usr.sbin/vmd/obj/vmd...done.
[New process 178128]
[New process 294426]
[New process 350865]
Core was generated by `vmd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0c07a64148bd in virtio_shutdown (vm=0xc09e1418000) at
/syzkaller/src/usr.sbin/vmd/virtio.c:2018
2018vioscsi->file.close(vioscsi->file.p, 0);


Hi Greg,

this is interesting.  Are you using the cdrom?  I guess not.  There
seems to be no if condition around that statement.

This diff should prevent that segfault.

--
Pratik


Index: usr.sbin/vmd/virtio.c
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmd/virtio.c,v
retrieving revision 1.70
diff -u -p -a -u -r1.70 virtio.c
--- usr.sbin/vmd/virtio.c   28 Sep 2018 12:35:32 -  1.70
+++ usr.sbin/vmd/virtio.c   3 Oct 2018 18:35:40 -
@@ -2015,7 +2015,8 @@ virtio_shutdown(struct vmd_vm *vm)
int i;

/* ensure that our disks are synced */
-   vioscsi->file.close(vioscsi->file.p, 0);
+   if (vioscsi != NULL)
+   vioscsi->file.close(vioscsi->file.p, 0);
for (i = 0; i < nr_vioblk; i++)
vioblk[i].file.close(vioblk[i].file.p, 0);
}



Re: Newer unveil diffs

2018-07-31 Thread Pratik Vyas

* Theo de Raadt  [2018-07-30 12:52:46 -0600]:


unveil(2) is now enabled in -current.

For those who want to play along at home, here are some diffs which use
this in a variety of programs.  Not all these diffs are correct or
complete yet.  This is a learning experience.  Based upon what we learn,
we may still change unveil(2) semantics slightly (similar to how pledge
semantics were reached).

These diffs are in snapshots.



Thanks Theo for pushing this in!

vmctl start needs a bit more unveiling.

Index: usr.sbin/vmctl/main.c
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmctl/main.c,v
retrieving revision 1.39
diff -u -p -a -u -r1.39 main.c
--- usr.sbin/vmctl/main.c   12 Jul 2018 14:53:37 -  1.39
+++ usr.sbin/vmctl/main.c   31 Jul 2018 21:15:42 -
@@ -45,7 +45,15 @@ static inttty_autoconnect = 0;
__dead void  usage(void);
__dead void  ctl_usage(struct ctl_command *);

+int vmmaction(struct parse_result *);
int  vmm_action(struct parse_result *);
+int parse_ifs(struct parse_result *, char *, int);
+int parse_network(struct parse_result *, char *);
+int parse_size(struct parse_result *, char *, long long);
+int parse_disk(struct parse_result *, char *);
+int parse_vmid(struct parse_result *, char *, int);
+voidparse_free(struct parse_result *);
+int parse(int, char *[]);

int  ctl_console(struct parse_result *, int, char *[]);
int  ctl_create(struct parse_result *, int, char *[]);
@@ -158,9 +166,14 @@ parse(int argc, char *argv[])
res.action = ctl->action;
res.ctl = ctl;

+   if (unveil(SOCKET_NAME, "r") == -1)
+   err(1, "unveil");
+
if (!ctl->has_pledge) {
/* pledge(2) default if command doesn't have its own pledge */
-   if (pledge("stdio rpath exec unix getpw", NULL) == -1)
+   if (unveil(VMCTL_CU, "x") == -1)
+   err(1, "unveil");
+   if (pledge("stdio rpath exec unix getpw unveil", NULL) == -1)
err(1, "pledge");
}
if (ctl->main(, argc, argv) != 0)
@@ -477,6 +490,10 @@ ctl_create(struct parse_result *res, int

paths[0] = argv[1];
paths[1] = NULL;
+
+   if (unveil(paths[0], "rwc") == -1)
+   err(1, "unveil");
+
if (pledge("stdio rpath wpath cpath", NULL) == -1)
err(1, "pledge");
argc--;
@@ -597,6 +614,8 @@ ctl_start(struct parse_result *res, int 
		case 'b':

if (res->path)
errx(1, "boot image specified multiple times");
+   if (unveil(optarg, "r") == -1)
+   err(1, "unveil");
if (realpath(optarg, path) == NULL)
err(1, "invalid boot image path");
if ((res->path = strdup(path)) == NULL)
@@ -628,6 +647,8 @@ ctl_start(struct parse_result *res, int 
errx(1, "invalid network: %s", optarg);

break;
case 'd':
+   if (unveil(optarg, "r") == -1)
+   err(1, "unveil");
if (realpath(optarg, path) == NULL)
err(1, "invalid disk path");
if (parse_disk(res, path) != 0)



Re: vmd: enable pause/unpause for vm owners

2018-04-17 Thread Pratik Vyas

* Mohamed Aslan  [2018-04-16 00:54:43 -0400]:


Hello tech@,

I noticed that vmd(8) only allows VM owners to start/stop their
VMs, but does not let them to pause/unpause those VMs.

I was just wondering if there are reasons behind that. If not, the
patch below enables pause/unpause commands for VM owners.

Regards,
Aslan


Hi Aslan,

No reason behind not letting owners pause / unpause.  vmctl send
/ receive is also missing this and it's on my list.

Thanks for the patch!  It looks ok and I will commit it.

--
Pratik



vmm: Call for testing: Expose TSC to guest

2017-07-22 Thread Pratik Vyas

Hello tech@,

The following diff should expose TSC to guest vm and OpenBSD guests
should be able to choose tsc as a preferred timecounter if the host
machine is >= skylake. 


This should improve the guest clock drift situation significantly.

I am aware that this breaks received vms and am working on handling
that.

Thanks,
Pratik

Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /home/pdvyas/cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.157
diff -u -p -a -u -r1.157 vmm.c
--- sys/arch/amd64/amd64/vmm.c  2 Jul 2017 19:49:31 -   1.157
+++ sys/arch/amd64/amd64/vmm.c  23 Jul 2017 03:49:30 -
@@ -5200,7 +5200,7 @@ vmm_handle_cpuid(struct vcpu *vcpu)

switch (*rax) {
case 0x00:  /* Max level and vendor ID */
-   *rax = 0x0d; /* cpuid_level */
+   *rax = cpuid_level;
*rbx = *((uint32_t *)_vendor);
*rdx = *((uint32_t *)_vendor + 1);
*rcx = *((uint32_t *)_vendor + 2);
@@ -5226,7 +5226,6 @@ vmm_handle_cpuid(struct vcpu *vcpu)
 *  direct cache access (CPUIDECX_DCA)
 *  x2APIC (CPUIDECX_X2APIC)
 *  apic deadline (CPUIDECX_DEADLINE)
-*  timestamp (CPUID_TSC)
 *  apic (CPUID_APIC)
 *  psn (CPUID_PSN)
 *  self snoop (CPUID_SS)
@@ -5243,10 +5242,9 @@ vmm_handle_cpuid(struct vcpu *vcpu)
CPUIDECX_SDBG | CPUIDECX_XTPR | CPUIDECX_PCID |
CPUIDECX_DCA | CPUIDECX_X2APIC | CPUIDECX_DEADLINE);
*rdx = curcpu()->ci_feature_flags &
-   ~(CPUID_ACPI | CPUID_TM | CPUID_TSC |
- CPUID_HTT | CPUID_DS | CPUID_APIC |
- CPUID_PSN | CPUID_SS | CPUID_PBE |
- CPUID_MTRR);
+   ~(CPUID_ACPI | CPUID_TM | CPUID_HTT |
+ CPUID_DS | CPUID_APIC | CPUID_PSN |
+ CPUID_SS | CPUID_PBE | CPUID_MTRR);
break;
case 0x02:  /* Cache and TLB information */
*rax = eax;
@@ -5399,13 +5397,8 @@ vmm_handle_cpuid(struct vcpu *vcpu)
*rcx = 0;
*rdx = 0;
break;
-   case 0x15:  /* TSC / Core Crystal Clock info (not supported) */
-   DPRINTF("%s: function 0x15 (TSC / CCC info) not supported\n",
-   __func__);
-   *rax = 0;
-   *rbx = 0;
-   *rcx = 0;
-   *rdx = 0;
+   case 0x15:
+   CPUID(0x15, *rax, *rbx, *rcx, *rdx);
break;
case 0x16:  /* Processor frequency info (not supported) */
DPRINTF("%s: function 0x16 (frequency info) not supported\n",
@@ -5432,7 +5425,6 @@ vmm_handle_cpuid(struct vcpu *vcpu)
*rbx = 0;   /* Reserved */
*rcx = curcpu()->ci_efeature_ecx;
*rdx = curcpu()->ci_feature_eflags;
-   *rdx &= ~CPUID_RDTSCP;
break;
case 0x8002:/* Brand string */
*rax = curcpu()->ci_brand[0];
@@ -5465,10 +5457,7 @@ vmm_handle_cpuid(struct vcpu *vcpu)
*rdx = curcpu()->ci_extcacheinfo[3];
break;
case 0x8007:/* apmi */
-   *rax = 0;   /* Reserved */
-   *rbx = 0;   /* Reserved */
-   *rcx = 0;   /* Reserved */
-   *rdx = 0;   /* unsupported ITSC */
+   CPUID(0x8007, *rax, *rbx, *rcx, *rdx);
break;
case 0x8008:/* Phys bits info and topology (AMD) */
DPRINTF("%s: function 0x8008 (phys bits info) not "



Re: [PATCH] vmd: write and read device state to and from fd

2017-05-06 Thread Pratik Vyas

* Reyk Floeter  [2017-05-05 18:11:22 +0200]:


Once again and for the record: nice work!


Thank you!


See comments below, otherwise OK.


Addressed and attached a patch inline.


(As mentioned before, we should try to merge this part if init and
restore later to make it easier to keep it in sync.)

The virtio_init() code also has the following line:

vionet[i].vm_vmid = vm->vm_vmid;

vm_vmid is the "vmd userspace id", vm_id is the "kernel id".  We use
the vm_vmid in the integrated bootp/dhcp server to calculate the IP
guest and host addresses for -L mode that is communicated to the VM.

The restored VM will most probably have a new vm_vmid, so you have to
update it as well.  We cannot keep the old one as it a) might not
match what is configured on the host side tap(4) interface via priv.c
and b) it might conflict with a different VM that is running on the
target host.

Now you have to pass vm->vm_vmid to the restore functions, but maybe
it is just easier to pass struct vmd_vm *vm instead of struct
vm_create_params *vcp to all restore_emulated_hw() functions, because
*vm includes the vcp and all other information.



Did this for vionet to be able to set vionet[i].vm_vmid for now. I do
think breaking out init and reusing parts of it code for restore is
a good idea and standardising vmd_vm can be part of the same exercise.
Will send a patch in future.

--
Pratik


Index: usr.sbin/vmd/Makefile
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmd/Makefile,v
retrieving revision 1.14
diff -u -p -a -u -r1.14 Makefile
--- usr.sbin/vmd/Makefile   19 Apr 2017 15:38:32 -  1.14
+++ usr.sbin/vmd/Makefile   6 May 2017 17:56:55 -
@@ -6,7 +6,7 @@ PROG=   vmd
SRCS=   vmd.c control.c log.c priv.c proc.c config.c vmm.c
SRCS+=  vm.c loadfile_elf.c pci.c virtio.c i8259.c mc146818.c
SRCS+=  ns8250.c i8253.c vmboot.c ufs.c disklabel.c dhcp.c packet.c
-SRCS+= parse.y
+SRCS+= parse.y atomicio.c

CFLAGS+=-Wall -I${.CURDIR}
CFLAGS+=-Wstrict-prototypes -Wmissing-prototypes
Index: usr.sbin/vmd/atomicio.c
===
RCS file: usr.sbin/vmd/atomicio.c
diff -N usr.sbin/vmd/atomicio.c
--- /dev/null   1 Jan 1970 00:00:00 -
+++ usr.sbin/vmd/atomicio.c 6 May 2017 17:56:55 -
@@ -0,0 +1,153 @@
+/* $OpenBSD: atomicio.c,v 1.28 2016/07/27 23:18:12 djm Exp $ */
+/*
+ * Copyright (c) 2006 Damien Miller. All rights reserved.
+ * Copyright (c) 2005 Anil Madhavapeddy. All rights reserved.
+ * Copyright (c) 1995,1999 Theo de Raadt.  All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "atomicio.h"
+
+/*
+ * ensure all of data on socket comes through. f==read || f==vwrite
+ */
+size_t
+atomicio6(ssize_t (*f) (int, void *, size_t), int fd, void *_s, size_t n,
+int (*cb)(void *, size_t), void *cb_arg)
+{
+   char *s = _s;
+   size_t pos = 0;
+   ssize_t res;
+   struct pollfd pfd;
+
+   pfd.fd = fd;
+   pfd.events = f == read ? POLLIN : POLLOUT;
+   while (n > pos) {
+   res = (f) (fd, s + pos, n - pos);
+   switch (res) {
+   case -1:
+   if (errno == EINTR)
+   continue;
+   if (errno == EAGAIN) {
+   (void)poll(, 1, -1);
+   continue;
+   }
+   return 0;
+   case 0:
+   errno = EPIPE;
+   return 

[PATCH] vmm: Add XCR0 to readregs / writeregs

2017-05-05 Thread Pratik Vyas


Hello tech@,


This is a patch that extends the readregs and writeregs vmm(4) ioctl to
read and write XCR0. This is required to send and receive FPU state
correctly for vmctl send and vmctl receive. vmctl send / receive are two
new options that will support snapshotting VMs and migrating VMs from
one host to another. This project was undertaken at San Jose State
University along with my three teammates, Ashwin, Harshada and Siri with
mlarkin@ as our advisor.


Thanks,
Pratik



Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /home/pdvyas/cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.138
diff -u -p -a -u -r1.138 vmm.c
--- sys/arch/amd64/amd64/vmm.c  2 May 2017 02:57:46 -   1.138
+++ sys/arch/amd64/amd64/vmm.c  5 May 2017 07:07:56 -
@@ -1396,6 +1396,7 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
}
if (regmask & VM_RWREGS_CRS) {
crs[VCPU_REGS_CR2] = vcpu->vc_gueststate.vg_cr2;
+   crs[VCPU_REGS_XCR0] = vcpu->vc_gueststate.vg_xcr0;
if (vmread(VMCS_GUEST_IA32_CR0, [VCPU_REGS_CR0]))
goto errout;
if (vmread(VMCS_GUEST_IA32_CR3, [VCPU_REGS_CR3]))
@@ -1522,6 +1523,7 @@ vcpu_writeregs_vmx(struct vcpu *vcpu, ui
goto errout;
}
if (regmask & VM_RWREGS_CRS) {
+   vcpu->vc_gueststate.vg_xcr0 = crs[VCPU_REGS_XCR0];
if (vmwrite(VMCS_GUEST_IA32_CR0, crs[VCPU_REGS_CR0]))
goto errout;
if (vmwrite(VMCS_GUEST_IA32_CR3, crs[VCPU_REGS_CR3]))
Index: sys/arch/amd64/include/vmmvar.h
===
RCS file: /home/pdvyas/cvs/src/sys/arch/amd64/include/vmmvar.h,v
retrieving revision 1.36
diff -u -p -a -u -r1.36 vmmvar.h
--- sys/arch/amd64/include/vmmvar.h 2 May 2017 02:57:46 -   1.36
+++ sys/arch/amd64/include/vmmvar.h 5 May 2017 07:07:56 -
@@ -328,7 +328,8 @@ struct vcpu_segment_info {
#define VCPU_REGS_CR3   2
#define VCPU_REGS_CR4   3
#define VCPU_REGS_CR8   4
-#define VCPU_REGS_NCRS (VCPU_REGS_CR8 + 1)
+#define VCPU_REGS_XCR0 5
+#define VCPU_REGS_NCRS (VCPU_REGS_XCR0 + 1)

#define VCPU_REGS_CS0
#define VCPU_REGS_DS1
Index: usr.sbin/vmd/vm.c
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmd/vm.c,v
retrieving revision 1.15
diff -u -p -a -u -r1.15 vm.c
--- usr.sbin/vmd/vm.c   2 May 2017 07:19:53 -   1.15
+++ usr.sbin/vmd/vm.c   5 May 2017 07:07:56 -
@@ -139,7 +139,8 @@ static const struct vcpu_reg_state vcpu_
.vrs_msrs[VCPU_REGS_LSTAR] = 0ULL,
.vrs_msrs[VCPU_REGS_CSTAR] = 0ULL,
.vrs_msrs[VCPU_REGS_SFMASK] = 0ULL,
-   .vrs_msrs[VCPU_REGS_KGSBASE] = 0ULL
+   .vrs_msrs[VCPU_REGS_KGSBASE] = 0ULL,
+   .vrs_crs[VCPU_REGS_XCR0] = XCR0_X87
#endif
};

@@ -175,7 +176,8 @@ static const struct vcpu_reg_state vcpu_
.vrs_msrs[VCPU_REGS_LSTAR] = 0ULL,
.vrs_msrs[VCPU_REGS_CSTAR] = 0ULL,
.vrs_msrs[VCPU_REGS_SFMASK] = 0ULL,
-   .vrs_msrs[VCPU_REGS_KGSBASE] = 0ULL
+   .vrs_msrs[VCPU_REGS_KGSBASE] = 0ULL,
+   .vrs_crs[VCPU_REGS_XCR0] = XCR0_X87
#endif
};




[PATCH] vmd: write and read device state to and from fd

2017-05-04 Thread Pratik Vyas

Hello tech@,

This patch adds functions to read and write state of devices in vmd. The
atomicio parts are copied from usr.bin/ssh.

Context: This is required for implementing vmctl send and vmctl receive.
vmctl send / receive are two new options that will support snapshotting
VMs and migrating VMs from one host to another. This project was
undertaken at San Jose State University along with my three teammates,
Ashwin, Harshada and Siri with mlarkin@ as our advisor.

We are working with reyk@ on cleaning up rest of the vmd changes.

Thanks,
Pratik


Index: usr.sbin/vmd/Makefile
===
RCS file: /home/pdvyas/cvs/src/usr.sbin/vmd/Makefile,v
retrieving revision 1.14
diff -u -p -a -u -r1.14 Makefile
--- usr.sbin/vmd/Makefile   19 Apr 2017 15:38:32 -  1.14
+++ usr.sbin/vmd/Makefile   5 May 2017 03:43:41 -
@@ -6,7 +6,7 @@ PROG=   vmd
SRCS=   vmd.c control.c log.c priv.c proc.c config.c vmm.c
SRCS+=  vm.c loadfile_elf.c pci.c virtio.c i8259.c mc146818.c
SRCS+=  ns8250.c i8253.c vmboot.c ufs.c disklabel.c dhcp.c packet.c
-SRCS+= parse.y
+SRCS+= parse.y atomicio.c

CFLAGS+=-Wall -I${.CURDIR}
CFLAGS+=-Wstrict-prototypes -Wmissing-prototypes
Index: usr.sbin/vmd/atomicio.c
===
RCS file: usr.sbin/vmd/atomicio.c
diff -N usr.sbin/vmd/atomicio.c
--- /dev/null   1 Jan 1970 00:00:00 -
+++ usr.sbin/vmd/atomicio.c 5 May 2017 03:43:41 -
@@ -0,0 +1,306 @@
+/* $OpenBSD: atomicio.c,v 1.28 2016/07/27 23:18:12 djm Exp $ */
+/*
+ * Copyright (c) 2006 Damien Miller. All rights reserved.
+ * Copyright (c) 2005 Anil Madhavapeddy. All rights reserved.
+ * Copyright (c) 1995,1999 Theo de Raadt.  All rights reserved.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "atomicio.h"
+
+/*
+ * ensure all of data on socket comes through. f==read || f==vwrite
+ */
+size_t
+atomicio6(ssize_t (*f) (int, void *, size_t), int fd, void *_s, size_t n,
+int (*cb)(void *, size_t), void *cb_arg)
+{
+   char *s = _s;
+   size_t pos = 0;
+   ssize_t res;
+   struct pollfd pfd;
+
+   pfd.fd = fd;
+   pfd.events = f == read ? POLLIN : POLLOUT;
+   while (n > pos) {
+   res = (f) (fd, s + pos, n - pos);
+   switch (res) {
+   case -1:
+   if (errno == EINTR)
+   continue;
+   if (errno == EAGAIN) {
+   (void)poll(, 1, -1);
+   continue;
+   }
+   return 0;
+   case 0:
+   errno = EPIPE;
+   return pos;
+   default:
+   pos += (size_t)res;
+   if (cb != NULL && cb(cb_arg, (size_t)res) == -1) {
+   errno = EINTR;
+   return pos;
+   }
+   }
+   }
+   return pos;
+}
+
+size_t
+atomicio(ssize_t (*f) (int, void *, size_t), int fd, void *_s, size_t n)
+{
+   return atomicio6(f, fd, _s, n, NULL, NULL);
+}
+
+/*
+ * ensure all of data on socket comes through. f==readv || f==writev
+ */
+size_t
+atomiciov6(ssize_t (*f) (int, const struct iovec *, int), int fd,
+const struct iovec *_iov, int iovcnt,
+int (*cb)(void *, size_t), void *cb_arg)
+{
+   size_t pos = 0, rem;
+   ssize_t res;
+   struct iovec iov_array[IOV_MAX], *iov = iov_array;
+   struct pollfd pfd;
+
+   if (iovcnt < 0 || iovcnt 

[PATCH] vmm: Add MSRs to readregs / writeregs

2017-04-29 Thread Pratik Vyas

Hello tech@,

This is a patch that extends the readregs and writeregs vmm(4) ioctl to
read and write MSRs as well.

It also sets the IA32_VMX_IA32E_MODE_GUEST entry control in
vcpu_reset_regs based on the value of EFER_LMA.

There are changes to vmmvar.h and would require a `make includes` step
to build vmd(8). This should also not alter any behaviour of vmd(8).

Context: These are the kernel changes required to implement vmctl send
and vmctl receive. vmctl send / receive are two new options that will
support snapshotting VMs and migrating VMs from one host to another.
This project was undertaken at San Jose State University along with my
three teammates CCed with mlarkin@ as our advisor.

Patches to vmd(8) that implement vmctl send and vmctl receive are
forthcoming.


Thanks,
Pratik



Index: sys/arch/amd64/amd64/vmm.c
===
RCS file: /home/pdvyas/cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.137
diff -u -p -a -u -r1.137 vmm.c
--- sys/arch/amd64/amd64/vmm.c  28 Apr 2017 10:09:37 -  1.137
+++ sys/arch/amd64/amd64/vmm.c  30 Apr 2017 00:01:21 -
@@ -1334,7 +1334,9 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
uint64_t sel, limit, ar;
uint64_t *gprs = vrs->vrs_gprs;
uint64_t *crs = vrs->vrs_crs;
+   uint64_t *msrs = vrs->vrs_msrs;
struct vcpu_segment_info *sregs = vrs->vrs_sregs;
+   struct vmx_msr_store *msr_store;

if (vcpu_reload_vmcs_vmx(>vc_control_pa))
return (EINVAL);
@@ -1402,6 +1404,14 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
goto errout;
}

+   msr_store = (struct vmx_msr_store *)vcpu->vc_vmx_msr_exit_save_va;
+
+   if (regmask & VM_RWREGS_MSRS) {
+   for (i = 0; i < VCPU_REGS_NMSRS; i++) {
+   msrs[i] = msr_store[i].vms_data;
+   }
+   }
+
goto out;

errout:
@@ -1448,7 +1458,9 @@ vcpu_writeregs_vmx(struct vcpu *vcpu, ui
uint64_t limit, ar;
uint64_t *gprs = vrs->vrs_gprs;
uint64_t *crs = vrs->vrs_crs;
+   uint64_t *msrs = vrs->vrs_msrs;
struct vcpu_segment_info *sregs = vrs->vrs_sregs;
+   struct vmx_msr_store *msr_store;

if (loadvmcs) {
if (vcpu_reload_vmcs_vmx(>vc_control_pa))
@@ -1518,6 +1530,14 @@ vcpu_writeregs_vmx(struct vcpu *vcpu, ui
goto errout;
}

+   msr_store = (struct vmx_msr_store *)vcpu->vc_vmx_msr_exit_save_va;
+
+   if (regmask & VM_RWREGS_MSRS) {
+   for (i = 0; i < VCPU_REGS_NMSRS; i++) {
+   msr_store[i].vms_data = msrs[i];
+   }
+   }
+
goto out;

errout:
@@ -2128,7 +2148,7 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
 * IA32_VMX_LOAD_DEBUG_CONTROLS
 * IA32_VMX_LOAD_IA32_PERF_GLOBAL_CTRL_ON_ENTRY
 */
-   if (ug == 1)
+   if (ug == 1 && !(vrs->vrs_msrs[VCPU_REGS_EFER] & EFER_LMA))
want1 = 0;
else
want1 = IA32_VMX_IA32E_MODE_GUEST;
@@ -2277,26 +2297,12 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
 */
msr_store = (struct vmx_msr_store *)vcpu->vc_vmx_msr_exit_save_va;

-   /*
-* Make sure LME is enabled in EFER if restricted guest mode is
-* needed.
-*/
-   msr_store[0].vms_index = MSR_EFER;
-   if (ug == 1)
-   msr_store[0].vms_data = 0ULL;   /* Initial value */
-   else
-   msr_store[0].vms_data = EFER_LME;
-
-   msr_store[1].vms_index = MSR_STAR;
-   msr_store[1].vms_data = 0ULL;   /* Initial value */
-   msr_store[2].vms_index = MSR_LSTAR;
-   msr_store[2].vms_data = 0ULL;   /* Initial value */
-   msr_store[3].vms_index = MSR_CSTAR;
-   msr_store[3].vms_data = 0ULL;   /* Initial value */
-   msr_store[4].vms_index = MSR_SFMASK;
-   msr_store[4].vms_data = 0ULL;   /* Initial value */
-   msr_store[5].vms_index = MSR_KERNELGSBASE;
-   msr_store[5].vms_data = 0ULL;   /* Initial value */
+   msr_store[VCPU_REGS_EFER].vms_index = MSR_EFER;
+   msr_store[VCPU_REGS_STAR].vms_index = MSR_STAR;
+   msr_store[VCPU_REGS_LSTAR].vms_index = MSR_LSTAR;
+   msr_store[VCPU_REGS_CSTAR].vms_index = MSR_CSTAR;
+   msr_store[VCPU_REGS_SFMASK].vms_index = MSR_SFMASK;
+   msr_store[VCPU_REGS_KGSBASE].vms_index = MSR_KERNELGSBASE;

/*
 * Currently we have the same count of entry/exit MSRs loads/stores
@@ -2359,6 +2365,13 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_ALL, 0, vrs);

/*
+* Make sure LME is enabled in EFER if restricted guest mode is
+* needed.
+*/
+   if (ug == 0)
+   msr_store[VCPU_REGS_EFER].vms_data |= EFER_LME;
+
+   /*
 * Set up the MSR bitmap
 */
memset((uint8_t *)vcpu->vc_msr_bitmap_va, 0xFF,