Hi,

the attached diff is another attempt at implementing a pvclock(4)
guest driver.  This improves the clock on KVM and replaces the need
for using the VM-expensive acpihpet(4).

While pvclock(4) is available on KVM, Xen, Hyper-V (in a modified
form), it currently only attaches under KVM:

pvbus0 at mainbus0: KVM
pvclock0 at pvbus0

A few notes:

- The "invtsc" workaround to get a real TSC is not always possible, as
it breaks migration and a few other things.  If you're running in a
cloud, it is very unlikely that you'll get it enabled.

- pvclock suffered from the same problem as early TSC implementations:
it wasn't synced across CPUs, so the OS had to do it manually.  In
OpenBSD, we never supported that and we decided to only support the
newer "invariant" TSCs that started to show up around SkyLake.  And I
think that's why the previous pvlock implementations for OpenBSD
didn't take off.

- So my diff does the same for pvclock what we did for TSC: only
support the "stable" pvclock that guarantees to provide a synchronized
clock by either doing the synchronization on the host side or by
providing it based on an invariant TSC on the host itself.  So this
diff might not work on older KVM versions or CPUs; I'd be happy to see
some test results.

- I did not copy+paste from the Linux and/or FreeBSD (they have
suprising similarities but different licenses), and I wasn't convinced
that it is worth copying the asm code that they use instead of the
following multiplication:
        ctr = ((delta * mul_frac) >> 32) + system_time;
Let me know if I'm wrong and if this code provides significant benefits:
https://github.com/freebsd/freebsd/blob/1d6e4247415d264485ee94b59fdbc12e0c566fd0/sys/x86/x86/pvclock.c#L90-L124

- This driver only implements the "system time" clock, and not the
static "wall clock".  I didn't see a point in supporting the wall
clock as its timestamp is only ever updated when you cold-start the
VM.  The wall clock struct is still defined in the driver for
reference but could probably be removed.

- I assigned a timecounter priority of 1500 to be in the middle
between acpihpet(1000) and tsc(2000).  If TSC is available, we should
probably use it instead; even if pvclock is theoretically better as it
is TSC-based and optimized for VMs.

- This is only a guest driver, not a driver for vmm(4)/vmd(8).  But it
could solve some issues if somebody would implement it for vmm(4)...

-  I'm running it on a busy build machine on Exoscale.ch for at least
a week now.  This machine runs stuff that utilizes all CPUs.  With
HPET, we've sometimes experienced livelocks but the pvclock(4) driver
made it much snappier.

$ time sleep 2
    0m02.00s real     0m00.00s user     0m00.01s system
$ cat /var/db/ntpd.drift                                                  
-4.679
$ sysctl kern.timecounter                                                 
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=1
kern.timecounter.hardware=pvclock0
kern.timecounter.choice=i8254(0) pvclock0(1500) acpihpet0(1000) 
acpitimer0(1000) dummy(-1000000)
$ sysctl hw.model hw.ncpuonline hw.vendor hw.product hw.version
hw.model=Intel Xeon Processor (Skylake)
hw.ncpuonline=8
hw.vendor=Apache Software Foundation
hw.product=CloudStack KVM Hypervisor
hw.version=pc-i440fx-2.11

Feedback? Tests? OKs?

Reyk

Index: share/man/man4/pvclock.4
===================================================================
RCS file: share/man/man4/pvclock.4
diff -N share/man/man4/pvclock.4
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ share/man/man4/pvclock.4    19 Nov 2018 11:48:33 -0000
@@ -0,0 +1,45 @@
+.\"    $OpenBSD$
+.\"
+.\" Copyright (c) 2018 Reyk Floeter <r...@openbsd.org>
+.\"
+.\" Permission to use, copy, modify, and distribute this software for any
+.\" purpose with or without fee is hereby granted, provided that the above
+.\" copyright notice and this permission notice appear in all copies.
+.\"
+.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+.\"
+.Dd $Mdocdate$
+.Dt PVCLOCK 4
+.Os
+.Sh NAME
+.Nm pvclock
+.Nd paravirtual clock driver
+.Sh SYNOPSIS
+.Cd "pvclock* at pvbus?
+.Sh DESCRIPTION
+The
+.Nm
+driver supports the paravirtual clock that is available in KVM and
+other hypervisors.
+.Nm
+uses a shared page between the host and the hypervisor to synchronize
+the TSC clock in an efficient way.
+.Sh SEE ALSO
+.Xr pvbus 4
+.Sh HISTORY
+The
+.Nm
+driver first appeared in
+.Ox 6.5 .
+.Sh AUTHORS
+.An -nosplit
+The
+.Nm
+driver was written by
+.An Reyk Floeter Aq Mt r...@openbsd.org .
Index: sys/arch/amd64/conf/GENERIC
===================================================================
RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
retrieving revision 1.464
diff -u -p -u -p -r1.464 GENERIC
--- sys/arch/amd64/conf/GENERIC 26 Oct 2018 20:26:19 -0000      1.464
+++ sys/arch/amd64/conf/GENERIC 19 Nov 2018 11:48:33 -0000
@@ -79,6 +79,8 @@ ipmi0 at mainbus? disable     # IPMI
 
 vmt0   at pvbus?               # VMware Tools
 
+pvclock0 at pvbus?             # KVM pvclock
+
 xen0   at pvbus?               # Xen HVM domU
 xnf*   at xen?                 # Xen Netfront
 xbf*   at xen?                 # Xen Blkfront
Index: sys/dev/pv/files.pv
===================================================================
RCS file: /cvs/src/sys/dev/pv/files.pv,v
retrieving revision 1.14
diff -u -p -u -p -r1.14 files.pv
--- sys/dev/pv/files.pv 24 Aug 2018 16:07:01 -0000      1.14
+++ sys/dev/pv/files.pv 19 Nov 2018 11:48:33 -0000
@@ -8,6 +8,11 @@ device pvbus
 attach pvbus at mainbus
 file   dev/pv/pvbus.c                  pvbus   needs-flag
 
+# KVM clock
+device pvclock
+attach pvclock at pvbus
+file   dev/pv/pvclock.c                pvclock needs-flag
+
 # VMware Tools
 device vmt
 attach vmt at pvbus
Index: sys/dev/pv/pvclock.c
===================================================================
RCS file: sys/dev/pv/pvclock.c
diff -N sys/dev/pv/pvclock.c
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ sys/dev/pv/pvclock.c        19 Nov 2018 11:48:33 -0000
@@ -0,0 +1,229 @@
+/*     $OpenBSD$       */
+
+/*
+ * Copyright (c) 2018 Reyk Floeter <r...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#if !defined(__i386__) && !defined(__amd64__)
+#error pvclock(4) is only supported on i386 and amd64
+#endif
+
+#include <sys/param.h>
+#include <sys/systm.h>
+#include <sys/kernel.h>
+#include <sys/timetc.h>
+#include <sys/timeout.h>
+#include <sys/malloc.h>
+#include <sys/atomic.h>
+
+#include <machine/cpu.h>
+#include <uvm/uvm_extern.h>
+
+#include <dev/pv/pvvar.h>
+#include <dev/pv/pvreg.h>
+
+struct pvclock_softc {
+       struct device            sc_dev;
+       void                    *sc_time;
+       paddr_t                  sc_paddr;
+       struct timecounter      *sc_tc;
+};
+
+struct pvclock_wall_clock {
+       uint32_t                 wc_version;
+       uint32_t                 wc_sec;
+       uint32_t                 wc_nsec;
+} __packed;
+
+struct pvclock_time_info {
+       uint32_t                 ti_version;
+       uint32_t                 ti_pad0;
+       uint64_t                 ti_tsc_timestamp;
+       uint64_t                 ti_system_time;
+       uint32_t                 ti_tsc_to_system_mul;
+       int8_t                   ti_tsc_shift;
+       uint8_t                  ti_flags;
+       uint8_t                  ti_pad[2];
+} __packed;
+
+#define PVCLOCK_FLAG_TSC_STABLE                0x01
+#define PVCLOCK_SYSTEM_TIME_ENABLE     0x01
+#define DEVNAME(_s)                    ((_s)->sc_dev.dv_xname)
+
+int     pvclock_match(struct device *, void *, void *);
+void    pvclock_attach(struct device *, struct device *, void *);
+int     pvclock_activate(struct device *, int);
+
+uint    pvclock_get_timecount(struct timecounter *);
+void    pvclock_read_time_info(struct pvclock_softc *,
+           struct pvclock_time_info *);
+
+struct cfattach pvclock_ca = {
+       sizeof(struct pvclock_softc),
+       pvclock_match,
+       pvclock_attach,
+       NULL,
+       pvclock_activate
+};
+
+struct cfdriver pvclock_cd = {
+       NULL,
+       "pvclock",
+       DV_DULL
+};
+
+struct timecounter pvclock_timecounter = {
+       pvclock_get_timecount, NULL, ~0u, 0, NULL, -2000, NULL
+};
+
+int
+pvclock_match(struct device *parent, void *match, void *aux)
+{
+       struct pv_attach_args   *pva = aux;
+       struct pvbus_hv         *hv;
+
+       /*
+        * pvclock is provided by different hypervisors, we currently
+        * only support the "kvmclock".
+        */
+       hv = &pva->pva_hv[PVBUS_KVM];
+       if (hv->hv_base != 0) {
+               /*
+                * We only implement support for the 2nd version of pvclock.
+                * The first version is basically the same but with different
+                * non-standard MSRs and it is deprecated.
+                */
+               if ((hv->hv_features & (1 << KVM_FEATURE_CLOCKSOURCE2)) == 0)
+                       return (0);
+
+               /*
+                * Only the "stable" clock with a sync'ed TSC is supported.
+                * In this case the host guarantees that the TSC is constant
+                * and invariant, either by the underlying TSC or by passing
+                * on a synchronized value.
+                */
+               if ((hv->hv_features &
+                   (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) == 0)
+                       return (0);
+       }
+
+       return (1);
+}
+
+void
+pvclock_attach(struct device *parent, struct device *self, void *aux)
+{
+       struct pvclock_softc    *sc = (struct pvclock_softc *)self;
+       paddr_t                  pa;
+
+       if ((sc->sc_time = km_alloc(PAGE_SIZE,
+           &kv_any, &kp_zero, &kd_nowait)) == NULL) {
+               printf(": time page allocation failed\n");
+               return;
+       }
+       if (!pmap_extract(pmap_kernel(), (vaddr_t)sc->sc_time, &pa)) {
+               printf(": time page PA extraction failed\n");
+               km_free(sc->sc_time, PAGE_SIZE, &kv_any, &kp_zero);
+               return;
+       }
+
+       wrmsr(KVM_MSR_SYSTEM_TIME, pa | PVCLOCK_SYSTEM_TIME_ENABLE);
+       sc->sc_paddr = pa;
+
+       sc->sc_tc = &pvclock_timecounter;
+       sc->sc_tc->tc_name = DEVNAME(sc);
+       sc->sc_tc->tc_frequency = 1000000000ULL;
+       sc->sc_tc->tc_priv = sc;
+
+       /* Better than HPET but below TSC */
+       sc->sc_tc->tc_quality = 1500;
+
+       tc_init(sc->sc_tc);
+
+       printf("\n");
+}
+
+int
+pvclock_activate(struct device *self, int act)
+{
+       struct pvclock_softc    *sc = (struct pvclock_softc *)self;
+       int                      rv = 0;
+       paddr_t                  pa = sc->sc_paddr;
+
+       switch (act) {
+       case DVACT_POWERDOWN:
+               wrmsr(KVM_MSR_SYSTEM_TIME, pa & ~PVCLOCK_SYSTEM_TIME_ENABLE);
+               break;
+       case DVACT_RESUME:
+               wrmsr(KVM_MSR_SYSTEM_TIME, pa | PVCLOCK_SYSTEM_TIME_ENABLE);
+               break;
+       }
+
+       return (rv);
+}
+
+static inline uint32_t
+pvclock_read_begin(const struct pvclock_time_info *ti)
+{
+       uint32_t version = ti->ti_version & ~0x1;
+       virtio_membar_sync();
+       return (version);
+}
+
+static inline int
+pvclock_read_done(const struct pvclock_time_info *ti,
+    uint32_t version)
+{
+       virtio_membar_sync();
+       return (ti->ti_version == version);
+}
+
+uint
+pvclock_get_timecount(struct timecounter *tc)
+{
+       struct pvclock_softc            *sc = tc->tc_priv;
+       struct pvclock_time_info        *ti;
+       uint64_t                         tsc_timestamp, system_time, delta, ctr;
+       uint32_t                         version, mul_frac;
+       int8_t                           shift;
+       uint8_t                          flags;
+
+       ti = sc->sc_time;
+       do {
+               version = pvclock_read_begin(ti);
+               system_time = ti->ti_system_time;
+               tsc_timestamp = ti->ti_tsc_timestamp;
+               mul_frac = ti->ti_tsc_to_system_mul;
+               shift = ti->ti_tsc_shift;
+               flags = ti->ti_flags;
+       } while (!pvclock_read_done(ti, version));
+
+       /* This bit must be set as we attached based on the stable flag */
+       if ((flags & PVCLOCK_FLAG_TSC_STABLE) == 0)
+               panic("%s: unstable result on stable clock", DEVNAME(sc));
+
+       /*
+        * The algorithm is described in
+        * linux/Documentation/virtual/kvm/msr.txt
+        */
+       delta = rdtsc() - tsc_timestamp;
+       if (shift < 0)
+               delta >>= -shift;
+       else
+               delta <<= shift;
+       ctr = ((delta * mul_frac) >> 32) + system_time;
+
+       return (ctr);
+}
Index: sys/dev/pv/pvreg.h
===================================================================
RCS file: /cvs/src/sys/dev/pv/pvreg.h,v
retrieving revision 1.4
diff -u -p -u -p -r1.4 pvreg.h
--- sys/dev/pv/pvreg.h  12 Dec 2015 12:33:49 -0000      1.4
+++ sys/dev/pv/pvreg.h  19 Nov 2018 11:48:33 -0000
@@ -43,6 +43,9 @@
 #define        KVM_MSR_EOI_EN                          0x4b564d04
 #define KVM_PV_EOI_BIT                         0
 
+#define KVM_MSR_WALL_CLOCK                     0x4b564d00
+#define KVM_MSR_SYSTEM_TIME                    0x4b564d01
+
 /*
  * Hyper-V
  */

Reply via email to