Re: uvm_fault: Comments & style cleanup

2021-02-15 Thread Mike Larkin
On Mon, Feb 15, 2021 at 01:15:33PM +0100, Martin Pieuchot wrote:
> On 15/02/21(Mon) 11:47, Martin Pieuchot wrote:
> > Diff below includes non-functional changes:
> >
> > - Sync comments with NetBSD including locking details.
> > - Remove superfluous parenthesis and spaces.
> > - Add brackets, even if questionable, to reduce diff with NetBSD
> > - Use for (;;) instead of while(1)
> > - Rename a variable from 'result' into 'error'.
> > - Move uvm_fault() and uvm_fault_upper_lookup()
> > - Add an locking assert in uvm_fault_upper_lookup()
>
> Updated diff on top of recent fix, still ok?
>

I reviewed the diff and agree it introduces no functional changes. If you are
still looking for oks and it helps you with the locking work, ok mlarkin.

-ml

> Index: uvm/uvm_fault.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> retrieving revision 1.114
> diff -u -p -r1.114 uvm_fault.c
> --- uvm/uvm_fault.c   15 Feb 2021 12:12:54 -  1.114
> +++ uvm/uvm_fault.c   15 Feb 2021 12:14:08 -
> @@ -55,11 +55,11 @@
>   *read/write1 write>1  read/write   +-cow_write/zero
>   * | | ||
>   *  +--|--+   +--|--+ +-+   +  |  + | +-+
> - * amap |  V  |   |  --->new|  || |  ^  |
> + * amap |  V  |   |  -> new |  || |  ^  |
>   *  +-+   +-+ +-+   +  |  + | +--|--+
>   * |||
>   *  +-+   +-+   +--|--+ | +--|--+
> - * uobj | d/c |   | d/c |   |  V  | +|  |
> + * uobj | d/c |   | d/c |   |  V  | ++  |
>   *  +-+   +-+   +-+   +-+
>   *
>   * d/c = don't care
> @@ -69,7 +69,7 @@
>   *
>   *   case [1]: upper layer fault [anon active]
>   * 1A: [read] or [write with anon->an_ref == 1]
> - *   I/O takes place in top level anon and uobj is not touched.
> + *   I/O takes place in upper level anon and uobj is not touched.
>   * 1B: [write with anon->an_ref > 1]
>   *   new anon is alloc'd and data is copied off ["COW"]
>   *
> @@ -89,7 +89,7 @@
>   * the code is structured as follows:
>   *
>   * - init the "IN" params in the ufi structure
> - *   ReFault:
> + *   ReFault: (ERESTART returned to the loop in uvm_fault)
>   * - do lookups [locks maps], check protection, handle needs_copy
>   * - check for case 0 fault (error)
>   * - establish "range" of fault
> @@ -136,8 +136,8 @@
>   *by multiple map entries, and figuring out what should wait could be
>   *complex as well...).
>   *
> - * we use alternative 2 currently.   maybe alternative 3 would be useful
> - * in the future.XXX keep in mind for future consideration//rechecking.
> + * we use alternative 2.  given that we are multi-threaded now we may want
> + * to reconsider the choice.
>   */
>
>  /*
> @@ -177,7 +177,7 @@ uvmfault_anonflush(struct vm_anon **anon
>   int lcv;
>   struct vm_page *pg;
>
> - for (lcv = 0 ; lcv < n ; lcv++) {
> + for (lcv = 0; lcv < n; lcv++) {
>   if (anons[lcv] == NULL)
>   continue;
>   KASSERT(rw_lock_held(anons[lcv]->an_lock));
> @@ -222,14 +222,14 @@ uvmfault_init(void)
>  /*
>   * uvmfault_amapcopy: clear "needs_copy" in a map.
>   *
> + * => called with VM data structures unlocked (usually, see below)
> + * => we get a write lock on the maps and clear needs_copy for a VA
>   * => if we are out of RAM we sleep (waiting for more)
>   */
>  static void
>  uvmfault_amapcopy(struct uvm_faultinfo *ufi)
>  {
> -
> - /* while we haven't done the job */
> - while (1) {
> + for (;;) {
>   /* no mapping?  give up. */
>   if (uvmfault_lookup(ufi, TRUE) == FALSE)
>   return;
> @@ -258,36 +258,46 @@ uvmfault_amapcopy(struct uvm_faultinfo *
>   * uvmfault_anonget: get data in an anon into a non-busy, non-released
>   * page in that anon.
>   *
> - * => we don't move the page on the queues [gets moved later]
> - * => if we allocate a new page [we_own], it gets put on the queues.
> - *either way, the result is that the page is on the queues at return time
> + * => Map, amap and thus anon should be locked by caller.
> + * => If we fail, we unlock everything and error is returned.
> + * => If we are successful, return with everything still locked.
> + * => We do not move the page on the queues [gets moved later].  If we
> + *allocate a new page [we_own], it gets put on the queues.  Either way,
> + *the result is that the page is on the queues at return time
>   */
>  int
>  uvmfault_anonget(struct uvm_faultinfo *ufi, struct vm_amap *amap,
>  struct vm_anon *anon)
>  {
> - boolean_t we_own;   /* we own anon's page? */
> -   

Re: Increase timeout length for VMs trying to fully shutdown

2021-01-05 Thread Mike Larkin
On Tue, Jan 05, 2021 at 12:49:29PM -0700, Tracey Emery wrote:
> Hello tech@,
>
> Some of us have been having shutdown issues with our VMs on OpenBSDAms.
> I tracked down the problem to too short of a timeout for the shutdown
> event.
>
> If there are an additional 1 or 2 package daemons running on the instance,
> the timeout triggers before the VM has shutdown the package daemons and
> properly synced the disks, resulting in a dirty startup.
>
> I've increased the timeout to 2 minutes instead of 30 seconds. My test
> VM on my laptop with 7 additional package daemons succeeded in 60
> seconds, but that might not be fast enough for slower disks.
>
> Am I being conservative enough with this number? Should it be another
> minute or two?
>
> Thoughts? Ok?
>
> --
>
> Tracey Emery
>
> diff 7a6bb14936050379800deb10d4a137c4d2d4a3c4 /usr/src
> blob - 9a64973ab998accb810d56c386c1bb92c204ab20
> file + usr.sbin/vmd/virtio.h
> --- usr.sbin/vmd/virtio.h
> +++ usr.sbin/vmd/virtio.h
> @@ -38,7 +38,7 @@
>
>  /* VMM Control Interface shutdown timeout (in seconds) */
>  #define VMMCI_TIMEOUT3
> -#define VMMCI_SHUTDOWN_TIMEOUT   30
> +#define VMMCI_SHUTDOWN_TIMEOUT   120
>
>  /* All the devices we support have either 1, 2 or 3 queues */
>  /* viornd - 1 queue
>

I took a look through the code. I'd say this bump is fine, there is no
side effect aside from just waiting for the VM to shutdown, *except*
possibly when waiting for the host to shutdown (/etc/rc in the shutdown
path), that might take longer if some VMs get stuck in their shutdown
code. But if you got impatient, you could always ^C at that point...

ok mlarkin

-ml



Re: PATCH: Fix PCI Config Space union size on VMM

2020-09-09 Thread Mike Larkin
On Mon, Sep 07, 2020 at 06:03:00PM -0500, Jordan Hargrave wrote:
> This code fixes the pci device union for accessing PCI config space >= 0x40
>
> Running pcidump -xxx in a virtual machine would return garbage data due to 
> union overlap
>

Thanks, looks good from my perspective.

-ml

> On Mon, Sep 07, 2020 at 05:52:55PM -0500, Jordan Hargrave wrote:
> > Index: pci.h
> > ===
> > RCS file: /cvs/src/usr.sbin/vmd/pci.h,v
> > retrieving revision 1.7
> > diff -u -p -u -r1.7 pci.h
> > --- pci.h   17 Sep 2017 23:07:56 -  1.7
> > +++ pci.h   7 Sep 2020 22:48:09 -
> > @@ -32,43 +32,44 @@ typedef int (*pci_iobar_fn_t)(int dir, u
> >  void *, uint8_t);
> >  typedef int (*pci_mmiobar_fn_t)(int dir, uint32_t ofs, uint32_t *data);
> >
> > -union pci_dev {
> > -   uint32_t pd_cfg_space[PCI_CONFIG_SPACE_SIZE / 4];
> >
> > -   struct {
> > -   uint16_t pd_vid;
> > -   uint16_t pd_did;
> > -   uint16_t pd_cmd;
> > -   uint16_t pd_status;
> > -   uint8_t pd_rev;
> > -   uint8_t pd_prog_if;
> > -   uint8_t pd_subclass;
> > -   uint8_t pd_class;
> > -   uint8_t pd_cache_size;
> > -   uint8_t pd_lat_timer;
> > -   uint8_t pd_header_type;
> > -   uint8_t pd_bist;
> > -   uint32_t pd_bar[PCI_MAX_BARS];
> > -   uint32_t pd_cardbus_cis;
> > -   uint16_t pd_subsys_vid;
> > -   uint16_t pd_subsys_id;
> > -   uint32_t pd_exp_rom_addr;
> > -   uint8_t pd_cap;
> > -   uint32_t pd_reserved0 : 24;
> > -   uint32_t pd_reserved1;
> > -   uint8_t pd_irq;
> > -   uint8_t pd_int;
> > -   uint8_t pd_min_grant;
> > -   uint8_t pd_max_grant;
> > +struct pci_dev {
> > +   union {
> > +   uint32_t pd_cfg_space[PCI_CONFIG_SPACE_SIZE / 4];
> > +   struct {
> > +   uint16_t pd_vid;
> > +   uint16_t pd_did;
> > +   uint16_t pd_cmd;
> > +   uint16_t pd_status;
> > +   uint8_t pd_rev;
> > +   uint8_t pd_prog_if;
> > +   uint8_t pd_subclass;
> > +   uint8_t pd_class;
> > +   uint8_t pd_cache_size;
> > +   uint8_t pd_lat_timer;
> > +   uint8_t pd_header_type;
> > +   uint8_t pd_bist;
> > +   uint32_t pd_bar[PCI_MAX_BARS];
> > +   uint32_t pd_cardbus_cis;
> > +   uint16_t pd_subsys_vid;
> > +   uint16_t pd_subsys_id;
> > +   uint32_t pd_exp_rom_addr;
> > +   uint8_t pd_cap;
> > +   uint32_t pd_reserved0 : 24;
> > +   uint32_t pd_reserved1;
> > +   uint8_t pd_irq;
> > +   uint8_t pd_int;
> > +   uint8_t pd_min_grant;
> > +   uint8_t pd_max_grant;
> > +   } __packed;
> > +   };
> > +   uint8_t pd_bar_ct;
> > +   pci_cs_fn_t pd_csfunc;
> >
> > -   uint8_t pd_bar_ct;
> > -   pci_cs_fn_t pd_csfunc;
> > -
> > -   uint8_t pd_bartype[PCI_MAX_BARS];
> > -   uint32_t pd_barsize[PCI_MAX_BARS];
> > -   void *pd_barfunc[PCI_MAX_BARS];
> > -   void *pd_bar_cookie[PCI_MAX_BARS];
> > -   } __packed;
> > +   uint8_t pd_bartype[PCI_MAX_BARS];
> > +   uint32_t pd_barsize[PCI_MAX_BARS];
> > +   void *pd_barfunc[PCI_MAX_BARS];
> > +   void *pd_bar_cookie[PCI_MAX_BARS];
> >  };
> >
> >  struct pci {
> > @@ -79,7 +80,7 @@ struct pci {
> > uint32_t pci_addr_reg;
> > uint32_t pci_data_reg;
> >
> > -   union pci_dev pci_devices[PCI_CONFIG_MAX_DEV];
> > +   struct pci_dev pci_devices[PCI_CONFIG_MAX_DEV];
> >  };
> >
> >  void pci_handle_address_reg(struct vm_run_params *);
> >
>



Re: amd64: add tsc_delay(), a TSC-based delay(9) implementation

2020-08-25 Thread Mike Larkin
On Tue, Aug 25, 2020 at 12:12:36PM -0700, Mike Larkin wrote:
> On Mon, Aug 24, 2020 at 01:55:45AM +0200, Mark Kettenis wrote:
> > > Date: Sun, 23 Aug 2020 18:11:12 -0500
> > > From: Scott Cheloha 
> > >
> > > Hi,
> > >
> > > Other BSDs use the TSC to implement delay(9) if the TSC is constant
> > > and invariant.  Here's a patch to add something similar to our kernel.
> >
> > If the TSC is fine as a timecounter it should be absolutely fine for
> > use as delay().  And we could even use if the TSC isn't synchronized
> > between CPUs.
> >
> > >
> > > This patch (or something equivalent) is a prerequisite to running the
> > > lapic timer in oneshot or TSC deadline mode.  Using the lapic timer to
> > > implement delay(9) when it isn't running in periodic mode is too
> > > complicated.  However, using the i8254 for delay(9) is too slow.  We
> > > need an alternative.
> >
> > Hmm, but what are we going to use on machines where the TSC isn't
> > constant/invariant?
> >
> > In what respect is the i8254 too slow?  Does it take more than a
> > microsecond to read it?
> >
>
> It's 3 outb/inb pairs to ensure you get the reading correct. So that could
> be quite a long time (as cheloha@ points out). Also, that's 6 VM exits if
> running virtually (I realize that's not the main use case here but just
> saying...)
>
> IIRC the 3 in/out pairs are the latch command followed by reading the LSB/MSB
> of the counter. It's not MMIO like the HPET or ACPI timer.
>
> And as cheloha@ also points out, it is highly likely that none of us have a
> real i8254 anymore, much of this is probably implemented in some EC somewhere
> and it's unlikely the developer of said EC put a lot of effort into optimizing
> the implementation of a legacy device like this.
>
> On the topic of virtualization:
>
> while (rdtsc() - start < want)
>  rdtsc();
>

I just realized the original diff didn't do two rdtscs. It did a pause inside 
the
loop. So the effect is not *as* bad as I described but it's still *somewhat* 
bad.

PS - pause loop exiting can be enabled to improve performance in this situation.

> ..produces two VM exits (generally, on most hypervisors) since the TSC is
> usually time corrected. That's a lot of exits, and it gets worse on faster
> machines. I don't have a better idea, however. There may be a PV clock option
> that is more optimized in some scenarios.
>
> -ml
>
>
> > We could use the HPET I suppose, whic may be a bit better.
> >
> > > As for the patch, it works for me here, though I'd appreciate a few
> > > tests.  I admit that comparing function pointers is ugly, but I think
> > > this is as simple as it can be without implementing some sort of
> > > framework for "registering" delay(9) implementations and comparing
> > > them and selecting the "best" implementation.
> >
> > What about:
> >
> > if (delay_func == NULL)
> > delay_func = lapic_delay;
> >
> > > I'm not sure I put the prototypes in the right headers.  We don't have
> > > a tsc.h but cpuvar.h looks sorta-correct for tsc_delay().
> >
> > I think cpuvar.h is fine since it has other TSC-related stuff.
> > However, with my suggestion above you can drop that.
> >
> > > FreeBSD's x86/delay.c may be of note:
> > >
> > > https://github.com/freebsd/freebsd/blob/ed96335a07b688c39e16db8856232e5840bc22ac/sys/x86/x86/delay.c
> > >
> > > Thoughts?
> > >
> > > Index: amd64/tsc.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
> > > retrieving revision 1.20
> > > diff -u -p -r1.20 tsc.c
> > > --- amd64/tsc.c   23 Aug 2020 21:38:47 -  1.20
> > > +++ amd64/tsc.c   23 Aug 2020 22:59:25 -
> > > @@ -26,6 +26,7 @@
> > >
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #define RECALIBRATE_MAX_RETRIES  5
> > >  #define RECALIBRATE_SMI_THRESHOLD5
> > > @@ -252,7 +253,8 @@ tsc_timecounter_init(struct cpu_info *ci
> > >   tsc_timecounter.tc_quality = -1000;
> > >   tsc_timecounter.tc_user = 0;
> > >   tsc_is_invariant = 0;
> > > - }
> > > + } else
> > > + delay_func = tsc_delay;
> > >
> > >   tc_init(_timecounter);
> > >  }
> > > @@ -342,4 +344,15 @@ tsc_sync_ap(struct cpu_info *ci)
> &

Re: amd64: add tsc_delay(), a TSC-based delay(9) implementation

2020-08-25 Thread Mike Larkin
On Mon, Aug 24, 2020 at 12:29:15AM -0500, Scott Cheloha wrote:
> On Sun, Aug 23, 2020 at 11:45:22PM -0500, Scott Cheloha wrote:
> >
> > [...]
> >
> > > > This patch (or something equivalent) is a prerequisite to running the
> > > > lapic timer in oneshot or TSC deadline mode.  Using the lapic timer to
> > > > implement delay(9) when it isn't running in periodic mode is too
> > > > complicated.  However, using the i8254 for delay(9) is too slow.  We
> > > > need an alternative.
> > >
> > > Hmm, but what are we going to use on machines where the TSC isn't
> > > constant/invariant?
> >
> > Probably fall back on the i8254?  Unless someone wants to add yet
> > another delay(9) implementation to amd64...
> >
> > > In what respect is the i8254 too slow?  Does it take more than a
> > > microsecond to read it?
> >
> > On my machine, the portion of gettick() *within* the mutex runs in ~19
> > microseconds.
> >
> > That's before any overhead from mtx_enter(9).  I think having multiple
> > threads in delay(9) should be relatively rare, but you have to keep
> > that in mind.
> >
> > No idea what the overhead would look like on real hardware.  I'm
> > pretty sure my i8254 is emulated.
> >
> > > We could use the HPET I suppose, whic may be a bit better.
> >
> > It's better.  No mutex.  On my machine it takes ~11 microseconds.
> > It's a start.
>
> Hmmm, now I'm worried I have screwed something up or misconfigured
> something.
>
> It doesn't seem right that it would take 20K cycles to read the HPET
> on this machine.
>
> Am I way off?  Or is 20K actually a reasonable number?
>

There have been reports of the HPET being really slow on some machines.
IIRC this is why we ended up getting a tsc timecounter a number of years
ago. Someone (reyk@?) found his skylake had a super slow HPET and that
ended up being part of the impetus to to a tsc timecounter.

Also, 20k cycles is totally expected if you are on a VM (not sure if
this is the case).


> For comparison, lapic_gettick() completes in... 80 nanoseconds (?) on
> the same machine.  Relevant sysctls:
>

LAPIC memory page accesses go to the CPU. It's not always the case that
the HPET does the same (they may be accessed via PCI). Also, in a VM,
on new CPUs, LAPIC virtualization can be enabled which means no exits
for LAPIC accesses. So, yeah, these numbers you are seeing aren't surprising.

> $ sysctl hw.{model,setperf,perfpolicy} machdep.{tscfreq,invarianttsc}
> hw.model=Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
> hw.setperf=100
> hw.perfpolicy=high
> machdep.tscfreq=211200
> machdep.invarianttsc=1
>
> ... if it really takes that long, then "high precision" is a bit of a
> misnomer.
>



Re: amd64: add tsc_delay(), a TSC-based delay(9) implementation

2020-08-25 Thread Mike Larkin
On Mon, Aug 24, 2020 at 01:55:45AM +0200, Mark Kettenis wrote:
> > Date: Sun, 23 Aug 2020 18:11:12 -0500
> > From: Scott Cheloha 
> >
> > Hi,
> >
> > Other BSDs use the TSC to implement delay(9) if the TSC is constant
> > and invariant.  Here's a patch to add something similar to our kernel.
>
> If the TSC is fine as a timecounter it should be absolutely fine for
> use as delay().  And we could even use if the TSC isn't synchronized
> between CPUs.
>
> >
> > This patch (or something equivalent) is a prerequisite to running the
> > lapic timer in oneshot or TSC deadline mode.  Using the lapic timer to
> > implement delay(9) when it isn't running in periodic mode is too
> > complicated.  However, using the i8254 for delay(9) is too slow.  We
> > need an alternative.
>
> Hmm, but what are we going to use on machines where the TSC isn't
> constant/invariant?
>
> In what respect is the i8254 too slow?  Does it take more than a
> microsecond to read it?
>

It's 3 outb/inb pairs to ensure you get the reading correct. So that could
be quite a long time (as cheloha@ points out). Also, that's 6 VM exits if
running virtually (I realize that's not the main use case here but just
saying...)

IIRC the 3 in/out pairs are the latch command followed by reading the LSB/MSB
of the counter. It's not MMIO like the HPET or ACPI timer.

And as cheloha@ also points out, it is highly likely that none of us have a
real i8254 anymore, much of this is probably implemented in some EC somewhere
and it's unlikely the developer of said EC put a lot of effort into optimizing
the implementation of a legacy device like this.

On the topic of virtualization:

while (rdtsc() - start < want)
 rdtsc();

..produces two VM exits (generally, on most hypervisors) since the TSC is
usually time corrected. That's a lot of exits, and it gets worse on faster
machines. I don't have a better idea, however. There may be a PV clock option
that is more optimized in some scenarios.

-ml


> We could use the HPET I suppose, whic may be a bit better.
>
> > As for the patch, it works for me here, though I'd appreciate a few
> > tests.  I admit that comparing function pointers is ugly, but I think
> > this is as simple as it can be without implementing some sort of
> > framework for "registering" delay(9) implementations and comparing
> > them and selecting the "best" implementation.
>
> What about:
>
>   if (delay_func == NULL)
>   delay_func = lapic_delay;
>
> > I'm not sure I put the prototypes in the right headers.  We don't have
> > a tsc.h but cpuvar.h looks sorta-correct for tsc_delay().
>
> I think cpuvar.h is fine since it has other TSC-related stuff.
> However, with my suggestion above you can drop that.
>
> > FreeBSD's x86/delay.c may be of note:
> >
> > https://github.com/freebsd/freebsd/blob/ed96335a07b688c39e16db8856232e5840bc22ac/sys/x86/x86/delay.c
> >
> > Thoughts?
> >
> > Index: amd64/tsc.c
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
> > retrieving revision 1.20
> > diff -u -p -r1.20 tsc.c
> > --- amd64/tsc.c 23 Aug 2020 21:38:47 -  1.20
> > +++ amd64/tsc.c 23 Aug 2020 22:59:25 -
> > @@ -26,6 +26,7 @@
> >
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define RECALIBRATE_MAX_RETRIES5
> >  #define RECALIBRATE_SMI_THRESHOLD  5
> > @@ -252,7 +253,8 @@ tsc_timecounter_init(struct cpu_info *ci
> > tsc_timecounter.tc_quality = -1000;
> > tsc_timecounter.tc_user = 0;
> > tsc_is_invariant = 0;
> > -   }
> > +   } else
> > +   delay_func = tsc_delay;
> >
> > tc_init(_timecounter);
> >  }
> > @@ -342,4 +344,15 @@ tsc_sync_ap(struct cpu_info *ci)
> >  {
> > tsc_post_ap(ci);
> > tsc_post_ap(ci);
> > +}
> > +
> > +void
> > +tsc_delay(int usecs)
> > +{
> > +   uint64_t interval, start;
> > +
> > +   interval = (uint64_t)usecs * tsc_frequency / 100;
> > +   start = rdtsc_lfence();
> > +   while (rdtsc_lfence() - start < interval)
> > +   CPU_BUSY_CYCLE();
> >  }
> > Index: amd64/lapic.c
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/amd64/lapic.c,v
> > retrieving revision 1.55
> > diff -u -p -r1.55 lapic.c
> > --- amd64/lapic.c   3 Aug 2019 14:57:51 -   1.55
> > +++ amd64/lapic.c   23 Aug 2020 22:59:25 -
> > @@ -41,6 +41,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -569,7 +570,8 @@ skip_calibration:
> >  * Now that the timer's calibrated, use the apic timer routines
> >  * for all our timing needs..
> >  */
> > -   delay_func = lapic_delay;
> > +   if (delay_func != tsc_delay)
> > +   delay_func = lapic_delay;
> > initclock_func = lapic_initclocks;
> > }
> >  }
> > Index: include/cpuvar.h
> > 

Re: kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
On Wed, Jul 29, 2020 at 10:14:11PM +0200, Mark Kettenis wrote:
> > Date: Wed, 29 Jul 2020 13:03:43 -0700
> > From: Mike Larkin 
> >
> > Hi,
> >
> >  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> > on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> > one. It does not happen on GENERIC kernels.
> >
> >  The crash will happen fairly quickly after the kernel starts executing
> > processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> > or two. It rarely makes it to the login prompt. The problem is 100%
> > reproducible on two different VMs I have, running on two different
> > hypervisors (Hyper-V and ESXi6.7U2).
> >
> >  I first started noticing the problem on the 24th July snap, but TBH these
> > machines were not frequently updated, so the previous snap I had installed
> > might have been a couple months old. Whatever older snap was on them before
> > worked fine.
> >
> >  Since this is happening on two different machines with two different VMs,
> > I'm gonna rule out hardware issues.
> >
> >  Crash:
> >
> > kernel: pretection fault trap, code=0
> > Stopped at  setrunqueue+0xa2:   addl$0x1,0x288(%r13)
> >
> >  Trace:
> > ddb{2}> trace
> > setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
> > sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
> > taskq_thread(82121548) at taskq_thread+0x8d
> > end trace frame: 0x0, count: -3
> >
> >  Registers:
> > ddb{2}> sh r
> > rdi 0x821ee728  sched_lock
> > rsi 0x800014cc6ff0
> > rbp 0x800015ea0e40
> > rbx  0
> > rdx   0x23ca94  acpi_pdirpa_0x2288fc
> > rcx0xc
> > rax0xc
> > r8   0x202
> > r9 0x2
> > r10  0
> > r11 0x57f79bf6968709d8
> > r12 0x800015e874e0
> > r13 0x27b3d6c24c3fab80
> > r14   0x32
> > r15 0x27b3d6c24c3fab80
> > rip 0x81b9df22  setrunqueue+0xa2
> > cs 0x8
> > rflags 0x10207  __ALIGN_SIZE+0xf207
> > rsp 0x800015ea0df0
> > ss0x10
> >
> >
> > The offending instruction is in kern_sched.c:260:
> >
> > spc->spc_nrun++;
> >
> > ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> > tests, %r13 always is this same trash value. That comes from 'ci', which is
> > either passed in or chosen by sched_choosecpu. Neither of these functions
> > have changed recently, so I'm guessing this corruption is coming from 
> > something
> > else.
> >
> >  Anyone have ideas where to start looking? I suppose I could start 
> > bisecting,
> > but does anyone know of any changes that would affect this area?
> >
> >  I can send dmesgs if needed, but these are pretty standard VMs,
> > nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.
>
> They're VMs and it turns out that many of the "PV" drivers are/were
> using the intr_barrier() interface the wrong way.
>
> For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
> 17 snapshot" thread on bugs@ from earlier today.
>
> Cheers,
>
> Mark
>

Thanks. I don't subscribe to bugs@ anymore, so that's why I likely missed it.

-ml



Re: kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
On Wed, Jul 29, 2020 at 01:03:43PM -0700, Mike Larkin wrote:
> Hi,
>
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
>
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
>
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
>
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
>
>  Crash:
>
> kernel: pretection fault trap, code=0
> Stopped atsetrunqueue+0xa2:   addl$0x1,0x288(%r13)
>
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
>
>  Registers:
> ddb{2}> sh r
> rdi   0x821ee728  sched_lock
> rsi   0x800014cc6ff0
> rbp   0x800015ea0e40
> rbx0
> rdx 0x23ca94  acpi_pdirpa_0x2288fc
> rcx  0xc
> rax  0xc
> r8 0x202
> r9   0x2
> r100
> r11   0x57f79bf6968709d8
> r12   0x800015e874e0
> r13   0x27b3d6c24c3fab80
> r14 0x32
> r15   0x27b3d6c24c3fab80
> rip   0x81b9df22  setrunqueue+0xa2
> cs   0x8
> rflags   0x10207  __ALIGN_SIZE+0xf207
> rsp   0x800015ea0df0
> ss  0x10
>
>
> The offending instruction is in kern_sched.c:260:
>
>   spc->spc_nrun++;
>
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from 
> something
> else.
>
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
>
>  I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
> configured in them. 4 CPUs, 8GB RAM, etc.
>
> -ml
>

Also I should note that the problem happens with snaps as well as kernels built
from source (-current), so this isn't likely something that is in snaps but not
yet in tree.

-ml



kernel crash in setrunqueue

2020-07-29 Thread Mike Larkin
Hi,

 I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
on GENERIC.MP regardless of whether or not the VM has one cpu or more than
one. It does not happen on GENERIC kernels.

 The crash will happen fairly quickly after the kernel starts executing
processes. Sometimes it crashes instantly, sometimes it lasts for a minute
or two. It rarely makes it to the login prompt. The problem is 100%
reproducible on two different VMs I have, running on two different
hypervisors (Hyper-V and ESXi6.7U2).

 I first started noticing the problem on the 24th July snap, but TBH these
machines were not frequently updated, so the previous snap I had installed
might have been a couple months old. Whatever older snap was on them before
worked fine.

 Since this is happening on two different machines with two different VMs,
I'm gonna rule out hardware issues.

 Crash:

kernel: pretection fault trap, code=0
Stopped at  setrunqueue+0xa2:   addl$0x1,0x288(%r13)

 Trace:
ddb{2}> trace
setrunqueue(27b3d6c24c3fab80, 800015e874e0,32) at setrunqueue+0xa2
sched_barrier_task(800015f1a168) at sched_barrier_task+0x6c
taskq_thread(82121548) at taskq_thread+0x8d
end trace frame: 0x0, count: -3

 Registers:
ddb{2}> sh r
rdi 0x821ee728  sched_lock
rsi 0x800014cc6ff0
rbp 0x800015ea0e40
rbx  0
rdx   0x23ca94  acpi_pdirpa_0x2288fc
rcx0xc
rax0xc
r8   0x202
r9 0x2
r10  0
r11 0x57f79bf6968709d8
r12 0x800015e874e0
r13 0x27b3d6c24c3fab80
r14   0x32
r15 0x27b3d6c24c3fab80
rip 0x81b9df22  setrunqueue+0xa2
cs 0x8
rflags 0x10207  __ALIGN_SIZE+0xf207
rsp 0x800015ea0df0
ss0x10


The offending instruction is in kern_sched.c:260:

spc->spc_nrun++;

... which indicates 'spc' is trash (and it is, based on %r13 above). In my
tests, %r13 always is this same trash value. That comes from 'ci', which is
either passed in or chosen by sched_choosecpu. Neither of these functions
have changed recently, so I'm guessing this corruption is coming from something
else.

 Anyone have ideas where to start looking? I suppose I could start bisecting,
but does anyone know of any changes that would affect this area?

 I can send dmesgs if needed, but these are pretty standard VMs, nothing fancy
configured in them. 4 CPUs, 8GB RAM, etc.

-ml



Re: Edgerouter 4 available for any OpenBSD dev that needs an octeon

2020-07-29 Thread Mike Larkin
On Tue, Jul 28, 2020 at 06:16:01PM -0700, Mike Larkin wrote:
> Someone (can't recall who) gave me an ER4. I found it while cleaning
> out my closet. Since I'm not active anymore, if any openbsd developer
> wants it, reach out to me privately and I'll see about sending it
> to you.
>
> Thanks.
>
> -ml
>

Thanks everyone, this is heading to an OpenBSD developer.

-ml



Edgerouter 4 available for any OpenBSD dev that needs an octeon

2020-07-28 Thread Mike Larkin
Someone (can't recall who) gave me an ER4. I found it while cleaning
out my closet. Since I'm not active anymore, if any openbsd developer
wants it, reach out to me privately and I'll see about sending it
to you.

Thanks.

-ml



Re: amd64: lapic: refactor lapic timer programming

2020-07-06 Thread Mike Larkin
On Fri, Jul 03, 2020 at 07:41:45PM -0500, Scott Cheloha wrote:
> Hi,
>
> I want to run the lapic timer in one-shot mode on amd64 as we do with
> other interrupt clocks on other platforms.  I aim to make the clock
> interrupt code MD where possible.
>
> However, nobody is going to test my MD clock interrupt work unless
> amd64 is ready to use it.  amd64 doesn't run in oneshot mode so there
> is preliminary work to do first.
>
> --
>
> Before we can run the lapic timer in one-shot mode we need to simplify
> the process of actually programming it.
>
> This patch refactors all lapic timer programming into a single
> routine.  We don't use any divisor other than 1 so I don't see a need
> to make it a parameter to lapic_timer_arm().  We can add TSC deadline
> support later if someone wants it.
>
> The way we program the timer differs from how e.g. Darwin and FreeBSD
> and Linux do it.  They write:
>
>  - lvtt (mode + vector + (maybe) mask)
>  - dcr
>  - icr
>
> while we do:
>
>  - lvtt (mode + mask)
>  - dcr
>  - icr
>  - (maybe) lvtt (mode + vector)
>
> I don't see a reason to arm the timer with four writes instead of
> three, so in this patch I use the three-write ordering.
>
> Am I missing something?  Do I need to disable interrupts before I
> reprogram the timer?
>

This reads ok to me. I am not aware of any requirements to disable
interrupts while reprogramming the timer.

-ml

> -Scott
>
> Index: lapic.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/lapic.c,v
> retrieving revision 1.55
> diff -u -p -r1.55 lapic.c
> --- lapic.c   3 Aug 2019 14:57:51 -   1.55
> +++ lapic.c   4 Jul 2020 00:40:26 -
> @@ -413,6 +413,42 @@ u_int32_t lapic_frac_usec_per_cycle;
>  u_int64_t lapic_frac_cycle_per_usec;
>  u_int32_t lapic_delaytab[26];
>
> +void lapic_timer_arm(uint32_t, int, uint32_t);
> +void lapic_timer_arm_once(int, uint32_t);
> +void lapic_timer_arm_period(int, uint32_t);
> +
> +/*
> + * Start the local apic countdown timer.
> + *
> + * First set the mode, vector, and (maybe) the mask.
> + * then set the divisor,
> + * and finally set the cycle count.
> + */
> +void
> +lapic_timer_arm(uint32_t mode, int masked, uint32_t cycles)
> +{
> + uint32_t lvtt;
> +
> + lvtt = mode | LAPIC_TIMER_VECTOR;
> + lvtt |= (masked) ? LAPIC_LVTT_M : 0;
> +
> + lapic_writereg(LAPIC_LVTT, lvtt);
> + lapic_writereg(LAPIC_DCR_TIMER, LAPIC_DCRT_DIV1);
> + lapic_writereg(LAPIC_ICR_TIMER, cycles);
> +}
> +
> +void
> +lapic_timer_arm_once(int masked, uint32_t cycles)
> +{
> + lapic_timer_arm(LAPIC_LVTT_TM_ONESHOT, masked, cycles);
> +}
> +
> +void
> +lapic_timer_arm_period(int masked, uint32_t cycles)
> +{
> + lapic_timer_arm(LAPIC_LVTT_TM_PERIODIC, masked, cycles);
> +}
> +
>  void
>  lapic_clockintr(void *arg, struct intrframe frame)
>  {
> @@ -430,17 +466,7 @@ lapic_clockintr(void *arg, struct intrfr
>  void
>  lapic_startclock(void)
>  {
> - /*
> -  * Start local apic countdown timer running, in repeated mode.
> -  *
> -  * Mask the clock interrupt and set mode,
> -  * then set divisor,
> -  * then unmask and set the vector.
> -  */
> - lapic_writereg(LAPIC_LVTT, LAPIC_LVTT_TM|LAPIC_LVTT_M);
> - lapic_writereg(LAPIC_DCR_TIMER, LAPIC_DCRT_DIV1);
> - lapic_writereg(LAPIC_ICR_TIMER, lapic_tval);
> - lapic_writereg(LAPIC_LVTT, LAPIC_LVTT_TM|LAPIC_TIMER_VECTOR);
> + lapic_timer_arm_period(0, lapic_tval);
>  }
>
>  void
> @@ -498,9 +524,7 @@ lapic_calibrate_timer(struct cpu_info *c
>* Configure timer to one-shot, interrupt masked,
>* large positive number.
>*/
> - lapic_writereg(LAPIC_LVTT, LAPIC_LVTT_M);
> - lapic_writereg(LAPIC_DCR_TIMER, LAPIC_DCRT_DIV1);
> - lapic_writereg(LAPIC_ICR_TIMER, 0x8000);
> + lapic_timer_arm_once(1, 0x8000);
>
>   s = intr_disable();
>
> @@ -540,10 +564,7 @@ skip_calibration:
>   lapic_tval = (lapic_per_second * 2) / hz;
>   lapic_tval = (lapic_tval / 2) + (lapic_tval & 0x1);
>
> - lapic_writereg(LAPIC_LVTT, LAPIC_LVTT_TM | LAPIC_LVTT_M |
> - LAPIC_TIMER_VECTOR);
> - lapic_writereg(LAPIC_DCR_TIMER, LAPIC_DCRT_DIV1);
> - lapic_writereg(LAPIC_ICR_TIMER, lapic_tval);
> + lapic_timer_arm_period(0, lapic_tval);
>
>   /*
>* Compute fixed-point ratios between cycles and
>



Re: 11n Tx aggregation for iwm(4)

2020-06-26 Thread Mike Larkin
On Fri, Jun 26, 2020 at 09:01:03PM -0700, Mike Larkin wrote:
> On Fri, Jun 26, 2020 at 02:45:53PM +0200, Stefan Sperling wrote:
> > This patch adds support for 11n Tx aggregation to iwm(4).
> >
> > Please help with testing if you can by running the patch and using wifi
> > as usual. Nothing should change, except that Tx speed may potentially
> > improve. If you have time to run before/after performance measurements with
> > tcpbench or such, that would be nice. But it's not required for testing.
> >
> > If Tx aggregation is active then netstat will show a non-zero output block 
> > ack
> > agreement counter:
> >
> > $ netstat -W iwm0 | grep 'output block'
> > 3 new output block ack agreements
> > 0 output block ack agreements timed out
> >
> > It would be great to get at least one test for all the chipsets the driver
> > supports: 7260, 7265, 3160, 3165, 3168, 8260, 8265, 9260, 9560
> > The behaviour of the access point also matters a great deal. It won't
> > hurt to test the same chipset against several different access points.
> >
> > I have tested this version on 8265 only so far. I've run older revisions
> > of this patch on 7265 so I'm confident that this chip will work, too.
> > So far, the APs I have tested against are athn(4) in 11a mode and in 11n
> > mode with the 'nomimo' nwflag, and a Sagemcom 11ac AP. All on 5Ghz channels.
>
> I tested this on my T490 Thinkpad:
>
> iwm0 at pci0 dev 20 function 3 "Intel Dual Band Wireless AC 9560" rev 0x30, 
> msix
> iwm0: hw rev 0x310, fw ver 34.3125811985.0
>
> It ended up having a heck of a time connecting to anything, most/all
> connections ended up timing out or just taking a really long time to complete.
>
> I looked in dmesg, and found a stream of fatal firmware errors and other
> errors (see end of this email).
>
> My iwm-firmware was updated before I tried the new kernel:
>
> -innsmouth- ~> pkg_info iwm-firmware
> Information for inst:iwm-firmware-20191022p1
>
> Comment:
> firmware binary images for iwm(4) driver
>
> Description:
> Firmware binary images for use with the iwm(4) driver.
>
> Maintainer: The OpenBSD ports mailing-list 
>
> WWW: https://wireless.wiki.kernel.org/en/users/Drivers/iwlwifi
>

PS, I did see 5 new output block ack agreements when I was running the diff,
so apparently at least it is doing ... something?

-ml

>
>
> I still have the kernel around if you want me to test something else. There
> is nothing in this tree except this Txagg diff. LMK if you need any more
> info.
>
> OpenBSD 6.7-current (GENERIC.MP) #1: Fri Jun 26 14:01:06 PDT 2020
> 
> mlar...@innsmouth.int.azathoth.net:/u/bin/src/OpenBSD/openbsd/sys/arch/amd64/compile/GENERIC.MP
> real mem = 51260506112 (48885MB)
> avail mem = 49691906048 (47389MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x604f5000 (67 entries)
> bios0: vendor LENOVO version "N2IET61W (1.39 )" date 05/16/2019
> bios0: LENOVO 20N20046US
> acpi0 at bios0: ACPI 6.1
> acpi0: sleep states S0 S3 S4 S5
> acpi0: tables DSDT FACP SSDT SSDT SSDT SSDT UEFI SSDT HPET APIC MCFG ECDT 
> SSDT SSDT BOOT SLIC SSDT LPIT WSMT SSDT DBGP DBG2 MSDM BATB DMAR NHLT ASF! 
> FPDT UEFI
> acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) 
> RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) 
> RP06(S4) PXSX(S4) [...]
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpihpet0 at acpi0: 2399 Hz
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 1586.72 MHz, 06-8e-0c
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
> cpu0: apic clock running at 24MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE
> cpu1 at mainbus0: apid 2 (application processor)
> cpu1: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 1333.05 MHz, 06-8e-0c
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,

Re: 11n Tx aggregation for iwm(4)

2020-06-26 Thread Mike Larkin
On Fri, Jun 26, 2020 at 02:45:53PM +0200, Stefan Sperling wrote:
> This patch adds support for 11n Tx aggregation to iwm(4).
>
> Please help with testing if you can by running the patch and using wifi
> as usual. Nothing should change, except that Tx speed may potentially
> improve. If you have time to run before/after performance measurements with
> tcpbench or such, that would be nice. But it's not required for testing.
>
> If Tx aggregation is active then netstat will show a non-zero output block ack
> agreement counter:
>
> $ netstat -W iwm0 | grep 'output block'
> 3 new output block ack agreements
>   0 output block ack agreements timed out
>
> It would be great to get at least one test for all the chipsets the driver
> supports: 7260, 7265, 3160, 3165, 3168, 8260, 8265, 9260, 9560
> The behaviour of the access point also matters a great deal. It won't
> hurt to test the same chipset against several different access points.
>
> I have tested this version on 8265 only so far. I've run older revisions
> of this patch on 7265 so I'm confident that this chip will work, too.
> So far, the APs I have tested against are athn(4) in 11a mode and in 11n
> mode with the 'nomimo' nwflag, and a Sagemcom 11ac AP. All on 5Ghz channels.

I tested this on my T490 Thinkpad:

iwm0 at pci0 dev 20 function 3 "Intel Dual Band Wireless AC 9560" rev 0x30, msix
iwm0: hw rev 0x310, fw ver 34.3125811985.0

It ended up having a heck of a time connecting to anything, most/all
connections ended up timing out or just taking a really long time to complete.

I looked in dmesg, and found a stream of fatal firmware errors and other
errors (see end of this email).

My iwm-firmware was updated before I tried the new kernel:

-innsmouth- ~> pkg_info iwm-firmware
Information for inst:iwm-firmware-20191022p1

Comment:
firmware binary images for iwm(4) driver

Description:
Firmware binary images for use with the iwm(4) driver.

Maintainer: The OpenBSD ports mailing-list 

WWW: https://wireless.wiki.kernel.org/en/users/Drivers/iwlwifi



I still have the kernel around if you want me to test something else. There
is nothing in this tree except this Txagg diff. LMK if you need any more
info.

OpenBSD 6.7-current (GENERIC.MP) #1: Fri Jun 26 14:01:06 PDT 2020

mlar...@innsmouth.int.azathoth.net:/u/bin/src/OpenBSD/openbsd/sys/arch/amd64/compile/GENERIC.MP
real mem = 51260506112 (48885MB)
avail mem = 49691906048 (47389MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x604f5000 (67 entries)
bios0: vendor LENOVO version "N2IET61W (1.39 )" date 05/16/2019
bios0: LENOVO 20N20046US
acpi0 at bios0: ACPI 6.1
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SSDT SSDT SSDT SSDT UEFI SSDT HPET APIC MCFG ECDT SSDT 
SSDT BOOT SLIC SSDT LPIT WSMT SSDT DBGP DBG2 MSDM BATB DMAR NHLT ASF! FPDT UEFI
acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) 
RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) 
RP06(S4) PXSX(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 1586.72 MHz, 06-8e-0c
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 1333.05 MHz, 06-8e-0c
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 1125.81 MHz, 06-8e-0c
cpu2: 

Re: vmm(4): unterminated vm_name after strncpy

2020-03-15 Thread Mike Larkin
On Thu, Mar 12, 2020 at 10:31:13PM +0100, Tobias Heider wrote:
> vmm uses 'strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN)' to copy
> to buffers of size VMM_MAX_NAME_LEN, which can leave the resulting string
> unterminated.
> From strncpy(3):
>   strncpy() only NUL terminates the destination string when the length of
>   the source string is less than the length parameter.
> 
> I propose replacing it with 'strlcpy' which does the right thing and
> only copies up to dstsize - 1 characters.
> 
> ok?
> 

good find. Thanks!

> CID 1453255
> 
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /mount/openbsd/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.266
> diff -u -p -r1.266 vmm.c
> --- sys/arch/amd64/amd64/vmm.c11 Mar 2020 16:38:42 -  1.266
> +++ sys/arch/amd64/amd64/vmm.c12 Mar 2020 21:15:01 -
> @@ -1167,7 +1167,7 @@ vm_create(struct vm_create_params *vcp, 
>   memcpy(vm->vm_memranges, vcp->vcp_memranges,
>   vm->vm_nmemranges * sizeof(vm->vm_memranges[0]));
>   vm->vm_memory_size = memsize;
> - strncpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN);
> + strlcpy(vm->vm_name, vcp->vcp_name, VMM_MAX_NAME_LEN);
>  
>   rw_enter_write(_softc->vm_lock);
>  
> @@ -3718,7 +3718,7 @@ vm_get_info(struct vm_info_params *vip)
>   out[i].vir_ncpus = vm->vm_vcpu_ct;
>   out[i].vir_id = vm->vm_id;
>   out[i].vir_creator_pid = vm->vm_creator_pid;
> - strncpy(out[i].vir_name, vm->vm_name, VMM_MAX_NAME_LEN);
> + strlcpy(out[i].vir_name, vm->vm_name, VMM_MAX_NAME_LEN);
>   rw_enter_read(>vm_vcpu_lock);
>   for (j = 0; j < vm->vm_vcpu_ct; j++) {
>   out[i].vir_vcpu_state[j] = VCPU_STATE_UNKNOWN;
> 



Re: [PATCH] Fixing an uninitialized variable that can lead to #GP.

2020-02-09 Thread Mike Larkin
On Sun, Feb 09, 2020 at 06:17:47PM -0800, Anthony Steinhauser wrote:
> In the current implementation of the TAA mitigation if the cpuid_level
> is 6 and it's an Intel CPU, the sefflags_edx variable is used without
> being initialized. If the SEFF0EDX_ARCH_CAP bit is accidentally flipped
> in it, the rdmsr on the unimplemented MSR_ARCH_CAPABILITIES index leads
> to a #GP fault.
> 
> This change initializes the sefflags_edx variable to 0 which is
> consistent with the MSR_ARCH_CAPABILITIES being unavailable.
> ---
>  sys/arch/amd64/amd64/cpu.c | 2 +-
>  sys/arch/i386/i386/cpu.c   | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/sys/arch/amd64/amd64/cpu.c b/sys/arch/amd64/amd64/cpu.c
> index 48ab6b5e7f3..f9beff0d5e3 100644
> --- a/sys/arch/amd64/amd64/cpu.c
> +++ b/sys/arch/amd64/amd64/cpu.c
> @@ -1164,7 +1164,7 @@ void
>  cpu_tsx_disable(struct cpu_info *ci)
>  {
>   uint64_t msr;
> - uint32_t dummy, sefflags_edx;
> + uint32_t dummy, sefflags_edx = 0;
>  
>   /* this runs before identifycpu() populates ci_feature_sefflags_edx */
>   if (cpuid_level >= 0x07)
> diff --git a/sys/arch/i386/i386/cpu.c b/sys/arch/i386/i386/cpu.c
> index b31a431c594..76f1b65bede 100644
> --- a/sys/arch/i386/i386/cpu.c
> +++ b/sys/arch/i386/i386/cpu.c
> @@ -473,7 +473,7 @@ void
>  cpu_tsx_disable(struct cpu_info *ci)
>  {
>   uint64_t msr;
> - uint32_t dummy, sefflags_edx;
> + uint32_t dummy, sefflags_edx = 0;
>  
>   /* this runs before identifycpu() populates ci_feature_sefflags_edx */
>   if (cpuid_level >= 0x07)
> -- 
> 2.25.0.341.g760bfbb309-goog
> 

Probably safer to use rdmsr_safe for this sort of thing also.

-ml



Re: Add mprotect_ept ioctl to vmm(4)

2020-02-07 Thread Mike Larkin
On Fri, Feb 07, 2020 at 01:25:38PM -0800, Mike Larkin wrote:
> On Fri, Feb 07, 2020 at 04:20:16AM +, Adam Steen wrote:
> > Hi
> > 
> > Please see the attached patch to add an 'IOCTL handler to sets the access
> > protections of the ept'
> > 
> > vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
> > a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
> > available here 
> > https://github.com/Solo5/solo5/compare/master...adamsteen:wnox
> > 
> > there are changes to vmd too, but this is just to ensure completeness,
> > if mprotect ept is called in the future, we would want the vm to be
> > stopped if we get a protection fault.
> > 
> > I was unsure what todo if called with execute only permissions on a cpu that
> > does not support it. I went with add read permissions and logging the
> > fact, instead of returning EINVAL.
> > 
> > Cheers
> > Adam
> > 
> 
> I have been giving Adam feedback on this diff for a while. There are a few
> minor comments below, but I think this is ok if someone wants to commit it 
> after
> the fixes below are incorporated.
> 
> -ml
> 

See updated comment below.

-ml

> > ? div
> > Index: sys/arch/amd64/amd64/vmm.c
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> > retrieving revision 1.258
> > diff -u -p -u -p -r1.258 vmm.c
> > --- sys/arch/amd64/amd64/vmm.c  31 Jan 2020 01:51:27 -  1.258
> > +++ sys/arch/amd64/amd64/vmm.c  7 Feb 2020 03:15:16 -
> > @@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *)
> >  int vm_resetcpu(struct vm_resetcpu_params *);
> >  int vm_intr_pending(struct vm_intr_params *);
> >  int vm_rwregs(struct vm_rwregs_params *, int);
> > +int vm_mprotect_ept(struct vm_mprotect_ept_params *);
> >  int vm_rwvmparams(struct vm_rwvmparams_params *, int);
> >  int vm_find(uint32_t, struct vm **);
> >  int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
> > @@ -186,6 +187,8 @@ int svm_fault_page(struct vcpu *, paddr_
> >  int vmx_fault_page(struct vcpu *, paddr_t);
> >  int vmx_handle_np_fault(struct vcpu *);
> >  int svm_handle_np_fault(struct vcpu *);
> > +int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
> > +pt_entry_t *vmx_pmap_find_pte_ept(pmap_t, paddr_t);
> >  int vmm_alloc_vpid(uint16_t *);
> >  void vmm_free_vpid(uint16_t);
> >  const char *vcpu_state_decode(u_int);
> > @@ -493,6 +496,9 @@ vmmioctl(dev_t dev, u_long cmd, caddr_t 
> > case VMM_IOC_WRITEREGS:
> > ret = vm_rwregs((struct vm_rwregs_params *)data, 1);
> > break;
> > +   case VMM_IOC_MPROTECT_EPT:
> > +   ret = vm_mprotect_ept((struct vm_mprotect_ept_params *)data);
> > +   break;
> > case VMM_IOC_READVMPARAMS:
> > ret = vm_rwvmparams((struct vm_rwvmparams_params *)data, 0);
> > break;
> > @@ -531,6 +537,7 @@ pledge_ioctl_vmm(struct proc *p, long co
> > case VMM_IOC_INTR:
> > case VMM_IOC_READREGS:
> > case VMM_IOC_WRITEREGS:
> > +   case VMM_IOC_MPROTECT_EPT:
> > case VMM_IOC_READVMPARAMS:
> > case VMM_IOC_WRITEVMPARAMS:
> > return (0);
> > @@ -806,6 +813,288 @@ vm_rwregs(struct vm_rwregs_params *vrwp,
> >  }
> >  
> >  /*
> > + * vm_mprotect_ept
> > + *
> > + * IOCTL handler to sets the access protections of the ept
> > + *
> > + * Parameters:
> > + *   vmep: decribes the memory for which the protect will be applied..
> > + *
> > + * Return values:
> > + *  0: if successful
> > + *  ENOENT: if the VM defined by 'vmep' cannot be found
> > + *  EINVAL: if the sgpa or size is not page aligned, the prot is invalid,
> > + *  size is too large (512GB), there is wraparound
> > + *  (like start = 512GB-1 and end = 512GB-2),
> > + *  the address specified is not within the vm's mem range
> > + *  or the address lies inside reserved (MMIO) memory
> > + */
> > +int
> > +vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
> > +{
> > +   struct vm *vm;
> > +   struct vcpu *vcpu;
> > +   vaddr_t sgpa;
> > +   size_t size;
> > +   vm_prot_t prot;
> > +   uint64_t msr;
> > +   int ret, memtype;
> > +
> > +   /* If not EPT or RVI, nothing to do here */
> > +   if (!(vmm_softc->mode == VMM_MODE_EPT
> > +   || vmm_softc->mode == VMM_MODE_RVI))
> > +   return (0);
>

vmm(4): wrong comment

2020-02-07 Thread Mike Larkin
Free commit for someone. Noticed last night by my student team that is working
on vmm(4) virtio memory ballooning support as we were adding the viomb(4)
stats queue.

-ml


Index: vmm.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
retrieving revision 1.257
diff -u -p -a -u -r1.257 vmm.c
--- vmm.c   13 Dec 2019 03:38:15 -  1.257
+++ vmm.c   7 Feb 2020 21:27:46 -
@@ -3666,7 +3666,6 @@ vcpu_vmx_compute_ctrl(uint64_t ctrlval, 
 /*
  * vm_get_info
  *
- * Returns information about the VM indicated by 'vip'.
  * Returns information about the VM indicated by 'vip'. The 'vip_size' field
  * in the 'vip' parameter is used to indicate the size of the caller's buffer.
  * If insufficient space exists in that buffer, the required size needed is



Re: Add mprotect_ept ioctl to vmm(4)

2020-02-07 Thread Mike Larkin
On Fri, Feb 07, 2020 at 04:20:16AM +, Adam Steen wrote:
> Hi
> 
> Please see the attached patch to add an 'IOCTL handler to sets the access
> protections of the ept'
> 
> vmd(8) does not make use of this change, but solo5, which uses vmm(4) as
> a backend hypervisor. The code calling 'VMM_IOC_MPROTECT_EPT' is
> available here https://github.com/Solo5/solo5/compare/master...adamsteen:wnox
> 
> there are changes to vmd too, but this is just to ensure completeness,
> if mprotect ept is called in the future, we would want the vm to be
> stopped if we get a protection fault.
> 
> I was unsure what todo if called with execute only permissions on a cpu that
> does not support it. I went with add read permissions and logging the
> fact, instead of returning EINVAL.
> 
> Cheers
> Adam
> 

I have been giving Adam feedback on this diff for a while. There are a few
minor comments below, but I think this is ok if someone wants to commit it after
the fixes below are incorporated.

-ml

> ? div
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.258
> diff -u -p -u -p -r1.258 vmm.c
> --- sys/arch/amd64/amd64/vmm.c31 Jan 2020 01:51:27 -  1.258
> +++ sys/arch/amd64/amd64/vmm.c7 Feb 2020 03:15:16 -
> @@ -124,6 +124,7 @@ int vm_get_info(struct vm_info_params *)
>  int vm_resetcpu(struct vm_resetcpu_params *);
>  int vm_intr_pending(struct vm_intr_params *);
>  int vm_rwregs(struct vm_rwregs_params *, int);
> +int vm_mprotect_ept(struct vm_mprotect_ept_params *);
>  int vm_rwvmparams(struct vm_rwvmparams_params *, int);
>  int vm_find(uint32_t, struct vm **);
>  int vcpu_readregs_vmx(struct vcpu *, uint64_t, struct vcpu_reg_state *);
> @@ -186,6 +187,8 @@ int svm_fault_page(struct vcpu *, paddr_
>  int vmx_fault_page(struct vcpu *, paddr_t);
>  int vmx_handle_np_fault(struct vcpu *);
>  int svm_handle_np_fault(struct vcpu *);
> +int vmx_mprotect_ept(vm_map_t, paddr_t, paddr_t, int);
> +pt_entry_t *vmx_pmap_find_pte_ept(pmap_t, paddr_t);
>  int vmm_alloc_vpid(uint16_t *);
>  void vmm_free_vpid(uint16_t);
>  const char *vcpu_state_decode(u_int);
> @@ -493,6 +496,9 @@ vmmioctl(dev_t dev, u_long cmd, caddr_t 
>   case VMM_IOC_WRITEREGS:
>   ret = vm_rwregs((struct vm_rwregs_params *)data, 1);
>   break;
> + case VMM_IOC_MPROTECT_EPT:
> + ret = vm_mprotect_ept((struct vm_mprotect_ept_params *)data);
> + break;
>   case VMM_IOC_READVMPARAMS:
>   ret = vm_rwvmparams((struct vm_rwvmparams_params *)data, 0);
>   break;
> @@ -531,6 +537,7 @@ pledge_ioctl_vmm(struct proc *p, long co
>   case VMM_IOC_INTR:
>   case VMM_IOC_READREGS:
>   case VMM_IOC_WRITEREGS:
> + case VMM_IOC_MPROTECT_EPT:
>   case VMM_IOC_READVMPARAMS:
>   case VMM_IOC_WRITEVMPARAMS:
>   return (0);
> @@ -806,6 +813,288 @@ vm_rwregs(struct vm_rwregs_params *vrwp,
>  }
>  
>  /*
> + * vm_mprotect_ept
> + *
> + * IOCTL handler to sets the access protections of the ept
> + *
> + * Parameters:
> + *   vmep: decribes the memory for which the protect will be applied..
> + *
> + * Return values:
> + *  0: if successful
> + *  ENOENT: if the VM defined by 'vmep' cannot be found
> + *  EINVAL: if the sgpa or size is not page aligned, the prot is invalid,
> + *  size is too large (512GB), there is wraparound
> + *  (like start = 512GB-1 and end = 512GB-2),
> + *  the address specified is not within the vm's mem range
> + *  or the address lies inside reserved (MMIO) memory
> + */
> +int
> +vm_mprotect_ept(struct vm_mprotect_ept_params *vmep)
> +{
> + struct vm *vm;
> + struct vcpu *vcpu;
> + vaddr_t sgpa;
> + size_t size;
> + vm_prot_t prot;
> + uint64_t msr;
> + int ret, memtype;
> +
> + /* If not EPT or RVI, nothing to do here */
> + if (!(vmm_softc->mode == VMM_MODE_EPT
> + || vmm_softc->mode == VMM_MODE_RVI))
> + return (0);
> +
> + /* Find the desired VM */
> + rw_enter_read(_softc->vm_lock);
> + ret = vm_find(vmep->vmep_vm_id, );
> + rw_exit_read(_softc->vm_lock);
> +
> + /* Not found? exit. */
> + if (ret != 0) {
> + DPRINTF("%s: vm id %u not found\n", __func__,
> + vmep->vmep_vm_id);
> + return (ret);
> + }
> +
> + rw_enter_read(>vm_vcpu_lock);
> + SLIST_FOREACH(vcpu, >vm_vcpu_list, vc_vcpu_link) {
> + if (vcpu->vc_id == vmep->vmep_vcpu_id)
> + break;
> + }
> + rw_exit_read(>vm_vcpu_lock);
> +
> + if (vcpu == NULL) {
> + DPRINTF("%s: vcpu id %u of vm %u not found\n", __func__,
> + vmep->vmep_vcpu_id, vmep->vmep_vm_id);
> + return (ENOENT);
> + }
> +
> + if (vcpu->vc_state != VCPU_STATE_STOPPED) {
> + DPRINTF("%s: mprotect_ept %u on 

Re: vmm(4) patch - iniatialise eptp to zero for vmx like svm

2020-02-06 Thread Mike Larkin
On Thu, Feb 06, 2020 at 01:05:01AM -0800, Mike Larkin wrote:
> On Thu, Feb 06, 2020 at 02:34:47AM +, Adam Steen wrote:
> > Hi
> > 
> > Again while working on a larger patch i noticed that the eptp for vmx
> > was not getting initialised to zero like the svm code path, as part of
> > a VMM_IOC_RESETCPU ioctl call.
> > 
> > please see the attach patch to initialise eptp to zero
> > 
> > cheers
> > Adam
> > 
> > ? div
> > Index: sys/arch/amd64/amd64/vmm.c
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> > retrieving revision 1.258
> > diff -u -p -u -p -r1.258 vmm.c
> > --- sys/arch/amd64/amd64/vmm.c  31 Jan 2020 01:51:27 -  1.258
> > +++ sys/arch/amd64/amd64/vmm.c  6 Feb 2020 02:18:30 -
> > @@ -2895,6 +2895,8 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
> > /* xcr0 power on default sets bit 0 (x87 state) */
> > vcpu->vc_gueststate.vg_xcr0 = XCR0_X87 & xsave_mask;
> >  
> > +   vcpu->vc_parent->vm_map->pmap->eptp = 0;
> > +
> >  exit:
> > /* Flush the VMCS */
> > if (vmclear(>vc_control_pa)) {
> > 
> > 
> 
> I do not believe this is what you want to do.
> 
> The SVM path *should* reset the eptp to 0, since there is no EPTP in RVI based
> scenarios.
> 
> If you reset the eptp to 0 in an EPT environment, you'll lose the VPID and
> the PA of the EPT itself (which, if you look earlier in that function, is 
> properly initialized; you're going to be whacking it back to 0 here):
> 
> Around line 2620:
> 
> ...
> DPRINTF("Guest EPTP = 0x%llx\n", eptp);
> if (vmwrite(VMCS_GUEST_IA32_EPTP, eptp)) {
> DPRINTF("%s: error setting guest EPTP\n", __func__);
> ret = EINVAL;
> goto exit;
> }
> 
> vcpu->vc_parent->vm_map->pmap->eptp = eptp;
> ...
> 
> 
> -ml
> 

PS -

Note that although the EPTP is written to the VMCS in the previous code, we do
use the cached eptp value in the VM's pmap to do EPT TLB flushes. If you clear
it like this, you won't get the proper behaviour (note, the behaviour today
isn't actually 100% correct to begin with, I have a diff for that but it is
growing out of control it seems...)



Re: vmm(4) patch - iniatialise eptp to zero for vmx like svm

2020-02-06 Thread Mike Larkin
On Thu, Feb 06, 2020 at 02:34:47AM +, Adam Steen wrote:
> Hi
> 
> Again while working on a larger patch i noticed that the eptp for vmx
> was not getting initialised to zero like the svm code path, as part of
> a VMM_IOC_RESETCPU ioctl call.
> 
> please see the attach patch to initialise eptp to zero
> 
> cheers
> Adam
> 
> ? div
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.258
> diff -u -p -u -p -r1.258 vmm.c
> --- sys/arch/amd64/amd64/vmm.c31 Jan 2020 01:51:27 -  1.258
> +++ sys/arch/amd64/amd64/vmm.c6 Feb 2020 02:18:30 -
> @@ -2895,6 +2895,8 @@ vcpu_reset_regs_vmx(struct vcpu *vcpu, s
>   /* xcr0 power on default sets bit 0 (x87 state) */
>   vcpu->vc_gueststate.vg_xcr0 = XCR0_X87 & xsave_mask;
>  
> + vcpu->vc_parent->vm_map->pmap->eptp = 0;
> +
>  exit:
>   /* Flush the VMCS */
>   if (vmclear(>vc_control_pa)) {
> 
> 

I do not believe this is what you want to do.

The SVM path *should* reset the eptp to 0, since there is no EPTP in RVI based
scenarios.

If you reset the eptp to 0 in an EPT environment, you'll lose the VPID and
the PA of the EPT itself (which, if you look earlier in that function, is 
properly initialized; you're going to be whacking it back to 0 here):

Around line 2620:

...
DPRINTF("Guest EPTP = 0x%llx\n", eptp);
if (vmwrite(VMCS_GUEST_IA32_EPTP, eptp)) {
DPRINTF("%s: error setting guest EPTP\n", __func__);
ret = EINVAL;
goto exit;
}

vcpu->vc_parent->vm_map->pmap->eptp = eptp;
...


-ml



Re: em(4) diff to test

2020-01-31 Thread Mike Larkin
On Thu, Jan 30, 2020 at 09:15:35AM +0100, Martin Pieuchot wrote:
> On 21/01/20(Tue) 12:31, Martin Pieuchot wrote:
> > On 20/01/20(Mon) 16:42, Martin Pieuchot wrote:
> > > Diff below is a refactoring of the actual em(4) code and defines that
> > > will allows me to present a shorter diff to interrupt multiple CPUs and
> > > make use of multiple queues.
> > > 
> > > It contains the following items:
> > > 
> > >   - Abstract the allocation/freeing of TX/RX ring into em_dma_malloc().
> > > This will ease the introduction of multiple rings.
> > > 
> > >   - Split the 82576 variant out of 82575.  The distinction is necessary
> > > when it comes to setting multiple queues.
> > > 

I see in the diff where this is being done, but is the multi-queue part not
in this diff? Just looks like you separated things, and left it at that?

> > >   - Change multiple TX/RX related macro to take an index argument
> > > corresponding to a ring.  Currently only the index 0 and 1 are used.
> > > 
> > >   - Gather and print more stats counters
> > > 
> > >   - Switch to using a function, like FreeBSD, to translate 82542
> > > registers and get rid of a set of defines.
> > > 
> > > It has been tested one the models below, I'd like to be sure there isn't
> > > any fallout with this part before continuing the effort.
> > 
> > New diff that works with 82576, previous breakage reported by Hrvoje
> > Popovski.  So far the following models have been tested, I'm looking for
> > more tests :o)
> 
> So far this has been tested on the following, I'm confident enough and
> would like to move forward, ok?
> 
>   em2 at pci0 dev 4 function 0 "Intel 82540EM" rev 0x02: apic 9 int 14
>   em1 at pci3 dev 1 function 0 "Intel 82545GM" rev 0x04: apic 4 int 0
>   em0 at pci3 dev 1 function 0 "Intel 82546GB" rev 0x03: apic 3 int 0
>   em3 at pci2 dev 0 function 0 "Intel 82571EB" rev 0x06: apic 0 int 16
>   em0 at pci1 dev 0 function 0 "Intel 82572EI" rev 0x06: apic 0 int 16
>   em2 at pci5 dev 0 function 0 "Intel 82573E" rev 0x03: msi
>   em3 at pci6 dev 0 function 0 "Intel 82573L" rev 0x00: msi
>   em0 at pci3 dev 0 function 0 "Intel 82574L" rev 0x00: msi
>   em0 at pci0 dev 2 function 0 "Intel 82575GB" rev 0x02: msi
>   em0 at pci1 dev 0 function 0 "Intel 82576" rev 0x01: msi
>   em0 at pci0 dev 25 function 0 "Intel 82577LM" rev 0x06: msi
>   em0 at pci0 dev 25 function 0 "Intel 82579LM" rev 0x04: msi
>   em0 at pci1 dev 0 function 0 "Intel 82583V" rev 0x00: msi
>   em0 at pci1 dev 0 function 0 "Intel I210" rev 0x03: msi
>   em0 at pci1 dev 0 function 0 "Intel I211" rev 0x03: msi
>   em0 at pci0 dev 25 function 0 "Intel I217-LM" rev 0x04: msi
>   em0 at pci0 dev 25 function 0 "Intel I218-V" rev 0x03: msi
>   em0 at pci0 dev 25 function 0 Intel I218-LM rev 0x04: msi

You can add this to the "works as intended" list (t490):

em0 at pci0 dev 31 function 6 "Intel I219-LM" rev 0x30: msi

Diff reads ok to me. So ok mlarkin@ if you are still looking for oks and didn't
commit it already.

-ml

>   em0 at pci0 dev 31 function 6 "Intel I219-V" rev 0x21: msi
>   em0 at pci7 dev 0 function 0 "Intel I350" rev 0x01: msi
> 
> 
> Index: pci/if_em.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> retrieving revision 1.343
> diff -u -p -r1.343 if_em.c
> --- pci/if_em.c   20 Jan 2020 23:45:02 -  1.343
> +++ pci/if_em.c   21 Jan 2020 11:27:14 -
> @@ -260,6 +260,7 @@ void em_disable_aspm(struct em_softc *);
>  void em_txeof(struct em_softc *);
>  int  em_allocate_receive_structures(struct em_softc *);
>  int  em_allocate_transmit_structures(struct em_softc *);
> +int  em_allocate_desc_rings(struct em_softc *);
>  int  em_rxfill(struct em_softc *);
>  void em_rxrefill(void *);
>  int  em_rxeof(struct em_softc *);
> @@ -344,11 +345,6 @@ em_defer_attach(struct device *self)
>  
>   em_free_pci_resources(sc);
>  
> - sc->sc_rx_desc_ring = NULL;
> - em_dma_free(sc, >sc_rx_dma);
> - sc->sc_tx_desc_ring = NULL;
> - em_dma_free(sc, >sc_tx_dma);
> -
>   return;
>   }
>   
> @@ -464,6 +460,7 @@ em_attach(struct device *parent, struct 
>   case em_82572:
>   case em_82574:
>   case em_82575:
> + case em_82576:
>   case em_82580:
>   case em_i210:
>   case em_i350:
> @@ -494,23 +491,11 @@ em_attach(struct device *parent, struct 
>   sc->hw.min_frame_size = 
>   ETHER_MIN_LEN + ETHER_CRC_LEN;
>  
> - /* Allocate Transmit Descriptor ring */
> - if (em_dma_malloc(sc, sc->sc_tx_slots * sizeof(struct em_tx_desc),
> - >sc_tx_dma) != 0) {
> - printf("%s: Unable to allocate tx_desc memory\n", 
> -DEVNAME(sc));
> - goto err_tx_desc;
> - }
> - sc->sc_tx_desc_ring = (struct 

Re: Remove unused code from vmm

2020-01-30 Thread Mike Larkin
On Fri, Jan 31, 2020 at 01:40:14AM +, Adam Steen wrote:
>  Hi
> 
>  While working on a patch, i noticed that vmm_get_guest_faulttype was
>  incorrect for amd (VMM_MODE_RVI) cpus, apon further inspection realised
>  it was unused. Please see the patch below to remove it.
> 
> cheers
> Adam
> 

Thanks, will remove.

-ml

> ? div
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.257
> diff -u -p -u -p -r1.257 vmm.c
> --- sys/arch/amd64/amd64/vmm.c13 Dec 2019 03:38:15 -  1.257
> +++ sys/arch/amd64/amd64/vmm.c30 Jan 2020 06:47:41 -
> @@ -177,7 +177,6 @@ void vmx_handle_intr(struct vcpu *);
>  void vmx_handle_intwin(struct vcpu *);
>  void vmx_handle_misc_enable_msr(struct vcpu *);
>  int vmm_get_guest_memtype(struct vm *, paddr_t);
> -int vmm_get_guest_faulttype(void);
>  int vmx_get_guest_faulttype(void);
>  int svm_get_guest_faulttype(struct vmcb *);
>  int vmx_get_exit_qualification(uint64_t *);
> @@ -5073,23 +5072,6 @@ vmm_get_guest_memtype(struct vm *vm, pad
>  
>   DPRINTF("guest memtype @ 0x%llx unknown\n", (uint64_t)gpa);
>   return (VMM_MEM_TYPE_UNKNOWN);
> -}
> -
> -/*
> - * vmm_get_guest_faulttype
> - *
> - * Determines the type (R/W/X) of the last fault on the VCPU last run on
> - * this PCPU. Calls the appropriate architecture-specific subroutine.
> - */
> -int
> -vmm_get_guest_faulttype(void)
> -{
> - if (vmm_softc->mode == VMM_MODE_EPT)
> - return vmx_get_guest_faulttype();
> - else if (vmm_softc->mode == VMM_MODE_RVI)
> - return vmx_get_guest_faulttype();
> - else
> - panic("%s: unknown vmm mode: %d", __func__, vmm_softc->mode);
>  }
>  
>  /*
> 



Re: ldomctl: download: select new configuration

2020-01-06 Thread Mike Larkin
On Fri, Jan 03, 2020 at 08:27:21PM +0100, Mark Kettenis wrote:
> > Date: Mon, 30 Dec 2019 21:07:59 +0100
> > From: Klemens Nanni 
> > 
> > The example in the manual implies that the download command also selects
> > it:
> > 
> > # ldomctl init-system ldom.conf
> > # cd ..
> > # ldomctl delete openbsd
> > # ldomctl download openbsd
> > # ldomctl list
> > factory-default [current]
> > openbsd [next]
> > 
> > But `ldomctl select openbsd' is required between downloading and listing
> > to get this result - at least on my T4 machine `download' never selected
> > any configuration, however I vaguely remember that this was the case
> > with older machines.
> > 
> > kettenis: Has this really been missing all the time or could there be
> > differences in the mdstore protocol and/or firmware that cause this?
> > 
> > Diff below explicitly selects configuration in code.
> > 
> > Feedback? OK?
> 
> I can't remember.  Maybe someone with a t1k/t2k or t5120/t5140 can
> test this?
> 

Do you still need testing here? I could reinstall my t5240 if nobody has
tested yet...

-ml

> It makes sense for the download command not to immediately select a
> configuration.  So maybe just the documentation needs changing?
> 
> > Index: ldomctl.c
> > ===
> > RCS file: /cvs/src/usr.sbin/ldomctl/ldomctl.c,v
> > retrieving revision 1.31
> > diff -u -p -r1.31 ldomctl.c
> > --- ldomctl.c   28 Dec 2019 18:36:02 -  1.31
> > +++ ldomctl.c   30 Dec 2019 19:51:59 -
> > @@ -415,6 +415,7 @@ download(int argc, char **argv)
> > ds_conn_handle(dc);
> >  
> > mdstore_download(dc, argv[1]);
> > +   mdstore_select(dc, argv[1]);
> >  }
> >  
> >  void
> > 
> > 
> 



Re: vmctl: print root user in status owner field

2019-12-15 Thread Mike Larkin
On Sat, Dec 14, 2019 at 02:16:20AM +0100, Klemens Nanni wrote:
> With "owner root:wheel" (any group) the `vmctl status' output
> will omit the "root" part in the OWNER column:
> 
>   vm "generic" {
>   owner "root:vms"
>   ...
>   }
> 
>   $ vmctl status
>  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
>   1 - 1512M   -   - :vms  stopped generic
> 
> It only omits it if the user is root, presumably to say "only the group
> matters".
> 
> I find this special case confusing as it looks incomplete, instead just
> print whatever is configured: 
> 
>   $ ./obj/vmctl status
>  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
>   1 - 1512M   -   - root:vms  stopped generic
> 
> Feedback? OK?
> 
> 
> Index: vmctl.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.72
> diff -u -p -r1.72 vmctl.c
> --- vmctl.c   12 Dec 2019 03:53:38 -  1.72
> +++ vmctl.c   14 Dec 2019 00:54:23 -
> @@ -768,8 +768,6 @@ print_vm_info(struct vmop_info_result *l
>   (void)strlcpy(user, name, sizeof(user));
>   /* get group name */
>   if (vmi->vir_gid != -1) {
> - if (vmi->vir_uid == 0)
> - *user = '\0';
>   name = group_from_gid(vmi->vir_gid, 1);
>   if (name == NULL)
>   (void)snprintf(group, sizeof(group),
> 

sure



Re: vmm(4) question: unneeded vmclear() in vcpu_readregs_vmx()?

2019-12-12 Thread Mike Larkin
On Tue, Oct 22, 2019 at 06:57:32PM +0900, Iori YONEJI wrote:
> On Tue, Oct 22, 2019 at 11:17 AM Mike Larkin  wrote:
> >
> > On Mon, Oct 21, 2019 at 03:52:52AM +0900, Iori YONEJI wrote:
> > > Hello tech@,
> > >
> > > I have a question (or maybe a suggestion) about vmm(4).
> > >
> > > I'm writing a small additional feature to sys/arch/amd64/amd64/vmm.c
> > > and found a seemingly unneeded vmclear() at the end of
> > > vcpu_readregs_vmx(). This function didn't seem to affect VM state at
> > > first glance because, obviously, its name is _read_. Against this
> > > intuition, this function clears VM state, which makes the vmexit
> > > handling stop (eg. "advance rip" failure) and the VM die if it is used
> > > in there. Possibly this was added as a counterpart of
> > > vcpu_reload_vmcs_vmx() at the very beginning of the read function, but
> > > I don't think it does undo the reload.
> > >
> > > This vmclear can be removed in my understanding and also I confirmed
> > > that Alpine Linux and OpenBSD guest VMs can run on the vmm(4) without
> > > this vmclear call.
> > >
> > > Or I am just wrong and it may be actually needed in some corner cases.
> > > If so, I will restore (vmptrld) vcpu->vc_control_pa every time the VM
> > > context continues after calling readregs function or add another flag
> > > to indicate whether vmclear is needed here if all of the corner cases
> > > are known.
> > >
> > > Would you mind to tell me I am correct or not, or how it has to be
> > > dealt with?
> > >
> > >
> > > Index: sys/arch/amd64/amd64/vmm.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> > > retrieving revision 1.254
> > > diff -u -p -r1.254 vmm.c
> > > --- sys/arch/amd64/amd64/vmm.c22 Sep 2019 08:47:54 -1.254
> > > +++ sys/arch/amd64/amd64/vmm.c20 Oct 2019 18:04:17 -
> > > @@ -1602,8 +1602,10 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
> > >  errout:
> > >  ret = EINVAL;
> > >  out:
> > > +/* XXX Why do we need vmclear here?
> > >  if (vmclear(>vc_control_pa))
> > >  ret = EINVAL;
> > > +*/
> > >  return (ret);
> > >  }
> > >
> >
> > The assumption in vcpu_run_vmx is that we will always come in with a cleared
> > VMCS state, unless we are staying in the run loop (the "resume" flag in that
> > function tracks this). So any ioctl (including things like read/write regs)
> > that might possibly manipulate the VMCS, need to clear the VMCS before they
> > return to user space. That was the original reason for that clear at the end
> > of the function.
> >
> > If you are adding new functionality that uses read/write regs internally, 
> > then
> > you're right; on return, you'll have a cleared VMCS and subsequent vmreads 
> > and
> > vmwrites will fail.
> >
> > It looks like there are a couple of other users of readregs/writeregs
> > internally - the presently unused GVA translator function as well as the 
> > in/out
> > exit handler structure preparation code. Both of those look like they got 
> > lucky
> > in how they were handling things; they aren't touching the VMCS after 
> > read/write
> > regs or we would have seen them explode like you're saying.
> >
> > It looks like the start of vcpu_run_vmx *almost* properly handles coming
> > in with an arbitrary  VMCS. The "almost" part is the bit under #ifdef 
> > VMM_DEBUG
> > that handles triple faults. It seems to assume that the proper VMCS is 
> > currently
> > loaded, which is certainly not true all the time. I'll fix that part.
> >
> > Based on the fact that the other code paths look clean, I think we can 
> > remove
> > the vmclear from the end of read/write regs, as you suggest. I'll do that as
> > well.
> >
> > Might I ask what you're working on? Just in case I know of someone else 
> > working
> > on the same thing? (There are about a dozen vmm side-projects going on that 
> > I'm
> > aware of and it would be a waste for two people to be working on the same 
> > thing
> > without knowing of the other). Feel free to mail me off-list if you don't 
> > want
> > to share in public.
> >
> > -ml
> >
> Thank you. I didn't notice that the VMCS must be cleared before
> returning to user space.
> 
> Ta

Re: [PATCH] staggered start of vms in vm.conf

2019-12-08 Thread Mike Larkin
On Sun, Dec 08, 2019 at 02:07:46AM -0800, Pratik Vyas wrote:
> Hi!
> 
> This is an attempt to address 'thundering herd' problem when a lot of
> vms are configured in vm.conf.  A lot of vms booting in parallel can
> overload the host and also mess up tsc calibration in openbsd guests as
> it uses PIT which doesn't fire reliably if the host is overloaded.
> 
> 
> This diff makes vmd start vms in a staggered fashion with default parallelism 
> of
> number of cpus on the host and a delay of 30s.  Default can be overridden with
> a line like following in vm.conf
> 
> staggered start parallel 4 delay 30
> 
> 
> Every non-disabled vm starts in waiting state.  If you are eager to
> start a vm that is way further in the list, you can vmctl start it.
> 
> Discussed the idea with ori@, mlarkin@ and phessler@.
> 
> Comments / ok?
> 
> --
> Pratik
> 

See below. Other than the nits below, ok mlarkin when you are ready.

-ml

> Index: usr.sbin/vmctl/vmctl.c
> ===
> RCS file: /home/cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.71
> diff -u -p -a -u -r1.71 vmctl.c
> --- usr.sbin/vmctl/vmctl.c7 Sep 2019 09:11:14 -   1.71
> +++ usr.sbin/vmctl/vmctl.c8 Dec 2019 09:29:39 -
> @@ -716,6 +716,8 @@ vm_state(unsigned int mask)
> {
>   if (mask & VM_STATE_PAUSED)
>   return "paused";
> + else if (mask & VM_STATE_WAITING)
> + return "waiting";
>   else if (mask & VM_STATE_RUNNING)
>   return "running";
>   else if (mask & VM_STATE_SHUTDOWN)
> Index: usr.sbin/vmd/parse.y
> ===
> RCS file: /home/cvs/src/usr.sbin/vmd/parse.y,v
> retrieving revision 1.52
> diff -u -p -a -u -r1.52 parse.y
> --- usr.sbin/vmd/parse.y  14 May 2019 06:05:45 -  1.52
> +++ usr.sbin/vmd/parse.y  8 Dec 2019 09:29:39 -
> @@ -122,7 +122,8 @@ typedef struct {
> %tokenINCLUDE ERROR
> %tokenADD ALLOW BOOT CDROM DEVICE DISABLE DISK DOWN ENABLE FORMAT 
> GROUP
> %tokenINET6 INSTANCE INTERFACE LLADDR LOCAL LOCKED MEMORY NET NIFS 
> OWNER
> -%token   PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID
> +%token   PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID STAGGERED 
> START
> +%token  PARALLEL DELAY
> %token  NUMBER
> %token  STRING
> %type   lladdr
> @@ -217,6 +218,11 @@ main : LOCAL INET6 {
>   env->vmd_ps.ps_csock.cs_uid = $3.uid;
>   env->vmd_ps.ps_csock.cs_gid = $3.gid == -1 ? 0 : $3.gid;
>   }
> + | STAGGERED START PARALLEL NUMBER DELAY NUMBER {
> + env->vmd_cfg.cfg_flags |= VMD_CFG_STAGGERED_START;
> + env->vmd_cfg.delay.tv_sec = $6;
> + env->vmd_cfg.parallelism = $4;
> + }
>   ;
> 
> switch: SWITCH string {
> @@ -368,6 +374,8 @@ vm: VM string vm_instance {
>   } else {
>   if (vcp_disable)
>   vm->vm_state |= 
> VM_STATE_DISABLED;
> + else
> + vm->vm_state |= 
> VM_STATE_WAITING;
>   log_debug("%s:%d: vm \"%s\" "
>   "registered (%s)",
>   file->name, yylval.lineno,
> @@ -766,6 +774,7 @@ lookup(char *s)
>   { "allow",  ALLOW },
>   { "boot",   BOOT },
>   { "cdrom",  CDROM },
> + { "delay",  DELAY },
>   { "device", DEVICE },
>   { "disable",DISABLE },
>   { "disk",   DISK },
> @@ -785,10 +794,13 @@ lookup(char *s)
>   { "memory", MEMORY },
>   { "net",NET },
>   { "owner",  OWNER },
> + { "parallel",   PARALLEL },
>   { "prefix", PREFIX },
>   { "rdomain",RDOMAIN },
>   { "size",   SIZE },
>   { "socket", SOCKET },
> + { "staggered",  STAGGERED },
> + { "start",  START  },
>   { "switch", SWITCH },
>   { "up", UP },
>   { "vm", VM }
> Index: usr.sbin/vmd/vm.conf.5
> ===
> RCS file: /home/cvs/src/usr.sbin/vmd/vm.conf.5,v
> retrieving revision 1.44
> diff -u -p -a -u -r1.44 vm.conf.5
> --- usr.sbin/vmd/vm.conf.514 May 2019 12:47:17 -  1.44
> +++ usr.sbin/vmd/vm.conf.58 Dec 2019 09:29:39 -
> @@ -91,6 +91,16 

Re: [PATCH] attach pvclock with lower priority if tsc is unstable

2019-12-06 Thread Mike Larkin
On Fri, Dec 06, 2019 at 02:16:43PM -0800, Pratik Vyas wrote:
> * Pratik Vyas  [2019-11-24 23:07:26 -0800]:
> 
> > Hello tech@,
> > 
> > This diff attaches pvclock with lower priority (500) in case of unstable
> > tsc (PVCLOCK_FLAG_TSC_STABLE) instead of not attaching at all.
> > 
> > For reference current priorities,
> > tsc (variant)  : -2000
> > i8254  : 0
> > acpitimer  : 1000
> > acpihpet0  : 1000
> > pvclock (stable tsc)   : 1500
> > tsc (invariant, stable): 2000
> > 
> > --
> > Pratik
> > 
> 
> Does this look ok? (or any comments?)
> 
> --
> Pratik
> 

That seems like a reasonable compromise. People who still want to force
set their timecounter can do so.

-ml



Re: uvm/uvm_map.h cleanup

2019-12-06 Thread Mike Larkin
On Thu, Dec 05, 2019 at 07:25:51PM +0100, Martin Pieuchot wrote:
> Following cleanup diff:
> 
> - reduces gratuitous differences with NetBSD,
> - merges multiple '#ifdef _KERNEL' blocks,
> - kills unused 'struct vm_map_intrsafe'
> - turns 'union vm_map_object' into a anonymous union (following to NetBSD)
> - move questionable vm_map_modflags() into uvm/uvm_map.c
> - remove guards around MAX_KMAPENT, it is defined only once
> - document lock differences
> - fix tab vs space
> 
> Ok?
> 

ok mlarkin

> Index: uvm/uvm_extern.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
> retrieving revision 1.151
> diff -u -p -r1.151 uvm_extern.h
> --- uvm/uvm_extern.h  29 Nov 2019 06:34:45 -  1.151
> +++ uvm/uvm_extern.h  5 Dec 2019 16:06:33 -
> @@ -65,9 +65,6 @@ typedef int vm_fault_t;
>  typedef int vm_inherit_t;/* XXX: inheritance codes */
>  typedef off_t voff_t;/* XXX: offset within a uvm_object */
>  
> -union vm_map_object;
> -typedef union vm_map_object vm_map_object_t;
> -
>  struct vm_map_entry;
>  typedef struct vm_map_entry *vm_map_entry_t;
>  
> Index: uvm/uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.256
> diff -u -p -r1.256 uvm_map.c
> --- uvm/uvm_map.c 4 Dec 2019 08:28:29 -   1.256
> +++ uvm/uvm_map.c 5 Dec 2019 16:27:22 -
> @@ -230,7 +230,6 @@ void   vmspace_validate(struct 
> vm_map*)
>  #define PMAP_PREFER(addr, off)   (addr)
>  #endif
>  
> -
>  /*
>   * The kernel map will initially be VM_MAP_KSIZE_INIT bytes.
>   * Every time that gets cramped, we grow by at least VM_MAP_KSIZE_DELTA 
> bytes.
> @@ -334,6 +333,14 @@ vaddr_t uvm_maxkaddr;
>   MUTEX_ASSERT_LOCKED(&(_map)->mtx);  \
>   }   \
>   } while (0)
> +
> +#define  vm_map_modflags(map, set, clear)
> \
> + do {\
> + mtx_enter(&(map)->flags_lock);  \
> + (map)->flags = ((map)->flags | (set)) & ~(clear);   \
> + mtx_leave(&(map)->flags_lock);  \
> + } while (0)
> +
>  
>  /*
>   * Tree describing entries by address.
> Index: uvm/uvm_map.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.h,v
> retrieving revision 1.65
> diff -u -p -r1.65 uvm_map.h
> --- uvm/uvm_map.h 29 Nov 2019 06:34:46 -  1.65
> +++ uvm/uvm_map.h 5 Dec 2019 16:26:09 -
> @@ -86,16 +86,6 @@
>  #ifdef _KERNEL
>  
>  /*
> - * Internal functions.
> - *
> - * Required by clipping macros.
> - */
> -void  uvm_map_clip_end(struct vm_map*, struct vm_map_entry*,
> - vaddr_t);
> -void  uvm_map_clip_start(struct vm_map*,
> - struct vm_map_entry*, vaddr_t);
> -
> -/*
>   * UVM_MAP_CLIP_START: ensure that the entry begins at or after
>   * the starting address, if it doesn't we split the entry.
>   * 
> @@ -133,26 +123,6 @@ void  uvm_map_clip_start(struct 
> vm_map
>  #include 
>  
>  /*
> - * types defined:
> - *
> - *   vm_map_tthe high-level address map data structure.
> - *   vm_map_entry_t  an entry in an address map.
> - *   vm_map_version_ta timestamp of a map, for use with vm_map_lookup
> - */
> -
> -/*
> - * Objects which live in maps may be either VM objects, or another map
> - * (called a "sharing map") which denotes read-write sharing with other maps.
> - *
> - * XXXCDC: private pager data goes here now
> - */
> -
> -union vm_map_object {
> - struct uvm_object   *uvm_obj;   /* UVM OBJECT */
> - struct vm_map   *sub_map;   /* belongs to another map */
> -};
> -
> -/*
>   * Address map entries consist of start and end addresses,
>   * a VM object (or sharing map) and offset into that object,
>   * and user-exported inheritance and protection information.
> @@ -177,23 +147,23 @@ struct vm_map_entry {
>   vsize_t guard;  /* bytes in guard */
>   vsize_t fspace; /* free space */
>  
> - union vm_map_object object; /* object I point to */
> + union {
> + struct uvm_object *uvm_obj; /* uvm object */
> + struct vm_map   *sub_map;   /* belongs to another map */
> + } object;   /* object I point to */
>   voff_t  offset; /* offset into object */
>   struct vm_aref  aref;   /* anonymous overlay */
> -
>   int etype;  /* entry type */
> -
>   vm_prot_t   protection; /* protection code */
>   

Re: Kill uvm/uvm_stat.c

2019-12-04 Thread Mike Larkin
On Wed, Dec 04, 2019 at 03:19:41PM +0100, Martin Pieuchot wrote:
> Less is more.  Fewer files to look at, simpler it becomes to understand
> UVM.  uvm/uvm_stat.c contains just a ddb(4) function.  Let's move it to
> uvm/uvm_meter.c which also deals with counters. ok?
> 

Also reads ok to me.

-ml

> Index: conf/files
> ===
> RCS file: /cvs/src/sys/conf/files,v
> retrieving revision 1.677
> diff -u -p -r1.677 files
> --- conf/files5 Nov 2019 08:18:47 -   1.677
> +++ conf/files4 Dec 2019 14:15:03 -
> @@ -964,7 +964,6 @@ file uvm/uvm_page.c
>  file uvm/uvm_pager.c
>  file uvm/uvm_pdaemon.c
>  file uvm/uvm_pmemrange.c
> -file uvm/uvm_stat.c
>  file uvm/uvm_swap.c
>  file uvm/uvm_swap_encrypt.c  uvm_swap_encrypt
>  file uvm/uvm_unix.c
> Index: uvm/uvm_meter.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
> retrieving revision 1.38
> diff -u -p -r1.38 uvm_meter.c
> --- uvm/uvm_meter.c   6 Nov 2018 07:49:38 -   1.38
> +++ uvm/uvm_meter.c   4 Dec 2019 14:16:01 -
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef UVM_SWAP_ENCRYPT
>  #include 
> @@ -312,3 +313,62 @@ uvm_total(struct vmtotal *totalp)
>   totalp->t_rmshr = 0;/* XXX */
>   totalp->t_armshr = 0;   /* XXX */
>  }
> +
> +#ifdef DDB
> +
> +/*
> + * uvmexp_print: ddb hook to print interesting uvm counters
> + */
> +void
> +uvmexp_print(int (*pr)(const char *, ...))
> +{
> +
> + (*pr)("Current UVM status:\n");
> + (*pr)("  pagesize=%d (0x%x), pagemask=0x%x, pageshift=%d\n",
> + uvmexp.pagesize, uvmexp.pagesize, uvmexp.pagemask,
> + uvmexp.pageshift);
> + (*pr)("  %d VM pages: %d active, %d inactive, %d wired, %d free (%d 
> zero)\n",
> + uvmexp.npages, uvmexp.active, uvmexp.inactive, uvmexp.wired,
> + uvmexp.free, uvmexp.zeropages);
> + (*pr)("  min  %d%% (%d) anon, %d%% (%d) vnode, %d%% (%d) vtext\n",
> + uvmexp.anonminpct, uvmexp.anonmin, uvmexp.vnodeminpct,
> + uvmexp.vnodemin, uvmexp.vtextminpct, uvmexp.vtextmin);
> + (*pr)("  freemin=%d, free-target=%d, inactive-target=%d, "
> + "wired-max=%d\n", uvmexp.freemin, uvmexp.freetarg, uvmexp.inactarg,
> + uvmexp.wiredmax);
> + (*pr)("  faults=%d, traps=%d, intrs=%d, ctxswitch=%d fpuswitch=%d\n",
> + uvmexp.faults, uvmexp.traps, uvmexp.intrs, uvmexp.swtch,
> + uvmexp.fpswtch);
> + (*pr)("  softint=%d, syscalls=%d, kmapent=%d\n",
> + uvmexp.softs, uvmexp.syscalls, uvmexp.kmapent);
> +
> + (*pr)("  fault counts:\n");
> + (*pr)("noram=%d, noanon=%d, noamap=%d, pgwait=%d, pgrele=%d\n",
> + uvmexp.fltnoram, uvmexp.fltnoanon, uvmexp.fltnoamap,
> + uvmexp.fltpgwait, uvmexp.fltpgrele);
> + (*pr)("ok relocks(total)=%d(%d), anget(retries)=%d(%d), "
> + "amapcopy=%d\n", uvmexp.fltrelckok, uvmexp.fltrelck,
> + uvmexp.fltanget, uvmexp.fltanretry, uvmexp.fltamcopy);
> + (*pr)("neighbor anon/obj pg=%d/%d, gets(lock/unlock)=%d/%d\n",
> + uvmexp.fltnamap, uvmexp.fltnomap, uvmexp.fltlget, uvmexp.fltget);
> + (*pr)("cases: anon=%d, anoncow=%d, obj=%d, prcopy=%d, przero=%d\n",
> + uvmexp.flt_anon, uvmexp.flt_acow, uvmexp.flt_obj, uvmexp.flt_prcopy,
> + uvmexp.flt_przero);
> +
> + (*pr)("  daemon and swap counts:\n");
> + (*pr)("woke=%d, revs=%d, scans=%d, obscans=%d, anscans=%d\n",
> + uvmexp.pdwoke, uvmexp.pdrevs, uvmexp.pdscans, uvmexp.pdobscan,
> + uvmexp.pdanscan);
> + (*pr)("busy=%d, freed=%d, reactivate=%d, deactivate=%d\n",
> + uvmexp.pdbusy, uvmexp.pdfreed, uvmexp.pdreact, uvmexp.pddeact);
> + (*pr)("pageouts=%d, pending=%d, nswget=%d\n", uvmexp.pdpageouts,
> + uvmexp.pdpending, uvmexp.nswget);
> + (*pr)("nswapdev=%d\n",
> + uvmexp.nswapdev);
> + (*pr)("swpages=%d, swpginuse=%d, swpgonly=%d paging=%d\n",
> + uvmexp.swpages, uvmexp.swpginuse, uvmexp.swpgonly, uvmexp.paging);
> +
> + (*pr)("  kernel pointers:\n");
> + (*pr)("objs(kern)=%p\n", uvm.kernel_object);
> +}
> +#endif
> Index: uvm/uvm_stat.c
> ===
> RCS file: uvm/uvm_stat.c
> diff -N uvm/uvm_stat.c
> --- uvm/uvm_stat.c19 Jun 2018 22:35:07 -  1.30
> +++ /dev/null 1 Jan 1970 00:00:00 -
> @@ -1,98 +0,0 @@
> -/*   $OpenBSD: uvm_stat.c,v 1.30 2018/06/19 22:35:07 krw Exp $*/
> -/*   $NetBSD: uvm_stat.c,v 1.18 2001/03/09 01:02:13 chs Exp $ */
> -
> -/*
> - * Copyright (c) 1997 Charles D. Cranor and Washington University.
> - * All rights reserved.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
> - * are met:
> - * 1. 

Re: un-boolean_t amd64's pmap

2019-12-04 Thread Mike Larkin
On Wed, Dec 04, 2019 at 03:31:07PM +0100, Martin Pieuchot wrote:
> Similar to recent ddb(4) changes, replace boolean_t/TRUE/FALSE by
> int/1/0.
> 
> ok?
> 

No objection here, unsure if anyone else has commented either way.

-ml

> Index: arch/amd64/amd64/pmap.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> retrieving revision 1.136
> diff -u -p -r1.136 pmap.c
> --- arch/amd64/amd64/pmap.c   3 Nov 2019 09:44:23 -   1.136
> +++ arch/amd64/amd64/pmap.c   4 Dec 2019 14:26:15 -
> @@ -254,7 +254,7 @@ paddr_t cr3_pcid_proc_intel;
>   */
>  
>  pt_entry_t protection_codes[8]; /* maps MI prot to i386 prot code */
> -boolean_t pmap_initialized = FALSE; /* pmap_init done yet? */
> +int pmap_initialized = 0;/* pmap_init done yet? */
>  
>  /*
>   * pv management structures.
> @@ -312,7 +312,7 @@ void pmap_free_ptp(struct pmap *, struct
>  vaddr_t, struct pg_to_free *);
>  void pmap_freepage(struct pmap *, struct vm_page *, int, struct pg_to_free 
> *);
>  #ifdef MULTIPROCESSOR
> -static boolean_t pmap_is_active(struct pmap *, int);
> +static int pmap_is_active(struct pmap *, int);
>  #endif
>  paddr_t pmap_map_ptes(struct pmap *);
>  struct pv_entry *pmap_remove_pv(struct vm_page *, struct pmap *, vaddr_t);
> @@ -320,7 +320,7 @@ void pmap_do_remove(struct pmap *, vaddr
>  void pmap_remove_ept(struct pmap *, vaddr_t, vaddr_t);
>  void pmap_do_remove_ept(struct pmap *, vaddr_t);
>  int pmap_enter_ept(struct pmap *, vaddr_t, paddr_t, vm_prot_t);
> -boolean_t pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *,
> +int pmap_remove_pte(struct pmap *, struct vm_page *, pt_entry_t *,
>  vaddr_t, int, struct pv_entry **);
>  void pmap_remove_ptes(struct pmap *, struct vm_page *, vaddr_t,
>  vaddr_t, vaddr_t, int, struct pv_entry **);
> @@ -328,8 +328,8 @@ void pmap_remove_ptes(struct pmap *, str
>  #define PMAP_REMOVE_SKIPWIRED1   /* skip wired mappings */
>  
>  void pmap_unmap_ptes(struct pmap *, paddr_t);
> -boolean_t pmap_get_physpage(vaddr_t, int, paddr_t *);
> -boolean_t pmap_pdes_valid(vaddr_t, pd_entry_t *);
> +int pmap_get_physpage(vaddr_t, int, paddr_t *);
> +int pmap_pdes_valid(vaddr_t, pd_entry_t *);
>  void pmap_alloc_level(vaddr_t, int, long *);
>  
>  static inline
> @@ -353,7 +353,7 @@ void pmap_tlb_shootwait(void);
>   *   of course the kernel is always loaded
>   */
>  
> -static __inline boolean_t
> +static __inline int
>  pmap_is_curpmap(struct pmap *pmap)
>  {
>   return((pmap == pmap_kernel()) ||
> @@ -365,7 +365,7 @@ pmap_is_curpmap(struct pmap *pmap)
>   */
>  
>  #ifdef MULTIPROCESSOR
> -static __inline boolean_t
> +static __inline int
>  pmap_is_active(struct pmap *pmap, int cpu_id)
>  {
>   return (pmap == pmap_kernel() ||
> @@ -1402,7 +1402,7 @@ pmap_deactivate(struct proc *p)
>   * some misc. functions
>   */
>  
> -boolean_t
> +int
>  pmap_pdes_valid(vaddr_t va, pd_entry_t *lastpde)
>  {
>   int i;
> @@ -1424,7 +1424,7 @@ pmap_pdes_valid(vaddr_t va, pd_entry_t *
>   * pmap_extract: extract a PA for the given VA
>   */
>  
> -boolean_t
> +int
>  pmap_extract(struct pmap *pmap, vaddr_t va, paddr_t *pap)
>  {
>   pt_entry_t *ptes;
> @@ -1433,7 +1433,7 @@ pmap_extract(struct pmap *pmap, vaddr_t 
>   if (pmap == pmap_kernel() && va >= PMAP_DIRECT_BASE &&
>   va < PMAP_DIRECT_END) {
>   *pap = va - PMAP_DIRECT_BASE;
> - return (TRUE);
> + return 1;
>   }
>  
>   level = pmap_find_pte_direct(pmap, va, , );
> @@ -1441,12 +1441,12 @@ pmap_extract(struct pmap *pmap, vaddr_t 
>   if (__predict_true(level == 0 && pmap_valid_entry(ptes[offs]))) {
>   if (pap != NULL)
>   *pap = (ptes[offs] & PG_FRAME) | (va & PAGE_MASK);
> - return (TRUE);
> + return 1;
>   }
>   if (level == 1 && (ptes[offs] & (PG_PS|PG_V)) == (PG_PS|PG_V)) {
>   if (pap != NULL)
>   *pap = (ptes[offs] & PG_LGFRAME) | (va & PAGE_MASK_L2);
> - return (TRUE);
> + return 1;
>   }
>  
>   return FALSE;
> @@ -1588,7 +1588,7 @@ pmap_remove_ptes(struct pmap *pmap, stru
>   * => returns true if we removed a mapping
>   */
>  
> -boolean_t
> +int
>  pmap_remove_pte(struct pmap *pmap, struct vm_page *ptp, pt_entry_t *pte,
>  vaddr_t va, int flags, struct pv_entry **free_pvs)
>  {
> @@ -1597,9 +1597,9 @@ pmap_remove_pte(struct pmap *pmap, struc
>   pt_entry_t opte;
>  
>   if (!pmap_valid_entry(*pte))
> - return(FALSE);  /* VA not mapped */
> + return 0;   /* VA not mapped */
>   if ((flags & PMAP_REMOVE_SKIPWIRED) && (*pte & PG_W)) {
> - return(FALSE);
> + return 0;
>   }
>  
>   /* atomically save the old PTE and zap! it */
> @@ -1623,7 +1623,7 @@ pmap_remove_pte(struct pmap *pmap, struc
>   

Re: [PATCH] fix vmm pvclock accuracy

2019-11-25 Thread Mike Larkin
On Mon, Nov 25, 2019 at 07:06:19PM -0800, Pratik Vyas wrote:
> Hi tech@,
> 
> This patch fixes vmm pvclock accuracy issues.  Shift math error
> discovered by George Koehler.  This diff also fixes the error in tsc
> multiplier which was correct only if the host timecounter is tsc.
> 
> --
> Pratik
> 

Provided there is no reported fallout for just this piece, ok mlarkin@.

-ml

> 
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.254
> diff -u -p -a -u -r1.254 vmm.c
> --- sys/arch/amd64/amd64/vmm.c22 Sep 2019 08:47:54 -  1.254
> +++ sys/arch/amd64/amd64/vmm.c26 Nov 2019 00:08:10 -
> @@ -28,7 +28,6 @@
> #include 
> #include 
> #include 
> -#include 
> 
> #include 
> 
> @@ -6879,8 +6878,11 @@ void
> vmm_init_pvclock(struct vcpu *vcpu, paddr_t gpa)
> {
>   vcpu->vc_pvclock_system_gpa = gpa;
> - vcpu->vc_pvclock_system_tsc_mul =
> - (int) ((10L << 20) / tc_getfrequency());
> + if (tsc_frequency > 0)
> + vcpu->vc_pvclock_system_tsc_mul =
> + (int) ((10L << 20) / tsc_frequency);
> + else
> + vcpu->vc_pvclock_system_tsc_mul = 0;
>   vmm_update_pvclock(vcpu);
> }
> 
> @@ -6906,7 +6908,7 @@ vmm_update_pvclock(struct vcpu *vcpu)
>   nanotime();
>   pvclock_ti->ti_system_time =
>   tv.tv_sec * 10L + tv.tv_nsec;
> - pvclock_ti->ti_tsc_shift = -20;
> + pvclock_ti->ti_tsc_shift = 12;
>   pvclock_ti->ti_tsc_to_system_mul =
>   vcpu->vc_pvclock_system_tsc_mul;
>   pvclock_ti->ti_flags = PVCLOCK_FLAG_TSC_STABLE;
> 



Re: sdhc(4): no 0V one some Intel

2019-11-19 Thread Mike Larkin
On Tue, Nov 19, 2019 at 10:44:54AM +0100, Patrick Wildt wrote:
> Hi,
> 
> on some GPD Pocket mlarkin@ has the eMMC doesn't come up.  One issue
> is that we shouldn't go to 0V on some/most(?) Intel controllers.  This
> only adds it for his machine, but I know that the Appollo Lake versions
> might also work if added to this check.  But unless verified I'll not
> add it.  This makes mlarkin@'s machine work, even though it's utterly
> slow.  That would be the next thing to fix... probably low clocks or
> doesn't use 8-bit.
> 
> ok?
> 

Sure. And you get the handoff to fix the rest :)

-ml

> Patrick
> 
> diff --git a/sys/dev/pci/sdhc_pci.c b/sys/dev/pci/sdhc_pci.c
> index d1b6688f573..dd6bc79c29c 100644
> --- a/sys/dev/pci/sdhc_pci.c
> +++ b/sys/dev/pci/sdhc_pci.c
> @@ -127,6 +127,11 @@ sdhc_pci_attach(struct device *parent, struct device 
> *self, void *aux)
>   PCI_PRODUCT(pa->pa_id) == PCI_PRODUCT_ENE_SDCARD)
>   sc->sc.sc_flags |= SDHC_F_NOPWR0;
>  
> + /* Some Intel controllers break if set to 0V bus power. */
> + if (PCI_VENDOR(pa->pa_id) == PCI_VENDOR_INTEL &&
> + PCI_PRODUCT(pa->pa_id) == PCI_PRODUCT_INTEL_100SERIES_LP_EMMC)
> + sc->sc.sc_flags |= SDHC_F_NOPWR0;
> +
>   /* Some RICOH controllers need to be bumped into the right mode. */
>   if (PCI_VENDOR(pa->pa_id) == PCI_VENDOR_RICOH &&
>   (PCI_PRODUCT(pa->pa_id) == PCI_PRODUCT_RICOH_R5U822 ||
> 



Re: iwm: support 9260 devices

2019-11-16 Thread Mike Larkin
On Sat, Nov 16, 2019 at 05:09:40PM +0100, Stefan Sperling wrote:
> On Sat, Nov 16, 2019 at 04:51:44PM +0100, Stefan Sperling wrote:
> > This diff adds support for iwm(4) 9260 devices and hopefully 9560
> > devices as well but I have not yet had time to test those.
> > 
> > Joint work with patrick@. Some parts were lifted from FreeBSD.
> > 
> > If you have the followng device in pcidump it should at least get
> > an IP address from DHCP and be able to ping:
> >  4:0:0: Intel Dual Band Wireless-AC 9260
> >  0x: Vendor ID: 8086, Product ID: 2526
> > 
> > The firmware is not in fw_update yet.
> > In the meantime firmware can be fetched from here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/
> > 
> > Copy these files to /etc/firmware as indicated:
> > for 9260: iwlwifi-9260-th-b0-jf-b0-34.ucode -> /etc/firmware/iwm-9260-34
> > for 9560: iwlwifi-9000-pu-b0-jf-b0-34.ucode -> /etc/firmware/iwm-9000-34
> > 
> > Checks for regressions on already supported devices are also welcome,
> > in which case the firmware isn't needed.
> 
> Better diff which fixes an Rx throughput issue which was present in
> the previous diff.
> 

Cool. Seems like it works here, I'm using it now, but it seems to have had at
least one firmware error in the past few minutes:

iwm0 at pci0 dev 20 function 3 "Intel Dual Band Wireless AC 9560" rev 0x30, msi
iwm0: hw rev 0x310, fw ver 34.3125811985.0, address xxx
iwm0: fatal firmware error
iwm0: could not remove MAC context (error 35)

I didn't notice any problem though. Other than that it seems ok!

-ml



Re: [PATCH: 1/3] MMIO handler in vmm(4)

2019-11-13 Thread Mike Larkin
On Sat, Nov 02, 2019 at 06:40:52AM +0900, Iori YONEJI wrote:
> On Tue, Oct 29, 2019 at 02:17:28AM -0700, Mike Larkin wrote:
> > On Thu, Oct 24, 2019 at 08:54:58AM +0900, Iori YONEJI wrote:
> > > Hello tech@,
> > > 
> > > Here is the patch discussed in the previous email. This part mainly
> > > covers changes in the declaration part and fault handlers.
> > > 
> > 
> > Hello,
> > 
> >  I read through the three diffs and have some feedback.
> > 
> > First, please reformat all 3 diffs using style(9) guidelines. There are many
> > spaces vs tabs issues in the diffs.
> Thank you for your review. I'll fix the issues on this weekend.
> 
> First of all, the reason of the spaces vs. tabs issues was due to
> mangling by the mailer. I fixed this now, but I'm not aware of other
> types of style issues, even I will go through the patches to find other
> style mismatches after started working on them.
> 
> > Second, there seems to be quite a bit of important code missing here:
> Most of them convinced me, but let me leave a few comment for them.
> 

Iori Yoneji and I are discussing this in another thread, but for the benefit
of searchability/posterity, see below for a summary.

> 1. Segment base and limit check
> > 1. There appears to be no handling of segment base and limits for 32 bit
> >instructions. These need to be read from the gdt and strictly adhered
> >to. For 64 bit mode, it's not as important, unless the guest has enabled
> >LMSLE (long mode segment limit enable) on AMD CPUs, in which case the
> >limits need to be checked also. If I recall correctly, there are also
> >some different rules that need to be followed for 32 bit segment use
> >relating to permission checks and segment types.
> I wasn't aware of limit check, but I think vcpu would be triggered #GP
> if the segment was out of range instead of EPT violation, wouldn't it?
>

The SDM seems to be a bit vague here about the rules for checking segment limits
in VMX non-root mode (eg, inside a VM) and whether that causes an exit or not
based on various configurations, specifically around segment limits. This is
likely an issue only for 32 bit guests (we can block LMSLE).

> 2. Privilege level check
> > 2. For that matter, there appears to be no handling of any permission or
> >privilege checks in the instruction emulator. This means any privilege
> >level can read or write any memory in the VM.
> Sorry, it is my very big fault. I must check CPU mode in next post.
> 

In the generic EPT case (not a segment limit issue), this is likely handled
properly already. We are discussing if it makes sense to put the permission
checks in anyway, in case this code ends up being used in other places.

> 3-4. 'A'ccessed and 'D'irty bit management
> > 3. There appears to be no handling of the updating of the 'A'ccessed or
> >'D'irty bits on successful page table walk and writing to a page
> >computed by that translation. Granted, this probably needs to go into
> >the existing translate_gva function for the 'A' bit, but the 'D' bit
> >needs to also be handled here.
> > 
> > 4. There appears to be no handling of the updating of the 'A' bit in the
> >GDT for the segment descriptor in use.
> Yes, but I'm not sure what would be the expected use of A/D bit
> management on MMIO region because it wouldn't be subject to swap in/out.
> I think I will understand the expected behavior after working on it,
> however, I couldn't get how it compromise the virtual machine features.
> 

There are certain degenerate cases that might require this, and we are unclear
under these circumstances if the CPU does this for us or if the emulator
needs to. As pointed out above, the need for A/D bits in this particular case
is likely minimal, but in the same theme as above, we may want to make the
code work generically instead of assuming the emulator will only ever be
used for accesses in the MMIO region.

-ml


> 5-7. Prefixes
> > 5. The code seems to ignore any segment prefix override bytes.
> > 
> > 6. The code seems to ignore forbidden instruction prefixes (like rex.w
> >or lock prefixes being placed before instructions where they don't
> >make sense or are prohibited by the SDM)
> > 
> > 7. For that matter, operand size encoding prefixes don't seem to be
> >honoured at all
> Yes, there are some prefixes I missed in the patches. I will enhance
> them to handle those prefixes.
> One thing to confirm: Is it OK not to calculate in the segment registers
> (whether override-ed or not) but just count the bytes, because we know
> the address the instruction intended to access?
> 
> 8. %rflags
> 

Re: vmm disk unavailable after forceful vm termination

2019-11-01 Thread Mike Larkin
On Fri, Nov 01, 2019 at 09:20:58AM -0400, Johan Huldtgren wrote:
> hello,
> 
> I have vmd running on -current, in it I have an Ubuntu vm (18.04.3 LTS),
> every now and then the Ubuntu vm will hang hard, console is dead, only
> option is to restart it. Now at that point a graceful restart won't
> work, 'vmctl stop n' will run and return but nothing will happen. So
> the only option is 'vmctl stop -f n', this will kill the vm. However
> after that the vm will disapear from the list of vms
> 
> Before killing:
> 
> $ vmctl status
>  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
>   2 71924 12.0G1.1G   ttyp2johan  running plex.vm
>   1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
>   3 - 11.0G   -   -johan  stopped 
> amd64-ports.vm
>   4 - 1512M   -   -johan  stopped 
> i386-ports.vm
> 
> After killing:
> 
> $ vmctl stop -f 2
> stopping vm: forced to terminate vm 2
> 
> $ vmctl status
>  ID   PID VCPUS  MAXMEM  CURMEM TTYOWNERSTATE NAME
>   1 18308 18.0G6.5G   ttyp0johan  running monitor.vm
>   3 - 11.0G   -   -johan  stopped 
> amd64-ports.vm
>   4 - 1512M   -   -johan  stopped 
> i386-ports.vm
> 
> The vm is now gone, I can't just start it with 'vmctl start n',
> but it's defined in vm.conf so let's try starting it by name:
> 
> $ vmctl start plex.vm
> vmctl: start vm command failed: Operation already in progress
> 
> ok.. so let's restart vmd.
> 
> $ doas rcctl stop vmd
> vmd(ok)
> $ doas rcctl start vmd
> vmd(ok)
> 
> but it's not actually running.
> 
> $ vmctl status
> vmctl: connect: /var/run/vmd.sock: Connection refused
> 
> Let's try starting manually with debug
> 
> $ doas vmd -d
> startup
> warning: macro 'sets' not used
> can't open disk /ftp/vm/plex.img: Resource temporarily unavailable
> failed to start vm plex.vm
> parent: configuration failed
> priv exiting, pid 9509
> control exiting, pid 63731
> vmm exiting, pid 28106
> 
> So it seems the disk image is now in some state which won't let it be read 
> again?
> 
> $ ls -al /ftp/vm/plex.img
> -rw---  1 root  wheel  32212254720 Nov  1 08:07 /ftp/vm/plex.img
> 
> $ doas file /ftp/vm/plex.img
> /ftp/vm/plex.img: x86 boot sector; partition 1: ID=0x83, active, starthead 
> 32, startsector 2048, 62910464 sectors
> 
> At this point the only solution is rebooting the host running vmd.
> After that everything will work just fine again until the next time
> this happens. I don't know if this is known or a bug, googling and
> scanning through marc.info I couldn't find any reports. I don't know
> if this is relevant but when I restarted this vm yesterday (along with
> hanging hard it sometimes just dies and is reported as stopped). I see
> this in /var/log/daemon
> 
> Oct 31 07:58:25 absu vmd[17613]: plex.vm: started vm 2 successfully, tty 
> /dev/ttyp2
> Oct 31 07:58:29 absu vmd[71924]: vcpu_process_com_data: guest reading com1 
> when not ready
> Oct 31 07:58:35 absu vmd[71924]: vioblk_notifyq: unsupported command 0x8
> Oct 31 07:58:52 absu vmd[71924]: vcpu_process_com_data: guest reading com1 
> when not ready
> 
> Full dmesg of the vmd host below, let me know if I can provide any further 
> details.
> 
> thanks,
> 
> .jh
> 

Sometimes vmd gets really stuck like this and even vmctl stop -f won't stop
it and rcctl stop vmd also fails. You probably have a vmd spinning at 100%,
kill that one manually and it should free things up.

I know about the problem, but have not had a chance to fix it yet.

-ml

> ---
> 
> OpenBSD 6.6-current (GENERIC.MP) #407: Mon Oct 28 00:42:58 MDT 2019
>  dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 137318658048 (130957MB)
> avail mem = 133144268800 (126976MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0x7db4c000 (134 entries)
> bios0: vendor American Megatrends Inc. version "3703" date 04/24/2018
> bios0: ASUSTeK COMPUTER INC. Z10PE-D16 Series
> acpi0 at bios0: ACPI 5.0
> acpi0: sleep states S0 S4 S5
> acpi0: tables DSDT FACP APIC FPDT FIDT MCFG EINJ UEFI HPET MSCT SLIT SRAT 
> WDDT SSDT SPMI SSDT SSDT PRAD DMAR HEST BERT ERST
> acpi0: wakeup devices IP2P(S3) EHC1(S4) BR1A(S4) BR1B(S4) BR2A(S4) BR2B(S4) 
> BR2C(S4) BR2D(S4) BR3A(S4) BR3B(S4) BR3C(S4) BR3D(S4) RP01(S4) RP02(S4) 
> RP03(S4) RP04(S4) [...]
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 2394.84 MHz, 06-3f-02
> cpu0: 
> 

Re: [PATCH: 1/3] MMIO handler in vmm(4)

2019-10-29 Thread Mike Larkin
On Thu, Oct 24, 2019 at 08:54:58AM +0900, Iori YONEJI wrote:
> Hello tech@,
> 
> Here is the patch discussed in the previous email. This part mainly
> covers changes in the declaration part and fault handlers.
> 

Hello,

 I read through the three diffs and have some feedback.

First, please reformat all 3 diffs using style(9) guidelines. There are many
spaces vs tabs issues in the diffs.

Second, there seems to be quite a bit of important code missing here:

1. There appears to be no handling of segment base and limits for 32 bit
   instructions. These need to be read from the gdt and strictly adhered
   to. For 64 bit mode, it's not as important, unless the guest has enabled
   LMSLE (long mode segment limit enable) on AMD CPUs, in which case the
   limits need to be checked also. If I recall correctly, there are also
   some different rules that need to be followed for 32 bit segment use
   relating to permission checks and segment types.

2. For that matter, there appears to be no handling of any permission or
   privilege checks in the instruction emulator. This means any privilege
   level can read or write any memory in the VM.

3. There appears to be no handling of the updating of the 'A'ccessed or
   'D'irty bits on successful page table walk and writing to a page
   computed by that translation. Granted, this probably needs to go into
   the existing translate_gva function for the 'A' bit, but the 'D' bit
   needs to also be handled here.

4. There appears to be no handling of the updating of the 'A' bit in the
   GDT for the segment descriptor in use.

5. The code seems to ignore any segment prefix override bytes.

6. The code seems to ignore forbidden instruction prefixes (like rex.w
   or lock prefixes being placed before instructions where they don't
   make sense or are prohibited by the SDM)

7. For that matter, operand size encoding prefixes don't seem to be
   honoured at all

8. %rflags is not being updated in any scenario. It looks like there were
   a few #defines for things like VMM_EMUL_ADD/SUB, which would require
   setting various flags in %rflags, but I can't seem to find the code that
   actually does these instructions, so maybe these are just old #defines?

While this appears to be a step in the right direction, I'm afraid there
is quite a bit that needs to get done before we can commit this. The
items above are not nit-pick items; they are critical for security and
without implementing these we will end up opening huge security holes
inside the VM.

Please let me know if any of these are actually implemented somewhere
in the three diffs; it's certainly possible I missed something.

How would you like to proceed? If you would like to clean up the style
and start implementing the items above, I'm happy to take another look
once you make some progress. Let me know.

Thanks again for trying to make progress here. I hope we can get things
moving forward.

-ml

> 
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.254
> diff -u -p -r1.254 vmm.c
> --- sys/arch/amd64/amd64/vmm.c22 Sep 2019 08:47:54 -1.254
> +++ sys/arch/amd64/amd64/vmm.c23 Oct 2019 23:04:30 -
> @@ -107,6 +107,27 @@ struct vmm_softc {
>  uint8_tvpids[512];/* bitmap of used VPID/ASIDs */
>  };
> 
> +/* Instruction type collection for MMIO emulation */
> +#define VMM_EMUL_MOV 1
> +#define VMM_EMUL_ADD 2
> +#define VMM_EMUL_SUB 3
> +#define VMM_EMUL_XCHG 4
> +
> +/* vmm_emul_data_desc indicates a location used for MMIO emulation */
> +struct vmm_emul_data_desc {
> +vaddr_tgpa;
> +intreg_id;/* -1: use gpa as the data location */
> +};
> +
> +struct vmm_emul_instruction {
> +struct vmm_emul_data_desc src;
> +struct vmm_emul_data_desc dest;
> +intlen;
> +intinsn_type;
> +intwidth; /* effective width of data to be wrote */
> +intzero_sign; /* zero or sign extend? */
> +};
> +
>  void vmx_dump_vmcs_field(uint16_t, const char *);
>  int vmm_enabled(void);
>  int vmm_probe(struct device *, void *, void *);
> @@ -167,7 +188,9 @@ int vmx_handle_cr0_write(struct vcpu *,
>  int vmx_handle_cr4_write(struct vcpu *, uint64_t);
>  int vmx_handle_cr(struct vcpu *);
>  int svm_handle_inout(struct vcpu *);
> +int svm_handle_mmio(struct vcpu *, paddr_t);
>  int vmx_handle_inout(struct vcpu *);
> +int vmx_handle_mmio(struct vcpu *, paddr_t);
>  int svm_handle_hlt(struct vcpu *);
>  int vmx_handle_hlt(struct vcpu *);
>  int vmm_inject_ud(struct vcpu *);
> @@ -183,6 +206,9 @@ int svm_get_guest_faulttype(struct vmcb
>  int vmx_get_exit_qualification(uint64_t *);
>  int vmm_get_guest_cpu_cpl(struct vcpu *);
>  int vmm_get_guest_cpu_mode(struct vcpu *);
> +int parse_instruction(struct vcpu *vcpu, uint8_t insn[], paddr_t iomem,
> +struct vmm_emul_instruction *info);
> +int emul_mmio(struct vcpu *, 

Re: vmctl: start: Require one interface at minimium with -i

2019-10-27 Thread Mike Larkin
On Sat, Oct 26, 2019 at 12:57:56AM +0200, Klemens Nanni wrote:
> It makes no sense to allow zero interfaces;  either a positive count is
> given or -i is omitted entirely.  vm.conf(5) does not allow interface
> configuration that results in zero interfaces either.
> 
>   $ doas vmctl start -Li0 -b ~/bsd.rd foo  
>   vmctl: starting without disks
>   vmctl: started vm 6 successfully, tty /dev/ttyph
> 
> Diff below raises the minimum to one and tells more about invalid counts
> with the usual strtonum(3) idiom:
> 
>   $ doas vmctl start -Li-1 -b ~/bsd.rd foo
>   vmctl: invalid count "-1": too small
>   vmctl: invalid interface count: -1
>   $ doas ./obj/vmctl start -Li0 -b ~/bsd.rd foo 
>   vmctl: count is too small: 0
>   vmctl: invalid interface count: 0
> 
> Those duplicate errors aren't easily merged without breaking consistency
> with how all the command line flags are passed, so I did nothing about
> this.
> 
> OK?
> 
> 
> Index: main.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/main.c,v
> retrieving revision 1.58
> diff -u -p -r1.58 main.c
> --- main.c23 Aug 2019 07:55:20 -  1.58
> +++ main.c25 Oct 2019 22:48:49 -
> @@ -373,9 +373,9 @@ parse_ifs(struct parse_result *res, char
>   const char  *error;
>  
>   if (word != NULL) {
> - val = strtonum(word, 0, INT_MAX, );
> + val = strtonum(word, 1, INT_MAX, );
>   if (error != NULL)  {
> - warnx("invalid count \"%s\": %s", word, error);
> + warnx("count is %s: %s", error, word);
>   return (-1);
>   }
>   }
> 

ok mlarkin@ if you are still looking to do this.



Re: [PATCH: 0/3] MMIO handler in vmm(4)

2019-10-25 Thread Mike Larkin
On Thu, Oct 24, 2019 at 08:54:16AM +0900, Iori YONEJI wrote:
> Hello tech@,
> 
> I introduce an MMIO handler in vmm(4) as an attempt to make a
> scaffolding for full-featured MMIO handler which requires more effort.
> I will send a few emails with patches after this one.  First of all, I
> show the overview.
> 
> As I sent you earlier in emails about a question on vmclear() in
> vmm(4), I have a problem to boot NetBSD on vmd(8). In short, the first
> of the reasons was that NetBSD tries to probe TPM module using MMIO
> which is not implemented, so I decided to work on adding MMIO handler
> to vmm(4).
> 
> I also hope to discuss how the "Emulation" and "Device registration"
> part should be done in the future because I currently only implement as
> minimum as it works. I am afraid to post the patch in immature status
> as described in discussion note, but otherwise, the patch might get
> tremendous. However, if it could be better if I should have posted
> after finishing something, please let me know what must be done.
> 
> 

Thanks for this set of diffs. I'll try to take a look this weekend.

-ml

> 
> 1. Current and patch status
> What happens if MMIO issued now?
> Current vmm(4) handles the fault by allocating a new uvm page if the
> memory region is claimed to be reserved by vmd(8) then replay the same
> instruction, but vmm(4) prohibits the reservation of MMIO region. It is
> sane validation because, in MMIO, the access must fault every access
> for being handled by the hypervisor. The problem is that the current
> implementation handles all faults as "usual" memory access
> (VMM_MEM_TYPE_REGULAR). I added a new type VMM_MEM_TYPE_MMIO, and in
> case of this type, it emulates the access as if the instruction
> succeeded and then restarts the guest from the next instruction.
> 
> 2. Extract the instruction
> To emulate the instruction, the hypervisor must extract the guest
> instruction memory. Thanks for all past works, there is already a
> function to retrieve guest physical address, so it can be read. In AMD
> CPU, VMCB offers the fetched instruction data as long as possible, so
> the work is much easy (and fast; it doesn't have to walk the page
> tables). "as long as possible" means it may fail. However, it returns
> fewer bytes "only if a fault occurs while fetching", so it is not our
> case.
> 
> 3. Emulation
> In AMD CPU, we don't know the instruction length, so instruction prefix
> must be parsed. In Intel CPU, VMCS tells the instruction length, but
> still, the desired operation is uncertain. I wrote a very very minimum
> x86 instruction parser for this usage, assuming that the machine is in
> 32- or 64-bit mode and also the involving memory address is known. If
> we have to have matured instruction parser eventually, it may be worth
> to use bhyve instruction parser (sys/amd64/vmm/vmm_instruction_emul.c
> in FreeBSD).
> Discussion: I have only very poor implementation for this part now,
> what will be done next? What instruction must be handled? Or is there a
> good instruction parser available in our code? Mike Larkin told me in
> the previous discussion that I have to implement String instruction,
> but I believe there are more (thank you for the advice, but I still
> failed to reproduce the fault with String instruction, sorry).
> 
> 4. Device registration
> I have implemented nothing yet. For TPM, we don't offer TPM for guest
> VMs, so I just check the capabilities address and write zero to
> specified register (zero means "not available").
> Discussion: What devices will be available in the future? If we offer
> VGA or virtio-1.0 devices, they should be handled by vmd(8) and I have
> to implement a path to tell what devices are available.
> 
> 5. Does it work?
> Partially. My patch can fix the TPM probing problem and NetBSD kernel
> can boot now, but it cannot determine the boot device. I suspect this
> issue is related to another problem because I don't observe other MMIO
> access.
> I paste the truncated guest's log and related dmesg (VMM_DEBUG=1).
> 
> % doas vmctl -v start -c -L -i 1 -r boot-com.iso netbsd
> (snip)
> vioscsi0 at virtio2: Features: 0x0
> vioscsi0: cmd_per_lun 1 qsize 128 seg_max 17 max_target 1 max_lun 1
> virtio2: interrupting at irq 6
> scsibus0 at vioscsi0: 1 target, 1 lun per target
> vendor 0b5d product 0777 (miscellaneous communications) at pci0 dev 4
> function 0 not configured
> isa0 at mainbus0
> com0 at isa0 port 0x3f8-0x3ff irq 4: ns8250 or ns16450, no fifo
> com0: console
> attimer0 at isa0 port 0x40-0x43
> pad0: outputs: 44100Hz, 16-bit, stereo
> audio0 at pad0: half duplex, playback, capture, mmap
> pad0: Virtual format con

Re: vmd: static address for local interfaces, fix static tapX names

2019-10-25 Thread Mike Larkin
On Fri, Oct 25, 2019 at 07:47:35PM +, Reyk Floeter wrote:
> On Fri, Oct 25, 2019 at 12:27:25PM -0700, Mike Larkin wrote:
> > On Fri, Oct 25, 2019 at 06:15:59PM +, Reyk Floeter wrote:
> > > Hi,
> > > 
> > > the attached diff is rather large and implements two things for vmd:
> > > 
> > > 1) Allow to configure static IP address/gateway pairs local interfaces.
> > > 2) Skip statically configured interface names (eg. tap0) when
> > >   allocating dynamic interfaces.
> > > 
> > > Example:
> > > ---snip---
> > > vm "foo" {
> > > disable
> > > local interface "tap0" {
> > > address 192.168.0.10/24 192.168.0.1
> > > }
> > >   local interface "tap1"
> > > disk "/home/vm/foo.qcow2"
> > > }
> > > 
> > > vm "bar" {
> > > local interface
> > > disk "/home/vm/bar.qcow2"
> > > }
> > > ---snap---
> > > 
> > > 
> > > 1) The VM "foo" has two interfaces: The first interface has a fixed
> > > IPv4 address with 192.168.0.1/24 on the gateway and 192.168.0.10/24 on
> > > the VM.  192.168.0.10/24 is assigned to the VM's first NIC via the
> > > built-in DHCP server.  The second VM gets a default 100.64.x.x/31 IP.
> > 
> > I'm not sure the above description matches what I'm seeing in the vm.conf
> > snippet above.
> > 
> > What's "the gateway" here? Is this the host machine, or the actual
> > gateway, perhaps on some other machine? Does this just allow me to specify
> > the host-side tap(4) IP address for a corresponding given VM vio(4) 
> > interface?
> > 
> 
> Ah, OK.  I used the terms without explaining them:
> 
> With local interfaces, vmd(8) uses two IPs per interface: one for the
> tap(4) on the host, one for the vio(4) on the VM.  It configures the
> first one on the host and provides the second one via DHCP.  The IP on
> the host IP is the default "gateway" router for the VM.
> 

Ah, I missed the fact that these are not "-i" style interfaces but rather
*local* interfaces (eg, "-L" style). 

> The address syntax is currently reversed:
>   address "address/prefix" "gateway"
> Maybe I should change it to
>   address "gateway" "address/prefix"
> or
>   address "address/prefix" gateway "gateway"

I like the last one, but I probably won't be a heavy user of this ...

-ml

> 
> I also wonder if we could technically use a non-local IP address for
> the gateway.  I currently enforce that the prefix matches, but I don't
> enforce that both addresses are in the same subnet.
> 
> When using the default auto-generated 100.64.0.0/31 method, it uses
> the first IP in the subnet as the gateway and the second IP for the
> VM.
> 
> > And did you mean "The second interface" there instead of the "The second 
> > VM"?
> > (Although I think the description fits for "The second VM" also...)
> > 
> 
> Yes, both, the second interface is correct as well.
> 
> > I think the idea is sound. As long as we don't end up adding extra command
> > line args to vmctl to manually configure this, which it doesn't appear we 
> > are
> > doing here. :)
> > 
> 
> I don't want to add it to vmctl either.
> 
> > I didn't read the diff in great detail, I'll wait until you say you have a
> > final version.
> > 
> 
> OK, thanks.
> 
> Reyk
> 
> > -ml
> > 
> > >   This idea came up when I talked with Mischa at EuroBSDCon about
> > > OpenBSDAms: instead of using L2 and external static dhcpd for all VMs,
> > > it could be a solution to use L3 and to avoid bridge(4) and dhcpd(8).
> > > But it would need a way to serve static IPs via the internal dhcp
> > > server.  Using L3 with vmd is better with performance, routing, PF,
> > > etc., but has the drawback that it wastes a subnet and gateway IP per
> > > VM (maybe rdomains or other tricks could help here, but this is a
> > > problem for later).
> > > 
> > > 2) The VM "foo" uses two static interface names, tap0 and tap1, and
> > > the VM "bar" uses a dynamic interface name (tapX).  Without this diff,
> > > vmd would most certainly use tap0 for bar's interface because foo is
> > > disabled and not started before bar.  With the diff, the first
> > > interface of bar will be tap2 or 

Re: vmd: static address for local interfaces, fix static tapX names

2019-10-25 Thread Mike Larkin
On Fri, Oct 25, 2019 at 06:15:59PM +, Reyk Floeter wrote:
> Hi,
> 
> the attached diff is rather large and implements two things for vmd:
> 
> 1) Allow to configure static IP address/gateway pairs local interfaces.
> 2) Skip statically configured interface names (eg. tap0) when
>   allocating dynamic interfaces.
> 
> Example:
> ---snip---
> vm "foo" {
> disable
> local interface "tap0" {
> address 192.168.0.10/24 192.168.0.1
> }
>   local interface "tap1"
> disk "/home/vm/foo.qcow2"
> }
> 
> vm "bar" {
> local interface
> disk "/home/vm/bar.qcow2"
> }
> ---snap---
> 
> 
> 1) The VM "foo" has two interfaces: The first interface has a fixed
> IPv4 address with 192.168.0.1/24 on the gateway and 192.168.0.10/24 on
> the VM.  192.168.0.10/24 is assigned to the VM's first NIC via the
> built-in DHCP server.  The second VM gets a default 100.64.x.x/31 IP.

I'm not sure the above description matches what I'm seeing in the vm.conf
snippet above.

What's "the gateway" here? Is this the host machine, or the actual
gateway, perhaps on some other machine? Does this just allow me to specify
the host-side tap(4) IP address for a corresponding given VM vio(4) interface?

And did you mean "The second interface" there instead of the "The second VM"?
(Although I think the description fits for "The second VM" also...)

I think the idea is sound. As long as we don't end up adding extra command
line args to vmctl to manually configure this, which it doesn't appear we are
doing here. :)

I didn't read the diff in great detail, I'll wait until you say you have a
final version.

-ml

>   This idea came up when I talked with Mischa at EuroBSDCon about
> OpenBSDAms: instead of using L2 and external static dhcpd for all VMs,
> it could be a solution to use L3 and to avoid bridge(4) and dhcpd(8).
> But it would need a way to serve static IPs via the internal dhcp
> server.  Using L3 with vmd is better with performance, routing, PF,
> etc., but has the drawback that it wastes a subnet and gateway IP per
> VM (maybe rdomains or other tricks could help here, but this is a
> problem for later).
> 
> 2) The VM "foo" uses two static interface names, tap0 and tap1, and
> the VM "bar" uses a dynamic interface name (tapX).  Without this diff,
> vmd would most certainly use tap0 for bar's interface because foo is
> disabled and not started before bar.  With the diff, the first
> interface of bar will be tap2 or higher.
>   The problem was just reported by kn@.  I mixed both things into
> one diff because I was working on 1) when kn@ reported it.  There are
> other ways to implement 2) but solving both issues in a similar way
> made more sense.
> 
> This is not the final diff.  I still have to clean it up, get
> feedback, think a little bit about it, and split it into smaller parts
> for review.  I wanted to share the big picture.
> 
> As a side node, I implemented the lookup with sorted tables because it
> is the most efficient way to do it, but maybe a simple linear lookup
> (iterating over all the VMs and all the interfaces all the time) would
> be good enough.  But the current approach has benefits - if I did it
> right ;)
> 
> Thoughts?
> 
> Reyk
> 
> Index: usr.sbin/vmd/dhcp.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/dhcp.c,v
> retrieving revision 1.8
> diff -u -p -u -p -r1.8 dhcp.c
> --- usr.sbin/vmd/dhcp.c   27 Dec 2018 19:51:30 -  1.8
> +++ usr.sbin/vmd/dhcp.c   25 Oct 2019 18:11:05 -
> @@ -119,8 +119,7 @@ dhcp_request(struct vionet_dev *dev, cha
>   }
>  
>   if ((client_addr.s_addr =
> - vm_priv_addr(>vmd_cfg,
> - dev->vm_vmid, dev->idx, 1)) == 0)
> + vm_priv_addr(>vmd_cfg, dev->vm_vmid, dev->idx, 1, )) == 0)
>   return (-1);
>   memcpy(, _addr,
>   sizeof(client_addr));
> @@ -129,7 +128,7 @@ dhcp_request(struct vionet_dev *dev, cha
>   ss2sin(_dst)->sin_port = htons(CLIENT_PORT);
>  
>   if ((server_addr.s_addr = vm_priv_addr(>vmd_cfg, dev->vm_vmid,
> - dev->idx, 0)) == 0)
> + dev->idx, 0, )) == 0)
>   return (-1);
>   memcpy(, _addr, sizeof(server_addr));
>   memcpy((_src)->sin_addr, _addr,
> Index: usr.sbin/vmd/parse.y
> ===
> RCS file: /cvs/src/usr.sbin/vmd/parse.y,v
> retrieving revision 1.52
> diff -u -p -u -p -r1.52 parse.y
> --- usr.sbin/vmd/parse.y  14 May 2019 06:05:45 -  1.52
> +++ usr.sbin/vmd/parse.y  25 Oct 2019 18:11:05 -
> @@ -120,9 +120,9 @@ typedef struct {
>  
>  
>  %token   INCLUDE ERROR
> -%token   ADD ALLOW BOOT CDROM DEVICE DISABLE DISK DOWN ENABLE FORMAT 
> GROUP
> -%token   INET6 INSTANCE INTERFACE LLADDR LOCAL LOCKED MEMORY NET NIFS 
> OWNER
> -%token   PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID
> +%token   ADD ADDRESS ALLOW BOOT CDROM 

Re: vmm(4) question: unneeded vmclear() in vcpu_readregs_vmx()?

2019-10-22 Thread Mike Larkin
On Tue, Oct 22, 2019 at 06:57:32PM +0900, Iori YONEJI wrote:
> On Tue, Oct 22, 2019 at 11:17 AM Mike Larkin  wrote:
> >
> > On Mon, Oct 21, 2019 at 03:52:52AM +0900, Iori YONEJI wrote:
> > > Hello tech@,
> > >
> > > I have a question (or maybe a suggestion) about vmm(4).
> > >
> > > I'm writing a small additional feature to sys/arch/amd64/amd64/vmm.c
> > > and found a seemingly unneeded vmclear() at the end of
> > > vcpu_readregs_vmx(). This function didn't seem to affect VM state at
> > > first glance because, obviously, its name is _read_. Against this
> > > intuition, this function clears VM state, which makes the vmexit
> > > handling stop (eg. "advance rip" failure) and the VM die if it is used
> > > in there. Possibly this was added as a counterpart of
> > > vcpu_reload_vmcs_vmx() at the very beginning of the read function, but
> > > I don't think it does undo the reload.
> > >
> > > This vmclear can be removed in my understanding and also I confirmed
> > > that Alpine Linux and OpenBSD guest VMs can run on the vmm(4) without
> > > this vmclear call.
> > >
> > > Or I am just wrong and it may be actually needed in some corner cases.
> > > If so, I will restore (vmptrld) vcpu->vc_control_pa every time the VM
> > > context continues after calling readregs function or add another flag
> > > to indicate whether vmclear is needed here if all of the corner cases
> > > are known.
> > >
> > > Would you mind to tell me I am correct or not, or how it has to be
> > > dealt with?
> > >
> > >
> > > Index: sys/arch/amd64/amd64/vmm.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> > > retrieving revision 1.254
> > > diff -u -p -r1.254 vmm.c
> > > --- sys/arch/amd64/amd64/vmm.c22 Sep 2019 08:47:54 -1.254
> > > +++ sys/arch/amd64/amd64/vmm.c20 Oct 2019 18:04:17 -
> > > @@ -1602,8 +1602,10 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
> > >  errout:
> > >  ret = EINVAL;
> > >  out:
> > > +/* XXX Why do we need vmclear here?
> > >  if (vmclear(>vc_control_pa))
> > >  ret = EINVAL;
> > > +*/
> > >  return (ret);
> > >  }
> > >
> >
> > The assumption in vcpu_run_vmx is that we will always come in with a cleared
> > VMCS state, unless we are staying in the run loop (the "resume" flag in that
> > function tracks this). So any ioctl (including things like read/write regs)
> > that might possibly manipulate the VMCS, need to clear the VMCS before they
> > return to user space. That was the original reason for that clear at the end
> > of the function.
> >
> > If you are adding new functionality that uses read/write regs internally, 
> > then
> > you're right; on return, you'll have a cleared VMCS and subsequent vmreads 
> > and
> > vmwrites will fail.
> >
> > It looks like there are a couple of other users of readregs/writeregs
> > internally - the presently unused GVA translator function as well as the 
> > in/out
> > exit handler structure preparation code. Both of those look like they got 
> > lucky
> > in how they were handling things; they aren't touching the VMCS after 
> > read/write
> > regs or we would have seen them explode like you're saying.
> >
> > It looks like the start of vcpu_run_vmx *almost* properly handles coming
> > in with an arbitrary  VMCS. The "almost" part is the bit under #ifdef 
> > VMM_DEBUG
> > that handles triple faults. It seems to assume that the proper VMCS is 
> > currently
> > loaded, which is certainly not true all the time. I'll fix that part.
> >
> > Based on the fact that the other code paths look clean, I think we can 
> > remove
> > the vmclear from the end of read/write regs, as you suggest. I'll do that as
> > well.
> >
> > Might I ask what you're working on? Just in case I know of someone else 
> > working
> > on the same thing? (There are about a dozen vmm side-projects going on that 
> > I'm
> > aware of and it would be a waste for two people to be working on the same 
> > thing
> > without knowing of the other). Feel free to mail me off-list if you don't 
> > want
> > to share in public.
> >
> > -ml
> >
> Thank you. I didn't notice that the VMCS must be cleared before
> returning to user space.
> 
> Ta

Re: vmm(4) question: unneeded vmclear() in vcpu_readregs_vmx()?

2019-10-21 Thread Mike Larkin
On Mon, Oct 21, 2019 at 03:52:52AM +0900, Iori YONEJI wrote:
> Hello tech@,
> 
> I have a question (or maybe a suggestion) about vmm(4).
> 
> I'm writing a small additional feature to sys/arch/amd64/amd64/vmm.c
> and found a seemingly unneeded vmclear() at the end of
> vcpu_readregs_vmx(). This function didn't seem to affect VM state at
> first glance because, obviously, its name is _read_. Against this
> intuition, this function clears VM state, which makes the vmexit
> handling stop (eg. "advance rip" failure) and the VM die if it is used
> in there. Possibly this was added as a counterpart of
> vcpu_reload_vmcs_vmx() at the very beginning of the read function, but
> I don't think it does undo the reload.
> 
> This vmclear can be removed in my understanding and also I confirmed
> that Alpine Linux and OpenBSD guest VMs can run on the vmm(4) without
> this vmclear call.
> 
> Or I am just wrong and it may be actually needed in some corner cases.
> If so, I will restore (vmptrld) vcpu->vc_control_pa every time the VM
> context continues after calling readregs function or add another flag
> to indicate whether vmclear is needed here if all of the corner cases
> are known.
> 
> Would you mind to tell me I am correct or not, or how it has to be
> dealt with?
> 
> 
> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.254
> diff -u -p -r1.254 vmm.c
> --- sys/arch/amd64/amd64/vmm.c22 Sep 2019 08:47:54 -1.254
> +++ sys/arch/amd64/amd64/vmm.c20 Oct 2019 18:04:17 -
> @@ -1602,8 +1602,10 @@ vcpu_readregs_vmx(struct vcpu *vcpu, uin
>  errout:
>  ret = EINVAL;
>  out:
> +/* XXX Why do we need vmclear here?
>  if (vmclear(>vc_control_pa))
>  ret = EINVAL;
> +*/
>  return (ret);
>  }
> 

The assumption in vcpu_run_vmx is that we will always come in with a cleared
VMCS state, unless we are staying in the run loop (the "resume" flag in that
function tracks this). So any ioctl (including things like read/write regs)
that might possibly manipulate the VMCS, need to clear the VMCS before they
return to user space. That was the original reason for that clear at the end
of the function.

If you are adding new functionality that uses read/write regs internally, then
you're right; on return, you'll have a cleared VMCS and subsequent vmreads and
vmwrites will fail.

It looks like there are a couple of other users of readregs/writeregs
internally - the presently unused GVA translator function as well as the in/out
exit handler structure preparation code. Both of those look like they got lucky
in how they were handling things; they aren't touching the VMCS after read/write
regs or we would have seen them explode like you're saying.

It looks like the start of vcpu_run_vmx *almost* properly handles coming
in with an arbitrary  VMCS. The "almost" part is the bit under #ifdef VMM_DEBUG
that handles triple faults. It seems to assume that the proper VMCS is currently
loaded, which is certainly not true all the time. I'll fix that part.

Based on the fact that the other code paths look clean, I think we can remove
the vmclear from the end of read/write regs, as you suggest. I'll do that as
well.

Might I ask what you're working on? Just in case I know of someone else working
on the same thing? (There are about a dozen vmm side-projects going on that I'm
aware of and it would be a waste for two people to be working on the same thing
without knowing of the other). Feel free to mail me off-list if you don't want
to share in public.

-ml



Re: net80211: fix discarded input control frame count

2019-10-06 Thread Mike Larkin
On Sun, Oct 06, 2019 at 12:04:49PM +0200, Stefan Sperling wrote:
> The net80211 stack currently displays every received control frame
> as "discarded input control packet" in netstat(1).
>
> We do in fact process "power saving poll" and "block ack request" frames.
> Such frames should not be counted as discarded.
>
> ok?
>

Reads ok to me, ok mlarkin if you still are looking for oks.

-ml

> (diff uses 5 context lines for easier review)
>
> diff refs/heads/master refs/heads/rxctl
> blob - 6e192cfb3abd2224f7c1ae81c4fa43dbab8e9cb7
> blob + 3f38859dde5810fd0a4bef6781d0a6c4b2e60b78
> --- sys/net80211/ieee80211_input.c
> +++ sys/net80211/ieee80211_input.c
> @@ -533,21 +533,21 @@ ieee80211_inputm(struct ifnet *ifp, struct mbuf *m, st
>   (*ic->ic_recv_mgmt)(ic, m, ni, rxi, subtype);
>   m_freem(m);
>   return;
>
>   case IEEE80211_FC0_TYPE_CTL:
> - ic->ic_stats.is_rx_ctl++;
>   switch (subtype) {
>  #ifndef IEEE80211_STA_ONLY
>   case IEEE80211_FC0_SUBTYPE_PS_POLL:
>   ieee80211_recv_pspoll(ic, m, ni);
>   break;
>  #endif
>   case IEEE80211_FC0_SUBTYPE_BAR:
>   ieee80211_recv_bar(ic, m, ni);
>   break;
>   default:
> + ic->ic_stats.is_rx_ctl++;
>   break;
>   }
>   goto out;
>
>   default:
>



Re: pretty borders for slitherins

2019-09-25 Thread Mike Larkin
On Wed, Sep 25, 2019 at 10:53:21AM -0600, Theo de Raadt wrote:
> Ted Unangst  wrote:
>
> > Scott Cheloha wrote:
> > > On Mon, Sep 23, 2019 at 06:23:32PM -0400, Ted Unangst wrote:
> > > > snake and worm draw boxes, but they can be prettier by using the default
> > > > style, which will use line drawing instead of ugly -*| characters.
> > > >
> > > > should do the right thing on a vt100, but only tested in xterm.
> > >
> > > It looks much prettier in an xterm but in the system console I'm
> > > getting question marks.  Is that a limitation of the terminal or
> > > am I missing the glyphs?  Dunno if this is the expected behavior
> > > but it is uglier than what we had before.
> >
> > ah, that's disappointing. my understanding is it should revert to ascii, but
> > there seems to be a flaw, either in curses or my understanding. curses!
>
> If I recall, curses needs a little bit more initialization if you want
> it to do the right thing
>
> this trap was set because we don't have a curses maintainer, but look
> now we do, since it is mostly the games who need this, maybe mlarkin
> can help you
>

How did I get pulled into this? :)



Re: vmd(8): fix memory leak in virtio network TX path

2019-09-24 Thread Mike Larkin
On Mon, Sep 23, 2019 at 08:44:01AM +0200, Theo Buehler wrote:
> On Sun, Sep 22, 2019 at 02:42:28AM -0700, Mike Larkin wrote:
> > We allocate a 'pkt' for each network packet in the queue, but only were
> > freeing the last one. This has always been a bug, but it looks like recent
> > changes elsewhere in the network stack may have made the problem more 
> > apparent
> > since more packets seem to be deposited in the queue for each TX operation
> > now.
> >
> > This was reported back in July by Paul de Weerd, who told me this could be
> > reproduced by doing iperf tests between a VM and the host. He and I have
> > tested that this diff seems to resolve the problem.
> >
> > There is a similar leak with the built-in vmd(8) DHCP service's packets.
> > Both are fixed below. Each pointer is reset to NULL so that the final
> > free at the end of the function doesn't double free the last buffer(s).
>
> That doesn't look right as dhcppkt will be used after the while loop:
>
> 1542 if (dhcpsz > 0) {
> 1543 if (vionet_enq_rx(dev, dhcppkt, dhcpsz, ))
> 1544 ret = 1;
> 1545 }
>
> I've tried this diff with a local interface and indeed it broke dhcp.
>
> Note that dhcppkt is only allocated once as afterwards dhcpsz > 0 will
> prevent further calls to dhcp_request(), so I'm not sure we should touch
> dhcppkt at all.
>
> An alternative to the diff below would be to free and null out pkt and
> dhcpkt at the beginning of the while loop and setting pktsz and dhcpsz
> to 0. Both approaches work for me, but I don't know which is
> better/correct.
>

Good catch, I'll revert that part.

Thanks.

-ml

> Index: virtio.c
> ===
> RCS file: /var/cvs/src/usr.sbin/vmd/virtio.c,v
> retrieving revision 1.77
> diff -u -p -r1.77 virtio.c
> --- virtio.c  22 Jan 2019 10:12:36 -  1.77
> +++ virtio.c  23 Sep 2019 06:28:43 -
> @@ -1533,6 +1533,9 @@ vionet_notify_tx(struct vionet_dev *dev)
>   num_enq++;
>
>   idx = dev->vq[TXQ].last_avail & VIONET_QUEUE_MASK;
> +
> + free(pkt);
> + pkt = NULL;
>   }
>
>   if (write_mem(q_gpa, vr, vr_sz)) {
>



vmd(8): fix memory leak in virtio network TX path

2019-09-22 Thread Mike Larkin
We allocate a 'pkt' for each network packet in the queue, but only were
freeing the last one. This has always been a bug, but it looks like recent
changes elsewhere in the network stack may have made the problem more apparent
since more packets seem to be deposited in the queue for each TX operation
now.

This was reported back in July by Paul de Weerd, who told me this could be
reproduced by doing iperf tests between a VM and the host. He and I have
tested that this diff seems to resolve the problem.

There is a similar leak with the built-in vmd(8) DHCP service's packets.
Both are fixed below. Each pointer is reset to NULL so that the final
free at the end of the function doesn't double free the last buffer(s).

ok?

-ml

Index: virtio.c
===
RCS file: /cvs/src/usr.sbin/vmd/virtio.c,v
retrieving revision 1.77
diff -u -p -a -u -r1.77 virtio.c
--- virtio.c22 Jan 2019 10:12:36 -  1.77
+++ virtio.c22 Sep 2019 07:45:08 -
@@ -1533,6 +1533,12 @@ vionet_notify_tx(struct vionet_dev *dev)
num_enq++;

idx = dev->vq[TXQ].last_avail & VIONET_QUEUE_MASK;
+
+   free(pkt);
+   pkt = NULL;
+
+   free(dhcppkt);
+   dhcppkt = NULL;
}

if (write_mem(q_gpa, vr, vr_sz)) {



Re: EFI frame buffer > 4GB

2019-09-20 Thread Mike Larkin
On Fri, Sep 20, 2019 at 03:35:00PM +0200, Mark Kettenis wrote:
> > Date: Fri, 20 Sep 2019 06:06:40 -0700
> > From: Mike Larkin 
> >
> > On Fri, Sep 20, 2019 at 02:22:13PM +0200, Mark Kettenis wrote:
> > > > Date: Fri, 20 Sep 2019 02:55:27 -0700
> > > > From: Mike Larkin 
> > > >
> > > > On Fri, Sep 20, 2019 at 01:09:56AM +0900, YASUOKA Masahiko wrote:
> > > > > Hi,
> > > > >
> > > > > I recently got a VAIO Pro PK.  The diff below is required to boot.
> > > > > Without the diff, it freezes during boot.
> > > > >
> > > >
> > > > > Its EFI framebuffer is located 0x40 (9 zeros).  This is > 4GB
> > > > > and higher than highest available memory of the machine.  These
> > > > > configuraions seem to cause the problem.
> > > > >
> > > > > * * *
> > > > >
> > > > > Call cninit() after pmap_bootstrap() is called.  Since the EFI
> > > > > framebuffer may be located > 4GB which is not initialized by locore,
> > > > > but by pmap_bootstrap().  Also make the address parameter passed to
> > > > > pmap_bootstrap() cover the framebuffer.  Actually VAIO pro PK's
> > > > > framebuffer is located higher than the highest available memory
> > > > > region.
> > > > >
> > > > > ok? comments?
> > > > >
> > > >
> > > > Hi,
> > > >
> > > >  I have a few questions...
> > > >
> > > > 1. There seems to be no limit on the max PA that we extend to here.
> > > >This means, for example, if EFI places the framebuffer past 2TB
> > > >PA, we won't have enough direct map to cover the mapping. Plus
> > > >I think this will end up extending the direct map to cover any hole
> > > >between "end of phys mem" and "efi fb addr". At a minimum, I think
> > > >we need some sort of max PA clamp here. I don't know what Sony's
> > > >placement algorithm is, but 0x40 is 256GB PA.
> > >
> > > A dmesg and pcidump output would be useful.
> > >
> > > I suspect that this is a discrete graphics card where the EFI frame
> > > buffer resides in VRAM.  Using the direct map in this case is probably
> > > not the right thing to do.
> > >
> > > > 2. What does delaying cninit do for machines that have errors or
> > > >printfs before this? Would those even print anymore? This would
> > > >affect all machines, even those without efifb, correct?
> > >
> > > Yes and no.  It doesn't affect the classic VGA glass console, but it
> > > does mean serial output might disappear.  That isn't acceptable I'd
> > > say.
> > >
> > > > 3. I am not a big fan of placing device-specific quirks in
> > > >init_x86_64. Could this not be done in the efifb specific console
> > > >init code? You could pmap_enter whatever you wanted there, based on
> > > >the PA EFI sent you. Or does efifb go through the direct map for
> > > >all video output? If so, we may be stuck creating that big direct map
> > > >range. If that's the case though, we should probably try to restrict
> > > >the permissions in the unused holes.
> > >
> > > The direct map is only used early on in the boot process.  The frame
> > > buffer is remapped in mainbus_attach() such that we can use
> > > write-combining.  But that is done after we print copyright.  I think
> > > the remapping could be done a bit earlier, but not before uvm gets
> > > initialized, which happens after we print the copyright message.
> > >
> > > We don't have to use the direct map during early boot.  If you gave us
> > > some other way to map the framebuffer before pmap_bootstrap() has been
> > > called we could stick that into efifb_cnattach_common().  We'd need
> > > your help with that though.  Note that the framebuffer can be fairly
> > > large though (but we can probably come up with a reasonable upper
> > > limit).
> > >
> >
> > What sort of function do you need? Map this PA range at X, but before
> > pmap_bootstrap?
>
> Map this PA range and hand me back the VA where you mapped it, indeed
> before pmap_bootstrap().
>
> I reckon you'd need to reserve slots in the early page tables for
> this.  The mapping needs to be uncached (UC or WC).  I suspect we'll
> continue to remap the frame buffer later, UC is fine as we're not
> going to produce massive amounts of output to the console before doing
> so.

Likely doable, but it will be some time before I can get to it.

-ml



Re: EFI frame buffer > 4GB

2019-09-20 Thread Mike Larkin
On Fri, Sep 20, 2019 at 02:22:13PM +0200, Mark Kettenis wrote:
> > Date: Fri, 20 Sep 2019 02:55:27 -0700
> > From: Mike Larkin 
> >
> > On Fri, Sep 20, 2019 at 01:09:56AM +0900, YASUOKA Masahiko wrote:
> > > Hi,
> > >
> > > I recently got a VAIO Pro PK.  The diff below is required to boot.
> > > Without the diff, it freezes during boot.
> > >
> >
> > > Its EFI framebuffer is located 0x40 (9 zeros).  This is > 4GB
> > > and higher than highest available memory of the machine.  These
> > > configuraions seem to cause the problem.
> > >
> > > * * *
> > >
> > > Call cninit() after pmap_bootstrap() is called.  Since the EFI
> > > framebuffer may be located > 4GB which is not initialized by locore,
> > > but by pmap_bootstrap().  Also make the address parameter passed to
> > > pmap_bootstrap() cover the framebuffer.  Actually VAIO pro PK's
> > > framebuffer is located higher than the highest available memory
> > > region.
> > >
> > > ok? comments?
> > >
> >
> > Hi,
> >
> >  I have a few questions...
> >
> > 1. There seems to be no limit on the max PA that we extend to here.
> >This means, for example, if EFI places the framebuffer past 2TB
> >PA, we won't have enough direct map to cover the mapping. Plus
> >I think this will end up extending the direct map to cover any hole
> >between "end of phys mem" and "efi fb addr". At a minimum, I think
> >we need some sort of max PA clamp here. I don't know what Sony's
> >placement algorithm is, but 0x40 is 256GB PA.
>
> A dmesg and pcidump output would be useful.
>
> I suspect that this is a discrete graphics card where the EFI frame
> buffer resides in VRAM.  Using the direct map in this case is probably
> not the right thing to do.
>
> > 2. What does delaying cninit do for machines that have errors or
> >printfs before this? Would those even print anymore? This would
> >affect all machines, even those without efifb, correct?
>
> Yes and no.  It doesn't affect the classic VGA glass console, but it
> does mean serial output might disappear.  That isn't acceptable I'd
> say.
>
> > 3. I am not a big fan of placing device-specific quirks in
> >init_x86_64. Could this not be done in the efifb specific console
> >init code? You could pmap_enter whatever you wanted there, based on
> >the PA EFI sent you. Or does efifb go through the direct map for
> >all video output? If so, we may be stuck creating that big direct map
> >range. If that's the case though, we should probably try to restrict
> >the permissions in the unused holes.
>
> The direct map is only used early on in the boot process.  The frame
> buffer is remapped in mainbus_attach() such that we can use
> write-combining.  But that is done after we print copyright.  I think
> the remapping could be done a bit earlier, but not before uvm gets
> initialized, which happens after we print the copyright message.
>
> We don't have to use the direct map during early boot.  If you gave us
> some other way to map the framebuffer before pmap_bootstrap() has been
> called we could stick that into efifb_cnattach_common().  We'd need
> your help with that though.  Note that the framebuffer can be fairly
> large though (but we can probably come up with a reasonable upper
> limit).
>

What sort of function do you need? Map this PA range at X, but before
pmap_bootstrap?

> Cheers,
>
> Mark
>
> > > Index: sys/arch/amd64/amd64/machdep.c
> > > ===
> > > RCS file: /cvs/src/sys/arch/amd64/amd64/machdep.c,v
> > > retrieving revision 1.259
> > > diff -u -p -r1.259 machdep.c
> > > --- sys/arch/amd64/amd64/machdep.c7 Sep 2019 19:05:44 -   
> > > 1.259
> > > +++ sys/arch/amd64/amd64/machdep.c19 Sep 2019 15:55:18 -
> > > @@ -193,6 +193,8 @@ int lid_action = 1;
> > >  int pwr_action = 1;
> > >  int forceukbd;
> > >
> > > +int docninit;
> > > +
> > >  /*
> > >   * safepri is a safe priority for sleep to set for a spin-wait
> > >   * during autoconfiguration or after a panic.
> > > @@ -1371,6 +1373,7 @@ init_x86_64(paddr_t first_avail)
> > >   bios_memmap_t *bmp;
> > >   int x, ist;
> > >   uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
> > > + paddr_t max_pa;
> > >
> > >   cpu_init_msrs(_info_prima

Re: EFI frame buffer > 4GB

2019-09-20 Thread Mike Larkin
On Fri, Sep 20, 2019 at 01:09:56AM +0900, YASUOKA Masahiko wrote:
> Hi,
>
> I recently got a VAIO Pro PK.  The diff below is required to boot.
> Without the diff, it freezes during boot.
>

> Its EFI framebuffer is located 0x40 (9 zeros).  This is > 4GB
> and higher than highest available memory of the machine.  These
> configuraions seem to cause the problem.
>
> * * *
>
> Call cninit() after pmap_bootstrap() is called.  Since the EFI
> framebuffer may be located > 4GB which is not initialized by locore,
> but by pmap_bootstrap().  Also make the address parameter passed to
> pmap_bootstrap() cover the framebuffer.  Actually VAIO pro PK's
> framebuffer is located higher than the highest available memory
> region.
>
> ok? comments?
>

Hi,

 I have a few questions...

1. There seems to be no limit on the max PA that we extend to here.
   This means, for example, if EFI places the framebuffer past 2TB
   PA, we won't have enough direct map to cover the mapping. Plus
   I think this will end up extending the direct map to cover any hole
   between "end of phys mem" and "efi fb addr". At a minimum, I think
   we need some sort of max PA clamp here. I don't know what Sony's
   placement algorithm is, but 0x40 is 256GB PA.

2. What does delaying cninit do for machines that have errors or
   printfs before this? Would those even print anymore? This would
   affect all machines, even those without efifb, correct?

3. I am not a big fan of placing device-specific quirks in
   init_x86_64. Could this not be done in the efifb specific console
   init code? You could pmap_enter whatever you wanted there, based on
   the PA EFI sent you. Or does efifb go through the direct map for
   all video output? If so, we may be stuck creating that big direct map
   range. If that's the case though, we should probably try to restrict
   the permissions in the unused holes.

Thanks.

-ml

> Index: sys/arch/amd64/amd64/machdep.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/machdep.c,v
> retrieving revision 1.259
> diff -u -p -r1.259 machdep.c
> --- sys/arch/amd64/amd64/machdep.c7 Sep 2019 19:05:44 -   1.259
> +++ sys/arch/amd64/amd64/machdep.c19 Sep 2019 15:55:18 -
> @@ -193,6 +193,8 @@ int lid_action = 1;
>  int pwr_action = 1;
>  int forceukbd;
>
> +int docninit;
> +
>  /*
>   * safepri is a safe priority for sleep to set for a spin-wait
>   * during autoconfiguration or after a panic.
> @@ -1371,6 +1373,7 @@ init_x86_64(paddr_t first_avail)
>   bios_memmap_t *bmp;
>   int x, ist;
>   uint64_t max_dm_size = ((uint64_t)512 * NUM_L4_SLOT_DIRECT) << 30;
> + paddr_t max_pa;
>
>   cpu_init_msrs(_info_primary);
>
> @@ -1541,7 +1544,16 @@ init_x86_64(paddr_t first_avail)
>* Call pmap initialization to make new kernel address space.
>* We must do this before loading pages into the VM system.
>*/
> - first_avail = pmap_bootstrap(first_avail, trunc_page(avail_end));
> + max_pa = avail_end;
> + /* Make sure max_pa covers the EFI frame buffer */
> + if (bios_efiinfo->fb_addr != 0 &&
> + max_pa < bios_efiinfo->fb_addr + bios_efiinfo->fb_size)
> + max_pa = bios_efiinfo->fb_addr + bios_efiinfo->fb_size;
> + first_avail = pmap_bootstrap(first_avail, trunc_page(max_pa));
> +
> + /* Call cninit after entire physical memory is available */
> + if (docninit > 0)
> + cninit();
>
>   /* Allocate these out of the 640KB base memory */
>   if (avail_start != PAGE_SIZE)
> @@ -1914,7 +1926,6 @@ getbootinfo(char *bootinfo, int bootinfo
>   bios_ddb_t *bios_ddb;
>   bios_bootduid_t *bios_bootduid;
>   bios_bootsr_t *bios_bootsr;
> - int docninit = 0;
>
>  #undef BOOTINFO_DEBUG
>  #ifdef BOOTINFO_DEBUG
> @@ -2026,8 +2037,6 @@ getbootinfo(char *bootinfo, int bootinfo
>   break;
>   }
>   }
> - if (docninit > 0)
> - cninit();
>  #ifdef BOOTINFO_DEBUG
>   printf("\n");
>  #endif
>



Re: New driver for AMD CPU temperature sensor over SMN

2019-09-19 Thread Mike Larkin
On Tue, Sep 17, 2019 at 12:41:11PM -0400, Bryan Steele wrote:
> Unlike with previous generations of AMD processors, the on-die
> temperature sensor is only available from reading from the SMU
> co-processor over an internal network (SMN), this includes all
> Ryzen CPUs and some later Family 15h models (not supported here).
>
> This adds a new driver based on km(4) and info gleamed from Linux
> FreeBSD.
>
> High-TDP Zen/Zen+ parts have undocumented offsets from the control temp
> and need to be adjusted, low-TDP (mobile) and Zen2 have no offsets.
>
> I've tested this on a Ryzen 2700X desktop and a Ryzen 5 2500U (mobile)
> laptop (no offset). I would appreciate feedback on other models, and
> confirmation about Zen2.
>
> Thanks to @nte@bsd.network for the providing helpful information and
> much needed prodding.
>
> -Bryan.
>

This works on my Ryzen 2700U, thanks!

kmsmn0 at pci0 dev 0 function 0 "AMD AMD64 17h/1xh Root Complex" rev 0x00

kmsmn0.temp0 60.88 degC

-ml

> Index: arch/amd64/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
> retrieving revision 1.478
> diff -u -p -u -r1.478 GENERIC
> --- sys/arch/amd64/conf/GENERIC   7 Sep 2019 13:46:19 -   1.478
> +++ sys/arch/amd64/conf/GENERIC   17 Sep 2019 16:24:03 -
> @@ -103,6 +103,7 @@ amdpcib* at pci?  # AMD 8111 LPC bridge
>  tcpcib*  at pci? # Intel Atom E600 LPC bridge
>  kate*at pci? # AMD K8 temperature sensor
>  km*  at pci? # AMD K10 temperature sensor
> +kmsmn*   at pci? # AMD F17 temperature sensor
>  amas*at pci? disable # AMD memory configuration
>  pchtemp* at pci? # Intel C610 termperature sensor
>  ccp* at pci? # AMD Cryptographic Co-processor
> Index: dev/pci/files.pci
> ===
> RCS file: /cvs/src/sys/dev/pci/files.pci,v
> retrieving revision 1.338
> diff -u -p -u -r1.338 files.pci
> --- sys/dev/pci/files.pci 5 Aug 2019 08:33:38 -   1.338
> +++ sys/dev/pci/files.pci 17 Sep 2019 16:24:03 -
> @@ -774,6 +774,11 @@ device   km
>  attach   km at pci
>  file dev/pci/km.ckm
>
> +# AMD Family 15h/17h Temperature sensor over SMN
> +device   kmsmn
> +attach   kmsmn at pci
> +file dev/pci/kmsmn.c kmsmn
> +
>  # Intel SOC GCU
>  device   gcu
>  attach   gcu at pci
> Index: dev/pci/kmsmn.c
> ===
> RCS file: dev/pci/kmsmn.c
> diff -N dev/pci/kmsmn.c
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ sys/dev/pci/kmsmn.c   17 Sep 2019 16:24:03 -
> @@ -0,0 +1,170 @@
> +/*   $OpenBSD$   */
> +
> +/*
> + * Copyright (c) 2019 Bryan Steele 
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include 
> +#include 
> +
> +/*
> + * AMD temperature sensors on Family 17h (and some 15h) must be
> + * read from the System Management Unit (SMU) co-processor over
> + * the System Management Network (SMN).
> + */
> +
> +#define SMN_17H_ADDR_R 0x60
> +#define SMN_17H_DATA_R 0x64
> +
> +/*
> +  * AMD Family 17h SMU Thermal Registers (THM)
> +  *
> +  * 4.2.1, OSRR (Open-Source Register Reference) Guide for Family 17h
> +  * [31:21]  Current reported temperature.
> +  */
> +#define SMU_17H_THM  0x59800
> +#define KM_GET_CURTMP(r) (((r) >> 21) & 0x7ff)
> +
> +/*
> + * Bit 19 set: "Report on -49C to 206C scale range."
> + *  clear: "Report on 0C to 225C (255C?) scale range."
> + */
> +#define CURTMP_17H_RANGE_SEL (1 << 19)
> +#define CURTMP_17H_RANGE_ADJUST  490
> +
> +/*
> + * Undocumented tCTL offsets gleamed from Linux k10temp driver.
> + */
> +struct curtmp_offset {
> + const char *const cpu_model; /* partial match */
> + int tctl_offset;
> +} cpu_model_offsets[] = {
> + { "AMD Ryzen 5 1600X", 200 },
> + { "AMD Ryzen 7 1700X", 200 },
> + { "AMD Ryzen 7 1800X", 200 },
> + { "AMD Ryzen 7 2700X", 100 },
> + { "AMD Ryzen Threadripper 19", 270 }, /* many models */
> + 

Re: [patch] vmd: fix possible small memleak in vm_claimid() error path

2019-09-04 Thread Mike Larkin
On Thu, Aug 29, 2019 at 07:43:31PM +0200, Hiltjo Posthuma wrote:
> Hi,
> 
> This fixes a small possible memory leak in an error handling path in vmd.c
> vm_claimid().
> 
> 
> diff --git usr.sbin/vmd/vmd.c usr.sbin/vmd/vmd.c
> index 654af5974d3..81be6b356d6 100644
> --- usr.sbin/vmd/vmd.c
> +++ usr.sbin/vmd/vmd.c
> @@ -1197,6 +1197,7 @@ vm_claimid(const char *name, int uid, uint32_t *id)
>   n2i->uid = uid;
>   if (strlcpy(n2i->name, name, sizeof(n2i->name)) >= sizeof(n2i->name)) {
>   log_warnx("vm name too long");
> + free(n2i);
>   return -1;
>   }
>   TAILQ_INSERT_TAIL(env->vmd_known, n2i, entry);
> 
> -- 
> Kind regards,
> Hiltjo
> 

Thanks, committed.

-ml



Re: vmd(8): remove unused error code

2019-09-02 Thread Mike Larkin
On Mon, Sep 02, 2019 at 12:43:18AM +0200, Tobias Heider wrote:
> The VMD_DISK_INVALID error code is no longer used since
> https://marc.info/?l=openbsd-cvs=153147762830175 and
> can be removed.
> 
> Ok?
> 
> Index: vmctl/vmctl.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.69
> diff -u -p -u -r1.69 vmctl.c
> --- vmctl/vmctl.c 22 May 2019 16:19:21 -  1.69
> +++ vmctl/vmctl.c 1 Sep 2019 22:06:32 -
> @@ -235,11 +235,6 @@ vm_start_complete(struct imsg *imsg, int
>   warnx("could not open disk image(s)");
>   *ret = ENOENT;
>   break;
> - case VMD_DISK_INVALID:
> - warnx("specified disk image(s) are "
> - "not regular files");
> - *ret = ENOENT;
> - break;
>   case VMD_CDROM_MISSING:
>   warnx("could not find specified iso image");
>   *ret = ENOENT;
> Index: vmd/vmd.h
> ===
> RCS file: /cvs/src/usr.sbin/vmd/vmd.h,v
> retrieving revision 1.95
> diff -u -p -u -r1.95 vmd.h
> --- vmd/vmd.h 17 Jul 2019 05:51:07 -  1.95
> +++ vmd/vmd.h 1 Sep 2019 22:06:32 -
> @@ -68,7 +68,6 @@
>  /* vmd -> vmctl error codes */
>  #define VMD_BIOS_MISSING 1001
>  #define VMD_DISK_MISSING 1002
> -#define VMD_DISK_INVALID 1003
>  #define VMD_VM_STOP_INVALID  1004
>  #define VMD_CDROM_MISSING1005
>  #define VMD_CDROM_INVALID1006
> 

ok mlarkin

You might want to comment that 1003 was VMD_DISK_INVALID in case someone
later is wondering why we skipped 1002 to 1004. Like we do with syscalls.

-ml



Re: Another dwiic(4) fix

2019-08-17 Thread Mike Larkin
On Sat, Aug 17, 2019 at 06:42:01PM +0200, Mark Kettenis wrote:
> The timeout when waiting for data to be received for polled mode is
> too small for taling to the BMC on the Ampere/Lenovo arm64 server.
> This bumps it to 50 ms, which is still lower than what it is for
> non-polled mode.
> 
> I also found that things worked much better if we checked for the
> actually STOP detection bit to come instead of looking at the activity
> bit in the status register.
> 
> Also tested on my Asus x205ta, which has an i2c keyboard.
> 
> ok?
> 
> 

This causes no regressions on the 2 machines I tried.

Probably the right time to get it in, ok mlarkin

> Index: dwiic.c
> ===
> RCS file: /cvs/src/sys/dev/ic/dwiic.c,v
> retrieving revision 1.6
> diff -u -p -r1.6 dwiic.c
> --- dwiic.c   6 Aug 2019 06:56:29 -   1.6
> +++ dwiic.c   17 Aug 2019 16:12:39 -
> @@ -350,7 +350,7 @@ dwiic_i2c_exec(void *cookie, i2c_op_t op
>   sc->sc_dev.dv_xname, __func__, tx_limit, x));
>  
>   if (flags & I2C_F_POLL) {
> - for (retries = 100; retries > 0; retries--) {
> + for (retries = 1000; retries > 0; retries--) {
>   rx_avail = dwiic_read(sc, DW_IC_RXFLR);
>   if (rx_avail > 0)
>   break;
> @@ -407,14 +407,13 @@ dwiic_i2c_exec(void *cookie, i2c_op_t op
>  
>   if (I2C_OP_STOP_P(op) && I2C_OP_WRITE_P(op)) {
>   if (flags & I2C_F_POLL) {
> - /* wait for bus to be idle */
>   for (retries = 100; retries > 0; retries--) {
> - st = dwiic_read(sc, DW_IC_STATUS);
> - if (!(st & DW_IC_STATUS_ACTIVITY))
> + st = dwiic_read(sc, DW_IC_RAW_INTR_STAT);
> + if (st & DW_IC_INTR_STOP_DET)
>   break;
>   DELAY(1000);
>   }
> - if (st & DW_IC_STATUS_ACTIVITY)
> + if (!(st & DW_IC_INTR_STOP_DET))
>   printf("%s: timed out waiting for bus idle\n",
>   sc->sc_dev.dv_xname);
>   } else {
> 



Re: drm acpi diff

2019-08-16 Thread Mike Larkin
On Fri, Aug 16, 2019 at 10:21:33PM +0200, Mark Kettenis wrote:
> The diff below provides a minimal implementation of some of the Linux
> ACPI iterfaces.  Enough to allow us to compile the ACPI code for
> radeon(4) and amdgpu(4).  With this diff the brightness keys on my HP
> laptop with:
> 
> cpu0: AMD A4-4355M APU with Radeon(tm) HD Graphics, 1897.56 MHz, 15-10-01
> ...
> radeondrm0 at pci0 dev 1 function 0 "ATI Radeon HD 7400G" rev 0x00
> 
> now work.  I'd like to see some more tests, especially on laptops with
> amdgpu(4).  Diff has some debug printing enabled.  Feel free to share
> the dmesg output with me.
> 

On my HP 735G5, this does not enable the brightness controls, but it does not
seem to regress anything, either. dmesg contains the following:

initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x103C:0x83DA 0xD0).
[drm] ATCS version 1
[drm] Found ATIF handle \\_SB_.PCI0.BUSA.GFX0.ATIF
[drm] ATIF version 1
[drm] SYSTEM_PARAMS: mask = 0x106, flags = 0x107
[drm] Notification enabled, command code = 0xd0
amdgpu0: 1920x1080, 32bpp
wsdisplay0 at amdgpu0 mux 1: console (std, vt100 emulation), using wskbd0
wsdisplay0: screen 1-5 added (std, vt100 emulation)

xbacklight still reports:

No outputs have backlight property

-ml

> 
> Index: dev/pci/drm/drm_drv.c
> ===
> RCS file: /cvs/src/sys/dev/pci/drm/drm_drv.c,v
> retrieving revision 1.164
> diff -u -p -r1.164 drm_drv.c
> --- dev/pci/drm/drm_drv.c 30 Jul 2019 05:50:20 -  1.164
> +++ dev/pci/drm/drm_drv.c 16 Aug 2019 20:05:55 -
> @@ -55,6 +55,14 @@
>  #include 
>  #include 
>  
> +#include 
> +
> +#ifdef __HAVE_ACPI
> +#include 
> +#include 
> +#include 
> +#endif
> +
>  #include 
>  #include 
>  #include 
> @@ -76,7 +84,7 @@ struct drm_softc {
>  #ifdef DRMDEBUG
>  unsigned int drm_debug = DRM_UT_DRIVER | DRM_UT_KMS;
>  #else
> -unsigned int drm_debug = 0;
> +unsigned int drm_debug = DRM_UT_DRIVER;
>  #endif
>  
>  int   drm_firstopen(struct drm_device *);
> @@ -256,6 +264,12 @@ drm_attach(struct device *parent, struct
>   dev->bridgetag = da->bridgetag;
>   dev->pdev->tag = da->tag;
>   dev->pdev->pci = (struct pci_softc *)parent->dv_parent;
> +
> +#ifdef CONFIG_ACPI
> + dev->pdev->dev.node = acpi_find_pci(da->pc, da->tag);
> + aml_register_notify(dev->pdev->dev.node, NULL,
> + drm_linux_acpi_notify, NULL, ACPIDEV_NOPOLL);
> +#endif
>  
>   rw_init(>struct_mutex, "drmdevlk");
>   mtx_init(>event_lock, IPL_TTY);
> Index: dev/pci/drm/drm_linux.c
> ===
> RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
> retrieving revision 1.47
> diff -u -p -r1.47 drm_linux.c
> --- dev/pci/drm/drm_linux.c   5 Aug 2019 08:35:59 -   1.47
> +++ dev/pci/drm/drm_linux.c   16 Aug 2019 20:05:55 -
> @@ -987,6 +987,8 @@ vga_put(struct pci_dev *pdev, int rsrc)
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  acpi_status
>  acpi_get_table(const char *sig, int instance,
> @@ -1008,6 +1010,131 @@ acpi_get_table(const char *sig, int inst
>   }
>  
>   return AE_NOT_FOUND;
> +}
> +
> +acpi_status
> +acpi_get_handle(acpi_handle node, const char *name, acpi_handle *rnode)
> +{
> + node = aml_searchname(node, name);
> + if (node == NULL)
> + return AE_NOT_FOUND;
> +
> + *rnode = node;
> + return 0;
> +}
> +
> +acpi_status
> +acpi_get_name(acpi_handle node, int type,  struct acpi_buffer *buffer)
> +{
> + KASSERT(buffer->length != ACPI_ALLOCATE_BUFFER);
> + KASSERT(type == ACPI_FULL_PATHNAME);
> + strlcpy(buffer->pointer, aml_nodename(node), buffer->length);
> + return 0;
> +}
> +
> +acpi_status
> +acpi_evaluate_object(acpi_handle node, const char *name,
> +struct acpi_object_list *params, struct acpi_buffer *result)
> +{
> + struct aml_value args[4], res;
> + union acpi_object *obj;
> + uint8_t *data;
> + int i;
> +
> + KASSERT(params->count <= nitems(args));
> +
> + for (i = 0; i < params->count; i++) {
> + args[i].type = params->pointer[i].type;
> + switch (args[i].type) {
> + case AML_OBJTYPE_INTEGER:
> + args[i].v_integer = params->pointer[i].integer.value;
> + break;
> + case AML_OBJTYPE_BUFFER:
> + args[i].length = params->pointer[i].buffer.length;
> + args[i].v_buffer = params->pointer[i].buffer.pointer;
> + break;
> + default:
> + printf("%s: arg type 0x%02x", __func__, args[i].type);
> + return AE_BAD_PARAMETER;
> + }
> + }
> +
> + if (name) {
> + node = aml_searchname(node, name);
> + if (node == NULL)
> + return AE_NOT_FOUND;
> + }
> + if (aml_evalnode(acpi_softc, node, params->count, args, )) {
> + 

Re: vmctl: invalid parent template error

2019-08-12 Thread Mike Larkin
On Mon, Aug 12, 2019 at 06:20:23PM +0200, Anton Lindqvist wrote:
> On Mon, Aug 12, 2019 at 02:52:46PM +0200, Klemens Nanni wrote:
> > On Mon, Aug 12, 2019 at 02:14:42PM +0200, Anton Lindqvist wrote:
> > > Hi,
> > > I recently fat fingered the vm template passed to vmctl and was greeted
> > > with the following error:
> > > 
> > >   vmctl: start vm command failed: Operation not permitted
> > > 
> > > I think we can be more specific in order to improve usability:
> > I agree, but your wording implies the template might be there but vmd
> > failed to find it.  Also, "parent" and "template" seems too much;  we
> > have no other templates, so may I suggest one of the simpler
> > 
> > invalid template
> > template does not exist
> > 
> 
> Updated diff changing the error message to invalid template.
> 

I'm ok with this.

-ml

> Index: vmctl/vmctl.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.69
> diff -u -p -r1.69 vmctl.c
> --- vmctl/vmctl.c 22 May 2019 16:19:21 -  1.69
> +++ vmctl/vmctl.c 12 Aug 2019 16:18:22 -
> @@ -249,6 +249,10 @@ vm_start_complete(struct imsg *imsg, int
>   "file");
>   *ret = ENOENT;
>   break;
> + case VMD_PARENT_INVALID:
> + warnx("invalid template");
> + *ret = EINVAL;
> + break;
>   default:
>   errno = res;
>   warn("start vm command failed");
> Index: vmd/vmd.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/vmd.c,v
> retrieving revision 1.114
> diff -u -p -r1.114 vmd.c
> --- vmd/vmd.c 28 Jun 2019 13:32:51 -  1.114
> +++ vmd/vmd.c 12 Aug 2019 16:18:22 -
> @@ -1373,8 +1373,13 @@ vm_instance(struct privsep *ps, struct v
>  
>   /* return without error if the parent is NULL (nothing to inherit) */
>   if ((vmc->vmc_flags & VMOP_CREATE_INSTANCE) == 0 ||
> - (*vm_parent = vm_getbyname(vmc->vmc_instance)) == NULL)
> + vmc->vmc_instance[0] == '\0')
>   return (0);
> +
> + if ((*vm_parent = vm_getbyname(vmc->vmc_instance)) == NULL) {
> + errno = VMD_PARENT_INVALID;
> + return (-1);
> + }
>  
>   errno = 0;
>   vmcp = &(*vm_parent)->vm_params;
> Index: vmd/vmd.h
> ===
> RCS file: /cvs/src/usr.sbin/vmd/vmd.h,v
> retrieving revision 1.95
> diff -u -p -r1.95 vmd.h
> --- vmd/vmd.h 17 Jul 2019 05:51:07 -  1.95
> +++ vmd/vmd.h 12 Aug 2019 16:18:22 -
> @@ -72,6 +72,7 @@
>  #define VMD_VM_STOP_INVALID  1004
>  #define VMD_CDROM_MISSING1005
>  #define VMD_CDROM_INVALID1006
> +#define VMD_PARENT_INVALID   1007
>  
>  /* Image file signatures */
>  #define VM_MAGIC_QCOW"QFI\xfb"
> 



Re: TSC synchronization on MP machines

2019-08-05 Thread Mike Larkin
On Tue, Aug 06, 2019 at 12:38:51AM +0200, Mark Kettenis wrote:
> > Date: Mon, 5 Aug 2019 16:58:27 +0300
> > From: Paul Irofti 
> > 
> > On Fri, Aug 02, 2019 at 01:29:37PM +0300, Paul Irofti wrote:
> > > On Mon, Jul 01, 2019 at 10:32:51AM +0200, Mark Kettenis wrote:
> > > > > Date: Thu, 27 Jun 2019 15:08:00 +0300
> > > > > From: Paul Irofti 
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > Here is an initial diff, adapted from NetBSD, that synchronizes TSC
> > > > > clocks across cores.
> > > > > 
> > > > > CPU0 is the reference clock and all others are skewed. During CPU
> > > > > initialization the clocks synchronize by keeping a registry of each 
> > > > > CPU
> > > > > clock skewness and adapting the TSC read routine accordingly.
> > > > > 
> > > > > I choose this implementation over what FreeBSD is doing (which is just
> > > > > copying Linux really), because it is clean and elegant.
> > > > > 
> > > > > I would love to hear reports from machines that were broken by this.
> > > > > Mine, which never exhibited the problem in the first place, run just
> > > > > fine with the following diff. In fact I am writting this message on 
> > > > > one
> > > > > such machine.
> > > > > 
> > > > > Also constructive comments are more than welcomed!
> > > > > 
> > > > > Notes:
> > > > > 
> > > > > - cpu_counter_serializing() could probably have a better name
> > > > >   (tsc _read for example)
> > > > > - the PAUSE instruction is probably not needed
> > > > > - acpi(4) suspend and resume bits are left out on purpose, but should
> > > > >   be trivial to add once the current diff settles
> > > > > 
> > > > > Paul Irofti
> > > > 
> > > > I don't think we want to introduce a  header file.
> > > > 
> > > > The code suffers from some NetBSD-isms, so that'll need to be fixed.
> > > > I pointed some of them out below.
> > > > 
> > > > Also, how accurate is your skew detection?  What skew is detected on a
> > > > machine that (supposedly) has the TSCs in sync?  The result will be
> > > > that you actually slightly desync the counters on different CPUs.
> > > > 
> > > > I think Linux uses the TSC_ADJUST MSR and compares its value across
> > > > cores.  If the skew is small and the TSC_ADJUST values are the same
> > > > across cores it skips the TSC adjustments.
> > > 
> > > Hi,
> > > 
> > > Here is an updated diff with a few bugs eliminated from the previous and
> > > with most of the concerns I got in private and from Mark fixed.
> > > 
> > > I will do the TSC_ADJUST_MSR dance in another iteration if the current
> > > incarnation turns out to be correct for machines suffering from TSCs not
> > > in sync.
> > > 
> > > The thing I am mostly worried about now is in the following sum
> > > 
> > >  uint
> > >  tsc_get_timecount(struct timecounter *tc)
> > >  {
> > >   return rdtsc() + curcpu()->cpu_cc_skew;
> > >  }
> > >  
> > > can one term be executed on one CPU and the other on another? Is there a
> > > way to protect this from happening other than locking?
> > > 
> > > I see NetBSD is checking for a change in the number of context switches 
> > > of the current process.
> > > 
> > > My plan is to have a fix in the tree before 6.6 is released, so I would
> > > love to hear your thoughts and reports on this.
> > > 
> > > Thanks,
> > > Paul
> > 
> > Hi,
> > 
> > Here is a third version of the TSC diff that also take into
> > consideration the suspend-resume path which was ignored by the previous
> > thus rendering resume broken.
> 
> Hmm, wat is this diff supposed to do upon suspend-resume?  I'm fairly
> certain that you'll need to recalibrate the skew in the resume path.
> But it doesn't seem to do that.  Or at least it doesn't print any
> messages.

I agree with kettenis. You definitely will need to re-calculate and apply
the skews on resume.

-ml

> 
> Further comments/questions/requests below...
> 
> 
> Anyway, here is what gets reported on my x1c3:
> 
> cpu0: Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz, 2494.66 MHz, 06-3d-04
> 
> tsc_timecounter_init: TSC skew=0 observed drift=0
> tsc_timecounter_init: TSC skew=-344 observed drift=0
> tsc_timecounter_init: TSC skew=-8 observed drift=0
> tsc_timecounter_init: TSC skew=-26 observed drift=0
> 
> > Index: arch/amd64/amd64/acpi_machdep.c
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/amd64/acpi_machdep.c,v
> > retrieving revision 1.86
> > diff -u -p -u -p -r1.86 acpi_machdep.c
> > --- arch/amd64/amd64/acpi_machdep.c 23 Oct 2018 17:51:32 -  1.86
> > +++ arch/amd64/amd64/acpi_machdep.c 5 Aug 2019 13:54:33 -
> > @@ -60,6 +60,8 @@ extern paddr_t tramp_pdirpa;
> >  
> >  extern int acpi_savecpu(void) __returns_twice;
> >  
> > +extern int64_t tsc_drift_observed;
> > +
> >  #define ACPI_BIOS_RSDP_WINDOW_BASE0xe
> >  #define ACPI_BIOS_RSDP_WINDOW_SIZE0x2
> >  
> > @@ -481,6 +483,8 @@ acpi_resume_cpu(struct acpi_softc *sc)
> >  {
> > fpuinit(_info_primary);
> >  
> > +   

Re: ihidev: always register interrupt, stop polling if it fires

2019-07-19 Thread Mike Larkin
On Fri, Jul 19, 2019 at 02:16:18PM -0500, joshua stein wrote:
> The fast polling of ihidev may cause a problem during suspend/resume 
> because dwiic may be in an unknown state, so add a DVACT_QUIESCE 
> handler to properly shut it down.
> 
> Even if polling is requested, register the interrupt handler too.  
> If it ever fires, stop polling and just use the interrupt.  This is 
> what led me to discover that ihidev's interrupt starts firing after 
> a suspend/resume cycle (but I still have no idea why it doesn't work 
> before that).
> 
> 

Seems to be a sensible workaround, but you might want to remove that
printf in the interrupt handler unless you're absolutely sure it's safe
there...

-ml

> Index: dev/acpi/dwiic_acpi.c
> ===
> RCS file: /cvs/src/sys/dev/acpi/dwiic_acpi.c,v
> retrieving revision 1.9
> diff -u -p -u -p -r1.9 dwiic_acpi.c
> --- dev/acpi/dwiic_acpi.c 16 Jul 2019 19:12:32 -  1.9
> +++ dev/acpi/dwiic_acpi.c 19 Jul 2019 19:11:21 -
> @@ -457,8 +457,9 @@ dwiic_acpi_found_ihidev(struct dwiic_sof
>  
>   aml_freevalue();
>  
> - if (!sc->sc_poll_ihidev &&
> - !(crs.irq_int == 0 && crs.gpio_int_node == NULL))
> + if (sc->sc_poll_ihidev)
> + ia.ia_poll = 1;
> + if (!(crs.irq_int == 0 && crs.gpio_int_node == NULL))
>   ia.ia_intr = 
>  
>   if (config_found(sc->sc_iic, , dwiic_i2c_print)) {
> Index: dev/i2c/i2cvar.h
> ===
> RCS file: /cvs/src/sys/dev/i2c/i2cvar.h,v
> retrieving revision 1.16
> diff -u -p -u -p -r1.16 i2cvar.h
> --- dev/i2c/i2cvar.h  23 Apr 2016 09:40:28 -  1.16
> +++ dev/i2c/i2cvar.h  19 Jul 2019 19:11:21 -
> @@ -113,6 +113,7 @@ struct i2c_attach_args {
>   char*ia_name;   /* chip name */
>   void*ia_cookie; /* pass extra info from bus to dev */
>   void*ia_intr;   /* interrupt info */
> + int ia_poll;/* to force polling */
>  };
>  
>  /*
> Index: dev/i2c/ihidev.c
> ===
> RCS file: /cvs/src/sys/dev/i2c/ihidev.c,v
> retrieving revision 1.19
> diff -u -p -u -p -r1.19 ihidev.c
> --- dev/i2c/ihidev.c  8 Apr 2019 17:50:45 -   1.19
> +++ dev/i2c/ihidev.c  19 Jul 2019 19:11:21 -
> @@ -63,6 +63,7 @@ static int I2C_HID_POWER_OFF= 0x1;
>  int  ihidev_match(struct device *, void *, void *);
>  void ihidev_attach(struct device *, struct device *, void *);
>  int  ihidev_detach(struct device *, int);
> +int  ihidev_activate(struct device *, int);
>  
>  int  ihidev_hid_command(struct ihidev_softc *, int, void *);
>  int  ihidev_intr(void *);
> @@ -80,7 +81,7 @@ struct cfattach ihidev_ca = {
>   ihidev_match,
>   ihidev_attach,
>   ihidev_detach,
> - NULL
> + ihidev_activate,
>  };
>  
>  struct cfdriver ihidev_cd = {
> @@ -128,7 +129,7 @@ ihidev_attach(struct device *parent, str
>   printf(", can't establish interrupt");
>   }
>  
> - if (sc->sc_ih == NULL) {
> + if (ia->ia_poll) {
>   printf(" (polling)");
>   sc->sc_poll = 1;
>   sc->sc_fastpoll = 1;
> @@ -227,6 +228,39 @@ ihidev_detach(struct device *self, int f
>   return (0);
>  }
>  
> +int
> +ihidev_activate(struct device *self, int act)
> +{
> + struct ihidev_softc *sc = (struct ihidev_softc *)self;
> +
> + DPRINTF(("%s(%d)\n", __func__, act));
> +
> + switch (act) {
> + case DVACT_QUIESCE:
> + sc->sc_dying = 1;
> + if (sc->sc_poll && timeout_initialized(>sc_timer)) {
> + DPRINTF(("%s: canceling polling\n",
> + sc->sc_dev.dv_xname));
> + timeout_del_barrier(>sc_timer);
> + }
> + if (ihidev_hid_command(sc, I2C_HID_CMD_SET_POWER,
> + _HID_POWER_OFF))
> + printf("%s: failed to power down\n",
> + sc->sc_dev.dv_xname);
> + break;
> + case DVACT_WAKEUP:
> + ihidev_reset(sc);
> + sc->sc_dying = 0;
> + if (sc->sc_poll && timeout_initialized(>sc_timer))
> + timeout_add(>sc_timer, 2000);
> + break;
> + }
> +
> + config_activate_children(self, act);
> +
> + return 0;
> +}
> +
>  void
>  ihidev_sleep(struct ihidev_softc *sc, int ms)
>  {
> @@ -580,6 +614,16 @@ ihidev_hid_desc_parse(struct ihidev_soft
>   return (0);
>  }
>  
> +void
> +ihidev_poll(void *arg)
> +{
> + struct ihidev_softc *sc = arg;
> +
> + sc->sc_frompoll = 1;
> + ihidev_intr(sc);
> + sc->sc_frompoll = 0;
> +}
> +
>  int
>  ihidev_intr(void *arg)
>  {
> @@ -589,6 +633,16 @@ ihidev_intr(void *arg)
>   u_char *p;
>   u_int rep = 0;
>  
> + if (sc->sc_dying)
> + return 1;
> +
> + if 

Re: use SMBIOS for inteldrm panel orientation quirks

2019-07-12 Thread Mike Larkin
On Fri, Jul 12, 2019 at 02:27:15PM +1000, Jonathan Gray wrote:
> Use SMBIOS data for panel orientation.  Uses BIOS dates when other
> strings are generic.
> 
> There are orientation quirks in drm_panel_orientation_quirks.c for:
>   Acer One 10 (S1003)
>   Asus T100HA
>   GPD MicroPC (generic strings, also match on bios date)
>   GPD Pocket 2 (generic strings, also match on bios date)
>   GPD Win (same note on DMI match as GPD Pocket)
>   I.T.Works TW891
>   Lenovo Ideapad Miix 320
>   VIOS LTH17
> 
> This codepath is also called from
> 
> i915/vlv_dsi.c with the call to
> drm_connector_init_panel_orientation_property().
> 

ok mlarkin if you're looking for OKs.

-ml

> Index: arch/amd64/amd64/bios.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/bios.c,v
> retrieving revision 1.37
> diff -u -p -r1.37 bios.c
> --- arch/amd64/amd64/bios.c   23 Oct 2018 17:51:32 -  1.37
> +++ arch/amd64/amd64/bios.c   12 Jul 2019 01:57:08 -
> @@ -67,6 +67,8 @@ const char *smbios_uninfo[] = {
>   "SYS-"
>  };
>  
> +char smbios_bios_date[64];
> +
>  int
>  bios_match(struct device *parent, void *match , void *aux)
>  {
> @@ -141,8 +143,11 @@ bios_attach(struct device *parent, struc
>   printf(" version \"%s\"",
>   fixstring(scratch));
>   if ((smbios_get_string(, sb->release,
> - scratch, sizeof(scratch))) != NULL)
> + scratch, sizeof(scratch))) != NULL) {
> + strlcpy(smbios_bios_date, fixstring(scratch),
> + sizeof(smbios_bios_date));
>   printf(" date %s", fixstring(scratch));
> + }
>   }
>  
>   smbios_info(sc->sc_dev.dv_xname);
> Index: arch/i386/i386/bios.c
> ===
> RCS file: /cvs/src/sys/arch/i386/i386/bios.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 bios.c
> --- arch/i386/i386/bios.c 23 Oct 2018 17:51:32 -  1.120
> +++ arch/i386/i386/bios.c 12 Jul 2019 03:45:03 -
> @@ -140,6 +140,8 @@ const char *smbios_uninfo[] = {
>  };
>  
>  
> +char smbios_bios_date[64];
> +
>  int
>  biosprobe(struct device *parent, void *match, void *aux)
>  {
> @@ -305,8 +307,12 @@ biosattach(struct device *parent, struct
>   printf(" version \"%s\"",
>   fixstring(scratch));
>   if ((smbios_get_string(, sb->release,
> - scratch, sizeof(scratch))) != NULL)
> + scratch, sizeof(scratch))) != NULL) {
> + strlcpy(smbios_bios_date,
> + fixstring(scratch),
> + sizeof(smbios_bios_date));
>   printf(" date %s", fixstring(scratch));
> + }
>   }
>   smbios_info(sc->sc_dev.dv_xname);
>  
> Index: dev/pci/drm/drm_linux.c
> ===
> RCS file: /cvs/src/sys/dev/pci/drm/drm_linux.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 drm_linux.c
> --- dev/pci/drm/drm_linux.c   10 Jul 2019 16:43:19 -  1.43
> +++ dev/pci/drm/drm_linux.c   12 Jul 2019 03:46:54 -
> @@ -394,6 +394,34 @@ dmi_found(const struct dmi_system_id *ds
>   return true;
>  }
>  
> +const struct dmi_system_id *
> +dmi_first_match(const struct dmi_system_id *sysid)
> +{
> + const struct dmi_system_id *dsi;
> +
> + for (dsi = sysid; dsi->matches[0].slot != 0 ; dsi++) {
> + if (dmi_found(dsi))
> + return dsi;
> + }
> +
> + return NULL;
> +}
> +
> +#ifdef CONFIG_DMI
> +extern char smbios_bios_date[];
> +#endif
> +
> +const char *
> +dmi_get_system_info(int slot)
> +{
> + WARN_ON(slot != DMI_BIOS_DATE);
> +#ifdef CONFIG_DMI
> + if (slot == DMI_BIOS_DATE)
> + return smbios_bios_date;
> +#endif
> + return NULL;
> +}
> +
>  int
>  dmi_check_system(const struct dmi_system_id *sysid)
>  {
> Index: dev/pci/drm/i915/i915_drv.c
> ===
> RCS file: /cvs/src/sys/dev/pci/drm/i915/i915_drv.c,v
> retrieving revision 1.118
> diff -u -p -r1.118 i915_drv.c
> --- dev/pci/drm/i915/i915_drv.c   8 May 2019 15:55:56 -   1.118
> +++ dev/pci/drm/i915/i915_drv.c   12 Jul 2019 03:50:53 -
> @@ -45,6 +45,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "i915_drv.h"
>  #include "i915_trace.h"
> @@ -3598,6 +3599,7 @@ inteldrm_attachhook(struct device *self)
>   struct wsemuldisplaydev_attach_args aa;
>   const struct drm_pcidev *id = 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-21 Thread Mike Larkin
On Fri, Jun 21, 2019 at 05:11:26PM -0300, Martin Pieuchot wrote:
> On 06/06/19(Thu) 15:16, Martin Pieuchot wrote:
> > On 02/06/19(Sun) 16:41, Martin Pieuchot wrote:
> > > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote:
> > > > Diff below exists mainly for documentation and test purposes.  If
> > > > you're not interested about how to break the scheduler internals in
> > > > pieces, don't read further and go straight to testing!
> > > > 
> > > > - First change is to stop calling tsleep(9) at PUSER.  That makes
> > > >   it clear that all "sleeping priorities" are smaller than PUSER.
> > > >   That's important to understand for the diff below.  `p_priority'
> > > >   is currently a placeholder for the "sleeping priority" and the
> > > >   "runnqueue priority".  Both fields are separated by this diff.
> > > > 
> > > > - When a thread goes to sleep, the priority argument of tsleep(9) is
> > > >   now recorded in `p_slpprio'.  This argument can be considered as part
> > > >   of the sleep queue.  Its purpose is to place the thread into a higher
> > > >   runqueue when awoken.
> > > > 
> > > > - Currently, for stopped threads, `p_priority' correspond to 
> > > > `p_usrpri'. 
> > > >   So setrunnable() has been untangled to place SSTOP and SSLEEP threads
> > > >   in the preferred queue without having to use `p_priority'.  Note that
> > > >   `p_usrpri' is still recalculated *after* having called setrunqueue().
> > > >   This is currently fine because setrunnable() is called with 
> > > > SCHED_LOCK() 
> > > >   but it will be racy when we'll split it.
> > > > 
> > > > - A new field, `p_runprio' has been introduced.  It should be considered
> > > >   as part of the per-CPU runqueues.  It indicates where a current thread
> > > >   is placed.
> > > > 
> > > > - `spc_curpriority' is now updated at every context-switch.  That means
> > > >need_resched() won't be called after comparing an out-of-date value.
> > > >At the same time, `p_usrpri' is initialized to the highest possible
> > > >value for idle threads.
> > > > 
> > > > - resched_proc() was calling need_resched() in the following conditions:
> > > >- If the SONPROC thread has a higher priority that the current
> > > >  running thread (itself).
> > > >- Twice in setrunnable() when we know that p_priority <= p_usrpri.
> > > >- If schedcpu() considered that a thread, after updating its prio,
> > > >  should preempt the one running on the CPU pointed by `p_cpu'. 
> > > > 
> > > >   The diff below simplify all of that by calling need_resched() when:
> > > >- A thread is inserted in a CPU runqueue at a higher priority than
> > > >  the one SONPROC.
> > > >- schedcpu() decides that a thread in SRUN state should preempt the
> > > >  one SONPROC.
> > > > 
> > > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority"
> > > >   of a thread are now updated while holding a per-thread mutex.  As a
> > > >   result schedclock() and donice() no longer takes the SCHED_LOCK(),
> > > >   and schedcpu() almost never take it.
> > > > 
> > > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' value
> > > >   when displaying priorities.  This is helpful to understand what's
> > > >   happening:
> > > > 
> > > > load averages:  0.99,  0.56,  0.25   two.lab.grenadille.net 
> > > > 23:42:10
> > > > 70 threads: 68 idle, 2 on processor
> > > > up  0:09
> > > > CPU0:  0.0% user,  0.0% nice, 51.0% sys,  2.0% spin,  0.0% intr, 47.1% 
> > > > idle
> > > > CPU1:  2.0% user,  0.0% nice, 51.0% sys,  3.9% spin,  0.0% intr, 43.1% 
> > > > idle
> > > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M
> > > > 
> > > >   PID  TID PRI NICE  SIZE   RES STATE WAIT  TIMECPU 
> > > > COMMAND
> > > > 81000   145101  7200K 1664K sleep/1   bored 1:15 36.96% 
> > > > softnet
> > > > 47133   244097  730 2984K 4408K sleep/1   netio 1:06 35.06% cvs 
> > > > 64749   522184  660  176K  148K onproc/1  - 0:55 28.81% nfsd
> > > > 21615   602473 12700K 1664K sleep/0   - 7:22  0.00% 
> > > > idle0  
> > > > 12413   606242 12700K 1664K sleep/1   - 7:08  0.00% 
> > > > idle1
> > > > 85778   338258  500 4936K 7308K idle  select0:10  0.00% ssh 
> > > >  
> > > > 22771   575513  500  176K  148K sleep/0   nfsd  0:02  0.00% 
> > > > nfsd 
> > > > 
> > > > 
> > > > 
> > > > - The removal of `p_priority' and the change that makes mi_switch()
> > > >   always update `spc_curpriority' might introduce some changes in
> > > >   behavior, especially with kernel threads that were not going through
> > > >   tsleep(9).  We currently have some situations where the priority of
> > > >   the running thread isn't correctly reflected.  This diff changes that
> > > >   which means we should be able to better understand where the problems
> > > >   are.
> > > > 
> > > > I'd be interested 

Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-06 Thread Mike Larkin
On Thu, Jun 06, 2019 at 02:55:35PM +0100, Stuart Henderson wrote:
> I'm testing the "pump my sched" and read/write unlock diffs and ran into
> the panic below. Seems more likely that it would be connected with the
> sched diff rather than anything else. I'll build a WITNESS kernel and
> see if I can get more details if it hapens again.
> 

Funny thing, I ran into the same panic last night after running this diff
without issue for several days. Mine didn't include a stack trace though.

> ddb{1}> show panic
> kernel diagnostic assertion "__mp_lock_held(_lock, curcpu()) == 0" 
> failed
> : file "/sys/kern/kern_lock.c", line 63
> ddb{1}> tr
> db_enter() at db_enter+0x10
> panic() at panic+0x128
> __assert(81aebffd,81ac7a49,3f,81b0e315) at 
> __assert+0x2
> e
> _kernel_lock() at _kernel_lock+0xf6
> pageflttrap() at pageflttrap+0x75
> kerntrap(800033ee0ca0) at kerntrap+0x91
> alltraps_kern(6,380,fffb,2,80002201aff0,e0) at alltraps_kern+0x7b
> setrunqueue(0,8000342b2ce8,e0) at setrunqueue+0xbc
> setrunnable(8000342b2ce8,e0) at setrunnable+0xa5
> ptsignal(8000342b2ce8,13,0) at ptsignal+0x3c4
> sys_kill(800033c2fcc0,800033ee0ec0,800033ee0f20) at sys_kill+0x1c5
> syscall(800033ee0f90) at syscall+0x389
> Xsyscall(6,7a,0,7a,13,7f7e3e18) at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7e3cb0, count: -13
> ddb{1}> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>  89165  411452   1971667  30x1000b2  poll  ping
>   1971   11232  86869667  30x82  piperdcheck_ping
>  66051  380473  92153667  30x1000b2  poll  ping
>  92153  480496  86869667  30x82  piperdcheck_ping
>   7608  110605   8332667  30x1000b2  poll  ping
>   8332  284560  86869667  30x82  piperdcheck_ping
>  40097  336365  35317667  30x1000b2  poll  ping
>  35317  385677  86869667  30x82  piperdcheck_ping
>  33714  102547  21689667  30x1000b2  poll  ping
>  21689  210571  86869667  30x82  piperdcheck_ping
>  72142  347023  62694667  30x1000b2  poll  ping
>  62694  439660  86869667  30x82  piperdcheck_ping
>  87347  447624  76730   1000  4   0x8000430xscreensaver
>  37569  319522  76730   1000  20x800402piecewise
>  11841  481452  18848667  30x1000b2  poll  ping
>  18848  111230  86869667  30x82  piperdcheck_ping
>  69292  189646  44902667  30x1000b2  poll  ping
>  44902   21064  86869667  30x82  piperdcheck_ping
>  67905  363250   1761507  30x92  kqreadpickup
>  86709  146945  91951   1000  30x92  kqreadimap
>  36596  227858  91951666  30x92  kqreadimap-login
>  90723  180281  91951   1000  30x92  kqreadimap
>  42867  129499  91951   1000  30x92  kqreadimap
>  52963  203919  91951   1000  30x92  kqreadimap
>  61599  167915  91951666  30x92  kqreadimap-login
>  54450  154426  91951666  30x92  kqreadimap-login
>  54061  103553  91951666  30x92  kqreadimap-login
>  39422  362611  85872715  30x90  netconperl
>  49809  260623  85872715  30x90  lockf perl
>   3824   92473  85872715  30x90  lockf perl
>  30990  398152  85872715  30x90  lockf perl
>   3327  264795  85872715  30x90  lockf perl
>  24406  217143  85872715  30x90  lockf perl
>  52871  404585  85872715  30x90  lockf perl
>  58220  118881  85872715  30x90  lockf perl
>  32898  307331  85872715  30x90  lockf perl
>  50203  396987  85872715  30x90  lockf perl
>  93024   24785  1   1000  30x100080  selectssh
>  22261  482889  52840   1000  30x100083  poll  ssh
>  52840  404967  61470   1000  30x10008b  pause ksh
>  35796  132597  61470   1000  30xb0  netio urxvt
>  61470  485028  72107   1000  30xb2  selecturxvt
>  72107  196192  1   1000  30x10008a  pause sh
>  70238  396933  1   1000  30x100080  selectssh
>  41840  302658  10362   1000  30x100083  poll  ssh
>  89638  411799  91951   1000  30x92  kqreadimap
>  49250  507579  81821   1000  30x100083  ttyin ksh
>  37608  523403  91951666  30x92  kqreadimap-login
>  75552   67606  33096   1000  30x100083  poll  mutt
>  33096  492099  81821   1000  30x10008b  pause ksh
>  81821  270355  1   1000  30x100080  kqreadtmux
>  24522  341281  91861   1000  30x100083  kqreadtmux
>  

Re: vmd(8) i8042 device implementation questions

2019-06-05 Thread Mike Larkin
On Sat, Jun 01, 2019 at 06:12:16PM -0500, Katherine Rohl wrote:
> Couple questions:
> 
> > This means no interrupt will be injected. I'm not sure if that's what you 
> > want.
> > See vm.c: vcpu_exit_inout(..). It looks like you may have manually asserted 
> > the
> > IRQ in this file, which is a bit different than what we do in other 
> > devices. That
> > may be okay, though.
> 
> The device can assert zero, one, or two IRQs depending on the state of the 
> input ports. Are we capable of asserting two IRQs at once through 
> vcpu_exit_i8042?
> 

Yes, just assert/deassert what you want and it should drive the PIC state
machine accordingly. LMK if that doesn't work. Assert followed by immediate
deassert for edge triggered, and assert followed by deassert after ack for
level triggered. I think all these are edge triggered though.

> > For this IRQ, if it's edge triggered, please assert then deassert the line.
> > The i8259 code should handle that properly. What you have here is a level
> > triggered interrupt (eg, the line will stay asserted until someone
> > does a 1 -> 0 transition below). Same goes for the next few cases.
> 
> Would asserting the IRQs through the exit function handle this for me if 
> that’s possible?

The return-to-vmm path can assert one single interrupt via returning something
other than 0xFF from the exit handler in vmd. But if you need two, you'll need
to either rework that logic (and fix all the other devices) or just
assert/deassert one in the exit handler function and have the return-to-vmm
path handle the other. Your call.

Note - it is probably a good idea to rework the logic at some point because the
current model can only handle "inject this edge triggered interrupt", and not
other interesting scenarios we may need later like "assert this level triggered
interrupt" or being able to flexibly configure a number of different interrupts
(edge or level triggered). There is also no provision to handle deasserting
something in the return-to-vmm path presently (that has to be handled in each
device's I/O routine in vmd - yuck).

> 
> > Also, please bump the revision in the vcpu struct for send/receive
> > as we will be sending a new struct layout now.
> 
> Where exactly? The file revision?



Re: vmd(8) i8042 device implementation questions

2019-06-05 Thread Mike Larkin
On Sun, Jun 02, 2019 at 03:21:34PM +0200, Jasper Lievisse Adriaanse wrote:
> On Sat, Jun 01, 2019 at 06:12:16PM -0500, Katherine Rohl wrote:
> > Couple questions:
> > 
> > > This means no interrupt will be injected. I'm not sure if that's what you 
> > > want.
> > > See vm.c: vcpu_exit_inout(..). It looks like you may have manually 
> > > asserted the
> > > IRQ in this file, which is a bit different than what we do in other 
> > > devices. That
> > > may be okay, though.
> > 
> > The device can assert zero, one, or two IRQs depending on the state of the 
> > input ports. Are we capable of asserting two IRQs at once through 
> > vcpu_exit_i8042?
> > 
> > > For this IRQ, if it's edge triggered, please assert then deassert the 
> > > line.
> > > The i8259 code should handle that properly. What you have here is a level
> > > triggered interrupt (eg, the line will stay asserted until someone
> > > does a 1 -> 0 transition below). Same goes for the next few cases.
> > 
> > Would asserting the IRQs through the exit function handle this for me if 
> > that???s possible?
> > 
> > > Also, please bump the revision in the vcpu struct for send/receive
> > > as we will be sending a new struct layout now.
> > 
> > Where exactly? The file revision?
> That would be VM_DUMP_VERSION in vmd.h I reckon.
> 

Yes, that's what I meant.

> -- 
> jasper
> 



Re: Pump my sched: fewer SCHED_LOCK() & kill p_priority

2019-06-04 Thread Mike Larkin
On Mon, Jun 03, 2019 at 11:50:14AM +0200, Solene Rapenne wrote:
> On Sat, Jun 01, 2019 at 06:55:20PM -0300, Martin Pieuchot wrote:
> > Diff below exists mainly for documentation and test purposes.  If
> > you're not interested about how to break the scheduler internals in
> > pieces, don't read further and go straight to testing!
> 
> I'm running it since a few hours.
> 
> - games/gzdoom feels smoother with this patch (stuttering was certainly
>   related to audio)
> - mpd playback doesn't seem interrupted under heavy load as it
>   occasionnaly did
> 
> this may be coincidences or placebo effect.
> 

On one of my machines, I'm running this diff and the unlock more syscalls diff
(the one from mpi@ that claudio@ recently reposted). It does indeed seem more
responsive (unclear which diff is doing this, though). The combination is also
stable without anything out of the ordinary.

-ml



Re: vmd(8) i8042 device implementation questions

2019-05-31 Thread Mike Larkin
On Thu, May 30, 2019 at 11:06:57PM -0500, Katherine Rohl wrote:
> Apologies.
> 
> ---
> 

Hi Katherine,

 Thanks for the diff. I think we are getting close! A few comments below. 

-ml

> diff --git a/sys/arch/amd64/amd64/vmm.c b/sys/arch/amd64/amd64/vmm.c
> index 4ffb2ff899f..7de38facc78 100644
> --- a/sys/arch/amd64/amd64/vmm.c
> +++ b/sys/arch/amd64/amd64/vmm.c
> @@ -5359,6 +5359,8 @@ svm_handle_inout(struct vcpu *vcpu)
>   case IO_ICU1 ... IO_ICU1 + 1:
>   case 0x40 ... 0x43:
>   case PCKBC_AUX:
> + case 0x60:
> + case 0x64:
>   case IO_RTC ... IO_RTC + 1:
>   case IO_ICU2 ... IO_ICU2 + 1:
>   case 0x3f8 ... 0x3ff:
> diff --git a/usr.sbin/vmd/Makefile b/usr.sbin/vmd/Makefile
> index 8645df7aecf..f39d85b1b14 100644
> --- a/usr.sbin/vmd/Makefile
> +++ b/usr.sbin/vmd/Makefile
> @@ -4,7 +4,7 @@
>  
>  PROG=vmd
>  SRCS=vmd.c control.c log.c priv.c proc.c config.c vmm.c
> -SRCS+=   vm.c loadfile_elf.c pci.c virtio.c i8259.c mc146818.c
> +SRCS+=   vm.c loadfile_elf.c pci.c virtio.c i8259.c mc146818.c 
> i8042.c
>  SRCS+=   ns8250.c i8253.c vmboot.c ufs.c disklabel.c dhcp.c 
> packet.c
>  SRCS+=   parse.y atomicio.c vioscsi.c vioraw.c vioqcow2.c 
> fw_cfg.c
>  
> diff --git a/usr.sbin/vmd/i8042.c b/usr.sbin/vmd/i8042.c
> new file mode 100644
> index 000..ced3b43da1c
> --- /dev/null
> +++ b/usr.sbin/vmd/i8042.c
> @@ -0,0 +1,423 @@
> +/* $OpenBSD$ */
> +/*
> + * Copyright (c) 2019 Katherine Rohl 
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#include 
> +
> +#include 
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "i8042.h"
> +#include "proc.h"
> +#include "vmm.h"
> +#include "atomicio.h"
> +#include "vmd.h"
> +
> +struct i8042 {
> + uint8_t data;
> + uint8_t command;
> + uint8_t status;
> + uint8_t internal_ram[16];
> +
> + uint8_t input_port;
> + uint8_t output_port;
> + 
> + /* Port 0 is keyboard, Port 1 is mouse */
> + uint8_t ps2_enabled[2];
> + 
> + uint32_t vm_id;
> +
> + /* Is the 8042 awaiting a command argument byte? */
> + uint8_t awaiting_argbyte;
> +};
> +
> +struct i8042 kbc;
> +pthread_mutex_t kbc_mtx;
> +
> +/*
> + * i8042_init
> + *
> + * Initialize the emulated i8042 KBC.
> + *
> + * Parameters:
> + *  vm_id: vmm(4) assigned ID of the VM
> + */
> +void
> +i8042_init(uint32_t vm_id)
> +{
> + memset(, 0, sizeof(kbc));
> +
> + if(pthread_mutex_init(_mtx, NULL) != 0)
> + fatalx("unable to create kbc mutex");
> +
> + kbc.vm_id = vm_id;
> +
> + /* Always initialize the reset bit to 1 for proper operation. */
> + kbc.output_port = KBC_DEFOUTPORT;
> + kbc.input_port = KBC_DEFINPORT;
> +}
> +
> +/*
> + * i8042_set_output_port
> + *
> + * Set the output port of the KBC.
> + * Many bits have side effects, so handle their on/off
> + *  transition effects as required.
> + */
> +static void
> +i8042_set_output_port(uint8_t byte)
> +{
> + /* 0x01: reset line
> +  * 0x02: A20 gate (not implemented)
> +  * 0x04: aux port clock
> +  * 0x08: aux port data
> +  * 0x10: data in output buffer is from kb port (IRQ1)
> +  * 0x20: data in output buffer is from aux port (IRQ12)
> +  * 0x40: kb port clock
> +  * 0x80: kb port data
> +  */
> +
> + uint8_t old_output_port = kbc.output_port;
> + kbc.output_port = byte;
> + 
> + /* 0 -> 1 transition */
> + if (((old_output_port & 0x10) == 0) &&
> + ((kbc.output_port & 0x10) == 1))
> + vcpu_assert_pic_irq(kbc.vm_id, 0, KBC_KBDIRQ);

For this IRQ, if it's edge triggered, please assert then deassert the line.
The i8259 code should handle that properly. What you have here is a level
triggered interrupt (eg, the line will stay asserted until someone
does a 1 -> 0 transition below). Same goes for the next few cases.

> + if (((old_output_port & 0x20) == 0) &&
> + ((kbc.output_port & 0x20) == 1))
> + vcpu_assert_pic_irq(kbc.vm_id, 0, KBC_AUXIRQ);
> +
> + /* 1 -> 0 transition */
> + if (((old_output_port & 0x10) == 1) &&
> + ((kbc.output_port & 0x10) == 0))
> + 

Re: MSI-X fix

2019-05-30 Thread Mike Larkin
On Thu, May 30, 2019 at 08:25:39PM +0200, Mark Kettenis wrote:
> I started implementing MSI-X support for arm64 adn tried to use re(4)
> for testing.  This failed miserably and when I tried to use MSI-X with
> re(4) on amd64 I noticed it didn't work either.
> 
> Turns out the Realtek hardware doesn't support 64-bit access to MSI-X
> table.  Reads return 0x and writes get ignored.
> Looking at the relevant documentation I noticed that the message
> address is actually split into two 32-bit "DWORD" entries.  Which in
> turn suggests that the Realtek folks have the standard on their side...
> 
> Diff below fixes the issue.  I did already add a "MAU32" define for
> this, even though it wasn't quite right ;).  I also took the
> opportunity to add the "function mask" bit that we'll probably need in
> the future if we start being clever with using multiple vectors and
> ave a need to guarantee atomicity of making changes to the MSI-X
> config.
> 
> ok?
> 

This reads ok to me. I believe the APIC address can be changed by
writing to the APICBASE MSR, and from what I can tell there's no requirement
that the base be < 4GB. I know we don't do that today, but should we write
the high bits of 'addr' instead of writing '0' to make things future proof?

Or is this just busywork?

Either way ok mlarkin.

-ml

> 
> Index: dev/pci/pcireg.h
> ===
> RCS file: /cvs/src/sys/dev/pci/pcireg.h,v
> retrieving revision 1.56
> diff -u -p -r1.56 pcireg.h
> --- dev/pci/pcireg.h  3 Aug 2018 22:18:13 -   1.56
> +++ dev/pci/pcireg.h  30 May 2019 18:06:39 -
> @@ -617,6 +617,7 @@ typedef u_int8_t pci_revision_t;
>   * Extended Message Signaled Interrups; access via capability pointer.
>   */
>  #define PCI_MSIX_MC_MSIXE0x8000
> +#define PCI_MSIX_MC_FM   0x4000
>  #define PCI_MSIX_MC_TBLSZ_MASK   0x07ff
>  #define PCI_MSIX_MC_TBLSZ_SHIFT  16
>  #define PCI_MSIX_MC_TBLSZ(reg)   \
> @@ -626,7 +627,7 @@ typedef u_int8_t pci_revision_t;
>  #define  PCI_MSIX_TABLE_OFF  ~(PCI_MSIX_TABLE_BIR)
>  
>  #define PCI_MSIX_MA(i)   ((i) * 16 + 0)
> -#define PCI_MSIX_MAU32(i)((i) * 16 + 0)
> +#define PCI_MSIX_MAU32(i)((i) * 16 + 4)
>  #define PCI_MSIX_MD(i)   ((i) * 16 + 8)
>  #define PCI_MSIX_VC(i)   ((i) * 16 + 12)
>  #define  PCI_MSIX_VC_MASK0x0001
> Index: arch/amd64/pci/pci_machdep.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/pci/pci_machdep.c,v
> retrieving revision 1.69
> diff -u -p -r1.69 pci_machdep.c
> --- arch/amd64/pci/pci_machdep.c  19 Aug 2018 08:23:47 -  1.69
> +++ arch/amd64/pci/pci_machdep.c  30 May 2019 18:06:39 -
> @@ -475,7 +475,8 @@ msix_addroute(struct pic *pic, struct cp
>   _bus_space_map(memt, base + offset, tblsz * 16, 0, ))
>   panic("%s: cannot map registers", __func__);
>  
> - bus_space_write_8(memt, memh, PCI_MSIX_MA(entry), addr);
> + bus_space_write_4(memt, memh, PCI_MSIX_MA(entry), addr);
> + bus_space_write_4(memt, memh, PCI_MSIX_MAU32(entry), 0);
>   bus_space_write_4(memt, memh, PCI_MSIX_MD(entry), vec);
>   bus_space_barrier(memt, memh, PCI_MSIX_MA(entry), 16,
>   BUS_SPACE_BARRIER_WRITE);
> 



Re: vmd(8) i8042 device implementation questions

2019-05-29 Thread Mike Larkin
On Tue, May 28, 2019 at 09:57:02PM -0500, Katherine Rohl wrote:
> I have my i8042 device for vmd(8) mostly implemented. It’s only missing a few 
> commands, but since there are no PS/2 input devices yet, there isn’t very 
> much in the way of testing I can do beyond ensuring that commands act as they 
> should. 
> 
> I have a couple questions about expected behavior, mostly related to the 8042 
> output port.
> 
> * The 8042 controls the CPU reset line, but I’m not sure what the behavior 
> should be in terms of vmm/vmd. Send a reset command if the reset line gets 
> pulled low?

The reset line being asserted (deasserted?) can be shunted to the same code in
vmd that processes triple faults, thus resetting the CPU.

> * The 8042 controls Gate A20, which doesn’t seem to exist in the VM. I just 
> have the output line not hooked up to anything. That OK?

Ignore for now.

> * There is nothing hooked up to the PS/2 ports since adding input devices is 
> out of scope for my current diff.
> * Commands F0-F3 pulse 8042 output port bits low for 6uS. How should such 
> timing be implemented in a vmd device?
> 

We don't have very precise timing, so I'm not sure how you're going to get a
6uS pulse. What does an attached device do with such a pulse in a normal
scenario?

> Once I get these cleared up, I should have a diff ready by the weekend. The 
> 8042 already gets detected in the OpenBSD boot process!
> 
> Katherine
> 



Re: Attach kvm-clock to Linux guests on VMM

2019-05-27 Thread Mike Larkin
On Mon, May 27, 2019 at 03:53:11AM -0700, Renato Aguiar wrote:
> Hi,
> 
> The following patch makes Linux guests use kvm-clock by setting KVM's CPUID
> signature on VMM:
> 

By saying the hypervisor is KVM to all guests, does this cause the guests
to make other assumptions we don't want?

> Index: sys/arch/amd64/amd64/vmm.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.245
> diff -u -p -r1.245 vmm.c
> --- sys/arch/amd64/amd64/vmm.c17 May 2019 19:07:15 -1.245
> +++ sys/arch/amd64/amd64/vmm.c27 May 2019 09:44:07 -
> @@ -238,6 +238,7 @@ extern uint64_t tsc_frequency;
> extern int tsc_is_invariant;
> 
> const char *vmm_hv_signature = VMM_HV_SIGNATURE;
> +const char *kvm_hv_signature = KVM_HV_SIGNATURE;
> 
> const struct kmem_pa_mode vmm_kp_contig = {
>   .kp_constraint = _constraint,
> @@ -6433,7 +6434,14 @@ vmm_handle_cpuid(struct vcpu *vcpu)
>   *rcx = *((uint32_t *)_hv_signature[4]);
>   *rdx = *((uint32_t *)_hv_signature[8]);
>   break;
> + case 0x4100:/* KVM CPUID signature */
> + *rax = 0;
> + *rbx = *((uint32_t *)_hv_signature[0]);
> + *rcx = *((uint32_t *)_hv_signature[4]);
> + *rdx = *((uint32_t *)_hv_signature[8]);
> + break;
>   case 0x4001:/* KVM hypervisor features */
> + case 0x4101:
>   *rax = (1 << KVM_FEATURE_CLOCKSOURCE2) |
>   (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>   *rbx = 0;
> Index: sys/arch/amd64/include/vmmvar.h
> ===
> RCS file: /cvs/src/sys/arch/amd64/include/vmmvar.h,v
> retrieving revision 1.66
> diff -u -p -r1.66 vmmvar.h
> --- sys/arch/amd64/include/vmmvar.h   17 May 2019 19:07:16 -1.66
> +++ sys/arch/amd64/include/vmmvar.h   27 May 2019 09:44:07 -
> @@ -22,6 +22,7 @@
> #define _MACHINE_VMMVAR_H_
> 
> #define VMM_HV_SIGNATURE  "OpenBSDVMM58"
> +#define KVM_HV_SIGNATURE "KVMKVMKVM\0\0\0"
> 
> #define VMM_MAX_MEM_RANGES16
> #define VMM_MAX_DISKS_PER_VM  4
> Index: sys/arch/i386/include/vmmvar.h
> ===
> RCS file: /cvs/src/sys/arch/i386/include/vmmvar.h,v
> retrieving revision 1.19
> diff -u -p -r1.19 vmmvar.h
> --- sys/arch/i386/include/vmmvar.h18 Jan 2019 01:34:50 -1.19
> +++ sys/arch/i386/include/vmmvar.h27 May 2019 09:44:07 -
> @@ -15,3 +15,4 @@
>  */
> 
> #define VMM_HV_SIGNATURE  "OpenBSDVMM58"
> +#define KVM_HV_SIGNATURE "KVMKVMKVM\0\0\0"
> Index: sys/dev/pv/pvbus.c
> ===
> RCS file: /cvs/src/sys/dev/pv/pvbus.c,v
> retrieving revision 1.19
> diff -u -p -r1.19 pvbus.c
> --- sys/dev/pv/pvbus.c13 May 2019 15:40:34 -  1.19
> +++ sys/dev/pv/pvbus.c27 May 2019 09:44:08 -
> @@ -85,7 +85,7 @@ struct pvbus_type {
>   void(*init)(struct pvbus_hv *);
>   void(*print)(struct pvbus_hv *);
> } pvbus_types[PVBUS_MAX] = {
> - { "KVMKVMKVM\0\0\0","KVM",  pvbus_kvm },
> + { KVM_HV_SIGNATURE, "KVM",  pvbus_kvm },
>   { "Microsoft Hv",   "Hyper-V", pvbus_hyperv, pvbus_hyperv_print },
>   { "VMwareVMware",   "VMware" },
>   { "XenVMMXenVMM",   "Xen",  pvbus_xen, pvbus_xen_print },
> 
> 
> -- 
> Renato Aguiar
> 



Re: vmd(8): slight NS8250 fix

2019-05-27 Thread Mike Larkin
On Sun, May 26, 2019 at 11:38:37PM -0700, Mike Larkin wrote:
> On Wed, May 22, 2019 at 08:05:50PM -0500, Katherine Rohl wrote:
> > Hi,
> > 
> > Adjusted NS8250 behavior in vmd(8) so it gets detected as an 8250 and not a 
> > 16450 by OpenBSD’s boot process. Also generalized some of the COM1-specific 
> > I/O address definitions to support adding COM2 (and COM3, and COM4…) in the 
> > future.
> > 
> > Tested by logging into my VM with the virtual serial console and 
> > reinstalling 6.5, everything was fine. The boot process detected an NS8250.
> > 
> 
> It occurred to me that by always returning 0xFF, a guest might write 0xFF to
> scratch, then read it back, and we'd always return 0xFF. They might then
> think that the scratch register worked. That's pretty bizarre, but I've seen
> a lot of bizarre behaviour in various guests.
> 

Theo pointed out that this is probably the right way (always return 0xFF), so
I'll commit the original version.

Thanks.

-ml

> The slightly revised version below just inverts whatever was written, ensuring
> that whatever is read is not what was written. This should make anyone doing
> such calculation think the scratch doesn't work.
> 
> Tested on Linux and OpenBSD guests.
> 
> Thoughts?
> 
> -ml
> 
> Index: ns8250.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/ns8250.c,v
> retrieving revision 1.20
> diff -u -p -a -u -r1.20 ns8250.c
> --- ns8250.c  11 Mar 2019 17:08:52 -  1.20
> +++ ns8250.c  27 May 2019 06:34:38 -
> @@ -74,6 +74,7 @@ ns8250_init(int fd, uint32_t vmid)
>   }
>   com1_dev.fd = fd;
>   com1_dev.irq = 4;
> + com1_dev.portid = NS8250_COM1;
>   com1_dev.rcv_pending = 0;
>   com1_dev.vmid = vmid;
>   com1_dev.byte_out = 0;
> @@ -509,7 +510,8 @@ vcpu_process_com_scr(struct vm_exit *vei
>   /*
>* vei_dir == VEI_DIR_OUT : out instruction
>*
> -  * Write to SCR
> +  * The 8250 does not have a scratch register. Make sure that whatever
> +  * was written is not what gets read back.
>*/
>   if (vei->vei.vei_dir == VEI_DIR_OUT) {
>   com1_dev.regs.scr = vei->vei.vei_data;
> @@ -517,9 +519,11 @@ vcpu_process_com_scr(struct vm_exit *vei
>   /*
>* vei_dir == VEI_DIR_IN : in instruction
>*
> -  * Read from SCR
> +  * Read from SCR - invert whatever was previously written,
> +  * to ensure the guest doesn't think the scratch register
> +  * works.
>*/
> - set_return_data(vei, com1_dev.regs.scr);
> + set_return_data(vei, ~com1_dev.regs.scr);
>   }
>  }
>  
> @@ -647,6 +651,7 @@ ns8250_restore(int fd, int con_fd, uint3
>   }
>   com1_dev.fd = con_fd;
>   com1_dev.irq = 4;
> + com1_dev.portid = NS8250_COM1;
>   com1_dev.rcv_pending = 0;
>   com1_dev.vmid = vmid;
>   com1_dev.byte_out = 0;
> Index: ns8250.h
> ===
> RCS file: /cvs/src/usr.sbin/vmd/ns8250.h,v
> retrieving revision 1.7
> diff -u -p -a -u -r1.7 ns8250.h
> --- ns8250.h  11 Mar 2019 17:08:52 -  1.7
> +++ ns8250.h  27 May 2019 06:31:13 -
> @@ -18,14 +18,30 @@
>  /*
>   * Emulated 8250 UART
>   */
> -#define COM1_DATA0x3f8
> -#define COM1_IER 0x3f9
> -#define COM1_IIR 0x3fa
> -#define COM1_LCR 0x3fb
> -#define COM1_MCR 0x3fc
> -#define COM1_LSR 0x3fd
> -#define COM1_MSR 0x3fe
> -#define COM1_SCR 0x3ff
> +#define COM1_BASE   0x3f8
> +#define COM1_DATACOM1_BASE+COM_OFFSET_DATA
> +#define COM1_IER COM1_BASE+COM_OFFSET_IER
> +#define COM1_IIR COM1_BASE+COM_OFFSET_IIR
> +#define COM1_LCR COM1_BASE+COM_OFFSET_LCR
> +#define COM1_MCR COM1_BASE+COM_OFFSET_MCR
> +#define COM1_LSR COM1_BASE+COM_OFFSET_LSR
> +#define COM1_MSR COM1_BASE+COM_OFFSET_MSR
> +#define COM1_SCR COM1_BASE+COM_OFFSET_SCR
> +
> +#define COM_OFFSET_DATA 0
> +#define COM_OFFSET_IER  1
> +#define COM_OFFSET_IIR  2
> +#define COM_OFFSET_LCR  3
> +#define COM_OFFSET_MCR  4
> +#define COM_OFFSET_LSR  5
> +#define COM_OFFSET_MSR  6
> +#define COM_OFFSET_SCR  7
> +
> +/* ns8250 port identifier */
> +enum ns8250_portid {
> + NS8250_COM1,
> + NS8250_COM2,
> +};
>  
>  /* ns8250 UART registers */
>  struct ns8250_regs {
> @@ -50,6 +66,7 @@ struct ns8250_dev {
>   struct event rate;
>   struct event wake;
>   struct timeval rate_tv;
> + enum ns8250_portid portid;
>   int fd;
>   int irq;
>   int rcv_pending;



Re: vmd(8): slight NS8250 fix

2019-05-27 Thread Mike Larkin
On Wed, May 22, 2019 at 08:05:50PM -0500, Katherine Rohl wrote:
> Hi,
> 
> Adjusted NS8250 behavior in vmd(8) so it gets detected as an 8250 and not a 
> 16450 by OpenBSD’s boot process. Also generalized some of the COM1-specific 
> I/O address definitions to support adding COM2 (and COM3, and COM4…) in the 
> future.
> 
> Tested by logging into my VM with the virtual serial console and reinstalling 
> 6.5, everything was fine. The boot process detected an NS8250.
> 

It occurred to me that by always returning 0xFF, a guest might write 0xFF to
scratch, then read it back, and we'd always return 0xFF. They might then
think that the scratch register worked. That's pretty bizarre, but I've seen
a lot of bizarre behaviour in various guests.

The slightly revised version below just inverts whatever was written, ensuring
that whatever is read is not what was written. This should make anyone doing
such calculation think the scratch doesn't work.

Tested on Linux and OpenBSD guests.

Thoughts?

-ml

Index: ns8250.c
===
RCS file: /cvs/src/usr.sbin/vmd/ns8250.c,v
retrieving revision 1.20
diff -u -p -a -u -r1.20 ns8250.c
--- ns8250.c11 Mar 2019 17:08:52 -  1.20
+++ ns8250.c27 May 2019 06:34:38 -
@@ -74,6 +74,7 @@ ns8250_init(int fd, uint32_t vmid)
}
com1_dev.fd = fd;
com1_dev.irq = 4;
+   com1_dev.portid = NS8250_COM1;
com1_dev.rcv_pending = 0;
com1_dev.vmid = vmid;
com1_dev.byte_out = 0;
@@ -509,7 +510,8 @@ vcpu_process_com_scr(struct vm_exit *vei
/*
 * vei_dir == VEI_DIR_OUT : out instruction
 *
-* Write to SCR
+* The 8250 does not have a scratch register. Make sure that whatever
+* was written is not what gets read back.
 */
if (vei->vei.vei_dir == VEI_DIR_OUT) {
com1_dev.regs.scr = vei->vei.vei_data;
@@ -517,9 +519,11 @@ vcpu_process_com_scr(struct vm_exit *vei
/*
 * vei_dir == VEI_DIR_IN : in instruction
 *
-* Read from SCR
+* Read from SCR - invert whatever was previously written,
+* to ensure the guest doesn't think the scratch register
+* works.
 */
-   set_return_data(vei, com1_dev.regs.scr);
+   set_return_data(vei, ~com1_dev.regs.scr);
}
 }
 
@@ -647,6 +651,7 @@ ns8250_restore(int fd, int con_fd, uint3
}
com1_dev.fd = con_fd;
com1_dev.irq = 4;
+   com1_dev.portid = NS8250_COM1;
com1_dev.rcv_pending = 0;
com1_dev.vmid = vmid;
com1_dev.byte_out = 0;
Index: ns8250.h
===
RCS file: /cvs/src/usr.sbin/vmd/ns8250.h,v
retrieving revision 1.7
diff -u -p -a -u -r1.7 ns8250.h
--- ns8250.h11 Mar 2019 17:08:52 -  1.7
+++ ns8250.h27 May 2019 06:31:13 -
@@ -18,14 +18,30 @@
 /*
  * Emulated 8250 UART
  */
-#define COM1_DATA  0x3f8
-#define COM1_IER   0x3f9
-#define COM1_IIR   0x3fa
-#define COM1_LCR   0x3fb
-#define COM1_MCR   0x3fc
-#define COM1_LSR   0x3fd
-#define COM1_MSR   0x3fe
-#define COM1_SCR   0x3ff
+#define COM1_BASE   0x3f8
+#define COM1_DATA  COM1_BASE+COM_OFFSET_DATA
+#define COM1_IER   COM1_BASE+COM_OFFSET_IER
+#define COM1_IIR   COM1_BASE+COM_OFFSET_IIR
+#define COM1_LCR   COM1_BASE+COM_OFFSET_LCR
+#define COM1_MCR   COM1_BASE+COM_OFFSET_MCR
+#define COM1_LSR   COM1_BASE+COM_OFFSET_LSR
+#define COM1_MSR   COM1_BASE+COM_OFFSET_MSR
+#define COM1_SCR   COM1_BASE+COM_OFFSET_SCR
+
+#define COM_OFFSET_DATA 0
+#define COM_OFFSET_IER  1
+#define COM_OFFSET_IIR  2
+#define COM_OFFSET_LCR  3
+#define COM_OFFSET_MCR  4
+#define COM_OFFSET_LSR  5
+#define COM_OFFSET_MSR  6
+#define COM_OFFSET_SCR  7
+
+/* ns8250 port identifier */
+enum ns8250_portid {
+   NS8250_COM1,
+   NS8250_COM2,
+};
 
 /* ns8250 UART registers */
 struct ns8250_regs {
@@ -50,6 +66,7 @@ struct ns8250_dev {
struct event rate;
struct event wake;
struct timeval rate_tv;
+   enum ns8250_portid portid;
int fd;
int irq;
int rcv_pending;



Re: vmd: tweak mc146818 periodic interrupt updating

2019-05-26 Thread Mike Larkin
On Sun, May 26, 2019 at 08:14:43PM +0200, Jasper Lievisse Adriaanse wrote:
> Hi,
> 
> Whilst looking at the mc146818 code in vmd I noticed something that initially 
> struck
> me as a pasto as the same code is present in the rtc_update_regb() function. 
> However
> it led me to look at how other emulators handle the updating of registers. 
> Based on
> that here's a diff to only reschedule the periodic interrupt after updating 
> register A
> if something changed in register A.
> 
> When updating register A we were checking in register B if the
> PIE bit was set in order to decide if rtc_reschedule_per needed
> to be called. if that bit was changed then the timer rate would
> already have been adjusted by rtc_update_regb so the call from
> rtc_update_rega is not needed. This now matches what qemu and
> other emulators are doing too.
> 
> This has been running fine for a few days in a number of VMs.
> 
> OK?
> 

sure, ok mlarkin

> Index: mc146818.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/mc146818.c,v
> retrieving revision 1.18
> diff -u -p -r1.18 mc146818.c
> --- mc146818.c12 Jul 2018 10:15:44 -  1.18
> +++ mc146818.c26 May 2019 12:13:32 -
> @@ -216,7 +216,7 @@ rtc_update_rega(uint32_t data)
>   __func__);
>  
>   rtc.regs[MC_REGA] = data;
> - if (rtc.regs[MC_REGB] & MC_REGB_PIE)
> + if ((rtc.regs[MC_REGA] ^ data) & 0x0f)
>   rtc_reschedule_per();
>  }
>  
> 
> -- 
> jasper
> 



Re: vmd(8): slight NS8250 fix

2019-05-25 Thread Mike Larkin
On Sat, May 25, 2019 at 01:43:43AM -0500, Katherine Rohl wrote:
> Hi,
> 
> I was able to install Centos and boot
> Alpine and Ubuntu without problems using the
> serial interface, using vmd(8) built with my diff.
> 
> (Side note, what console email program is
> the best for submitting these patches anyway?)
> 
> Katherine
> 
> > On May 23, 2019, at 12:22 AM, Mike Larkin  wrote:
> > 
> > 
> 

I will do a quick test and try to commit this weekend. Thanks!

-ml



Re: vmd(8): slight NS8250 fix

2019-05-22 Thread Mike Larkin
On Wed, May 22, 2019 at 08:05:50PM -0500, Katherine Rohl wrote:
> Hi,
> 
> Adjusted NS8250 behavior in vmd(8) so it gets detected as an 8250 and not a 
> 16450 by OpenBSD’s boot process. Also generalized some of the COM1-specific 
> I/O address definitions to support adding COM2 (and COM3, and COM4…) in the 
> future.
> 
> Tested by logging into my VM with the virtual serial console and reinstalling 
> 6.5, everything was fine. The boot process detected an NS8250.
> 

Thanks for the vmd diff.

Please test with various Linux guests also. I added the scratch reg to make
this like an 8250A, IIRC, to support Linux guests. And you'll need to test
with a few, Alpine, Ubuntu, Centos, etc, as they all had different
behaviour. If those all work then we can get this in.

-ml

> Index: ns8250.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/ns8250.c,v
> retrieving revision 1.20
> diff -u -p -u -r1.20 ns8250.c
> --- ns8250.c  11 Mar 2019 17:08:52 -  1.20
> +++ ns8250.c  23 May 2019 00:52:15 -
> @@ -74,6 +74,7 @@ ns8250_init(int fd, uint32_t vmid)
>   }
>   com1_dev.fd = fd;
>   com1_dev.irq = 4;
> + com1_dev.portid = NS8250_COM1;
>   com1_dev.rcv_pending = 0;
>   com1_dev.vmid = vmid;
>   com1_dev.byte_out = 0;
> @@ -509,10 +510,10 @@ vcpu_process_com_scr(struct vm_exit *vei
>   /*
>* vei_dir == VEI_DIR_OUT : out instruction
>*
> -  * Write to SCR
> +  * The 8250 does not have a scratch register.
>*/
>   if (vei->vei.vei_dir == VEI_DIR_OUT) {
> - com1_dev.regs.scr = vei->vei.vei_data;
> + com1_dev.regs.scr = 0xFF;
>   } else {
>   /*
>* vei_dir == VEI_DIR_IN : in instruction
> @@ -647,6 +648,7 @@ ns8250_restore(int fd, int con_fd, uint3
>   }
>   com1_dev.fd = con_fd;
>   com1_dev.irq = 4;
> + com1_dev.portid = NS8250_COM1;
>   com1_dev.rcv_pending = 0;
>   com1_dev.vmid = vmid;
>   com1_dev.byte_out = 0;
> Index: ns8250.h
> ===
> RCS file: /cvs/src/usr.sbin/vmd/ns8250.h,v
> retrieving revision 1.7
> diff -u -p -u -r1.7 ns8250.h
> --- ns8250.h  11 Mar 2019 17:08:52 -  1.7
> +++ ns8250.h  23 May 2019 00:52:15 -
> @@ -18,14 +18,30 @@
>  /*
>   * Emulated 8250 UART
>   */
> -#define COM1_DATA0x3f8
> -#define COM1_IER 0x3f9
> -#define COM1_IIR 0x3fa
> -#define COM1_LCR 0x3fb
> -#define COM1_MCR 0x3fc
> -#define COM1_LSR 0x3fd
> -#define COM1_MSR 0x3fe
> -#define COM1_SCR 0x3ff
> +#define COM1_BASE   0x3f8
> +#define COM1_DATACOM1_BASE+COM_OFFSET_DATA
> +#define COM1_IER COM1_BASE+COM_OFFSET_IER
> +#define COM1_IIR COM1_BASE+COM_OFFSET_IIR
> +#define COM1_LCR COM1_BASE+COM_OFFSET_LCR
> +#define COM1_MCR COM1_BASE+COM_OFFSET_MCR
> +#define COM1_LSR COM1_BASE+COM_OFFSET_LSR
> +#define COM1_MSR COM1_BASE+COM_OFFSET_MSR
> +#define COM1_SCR COM1_BASE+COM_OFFSET_SCR
> +
> +#define COM_OFFSET_DATA 0
> +#define COM_OFFSET_IER  1
> +#define COM_OFFSET_IIR  2
> +#define COM_OFFSET_LCR  3
> +#define COM_OFFSET_MCR  4
> +#define COM_OFFSET_LSR  5
> +#define COM_OFFSET_MSR  6
> +#define COM_OFFSET_SCR  7
> +
> +/* ns8250 port identifier */
> +enum ns8250_portid {
> + NS8250_COM1,
> + NS8250_COM2,
> +};
>  
>  /* ns8250 UART registers */
>  struct ns8250_regs {
> @@ -50,6 +66,7 @@ struct ns8250_dev {
>   struct event rate;
>   struct event wake;
>   struct timeval rate_tv;
> + enum ns8250_portid portid;
>   int fd;
>   int irq;
>   int rcv_pending;
> 



Re: efiboot: allow bigger ucodes

2019-05-22 Thread Mike Larkin
On Tue, May 21, 2019 at 12:33:24AM +0200, Mark Kettenis wrote:
> > Date: Sat, 18 May 2019 05:58:39 +0200 (CEST)
> > From: Mark Kettenis 
> > 
> > > Date: Fri, 17 May 2019 17:56:52 -0400
> > > From: Patrick Wildt 
> > > 
> > > Hi,
> > > 
> > > claudio@ has a Kaby Lake that exceeds the 128 kB limit, being as big as
> > > 190 kB.  So, for him the ucode isn't being loaded.
> > > 
> > > On EFI systems we can simply ask EFI to allocate some memory for us.
> > > I'm not sure why this file is duplicated 4 times, but it looks like
> > > for efiboot this one is currently the file to patch.
> > > 
> > > This diff works for claudio@'s EFI booted system.  Since I don't know
> > > how much space is available on legacy systems I will not attempt to
> > > fix this for the legacy bootloader.  Someone that knows more than me
> > > about x86 can attempt that.
> > > 
> > > Feedback? ok?
> > 
> > In principle, ye, that is the right approach.
> > 
> > However, until Mike gets us to the point where we no longer care at
> > which physical address the kernel was loaded, efiboot will move the
> > kernel to a specific physical address after loading it and potentially
> > thrash all memory allocated through the EFI interfaces.  So there is a
> > change that we will overwrite the microcode at that point.
> > 
> > So it might actually be safer to just change the limit to 256k if
> > there is a good chance the the memory at the 1MB mark is unlikely to
> > be used by anything else.
> 
> Actually, since we always move the kernel to 16MB, we can use
> AllocateMaxAddress to allocate a large enough block of memory below
> that.  So the diff below should work (and does on my Skylake system
> that uses a 100k firmware).
> 
> ok?
> 

While I am frightened about gigantisch microcodes and EFI trying to fit things
into lowmem below the ISA hole, I suppose it knows what it's doing.

ok mlarkin, not sure how you want to integrate this into the MDS stuff
that is being worked on though. I'll take care of merging this into efi32/efi64
when I'm ready.

> 
> Index: arch/amd64/stand/efiboot/exec_i386.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/stand/efiboot/exec_i386.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 exec_i386.c
> --- arch/amd64/stand/efiboot/exec_i386.c  10 May 2019 21:20:42 -  
> 1.1
> +++ arch/amd64/stand/efiboot/exec_i386.c  20 May 2019 22:25:38 -
> @@ -46,8 +46,13 @@
>  #include "softraid_amd64.h"
>  #endif
>  
> +#include 
> +#include 
> +#include "eficall.h"
>  #include "efiboot.h"
>  
> +extern EFI_BOOT_SERVICES *BS;
> +
>  typedef void (*startfuncp)(int, int, int, int, int, int, int, int)
>  __attribute__ ((noreturn));
>  
> @@ -165,6 +170,7 @@ run_loadfile(uint64_t *marks, int howto)
>  void
>  ucode_load(void)
>  {
> + EFI_PHYSICAL_ADDRESS addr;
>   uint32_t model, family, stepping;
>   uint32_t dummy, signature;
>   uint32_t vendor[4];
> @@ -200,12 +206,13 @@ ucode_load(void)
>   return;
>  
>   buflen = sb.st_size;
> - if (buflen > 128*1024) {
> - printf("ucode too large\n");
> + addr = 16 * 1024 * 1024;
> + if (EFI_CALL(BS->AllocatePages, AllocateMaxAddress, EfiLoaderData,
> + EFI_SIZE_TO_PAGES(buflen), ) != EFI_SUCCESS) {
> + printf("cannot allocate memory for ucode\n");
>   return;
>   }
> -
> - buf = (char *)(1*1024*1024);
> + buf = (char *)((paddr_t)addr);
>  
>   if (read(fd, buf, buflen) != buflen) {
>   close(fd);
> 
> 



Re: amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-18 Thread Mike Larkin
On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote:
> This code is really fidgety and I think we can do better.
> 
> If we use a 64-bit value here for microseconds the compiler arranges
> for all the math to be done with 64-bit quantites.  I'm pretty sure
> this is standard C numeric type upconversion.  As noted in the comment
> for the assembly, use of the 64-bit intermediate avoids overflow for
> most plausible inputs.
> 
> So with one line of code we get the same result as we do with the
> __GNUC__ assembly and the nigh incomprehensible default computation.
> 
> We could add error checking, but it wasn't here before, and you won't
> overflow an int of 8254 ticks until you try to delay for more than
> 1799795545 microseconds.  Maybe a KASSERT...
> 
> My toy program suggests that this simpler code is actually faster when
> compiled with both base gcc and clang.  But that's just a userspace
> benchmark, not measurement during practical execution in the kernel.
> Is the setup for delay(9) considered a hot path?
> 
> Near-identical simplification could be done in arch/i386/isa/clock.c.
> I don't have an i386 though so I don't know whether this code might be
> faster on such hardware.  It'd definitely be smaller though.
> 
> thoughts?
> 

What bug or problem is this fixing?

-ml

> Index: arch/amd64/isa/clock.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 clock.c
> --- arch/amd64/isa/clock.c27 Jul 2018 21:11:31 -  1.28
> +++ arch/amd64/isa/clock.c19 May 2019 05:31:04 -
> @@ -226,6 +226,7 @@ gettick(void)
>  void
>  i8254_delay(int n)
>  {
> + int64_t microseconds;
>   int limit, tick, otick;
>   static const int delaytab[26] = {
>0,  2,  3,  4,  5,  6,  7,  9, 10, 11,
> @@ -242,31 +243,9 @@ i8254_delay(int n)
>   if (n <= 25)
>   n = delaytab[n];
>   else {
> -#ifdef __GNUC__
> - /*
> -  * Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler
> -  * code so we can take advantage of the intermediate 64-bit
> -  * quantity to prevent loss of significance.
> -  */
> - int m;
> - __asm volatile("mul %3"
> -  : "=a" (n), "=d" (m)
> -  : "0" (n), "r" (TIMER_FREQ));
> - __asm volatile("div %4"
> -  : "=a" (n), "=d" (m)
> -  : "0" (n), "1" (m), "r" (100));
> -#else
> - /*
> -  * Calculate ((n * TIMER_FREQ) / 1e6) without using floating
> -  * point and without any avoidable overflows.
> -  */
> - int sec = n / 100,
> - usec = n % 100;
> - n = sec * TIMER_FREQ +
> - usec * (TIMER_FREQ / 100) +
> - usec * ((TIMER_FREQ % 100) / 1000) / 1000 +
> - usec * (TIMER_FREQ % 1000) / 100;
> -#endif
> + /* Using a 64-bit intermediate here avoids most overflow. */
> + microseconds = n;
> + n = microseconds * TIMER_FREQ / 100;
>   }
>  
>   limit = TIMER_FREQ / hz;
> 



Re: efiboot: allow bigger ucodes

2019-05-18 Thread Mike Larkin
On Sat, May 18, 2019 at 05:58:39AM +0200, Mark Kettenis wrote:
> > Date: Fri, 17 May 2019 17:56:52 -0400
> > From: Patrick Wildt 
> > 
> > Hi,
> > 
> > claudio@ has a Kaby Lake that exceeds the 128 kB limit, being as big as
> > 190 kB.  So, for him the ucode isn't being loaded.
> > 
> > On EFI systems we can simply ask EFI to allocate some memory for us.
> > I'm not sure why this file is duplicated 4 times, but it looks like
> > for efiboot this one is currently the file to patch.
> > 
> > This diff works for claudio@'s EFI booted system.  Since I don't know
> > how much space is available on legacy systems I will not attempt to
> > fix this for the legacy bootloader.  Someone that knows more than me
> > about x86 can attempt that.
> > 
> > Feedback? ok?
> 
> In principle, ye, that is the right approach.
> 
> However, until Mike gets us to the point where we no longer care at
> which physical address the kernel was loaded, efiboot will move the
> kernel to a specific physical address after loading it and potentially
> thrash all memory allocated through the EFI interfaces.  So there is a
> change that we will overwrite the microcode at that point.
> 
> So it might actually be safer to just change the limit to 256k if
> there is a good chance the the memory at the 1MB mark is unlikely to
> be used by anything else.
> 

Thanks for pointing this out. Will keep this in mind when we move to
random-PA kernel load.

-ml

> > diff --git a/sys/arch/amd64/stand/efiboot/exec_i386.c 
> > b/sys/arch/amd64/stand/efiboot/exec_i386.c
> > index e4b39d6cd0c..ea1d301f1d8 100644
> > --- a/sys/arch/amd64/stand/efiboot/exec_i386.c
> > +++ b/sys/arch/amd64/stand/efiboot/exec_i386.c
> > @@ -46,8 +46,13 @@
> >  #include "softraid_amd64.h"
> >  #endif
> >  
> > +#include 
> > +#include 
> > +#include "eficall.h"
> >  #include "efiboot.h"
> >  
> > +extern EFI_BOOT_SERVICES   *BS;
> > +
> >  typedef void (*startfuncp)(int, int, int, int, int, int, int, int)
> >  __attribute__ ((noreturn));
> >  
> > @@ -165,6 +170,7 @@ run_loadfile(uint64_t *marks, int howto)
> >  void
> >  ucode_load(void)
> >  {
> > +   EFI_PHYSICAL_ADDRESS addr;
> > uint32_t model, family, stepping;
> > uint32_t dummy, signature;
> > uint32_t vendor[4];
> > @@ -200,12 +206,12 @@ ucode_load(void)
> > return;
> >  
> > buflen = sb.st_size;
> > -   if (buflen > 128*1024) {
> > -   printf("ucode too large\n");
> > +   if (EFI_CALL(BS->AllocatePages, AllocateAnyPages, EfiLoaderData,
> > +   EFI_SIZE_TO_PAGES(buflen), ) != EFI_SUCCESS) {
> > +   printf("cannot allocate memory for ucode\n");
> > return;
> > }
> > -
> > -   buf = (char *)(1*1024*1024);
> > +   buf = (char *)((paddr_t)addr);
> >  
> > if (read(fd, buf, buflen) != buflen) {
> > close(fd);
> > 
> > 
> 



Re: efiboot: allow bigger ucodes

2019-05-17 Thread Mike Larkin
On Fri, May 17, 2019 at 05:56:52PM -0400, Patrick Wildt wrote:
> Hi,
> 
> claudio@ has a Kaby Lake that exceeds the 128 kB limit, being as big as
> 190 kB.  So, for him the ucode isn't being loaded.
> 
> On EFI systems we can simply ask EFI to allocate some memory for us.
> I'm not sure why this file is duplicated 4 times, but it looks like
> for efiboot this one is currently the file to patch.
> 
> This diff works for claudio@'s EFI booted system.  Since I don't know
> how much space is available on legacy systems I will not attempt to
> fix this for the legacy bootloader.  Someone that knows more than me
> about x86 can attempt that.
> 
> Feedback? ok?
> 
> Patrick
> 
> diff --git a/sys/arch/amd64/stand/efiboot/exec_i386.c 
> b/sys/arch/amd64/stand/efiboot/exec_i386.c
> index e4b39d6cd0c..ea1d301f1d8 100644
> --- a/sys/arch/amd64/stand/efiboot/exec_i386.c
> +++ b/sys/arch/amd64/stand/efiboot/exec_i386.c
> @@ -46,8 +46,13 @@
>  #include "softraid_amd64.h"
>  #endif
>  
> +#include 
> +#include 
> +#include "eficall.h"
>  #include "efiboot.h"
>  
> +extern EFI_BOOT_SERVICES *BS;
> +
>  typedef void (*startfuncp)(int, int, int, int, int, int, int, int)
>  __attribute__ ((noreturn));
>  
> @@ -165,6 +170,7 @@ run_loadfile(uint64_t *marks, int howto)
>  void
>  ucode_load(void)
>  {
> + EFI_PHYSICAL_ADDRESS addr;
>   uint32_t model, family, stepping;
>   uint32_t dummy, signature;
>   uint32_t vendor[4];
> @@ -200,12 +206,12 @@ ucode_load(void)
>   return;
>  
>   buflen = sb.st_size;
> - if (buflen > 128*1024) {
> - printf("ucode too large\n");
> + if (EFI_CALL(BS->AllocatePages, AllocateAnyPages, EfiLoaderData,
> + EFI_SIZE_TO_PAGES(buflen), ) != EFI_SUCCESS) {
> + printf("cannot allocate memory for ucode\n");
>   return;
>   }
> -
> - buf = (char *)(1*1024*1024);
> + buf = (char *)((paddr_t)addr);
>  
>   if (read(fd, buf, buflen) != buflen) {
>   close(fd);
> 

I will hold on to this diff for later when I remove efiboot and replace with
efi64/32. I'll recombine this diff at that time.

-ml



Re: Reduce the scope of SCHED_LOCK()

2019-05-15 Thread Mike Larkin
On Sun, May 12, 2019 at 06:17:04PM -0400, Martin Pieuchot wrote:
> People started complaining that the SCHED_LOCK() is contended.  Here's a
> first round at reducing its scope.
> 
> Diff below introduces a per-process mutex to protect time accounting
> fields accessed in tuagg().  tuagg() is principally called in mi_switch()
> where the SCHED_LOCK() is currently held.  Moving these fields out of
> its scope allows us to drop some SCHED_LOCK/UNLOCK dances in accounting
> path.
> 
> Note that hardclock(9) still increments p_{u,s,i}ticks without holding a
> lock.  I doubt it's worth doing anything so this diff doesn't change
> anything in that regard.
> 
> Comments, oks?
> 

I've been running this for 2 days on my main machine and it seems
solid.

Assuming you're confident you have mutexes in all the right places, ok
mlarkin when you're ready.

-ml


> Index: kern/init_main.c
> ===
> RCS file: /cvs/src/sys/kern/init_main.c,v
> retrieving revision 1.284
> diff -u -p -r1.284 init_main.c
> --- kern/init_main.c  26 Feb 2019 14:24:21 -  1.284
> +++ kern/init_main.c  12 May 2019 21:56:23 -
> @@ -518,7 +518,9 @@ main(void *framep)
>   getnanotime(>ps_start);
>   TAILQ_FOREACH(p, >ps_threads, p_thr_link) {
>   nanouptime(>p_cpu->ci_schedstate.spc_runtime);
> + mtx_enter(>ps_mtx);
>   timespecclear(>p_rtime);
> + mtx_leave(>ps_mtx);
>   }
>   }
>  
> Index: kern/kern_acct.c
> ===
> RCS file: /cvs/src/sys/kern/kern_acct.c,v
> retrieving revision 1.36
> diff -u -p -r1.36 kern_acct.c
> --- kern/kern_acct.c  28 Apr 2018 03:13:04 -  1.36
> +++ kern/kern_acct.c  12 May 2019 21:51:09 -
> @@ -179,7 +179,9 @@ acct_process(struct proc *p)
>   memcpy(acct.ac_comm, pr->ps_comm, sizeof acct.ac_comm);
>  
>   /* (2) The amount of user and system time that was used */
> + mtx_enter(>ps_mtx);
>   calctsru(>ps_tu, , , NULL);
> + mtx_leave(>ps_mtx);
>   acct.ac_utime = encode_comp_t(ut.tv_sec, ut.tv_nsec);
>   acct.ac_stime = encode_comp_t(st.tv_sec, st.tv_nsec);
>  
> Index: kern/kern_exec.c
> ===
> RCS file: /cvs/src/sys/kern/kern_exec.c,v
> retrieving revision 1.203
> diff -u -p -r1.203 kern_exec.c
> --- kern/kern_exec.c  8 Feb 2019 12:51:57 -   1.203
> +++ kern/kern_exec.c  12 May 2019 21:53:56 -
> @@ -661,8 +661,10 @@ sys_execve(struct proc *p, void *v, regi
>   }
>  
>   /* reset CPU time usage for the thread, but not the process */
> + mtx_enter(>ps_mtx);
>   timespecclear(>p_tu.tu_runtime);
>   p->p_tu.tu_uticks = p->p_tu.tu_sticks = p->p_tu.tu_iticks = 0;
> + mtx_leave(>ps_mtx);
>  
>   km_free(argp, NCARGS, _exec, _pageable);
>  
> Index: kern/kern_exit.c
> ===
> RCS file: /cvs/src/sys/kern/kern_exit.c,v
> retrieving revision 1.173
> diff -u -p -r1.173 kern_exit.c
> --- kern/kern_exit.c  23 Jan 2019 22:39:47 -  1.173
> +++ kern/kern_exit.c  12 May 2019 21:51:22 -
> @@ -282,7 +282,7 @@ exit1(struct proc *p, int rv, int flags)
>  
>   /* add thread's accumulated rusage into the process's total */
>   ruadd(rup, >p_ru);
> - tuagg(pr, p);
> + tuagg(p, NULL);
>  
>   /*
>* clear %cpu usage during swap
> @@ -294,7 +294,9 @@ exit1(struct proc *p, int rv, int flags)
>* Final thread has died, so add on our children's rusage
>* and calculate the total times
>*/
> + mtx_enter(>ps_mtx);
>   calcru(>ps_tu, >ru_utime, >ru_stime, NULL);
> + mtx_leave(>ps_mtx);
>   ruadd(rup, >ps_cru);
>  
>   /* notify interested parties of our demise and clean up */
> Index: kern/kern_fork.c
> ===
> RCS file: /cvs/src/sys/kern/kern_fork.c,v
> retrieving revision 1.209
> diff -u -p -r1.209 kern_fork.c
> --- kern/kern_fork.c  6 Jan 2019 12:59:45 -   1.209
> +++ kern/kern_fork.c  12 May 2019 21:55:57 -
> @@ -55,6 +55,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -209,6 +210,8 @@ process_initialize(struct process *pr, s
>   LIST_INIT(>ps_ftlist);
>   LIST_INIT(>ps_kqlist);
>   LIST_INIT(>ps_sigiolst);
> +
> + mtx_init(>ps_mtx, IPL_MPFLOOR);
>  
>   timeout_set(>ps_realit_to, realitexpire, pr);
>   timeout_set(>ps_rucheck_to, rucheck, pr);
> Index: kern/kern_resource.c
> ===
> RCS file: /cvs/src/sys/kern/kern_resource.c,v
> retrieving revision 1.59
> diff -u -p -r1.59 kern_resource.c
> --- kern/kern_resource.c  6 Jan 2019 12:59:45 -  

Re: Unlock uvm a tiny bit more

2019-05-15 Thread Mike Larkin
On Tue, May 14, 2019 at 12:13:52AM +0200, Mark Kettenis wrote:
> This changes uvm_unmap_detach() to get rid of the "easy" entries first
> before grabbing the kernel lock.  Probably doesn't help much with the
> lock contention, but it avoids a locking problem that happens with
> pools that use kernel_map() to allocate the kva for their pages.
> 
> ok?
> 

Reads ok to me, ok mlarkin

> 
> Index: uvm/uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.243
> diff -u -p -r1.243 uvm_map.c
> --- uvm/uvm_map.c 23 Apr 2019 13:35:12 -  1.243
> +++ uvm/uvm_map.c 13 May 2019 22:09:26 -
> @@ -1538,8 +1538,18 @@ uvm_mapent_tryjoin(struct vm_map *map, s
>  void
>  uvm_unmap_detach(struct uvm_map_deadq *deadq, int flags)
>  {
> - struct vm_map_entry *entry;
> + struct vm_map_entry *entry, *tmp;
>   int waitok = flags & UVM_PLA_WAITOK;
> +
> + TAILQ_FOREACH_SAFE(entry, deadq, dfree.deadq, tmp) {
> + /* Skip entries for which we have to grab the kernel lock. */
> + if (entry->aref.ar_amap || UVM_ET_ISSUBMAP(entry) ||
> + UVM_ET_ISOBJ(entry))
> + continue;
> +
> + TAILQ_REMOVE(deadq, entry, dfree.deadq);
> + uvm_mapent_free(entry);
> + }
>  
>   if (TAILQ_EMPTY(deadq))
>   return;
> 



Re: vm.conf: boot-device

2019-05-12 Thread Mike Larkin
On Sun, May 12, 2019 at 08:50:51PM +0200, Anton Lindqvist wrote:
> On Wed, May 08, 2019 at 01:02:10PM -0700, Mike Larkin wrote:
> > On Wed, May 08, 2019 at 09:41:42PM +0200, Anton Lindqvist wrote:
> > > On Wed, May 08, 2019 at 07:19:45PM +0200, Reyk Floeter wrote:
> > > > On Wed, May 08, 2019 at 06:47:53PM +0200, Anton Lindqvist wrote:
> > > > > Hi,
> > > > > A first stab at adding support for option `-B device' to vm.conf(5).
> > > > > With the diff below, I'm able to add a dedicated VM to be used with
> > > > > autoinstall(5):
> > > > > 
> > > > >   vm "amd64-install" {
> > > > >   disable
> > > > >   boot $snapshots "bsd.rd"
> > > > >   disk $home "amd64.qcow2"
> > > > >   boot-device net
> > > > >   local interface
> > > > >   }
> > > > > 
> > > > > The name `boot-device' is of course up for debate. Also, should allow 
> > > > > support
> > > > > also be considered?
> > > > > 
> > > > > Comments? OK?
> > > > > 
> > > > 
> > > > Nice, thanks for adding this.  When I did vm.conf vs vmctl, I always
> > > > liked to have vm.conf feature-complete or even to have it as a
> > > > superset of vmctl - it is just nicer to add complex options with a
> > > > grammar than command line flags (yay getsubopt(3)!).
> > > > 
> > > > Could you adjust the manpage based on vmctl(8)?  It is much more
> > > > verbose and why should vm.conf(5) lack the details?
> > > > 
> > > > For the grammar, I think it should be done without the "-"...
> > > > - boot "bsd.foo"
> > > > - boot device net
> > > > 
> > > > ...as the parser should be able to handle this.  We should have made
> > > > it more flexible but I guess we don't want to change the grammar now?
> > > > - boot image "bsd.foo"
> > > > - boot device net
> > > 
> > > Thanks for the feedback, new diff:
> > > 
> > > * Grammar changed to `boot device' according to feedback
> > > 
> > > * Manual tweaks, thanks jmc@ and sthen@
> > > 
> > > * I also took a stab at incorporating more of details from the vmctl
> > >   manual. This part could probably be improved...
> > > 
> > 
> > I like the diff, but there is a larger issue lurking ... :)
> > 
> > When I shuffled the man page bits last week, I just moved wording around, 
> > and
> > actually didn't notice we call out PXE in the documentation. The problem 
> > here
> > is that we don't actually support PXE in -current. I have a diff in my tree
> > that fixes this but it's not committed yet. It breaks Linux guests, and 
> > every
> > time I think I've nailed it, something else breaks. I had intended to commit
> > it after AsisBSDcon last month but then I ran into that bug.
> > 
> > When claudio committed the "net boot" changes last year, he was pretty clear
> > what it was intended to do:
> > 
> > 
> > revision 1.7
> > date: 2018/12/06 09:20:06;  author: claudio;  state: Exp;  lines: +7 -6;  
> > commitid: 3sAzaMHdn0pTPnpR;
> > Make it possible to define the bootdevice in vmd. This information is used
> > currently only when booting a OpenBSD kernel. If VMBOOTDEV_NET is used the
> > internal dhcp server will pass "auto_install" as boot file to the client and
> > the boot loader passes the MAC of the first interface to the kernel to 
> > indicate
> > PXE booting. Adding boot order support to SeaBIOS is not yet implemented.
> > Ok ccardenas@
> > 
> > So what this does is sets up the bootarg page to tell the kernel that it
> > had been booted from PXE (even though it wasn't). That's used in conjunction
> > with the vmd-dhcp server to provide auto_install options automatically to
> > the VM. The fact that the term "PXE" was included in the commit may have
> > misled some people to think this actually means PXE is supported now, which
> > it is not.
> > 
> > So, while the diff is good, we should probably clean up the documentation
> > to reflect what we really can and cannot do. Once I clean up iPXE and
> > (really) get it working, we can go back and rev the documentation again.
> > 
> > Claudio, please correct me if I am wrong in the interpreta

Re: vm.conf: boot-device

2019-05-08 Thread Mike Larkin
On Wed, May 08, 2019 at 09:41:42PM +0200, Anton Lindqvist wrote:
> On Wed, May 08, 2019 at 07:19:45PM +0200, Reyk Floeter wrote:
> > On Wed, May 08, 2019 at 06:47:53PM +0200, Anton Lindqvist wrote:
> > > Hi,
> > > A first stab at adding support for option `-B device' to vm.conf(5).
> > > With the diff below, I'm able to add a dedicated VM to be used with
> > > autoinstall(5):
> > > 
> > >   vm "amd64-install" {
> > >   disable
> > >   boot $snapshots "bsd.rd"
> > >   disk $home "amd64.qcow2"
> > >   boot-device net
> > >   local interface
> > >   }
> > > 
> > > The name `boot-device' is of course up for debate. Also, should allow 
> > > support
> > > also be considered?
> > > 
> > > Comments? OK?
> > > 
> > 
> > Nice, thanks for adding this.  When I did vm.conf vs vmctl, I always
> > liked to have vm.conf feature-complete or even to have it as a
> > superset of vmctl - it is just nicer to add complex options with a
> > grammar than command line flags (yay getsubopt(3)!).
> > 
> > Could you adjust the manpage based on vmctl(8)?  It is much more
> > verbose and why should vm.conf(5) lack the details?
> > 
> > For the grammar, I think it should be done without the "-"...
> > - boot "bsd.foo"
> > - boot device net
> > 
> > ...as the parser should be able to handle this.  We should have made
> > it more flexible but I guess we don't want to change the grammar now?
> > - boot image "bsd.foo"
> > - boot device net
> 
> Thanks for the feedback, new diff:
> 
> * Grammar changed to `boot device' according to feedback
> 
> * Manual tweaks, thanks jmc@ and sthen@
> 
> * I also took a stab at incorporating more of details from the vmctl
>   manual. This part could probably be improved...
> 

I like the diff, but there is a larger issue lurking ... :)

When I shuffled the man page bits last week, I just moved wording around, and
actually didn't notice we call out PXE in the documentation. The problem here
is that we don't actually support PXE in -current. I have a diff in my tree
that fixes this but it's not committed yet. It breaks Linux guests, and every
time I think I've nailed it, something else breaks. I had intended to commit
it after AsisBSDcon last month but then I ran into that bug.

When claudio committed the "net boot" changes last year, he was pretty clear
what it was intended to do:


revision 1.7
date: 2018/12/06 09:20:06;  author: claudio;  state: Exp;  lines: +7 -6;  
commitid: 3sAzaMHdn0pTPnpR;
Make it possible to define the bootdevice in vmd. This information is used
currently only when booting a OpenBSD kernel. If VMBOOTDEV_NET is used the
internal dhcp server will pass "auto_install" as boot file to the client and
the boot loader passes the MAC of the first interface to the kernel to indicate
PXE booting. Adding boot order support to SeaBIOS is not yet implemented.
Ok ccardenas@

So what this does is sets up the bootarg page to tell the kernel that it
had been booted from PXE (even though it wasn't). That's used in conjunction
with the vmd-dhcp server to provide auto_install options automatically to
the VM. The fact that the term "PXE" was included in the commit may have
misled some people to think this actually means PXE is supported now, which
it is not.

So, while the diff is good, we should probably clean up the documentation
to reflect what we really can and cannot do. Once I clean up iPXE and
(really) get it working, we can go back and rev the documentation again.

Claudio, please correct me if I am wrong in the interpretation.

-ml

> Index: parse.y
> ===
> RCS file: /cvs/src/usr.sbin/vmd/parse.y,v
> retrieving revision 1.50
> diff -u -p -r1.50 parse.y
> --- parse.y   13 Feb 2019 22:57:08 -  1.50
> +++ parse.y   8 May 2019 19:36:04 -
> @@ -120,12 +120,13 @@ typedef struct {
>  
>  
>  %token   INCLUDE ERROR
> -%token   ADD ALLOW BOOT CDROM DISABLE DISK DOWN ENABLE FORMAT GROUP INET6
> -%token   INSTANCE INTERFACE LLADDR LOCAL LOCKED MEMORY NIFS OWNER PATH 
> PREFIX
> -%token   RDOMAIN SIZE SOCKET SWITCH UP VM VMID
> +%token   ADD ALLOW BOOT CDROM DEVICE DISABLE DISK DOWN ENABLE FORMAT 
> GROUP
> +%token   INET6 INSTANCE INTERFACE LLADDR LOCAL LOCKED MEMORY NET NIFS 
> OWNER
> +%token   PATH PREFIX RDOMAIN SIZE SOCKET SWITCH UP VM VMID
>  %token NUMBER
>  %token STRING
>  %type  lladdr
> +%type  bootdevice
>  %type  disable
>  %type  image_format
>  %type  local
> @@ -456,6 +457,9 @@ vm_opts   : disable   {
>   }
>   vmc.vmc_flags |= VMOP_CREATE_KERNEL;
>   }
> + | BOOT DEVICE bootdevice{
> + vmc.vmc_bootdevice = $3;
> + }
>   | CDROM string  {
>   if 

Re: trailing whitespace policy

2019-05-05 Thread Mike Larkin
On Sun, Apr 14, 2019 at 11:55:01AM +0900, Jerome Pinot wrote:
> Hi,
> 
> I saw a recent commit in armv7 which fixed some KNF but added some
> trailing white spaces.
> I already saw some commit that removed trailing white spaces and thought
> about sending a patch (actually joined here).
> I usually don't send this kind of patch as it's not usually accepted:
> - it doesn't do anything useful except saving a few bytes
> - it modifies code history so it makes it harder to track functional
>   changes
> 
> I, myself, clean the useless white spaces in my code.
> 
> So here is a patch against sys/arch/armv7 that remove trailing
> white spaces. When cleaning, I found an extraneous ";".
> 
> If you think it's useful, I'll submit more, by subsystems.
> 
> Here is the part of the patch which doesn't remove only spaces
> (git diff -b):
> 

Thanks, committed.

-ml

> diff --git a/sys/arch/armv7/omap/gptimer.c b/sys/arch/armv7/omap/gptimer.c
> index e87db41106e..584a1d527ba 100644
> --- a/sys/arch/armv7/omap/gptimer.c
> +++ b/sys/arch/armv7/omap/gptimer.c
> @@ -291,7 +291,7 @@ gptimer_cpu_initclocks()
>  
>   ticks_per_intr = ticks_per_second / hz;
>   ticks_err_cnt = ticks_per_second % hz;
> - ticks_err_sum = 0;; 
> + ticks_err_sum = 0;
>  
>   prcm_setclock(1, PRCM_CLK_SPEED_32);
>   prcm_setclock(2, PRCM_CLK_SPEED_32);
> 
> Here is the full patch:
> 
> diff --git a/sys/arch/armv7/armv7/armv7_machdep.c 
> b/sys/arch/armv7/armv7/armv7_machdep.c
> index 11db647cfc2..5e85b0e31e8 100644
> --- a/sys/arch/armv7/armv7/armv7_machdep.c
> +++ b/sys/arch/armv7/armv7/armv7_machdep.c
> @@ -13,7 +13,7 @@
>   * 2. Redistributions in binary form must reproduce the above copyright
>   *notice, this list of conditions and the following disclaimer in the
>   *documentation and/or other materials provided with the distribution.
> - * 3. The name of Genetec Corporation may not be used to endorse or 
> + * 3. The name of Genetec Corporation may not be used to endorse or
>   *promote products derived from this software without specific prior
>   *written permission.
>   *
> @@ -29,7 +29,7 @@
>   * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>   * POSSIBILITY OF SUCH DAMAGE.
>   *
> - * Machine dependant functions for kernel setup for 
> + * Machine dependant functions for kernel setup for
>   * Intel DBPXA250 evaluation board (a.k.a. Lubbock).
>   * Based on iq80310_machhdep.c
>   */
> diff --git a/sys/arch/armv7/armv7/armv7_start.S 
> b/sys/arch/armv7/armv7/armv7_start.S
> index 47160b5c0ac..a1dacda477a 100644
> --- a/sys/arch/armv7/armv7/armv7_start.S
> +++ b/sys/arch/armv7/armv7/armv7_start.S
> @@ -13,7 +13,7 @@
>   * 2. Redistributions in binary form must reproduce the above copyright
>   *notice, this list of conditions and the following disclaimer in the
>   *documentation and/or other materials provided with the distribution.
> - * 3. The name of Genetec Corporation may not be used to endorse or 
> + * 3. The name of Genetec Corporation may not be used to endorse or
>   *promote products derived from this software without specific prior
>   *written permission.
>   *
> @@ -92,9 +92,9 @@ ENTRY(hvc_call)
>   *   omap4_smc_call - issues a secure monitor API call
>   *   @r0: contains the monitor API number
>   *   @r1: contains the value to set
> - * 
> + *
>   *   This function will send a secure monitor call to the internal rom code 
> in
> - *   the trust mode. The rom code expects that r0 will contain the value and 
> + *   the trust mode. The rom code expects that r0 will contain the value and
>   *   r12 will contain the API number, so internally the function swaps the
>   *   register settings around.  The trust mode code may also alter all the 
> cpu
>   *   registers so everything (including the lr register) is saved on the 
> stack
> diff --git a/sys/arch/armv7/include/bootconfig.h 
> b/sys/arch/armv7/include/bootconfig.h
> index af5bea555b5..c3cde41ff44 100644
> --- a/sys/arch/armv7/include/bootconfig.h
> +++ b/sys/arch/armv7/include/bootconfig.h
> @@ -62,7 +62,7 @@ typedef struct _BootConfig {
>  extern BootConfig bootconfig;
>  
>  #endif   /* _KERNEL || _STANDALONE */
> -#if defined(_KERNEL) 
> +#if defined(_KERNEL)
>  extern char *boot_args;
>  extern char *boot_file;
>  #endif   /* _KERNEL */
> diff --git a/sys/arch/armv7/include/intr.h b/sys/arch/armv7/include/intr.h
> index a166a17493a..865e8de102d 100644
> --- a/sys/arch/armv7/include/intr.h
> +++ b/sys/arch/armv7/include/intr.h
> @@ -135,7 +135,7 @@ void arm_setsoftintr(int si);
>  #define _setsoftintr arm_setsoftintr
>  
>  #include 
> -
> +
>  void *arm_intr_establish(int irqno, int level, int (*func)(void *),
>  void *cookie, char *name);
>  void arm_intr_disestablish(void *cookie);
> diff --git a/sys/arch/armv7/include/machine_reg.h 
> b/sys/arch/armv7/include/machine_reg.h
> index 645c045091b..114a7529a2a 100644
> --- a/sys/arch/armv7/include/machine_reg.h
> 

Re: More randomness for efiboot on amd64

2019-05-05 Thread Mike Larkin
On Sat, May 04, 2019 at 03:03:50PM -0600, Theo de Raadt wrote:
> Nice.
> 
> Mike -- does this collide with your current work, or does it
> not impact it?
> 

This won't collide. Thanks for checking though.

-ml

> Mark Kettenis  wrote:
> 
> > The UEFI standard defines an (optional) protocol for getting random
> > numbers.  A while back I wrote support for this for the arm64
> > bootloader.  But we should use this on amd64 as well.  I tested that
> > this actually works on a machine with a recent AMD Ryzen APU.  Not
> > sure if the firmware implements this using RDRAND or if it uses the
> > random number generator of the integrated crypto unit.  Either way,
> > this can't make matters worse since the "entropy" is XORed into the
> > random data buffer.
> > 
> > ok?
> > 
> > 
> > Index: arch/amd64/stand/efiboot/Makefile.common
> > ===
> > RCS file: /cvs/src/sys/arch/amd64/stand/efiboot/Makefile.common,v
> > retrieving revision 1.14
> > diff -u -p -r1.14 Makefile.common
> > --- arch/amd64/stand/efiboot/Makefile.common20 Apr 2019 22:59:03 
> > -  1.14
> > +++ arch/amd64/stand/efiboot/Makefile.common4 May 2019 20:59:48 
> > -
> > @@ -11,7 +11,7 @@ EFI_HEAP_LIMIT=   0xc0
> >  
> >  LDFLAGS+=  -nostdlib -T${.CURDIR}/../${LDSCRIPT} -Bsymbolic -shared
> >  
> > -COPTS+=-DEFIBOOT -DNEEDS_HEAP_H -I${.CURDIR}/..
> > +COPTS+=-DEFIBOOT -DFWRANDOM -DNEEDS_HEAP_H -I${.CURDIR}/..
> >  COPTS+=-I${EFIDIR}/include -I${S}/stand/boot
> >  COPTS+=-ffreestanding -std=gnu99
> >  COPTS+=-fshort-wchar -fPIC -mno-red-zone
> > @@ -24,7 +24,7 @@ AFLAGS+=  -pipe -fPIC
> >  
> >  .PATH: ${.CURDIR}/..
> >  SRCS+= self_reloc.c
> > -SRCS+= efiboot.c efidev.c efipxe.c
> > +SRCS+= efiboot.c efidev.c efipxe.c efirng.c
> >  SRCS+= conf.c
> >  
> >  .PATH: ${S}/stand/boot
> > Index: arch/amd64/stand/efiboot/efirng.c
> > ===
> > RCS file: arch/amd64/stand/efiboot/efirng.c
> > diff -N arch/amd64/stand/efiboot/efirng.c
> > --- /dev/null   1 Jan 1970 00:00:00 -
> > +++ arch/amd64/stand/efiboot/efirng.c   4 May 2019 20:59:48 -
> > @@ -0,0 +1,87 @@
> > +/* $OpenBSD: efirng.c,v 1.1 2018/04/08 13:27:22 kettenis Exp $ */
> > +
> > +/*
> > + * Copyright (c) 2018 Mark Kettenis 
> > + *
> > + * Permission to use, copy, modify, and distribute this software for any
> > + * purpose with or without fee is hereby granted, provided that the above
> > + * copyright notice and this permission notice appear in all copies.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> > + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> > + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> > + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> > + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> > + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> > + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> > + */
> > +
> > +#include 
> > +
> > +#include 
> > +#include 
> > +
> > +#include "eficall.h"
> > +#include "libsa.h"
> > +
> > +extern EFI_BOOT_SERVICES   *BS;
> > +
> > +/* Random Number Generator Protocol */
> > +
> > +#define EFI_RNG_PROTOCOL_GUID \
> > +{ 0x3152bca5, 0xeade, 0x433d, {0x86, 0x2e, 0xc0, 0x1c, 0xdc, 0x29, 
> > 0x1f, 0x44} }
> > +
> > +INTERFACE_DECL(_EFI_RNG_PROTOCOL);
> > +
> > +typedef EFI_GUID EFI_RNG_ALGORITHM;
> > +
> > +typedef
> > +EFI_STATUS
> > +(EFIAPI *EFI_RNG_GET_INFO) (
> > +IN struct _EFI_RNG_PROTOCOL*This,
> > +IN  OUT UINTN  *RNGAlgorithmListSize,
> > +OUT EFI_RNG_ALGORITHM  *RNGAlgorithmList
> > +);
> > +
> > +typedef
> > +EFI_STATUS
> > +(EFIAPI *EFI_RNG_GET_RNG) (
> > +IN struct _EFI_RNG_PROTOCOL*This,
> > +IN EFI_RNG_ALGORITHM   *RNGAlgorithm, OPTIONAL
> > +IN UINTN   RNGValueLength,
> > +OUT UINT8  *RNGValue
> > +);
> > +
> > +typedef struct _EFI_RNG_PROTOCOL {
> > +   EFI_RNG_GET_INFOGetInfo;
> > +   EFI_RNG_GET_RNG GetRNG;
> > +} EFI_RNG_PROTOCOL;
> > +
> > +static EFI_GUIDrng_guid = EFI_RNG_PROTOCOL_GUID;
> > +
> > +void
> > +fwrandom(char *buf, size_t buflen)
> > +{
> > +   EFI_STATUS   status;
> > +   EFI_RNG_PROTOCOL*rng = NULL;
> > +   UINT8   *random;
> > +   size_t   i;
> > +
> > +   status = EFI_CALL(BS->LocateProtocol, _guid, NULL, (void **));
> > +   if (rng == NULL || EFI_ERROR(status))
> > +   return;
> > +
> > +   random = alloc(buflen);
> > +
> > +   status = EFI_CALL(rng->GetRNG, rng, NULL, buflen, random);
> > +   if (EFI_ERROR(status)) {
> > +   

Re: Small puc(4) fix

2019-05-05 Thread Mike Larkin
On Sun, May 05, 2019 at 12:57:47PM +0200, Mark Kettenis wrote:
> I noticed that Intel AMT Serial-over-LAN prints the following:
> 
> puc0 at pci0 dev 22 function 3 "Intel 9 Series KT" rev 0x03: ports: 16 com
> 
> Now there certainly is only 1 com port on the device instead of 16 and
> this is caused by a buglet in the code where it inadvertedly considers
> "empty" entries in the pucdata table as com ports.  The diff below
> fixes this.
> 
> ok?
> 

ok

> 
> Index: dev/pci/pucvar.h
> ===
> RCS file: /cvs/src/sys/dev/pci/pucvar.h,v
> retrieving revision 1.16
> diff -u -p -r1.16 pucvar.h
> --- dev/pci/pucvar.h  2 May 2018 19:11:01 -   1.16
> +++ dev/pci/pucvar.h  5 May 2019 10:54:07 -
> @@ -75,7 +75,7 @@ static const struct puc_port_type puc_po
>  };
>  
>  #define PUC_IS_LPT(type) ((type) == PUC_PORT_LPT)
> -#define PUC_IS_COM(type) ((type) != PUC_PORT_LPT)
> +#define PUC_IS_COM(type) ((type) && (type) != PUC_PORT_LPT)
>  
>  #define PUC_PORT_BAR_INDEX(bar)  (((bar) - PCI_MAPREG_START) / 4)
>  
> 



Re: ccp(4) support for AMD Ryzen 3 PRO 2200GE

2019-05-01 Thread Mike Larkin
On Wed, May 01, 2019 at 06:26:49PM +0200, Mark Kettenis wrote:
> Linux doesn't have this yet, so this is pure guesswork, and I pulled
> the name out of my ass.  But the numbers I get look random enough.
> 
> ok?
> 

Sure, ok mlarkin if you didn't commit already

> P.S. I think it would make sense to rename the device IDs and the
>  strings to be similar to their companions on the chip.  I'll send
>  out a separate diff for that.
> 
> 
> Index: dev/pci/ccp_pci.c
> ===
> RCS file: /cvs/src/sys/dev/pci/ccp_pci.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 ccp_pci.c
> --- dev/pci/ccp_pci.c 20 Apr 2018 04:37:21 -  1.1
> +++ dev/pci/ccp_pci.c 1 May 2019 16:18:47 -
> @@ -46,6 +46,7 @@ static const struct pci_matchid ccp_pci_
>   { PCI_VENDOR_AMD,   PCI_PRODUCT_AMD_CCPV3 },
>   { PCI_VENDOR_AMD,   PCI_PRODUCT_AMD_CCPV5A },
>   { PCI_VENDOR_AMD,   PCI_PRODUCT_AMD_CCPV5B },
> + { PCI_VENDOR_AMD,   PCI_PRODUCT_AMD_CCPV5C },
>  };
>  
>  int
> Index: dev/pci/pcidevs
> ===
> RCS file: /cvs/src/sys/dev/pci/pcidevs,v
> retrieving revision 1.1885
> diff -u -p -r1.1885 pcidevs
> --- dev/pci/pcidevs   24 Apr 2019 03:44:50 -  1.1885
> +++ dev/pci/pcidevs   1 May 2019 16:18:47 -
> @@ -790,6 +790,7 @@ product AMD AMD64_17_1X_IOMMU 0x15d1  AMD
>  product AMD AMD64_17_1X_PCIE_1   0x15d3  AMD64 17h/1xh PCIE
>  product AMD AMD64_17_1X_PCIE_2   0x15db  AMD64 17h/1xh PCIE
>  product AMD AMD64_17_1X_PCIE_3   0x15dc  AMD64 17h/1xh PCIE
> +product AMD CCPV5C   0x15df  Cryptographic Co-processor v5c
>  product AMD AMD64_17_1X_XHCI_1   0x15e0  AMD64 17h/1xh xHCI
>  product AMD AMD64_17_1X_XHCI_2   0x15e1  AMD64 17h/1xh xHCI
>  product AMD RAVENRIDGE_HDA   0x15e3  Raven Ridge HD Audio
> Index: dev/pci/pcidevs.h
> ===
> RCS file: /cvs/src/sys/dev/pci/pcidevs.h,v
> retrieving revision 1.1878
> diff -u -p -r1.1878 pcidevs.h
> --- dev/pci/pcidevs.h 24 Apr 2019 03:45:35 -  1.1878
> +++ dev/pci/pcidevs.h 1 May 2019 16:18:47 -
> @@ -795,6 +795,7 @@
>  #define  PCI_PRODUCT_AMD_AMD64_17_1X_PCIE_1  0x15d3  /* 
> AMD64 17h/1xh PCIE */
>  #define  PCI_PRODUCT_AMD_AMD64_17_1X_PCIE_2  0x15db  /* 
> AMD64 17h/1xh PCIE */
>  #define  PCI_PRODUCT_AMD_AMD64_17_1X_PCIE_3  0x15dc  /* 
> AMD64 17h/1xh PCIE */
> +#define  PCI_PRODUCT_AMD_CCPV5C  0x15df  /* Cryptographic 
> Co-processor v5c */
>  #define  PCI_PRODUCT_AMD_AMD64_17_1X_XHCI_1  0x15e0  /* 
> AMD64 17h/1xh xHCI */
>  #define  PCI_PRODUCT_AMD_AMD64_17_1X_XHCI_2  0x15e1  /* 
> AMD64 17h/1xh xHCI */
>  #define  PCI_PRODUCT_AMD_RAVENRIDGE_HDA  0x15e3  /* Raven Ridge 
> HD Audio */
> Index: dev/pci/pcidevs_data.h
> ===
> RCS file: /cvs/src/sys/dev/pci/pcidevs_data.h,v
> retrieving revision 1.1873
> diff -u -p -r1.1873 pcidevs_data.h
> --- dev/pci/pcidevs_data.h24 Apr 2019 03:45:35 -  1.1873
> +++ dev/pci/pcidevs_data.h1 May 2019 16:18:47 -
> @@ -1512,6 +1512,10 @@ static const struct pci_known_product pc
>   "AMD64 17h/1xh PCIE",
>   },
>   {
> + PCI_VENDOR_AMD, PCI_PRODUCT_AMD_CCPV5C,
> + "Cryptographic Co-processor v5c",
> + },
> + {
>   PCI_VENDOR_AMD, PCI_PRODUCT_AMD_AMD64_17_1X_XHCI_1,
>   "AMD64 17h/1xh xHCI",
>   },
> 



Re: dwxe: resetting interface on watchdog timeout

2019-04-17 Thread Mike Larkin
On Wed, Apr 17, 2019 at 09:44:43AM +0200, Sebastien Marie wrote:
> Hi,
> 
> With a pine64, I am experimenting regulary dwxe watchdog
> timeout. Usually it is a sign that something doesn't work in the driver
> itself.
> 
> The problem I am facing currently is when watchdog timeout occurs,
> the interface is unusable. And so I need another system connected
> permanently to serial in order to login and reboot the board to get it
> working.
> 
> The following diff is still a workaround for the underline driver
> problem. It tries to reset the interface when watchdog timeout
> occurs. But at least, the board could come back in a more accessible
> state.
> 
> When a watchdog timeout occurs, it will try to:
> - down the interface (if it is up)
> - reset it
> - up the interface (if it called down previously)
> 
> With it, I have a "stable" connection to the board via network.
> 
> Comments or OK ?
> -- 
> Sebastien Marie
> 
> 

Just to add here, in my TESTS for 6.5, all of my 20 or so PINE64s have
had a really tough time with dwxe(4). I have had to put all of them into
10baseT mode. Previously, they all had "media 100baseTX" in their
/etc/hostname.dwxe0 (and these are supposedly 1Gb devices), so even in
the past it has been really flaky. If this helps improve things, I'm all
for it, but you should probably get oks from someone who knows the
driver better.

-ml

> Index: if_dwxe.c
> ===
> RCS file: /cvs/src/sys/dev/fdt/if_dwxe.c,v
> retrieving revision 1.11
> diff -u -p -r1.11 if_dwxe.c
> --- if_dwxe.c 3 Jan 2019 00:59:58 -   1.11
> +++ if_dwxe.c 15 Apr 2019 10:21:39 -
> @@ -687,7 +687,21 @@ dwxe_ioctl(struct ifnet *ifp, u_long cmd
>  void
>  dwxe_watchdog(struct ifnet *ifp)
>  {
> - printf("%s\n", __func__);
> + struct dwxe_softc *sc = ifp->if_softc;
> + int down_up = 0;
> +
> + printf("%s: watchdog timeout\n", sc->sc_dev.dv_xname);
> + ifp->if_oerrors++;
> +
> + if (ifp->if_flags & IFF_RUNNING) {
> + down_up = 1;
> + dwxe_down(sc);
> + }
> +
> + dwxe_reset(sc);
> +
> + if (down_up == 1)
> + dwxe_up(sc);
>  }
>  
>  int
> 



Re: azalia: dolby atmos suspend/resume fix

2019-04-08 Thread Mike Larkin
On Mon, Apr 08, 2019 at 10:27:06AM +0200, Stefan Sperling wrote:
> The dolby atmos hack is currently only run during boot, but it
> becomes ineffective after a suspend/resume cycle. So sound plays
> fine after a reboot but is only playing from the left speaker
> after suspend/resume.
> 
> This diff adds the hack in the resume path as well and fixes
> this issue on my matebook x.
> 
> ok?
> 
> diff bb051d67df64d185fb90f087d4ed42818e376163 /usr/src
> blob - a42ad9fc63873a7caed9af0fd2fbba508165863f
> file + sys/dev/pci/azalia.c
> --- sys/dev/pci/azalia.c
> +++ sys/dev/pci/azalia.c
> @@ -1407,6 +1407,9 @@ azalia_resume_codec(codec_t *this)
>   }
>   DELAY(100);
>  
> + if (this->qrks & AZ_QRK_WID_DOLBY_ATMOS)
> + azalia_codec_init_dolby_atmos(this);
> +
>   FOR_EACH_WIDGET(this, i) {
>   w = >w[i];
>   if (w->widgetcap & COP_AWCAP_POWER) {
> 

if this didn't already go in, ok mlarkin



Re: vmctl: report reliable VM state

2019-04-01 Thread Mike Larkin
On Mon, Apr 01, 2019 at 06:05:38PM +0200, Klemens Nanni wrote:
> As of now, `vmctl status test' will tell you whether the VM is running
> or not; except that "STATE" actually denotes whether the VCPU is
> currently running or haltet, not whether the VM is started/running or
> stopped.
> 
> I tripped over this when trying to use
> 
>   vmctl status test | fgrep 'STATE: RUNNING'
> 
> as reliable check, where it therefore only worked when the VM was busy
> CPU wise.  According to mlarkin, this was only ever intended as internal
> debugging aid anyway.
> 
> The following diff makes vmctl(8) switch it's state indicator from the
> actual VCPU state to whether the VM has a valid process ID, that is
> simply whether the respective host process runs.
> 
> Since there's only ever one VCPU per VM for now, drop the constant
> "VCPU:  0" all together and stick with "STATE: RUNNING" or
> "STATE: STOPPED" on it's own line.  That makes it easily scriptable.
> 
> 
> Given that the current behaviour is only vaguely documented in vmctl(8),
> I do not expect users to already depend on it; if so: now would be the
> time to change that into something consistent.  I have not come up with
> a nice update to the manual yet, suggestions welcome.
> 
> 
> Feedback? OK?
> 

I am ok with this. Go for it unless someone privately objected.

-ml

> Index: usr.sbin/vmctl//vmctl.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.65
> diff -u -p -r1.65 vmctl.c
> --- usr.sbin/vmctl//vmctl.c   6 Dec 2018 09:23:15 -   1.65
> +++ usr.sbin/vmctl//vmctl.c   1 Apr 2019 15:01:34 -
> @@ -716,12 +716,13 @@ print_vm_info(struct vmop_info_result *l
>  {
>   struct vm_info_result *vir;
>   struct vmop_info_result *vmi;
> - size_t i, j;
> - char *vcpu_state, *tty;
> + size_t i;
> + char *tty;
>   char curmem[FMT_SCALED_STRSIZE];
>   char maxmem[FMT_SCALED_STRSIZE];
>   char user[16], group[16];
>   const char *name;
> + int running;
>  
>   printf("%5s %5s %5s %7s %7s %7s %12s %s\n", "ID", "PID", "VCPUS",
>   "MAXMEM", "CURMEM", "TTY", "OWNER", "NAME");
> @@ -729,6 +730,7 @@ print_vm_info(struct vmop_info_result *l
>   for (i = 0; i < ct; i++) {
>   vmi = [i];
>   vir = >vir_info;
> + running = (vir->vir_creator_pid != 0 && vir->vir_id != 0);
>   if (check_info_id(vir->vir_name, vir->vir_id)) {
>   /* get user name */
>   name = user_from_uid(vmi->vir_uid, 1);
> @@ -757,7 +759,7 @@ print_vm_info(struct vmop_info_result *l
>   (void)fmt_scaled(vir->vir_memory_size * 1024 * 1024,
>   maxmem);
>  
> - if (vir->vir_creator_pid != 0 && vir->vir_id != 0) {
> + if (running) {
>   if (*vmi->vir_ttyname == '\0')
>   tty = "-";
>   /* get tty - skip /dev/ path */
> @@ -781,19 +783,8 @@ print_vm_info(struct vmop_info_result *l
>   }
>   }
>   if (check_info_id(vir->vir_name, vir->vir_id) > 0) {
> - for (j = 0; j < vir->vir_ncpus; j++) {
> - if (vir->vir_vcpu_state[j] ==
> - VCPU_STATE_STOPPED)
> - vcpu_state = "STOPPED";
> - else if (vir->vir_vcpu_state[j] ==
> - VCPU_STATE_RUNNING)
> - vcpu_state = "RUNNING";
> - else
> - vcpu_state = "UNKNOWN";
> -
> - printf(" VCPU: %2zd STATE: %s\n",
> - j, vcpu_state);
> - }
> + printf("STATE: %s\n",
> + running ? "RUNNING" : "STOPPED");
>   }
>   }
>  }
> ===
> Stats: --- 16 lines 430 chars
> Stats: +++ 7 lines 184 chars
> Stats: -9 lines
> Stats: -246 chars
> 



Re: hibernate_io function

2019-03-29 Thread Mike Larkin
On Fri, Mar 29, 2019 at 12:58:18PM +, Shivaprashanth H wrote:
> Another section I am facing difficulty in understanding is thats called 
> TRAMPOLINE
> 
> I am not finding much details on this. so whats the concept here?
> 

The image unpack code places the machine into a mode compatible with our
existing "resume from S3" trampoline. This is code that resumes the machine
from real mode back into the kernel.

The code is in sys/arch/[i386,amd64]/[i386,amd64]/acpi_wakecode.s

-ml

> 
> From: Theo de Raadt 
> Sent: Friday, March 29, 2019 7:05:54 AM
> To: Mike Larkin
> Cc: Shivaprashanth H; tech@openbsd.org
> Subject: Re: hibernate_io function
> 
> Mike Larkin  wrote:
> 
> > On Thu, Mar 28, 2019 at 07:02:11AM +, Shivaprashanth H wrote:
> > > hi Larkin,
> > >
> > > yes. I am looking to port the hibernate feature from openbsd to freebsd.
> > >
> > > so in freebsd, i see dev/ada
> > >
> >
> > I don't know what physical device that corresponds to. According to the
> > FreeBSD man page, that could really be anything that communicates using
> > the ATA command set.
> >
> > Once you find out what physical device it is, you'll need to implement
> > a side-effect-free I/O routine. The ones we've built so far
> > (wd/ahci/nvme/sdmmc) all reimplement a write function that uses private
> > memory carved out of an area reserved in memory called the piglet. The
> > machine is quiesced by that point (no interrupts, etc) and the write
> > routine is required to only touch memory in the page it has been assigned
> > in the piglet. The one that is likely closest to your needs is
> > ahci_hibernate_io, found in /sys/dev/ic/ahci.c. The layout of the private
> > area in the piglet is described by the struct at the head of that function.
> >
> > Note that in the case softraid(4) is being used, the private page is shared
> > between both the side effect free softraid I/O functions and whatever
> > underlying device-specific side effect free I/O function. IIRC when I
> > wrote that code, I put one struct at one end of the private page and
> > the other struct at the other end. I am not sure if FreeBSD has a similar
> > concept.
> >
> > Good luck.
> 
> And then, you'll need the underlying VM functionality that handles this
> concept called "piglet", a reserved zone for storaged so that the internal
> state can be copied side-effect-free to/from disk..
> 
> And then you'll need the actual soft-state vs hard-state storage, and
> all timeouts deactivated, in all drivers, the way that suspend/resume
> handle it.
> 
> hibernate (and suspend/resume) took about 5 years to incrementally
> develop in OpenBSD.
> 
> There are lots of pieces, which we built in relationship to our
> own ACPI stack, and you may not have the pleasure of making the
> same decisions we made.
> 
> the side-effect-free drivers are only a small piece of the total
> framework. You may need more than good luck...
> 
> 
> Disclaimer: "This message is intended only for the designated recipient(s). 
> It may contain confidential or proprietary information and may be subject to 
> other confidentiality protections. If you are not a designated recipient, you 
> may not review, copy or distribute this message. Please notify the sender by 
> e-mail and delete this message. GlobalEdge does not accept any liability for 
> virus infected mails."



Re: hibernate_io function

2019-03-28 Thread Mike Larkin
On Thu, Mar 28, 2019 at 07:02:11AM +, Shivaprashanth H wrote:
> hi Larkin,
> 
> yes. I am looking to port the hibernate feature from openbsd to freebsd.
> 
> so in freebsd, i see dev/ada
> 

I don't know what physical device that corresponds to. According to the
FreeBSD man page, that could really be anything that communicates using
the ATA command set.

Once you find out what physical device it is, you'll need to implement
a side-effect-free I/O routine. The ones we've built so far
(wd/ahci/nvme/sdmmc) all reimplement a write function that uses private
memory carved out of an area reserved in memory called the piglet. The
machine is quiesced by that point (no interrupts, etc) and the write
routine is required to only touch memory in the page it has been assigned
in the piglet. The one that is likely closest to your needs is 
ahci_hibernate_io, found in /sys/dev/ic/ahci.c. The layout of the private
area in the piglet is described by the struct at the head of that function.

Note that in the case softraid(4) is being used, the private page is shared
between both the side effect free softraid I/O functions and whatever
underlying device-specific side effect free I/O function. IIRC when I 
wrote that code, I put one struct at one end of the private page and
the other struct at the other end. I am not sure if FreeBSD has a similar
concept.

Good luck.

-ml


> ____
> From: Mike Larkin 
> Sent: Thursday, March 28, 2019 12:25:41 PM
> To: Shivaprashanth H
> Cc: tech@openbsd.org
> Subject: Re: hibernate_io function
> 
> On Thu, Mar 28, 2019 at 06:03:25AM +, Shivaprashanth H wrote:
> > the get_hibernate_io_function() in sys/arch/amd64/amd64/hibernate_machdep.c
> >
> > support for 'wd' and 'sd' (ahci, nvme, softraid, sdmmc) are present
> >
> > in my system i see /dev/ada0
> >
> > so which of the above are compatible with /dev/ada0 drive?
> >
> > Disclaimer: "This message is intended only for the designated recipient(s). 
> > It may contain confidential or proprietary information and may be subject 
> > to other confidentiality protections. If you are not a designated 
> > recipient, you may not review, copy or distribute this message. Please 
> > notify the sender by e-mail and delete this message. GlobalEdge does not 
> > accept any liability for virus infected mails."
> 
> Are you sure you're running OpenBSD? /dev/ada0 looks like something FreeBSD
> would use.
> 
> -ml
> Disclaimer: "This message is intended only for the designated recipient(s). 
> It may contain confidential or proprietary information and may be subject to 
> other confidentiality protections. If you are not a designated recipient, you 
> may not review, copy or distribute this message. Please notify the sender by 
> e-mail and delete this message. GlobalEdge does not accept any liability for 
> virus infected mails."



Re: hibernate_io function

2019-03-28 Thread Mike Larkin
On Thu, Mar 28, 2019 at 06:03:25AM +, Shivaprashanth H wrote:
> the get_hibernate_io_function() in sys/arch/amd64/amd64/hibernate_machdep.c
> 
> support for 'wd' and 'sd' (ahci, nvme, softraid, sdmmc) are present
> 
> in my system i see /dev/ada0
> 
> so which of the above are compatible with /dev/ada0 drive?
> 
> Disclaimer: "This message is intended only for the designated recipient(s). 
> It may contain confidential or proprietary information and may be subject to 
> other confidentiality protections. If you are not a designated recipient, you 
> may not review, copy or distribute this message. Please notify the sender by 
> e-mail and delete this message. GlobalEdge does not accept any liability for 
> virus infected mails."

Are you sure you're running OpenBSD? /dev/ada0 looks like something FreeBSD
would use.

-ml



Re: Ext4 and suspend

2019-03-18 Thread Mike Larkin
On Mon, Mar 18, 2019 at 10:42:50PM +0200, Ville Valkonen wrote:
> Hello,
> 
> just a hint for those who are having ext4 partitions mounted: that
> will prohibit suspend to work.
> 
> Pardon for my unscientific claim, but If I'm not correctly mistaken
> this was related to changes that made syncing disks faster in halting.
> For a long time I thought that also was affecting ext2 partitions too
> but a few weeks ago I was happy to find out that it only affects ext4
> partitions, even in readonly mode.
> 
> Wanted to share my findings since there might be someone else
> pondering why the suspend doesn't work. Unmount the ext4 for good and
> be happy.
> 
> --
> Kind regards,
> Ville
> 

What does 'prohibit suspend to work' even mean?

No dmesg, no logs, nothing of use in this email to point us in the right
direction.

-ml



Re: Thinkpad X1 5th gen TPM chip

2019-03-11 Thread Mike Larkin
On Tue, Mar 12, 2019 at 12:35:12AM +, Edd Barrett wrote:
> 
> Hi Mike,
> 
> We've actually spoken about this issue before on this thread:
>  
> http://openbsd-archive.7691.n7.nabble.com/Thinkpad-X1-Carbon-5th-Gen-Flaky-Suspend-td334772.html
> 
> The symptoms remain the same and a colleague who also runs OpenBSD on an 
> identical machine has the same issue.
> 
> Some time ago, I tried your strategy of moving the LED indicator flashing 
> around to see if we could see where it gets stuck.
> 
> I got as far as deducing that the problem lies in a device driver (as opposed 
> to stuff in the leadup to waking up devices), before a failed wake caused my 
> disk to corrupt to the point where it wouldn't boot.
> 
> I've not had the time or the guts to try again, but Google searches suggest 
> that people have had similar issues with Linux and even Windows on this 
> model, so I'm not even certain it's a software issue. Then again, two 
> (different) colleagues run Arch Linux on their X1 5gs and have no problem 
> suspending.
> 
> Cheers!
> 
> Edd Barrett
> 

I see. Good luck in your diagnosis!

-ml



> Mon Mar 11 23:47:54 GMT 2019 Mike Larkin :
> 
> > On Mon, Mar 11, 2019 at 11:24:42AM +, Edd Barrett wrote:
> > > Hi,
> > >
> > > I was looking at the manual page for tpm(4) and noticed that it has
> > > something to to with suspend/resume:
> > >
> > > > Functionality is limited to instructing the device to save its
> > > > state before a system suspend.
> > >
> > > The X1 5th generation has a MSFT0101 TPM chip which we don't yet attach
> > > a driver to. tpm(4) seems happy to attach to with the following small
> > > change. Then in dmesg I have:
> > >
> > > tpm0 at acpi0: TPM_ addr 0xfed4/0x5000: device 0x001b15d1 rev 0x10
> > >
> > > There are no errors relating to tpm in the dmesg.
> > >
> > > Sadly having tpm(4) attached doesn't fix suspend/resume on this machine,
> > > but I wonder if the tpm driver should be used on this machine anyway? I
> > > know nothing about tpm (really, I'm just stabbing in the dark here), so
> > > I'm hoping someone else will have more of an idea.
> > >
> > > Thanks
> > >
> >
> > What is broken with suspend/resume on this machine?
> >
> > >
> > > Index: tpm.c
> > > ===
> > > RCS file: /cvs/src/sys/dev/acpi/tpm.c,v
> > > retrieving revision 1.3
> > > diff -u -p -r1.3 tpm.c
> > > --- tpm.c 1 Jul 2018 19:40:49 - 1.3
> > > +++ tpm.c 11 Mar 2019 10:57:38 -
> > > @@ -194,6 +194,7 @@ const char *tpm_hids[] = {
> > > "BCM0102",
> > > "NSC1200",
> > > "ICO0102",
> > > + "MSFT0101",
> > > NULL
> > > };
> > >
> > >
> > > --
> > > Best Regards
> > > Edd Barrett
> > >
> > > http://www.theunixzoo.co.uk
> > >
> >
> 



Re: Thinkpad X1 5th gen TPM chip

2019-03-11 Thread Mike Larkin
On Mon, Mar 11, 2019 at 11:24:42AM +, Edd Barrett wrote:
> Hi,
> 
> I was looking at the manual page for tpm(4) and noticed that it has
> something to to with suspend/resume:
> 
> >  Functionality is limited to instructing the device to save its
> >  state before a system suspend.
> 
> The X1 5th generation has a MSFT0101 TPM chip which we don't yet attach
> a driver to. tpm(4) seems happy to attach to with the following small
> change. Then in dmesg I have:
> 
> tpm0 at acpi0: TPM_ addr 0xfed4/0x5000: device 0x001b15d1 rev 0x10
> 
> There are no errors relating to tpm in the dmesg.
> 
> Sadly having tpm(4) attached doesn't fix suspend/resume on this machine,
> but I wonder if the tpm driver should be used on this machine anyway? I
> know nothing about tpm (really, I'm just stabbing in the dark here), so
> I'm hoping someone else will have more of an idea.
> 
> Thanks
> 

What is broken with suspend/resume on this machine?

> 
> Index: tpm.c
> ===
> RCS file: /cvs/src/sys/dev/acpi/tpm.c,v
> retrieving revision 1.3
> diff -u -p -r1.3 tpm.c
> --- tpm.c 1 Jul 2018 19:40:49 -   1.3
> +++ tpm.c 11 Mar 2019 10:57:38 -
> @@ -194,6 +194,7 @@ const char *tpm_hids[] = {
>   "BCM0102",
>   "NSC1200",
>   "ICO0102",
> + "MSFT0101",
>   NULL
>  };
>  
> 
> -- 
> Best Regards
> Edd Barrett
> 
> http://www.theunixzoo.co.uk
> 



Re: vmd/vmctl: improve VM name checks/error handling

2019-03-07 Thread Mike Larkin
On Thu, Mar 07, 2019 at 08:52:45PM +0100, Klemens Nanni wrote:
> vmd(8) does not support numerical names with `start' and `receive'.
> It never worked, the manuals are now clearer about this, but error
> handling can still be improved:
> 
>   $ vmctl start 60 -t test -d 60.qcow2
>   vmctl: start vm command failed: No such file or directory
> 
> That's from vm_start_complete() in vmctl.c:254 where vmd(8) unexpectedly
> returns EPERM after it failed to create the VM.
> 
>   $ vmctl receive 60vmctl: invalid vm name
>   vmctl: invalid id: 60
> 
> Here, first parse_vmid(), then it's caller ctl_receive() complains.
> 
> parse_vmid()'s last argument is `needname', indicating that the id to be
> parsed must not be numerical.
> 
> The following diff makes the code check for a letter in the beginning,
> adjusts `start' and `receive' set this argument and print only one error.

This is not right. Each started (or defined if in /etc/vm.conf) VM is assigned
an ID. That ID can be used in place of a name. That behaviour needs to continue
to work.

My advice is to just clean up the man page and error messages, nothing more.

-ml

> 
>   $ ./obj/vmctl start 60 -t test -d 60.qcow2
>   vmctl: invalid VM name
>   $ ./obj/vmctl receive 60vmctl: invalid VM name
> 
> "vm" -> "VM" for consistency with all other "invalid VM name" occurences
> throughout the sources.
> 
> Feedback? OK?
> 
> Index: usr.sbin/vmctl/main.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/main.c,v
> retrieving revision 1.54
> diff -u -p -r1.54 main.c
> --- usr.sbin/vmctl/main.c 1 Mar 2019 12:47:36 -   1.54
> +++ usr.sbin/vmctl/main.c 7 Mar 2019 19:14:15 -
> @@ -524,7 +524,7 @@ parse_vmid(struct parse_result *res, cha
>   id = strtonum(word, 0, UINT32_MAX, );
>   if (error == NULL) {
>   if (needname) {
> - warnx("invalid vm name");
> + warnx("invalid VM name");
>   return (-1);
>   } else {
>   res->id = id;
> @@ -842,8 +842,8 @@ ctl_start(struct parse_result *res, int 
>   if (argc < 2)
>   ctl_usage(res->ctl);
>  
> - if (parse_vmid(res, argv[1], 0) == -1)
> - errx(1, "invalid id: %s", argv[1]);
> + if (parse_vmid(res, argv[1], 1) == -1)
> + exit(1);
>  
>   argc--;
>   argv++;
> @@ -1046,7 +1046,7 @@ ctl_receive(struct parse_result *res, in
>   err(1, "pledge");
>   if (argc == 2) {
>   if (parse_vmid(res, argv[1], 1) == -1)
> - errx(1, "invalid id: %s", argv[1]);
> + exit(1);
>   } else if (argc != 2)
>   ctl_usage(res->ctl);
>  
> Index: usr.sbin/vmctl/vmctl.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/vmctl.c,v
> retrieving revision 1.65
> diff -u -p -r1.65 vmctl.c
> --- usr.sbin/vmctl/vmctl.c6 Dec 2018 09:23:15 -   1.65
> +++ usr.sbin/vmctl/vmctl.c7 Mar 2019 19:14:15 -
> @@ -154,12 +154,7 @@ vm_start(uint32_t start_id, const char *
>   }
>   }
>   if (name != NULL) {
> - /*
> -  * Allow VMs names with alphanumeric characters, dot, hyphen
> -  * and underscore. But disallow dot, hyphen and underscore at
> -  * the start.
> -  */
> - if (*name == '-' || *name == '.' || *name == '_')
> + if (!isalpha(*name))
>   errx(1, "invalid VM name");
>  
>   for (s = name; *s != '\0'; ++s) {
> Index: usr.sbin/vmd/vmd.c
> ===
> RCS file: /cvs/src/usr.sbin/vmd/vmd.c,v
> retrieving revision 1.108
> diff -u -p -r1.108 vmd.c
> --- usr.sbin/vmd/vmd.c9 Dec 2018 12:26:38 -   1.108
> +++ usr.sbin/vmd/vmd.c7 Mar 2019 19:14:15 -
> @@ -1257,11 +1257,7 @@ vm_register(struct privsep *ps, struct v
>   vcp->vcp_ndisks == 0 && strlen(vcp->vcp_cdrom) == 0) {
>   log_warnx("no kernel or disk/cdrom specified");
>   goto fail;
> - } else if (strlen(vcp->vcp_name) == 0) {
> - log_warnx("invalid VM name");
> - goto fail;
> - } else if (*vcp->vcp_name == '-' || *vcp->vcp_name == '.' ||
> - *vcp->vcp_name == '_') {
> + } else if (strlen(vcp->vcp_name) == 0 || !isalpha(*vcp->vcp_name)) {
>   log_warnx("invalid VM name");
>   goto fail;
>   } else {
> ===
> Stats: --- 15 lines 531 chars
> Stats: +++ 6 lines 185 chars
> Stats: -9 lines
> Stats: -346 chars
> 



Re: vmctl: usage on extra arguments

2019-03-01 Thread Mike Larkin
On Fri, Mar 01, 2019 at 01:37:58PM +0100, Antoine Jacoutot wrote:
> On Fri, Mar 01, 2019 at 01:28:58PM +0100, Klemens Nanni wrote:
> > I blatantly missed the argc/argv adjustments after getopt(3), resulting
> > in valid commands like `vmctl create a -s 1G' to fail.
> > 
> > Noticed by ajacoutot the hard way.
> > 
> > OK?
> 
> Works for me (tm).
> ok :-)
> 
> 

Thanks for the quick fix.

> > 
> > Index: usr.sbin/vmctl/main.c
> > ===
> > RCS file: /cvs/src/usr.sbin/vmctl/main.c,v
> > retrieving revision 1.53
> > diff -u -p -r1.53 main.c
> > --- usr.sbin/vmctl/main.c   1 Mar 2019 10:34:14 -   1.53
> > +++ usr.sbin/vmctl/main.c   1 Mar 2019 12:25:23 -
> > @@ -598,6 +598,8 @@ ctl_create(struct parse_result *res, int
> > /* NOTREACHED */
> > }
> > }
> > +   argc -= optind;
> > +   argv += optind;
> >  
> > if (argc > 0)
> > ctl_usage(res->ctl);
> > @@ -915,6 +917,8 @@ ctl_start(struct parse_result *res, int 
> > /* NOTREACHED */
> > }
> > }
> > +   argc -= optind;
> > +   argv += optind;
> >  
> > if (argc > 0)
> > ctl_usage(res->ctl);
> > @@ -959,6 +963,8 @@ ctl_stop(struct parse_result *res, int a
> > /* NOTREACHED */
> > }
> > }
> > +   argc -= optind;
> > +   argv += optind;
> >  
> > if (argc > 0)
> > ctl_usage(res->ctl);
> > 
> 
> -- 
> Antoine
> 



Re: vmctl: usage on extra arguments

2019-02-28 Thread Mike Larkin
On Fri, Mar 01, 2019 at 01:33:48AM +0100, Klemens Nanni wrote:
> tedu's apm(8) diff reminded me that certain vmctl(8) commands are too
> relaxed:
> 
>   $ vmctl start a b
>   vmctl: start vm command failed: Operation not permitted
>   $ vmctl stop a b
>   stopping vm a: vm not found
>   $ vmctl create a b
>   could not create a: missing size argument
>   usage:  vmctl [-v] create disk [-b base | -i disk] [-s size]
>   $ vmctl create a b -s 1G
>   could not create a: missing size argument
>   usage:  vmctl [-v] create disk [-b base | -i disk] [-s size]
>   $ vmctl create a -s 1G b
>   vmctl: raw imagefile created
> 
> `start', `stop' and `create' are the only commands with a variable
> number of options; other commands have strict checks already.
> 
> With diff below, all examples above print usage directly without doing
> anything else.
> 
> OK?
> 

sure

-ml


> Index: usr.sbin/vmctl/main.c
> ===
> RCS file: /cvs/src/usr.sbin/vmctl/main.c,v
> retrieving revision 1.52
> diff -u -p -r1.52 main.c
> --- usr.sbin/vmctl/main.c 14 Dec 2018 07:56:17 -  1.52
> +++ usr.sbin/vmctl/main.c 1 Mar 2019 00:29:12 -
> @@ -599,6 +599,9 @@ ctl_create(struct parse_result *res, int
>   }
>   }
>  
> + if (argc > 0)
> + ctl_usage(res->ctl);
> +
>   if (input) {
>   if (base && input)
>   errx(1, "conflicting -b and -i arguments");
> @@ -913,6 +916,9 @@ ctl_start(struct parse_result *res, int 
>   }
>   }
>  
> + if (argc > 0)
> + ctl_usage(res->ctl);
> +
>   for (i = res->nnets; i < res->nifs; i++) {
>   /* Add interface that is not attached to a switch */
>   if (parse_network(res, "") == -1)
> @@ -953,6 +959,9 @@ ctl_stop(struct parse_result *res, int a
>   /* NOTREACHED */
>   }
>   }
> +
> + if (argc > 0)
> + ctl_usage(res->ctl);
>  
>   /* VM id is only expected without the -a flag */
>   if ((res->action != CMD_STOPALL && ret == -1) ||
> 



Re: amd64: update PTDpaddr with new PA of PML4 for libkvm

2019-02-13 Thread Mike Larkin
On Wed, Feb 13, 2019 at 05:40:45PM +0900, Naoki Fukaumi wrote:
> Hi Mike Larkin,
> 
> since pmap_kernel is randomized, savecore(libkvm) cannot save core
> dump from dump device. (savecore: magic number mismatch)
> 
> updating PTDpaddr fixes this issue.
> 
> by the way, is there any problem to use proc0.p_addr->u_pcb.pcb_cr3
> instead of PTDpaddr in cpu_dump()?
> 

Thanks for noticing this!

Does using the proc0.p_addr->u_pcb.pcb_cr3 expansion also work?
If so, we may be able to remove PTPpaddr entirely, if we remove the
other usage in cpu_dump also.

-ml

> --
> FUKAUMI Naoki
> 
> Index: sys/arch/amd64/amd64/pmap.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> retrieving revision 1.128
> diff -u -p -u -p -r1.128 pmap.c
> --- sys/arch/amd64/amd64/pmap.c   1 Feb 2019 21:48:48 -   1.128
> +++ sys/arch/amd64/amd64/pmap.c   13 Feb 2019 07:43:27 -
> @@ -835,6 +835,9 @@ pmap_randomize(void)
>   pmap_kernel()->pm_pdir = pml4va;
>   proc0.p_addr->u_pcb.pcb_cr3 = pml4pa;
>  
> + /* Fixup PTDpaddr for libkvm */
> + PTDpaddr = pml4pa;
> +
>   /* Fixup recursive PTE PML4E slot. We are only changing the PA */   
>   pml4va[PDIR_SLOT_PTE] = pml4pa | (pml4va[PDIR_SLOT_PTE] & ~PG_FRAME);
>  
> 



vmd: change virtio network interface numbering

2019-01-22 Thread Mike Larkin
I just committed a change to move the virtio network interfaces before disks
in the PCI device ordering in vmd(8). This is to fix some strangeness in how
Linux assigns device numbers (based on the PCI slot number).

On some Linux guests, you'll get a virtio network interface named enp0s3
(for example), indicating the interface is in PCI slot 3. But that slot number
was subject to the number of disks on the machine, since those were added
first.

If you used a disk image to install your guest, you'd have two disk devices
present, and your network interface would be enp0s3. During install, that name
would be written to network autoconfiguration files/scripts.

When you detach the install media, that means you have one less disk, and thus
the network interface shifts up by one, becoming enp0s2, likely breaking your
network autoconfiguration.

People familiar with this behaviour easily fixed it. But it's bizarre and
unexpected, and rather than reply to continual bug reports, I found it easier
to just reorder things.

This means if you have Linux guests in vmm/vmd, you'll need to make a one time
change to fixup the interface name, as it likely changed.

Note, not all Linux distributions do this. If your interface name is just
"eth0", you're likely fine.

-ml



Re: [PATCH 6/7] Rework virtio_negotiate_features()

2019-01-19 Thread Mike Larkin
On Sat, Jan 19, 2019 at 05:37:34PM +0100, Stefan Fritsch wrote:
> Add a sc_driver_features field that is automatically used by
> virtio_negotiate_features() and during reinit.
> 
> Make virtio_negotiate_features() return an error code.  Virtio 1.0 has a
> special status bit for feature negotiation that means that negotiation
> can fail. Make virtio_negotiate_features() return an error code instead
> of the features.
> 
> Make virtio_reinit_start() automatically call
> virtio_negotiate_features().
> 
> Add a convenience function virtio_has_feature() to make checking bits
> easier.
> 
> Add an error check in viomb for virtio_negotiate_features because it has
> some feature bits that may cause negotiation to fail. More error
> checking in the child drivers is still missing.

ok mlarkin

> ---
>  sys/dev/fdt/virtio_mmio.c | 38 ++
>  sys/dev/pci/virtio_pci.c  | 39 +++
>  sys/dev/pv/if_vio.c   | 38 +-
>  sys/dev/pv/vioblk.c   | 25 +
>  sys/dev/pv/viocon.c   |  6 +++---
>  sys/dev/pv/viomb.c|  9 +
>  sys/dev/pv/viornd.c   |  2 +-
>  sys/dev/pv/vioscsi.c  |  2 +-
>  sys/dev/pv/virtio.c   | 13 +++--
>  sys/dev/pv/virtiovar.h| 15 ---
>  sys/dev/pv/vmmci.c| 13 ++---
>  11 files changed, 110 insertions(+), 90 deletions(-)
> 
> diff --git a/sys/dev/fdt/virtio_mmio.c b/sys/dev/fdt/virtio_mmio.c
> index 2d7a131125d..6e09fb0e76f 100644
> --- a/sys/dev/fdt/virtio_mmio.c
> +++ b/sys/dev/fdt/virtio_mmio.c
> @@ -91,8 +91,8 @@ void
> virtio_mmio_write_device_config_8(struct virtio_softc *, int, uint64_t);
>  uint16_t virtio_mmio_read_queue_size(struct virtio_softc *, uint16_t);
>  void virtio_mmio_setup_queue(struct virtio_softc *, struct virtqueue 
> *, uint64_t);
>  void virtio_mmio_set_status(struct virtio_softc *, int);
> -uint64_t virtio_mmio_negotiate_features(struct virtio_softc *, uint64_t,
> -   const struct virtio_feature_name 
> *);
> +int  virtio_mmio_negotiate_features(struct virtio_softc *,
> +const struct virtio_feature_name *);
>  int  virtio_mmio_intr(void *);
>  
>  struct virtio_mmio_softc {
> @@ -300,28 +300,42 @@ virtio_mmio_detach(struct device *self, int flags)
>   * Prints available / negotiated features if guest_feature_names != NULL and
>   * VIRTIO_DEBUG is 1
>   */
> -uint64_t
> -virtio_mmio_negotiate_features(struct virtio_softc *vsc, uint64_t 
> guest_features,
> -   const struct virtio_feature_name *guest_feature_names)
> +int
> +virtio_mmio_negotiate_features(struct virtio_softc *vsc,
> +const struct virtio_feature_name *guest_feature_names)
>  {
>   struct virtio_mmio_softc *sc = (struct virtio_mmio_softc *)vsc;
>   uint64_t host, neg;
>  
> + vsc->sc_active_features = 0;
> +
>   /*
> -  * indirect descriptors can be switched off by setting bit 1 in the
> -  * driver flags, see config(8)
> +  * We enable indirect descriptors by default. They can be switched
> +  * off by setting bit 1 in the driver flags, see config(8).
>*/
>   if (!(vsc->sc_dev.dv_cfdata->cf_flags & VIRTIO_CF_NO_INDIRECT) &&
>   !(vsc->sc_child->dv_cfdata->cf_flags & VIRTIO_CF_NO_INDIRECT)) {
> - guest_features |= VIRTIO_F_RING_INDIRECT_DESC;
> - } else {
> + vsc->sc_driver_features |= VIRTIO_F_RING_INDIRECT_DESC;
> + } else if (guest_feature_names != NULL) {
>   printf("RingIndirectDesc disabled by UKC\n");
>   }
> + /*
> +  * The driver must add VIRTIO_F_RING_EVENT_IDX if it supports it.
> +  * If it did, check if it is disabled by bit 2 in the driver flags.
> +  */
> + if ((vsc->sc_driver_features & VIRTIO_F_RING_EVENT_IDX) &&
> + ((vsc->sc_dev.dv_cfdata->cf_flags & VIRTIO_CF_NO_EVENT_IDX) ||
> + (vsc->sc_child->dv_cfdata->cf_flags & VIRTIO_CF_NO_EVENT_IDX))) {
> + if (guest_feature_names != NULL)
> + printf(" RingEventIdx disabled by UKC");
> + vsc->sc_driver_features &= ~(VIRTIO_F_RING_EVENT_IDX);
> + }
> +
>   bus_space_write_4(sc->sc_iot, sc->sc_ioh,
>   VIRTIO_MMIO_HOST_FEATURES_SEL, 0);
>   host = bus_space_read_4(sc->sc_iot, sc->sc_ioh,
>   VIRTIO_MMIO_HOST_FEATURES);
> - neg = host & guest_features;
> + neg = host & vsc->sc_driver_features;
>  #if VIRTIO_DEBUG
>   if (guest_feature_names)
>   virtio_log_features(host, neg, guest_feature_names);
> @@ -330,13 +344,13 @@ virtio_mmio_negotiate_features(struct virtio_softc 
> *vsc, uint64_t guest_features
>   VIRTIO_MMIO_GUEST_FEATURES_SEL, 0);
>   bus_space_write_4(sc->sc_iot, sc->sc_ioh,
> VIRTIO_MMIO_GUEST_FEATURES, neg);
> - vsc->sc_features = neg;
> 

  1   2   3   4   5   >