Re: bhyve and vfs.zfs.arc_max, and zfs tuning for a hypervisor

2019-03-20 Thread Mike Gerdts
On Tue, Mar 19, 2019 at 3:07 AM Patrick M. Hausen  wrote:

> Hi!
>
> > Am 19.03.2019 um 03:46 schrieb Victor Sudakov :
> > 1. Does ARC actually cache zfs volumes (not files/datasets)?
>
> Yes it does.
>
> > 2. If ARC does cache volumes, does this cache make sense on a hypervisor,
> > because guest OSes will probably have their own disk cache anyway.
>
> IMHO not much, because the guest OS is relying on the fact that when
> it writes it’s own cached data out to „disk“, it will be committed to
> stable storage.
>

I'd recommend caching at least metadata (primarycache=metadata).  The guest
will not cache zfs metadata and not having metadata in the cache can lead
to a big hit in performance.  The metadata in question here are things like
block pointers that keep track of where the data is at - zfs can't find the
data without metadata.

I think the key decision as to whether you use primarycache=metadata or
primarycache=all comes down to whether you are after predictable
performance or optimal performance.  You will likely get worse performance
with primarycache=metaadata (or especially with primarycache=none),
presuming the host has RAM to spare.  As you pack the system with more VMs
or allocate more disk to existing VMs, you will probably find that
primarycache=metadata leads steadier performance regardless of how much
storage is used or how active other VMs are.


> > 3. Would it make sense to limit vfs.zfs.arc_max to 1/8 or even less of
> > total RAM, so that most RAM is available to guest machines?
>
> Yes if you build your own solution on plain FreeBSD. No if you are running
> FreeNAS which already tries to autotune the ARC size according to the
> memory committed to VMs.
>
> > 4. What other zfs tuning measures can you suggest for a bhyve
> > hypervisor?
>
> e.g.
> zfs set sync=always zfs/vm
>
> if zfs/vm is the dataset under which you create the ZVOLs for your emulated
> disks.
>

I'm not sure what the state of this is in FreeBSD, but in SmartOS we allow
the guests to benefit from write caching if they negotiate FLUSH.  Guests
that do negotiate flush are expected to use proper barriers to flush the
cache at critical times.  When a FLUSH arrives, SmartOS bhyve issues an
fsync().  To be clear - SmartOS bhyve is not actually caching writes in
memory, it is just delaying transaction group commits.  This avoids
significant write inflation and associated latency.  Support for FLUSH
negotiation has greatly improved I/O performance - to the point that some
tests show parity with running directly on the host pool.  If not already
in FreeBSD, this would probably be something of relatively high value to
pull in.

If you do go the route of sync=always and primarycache=, be
sure your guest block size and host volblocksize match.  ZFS (on platforms
I'm more familiar with, at least) defaults to volblocksize=8k.  Most guest
file systems these days seem to default to a block size of 4 KiB.  If the
guest file system issues a 4 KiB aligned write, that will turn into a
read-modify-write cycle to stitch that 4 KiB guest block into the host's 8
KB block. If the adjacent guest block that is in the same 8 KiB host block
is written in the next write, it will also turn into a read-modify-write
cycle.

If you are using ZFS in the guest, this can be particularly problematic
because the guest ZFS will align writes with the guest pool's ashift, not
with a guest dataset's recordsize or volblocksize.  I discovered this in an
extended benchmarking of zfs-on-zfs - primarily with primarycache=metadata
and sync=always.  The write inflation was quite significant: 3x was
common.  I tracked some of this down to alignment issues and part of it was
due to sync writes causing the data to be written twice.

George Wilson has a great talk where he describes the same issues I hit.

https://www.youtube.com/watch?v=_-QAnKtIbGc

I've mentioned write inflation related to sync writes a few times.  One
point that I think is poorly understood is that when ZFS is rushed into
committing a write with fsync or similar, the immediate write is of ZIL
blocks to the intent log.  The intent log can be on a separate disk (slog,
log=) or it can be on the disks that hold the pool's data.  When
the intent log is on the data disks, the data is written to the same disks
multiple times: once as ZIL blocks and once as data blocks.  Between these
writes there will be full-disk head movement as the uberblocks are updated
at the beginning and end of the disk.

What I say above is based on experience with kernel zones on Solaris and
bhyve on SmartOS.  There are enough similarities that I expect bhyve on
FreeBSD will be the same, but FreeBSD may have some strange-to-me zfs
caching changes.

Regards,
Mike
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


Re: Centos7 uefi boot problem with bhyve after update

2018-05-04 Thread Mike Gerdts
On Fri, May 4, 2018 at 7:26 AM, Michael Reifenberger 
wrote:

> The kernel version shouldn't matter that much because the error occures
> before the kernel gets loaded.
> It's after /EFI/BOOT/BOOTX64.EFI got loaded and before
> /EFI/centos/grubx64.efi gets loaded
> because of not beeing found...
>

On SmartOS I've found that fresh installations of CentOS 7 put all of the
required files to actually boot the OS in the /EFI/centos directory leaving
the /EFI/BOOT directory mostly empty.  The image at https://ibb.co/jFwDES
matches this symptom.  This can be fixed (hacked around) via the EFI shell.

The following is pieced together from scrollback (that was kinda wonky due
to size issues) from a serial console.

Shell> FS0:
FS0:\> cd efi
FS0:\efi\> cd BOOT
FS0:\efi\BOOT\> dir
Directory of: FS0:\efi\BOOT\
05/03/2018  19:29  4,096  .
05/03/2018  19:29  4,096  ..
08/31/2017  21:30   1,296,176  BOOTX64.EFI
08/31/2017  21:30  79,048  fbx64.efi
  2 File(s)   1,375,224 bytes
  2 Dir(s)
FS0:\efi\BOOT\> cd ..\centos
FS0:\efi\centos\> dir
Directory of: FS0:\efi\centos\
05/03/2018  19:27  4,096  .
05/03/2018  19:27  4,096  ..
05/03/2018  19:29  4,096  fonts
08/31/2017  21:30   1,297,120  shimx64-centos.efi
08/17/2017  18:00   1,052,032  grubx64.efi
08/31/2017  21:30 134  BOOT.CSV
08/31/2017  21:30 134  BOOTX64.CSV
08/31/2017  21:30   1,262,816  mmx64.efi
08/31/2017  21:30   1,296,176  shim.efi
05/03/2018  19:33   1,024  grubenv
08/31/2017  21:30   1,296,176  shimx64.efi
05/03/2018  19:33   4,231  grub.cfg
  9 File(s)   6,209,843 bytes
  3 Dir(s)
FS0:\efi\centos\> cp -r * ..\boot
Copying FS0:\EFI\centos\fonts -> FS0:\EFI\BOOT\fonts
...
Copying FS0:\EFI\centos\grub.cfg -> FS0:\EFI\BOOT\grub.cfg
- [ok]
FS0:\efi\centos\> reset


I've not curled up with the EFI spec for a while and forget how it is
supposed to choose which directory to read the various files from.  That
is, the fault here could be that of bootrom not reading the files it should
or the guest OS not putting the right thing in \EFI\BOOT to get it to look
in \EFI\centos.

Mike
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"