Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-21 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Jim Klimov
 
 As for ZIL - even if it is used with the in-pool variant, I don't
 think your setup needs any extra steps to disable it (as Edward likes
 to suggest), and most other setups don't need to disable it either.

No, no - I know I often suggest disabling the zil, because so many people 
outrule it on principle (the evil tuning guide says disable the zil (don't!))

But in this case, I was suggesting precisely the opposite of disabling it.  I 
was suggesting making it more aggressive.

But now that you mention it - if he's looking for maximum performance, perhaps 
disabling the zil would be best for him.   ;-)  

Nathan, it will do you some good to understand when it's ok or not ok to 
disable the zil.   (zfs set sync=disabled)  If this is a guest VM in your 
laptop or something like that, then it's definitely safe.  If the guest VM is a 
database server, with a bunch of external clients (on the LAN or network or 
whatever) then it's definitely *not* safe.

Basically if anything external of the VM is monitoring or depending on the 
state of the VM, then it's not ok.  But, if the VM were to crash and go back in 
time by a few seconds ... If there are no clients that would care about that 
... then it's safe to disable ZIL.  And that is the highest performance thing 
you can possibly do.


 It also shouldn't add much to your writes - the in-pool ZIL blocks
 are then referenced as userdata when the TXG commit happens (I think).

I would like to get some confirmation of that - because it's the opposite of 
what I thought.  
I thought the ZIL is used like a circular buffer.  The same blocks will be 
overwritten repeatedly.  But if there's a sync write over a certain size, then 
it skips the ZIL and writes immediately to main zpool storage, so it doesn't 
have to get written twice.


 I also think that with a VM in a raw partition you don't get any
 snapshots - neither ZFS as underlying storage ('cause it's not),
 not hypervisor snaps of the VM. So while faster, this is also some
 trade-off :)

Oh - But not faster than zvol.  I am currently a fan of wrapping zvol inside 
vmdk, so I get maximum performance and also snapshots.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Peter Tripp
Hi Nathan,

You've misunderstood how the Zil works and why it reduces write latency for 
synchronous writes. 

Since you've partitioned a single SSD into two silces, one as pool storage and 
one as Zil for that pool, all sync writes will be 2X amplified. There's no way 
around it. ZFS will write to the Zil while simultaneously (or with up to a 
couple second delay) write to the slice you're using to persistently store pool 
data.  This doesn't happen when you expose the raw partition to the VM because 
those write don't go through the Zil...hence no write amplification.

Since you've put the Zil on physically on the same device as the pool storage, 
the Zil serves no purpose other than to slow things down.  The purpose of a Zil 
is confirm sync writes as fast as possible even if they haven't hit the actual 
pool storage (usually slow HDs) yet; it confirms the write once it's hit the 
Zil and then ZFS has a moment (up to 30sec IIRC) to bundle multiple IOs before 
committing it to persistent pool storage.  

Remove the cache silce from the pool where you've carved out this zvol. Test 
again. Your writes will be faster. They likely won't as fast as your async 
writes (150MB/sec) but they will certainly be faster than the 15MB/sec you're 
getting now when you're unintentionally do synchronous writing to the zil slice 
and async writes to the pool storage slice simultaneously.  I'd bet the zvol 
solution will approach the speed of using a raw partition. 

-Pete 

P.S. Be careful using the term write amplification when talking about 
SSDs...people usually use that to refer to what happens within the SSD. 
Specifically before a write (especially a small write) can be written, other 
nearby data must be read and so an entire block can be rewritten.
http://en.wikipedia.org/wiki/Write_amplification 

On Nov 20, 2012, at 8:29 AM, Nathan Kroenert wrote:

 Hi folks,  (Long time no post...)
 
 Only starting to get into this one, so apologies if I'm light on detail, 
 but...
 
 I have a shiny SSD I'm using to help make some VirtualBox stuff I'm doing go 
 fast.
 
 I have a 240GB Intel 520 series jobbie. Nice.
 
 I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb.
 
 As part of my work, I have used it both as a RAW device (cxtxdxp1) and 
 wrapped partition 1 with a virtualbox created VMDK linkage, and it works like 
 a champ. :) Very happy with that.
 
 I then tried creating a new zpool using partition 2 of the disk (zpool create 
 c2d0p2) and then carved a zvol out of that (30GB), and wrapped *that* in a 
 vmdk.
 
 Still works OK and speed is good(ish) - but there are a couple of things in 
 particular that disturb me:
 - Sync writes are pretty slow - only about 1/10th of what I thought I might 
 get (about 15MB/s). ASync writes are fast - up to 150MB/s or more.
 - More worringly, it seems that writes are amplified by 2X in that if I write 
 100MB at the guest level, the underlying bare metal ZFS writes 200M, as 
 observed by iostat. This doesn't happen on the VM's that are using RAW slices.
 
 Anyone have any thoughts on what might be happening here?
 
 I can appreciate that if everything comes through as a sync write, it goes to 
 the ZIL first, then to it's final resting place - but it seems a little over 
 the top that it really is double.
 
 I have also had a play with sync=, primarycache settings and a few other 
 things but it doesn't seem to change the behavious
 
 Again - I'm looking for thoughts here - as I have only really just started 
 looking into this. Should I happen across anything interesting, I'll followup 
 this post.
 
 Cheers,
 
 Nathan. :)
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Nathan Kroenert
 
 I chopped into a few slices - p0 (partition table), p1 128GB, p2 60gb.
 
 As part of my work, I have used it both as a RAW device (cxtxdxp1) and
 wrapped partition 1 with a virtualbox created VMDK linkage, and it works
 like a champ. :) Very happy with that.
 
 I then tried creating a new zpool using partition 2 of the disk (zpool
 create c2d0p2) and then carved a zvol out of that (30GB), and wrapped
 *that* in a vmdk.

Why are you parititoning, then creating zpool, and then creating zvol?
I think you should make the whole disk a zpool unto itself, and then carve out 
the 128G zvol and 60G zvol.  For that matter, why are you carving out multiple 
zvol's?  Does your Guest VM really want multiple virtual disks for some reason?

Side note:  Assuming you *really* just want a single guest to occupy the whole 
disk and run as fast as possible...  If you want to snapshot your guest, you 
should make the whole disk one zpool, and then carve out a zvol which is 
significantly smaller than 50%, say perhaps 40% or 45% might do the trick.  The 
zvol will immediately reserve all the space it needs, and if you don't have 
enough space leftover to completely replicate the zvol, you won't be able to 
create the snapshot.  If your pool ever gets over 90% used, your performance 
will degrade, so a 40% zvol is what I would recommend.

Back to the topic:

Given that you're on the SSD, there is no faster nonvolatile storage you can 
use for ZIL log device.  So you should leave the default ZIL inside the pool... 
 Don't try adding any separate slice or anything as a log device...  But as you 
said, sync writes will hit the disk twice.  I would have to guess it's a good 
idea for you to tune ZFS to immediately flush transactions whenever there's a 
sync write.  I forget how this is done - there's some tunable that indicates 
anything sync write over a certain size should be immediately flushed...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Fajar A. Nugraha
On Wed, Nov 21, 2012 at 12:07 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 Why are you parititoning, then creating zpool,

The common case it's often because they use the disk for something
else as well (e.g. OS), not only for zfs

 and then creating zvol?

Because it enables you to do other stuff easier and faster (e.g.
copying files from the host) compared to using plain disk image files
(vmdk/vdi/vhd/whatever)

 I think you should make the whole disk a zpool unto itself, and then carve 
 out the 128G zvol and 60G zvol.  For that matter, why are you carving out 
 multiple zvol's?  Does your Guest VM really want multiple virtual disks for 
 some reason?

 Side note:  Assuming you *really* just want a single guest to occupy the 
 whole disk and run as fast as possible...  If you want to snapshot your 
 guest, you should make the whole disk one zpool, and then carve out a zvol 
 which is significantly smaller than 50%, say perhaps 40% or 45% might do the 
 trick.

... or use sparse zvols, e.g. zfs create -V 10G -s tank/vol1

Of course, that's assuming you KNOW that you never max-out storage use
on that zvol. If you don't have control over that, then using smaller
zvol size is indeed preferable.

-- 
Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread nathan

Hi folks,

some extra thoughts:

1. Don't question why. :) I'm playing and observing, so I ultimately 
know and understand the best way to do things! heh.
2. In fairness, asking why is entirely valid. ;) I'm not doing things to 
best practice just yet - I wanted the best performance for my VM's, 
which are all testing/training/playing VM's. I got *great* performance 
from the first RAW PARTITION I gave to VirtualBox. I wanted to do the 
same, but due to the way it wraps paritions, and Solaris complains that 
there is more than one Solaris2 partition on the disk when I try to 
install the second instance, I thought I'd give zvols a go.
3. The device I wrap as a VMDK is the RAW device. sigh. Of course, all 
writes will go through the ZIL, and of course we'll have to write twice 
as much. I should have seen that straight away, but was lacking sleep.
4. Note: I don't have a separate ZIL. The first partition I made was 
given directly to virtualbox. The second was used to create the zpool.


I'm going to have a play with using LVM md devices instead and see how 
that goes as well.


Overall, the pain of the doubling of bandwidth requirements seems like a 
big downer for *my* configuration, as I have just the one SSD, but I'll 
persist and see what I can get out of it.


Thanks for the thoughts thus far!

Cheers,

Nathan.

On 21/11/2012 8:33 AM, Fajar A. Nugraha wrote:

On Wed, Nov 21, 2012 at 12:07 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

Why are you parititoning, then creating zpool,

The common case it's often because they use the disk for something
else as well (e.g. OS), not only for zfs


and then creating zvol?

Because it enables you to do other stuff easier and faster (e.g.
copying files from the host) compared to using plain disk image files
(vmdk/vdi/vhd/whatever)


I think you should make the whole disk a zpool unto itself, and then carve out 
the 128G zvol and 60G zvol.  For that matter, why are you carving out multiple 
zvol's?  Does your Guest VM really want multiple virtual disks for some reason?

Side note:  Assuming you *really* just want a single guest to occupy the whole 
disk and run as fast as possible...  If you want to snapshot your guest, you 
should make the whole disk one zpool, and then carve out a zvol which is 
significantly smaller than 50%, say perhaps 40% or 45% might do the trick.

... or use sparse zvols, e.g. zfs create -V 10G -s tank/vol1

Of course, that's assuming you KNOW that you never max-out storage use
on that zvol. If you don't have control over that, then using smaller
zvol size is indeed preferable.



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol wrapped in a vmdk by Virtual Box and double writes?

2012-11-20 Thread Jim Klimov

On 2012-11-21 03:21, nathan wrote:

Overall, the pain of the doubling of bandwidth requirements seems like a
big downer for *my* configuration, as I have just the one SSD, but I'll
persist and see what I can get out of it.


I might also speculate that for each rewritten block of userdata in
the VM image, you have a series of metadata block updates in ZFS.
If you keep the zvol blocks relatively small, you might get the
effective doubling of writes for the userdata updates.

As for ZIL - even if it is used with the in-pool variant, I don't
think your setup needs any extra steps to disable it (as Edward likes
to suggest), and most other setups don't need to disable it either.
It also shouldn't add much to your writes - the in-pool ZIL blocks
are then referenced as userdata when the TXG commit happens (I think).

I also think that with a VM in a raw partition you don't get any
snapshots - neither ZFS as underlying storage ('cause it's not),
not hypervisor snaps of the VM. So while faster, this is also some
trade-off :)

//Jim

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss