Re: [zfs-discuss] Interesting question about L2ARC

2012-09-26 Thread Richard Elling

On Sep 26, 2012, at 4:28 AM, Sašo Kiselkov  wrote:

> On 09/26/2012 01:14 PM, Edward Ned Harvey
> (opensolarisisdeadlongliveopensolaris) wrote:
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Jim Klimov
>>> 
>>> Got me wondering: how many reads of a block from spinning rust
>>> suffice for it to ultimately get into L2ARC? Just one so it
>>> gets into a recent-read list of the ARC and then expires into
>>> L2ARC when ARC RAM is more needed for something else, 
>> 
>> Correct, but not always sufficient.  I forget the name of the parameter, but 
>> there's some rate limiting thing that limits how fast you can fill the 
>> L2ARC.  This means sometimes, things will expire from ARC, and simply get 
>> discarded.
> 
> The parameters are:
> 
> *) l2arc_write_max (default 8MB): max number of bytes written per
>fill cycle

It should be noted that this level was perhaps appropriate 6 years
ago, when L2ARC was integrated and given the SSDs available at the
time, but is well below reasonable settings for high speed systems or
modern SSDs. It is probably not a bad idea to change the default to 
reflect more modern systems, thus avoiding surprises.
 -- richard

> *) l2arc_headroom (default 2x): multiplies the above parameter and
>determines how far into the ARC lists we will search for buffers
>eligible for writing to L2ARC.
> *) l2arc_feed_secs (default 1s): regular interval between fill cycles
> *) l2arc_feed_min_ms (default 200ms): minimum interval between fill
>cycles
> 
> Cheers,
> --
> Saso
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-09-26 Thread Richard Elling
On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey 
(opensolarisisdeadlongliveopensolaris) 
 wrote:

> Here's another one.
>  
> Two identical servers are sitting side by side.  They could be connected to 
> each other via anything (presently using crossover ethernet cable.)  And 
> obviously they both connect to the regular LAN.  You want to serve VM's from 
> at least one of them, and even if the VM's aren't fault tolerant, you want at 
> least the storage to be live synced.  The first obvious thing to do is simply 
> cron a zfs send | zfs receive at a very frequent interval.  But there are a 
> lot of downsides to that - besides the fact that you have to settle for some 
> granularity, you also have a script on one system that will clobber the other 
> system. So in the event of a failure, you might promote the backup into 
> production, and you have to be careful not to let it get clobbered when the 
> main server comes up again.
>  
> I like much better, the idea of using a zfs mirror between the two systems.  
> Even if it comes with a performance penalty, as a result of bottlenecking the 
> storage onto Ethernet.  But there are several ways to possibly do that, and 
> I'm wondering which will be best.
>  
> Option 1:  Each system creates a big zpool of the local storage.  Then, 
> create a zvol within the zpool, and export it iscsi to the other system.  Now 
> both systems can see a local zvol, and a remote zvol, which it can use to 
> create a zpool mirror.  The reasons I don't like this idea are because it's a 
> zpoolwithin a zpool, including the double-checksumming and everything.  But 
> the double-checksummingisn't such a concern to me - I'm mostly afraid some 
> horrible performance or reliability problem might be resultant.  Naturally, 
> you would only zpool import the nested zpool on one system.  The other system 
> would basically just ignore it.  But in the event of a primary failure, you 
> could force import the nested zpool on the secondary system.

This was described by Thorsten a few years ago.
http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf

IMHO, the issues are operational: troubleshooting could be very challenging.

>  
> Option 2:  At present, both systems are using local mirroring ,3 mirror pairs 
> of 6 disks.  I could break these mirrors, and export one side over to the 
> other system...  And vice versa.  So neither server will be doing local 
> mirroring; they will both be mirroring across iscsi to targets on the other 
> host.  Once again, each zpool will only be imported on one host, but in the 
> event of a failure, you could force import it on the other host.
>  
> Can anybody think of a reason why Option 2 would be stupid, or can you think 
> of a better solution?

If they are close enough for "crossover cable" where the cable is UTP, then 
they are 
close enough for SAS.
 -- richard

--
illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco 
www.zfsday.com
richard.ell...@richardelling.com
+1-760-896-4422








___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-09-26 Thread Tim Cook
On Wed, Sep 26, 2012 at 12:54 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

>  Here's another one.
>
> ** **
>
> Two identical servers are sitting side by side.  They could be connected
> to each other via anything (presently using crossover ethernet cable.)  And
> obviously they both connect to the regular LAN.  You want to serve VM's
> from at least one of them, and even if the VM's aren't fault tolerant, you
> want at least the storage to be live synced.  The first obvious thing to
> do is simply cron a zfs send | zfs receive at a very frequent interval.  But
> there are a lot of downsides to that - besides the fact that you have to
> settle for some granularity, you also have a script on one system that will
> clobber the other system.  So in the event of a failure, you might
> promote the backup into production, and you have to be careful not to let
> it get clobbered when the main server comes up again.
>
> ** **
>
> I like much better, the idea of using a zfs mirror between the two
> systems.  Even if it comes with a performance penalty, as a result of
> bottlenecking the storage onto Ethernet.  But there are several ways to
> possibly do that, and I'm wondering which will be best.
>
> ** **
>
> Option 1:  Each system creates a big zpool of the local storage.  Then,
> create a zvol within the zpool, and export it iscsi to the other system.  Now
> both systems can see a local zvol, and a remote zvol, which it can use to
> create a zpool mirror.  The reasons I don't like this idea are because
> it's a zpool within a zpool, including the double-checksumming and
> everything.  But the double-checksumming isn't such a concern to me - I'm
> mostly afraid some horrible performance or reliability problem might be
> resultant.  Naturally, you would only zpool import the nested zpool on
> one system.  The other system would basically just ignore it.  But in the
> event of a primary failure, you could force import the nested zpool on
> the secondary system.
>
> ** **
>
> Option 2:  At present, both systems are using local mirroring ,3 mirror
> pairs of 6 disks.  I could break these mirrors, and export one side over
> to the other system...  And vice versa.  So neither server will be doing
> local mirroring; they will both be mirroring across iscsi to targets on
> the other host.  Once again, each zpool will only be imported on one
> host, but in the event of a failure, you could force import it on the other
> host.
>
> ** **
>
> Can anybody think of a reason why Option 2 would be stupid, or can you
> think of a better solution?
>
>
>

I would suggest if you're doing a crossover between systems, you use
infiniband rather than ethernet.  You can eBay a 40Gb IB card for under
$300.  Quite frankly the performance issues should become almost a
non-factor at that point.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zvol refreservation size

2012-09-26 Thread Matthew Ahrens
On Wed, Sep 26, 2012 at 10:28 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

>  When I create a 50G zvol, it gets "volsize" 50G, and it gets "used" and "
> refreservation" 51.6G
>
> ** **
>
> I have some filesystems already in use, hosting VM's, and I'd like to
> mimic the refreservation setting on the filesystem, as if I were smart
> enough from the beginning to have used the zvol.  So my question is ...***
> *
>
> ** **
>
> What's the extra 1.6G for?
>

It is for metadata -- the indirect blocks required to reference the 50G of
data.



> 
>
> And
>
> If I have a filesystem holding a single VM with a single 2T disk, how
> large should the refreservation be?
>
> ** **
>
> If it's a linear scale, it should be 2.064T refreservation.
>

For a filesystem, we can't exactly predict how much metadata will be needed
because it depends on how it is used (many small files vs few large files).
 For zvols, we can predict it exactly because we know it's just one big
object.  See zvol_volsize_to_reservation() for details.

Your case of a single large file can be treated like a zvol.  If your
filesystem has the same recordsize[*] (default is 128k) as the zvol's
volblocksize (default is 8k), then you can linearly scale that 3% with the
file size.  If you are using a different recordsize, you can linearly scale
the amount of metadata (larger recordsize -> less metadata).

--matt

[*] Note that the big file's recordsize is set when it is created, so what
matters is what the recordsize was when the file was created.  Changing the
recordsize property after it's created won't change the metadata layout of
the file.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-09-26 Thread matthew patton
"head units" crash or do weird things, but disks persist. There are a couple of 
HA head-unit solutions out there but most of them have their own separate 
storage and they effectively just send transaction groups to each other.

The other way is to connect 2 nodes to an external SAS/FC chassis. create 
desired ZPools. Assign some subset of pools to node A, the rest to node B. When 
failure occurs the other node imports the other's pools and exports as 
NFS/iSCSI/whatever.

You'll have to have a clustering/quorum and resource migration subsystem 
obviously. Or if you want simple act/passive, a means to make sure both heads 
don't try to import the same pools.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vm server storage mirror

2012-09-26 Thread Freddie Cash
If you're willing to try FreeBSD, there's HAST (aka high availability
storage) for this very purpose.

You use hast to create mirror pairs using 1 disk from each box, thus
creating /dev/hast/* nodes. Then you use those to create the zpool one the
'primary' box.

All writes to the pool on the primary box are mirrored over the network to
the secondary box.

When the primary box goes down, the secondary imports the pool and carries
on. When the primary box comes online, it syncs the data back from the
secondary, and then either takes over as primary or becomes the new
secondary.
 On Sep 26, 2012 10:54 AM, "Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris)" <
opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:

>  Here's another one.
>
> ** **
>
> Two identical servers are sitting side by side.  They could be connected
> to each other via anything (presently using crossover ethernet cable.)  And
> obviously they both connect to the regular LAN.  You want to serve VM's
> from at least one of them, and even if the VM's aren't fault tolerant, you
> want at least the storage to be live synced.  The first obvious thing to
> do is simply cron a zfs send | zfs receive at a very frequent interval.  But
> there are a lot of downsides to that - besides the fact that you have to
> settle for some granularity, you also have a script on one system that will
> clobber the other system.  So in the event of a failure, you might
> promote the backup into production, and you have to be careful not to let
> it get clobbered when the main server comes up again.
>
> ** **
>
> I like much better, the idea of using a zfs mirror between the two
> systems.  Even if it comes with a performance penalty, as a result of
> bottlenecking the storage onto Ethernet.  But there are several ways to
> possibly do that, and I'm wondering which will be best.
>
> ** **
>
> Option 1:  Each system creates a big zpool of the local storage.  Then,
> create a zvol within the zpool, and export it iscsi to the other system.  Now
> both systems can see a local zvol, and a remote zvol, which it can use to
> create a zpool mirror.  The reasons I don't like this idea are because
> it's a zpool within a zpool, including the double-checksumming and
> everything.  But the double-checksumming isn't such a concern to me - I'm
> mostly afraid some horrible performance or reliability problem might be
> resultant.  Naturally, you would only zpool import the nested zpool on
> one system.  The other system would basically just ignore it.  But in the
> event of a primary failure, you could force import the nested zpool on
> the secondary system.
>
> ** **
>
> Option 2:  At present, both systems are using local mirroring ,3 mirror
> pairs of 6 disks.  I could break these mirrors, and export one side over
> to the other system...  And vice versa.  So neither server will be doing
> local mirroring; they will both be mirroring across iscsi to targets on
> the other host.  Once again, each zpool will only be imported on one
> host, but in the event of a failure, you could force import it on the other
> host.
>
> ** **
>
> Can anybody think of a reason why Option 2 would be stupid, or can you
> think of a better solution?
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] vm server storage mirror

2012-09-26 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Here's another one.

Two identical servers are sitting side by side.  They could be connected to 
each other via anything (presently using crossover ethernet cable.)  And 
obviously they both connect to the regular LAN.  You want to serve VM's from at 
least one of them, and even if the VM's aren't fault tolerant, you want at 
least the storage to be live synced.  The first obvious thing to do is simply 
cron a zfs send | zfs receive at a very frequent interval.  But there are a lot 
of downsides to that - besides the fact that you have to settle for some 
granularity, you also have a script on one system that will clobber the other 
system.  So in the event of a failure, you might promote the backup into 
production, and you have to be careful not to let it get clobbered when the 
main server comes up again.

I like much better, the idea of using a zfs mirror between the two systems.  
Even if it comes with a performance penalty, as a result of bottlenecking the 
storage onto Ethernet.  But there are several ways to possibly do that, and I'm 
wondering which will be best.

Option 1:  Each system creates a big zpool of the local storage.  Then, create 
a zvol within the zpool, and export it iscsi to the other system.  Now both 
systems can see a local zvol, and a remote zvol, which it can use to create a 
zpool mirror.  The reasons I don't like this idea are because it's a zpool 
within a zpool, including the double-checksumming and everything.  But the 
double-checksumming isn't such a concern to me - I'm mostly afraid some 
horrible performance or reliability problem might be resultant.  Naturally, you 
would only zpool import the nested zpool on one system.  The other system would 
basically just ignore it.  But in the event of a primary failure, you could 
force import the nested zpool on the secondary system.

Option 2:  At present, both systems are using local mirroring ,3 mirror pairs 
of 6 disks.  I could break these mirrors, and export one side over to the other 
system...  And vice versa.  So neither server will be doing local mirroring; 
they will both be mirroring across iscsi to targets on the other host.  Once 
again, each zpool will only be imported on one host, but in the event of a 
failure, you could force import it on the other host.

Can anybody think of a reason why Option 2 would be stupid, or can you think of 
a better solution?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zvol refreservation size

2012-09-26 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
When I create a 50G zvol, it gets "volsize" 50G, and it gets "used" and 
"refreservation" 51.6G

I have some filesystems already in use, hosting VM's, and I'd like to mimic the 
refreservation setting on the filesystem, as if I were smart enough from the 
beginning to have used the zvol.  So my question is ...

What's the extra 1.6G for?
And
If I have a filesystem holding a single VM with a single 2T disk, how large 
should the refreservation be?

If it's a linear scale, it should be 2.064T refreservation.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Matt Van Mater
Excellent thanks to you both.  I knew of both those methods and wanted
to make sure i wasn't missing something!

On Wed, Sep 26, 2012 at 11:21 AM, Dan Swartzendruber wrote:

> **
> On 9/26/2012 11:18 AM, Matt Van Mater wrote:
>
>  If the added device is slower, you will experience a slight drop in
>> per-op performance, however, if your working set needs another SSD,
>> overall it might improve your throughput (as the cache hit ratio will
>> increase).
>>
>
>  Thanks for your fast reply!  I think I know the answer to this question,
> but what is the best way to determine how large my pool's l2arc working set
> is (i.e. how much l2arc is in use)?
>
>
> Easiest way:
>
> zpool iostat -v
>
>
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Dan Swartzendruber

On 9/26/2012 11:18 AM, Matt Van Mater wrote:


If the added device is slower, you will experience a slight drop in
per-op performance, however, if your working set needs another SSD,
overall it might improve your throughput (as the cache hit ratio will
increase).


Thanks for your fast reply!  I think I know the answer to this 
question, but what is the best way to determine how large my pool's 
l2arc working set is (i.e. how much l2arc is in use)?




Easiest way:

zpool iostat -v


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Sašo Kiselkov
On 09/26/2012 05:18 PM, Matt Van Mater wrote:
>>
>> If the added device is slower, you will experience a slight drop in
>> per-op performance, however, if your working set needs another SSD,
>> overall it might improve your throughput (as the cache hit ratio will
>> increase).
>>
> 
> Thanks for your fast reply!  I think I know the answer to this question,
> but what is the best way to determine how large my pool's l2arc working set
> is (i.e. how much l2arc is in use)?

Go grab arcstat.pl from
http://blog.harschsystems.com/2010/09/08/arcstat-pl-updated-for-l2arc-statistics/
- that's the tool you're looking for.

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Matt Van Mater
>
> If the added device is slower, you will experience a slight drop in
> per-op performance, however, if your working set needs another SSD,
> overall it might improve your throughput (as the cache hit ratio will
> increase).
>

Thanks for your fast reply!  I think I know the answer to this question,
but what is the best way to determine how large my pool's l2arc working set
is (i.e. how much l2arc is in use)?

Matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Sašo Kiselkov
On 09/26/2012 05:08 PM, Matt Van Mater wrote:
> I've looked on the mailing list (the evil tuning wikis are down) and
> haven't seen a reference to this seemingly simple question...
> 
> I have two OCZ Vertex 4 SSDs acting as L2ARC.  I have a spare Crucial SSD
> (about 1.5 years old) that isn't getting much use and i'm curious about
> adding it to the pool as a third L2ARC device.
> 
> Is there any reason why I technically can't use different capacity and/or
> manufacturer SSDs as a single ZFS pool's L2ARC?

No, there isn't. You can do it without problems.

> Even if it will work technically, will this configuration negatively impact
> performance (e.g. slow down the entire cache to the slowest drive's
> performance)?

If the added device is slower, you will experience a slight drop in
per-op performance, however, if your working set needs another SSD,
overall it might improve your throughput (as the cache hit ratio will
increase).

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Different size / manufacturer L2ARC

2012-09-26 Thread Matt Van Mater
I've looked on the mailing list (the evil tuning wikis are down) and
haven't seen a reference to this seemingly simple question...

I have two OCZ Vertex 4 SSDs acting as L2ARC.  I have a spare Crucial SSD
(about 1.5 years old) that isn't getting much use and i'm curious about
adding it to the pool as a third L2ARC device.

Is there any reason why I technically can't use different capacity and/or
manufacturer SSDs as a single ZFS pool's L2ARC?
Even if it will work technically, will this configuration negatively impact
performance (e.g. slow down the entire cache to the slowest drive's
performance)?

Thanks!
Matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting question about L2ARC

2012-09-26 Thread Sašo Kiselkov
On 09/26/2012 01:14 PM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) wrote:
>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Jim Klimov
>>
>> Got me wondering: how many reads of a block from spinning rust
>> suffice for it to ultimately get into L2ARC? Just one so it
>> gets into a recent-read list of the ARC and then expires into
>> L2ARC when ARC RAM is more needed for something else, 
> 
> Correct, but not always sufficient.  I forget the name of the parameter, but 
> there's some rate limiting thing that limits how fast you can fill the L2ARC. 
>  This means sometimes, things will expire from ARC, and simply get discarded.

The parameters are:

 *) l2arc_write_max (default 8MB): max number of bytes written per
fill cycle
 *) l2arc_headroom (default 2x): multiplies the above parameter and
determines how far into the ARC lists we will search for buffers
eligible for writing to L2ARC.
 *) l2arc_feed_secs (default 1s): regular interval between fill cycles
 *) l2arc_feed_min_ms (default 200ms): minimum interval between fill
cycles

Cheers,
--
Saso
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Interesting question about L2ARC

2012-09-26 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Jim Klimov
> 
> Got me wondering: how many reads of a block from spinning rust
> suffice for it to ultimately get into L2ARC? Just one so it
> gets into a recent-read list of the ARC and then expires into
> L2ARC when ARC RAM is more needed for something else, 

Correct, but not always sufficient.  I forget the name of the parameter, but 
there's some rate limiting thing that limits how fast you can fill the L2ARC.  
This means sometimes, things will expire from ARC, and simply get discarded.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss