Re: [zfs-discuss] Interesting question about L2ARC
On Sep 26, 2012, at 4:28 AM, Sašo Kiselkov wrote: > On 09/26/2012 01:14 PM, Edward Ned Harvey > (opensolarisisdeadlongliveopensolaris) wrote: >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Jim Klimov >>> >>> Got me wondering: how many reads of a block from spinning rust >>> suffice for it to ultimately get into L2ARC? Just one so it >>> gets into a recent-read list of the ARC and then expires into >>> L2ARC when ARC RAM is more needed for something else, >> >> Correct, but not always sufficient. I forget the name of the parameter, but >> there's some rate limiting thing that limits how fast you can fill the >> L2ARC. This means sometimes, things will expire from ARC, and simply get >> discarded. > > The parameters are: > > *) l2arc_write_max (default 8MB): max number of bytes written per >fill cycle It should be noted that this level was perhaps appropriate 6 years ago, when L2ARC was integrated and given the SSDs available at the time, but is well below reasonable settings for high speed systems or modern SSDs. It is probably not a bad idea to change the default to reflect more modern systems, thus avoiding surprises. -- richard > *) l2arc_headroom (default 2x): multiplies the above parameter and >determines how far into the ARC lists we will search for buffers >eligible for writing to L2ARC. > *) l2arc_feed_secs (default 1s): regular interval between fill cycles > *) l2arc_feed_min_ms (default 200ms): minimum interval between fill >cycles > > Cheers, > -- > Saso > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
On Sep 26, 2012, at 10:54 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: > Here's another one. > > Two identical servers are sitting side by side. They could be connected to > each other via anything (presently using crossover ethernet cable.) And > obviously they both connect to the regular LAN. You want to serve VM's from > at least one of them, and even if the VM's aren't fault tolerant, you want at > least the storage to be live synced. The first obvious thing to do is simply > cron a zfs send | zfs receive at a very frequent interval. But there are a > lot of downsides to that - besides the fact that you have to settle for some > granularity, you also have a script on one system that will clobber the other > system. So in the event of a failure, you might promote the backup into > production, and you have to be careful not to let it get clobbered when the > main server comes up again. > > I like much better, the idea of using a zfs mirror between the two systems. > Even if it comes with a performance penalty, as a result of bottlenecking the > storage onto Ethernet. But there are several ways to possibly do that, and > I'm wondering which will be best. > > Option 1: Each system creates a big zpool of the local storage. Then, > create a zvol within the zpool, and export it iscsi to the other system. Now > both systems can see a local zvol, and a remote zvol, which it can use to > create a zpool mirror. The reasons I don't like this idea are because it's a > zpoolwithin a zpool, including the double-checksumming and everything. But > the double-checksummingisn't such a concern to me - I'm mostly afraid some > horrible performance or reliability problem might be resultant. Naturally, > you would only zpool import the nested zpool on one system. The other system > would basically just ignore it. But in the event of a primary failure, you > could force import the nested zpool on the secondary system. This was described by Thorsten a few years ago. http://www.osdevcon.org/2009/slides/high_availability_with_minimal_cluster_torsten_frueauf.pdf IMHO, the issues are operational: troubleshooting could be very challenging. > > Option 2: At present, both systems are using local mirroring ,3 mirror pairs > of 6 disks. I could break these mirrors, and export one side over to the > other system... And vice versa. So neither server will be doing local > mirroring; they will both be mirroring across iscsi to targets on the other > host. Once again, each zpool will only be imported on one host, but in the > event of a failure, you could force import it on the other host. > > Can anybody think of a reason why Option 2 would be stupid, or can you think > of a better solution? If they are close enough for "crossover cable" where the cable is UTP, then they are close enough for SAS. -- richard -- illumos Day & ZFS Day, Oct 1-2, 2012 San Fransisco www.zfsday.com richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
On Wed, Sep 26, 2012 at 12:54 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) < opensolarisisdeadlongliveopensola...@nedharvey.com> wrote: > Here's another one. > > ** ** > > Two identical servers are sitting side by side. They could be connected > to each other via anything (presently using crossover ethernet cable.) And > obviously they both connect to the regular LAN. You want to serve VM's > from at least one of them, and even if the VM's aren't fault tolerant, you > want at least the storage to be live synced. The first obvious thing to > do is simply cron a zfs send | zfs receive at a very frequent interval. But > there are a lot of downsides to that - besides the fact that you have to > settle for some granularity, you also have a script on one system that will > clobber the other system. So in the event of a failure, you might > promote the backup into production, and you have to be careful not to let > it get clobbered when the main server comes up again. > > ** ** > > I like much better, the idea of using a zfs mirror between the two > systems. Even if it comes with a performance penalty, as a result of > bottlenecking the storage onto Ethernet. But there are several ways to > possibly do that, and I'm wondering which will be best. > > ** ** > > Option 1: Each system creates a big zpool of the local storage. Then, > create a zvol within the zpool, and export it iscsi to the other system. Now > both systems can see a local zvol, and a remote zvol, which it can use to > create a zpool mirror. The reasons I don't like this idea are because > it's a zpool within a zpool, including the double-checksumming and > everything. But the double-checksumming isn't such a concern to me - I'm > mostly afraid some horrible performance or reliability problem might be > resultant. Naturally, you would only zpool import the nested zpool on > one system. The other system would basically just ignore it. But in the > event of a primary failure, you could force import the nested zpool on > the secondary system. > > ** ** > > Option 2: At present, both systems are using local mirroring ,3 mirror > pairs of 6 disks. I could break these mirrors, and export one side over > to the other system... And vice versa. So neither server will be doing > local mirroring; they will both be mirroring across iscsi to targets on > the other host. Once again, each zpool will only be imported on one > host, but in the event of a failure, you could force import it on the other > host. > > ** ** > > Can anybody think of a reason why Option 2 would be stupid, or can you > think of a better solution? > > > I would suggest if you're doing a crossover between systems, you use infiniband rather than ethernet. You can eBay a 40Gb IB card for under $300. Quite frankly the performance issues should become almost a non-factor at that point. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol refreservation size
On Wed, Sep 26, 2012 at 10:28 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) < opensolarisisdeadlongliveopensola...@nedharvey.com> wrote: > When I create a 50G zvol, it gets "volsize" 50G, and it gets "used" and " > refreservation" 51.6G > > ** ** > > I have some filesystems already in use, hosting VM's, and I'd like to > mimic the refreservation setting on the filesystem, as if I were smart > enough from the beginning to have used the zvol. So my question is ...*** > * > > ** ** > > What's the extra 1.6G for? > It is for metadata -- the indirect blocks required to reference the 50G of data. > > > And > > If I have a filesystem holding a single VM with a single 2T disk, how > large should the refreservation be? > > ** ** > > If it's a linear scale, it should be 2.064T refreservation. > For a filesystem, we can't exactly predict how much metadata will be needed because it depends on how it is used (many small files vs few large files). For zvols, we can predict it exactly because we know it's just one big object. See zvol_volsize_to_reservation() for details. Your case of a single large file can be treated like a zvol. If your filesystem has the same recordsize[*] (default is 128k) as the zvol's volblocksize (default is 8k), then you can linearly scale that 3% with the file size. If you are using a different recordsize, you can linearly scale the amount of metadata (larger recordsize -> less metadata). --matt [*] Note that the big file's recordsize is set when it is created, so what matters is what the recordsize was when the file was created. Changing the recordsize property after it's created won't change the metadata layout of the file. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
"head units" crash or do weird things, but disks persist. There are a couple of HA head-unit solutions out there but most of them have their own separate storage and they effectively just send transaction groups to each other. The other way is to connect 2 nodes to an external SAS/FC chassis. create desired ZPools. Assign some subset of pools to node A, the rest to node B. When failure occurs the other node imports the other's pools and exports as NFS/iSCSI/whatever. You'll have to have a clustering/quorum and resource migration subsystem obviously. Or if you want simple act/passive, a means to make sure both heads don't try to import the same pools. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vm server storage mirror
If you're willing to try FreeBSD, there's HAST (aka high availability storage) for this very purpose. You use hast to create mirror pairs using 1 disk from each box, thus creating /dev/hast/* nodes. Then you use those to create the zpool one the 'primary' box. All writes to the pool on the primary box are mirrored over the network to the secondary box. When the primary box goes down, the secondary imports the pool and carries on. When the primary box comes online, it syncs the data back from the secondary, and then either takes over as primary or becomes the new secondary. On Sep 26, 2012 10:54 AM, "Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)" < opensolarisisdeadlongliveopensola...@nedharvey.com> wrote: > Here's another one. > > ** ** > > Two identical servers are sitting side by side. They could be connected > to each other via anything (presently using crossover ethernet cable.) And > obviously they both connect to the regular LAN. You want to serve VM's > from at least one of them, and even if the VM's aren't fault tolerant, you > want at least the storage to be live synced. The first obvious thing to > do is simply cron a zfs send | zfs receive at a very frequent interval. But > there are a lot of downsides to that - besides the fact that you have to > settle for some granularity, you also have a script on one system that will > clobber the other system. So in the event of a failure, you might > promote the backup into production, and you have to be careful not to let > it get clobbered when the main server comes up again. > > ** ** > > I like much better, the idea of using a zfs mirror between the two > systems. Even if it comes with a performance penalty, as a result of > bottlenecking the storage onto Ethernet. But there are several ways to > possibly do that, and I'm wondering which will be best. > > ** ** > > Option 1: Each system creates a big zpool of the local storage. Then, > create a zvol within the zpool, and export it iscsi to the other system. Now > both systems can see a local zvol, and a remote zvol, which it can use to > create a zpool mirror. The reasons I don't like this idea are because > it's a zpool within a zpool, including the double-checksumming and > everything. But the double-checksumming isn't such a concern to me - I'm > mostly afraid some horrible performance or reliability problem might be > resultant. Naturally, you would only zpool import the nested zpool on > one system. The other system would basically just ignore it. But in the > event of a primary failure, you could force import the nested zpool on > the secondary system. > > ** ** > > Option 2: At present, both systems are using local mirroring ,3 mirror > pairs of 6 disks. I could break these mirrors, and export one side over > to the other system... And vice versa. So neither server will be doing > local mirroring; they will both be mirroring across iscsi to targets on > the other host. Once again, each zpool will only be imported on one > host, but in the event of a failure, you could force import it on the other > host. > > ** ** > > Can anybody think of a reason why Option 2 would be stupid, or can you > think of a better solution? > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] vm server storage mirror
Here's another one. Two identical servers are sitting side by side. They could be connected to each other via anything (presently using crossover ethernet cable.) And obviously they both connect to the regular LAN. You want to serve VM's from at least one of them, and even if the VM's aren't fault tolerant, you want at least the storage to be live synced. The first obvious thing to do is simply cron a zfs send | zfs receive at a very frequent interval. But there are a lot of downsides to that - besides the fact that you have to settle for some granularity, you also have a script on one system that will clobber the other system. So in the event of a failure, you might promote the backup into production, and you have to be careful not to let it get clobbered when the main server comes up again. I like much better, the idea of using a zfs mirror between the two systems. Even if it comes with a performance penalty, as a result of bottlenecking the storage onto Ethernet. But there are several ways to possibly do that, and I'm wondering which will be best. Option 1: Each system creates a big zpool of the local storage. Then, create a zvol within the zpool, and export it iscsi to the other system. Now both systems can see a local zvol, and a remote zvol, which it can use to create a zpool mirror. The reasons I don't like this idea are because it's a zpool within a zpool, including the double-checksumming and everything. But the double-checksumming isn't such a concern to me - I'm mostly afraid some horrible performance or reliability problem might be resultant. Naturally, you would only zpool import the nested zpool on one system. The other system would basically just ignore it. But in the event of a primary failure, you could force import the nested zpool on the secondary system. Option 2: At present, both systems are using local mirroring ,3 mirror pairs of 6 disks. I could break these mirrors, and export one side over to the other system... And vice versa. So neither server will be doing local mirroring; they will both be mirroring across iscsi to targets on the other host. Once again, each zpool will only be imported on one host, but in the event of a failure, you could force import it on the other host. Can anybody think of a reason why Option 2 would be stupid, or can you think of a better solution? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zvol refreservation size
When I create a 50G zvol, it gets "volsize" 50G, and it gets "used" and "refreservation" 51.6G I have some filesystems already in use, hosting VM's, and I'd like to mimic the refreservation setting on the filesystem, as if I were smart enough from the beginning to have used the zvol. So my question is ... What's the extra 1.6G for? And If I have a filesystem holding a single VM with a single 2T disk, how large should the refreservation be? If it's a linear scale, it should be 2.064T refreservation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Different size / manufacturer L2ARC
Excellent thanks to you both. I knew of both those methods and wanted to make sure i wasn't missing something! On Wed, Sep 26, 2012 at 11:21 AM, Dan Swartzendruber wrote: > ** > On 9/26/2012 11:18 AM, Matt Van Mater wrote: > > If the added device is slower, you will experience a slight drop in >> per-op performance, however, if your working set needs another SSD, >> overall it might improve your throughput (as the cache hit ratio will >> increase). >> > > Thanks for your fast reply! I think I know the answer to this question, > but what is the best way to determine how large my pool's l2arc working set > is (i.e. how much l2arc is in use)? > > > Easiest way: > > zpool iostat -v > > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Different size / manufacturer L2ARC
On 9/26/2012 11:18 AM, Matt Van Mater wrote: If the added device is slower, you will experience a slight drop in per-op performance, however, if your working set needs another SSD, overall it might improve your throughput (as the cache hit ratio will increase). Thanks for your fast reply! I think I know the answer to this question, but what is the best way to determine how large my pool's l2arc working set is (i.e. how much l2arc is in use)? Easiest way: zpool iostat -v ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Different size / manufacturer L2ARC
On 09/26/2012 05:18 PM, Matt Van Mater wrote: >> >> If the added device is slower, you will experience a slight drop in >> per-op performance, however, if your working set needs another SSD, >> overall it might improve your throughput (as the cache hit ratio will >> increase). >> > > Thanks for your fast reply! I think I know the answer to this question, > but what is the best way to determine how large my pool's l2arc working set > is (i.e. how much l2arc is in use)? Go grab arcstat.pl from http://blog.harschsystems.com/2010/09/08/arcstat-pl-updated-for-l2arc-statistics/ - that's the tool you're looking for. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Different size / manufacturer L2ARC
> > If the added device is slower, you will experience a slight drop in > per-op performance, however, if your working set needs another SSD, > overall it might improve your throughput (as the cache hit ratio will > increase). > Thanks for your fast reply! I think I know the answer to this question, but what is the best way to determine how large my pool's l2arc working set is (i.e. how much l2arc is in use)? Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Different size / manufacturer L2ARC
On 09/26/2012 05:08 PM, Matt Van Mater wrote: > I've looked on the mailing list (the evil tuning wikis are down) and > haven't seen a reference to this seemingly simple question... > > I have two OCZ Vertex 4 SSDs acting as L2ARC. I have a spare Crucial SSD > (about 1.5 years old) that isn't getting much use and i'm curious about > adding it to the pool as a third L2ARC device. > > Is there any reason why I technically can't use different capacity and/or > manufacturer SSDs as a single ZFS pool's L2ARC? No, there isn't. You can do it without problems. > Even if it will work technically, will this configuration negatively impact > performance (e.g. slow down the entire cache to the slowest drive's > performance)? If the added device is slower, you will experience a slight drop in per-op performance, however, if your working set needs another SSD, overall it might improve your throughput (as the cache hit ratio will increase). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Different size / manufacturer L2ARC
I've looked on the mailing list (the evil tuning wikis are down) and haven't seen a reference to this seemingly simple question... I have two OCZ Vertex 4 SSDs acting as L2ARC. I have a spare Crucial SSD (about 1.5 years old) that isn't getting much use and i'm curious about adding it to the pool as a third L2ARC device. Is there any reason why I technically can't use different capacity and/or manufacturer SSDs as a single ZFS pool's L2ARC? Even if it will work technically, will this configuration negatively impact performance (e.g. slow down the entire cache to the slowest drive's performance)? Thanks! Matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
On 09/26/2012 01:14 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Jim Klimov >> >> Got me wondering: how many reads of a block from spinning rust >> suffice for it to ultimately get into L2ARC? Just one so it >> gets into a recent-read list of the ARC and then expires into >> L2ARC when ARC RAM is more needed for something else, > > Correct, but not always sufficient. I forget the name of the parameter, but > there's some rate limiting thing that limits how fast you can fill the L2ARC. > This means sometimes, things will expire from ARC, and simply get discarded. The parameters are: *) l2arc_write_max (default 8MB): max number of bytes written per fill cycle *) l2arc_headroom (default 2x): multiplies the above parameter and determines how far into the ARC lists we will search for buffers eligible for writing to L2ARC. *) l2arc_feed_secs (default 1s): regular interval between fill cycles *) l2arc_feed_min_ms (default 200ms): minimum interval between fill cycles Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interesting question about L2ARC
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > Got me wondering: how many reads of a block from spinning rust > suffice for it to ultimately get into L2ARC? Just one so it > gets into a recent-read list of the ARC and then expires into > L2ARC when ARC RAM is more needed for something else, Correct, but not always sufficient. I forget the name of the parameter, but there's some rate limiting thing that limits how fast you can fill the L2ARC. This means sometimes, things will expire from ARC, and simply get discarded. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss