Re: [zfs-discuss] Thinking about spliting a zpool in "system" and "data"
Hello, Jesus, I have transitioned a number of systems roughly by the same procedure as you've outlined. Sadly, my notes are not in English so they wouldn't be of much help directly; but I can report that I had success with similar "in-place" manual transitions from mirrored SVM (pre-solaris 10u4) to new ZFS root pools, as well as various transitions of ZFS root pools from one layout to another, on systems with limited numbers of disk drives (2-4 overall). As I've recently reported on the list, I've also done such "migration" for my faulty single-disk rpool at home via the data pool and backwards, changing the "copies" setting enroute. Overall, your plan seems okay and has more failsafes than we've had - because longer downtimes were affordable ;) However, when doing such low-level stuff, you should make sure that you have remote access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs for externally enforced poweroff-poweron are welcome), and that you can boot the systems over ILOM/rKVM with an image of a LiveUSB/LiveCD/etc in case of bigger trouble. In the steps 6-7, where you reboot the system to test that new rpool works, you might want to keep the zones down, i.e. by disabling the zones service in the old BE just before reboot, and zfs-sending this update to the new small rpool. Also it is likely that in the new BE (small rpool) your old "data" from the big rpool won't get imported by itself and zones (or their services) wouldn't start correctly anyway before steps 7-8. --- Below I'll outline our experience from my notes, as it successfully applied to an even more complicated situation than yours: On many Sol10/SXCE systems with ZFS roots we've also created a hierarchical layout (separate /var, /usr, /opt with compression enabled), but this procedure HAS FAILED for newer OpenIndiana systems. So for OI we have to use the default single-root layout and only seperate some of /var/* subdirs (adm, log, mail, crash, cores, ...) in order to set quotas and higher compression on them. Such datasets are also kept separate from OS upgrades and are used in all boot environments without cloning. To simplify things, most of the transitions were done in off-hours time so it was okay to shut down all the zones and other services. In some cases for Sol10/SXCE the procedure involved booting in the "Failsafe Boot" mode; for all systems this can be done with the BootCD. For usual Solaris 10 and OpenSolaris SXCE maintenance we did use LiveUpgrade, but at that time its ZFS support was immature, so we circumvented LU and transitioned manually. In those cases we used LU to update systems to the base level supporting ZFS roots (Sol10u4+) while running from SVM mirrors (one mirror for main root, another mirror for LU root for new/old OS image). After the transition to ZFS rpool, we cleared out the LU settings (/etc/lu/, /etc/lutab) by using defaults from the most recent SUNWlu* packages, and when booted from ZFS - we created the "current" LU BE based on the current ZFS rpool. When the OS was capable of booting from ZFS (sol10u4+, snv_100 approx), we broke the SVM mirrors, repartitioned the second disk to our liking (about 4-10Gb for rpool, rest for data), created the new rpool and dataset hierarchy we needed and had in mounted under "/zfsroot". Note that in our case we used a "minimized" install of Solaris which fit under 1-2Gb per BE, we did not use a separate /dump device and the swap volume was located in the ZFS data pool (mirror or raidz for 4-disk systems). Zoneroots were also separate from the system rpool and were stored in the data pool. This DID yield problems for LiveUpgrade, so zones were detached before LU and reattached-with-upgrade after the OS upgrade and disk migrations. Then we copied the root FS data like this: # cd /zfsroot && ( ufsdump 0f - / | ufsrestore -rf - ) If the source (SVM) paths like /var, /usr or /boot are separate UFS filesystems - repeat likewise, changing the current paths in the command above. For non-UFS systems, such as migration from VxFS or even ZFS (if you need a different layout, compression, etc. - so ZFS send/recv is not applicable), you can use Sun cpio (it should carry over extended attributes and ACLs). For example, if you're booted from the LiveCD and the old UFS root is mounted in "/usfroot" and new ZFS rpool hierarchy is in "/zfsroot", you'd do this: # cd /ufsroot && ( find . -xdev -depth -print | cpio -pvdm /zfsroot ) The example above also copies only the data from current FS, so you need to repeat it for each UFS sub-fs like /var, etc. Another problem we've encountered while cpio'ing live systems (when not running from failsafe/livecd) is that "find" skips mountpoints of sub-fses. While your new ZFS hierarchy would provide usr, var, opt under /zfspool, you might need to manually create some others - see the list in your current "df" output. Example: # cd /zfsroot # mkdir -p tmp proc devices var/run system/contract syste
[zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
Hello all, For smaller systems such as laptops or low-end servers, which can house 1-2 disks, would it make sense to dedicate a 2-4Gb slice to the ZIL for the data pool, separate from rpool? Example layout (single-disk or mirrored): s0 - 16Gb - rpool s1 - 4Gb - data-zil s3 - *Gb - data pool The idea would be to decrease fragmentation (committed writes to data pool would be more coalesced) and to keep the ZIL at faster tracks of the HDD drive. I'm actually more interested in the former: would the dedicated ZIL decrease fragmentation of the pool? Likewise, for larger pools (such as my 6-disk raidz2) can fragmentation and/or performance benefit from some dedicated ZIL slices (i.e. s0 = 1-2Gb ZIL per 2Tb disk, with 3 mirrored ZIL sets overall)? Can several ZIL (mirrors) be concatenated for a single data pool, or only one dedicated ZIL vdev can be used? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Upgrade
2012-01-06 17:49, Edward Ned Harvey пишет: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ivan Rodriguez Dear list, I'm about to upgrade a zpool from 10 to 29 version, I suppose that this upgrade will improve several performance issues that are present on 10, however inside that pool we have several zfs filesystems all of them are version 1 my first question is is there a problem with performance or any other problem if you operate a zpool 29 with zfs filesystems version 1 ? Is it better to upgrade zfs to the latest version ? Can we jump from zfs version 1 to 5 ? Is there any implications on zfs send/receive with filesystem's and pools with different versions ? You can, and definitely should, upgrade all your zpool's and zfs filesystems. The only exceptions to think about are rpool. You definitely DON'T want to upgrade rpool higher than what's supported on the boot CD. So I suggest you create a test system, boot from the boot CD, create some filesystem, check to see which zpool and zfs version they are. Then, upgrade rpool only to that level (just in case you ever need to boot from CD to perform a rescue). And upgrade all your other filesystems to the latest. I believe in this case it might make sense to boot the target system from this BootCD and use "zpool upgrade" from this OS image. This way you can be more sure that your recovery software (Solaris BootCD) would be helpful :) But this is only applicable if you can afford the downtime... //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Stress test zfs
Hi Grant On 01/06/2012 04:50 PM, Richard Elling wrote: > Hi Grant, > > On Jan 4, 2012, at 2:59 PM, grant lowe wrote: > >> Hi all, >> >> I've got a solaris 10 running 9/10 on a T3. It's an oracle box with 128GB >> memory RIght now oracle . I've been trying to load test the box with >> bonnie++. I can seem to get 80 to 90 K writes, but can't seem to get more >> than a couple K for writes. Any suggestions? Or should I take this to a >> bonnie++ mailing list? Any help is appreciated. I'm kinda new to load >> testing. > > I was hoping Roch (from Oracle) would respond, but perhaps he's not hanging > out on > zfs-discuss anymore? > > Bonnie++ sux as a benchmark. The best analysis of this was done by Roch and > published > online in the seminal blog post: > http://137.254.16.27/roch/entry/decoding_bonnie > > I suggest you find a benchmark that more closely resembles your expected > workload and > do not rely on benchmarks that provide a summary metric. > -- richard I had good experience with "filebench". I resembles your workload as good as you are able to describe it but takes some time to get things setup if you cannot find your workload in one of the many provided examples Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs defragmentation via resilvering?
Hello all, I understand that relatively high fragmentation is inherent to ZFS due to its COW and possible intermixing of metadata and data blocks (of which metadata path blocks are likely to expire and get freed relatively quickly). I believe it was sometimes implied on this list that such fragmentation for "static" data can be currently combatted only by zfs send-ing existing pools data to other pools at some reserved hardware, and then clearing the original pools and sending the data back. This is time-consuming, disruptive and requires lots of extra storage idling for this task (or at best - for backup purposes). I wonder how resilvering works, namely - does it write blocks "as they were" or in an optimized (defragmented) fashion, in two usecases: 1) Resilvering from a healthy array (vdev) onto a spare drive in order to replace one of the healthy drives in the vdev; 2) Resilvering a degraded array from existing drives onto a new drive in order to repair the array and make it redundant again. Also, are these two modes different at all? I.e. if I were to ask ZFS to replace a working drive with a spare in the case (1), can I do it at all, and would its data simply be copied over, or reconstructed from other drives, or some mix of these two operations? Finally, what would the gurus say - does fragmentation pose a heavy problem on nearly-filled-up pools made of spinning HDDs (I believe so, at least judging from those performance degradation problems writing to 80+%-filled pools), and can fragmentation be effectively combatted on ZFS at all (with or without BP rewrite)? For example, can(does?) metadata live "separately" from data in some "dedicated" disk areas, while data blocks are written as contiguously as they can? Many Windows defrag programs group files into several "zones" on the disk based on their last-modify times, so that old WORM files remain defragmented for a long time. There are thus some empty areas reserved for new writes as well as for moving newly discovered WORM files to the WORM zones (free space permitting)... I wonder if this is viable with ZFS (COW and snapshots involved) when BP-rewrites are implemented? Perhaps such zoned defragmentation can be done based on block creation date (TXG number) and the knowledge that some blocks in certain order comprise at least one single file (maybe more due to clones and dedup) ;) What do you think? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > >For smaller systems such as laptops or low-end servers, > which can house 1-2 disks, would it make sense to dedicate > a 2-4Gb slice to the ZIL for the data pool, separate from > rpool? Example layout (single-disk or mirrored): > >The idea would be to decrease fragmentation (committed > writes to data pool would be more coalesced) and to keep > the ZIL at faster tracks of the HDD drive. I'm not authoritative, I'm speaking from memory of former discussions on this list and various sources of documentation. No, it won't help you. First of all, all your writes to the storage pool are aggregated, so you're already minimizing fragmentation of writes in your main pool. However, over time, as snapshots are created & destroyed, small changes are made to files, and file contents are overwritten incrementally and internally... The only fragmentation you get creeps in as a result of COW. This fragmentation only impacts sequential reads of files which were previously written in random order. This type of fragmentation has no relation to ZIL or writes. If you don't split out your ZIL separate from the storage pool, zfs already chooses disk blocks that it believes to be optimized for minimal access time. In fact, I believe, zfs will dedicate a few sectors at the low end, a few at the high end, and various other locations scattered throughout the pool, so whatever the current head position, it tries to go to the closest "landing zone" that's available for ZIL writes. If anything, splitting out your ZIL to a different partition might actually hurt your performance. Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, there was a time when HDD speeds were limited by rotational speed and magnetic density, so the outer tracks of the disk could serve up more data because more magnetic material passed over the head in each rotation. But nowadays, the hard drive sequential speed is limited by the head speed, which is invariably right around 1Gbps. So the inner and outer sectors of the HDD are equally fast - the outer sectors are actually less magnetically dense because the head can't handle it. And the random IO speed is limited by head seek + rotational latency, where seek is typically several times longer than latency. So basically, the only thing that matters, to optimize the performance of any modern typical HDD, is to minimize the head travel. You want to be seeking sectors which are on tracks that are nearby to the present head position. Of course, if you want to test & benchmark the performance of splitting apart the ZIL to a different partition, I encourage that. I'm only speaking my beliefs based on my understanding of the architectures and limitations involved. This is my best prediction. And I've certainly been wrong before. ;-) Sometimes, being wrong is my favorite thing, because you learn so much from it. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > >I understand that relatively high fragmentation is inherent > to ZFS due to its COW and possible intermixing of metadata > and data blocks (of which metadata path blocks are likely > to expire and get freed relatively quickly). > >I believe it was sometimes implied on this list that such > fragmentation for "static" data can be currently combatted > only by zfs send-ing existing pools data to other pools at > some reserved hardware, and then clearing the original pools > and sending the data back. This is time-consuming, disruptive > and requires lots of extra storage idling for this task (or > at best - for backup purposes). Can be combated by sending & receiving. But that's not the only way. You can defrag, (or apply/remove dedup and/or compression, or any of the other stuff that's dependent on BP rewrite) by doing any technique which sequentially reads the existing data, and writes it back to disk again. For example, if you "cp -p file1 file2 && mv file2 file1" then you have effectively defragged file1 (or added/removed dedup or compression). But of course it's requisite that file1 is sufficiently "not being used" right now. >I wonder how resilvering works, namely - does it write > blocks "as they were" or in an optimized (defragmented) > fashion, in two usecases: resilver goes according to temporal order. While this might sometimes yield a slightly better organization (If a whole bunch of small writes were previously spread out over a large period of time on a largely idle system, they will now be write-aggregated to sequential blocks) usually resilvering recreates fragmentation similar to the pre-existing fragmentation. In fact, even if you zfs send | zfs receive while preserving snapshots, you're still recreating the data in something loosely temporal order. Because it will do all the blocks of the oldest snapshot, and then all the blocks of the second oldest snapshot, etc. So by preserving the old snapshots, you might sometimes be recreating significant amount of fragmentation anyway. > 1) Resilvering from a healthy array (vdev) onto a spare drive > in order to replace one of the healthy drives in the vdev; > 2) Resilvering a degraded array from existing drives onto a > new drive in order to repair the array and make it redundant > again. Same behavior either way. Unless... If your old disks are small and very full, and your new disks are bigger, then sometimes in the past you may have suffered fragmentation due to lack of available sequential unused blocks. So resilvering onto new *larger* disks might make a difference. >Finally, what would the gurus say - does fragmentation > pose a heavy problem on nearly-filled-up pools made of > spinning HDDs Yes. But that's not unique to ZFS or COW. No matter what your system, if your disk is nearly full, you will suffer from fragmentation. > and can fragmentation be effectively combatted > on ZFS at all (with or without BP rewrite)? With BP rewrite, yes you can effectively combat fragmentation. Unfortunately it doesn't exist. :-/ Without BP rewrite... Define "effectively." ;-) I have successfully defragged, compressed, enabled/disabled dedup on pools before, by using zfs send | zfs receive... Or by asking users, "Ok, we're all in agreement, this weekend, nobody will be using the "a" directory. Right?" So then I sudo rm -rf a, and restore from the latest snapshot. Or something along those lines. Next weekend, we'll do the "b" directory... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
it seems that s11 shadow migration can help:-) On 1/7/2012 9:50 AM, Jim Klimov wrote: Hello all, I understand that relatively high fragmentation is inherent to ZFS due to its COW and possible intermixing of metadata and data blocks (of which metadata path blocks are likely to expire and get freed relatively quickly). I believe it was sometimes implied on this list that such fragmentation for "static" data can be currently combatted only by zfs send-ing existing pools data to other pools at some reserved hardware, and then clearing the original pools and sending the data back. This is time-consuming, disruptive and requires lots of extra storage idling for this task (or at best - for backup purposes). I wonder how resilvering works, namely - does it write blocks "as they were" or in an optimized (defragmented) fashion, in two usecases: 1) Resilvering from a healthy array (vdev) onto a spare drive in order to replace one of the healthy drives in the vdev; 2) Resilvering a degraded array from existing drives onto a new drive in order to repair the array and make it redundant again. Also, are these two modes different at all? I.e. if I were to ask ZFS to replace a working drive with a spare in the case (1), can I do it at all, and would its data simply be copied over, or reconstructed from other drives, or some mix of these two operations? Finally, what would the gurus say - does fragmentation pose a heavy problem on nearly-filled-up pools made of spinning HDDs (I believe so, at least judging from those performance degradation problems writing to 80+%-filled pools), and can fragmentation be effectively combatted on ZFS at all (with or without BP rewrite)? For example, can(does?) metadata live "separately" from data in some "dedicated" disk areas, while data blocks are written as contiguously as they can? Many Windows defrag programs group files into several "zones" on the disk based on their last-modify times, so that old WORM files remain defragmented for a long time. There are thus some empty areas reserved for new writes as well as for moving newly discovered WORM files to the WORM zones (free space permitting)... I wonder if this is viable with ZFS (COW and snapshots involved) when BP-rewrites are implemented? Perhaps such zoned defragmentation can be done based on block creation date (TXG number) and the knowledge that some blocks in certain order comprise at least one single file (maybe more due to clones and dedup) ;) What do you think? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Hung-Sheng Tsao Ph D. Founder& Principal HopBit GridComputing LLC cell: 9734950840 http://laotsao.blogspot.com/ http://laotsao.wordpress.com/ http://blogs.oracle.com/hstsao/ <>___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs read-ahead and L2ARC
I wonder if it is possible (currently or in the future as an RFE) to tell ZFS to automatically read-ahead some files and cache them in RAM and/or L2ARC? One use-case would be for Home-NAS setups where multimedia (video files or catalogs of images/music) are viewed form a ZFS box. For example, if a user wants to watch a film, or listen to a playlist of MP3's, or push photos to a wall display (photo frame, etc.), the storage box "should" read-ahead all required data from HDDs and save it in ARC/L2ARC. Then the HDDs can spin down for hours while the pre-fetched gigabytes of data are used by consumers from the cache. End-users get peace, quiet and less electricity used while they enjoy their multimedia entertainment ;) Is it possible? If not, how hard would it be to implement? In terms of scripting, would it suffice to detect reads (i.e. with DTrace) and read the files to /dev/null to get them cached along with all required metadata (so that mechanical HDDs are not required for reads afterwards)? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Upgrade
On Sat, 7 Jan 2012, Jim Klimov wrote: I believe in this case it might make sense to boot the target system from this BootCD and use "zpool upgrade" from this OS image. This way you can be more sure that your recovery software (Solaris BootCD) would be helpful :) Also keep in mind that it would be a grevious error if the zpool version supported by the BootCD was never than what the installed GRUB and OS can support. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sat, 7 Jan 2012, Jim Klimov wrote: Several RAID systems have implemented "spread" spare drives in the sense that there is not an idling disk waiting to receive a burst of resilver data filling it up, but the capacity of the spare disk is spread among all drives in the array. As a result, the healthy array gets one more spindle and works a little faster, and rebuild times are often decreased since more spindles can participate in repairs at the same time. I think that I would also be interested in a system which uses the so-called spare disks for more protective redundancy but then reduces that protective redundancy in order to use that disk to replace a failed disk or to automatically enlarge the pool. For example, a pool could start out with four-way mirroring when there is little data in the pool. When the pool becomes more full, mirror devices are automatically removed (from existing vdevs), and used to add more vdevs. Eventually a limit would be hit so that no more mirrors are allowed to be removed. Obviously this approach works with simple mirrors but not for raidz. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
Hi Jim, On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > Hello all, > > I have a new idea up for discussion. > > Several RAID systems have implemented "spread" spare drives > in the sense that there is not an idling disk waiting to > receive a burst of resilver data filling it up, but the > capacity of the spare disk is spread among all drives in > the array. As a result, the healthy array gets one more > spindle and works a little faster, and rebuild times are > often decreased since more spindles can participate in > repairs at the same time. Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. There have been other implementations of more distributed RAIDness in the past (RAID-1E, etc). The big question is whether they are worth the effort. Spares solve a serviceability problem and only impact availability in an indirect manner. For single-parity solutions, spares can make a big difference in MTTDL, but have almost no impact on MTTDL for double-parity solutions (eg. raidz2). > I don't think I've seen such idea proposed for ZFS, and > I do wonder if it is at all possible with variable-width > stripes? Although if the disk is sliced in 200 metaslabs > or so, implementing a spread-spare is a no-brainer as well. Put some thoughts down on paper and work through the math. If it all works out, let's implement it! -- richard > > To be honest, I've seen this a long time ago in (Falcon?) > RAID controllers, and recently - in a USEnix presentation > of IBM GPFS on YouTube. In the latter the speaker goes > a greater depth describing how their "declustered RAID" > approach (as they call it: all blocks - spare, redundancy > and data are intermixed evenly on all drives and not in > a single "group" or a mid-level VDEV as would be for ZFS). > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > GPFS with declustered RAID not only decreases rebuild > times and/or impact of rebuilds on end-user operations, > but it also happens to increase reliability - there is > a smaller time window in case of multiple-disk failure > in a large RAID-6 or RAID-7 array (in the example they > use 47-disk sets) that the data is left in a "critical > state" due to lack of redundancy, and there is less data > overall in such state - so the system goes from critical > to simply degraded (with some redundancy) in a few minutes. > > Another thing they have in GPFS is temporary offlining > of disks so that they can catch up when reattached - only > newer writes (bigger TXG numbers in ZFS terms) are added to > reinserted disks. I am not sure this exists in ZFS today, > either. This might simplify physical systems maintenance > (as it does for IBM boxes - see presentation if interested) > and quick recovery from temporarily unavailable disks, such > as when a disk gets a bus reset and is unavailable for writes > for a few seconds (or more) while the array keeps on writing. > > I find these ideas cool. I do believe that IBM might get > angry if ZFS development copy-pasted them "as is", but it > might get nonetheless get us inventing a similar wheel > that would be a bit different ;) > There are already several vendors doing this in some way, > so perhaps there is no (patent) monopoly in place already... > > And I think all the magic of spread spares and/or "declustered > RAID" would go into just making another write-block allocator > in the same league "raidz" or "mirror" are nowadays... > BTW, are such allocators pluggable (as software modules)? > > What do you think - can and should such ideas find their > way into ZFS? Or why not? Perhaps from theoretical or > real-life experience with such storage approaches? > > //Jim Klimov > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On Jan 7, 2012, at 7:12 AM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Jim Klimov >> >> For smaller systems such as laptops or low-end servers, >> which can house 1-2 disks, would it make sense to dedicate >> a 2-4Gb slice to the ZIL for the data pool, separate from >> rpool? Example layout (single-disk or mirrored): >> >> The idea would be to decrease fragmentation (committed >> writes to data pool would be more coalesced) and to keep >> the ZIL at faster tracks of the HDD drive. > > I'm not authoritative, I'm speaking from memory of former discussions on > this list and various sources of documentation. > > No, it won't help you. Correct :-) > First of all, all your writes to the storage pool are aggregated, so you're > already minimizing fragmentation of writes in your main pool. However, over > time, as snapshots are created & destroyed, small changes are made to files, > and file contents are overwritten incrementally and internally... The only > fragmentation you get creeps in as a result of COW. This fragmentation only > impacts sequential reads of files which were previously written in random > order. This type of fragmentation has no relation to ZIL or writes. > > If you don't split out your ZIL separate from the storage pool, zfs already > chooses disk blocks that it believes to be optimized for minimal access > time. In fact, I believe, zfs will dedicate a few sectors at the low end, a > few at the high end, and various other locations scattered throughout the > pool, so whatever the current head position, it tries to go to the closest > "landing zone" that's available for ZIL writes. If anything, splitting out > your ZIL to a different partition might actually hurt your performance. > > Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, > there was a time when HDD speeds were limited by rotational speed and > magnetic density, so the outer tracks of the disk could serve up more data > because more magnetic material passed over the head in each rotation. But > nowadays, the hard drive sequential speed is limited by the head speed, > which is invariably right around 1Gbps. So the inner and outer sectors of > the HDD are equally fast - the outer sectors are actually less magnetically > dense because the head can't handle it. And the random IO speed is limited > by head seek + rotational latency, where seek is typically several times > longer than latency. Disagree. My data, and the vendor specs, continue to show different sequential media bandwidth speed for inner vs outer cylinders. > > So basically, the only thing that matters, to optimize the performance of > any modern typical HDD, is to minimize the head travel. You want to be > seeking sectors which are on tracks that are nearby to the present head > position. > > Of course, if you want to test & benchmark the performance of splitting > apart the ZIL to a different partition, I encourage that. I'm only speaking > my beliefs based on my understanding of the architectures and limitations > involved. This is my best prediction. And I've certainly been wrong > before. ;-) Sometimes, being wrong is my favorite thing, because you learn > so much from it. ;-) Good idea. I think you will see a tradeoff on the read side of the mixed read/write workload. Sync writes have higher priority than reads so the order of I/O sent to the disk will appear to be very random and not significantly coalesced. This is the pathological worst case workload for a HDD. OTOH, you're not trying to get high performance from an HDD are you? That game is over. -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
2012-01-08 5:37, Richard Elling пишет: The big question is whether they are worth the effort. Spares solve a serviceability problem and only impact availability in an indirect manner. For single-parity solutions, spares can make a big difference in MTTDL, but have almost no impact on MTTDL for double-parity solutions (eg. raidz2). Well, regarding this part: in the presentation linked in my OP, the IBM presenter suggests that for a 6-disk raid10 (3 mirrors) with one spare drive, overall a 7-disk set, there are such options for "critical" hits to data redundancy when one of drives dies: 1) Traditional RAID - one full disk is a mirror of another full disk; 100% of a disk's size is "critical" and has to be prelicated into a spare drive ASAP; 2) Declustered RAID - all 7 disks are used for 2 unique data blocks from "original" setup and one spare block (I am not sure I described it well in words, his diagram shows it better); if a single disk dies, only 1/7 worth of disk size is critical (not redundant) and can be fixed faster. For their typical 47-disk sets of RAID-7-like redundancy, under 1% of data becomes critical when 3 disks die at once, which is (deemed) unlikely as is. Apparently, in the GPFS layout, MTTDL is much higher than in raid10+spare with all other stats being similar. I am not sure I'm ready (or qualified) to sit down and present the math right now - I just heard some ideas that I considered worth sharing and discussing ;) Thanks for the input, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling wrote: > Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > > > > Several RAID systems have implemented "spread" spare drives > > in the sense that there is not an idling disk waiting to > > receive a burst of resilver data filling it up, but the > > capacity of the spare disk is spread among all drives in > > the array. As a result, the healthy array gets one more > > spindle and works a little faster, and rebuild times are > > often decreased since more spindles can participate in > > repairs at the same time. > > Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. > There have been other implementations of more distributed RAIDness in the > past (RAID-1E, etc). > > The big question is whether they are worth the effort. Spares solve a > serviceability > problem and only impact availability in an indirect manner. For > single-parity > solutions, spares can make a big difference in MTTDL, but have almost no > impact > on MTTDL for double-parity solutions (eg. raidz2). > I disagree. Dedicated spares impact far more than availability. During a rebuild performance is, in general, abysmal. ZIL and L2ARC will obviously help (L2ARC more than ZIL), but at the end of the day, if we've got a 12 hour rebuild (fairly conservative in the days of 2TB SATA drives), the performance degradation is going to be very real for end-users. With distributed parity and spares, you should in theory be able to cut this down an order of magnitude. I feel as though you're brushing this off as not a big deal when it's an EXTREMELY big deal (in my mind). In my opinion you can't just approach this from an MTTDL perspective, you also need to take into account user experience. Just because I haven't lost data, doesn't mean the system isn't (essentially) unavailable (sorry for the double negative and repeated parenthesis). If I can't use the system due to performance being a fraction of what it is during normal production, it might as well be an outage. > > > I don't think I've seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all works > out, let's implement it! > -- richard > > I realize it's not intentional Richard, but that response is more than a bit condescending. If he could just put it down on paper and code something up, I strongly doubt he would be posting his thoughts here. He would be posting results. The intention of his post, as far as I can tell, is to perhaps inspire someone who CAN just write down the math and write up the code to do so. Or at least to have them review his thoughts and give him a dev's perspective on how viable bringing something like this to ZFS is. I fear responses like "the code is there, figure it out" makes the *aris community no better than the linux one. > > > > To be honest, I've seen this a long time ago in (Falcon?) > > RAID controllers, and recently - in a USEnix presentation > > of IBM GPFS on YouTube. In the latter the speaker goes > > a greater depth describing how their "declustered RAID" > > approach (as they call it: all blocks - spare, redundancy > > and data are intermixed evenly on all drives and not in > > a single "group" or a mid-level VDEV as would be for ZFS). > > > > http://www.youtube.com/watch?v=2g5rx4gP6yU&feature=related > > > > GPFS with declustered RAID not only decreases rebuild > > times and/or impact of rebuilds on end-user operations, > > but it also happens to increase reliability - there is > > a smaller time window in case of multiple-disk failure > > in a large RAID-6 or RAID-7 array (in the example they > > use 47-disk sets) that the data is left in a "critical > > state" due to lack of redundancy, and there is less data > > overall in such state - so the system goes from critical > > to simply degraded (with some redundancy) in a few minutes. > > > > Another thing they have in GPFS is temporary offlining > > of disks so that they can catch up when reattached - only > > newer writes (bigger TXG numbers in ZFS terms) are added to > > reinserted disks. I am not sure this exists in ZFS today, > > either. This might simplify physical systems maintenance > > (as it does for IBM boxes - see presentation if interested) > > and quick recovery from temporarily unavailable disks, such > > as when a disk gets a bus reset and is unavailable for writes > > for a few seconds (or more) while the array keeps on writing. > > > > I find these ideas cool. I do believe that IBM might get > > angry if ZFS development copy-pasted them "as is", but it > > might get nonetheless get us inventing a similar wheel > > that would be a bit different ;) > > There are already several vendors doin