Re: [zfs-discuss] Convert pool from ashift=12 to ashift=9
2012-03-18 23:47, Richard Elling wrote: ... Yes, it is wrong to think that. Ok, thanks, we won't try that :) copy out, copy in. Whether this is easy or not depends on how well you plan your storage use ... Home users and personal budgets do tend to have a problem with planning. Any mistake is to be paid for personally, and many are left as is. It is is hard enough already to justify to an average wife that a storage box with large X-Tb disks needs raidz3 or mirroring and thus becomes larger and noisier, not to mention almost a thousand bucks more expensive just for the redundancy disks, but it will become two times cheaper in a year. Yup, it is not very easy to find another 10+Tb backup storage (with ZFS reliability) in a typical home I know of. Planning is not easy... But that's a rant... Hoping that in-place BP Rewrite would arrive and magically solve many problems =) Questions are: 1) How bad would a performance hit be with 512b blocks used on a 4kb drive with such efficient emulation? Depends almost exclusively on the workload and hardware. In my experience, most folks who bite the 4KB bullet have low-cost HDDs where one cannot reasonably expect high performance. Is it possible to model/emulate the situation somehow in advance to see if it's worth that change at all? It will be far more cost effective to just make the change. Meaning altogether? That with consumer disk which will suck from performance standpoint anyway, it was not a good idea to use ashift=12 and it was more cost effective to remain at ashift=9, to begin with? What about real-people's tests which seemed to show that there were substantial performance hits with misaligned large-block writes (spanning several 4k sectors at wrong boundaries)? I had an RFE posted sometime last year about making an optimisation for both worlds: use formal ashift=9 and allow writing of small blocks, but align larger blocks at set boundaries (i.e. offset divisible by 4096 for blocks sized 4096+). Perhaps writing of 512b blocks near each other should only be reserved for metadata which is dittoed anyway, so that a whole-sector (4kb) corruption won't be irreversible for some data. In effect, minblocksize for userdata would be enforced (by config) at the same 4kb in such case. This is a zfs-write only change (and some custom pool or dataset attributes), so the on-disk format and compatibility should not suffer with this solution. But I had little feedback whether the idea was at all reasonable. 2) Is it possible to easily estimate the amount of wasted disk space in slack areas of the currently active ZFS allocation (unused portions of 4kb blocks that might become available if the disks were reused with ashift=9)? Detailed space use is available from the zfs_blkstats mdb macro as previously described in such threads. 3) How many parts of ZFS pool are actually affected by the ashift setting? Everything is impacted. But that isn't a useful answer. From what I gather, it is applied at the top-level vdev level (I read that one can mix ashift=9 and ashift=12 TLVDEVs in one pool spanning several TLVDEVs). Is that a correct impression? Yes If yes, how does ashift size influence the amount of slots in uberblock ring (128 vs. 32 entries) which is applied at the leaf vdev level (right?) but should be consistent across the pool? It should be consistent across the top-level vdev. There is 128KB of space available for the uberblock list. The minimum size of an uberblock entry is 1KB. Obviously, a 4KB disk can't write only 1KB, so for 4KB sectors, there are 32 entries in theuberblock list. So if I have ashift=12 and ashift=9 top-level devices mixed in the pool, it is okay that some of them would remember 4x more of pool's TXG history than others? As far as I see in ZFS on-disk format, all sizes and offsets are in either bytes or 512b blocks, and the ashift'ed block size is not actually used anywhere except to set the minimal block size and its implicit alignment during writes. The on-disk format doc is somewhat dated and unclear here. UTSL. Are there any updates, or the 2006 pdf is the latest available? For example, is there an effort in illumos/nexenta/openindiana to publish their version of the current on-disk format? ;) Thanks for all the answers, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Convert pool from ashift=12 to ashift=9
Jim Klimov wrote: It is is hard enough already to justify to an average wife that...snip That made my night. Thanks, Jim. :) On 03/20/12 10:29 PM, Jim Klimov wrote: 2012-03-18 23:47, Richard Elling wrote: ... Yes, it is wrong to think that. Ok, thanks, we won't try that :) copy out, copy in. Whether this is easy or not depends on how well you plan your storage use ... Home users and personal budgets do tend to have a problem with planning. Any mistake is to be paid for personally, and many are left as is. It is is hard enough already to justify to an average wife that a storage box with large X-Tb disks needs raidz3 or mirroring and thus becomes larger and noisier, not to mention almost a thousand bucks more expensive just for the redundancy disks, but it will become two times cheaper in a year. Yup, it is not very easy to find another 10+Tb backup storage (with ZFS reliability) in a typical home I know of. Planning is not easy... But that's a rant... Hoping that in-place BP Rewrite would arrive and magically solve many problems =) Questions are: 1) How bad would a performance hit be with 512b blocks used on a 4kb drive with such efficient emulation? Depends almost exclusively on the workload and hardware. In my experience, most folks who bite the 4KB bullet have low-cost HDDs where one cannot reasonably expect high performance. Is it possible to model/emulate the situation somehow in advance to see if it's worth that change at all? It will be far more cost effective to just make the change. Meaning altogether? That with consumer disk which will suck from performance standpoint anyway, it was not a good idea to use ashift=12 and it was more cost effective to remain at ashift=9, to begin with? What about real-people's tests which seemed to show that there were substantial performance hits with misaligned large-block writes (spanning several 4k sectors at wrong boundaries)? I had an RFE posted sometime last year about making an optimisation for both worlds: use formal ashift=9 and allow writing of small blocks, but align larger blocks at set boundaries (i.e. offset divisible by 4096 for blocks sized 4096+). Perhaps writing of 512b blocks near each other should only be reserved for metadata which is dittoed anyway, so that a whole-sector (4kb) corruption won't be irreversible for some data. In effect, minblocksize for userdata would be enforced (by config) at the same 4kb in such case. This is a zfs-write only change (and some custom pool or dataset attributes), so the on-disk format and compatibility should not suffer with this solution. But I had little feedback whether the idea was at all reasonable. 2) Is it possible to easily estimate the amount of wasted disk space in slack areas of the currently active ZFS allocation (unused portions of 4kb blocks that might become available if the disks were reused with ashift=9)? Detailed space use is available from the zfs_blkstats mdb macro as previously described in such threads. 3) How many parts of ZFS pool are actually affected by the ashift setting? Everything is impacted. But that isn't a useful answer. From what I gather, it is applied at the top-level vdev level (I read that one can mix ashift=9 and ashift=12 TLVDEVs in one pool spanning several TLVDEVs). Is that a correct impression? Yes If yes, how does ashift size influence the amount of slots in uberblock ring (128 vs. 32 entries) which is applied at the leaf vdev level (right?) but should be consistent across the pool? It should be consistent across the top-level vdev. There is 128KB of space available for the uberblock list. The minimum size of an uberblock entry is 1KB. Obviously, a 4KB disk can't write only 1KB, so for 4KB sectors, there are 32 entries in theuberblock list. So if I have ashift=12 and ashift=9 top-level devices mixed in the pool, it is okay that some of them would remember 4x more of pool's TXG history than others? As far as I see in ZFS on-disk format, all sizes and offsets are in either bytes or 512b blocks, and the ashift'ed block size is not actually used anywhere except to set the minimal block size and its implicit alignment during writes. The on-disk format doc is somewhat dated and unclear here. UTSL. Are there any updates, or the 2006 pdf is the latest available? For example, is there an effort in illumos/nexenta/openindiana to publish their version of the current on-disk format? ;) Thanks for all the answers, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
I read the ZFS_Best_Practices_Guide and ZFS_Evil_Tuning_Guide, and have some questions: 1. Cache device for L2ARC Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool setup shouldn't we be reading at least at the 500MB/s read/write range? Why would we want a ~500MB/s cache? 2. ZFS dynamically strips along the top-most vdev's and that performance for 1 vdev is equivalent to performance of one drive in that group. Am I correct in thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the disks will go ~100MB/s each , this zpool would theoretically read/write at ~100MB/s max (how about real world average?)? If this was RAID6 I think this would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x- 14x faster than zfs, and both provide the same amount of redundancy)? Is my thinking off in the RAID6 or RAIDZ2 numbers? Why doesn't ZFS try to dynamically strip inside vdevs (and if it is, is there an easy to understand explanation why a vdev doesn't read from multiple drives at once when requesting data, or why a zpool wouldn't make N number of requests to a vdev with N being the number of disks in that vdev)? Since performance for 1 vdev is equivalent to performance of one drive in that group it seems like the higher raidzN are not very useful. If your using raidzN your probably looking for a lower than mirroring parity (aka 10%-33%), but if you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is terrible (almost unimaginable) if your running at 1/20 the ideal performance. Main Question: 3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want around 20%-25% parity. My system is like so: Main Application: Home NAS * Like to optimize max space with 20%(ideal) or 25% parity - would like 'decent' reading performance - 'decent' being max of 10GigE Ethernet, right now it is only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE becomes cheaper. My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20 disk raid. * 16GB RAM * Open to using ZIL/L2ARC, but, left out for now: writing doesn't occur much (~7GB a week, maybe a big burst every couple months), and don't really read same data multiple times. What would be the best setup? I'm thinking one of the following: a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? (~200MB/s reading, best reliability) b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s reading, evens out size across vdevs) c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s reading, maximize vdevs for performance) I am leaning towards a. since I am thinking raidz3+raidz2 should provide a little more reliability than 5 raidz1's, but, worried that the real world read/write performance will be low (theoridical is ~200MB/s, and, since the 2nd vdev is 3x the size as the 1st, I am probably looking at more like 133MB/s?). The 12 disk array is also above the 9 disk group max recommendation in the Best Practices guide, so not sure if this affects read performance (if it is just resilver time I am not as worried about it as long it isn't like 3x longer)? I guess I'm hoping a. really isn't ~200MB/s hehe, if it is I'm leaning towards b., but, if so, all three are downgrades from my initial setup read performance wise -_-. Is someone able to correct my understanding if some of my numbers are off, or would someone have a better raidzN configuration I should consider? Thanks for any help. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss