Re: ditto blocks on ZFS
Russell Coker posted on Fri, 23 May 2014 13:54:46 +1000 as excerpted: Is anyone doing research on how much free disk space is required on BTRFS for good performance? If a rumor (whether correct or incorrect) goes around that you need 20% free space on a BTRFS filesystem for performance then that will vastly outweigh the space used for metadata. Well, on btrfs there's free-space, and then there's free-space. The chunk allocation and both data/metadata fragmentation make a difference. That said, *IF* you're looking at the right numbers, btrfs doesn't actually require that much free space, and should run as efficiently right up to just a few GiB free, on pretty much any btrfs over a few GiB in size, so at least in the significant fractions of a TiB on up range, it doesn't require that much free space /as/ /a/ /percentage/ at all. **BUT BE SURE YOU'RE LOOKING AT THE RIGHT NUMBERS** as explained below. Chunks: On btrfs, both data and metadata are allocated in chunks, 1 GiB chunks for data, 256 MiB chunks for metadata. The catch is that while both chunks and space within chunks can be allocated on-demand, deleting files only frees space within chunks -- the chunks themselves remain allocated to data/metadata whichever they were, and cannot be reallocated to the other. To deallocate unused chunks and to rewrite partially used chunks to consolidate usage on to fewer chunks and free the others, btrfs admins must currently manually (or via script) do a btrfs balance. btrfs filesystem show: For the btrfs filesystem show output, the individual devid lines show total filesystem space on the device vs. used, as in allocated to chunks, space.[1] Ideally (assuming equal sized devices) you should keep at least 2.5-3.0 GiB free per device, since that will allow allocation of two chunks each for data (1 GiB each) and metadata (quarter GiB each, but on single-device filesystems they are allocated in pairs by default, so half a MiB, see below). Since the balance process itself will want to allocate a new chunk to write into in ordered to rewrite and consolidate existing chunks, you don't want to use the last one available, and since the filesystem could decide it needs to allocate another chunk for normal usage as well, you always want to keep at least two chunks worth of each, thus 2.5 GiB (3.0 GiB for single-device-filesystems, see below), unallocated, one chunk each data/metadata for the filesystem if it needs it, and another to ensure balance can allocate at least the one chunk to do its rewrite. As I said, data chunks are 1 GiB, while metadata chunks are 256 MiB, a quarter GiB. However, on a single-device btrfs, metadata will normally default to dup (duplicate, two copies for safety) mode, and will thus allocate two chunks, half a GiB at a time. This is why you want 3 GiB minimum free on a single-device btrfs, space for two single-mode data chunk allocations (1 GiB * 2 = 2 GiB), plus two dup-mode metadata chunk allocations (256 MiB * 2 * 2 = 1 GiB). But on multi-device btrfs, only a single copy is stored per device, so the metadata minimum reserve is only half a GiB per device (256 MiB * 2 = 512 MiB = half a GiB). That's the minimum unallocated space you need free. More than that is nice and lets you go longer between having to worry about rebalances, but it really won't help btrfs efficiency that much, since btrfs uses already allocated chunk space where it can. btrfs filesystem df: Then there's the already chunk-allocated space. btrfs filesystem df reports on this. In the df output, total means allocated while used means used, of that allocated, so the spread between them is the allocated but unused. Since btrfs allocates new chunks on-demand from the unallocated space pool, but cannot reallocate chunks between data and metadata on its own, and because the used blocks within existing chunks will get fragmented over time, it's best to keep the btrfs filesystem df reported spread between total and used to a minimum. Of course, as I said above data chunks are 1 GiB each, so a data allocation spread of under a GiB won't be recoverable in any case, and a spread of 1-5 GiB isn't a big deal. But if for instance btrfs filesystem df reports data 1.25 TiB total (that is, allocated) but only 250 GiB used, that's a spread of roughly a TiB, and running a btrfs balance in ordered to recover most of that spread to unallocated is a good idea. Similarly with metadata, except it'll be allocated in 256 MiB chunks, two at a time by default on a single device filesystem so 512 MiB at at time in that case. But again, if btrfs filesystem df is reporting say 10.5 GiB total metadata but only perhaps 1.75 GiB used, the spread is several chunks worth and particularly if your unallocated reserve (as reported by btrfs filesystem show in the individual device lines) is getting low, it's time to consider rebalancing it to recover the unused metadata space to
Re: ditto blocks on ZFS
On 2014-05-21 19:05, Martin wrote: Very good comment from Ashford. Sorry, but I see no advantages from Russell's replies other than for a feel-good factor or a dangerous false sense of security. At best, there is a weak justification that for metadata, again going from 2% to 4% isn't going to be a great problem (storage is cheap and fast). I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... On 21/05/14 03:51, Russell Coker wrote: Absolutely. Hopefully this discussion will inspire the developers to consider this an interesting technical challenge and a feature that is needed to beat ZFS. Sorry, but I think that is completely the wrong reasoning. ...Unless that is you are some proprietary sales droid hyping features and big numbers! :-P Personally I'm not convinced we gain anything beyond what btrfs will eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. Also note that usually, data is wanted to be 100% reliable and retrievable. Or if that fails, you go to your backups instead. Gambling proportions and importance rather than *ensuring* fault/error tolerance is a very human thing... ;-) Sorry: Interesting idea but not convinced there's any advantage for disk/SSD storage. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Another nice option in this case might be adding logic to make sure that there is some (considerable) offset between copies of metadata using the dup profile (all of the filesystems that I have actually looked at the low-level on-disk structures have had both copies of the System chunks right next to each other, right at the beginning of the disk, which of course mitigates the usefulness of storing two copies of them on disk). Adding an offset in those allocations would provide some better protection against some of the more common 'idiot' failure-modes (i.e. trying to use dd to write a disk image to a USB flash drive, and accidentally overwriting the first n GB of your first HDD instead). Ideally, once we have n-way replication, System chunks should default to one copy per device for multi-device filesystems. smime.p7s Description: S/MIME Cryptographic Signature
Re: ditto blocks on ZFS
I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... This may be true for SSD disks - for ordinary disks it's not entirely the case. For most RAID rebuilds, it still seems way faster with software RAID-1 where one drive is being read at its (almost) full speed, and the other is being written to at its (almost) full speed (assuming no other IO load). With btrfs RAID-1, the way balance is made after disk replace, it takes lots of disk head movements resulting in overall small speed to rebuild the RAID, especially with lots of snapshots and related fragmentation. And the balance is still not smart and is causing reads from one device, and writes to *both* devices (extra unnecessary write to the healthy device - while it should read from the healthy device and write to the replaced device only). Of course, other factors such as the amount of data or disk IO usage during rebuild apply. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
Russell, Overall, there are still a lot of unknowns WRT the stability, and ROI (Return On Investment) of implementing ditto blocks for BTRFS. The good news is that there's a lot of time before the underlying structure is in place to support, so there's time to figure this out a bit better. On Tue, 20 May 2014 07:56:41 ashf...@whisperpc.com wrote: 1. There will be more disk space used by the metadata. I've been aware of space allocation issues in BTRFS for more than three years. If the use of ditto blocks will make this issue worse, then it's probably not a good idea to implement it. The actual increase in metadata space is probably small in most circumstances. Data, RAID1: total=2.51TB, used=2.50TB System, RAID1: total=32.00MB, used=376.00KB Metadata, RAID1: total=28.25GB, used=26.63GB The above is my home RAID-1 array. It includes multiple backup copies of a medium size Maildir format mail spool which probably accounts for a significant portion of the used space, the Maildir spool has an average file size of about 70K and lots of hard links between different versions of the backup. Even so the metadata is only 1% of the total used space. Going from 1% to 2% to improve reliability really isn't a problem. Data, RAID1: total=140.00GB, used=139.60GB System, RAID1: total=32.00MB, used=28.00KB Metadata, RAID1: total=4.00GB, used=2.97GB Above is a small Xen server which uses snapshots to backup the files for Xen block devices (the system is lightly loaded so I don't use nocow) and for data files that include a small Maildir spool. It's still only 2% of disk space used for metadata, again going from 2% to 4% isn't going to be a great problem. You've addressed half of the issue. It appears that the metadata is normally a bit over 1% using the current methods, but two samples do not make a statistical universe. The good news is that these two samples are from opposite extremes of usage, so I expect they're close to where the overall average would end up. I'd like to see a few more samples, from other usage scenarios, just to be sure. If the above numbers are normal, adding ditto blocks could increase the size of the metadata from 1% to 2% or even 3%. This isn't a problem. What we still don't know, and probably won't until after it's implemented, is whether or not the addition of ditto blocks will make the space allocation worse. 2. Use of ditto blocks will increase write bandwidth to the disk. This is a direct and unavoidable result of having more copies of the metadata. The actual impact of this would depend on the file-system usage pattern, but would probably be unnoticeable in most circumstances. Does anyone have a worst-case scenario for testing? The ZFS design involves ditto blocks being spaced apart due to the fact that corruption tends to have some spacial locality. So you are adding an extra seek. The worst case would be when you have lots of small synchronous writes, probably the default configuration of Maildir delivery would be a good case. Is there a performance test for this? That would be helpful in determining the worst-case performance impact of implementing ditto blocks, and probably some other enhancements as well. 3. Certain kinds of disk errors would be easier to recover from. Some people here claim that those specific errors are rare. I have no opinion on how often they happen, but I believe that if the overall disk space cost is low, it will have a reasonable return. There would be virtually no reliability gains on an SSD-based file-system, as the ditto blocks would be written at the same time, and the SSD would be likely to map the logical blocks into the same page of flash memory. That claim is unproven AFAIK. That claim is a direct result of how SSDs function. 4. If the BIO layer of BTRFS and the device driver are smart enough, ditto blocks could reduce I/O wait time. This is a direct result of having more instances of the data on the disk, so it's likely that there will be a ditto block closer to where the disk head is currently. The actual benefit for disk-based file-systems is likely to be under 1ms per metadata seek. It's possible that a short-term backlog on one disk could cause BTRFS to use a ditto block on another disk, which could deliver 20ms of performance. There would be no performance benefit for SSD-based file-systems. That is likely with RAID-5 and RAID-10. It's likely with all disk layouts. The reason just looks different on different RAID structures. My experience is that once your disks are larger than about 500-750GB, RAID-6 becomes a much better choice, due to the increased chances of having an uncorrectable read error during a reconstruct. My opinion is that anyone storing critical information in RAID-5, or even 2-disk RAID-1, with disks of this capacity, should either reconsider their storage topology, or verify that they have a good backup/restore mechanism in
Re: ditto blocks on ZFS
On Thu, 22 May 2014 15:09:40 ashf...@whisperpc.com wrote: You've addressed half of the issue. It appears that the metadata is normally a bit over 1% using the current methods, but two samples do not make a statistical universe. The good news is that these two samples are from opposite extremes of usage, so I expect they're close to where the overall average would end up. I'd like to see a few more samples, from other usage scenarios, just to be sure. If the above numbers are normal, adding ditto blocks could increase the size of the metadata from 1% to 2% or even 3%. This isn't a problem. What we still don't know, and probably won't until after it's implemented, is whether or not the addition of ditto blocks will make the space allocation worse. I've been involved in many discussions about filesystem choice. None of them have included anyone raising an issue about ZFS metadata space usage, probably most ZFS users don't even know about ditto blocks. The relevant issue regarding disk space is the fact that filesystems tend to perform better if there is a reasonable amount of free space. The amount of free space for good performance will depend on filesystem, usage pattern, and whatever you might define as good performance. The first two Google hits on searching for recommended free space on ZFS recommended using no more than 80% and 85% of disk space. Obviously if good performance requires 15% of free disk space then your capacity problem isn't going to be solved by not duplicating metadata. Note that I am not aware of the accuracy of such claims about ZFS performance. Is anyone doing research on how much free disk space is required on BTRFS for good performance? If a rumor (whether correct or incorrect) goes around that you need 20% free space on a BTRFS filesystem for performance then that will vastly outweigh the space used for metadata. The ZFS design involves ditto blocks being spaced apart due to the fact that corruption tends to have some spacial locality. So you are adding an extra seek. The worst case would be when you have lots of small synchronous writes, probably the default configuration of Maildir delivery would be a good case. Is there a performance test for this? That would be helpful in determining the worst-case performance impact of implementing ditto blocks, and probably some other enhancements as well. http://doc.coker.com.au/projects/postal/ My Postal mail server benchmark is one option. There are more than a few benchmarks of synchronous writes of small files, but Postal uses real world programs that need such performance. Delivering a single message via a typical Unix MTA requires synchronous writes of two queue files and then the destination file in the mail store. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
Very good comment from Ashford. Sorry, but I see no advantages from Russell's replies other than for a feel-good factor or a dangerous false sense of security. At best, there is a weak justification that for metadata, again going from 2% to 4% isn't going to be a great problem (storage is cheap and fast). I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... On 21/05/14 03:51, Russell Coker wrote: Absolutely. Hopefully this discussion will inspire the developers to consider this an interesting technical challenge and a feature that is needed to beat ZFS. Sorry, but I think that is completely the wrong reasoning. ...Unless that is you are some proprietary sales droid hyping features and big numbers! :-P Personally I'm not convinced we gain anything beyond what btrfs will eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. Also note that usually, data is wanted to be 100% reliable and retrievable. Or if that fails, you go to your backups instead. Gambling proportions and importance rather than *ensuring* fault/error tolerance is a very human thing... ;-) Sorry: Interesting idea but not convinced there's any advantage for disk/SSD storage. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 20/5/2014 5:07 πμ, Russell Coker wrote: On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a raid5/1/10 configuration might ruin a ton of data along with an obscure set of metadata in some long tree paths - but not the entire filesystem. My experience is that most disk failures that don't involve extreme physical damage (EG dropping a drive on concrete) don't involve totally losing the disk. Much of the discussion about RAID failures concerns entirely failed disks, but I believe that is due to RAID implementations such as Linux software RAID that will entirely remove a disk when it gives errors. I have a disk which had ~14,000 errors of which ~2000 errors were corrected by duplicate metadata. If two disks with that problem were in a RAID-1 array then duplicate metadata would be a significant benefit. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research on latent errors on drives is worth reading. On page 12 they report latent sector errors on 9.5% of SATA disks per year. So if you lose one disk entirely the risk of having errors on a second disk is higher than you would want for RAID-5. While losing the root of the tree is unlikely, losing a directory in the middle that has lots of subdirectories is a risk. Seeing the results of that paper, I think erasure coding is a better solution. Instead of having many copies of metadata or data, we could do erasure coding using something like zfec[1] that is being used by Tahoe-LAFS, increasing their size by lets say 5-10%, and be quite safe even from multiple continuous bad sectors. [1] https://pypi.python.org/pypi/zfec I can understand why people wouldn't want ditto blocks to be mandatory. But why are people arguing against them as an option? As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 2014-05-19 22:07, Russell Coker wrote: On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a raid5/1/10 configuration might ruin a ton of data along with an obscure set of metadata in some long tree paths - but not the entire filesystem. My experience is that most disk failures that don't involve extreme physical damage (EG dropping a drive on concrete) don't involve totally losing the disk. Much of the discussion about RAID failures concerns entirely failed disks, but I believe that is due to RAID implementations such as Linux software RAID that will entirely remove a disk when it gives errors. I have a disk which had ~14,000 errors of which ~2000 errors were corrected by duplicate metadata. If two disks with that problem were in a RAID-1 array then duplicate metadata would be a significant benefit. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research on latent errors on drives is worth reading. On page 12 they report latent sector errors on 9.5% of SATA disks per year. So if you lose one disk entirely the risk of having errors on a second disk is higher than you would want for RAID-5. While losing the root of the tree is unlikely, losing a directory in the middle that has lots of subdirectories is a risk. I can understand why people wouldn't want ditto blocks to be mandatory. But why are people arguing against them as an option? As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. But the proposed changes for n-way replication would already handle this. They would just need the option of having more than one copy per device (which theoretically shouldn't be too hard once you have n-way replication). Also, BTRFS already has the option of replicating the root tree across multiple devices (it is included in the System Data subset), and in fact dose so by default when using multiple devices. Also, there are plans to have per-subvolume or per file RAID level selection, but IIRC that is planned for after n-way replication (and of course, RAID 5/6, as n-way replication isn't going to be implemented until after RAID 5/6) smime.p7s Description: S/MIME Cryptographic Signature
Re: ditto blocks on ZFS
On 2014/05/20 04:07 PM, Austin S Hemmelgarn wrote: On 2014-05-19 22:07, Russell Coker wrote: [snip] As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. But the proposed changes for n-way replication would already handle this. [snip] Russell's specific request above is probably best handled by being able to change replication levels per subvolume - this won't be handled by N-way replication. Extra replication on leaf nodes will make relatively little difference in the scenarios laid out in this thread - but on trunk nodes (folders or subvolumes closer to the filesystem root) it makes a significant difference. Plain N-way replication doesn't flexibly treat these two nodes differently. As an example, Russell might have a server with two disks - yet he wants 6 copies of all metadata for subvolumes and their immediate subfolders. At three folders deep he only wants to have 4 copies. At six folders deep, only 2. Ditto blocks add an attractive safety net without unnecessarily doubling or tripling the size of *all* metadata. It is a good idea. The next question to me is whether or not it is something that can be implemented elegantly and whether or not a talented *dev* thinks it is a good idea. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On Tue, 20 May 2014 07:56:41 ashf...@whisperpc.com wrote: 1. There will be more disk space used by the metadata. I've been aware of space allocation issues in BTRFS for more than three years. If the use of ditto blocks will make this issue worse, then it's probably not a good idea to implement it. The actual increase in metadata space is probably small in most circumstances. Data, RAID1: total=2.51TB, used=2.50TB System, RAID1: total=32.00MB, used=376.00KB Metadata, RAID1: total=28.25GB, used=26.63GB The above is my home RAID-1 array. It includes multiple backup copies of a medium size Maildir format mail spool which probably accounts for a significant portion of the used space, the Maildir spool has an average file size of about 70K and lots of hard links between different versions of the backup. Even so the metadata is only 1% of the total used space. Going from 1% to 2% to improve reliability really isn't a problem. Data, RAID1: total=140.00GB, used=139.60GB System, RAID1: total=32.00MB, used=28.00KB Metadata, RAID1: total=4.00GB, used=2.97GB Above is a small Xen server which uses snapshots to backup the files for Xen block devices (the system is lightly loaded so I don't use nocow) and for data files that include a small Maildir spool. It's still only 2% of disk space used for metadata, again going from 2% to 4% isn't going to be a great problem. 2. Use of ditto blocks will increase write bandwidth to the disk. This is a direct and unavoidable result of having more copies of the metadata. The actual impact of this would depend on the file-system usage pattern, but would probably be unnoticeable in most circumstances. Does anyone have a worst-case scenario for testing? The ZFS design involves ditto blocks being spaced apart due to the fact that corruption tends to have some spacial locality. So you are adding an extra seek. The worst case would be when you have lots of small synchronous writes, probably the default configuration of Maildir delivery would be a good case. As an aside I've been thinking of patching a mail server to do a sleep() before fsync() on mail delivery to see if that improves aggregate performance. My theory is that if you have dozens of concurrent delivery attempts then if they all sleep() before fsync() then the filesystem could write out metadata for multiple files in one pass in the most efficient manner. 3. Certain kinds of disk errors would be easier to recover from. Some people here claim that those specific errors are rare. All errors are rare. :-# Seriously you can run Ext4 on a single disk for years and probably not lose data. It's just a matter of how many disks and how much reliability you want. I have no opinion on how often they happen, but I believe that if the overall disk space cost is low, it will have a reasonable return. There would be virtually no reliability gains on an SSD-based file-system, as the ditto blocks would be written at the same time, and the SSD would be likely to map the logical blocks into the same page of flash memory. That claim is unproven AFAIK. On SSD the performance cost of such things is negligible (no seek cost) and losing 1% of disk space isn't a problem for most systems (admittedly the early SSDs were small). 4. If the BIO layer of BTRFS and the device driver are smart enough, ditto blocks could reduce I/O wait time. This is a direct result of having more instances of the data on the disk, so it's likely that there will be a ditto block closer to where the disk head is currently. The actual benefit for disk-based file-systems is likely to be under 1ms per metadata seek. It's possible that a short-term backlog on one disk could cause BTRFS to use a ditto block on another disk, which could deliver 20ms of performance. There would be no performance benefit for SSD-based file-systems. That is likely with RAID-5 and RAID-10. My experience is that once your disks are larger than about 500-750GB, RAID-6 becomes a much better choice, due to the increased chances of having an uncorrectable read error during a reconstruct. My opinion is that anyone storing critical information in RAID-5, or even 2-disk RAID-1, with disks of this capacity, should either reconsider their storage topology, or verify that they have a good backup/restore mechanism in place for that data. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research shows that the incidence of silent corruption is a lot greater than you would expect. RAID-6 doesn't save you from this. You need BTRFS or ZFS RAID-6. On Tue, 20 May 2014 22:11:16 Brendan Hide wrote: Extra replication on leaf nodes will make relatively little difference in the scenarios laid out in this thread - but on trunk nodes (folders or subvolumes closer to the filesystem root) it makes a significant difference. Plain N-way replication doesn't flexibly treat these two nodes
Re: ditto blocks on ZFS
On 18/05/14 17:09, Russell Coker wrote: On Sat, 17 May 2014 13:50:52 Martin wrote: [...] Do you see or measure any real advantage? Imagine that you have a RAID-1 array where both disks get ~14,000 read errors. This could happen due to a design defect common to drives of a particular model or some shared environmental problem. Most errors would be corrected by RAID-1 but there would be a risk of some data being lost due to both copies being corrupt. Another possibility is that one disk could entirely die (although total disk death seems rare nowadays) and the other could have corruption. If metadata was duplicated in addition to being on both disks then the probability of data loss would be reduced. Another issue is the case where all drive slots are filled with active drives (a very common configuration). To replace a disk you have to physically remove the old disk before adding the new one. If the array is a RAID-1 or RAID-5 then ANY error during reconstruction loses data. Using dup for metadata on top of the RAID protections (IE the ZFS ditto idea) means that case doesn't lose you data. Your example there is for the case where in effect there is no RAID. How is that case any better than what is already done for btrfs duplicating metadata? So... What real-world failure modes do the ditto blocks usefully protect against? And how does that compare for failure rates and against what is already done? For example, we have RAID1 and RAID5 to protect against any one RAID chunk being corrupted or for the total loss of any one device. There is a second part to that in that another failure cannot be tolerated until the RAID is remade. Hence, we have RAID6 that protects against any two failures for a chunk or device. Hence with just one failure, you can tolerate a second failure whilst rebuilding the RAID. And then we supposedly have safety-by-design where the filesystem itself is using a journal and barriers/sync to ensure that the filesystem is always kept in a consistent state, even after an interruption to any writes. *What other failure modes* should we guard against? There has been mention of fixing metadata keys from single bit flips... Should hamming codes be used instead of a crc so that we can have multiple bit error detect, single bit error correct functionality for all data both in RAM and on disk for those systems that do not use ECC RAM? Would that be useful?... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 2014/05/19 10:36 PM, Martin wrote: On 18/05/14 17:09, Russell Coker wrote: On Sat, 17 May 2014 13:50:52 Martin wrote: [...] Do you see or measure any real advantage? [snip] This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a raid5/1/10 configuration might ruin a ton of data along with an obscure set of metadata in some long tree paths - but not the entire filesystem. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a raid5/1/10 configuration might ruin a ton of data along with an obscure set of metadata in some long tree paths - but not the entire filesystem. My experience is that most disk failures that don't involve extreme physical damage (EG dropping a drive on concrete) don't involve totally losing the disk. Much of the discussion about RAID failures concerns entirely failed disks, but I believe that is due to RAID implementations such as Linux software RAID that will entirely remove a disk when it gives errors. I have a disk which had ~14,000 errors of which ~2000 errors were corrected by duplicate metadata. If two disks with that problem were in a RAID-1 array then duplicate metadata would be a significant benefit. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research on latent errors on drives is worth reading. On page 12 they report latent sector errors on 9.5% of SATA disks per year. So if you lose one disk entirely the risk of having errors on a second disk is higher than you would want for RAID-5. While losing the root of the tree is unlikely, losing a directory in the middle that has lots of subdirectories is a risk. I can understand why people wouldn't want ditto blocks to be mandatory. But why are people arguing against them as an option? As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On Sat, 17 May 2014 13:50:52 Martin wrote: On 16/05/14 04:07, Russell Coker wrote: https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape Probably most of you already know about this, but for those of you who haven't the above describes ZFS ditto blocks which is a good feature we need on BTRFS. The briefest summary is that on top of the RAID redundancy there... [... are additional copies of metadata ...] Is that idea not already implemented in effect in btrfs with the way that the superblocks are replicated multiple times, ever more times, for ever more huge storage devices? No. If the metadata for the root directory is corrupted then everything is lost even if the superblock is OK. At every level in the directory tree a corruption will lose all levels below that, a corruption for /home would be very significant as would a corruption of /home/importantuser/major-project. The one exception is for SSDs whereby there is the excuse that you cannot know whether your data is usefully replicated across different erase blocks on a single device, and SSDs are not 'that big' anyhow. I am not convinced by that argument. While you can't know that it's usefully replicated you also can't say for sure that replication will never save you. There will surely be some random factors involved. If dup on ssd will save you from 50% of corruption problems is it worth doing? What if it's 80% or 20%? I have BTRFS running as the root filesystem on Intel SSDs on four machines (one of which is a file server with a pair of large disks in a BTRFS RAID-1). On all of those systems I have dup for metadata, it doesn't take up any amount of space I need for something else and it might save me. So... Your idea of replicating metadata multiple times in proportion to assumed 'importance' or 'extent of impact if lost' is an interesting approach. However, is that appropriate and useful considering the real world failure mechanisms that are to be guarded against? Firstly it's not my idea, it's the idea of the ZFS developers. Secondly I started reading about this after doing some experiments with a failing SATA disk. In spite of having ~14,000 read errors (which sounds like a lot but is a small fraction of a 2TB disk) the vast majority of the data was readable, largely due to ~2000 errors corrected by dup metadata. Do you see or measure any real advantage? Imagine that you have a RAID-1 array where both disks get ~14,000 read errors. This could happen due to a design defect common to drives of a particular model or some shared environmental problem. Most errors would be corrected by RAID-1 but there would be a risk of some data being lost due to both copies being corrupt. Another possibility is that one disk could entirely die (although total disk death seems rare nowadays) and the other could have corruption. If metadata was duplicated in addition to being on both disks then the probability of data loss would be reduced. Another issue is the case where all drive slots are filled with active drives (a very common configuration). To replace a disk you have to physically remove the old disk before adding the new one. If the array is a RAID-1 or RAID-5 then ANY error during reconstruction loses data. Using dup for metadata on top of the RAID protections (IE the ZFS ditto idea) means that case doesn't lose you data. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On 16/05/14 04:07, Russell Coker wrote: https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape Probably most of you already know about this, but for those of you who haven't the above describes ZFS ditto blocks which is a good feature we need on BTRFS. The briefest summary is that on top of the RAID redundancy there... [... are additional copies of metadata ...] Is that idea not already implemented in effect in btrfs with the way that the superblocks are replicated multiple times, ever more times, for ever more huge storage devices? The one exception is for SSDs whereby there is the excuse that you cannot know whether your data is usefully replicated across different erase blocks on a single device, and SSDs are not 'that big' anyhow. So... Your idea of replicating metadata multiple times in proportion to assumed 'importance' or 'extent of impact if lost' is an interesting approach. However, is that appropriate and useful considering the real world failure mechanisms that are to be guarded against? Do you see or measure any real advantage? Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ditto blocks on ZFS
On Sat, May 17, 2014 at 01:50:52PM +0100, Martin wrote: On 16/05/14 04:07, Russell Coker wrote: https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape Probably most of you already know about this, but for those of you who haven't the above describes ZFS ditto blocks which is a good feature we need on BTRFS. The briefest summary is that on top of the RAID redundancy there... [... are additional copies of metadata ...] Is that idea not already implemented in effect in btrfs with the way that the superblocks are replicated multiple times, ever more times, for ever more huge storage devices? Superblocks are the smallest part of the metadata. There's a whole load of metadata that's not in the superblocks that isn't replicated in this way. The one exception is for SSDs whereby there is the excuse that you cannot know whether your data is usefully replicated across different erase blocks on a single device, and SSDs are not 'that big' anyhow. So... Your idea of replicating metadata multiple times in proportion to assumed 'importance' or 'extent of impact if lost' is an interesting approach. However, is that appropriate and useful considering the real world failure mechanisms that are to be guarded against? Do you see or measure any real advantage? This. How many copies do you actually need? Are there concrete statistics to show the marginal utility of each additional copy? Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- IMPROVE YOUR ORGANISMS!! -- Subject line of spam email --- signature.asc Description: Digital signature