Re: List of known BTRFS Raid 5/6 Bugs?
Stefan K posted on Tue, 11 Sep 2018 13:29:38 +0200 as excerpted: > wow, holy shit, thanks for this extended answer! > >> The first thing to point out here again is that it's not >> btrfs-specific. > so that mean that every RAID implemantation (with parity) has such Bug? > I'm looking a bit, it looks like that ZFS doesn't have a write hole. Every parity-raid implementation that doesn't contain specific write-hole workarounds, yes, but some already have workarounds built-in, as btrfs will after the planned code is written/tested/merged/tested-more-broadly. https://www.google.com/search?q=parity-raid+write-hole [1] As an example, back some years ago when I was doing raid6 on mdraid, it had the write-hole problem and I remember reading about it at the time. However, right on the first page of hits for the above search... LWN: A journal for MD/RAID5 : https://lwn.net/Articles/665299/ Seems md/raid5's write hole was (optionally) closed in kernel 4.4 with an optional journal device... preferably a fast ssd or nvram, to avoid performance issues, and mirrored, to avoid the journal itself being a single point of failure. For me zfs is strictly an arm's-length thing, because if Oracle wanted to they could easily resolve the licensing thing as they own the code, but they haven't, which at this point can only be deliberate, and as I result I simply don't touch it. That isn't to say I don't recommend it for those comfortable with or simply willing to overlook the licensing issues, however, because zfs remains the most mature Linux option for many of the same feature points that btrfs has, only at a lower maturity level. But while I keep zfs at personal arm's length, from what I've picked up, I /believe/ zfs gets around the write-hole by doing strict copy-on-write combined with variable-length stripes -- unlike current btrfs, a stripe isn't always written as widely as possible, so for instance in a 20- device raid5-alike they're able to do a 3-device and possibly even 2- device "stripe", which then being entirely copy-on-write, avoids the read- modify-write cycle of modified existing data that unless mitigated creates the parity-raid write-hole. Variable-length stripes is actually one of the possible longer-term solutions already discussed for btrfs as well, but the logging/journalling solution seems to be what they've decided to implement first, and there's other tradeoffs to it (as discussed elsewhere). Of course because as I've already explained I'm interested in the 3/4-way-mirroring option that would be used for the journal but also available to expand the current 2-way-raid1 mode to additional mirroring, this is absolutely fine with me! =:^) > And > it _only_ happens when the server has a ungraceful shutdown, caused by > poweroutage? So that mean if I running btrfs raid5/6 and I've no > poweroutages I've no problems? Sort-of yes? Keep in mind that power-outage isn't the /only/ way to have an ungraceful shutdown, just one of the most common. Should the kernel crash or lock up for some reason, common examples include video and occasionally network driver bugs due to the direct access to hardware and memory they get, that can trigger an "ungraceful shutdown" as well, altho with care (basically always trying ssh-ing in for a remote shutdown if possible and/ or using alt-sysrq-reisub sequences on apparent lockups) it's often possible to prevent those being /entirely/ ungraceful at the hardware level, so it's not /quite/ as bad as an abrupt power outage or perhaps even worse a brownout that doesn't kill writes entirely but can at least theoretically trigger garbage scribbling in random device blocks. So yes, sort-of, but it'd not just power-outages. >> it's possible to specify data as raid5/6 and metadata as raid1 > does some have this in production? I'm sure people do. (As I said I'm a raid1 guy here, even 3-way- mirroring for some things were it possible, so no parity-raid at all for me personally.) On btrfs, it is in fact the multi-device default and thus quite common to have data and metadata as different profiles. The multi-device default for metadata if not specified is raid1, with single profile data. So if you just specify raid5/6 data and don't specify metadata at all, you'll get exactly what was mentioned, raid5/6 data as specified, raid1 metadata as the unspecified multi-device default. So were I to guess I'd guess that a lot of people who weren't paying attention when setting up but saying they have raid5/6, actually only have it for data, having not specified anything for metadata, so they got raid1 for it. > ZFS btw have 2 copies of metadata by > default, maybe it would also be an option or btrfs? It actually sounds like they do hybrid raid, then, not just pure parity- raid, but mirroring the metadata as well. That would be in accord with a couple things I'd read about zfs but hadn't quite pursued to the logical conclusion, and woul
Re: List of known BTRFS Raid 5/6 Bugs?
wow, holy shit, thanks for this extended answer! > The first thing to point out here again is that it's not btrfs-specific. so that mean that every RAID implemantation (with parity) has such Bug? I'm looking a bit, it looks like that ZFS doesn't have a write hole. And it _only_ happens when the server has a ungraceful shutdown, caused by poweroutage? So that mean if I running btrfs raid5/6 and I've no poweroutages I've no problems? > it's possible to specify data as raid5/6 and metadata as raid1 does some have this in production? ZFS btw have 2 copies of metadata by default, maybe it would also be an option or btrfs? in this case you think 'btrfs fi balance start -mconvert=raid1 -dconvert=raid5 /path ' is safe at the moment? > That means small files and modifications to existing files, the ends of large > files, and much of the > metadata, will be written twice, first to the log, then to the final > location. that sounds that the performance will go down? So far as I can see btrfs can't beat ext4 or btrfs nor zfs and then they will made it even slower? thanks in advanced! best regards Stefan On Saturday, September 8, 2018 8:40:50 AM CEST Duncan wrote: > Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted: > > > sorry for disturb this discussion, > > > > are there any plans/dates to fix the raid5/6 issue? Is somebody working > > on this issue? Cause this is for me one of the most important things for > > a fileserver, with a raid1 config I loose to much diskspace. > > There's a more technically complete discussion of this in at least two > earlier threads you can find on the list archive, if you're interested, > but here's the basics (well, extended basics...) from a btrfs-using- > sysadmin perspective. > > "The raid5/6 issue" can refer to at least three conceptually separate > issues, with different states of solution maturity: > > 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus > the historic) in current kernels and tools. Unfortunately these will > still affect for some time many users of longer-term stale^H^Hble distros > who don't update using other sources for some time, as because the raid56 > feature wasn't yet stable at the lock-in time for whatever versions they > stabilized on, they're not likely to get the fixes as it's new-feature > material. > > If you're using a current kernel and tools, however, this issue is > fixed. You can look on the wiki for the specific versions, but with the > 4.18 kernel current latest stable, it or 4.17, and similar tools versions > since the version numbers are synced, are the two latest release series, > with the two latest release series being best supported and considered > "current" on this list. > > Also see... > > 2) General feature maturity: While raid56 mode should be /reasonably/ > stable now, it remains one of the newer features and simply hasn't yet > had the testing of time that tends to flush out the smaller and corner- > case bugs, that more mature features such as raid1 have now had the > benefit of. > > There's nothing to do for this but test, report any bugs you find, and > wait for the maturity that time brings. > > Of course this is one of several reasons we so strongly emphasize and > recommend "current" on this list, because even for reasonably stable and > mature features such as raid1, btrfs itself remains new enough that they > still occasionally get latent bugs found and fixed, and while /some/ of > those fixes get backported to LTS kernels (with even less chance for > distros to backport tools fixes), not all of them do and even when they > do, current still gets the fixes first. > > 3) The remaining issue is the infamous parity-raid write-hole that > affects all parity-raid implementations (not just btrfs) unless they take > specific steps to work around the issue. > > The first thing to point out here again is that it's not btrfs-specific. > Between that and the fact that it *ONLY* affects parity-raid operating in > degraded mode *WITH* an ungraceful-shutdown recovery situation, it could > be argued not to be a btrfs issue at all, but rather one inherent to > parity-raid mode and considered an acceptable risk to those choosing > parity-raid because it's only a factor when operating degraded, if an > ungraceful shutdown does occur. > > But btrfs' COW nature along with a couple technical implementation > factors (the read-modify-write cycle for incomplete stripe widths and how > that risks existing metadata when new metadata is written) does amplify > the risk somewhat compared to that seen with the same write-hole issue in > various other parity-raid implementations that don't avoid it due to > taking write-hole avoidance countermeasures. > > > So what can be done right now? > > As it happens there is a mitigation the admin can currently take -- btrfs > allows specifying data and metadata modes separately, and even where >
Re: List of known BTRFS Raid 5/6 Bugs?
Stefan K posted on Fri, 07 Sep 2018 15:58:36 +0200 as excerpted: > sorry for disturb this discussion, > > are there any plans/dates to fix the raid5/6 issue? Is somebody working > on this issue? Cause this is for me one of the most important things for > a fileserver, with a raid1 config I loose to much diskspace. There's a more technically complete discussion of this in at least two earlier threads you can find on the list archive, if you're interested, but here's the basics (well, extended basics...) from a btrfs-using- sysadmin perspective. "The raid5/6 issue" can refer to at least three conceptually separate issues, with different states of solution maturity: 1) Now generally historic bugs in btrfs scrub, etc, that are fixed (thus the historic) in current kernels and tools. Unfortunately these will still affect for some time many users of longer-term stale^H^Hble distros who don't update using other sources for some time, as because the raid56 feature wasn't yet stable at the lock-in time for whatever versions they stabilized on, they're not likely to get the fixes as it's new-feature material. If you're using a current kernel and tools, however, this issue is fixed. You can look on the wiki for the specific versions, but with the 4.18 kernel current latest stable, it or 4.17, and similar tools versions since the version numbers are synced, are the two latest release series, with the two latest release series being best supported and considered "current" on this list. Also see... 2) General feature maturity: While raid56 mode should be /reasonably/ stable now, it remains one of the newer features and simply hasn't yet had the testing of time that tends to flush out the smaller and corner- case bugs, that more mature features such as raid1 have now had the benefit of. There's nothing to do for this but test, report any bugs you find, and wait for the maturity that time brings. Of course this is one of several reasons we so strongly emphasize and recommend "current" on this list, because even for reasonably stable and mature features such as raid1, btrfs itself remains new enough that they still occasionally get latent bugs found and fixed, and while /some/ of those fixes get backported to LTS kernels (with even less chance for distros to backport tools fixes), not all of them do and even when they do, current still gets the fixes first. 3) The remaining issue is the infamous parity-raid write-hole that affects all parity-raid implementations (not just btrfs) unless they take specific steps to work around the issue. The first thing to point out here again is that it's not btrfs-specific. Between that and the fact that it *ONLY* affects parity-raid operating in degraded mode *WITH* an ungraceful-shutdown recovery situation, it could be argued not to be a btrfs issue at all, but rather one inherent to parity-raid mode and considered an acceptable risk to those choosing parity-raid because it's only a factor when operating degraded, if an ungraceful shutdown does occur. But btrfs' COW nature along with a couple technical implementation factors (the read-modify-write cycle for incomplete stripe widths and how that risks existing metadata when new metadata is written) does amplify the risk somewhat compared to that seen with the same write-hole issue in various other parity-raid implementations that don't avoid it due to taking write-hole avoidance countermeasures. So what can be done right now? As it happens there is a mitigation the admin can currently take -- btrfs allows specifying data and metadata modes separately, and even where raid1 loses too much space to be used for both, it's possible to specify data as raid5/6 and metadata as raid1. While btrfs raid1 only covers loss of a single device, it doesn't have the parity-raid write-hole as it's not parity-raid, and for most use-cases at least, specifying raid1 for metadata only, while raid5 for data, should strictly limit both the risk of the parity-raid write-hole as it'll be limited to data which in most cases will be full-stripe writes and thus not subject to the problem, and the size-doubling of raid1 as it'll be limited to metadata. Meanwhile, arguably, for a sysadmin properly following the sysadmin's first rule of backups, that the true value of data isn't defined by arbitrary claims, but by the number of backups it is considered worth the time/trouble/resources to have of that data, it's a known parity-raid risk specifically limited to the corner-case of having an ungraceful shutdown *WHILE* already operating degraded, and as such, it can be managed along with all the other known risks to the data, including admin fat-fingering, the risk that more devices will go out than the array can tolerate, the risk of general bugs affecting the filesystem or other storage-function related code, etc. IOW, in the context of the admin's first rule of backups, no matter the issue, rai
Re: List of known BTRFS Raid 5/6 Bugs?
sorry for disturb this discussion, are there any plans/dates to fix the raid5/6 issue? Is somebody working on this issue? Cause this is for me one of the most important things for a fileserver, with a raid1 config I loose to much diskspace. best regards Stefan
Re: List of known BTRFS Raid 5/6 Bugs?
Ok, but I cannot guarantee that I don't need to cancel scrub during the process As said, this is a domestic storage, and when scrub is running the performance hit is big enough to prevent smooth streaming of HD and 4k movies Il giorno gio 16 ago 2018 alle ore 21:38 ha scritto: > > Could you show scrub status -d, then start a new scrub (all drives) and show > scrub status -d again? This may help us diagnose the problem. > > Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com: > > I needed to resume scrub two times after an unclear shutdown (I was > > cooking and using too much electricity) and two times after a manual > > cancel, because I wanted to watch a 4k movie and the array > > performances were not enough with scrub active. > > Each time I resumed it, I checked also the status, and the total > > number of data scrubbed was keep counting (never started from zero) > > Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell > > ha scritto: > > > > > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote: > > > > Hi > > > > Well, I think it is worth to give more details on the array. > > > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII > > > > enclosure > > > > The enclosure is a cheap JMicron based chinese stuff (from Orico). > > > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb > > > > multiplexer behind it. So you cannot expect peak performance, which is > > > > not the goal of this array (domestic data storage). > > > > Also the USB to SATA firmware is buggy, so UAS operations are not > > > > stable, it run in BOT mode. > > > > Having said so, the scrub has been started (and resumed) on the array > > > > mount point: > > > > > > > > sudo btrfs scrub start(resume) /media/storage/das1 > > > > > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete > > > scrub end to end without cancelling or rebooting in between, what is > > > the size on all disks (btrfs scrub status -d)? > > > > > > > even if reading the documentation I understand that it is the same > > > > invoking it on mountpoint or one of the HDD in the array. > > > > In the end, especially for a RAID5 array, does it really make sense to > > > > scrub only one disk in the array??? > > > > > > You would set up a shell for-loop and scrub each disk of the array > > > in turn. Each scrub would correct errors on a single device. > > > > > > There was a bug in btrfs scrub where scrubbing the filesystem would > > > create one thread for each disk, and the threads would issue commands > > > to all disks and compete with each other for IO, resulting in terrible > > > performance on most non-SSD hardware. By scrubbing disks one at a time, > > > there are no competing threads, so the scrub runs many times faster. > > > With this bug the total time to scrub all disks individually is usually > > > less than the time to scrub the entire filesystem at once, especially > > > on HDD (and even if it's not faster, one-at-a-time disk scrubs are > > > much kinder to any other process trying to use the filesystem at the > > > same time). > > > > > > It appears this bug is not fixed, based on some timing results I am > > > getting from a test array. iostat shows 10x more reads than writes on > > > all disks even when all blocks on one disk are corrupted and the scrub > > > is given only a single disk to process (that should result in roughly > > > equal reads on all disks slightly above the number of writes on the > > > corrupted disk). > > > > > > This is where my earlier caveat about performance comes from. Many parts > > > of btrfs raid5 are somewhere between slower and *much* slower than > > > comparable software raid5 implementations. Some of that is by design: > > > btrfs must be at least 1% slower than mdadm because btrfs needs to read > > > metadata to verify data block csums in scrub, and the difference would > > > be much larger in practice due to HDD seek times, but 500%-900% overhead > > > still seems high especially when compared to btrfs raid1 that has the > > > same metadata csum reading issue without the huge performance gap. > > > > > > It seems like btrfs raid5 could still use a thorough profiling to figure > > > out where it's spending all its IO. > > > > > > > Regarding the data usage, here you have the current figures: > > > > > > > > menion@Menionubuntu:~$ sudo btrfs fi show > > > > [sudo] password for menion: > > > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f > > > > Total devices 1 FS bytes used 11.44GiB > > > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3 > > > > > > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > > > Total devices 5 FS bytes used 6.57TiB > > > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda > > > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb > > > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc > > > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd > > > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde > > > > > >
Re: List of known BTRFS Raid 5/6 Bugs?
Could you show scrub status -d, then start a new scrub (all drives) and show scrub status -d again? This may help us diagnose the problem. Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com: > I needed to resume scrub two times after an unclear shutdown (I was > cooking and using too much electricity) and two times after a manual > cancel, because I wanted to watch a 4k movie and the array > performances were not enough with scrub active. > Each time I resumed it, I checked also the status, and the total > number of data scrubbed was keep counting (never started from zero) > Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell > ha scritto: > > > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote: > > > Hi > > > Well, I think it is worth to give more details on the array. > > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII > > > enclosure > > > The enclosure is a cheap JMicron based chinese stuff (from Orico). > > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb > > > multiplexer behind it. So you cannot expect peak performance, which is > > > not the goal of this array (domestic data storage). > > > Also the USB to SATA firmware is buggy, so UAS operations are not > > > stable, it run in BOT mode. > > > Having said so, the scrub has been started (and resumed) on the array > > > mount point: > > > > > > sudo btrfs scrub start(resume) /media/storage/das1 > > > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete > > scrub end to end without cancelling or rebooting in between, what is > > the size on all disks (btrfs scrub status -d)? > > > > > even if reading the documentation I understand that it is the same > > > invoking it on mountpoint or one of the HDD in the array. > > > In the end, especially for a RAID5 array, does it really make sense to > > > scrub only one disk in the array??? > > > > You would set up a shell for-loop and scrub each disk of the array > > in turn. Each scrub would correct errors on a single device. > > > > There was a bug in btrfs scrub where scrubbing the filesystem would > > create one thread for each disk, and the threads would issue commands > > to all disks and compete with each other for IO, resulting in terrible > > performance on most non-SSD hardware. By scrubbing disks one at a time, > > there are no competing threads, so the scrub runs many times faster. > > With this bug the total time to scrub all disks individually is usually > > less than the time to scrub the entire filesystem at once, especially > > on HDD (and even if it's not faster, one-at-a-time disk scrubs are > > much kinder to any other process trying to use the filesystem at the > > same time). > > > > It appears this bug is not fixed, based on some timing results I am > > getting from a test array. iostat shows 10x more reads than writes on > > all disks even when all blocks on one disk are corrupted and the scrub > > is given only a single disk to process (that should result in roughly > > equal reads on all disks slightly above the number of writes on the > > corrupted disk). > > > > This is where my earlier caveat about performance comes from. Many parts > > of btrfs raid5 are somewhere between slower and *much* slower than > > comparable software raid5 implementations. Some of that is by design: > > btrfs must be at least 1% slower than mdadm because btrfs needs to read > > metadata to verify data block csums in scrub, and the difference would > > be much larger in practice due to HDD seek times, but 500%-900% overhead > > still seems high especially when compared to btrfs raid1 that has the > > same metadata csum reading issue without the huge performance gap. > > > > It seems like btrfs raid5 could still use a thorough profiling to figure > > out where it's spending all its IO. > > > > > Regarding the data usage, here you have the current figures: > > > > > > menion@Menionubuntu:~$ sudo btrfs fi show > > > [sudo] password for menion: > > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f > > > Total devices 1 FS bytes used 11.44GiB > > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3 > > > > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > > Total devices 5 FS bytes used 6.57TiB > > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda > > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb > > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc > > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd > > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde > > > > > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1 > > > Data, RAID5: total=6.57TiB, used=6.56TiB > > > System, RAID5: total=12.75MiB, used=416.00KiB > > > Metadata, RAID5: total=9.00GiB, used=8.16GiB > > > GlobalReserve, single: total=512.00MiB, used=0.00B > > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1 > > > WARNING: RAID56 detected, not implemented > > > WARNING: RAID56 detected, not implemented > > > WARNING: RAI
Re: List of known BTRFS Raid 5/6 Bugs?
I needed to resume scrub two times after an unclear shutdown (I was cooking and using too much electricity) and two times after a manual cancel, because I wanted to watch a 4k movie and the array performances were not enough with scrub active. Each time I resumed it, I checked also the status, and the total number of data scrubbed was keep counting (never started from zero) Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell ha scritto: > > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote: > > Hi > > Well, I think it is worth to give more details on the array. > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure > > The enclosure is a cheap JMicron based chinese stuff (from Orico). > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb > > multiplexer behind it. So you cannot expect peak performance, which is > > not the goal of this array (domestic data storage). > > Also the USB to SATA firmware is buggy, so UAS operations are not > > stable, it run in BOT mode. > > Having said so, the scrub has been started (and resumed) on the array > > mount point: > > > > sudo btrfs scrub start(resume) /media/storage/das1 > > So is 2.59TB the amount scrubbed _since resume_? If you run a complete > scrub end to end without cancelling or rebooting in between, what is > the size on all disks (btrfs scrub status -d)? > > > even if reading the documentation I understand that it is the same > > invoking it on mountpoint or one of the HDD in the array. > > In the end, especially for a RAID5 array, does it really make sense to > > scrub only one disk in the array??? > > You would set up a shell for-loop and scrub each disk of the array > in turn. Each scrub would correct errors on a single device. > > There was a bug in btrfs scrub where scrubbing the filesystem would > create one thread for each disk, and the threads would issue commands > to all disks and compete with each other for IO, resulting in terrible > performance on most non-SSD hardware. By scrubbing disks one at a time, > there are no competing threads, so the scrub runs many times faster. > With this bug the total time to scrub all disks individually is usually > less than the time to scrub the entire filesystem at once, especially > on HDD (and even if it's not faster, one-at-a-time disk scrubs are > much kinder to any other process trying to use the filesystem at the > same time). > > It appears this bug is not fixed, based on some timing results I am > getting from a test array. iostat shows 10x more reads than writes on > all disks even when all blocks on one disk are corrupted and the scrub > is given only a single disk to process (that should result in roughly > equal reads on all disks slightly above the number of writes on the > corrupted disk). > > This is where my earlier caveat about performance comes from. Many parts > of btrfs raid5 are somewhere between slower and *much* slower than > comparable software raid5 implementations. Some of that is by design: > btrfs must be at least 1% slower than mdadm because btrfs needs to read > metadata to verify data block csums in scrub, and the difference would > be much larger in practice due to HDD seek times, but 500%-900% overhead > still seems high especially when compared to btrfs raid1 that has the > same metadata csum reading issue without the huge performance gap. > > It seems like btrfs raid5 could still use a thorough profiling to figure > out where it's spending all its IO. > > > Regarding the data usage, here you have the current figures: > > > > menion@Menionubuntu:~$ sudo btrfs fi show > > [sudo] password for menion: > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f > > Total devices 1 FS bytes used 11.44GiB > > devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3 > > > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > Total devices 5 FS bytes used 6.57TiB > > devid1 size 7.28TiB used 1.64TiB path /dev/sda > > devid2 size 7.28TiB used 1.64TiB path /dev/sdb > > devid3 size 7.28TiB used 1.64TiB path /dev/sdc > > devid4 size 7.28TiB used 1.64TiB path /dev/sdd > > devid5 size 7.28TiB used 1.64TiB path /dev/sde > > > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1 > > Data, RAID5: total=6.57TiB, used=6.56TiB > > System, RAID5: total=12.75MiB, used=416.00KiB > > Metadata, RAID5: total=9.00GiB, used=8.16GiB > > GlobalReserve, single: total=512.00MiB, used=0.00B > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1 > > WARNING: RAID56 detected, not implemented > > WARNING: RAID56 detected, not implemented > > WARNING: RAID56 detected, not implemented > > Overall: > > Device size: 36.39TiB > > Device allocated: 0.00B > > Device unallocated: 36.39TiB > > Device missing: 0.00B > > Used: 0.00B > > Free (estimated): 0.00B (min: 8.00EiB) > > Data ratio: 0.00 > > Metadata ratio: 0.00 > > Global reserve: 512.00Mi
Re: List of known BTRFS Raid 5/6 Bugs?
On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote: > Hi > Well, I think it is worth to give more details on the array. > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure > The enclosure is a cheap JMicron based chinese stuff (from Orico). > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb > multiplexer behind it. So you cannot expect peak performance, which is > not the goal of this array (domestic data storage). > Also the USB to SATA firmware is buggy, so UAS operations are not > stable, it run in BOT mode. > Having said so, the scrub has been started (and resumed) on the array > mount point: > > sudo btrfs scrub start(resume) /media/storage/das1 So is 2.59TB the amount scrubbed _since resume_? If you run a complete scrub end to end without cancelling or rebooting in between, what is the size on all disks (btrfs scrub status -d)? > even if reading the documentation I understand that it is the same > invoking it on mountpoint or one of the HDD in the array. > In the end, especially for a RAID5 array, does it really make sense to > scrub only one disk in the array??? You would set up a shell for-loop and scrub each disk of the array in turn. Each scrub would correct errors on a single device. There was a bug in btrfs scrub where scrubbing the filesystem would create one thread for each disk, and the threads would issue commands to all disks and compete with each other for IO, resulting in terrible performance on most non-SSD hardware. By scrubbing disks one at a time, there are no competing threads, so the scrub runs many times faster. With this bug the total time to scrub all disks individually is usually less than the time to scrub the entire filesystem at once, especially on HDD (and even if it's not faster, one-at-a-time disk scrubs are much kinder to any other process trying to use the filesystem at the same time). It appears this bug is not fixed, based on some timing results I am getting from a test array. iostat shows 10x more reads than writes on all disks even when all blocks on one disk are corrupted and the scrub is given only a single disk to process (that should result in roughly equal reads on all disks slightly above the number of writes on the corrupted disk). This is where my earlier caveat about performance comes from. Many parts of btrfs raid5 are somewhere between slower and *much* slower than comparable software raid5 implementations. Some of that is by design: btrfs must be at least 1% slower than mdadm because btrfs needs to read metadata to verify data block csums in scrub, and the difference would be much larger in practice due to HDD seek times, but 500%-900% overhead still seems high especially when compared to btrfs raid1 that has the same metadata csum reading issue without the huge performance gap. It seems like btrfs raid5 could still use a thorough profiling to figure out where it's spending all its IO. > Regarding the data usage, here you have the current figures: > > menion@Menionubuntu:~$ sudo btrfs fi show > [sudo] password for menion: > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f > Total devices 1 FS bytes used 11.44GiB > devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3 > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > Total devices 5 FS bytes used 6.57TiB > devid1 size 7.28TiB used 1.64TiB path /dev/sda > devid2 size 7.28TiB used 1.64TiB path /dev/sdb > devid3 size 7.28TiB used 1.64TiB path /dev/sdc > devid4 size 7.28TiB used 1.64TiB path /dev/sdd > devid5 size 7.28TiB used 1.64TiB path /dev/sde > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1 > Data, RAID5: total=6.57TiB, used=6.56TiB > System, RAID5: total=12.75MiB, used=416.00KiB > Metadata, RAID5: total=9.00GiB, used=8.16GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1 > WARNING: RAID56 detected, not implemented > WARNING: RAID56 detected, not implemented > WARNING: RAID56 detected, not implemented > Overall: > Device size: 36.39TiB > Device allocated: 0.00B > Device unallocated: 36.39TiB > Device missing: 0.00B > Used: 0.00B > Free (estimated): 0.00B (min: 8.00EiB) > Data ratio: 0.00 > Metadata ratio: 0.00 > Global reserve: 512.00MiB (used: 32.00KiB) > > Data,RAID5: Size:6.57TiB, Used:6.56TiB >/dev/sda1.64TiB >/dev/sdb1.64TiB >/dev/sdc1.64TiB >/dev/sdd1.64TiB >/dev/sde1.64TiB > > Metadata,RAID5: Size:9.00GiB, Used:8.16GiB >/dev/sda2.25GiB >/dev/sdb2.25GiB >/dev/sdc2.25GiB >/dev/sdd2.25GiB >/dev/sde2.25GiB > > System,RAID5: Size:12.75MiB, Used:416.00KiB >/dev/sda3.19MiB >/dev/sdb3.19MiB >/dev/sdc3.19MiB >/dev/sdd3.19MiB >/dev/sde3.19MiB > > Unallocated: >/dev/sda5.63TiB >/dev/sdb5.63TiB >/dev/sdc
Re: List of known BTRFS Raid 5/6 Bugs?
Hi Well, I think it is worth to give more details on the array. the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII enclosure The enclosure is a cheap JMicron based chinese stuff (from Orico). There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb multiplexer behind it. So you cannot expect peak performance, which is not the goal of this array (domestic data storage). Also the USB to SATA firmware is buggy, so UAS operations are not stable, it run in BOT mode. Having said so, the scrub has been started (and resumed) on the array mount point: sudo btrfs scrub start(resume) /media/storage/das1 even if reading the documentation I understand that it is the same invoking it on mountpoint or one of the HDD in the array. In the end, especially for a RAID5 array, does it really make sense to scrub only one disk in the array??? Regarding the data usage, here you have the current figures: menion@Menionubuntu:~$ sudo btrfs fi show [sudo] password for menion: Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f Total devices 1 FS bytes used 11.44GiB devid1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3 Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc Total devices 5 FS bytes used 6.57TiB devid1 size 7.28TiB used 1.64TiB path /dev/sda devid2 size 7.28TiB used 1.64TiB path /dev/sdb devid3 size 7.28TiB used 1.64TiB path /dev/sdc devid4 size 7.28TiB used 1.64TiB path /dev/sdd devid5 size 7.28TiB used 1.64TiB path /dev/sde menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1 Data, RAID5: total=6.57TiB, used=6.56TiB System, RAID5: total=12.75MiB, used=416.00KiB Metadata, RAID5: total=9.00GiB, used=8.16GiB GlobalReserve, single: total=512.00MiB, used=0.00B menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1 WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 36.39TiB Device allocated: 0.00B Device unallocated: 36.39TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 512.00MiB (used: 32.00KiB) Data,RAID5: Size:6.57TiB, Used:6.56TiB /dev/sda1.64TiB /dev/sdb1.64TiB /dev/sdc1.64TiB /dev/sdd1.64TiB /dev/sde1.64TiB Metadata,RAID5: Size:9.00GiB, Used:8.16GiB /dev/sda2.25GiB /dev/sdb2.25GiB /dev/sdc2.25GiB /dev/sdd2.25GiB /dev/sde2.25GiB System,RAID5: Size:12.75MiB, Used:416.00KiB /dev/sda3.19MiB /dev/sdb3.19MiB /dev/sdc3.19MiB /dev/sdd3.19MiB /dev/sde3.19MiB Unallocated: /dev/sda5.63TiB /dev/sdb5.63TiB /dev/sdc5.63TiB /dev/sdd5.63TiB /dev/sde5.63TiB menion@Menionubuntu:~$ menion@Menionubuntu:~$ sf -h The program 'sf' is currently not installed. You can install it by typing: sudo apt install ruby-sprite-factory menion@Menionubuntu:~$ df -h Filesystem Size Used Avail Use% Mounted on udev934M 0 934M 0% /dev tmpfs 193M 22M 171M 12% /run /dev/mmcblk0p3 28G 12G 15G 44% / tmpfs 962M 0 962M 0% /dev/shm tmpfs 5,0M 0 5,0M 0% /run/lock tmpfs 962M 0 962M 0% /sys/fs/cgroup /dev/mmcblk0p1 188M 3,4M 184M 2% /boot/efi /dev/mmcblk0p3 28G 12G 15G 44% /home /dev/sda 37T 6,6T 29T 19% /media/storage/das1 tmpfs 193M 0 193M 0% /run/user/1000 menion@Menionubuntu:~$ btrfs --version btrfs-progs v4.17 So I don't fully understand where the scrub data size comes from Il giorno lun 13 ago 2018 alle ore 23:56 ha scritto: > > Running time of 55:06:35 indicates that the counter is right, it is not > enough time to scrub the entire array using hdd. > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start > /dev/sdx1" only scrubs the selected partition, > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array. > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and > post the output. > For live statistics, use "sudo watch -n 1". > > By the way: > 0 errors despite multiple unclean shutdowns? I assumed that the write hole > would corrupt parity the first time around, was i wrong? > > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: > > Hi > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > > there are contradicting opinions by the, well, "several" ways to check > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > > data. > > This array is running on kernel 4.17.3 and it definitely experienced > > power loss while data was being written. > > I can say that it wen through at least a dozen of unclear shutdown > > So following this thread I started my first scrub on the array. and > > this is the outcome (after having resumed it 4 times, two after a >
Re: List of known BTRFS Raid 5/6 Bugs?
On Mon, Aug 13, 2018 at 11:56:05PM +0200, erentheti...@mail.de wrote: > Running time of 55:06:35 indicates that the counter is right, it is > not enough time to scrub the entire array using hdd. > > 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub > start /dev/sdx1" only scrubs the selected partition, > whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array. > > Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics > and post the output. > For live statistics, use "sudo watch -n 1". > > By the way: > 0 errors despite multiple unclean shutdowns? I assumed that the write > hole would corrupt parity the first time around, was i wrong? You won't see the write hole from just a power failure. You need a power failure *and* a disk failure, and writes need to be happening at the moment power fails. Write hole breaks parity. Scrub silently(!) fixes parity. Scrub reads the parity block and compares it to the computed parity, and if it's wrong, scrub writes the computed parity back. Normal RAID5 reads with all disks online read only the data blocks, so they won't read the parity block and won't detect wrong parity. I did a couple of order-of-magnitude estimations of how likely a power failure is to trash a btrfs RAID system and got a probability between 3% and 30% per power failure if there were writes active at the time, and a disk failed to join the array after boot. That was based on 5 disks having 31 writes queued with one of the disks being significantly slower than the others (as failing disks often are) with continuous write load. If you have a power failure on an array that isn't writing anything at the time, nothing happens. > > Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: > > Hi > > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > > there are contradicting opinions by the, well, "several" ways to check > > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > > data. > > This array is running on kernel 4.17.3 and it definitely experienced > > power loss while data was being written. > > I can say that it wen through at least a dozen of unclear shutdown > > So following this thread I started my first scrub on the array. and > > this is the outcome (after having resumed it 4 times, two after a > > power loss...): > > > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > > total bytes scrubbed: 2.59TiB with 0 errors > > > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > > scrubbed data. Is it possible that also this values is crap, as the > > non zero counters for RAID5 array? > > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > > ha scritto: > > > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > > I guess that covers most topics, two last questions: > > > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > > > Not really. It changes the probability distribution (you get an extra > > > chance to recover using a parity block in some cases), but there are > > > still cases where data gets lost that didn't need to be. > > > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > > > There may be benefits of raid5 metadata, but they are small compared to > > > the risks. > > > > > > In some configurations it may not be possible to allocate the last > > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > > N is an odd number there could be one chunk left over in the array that > > > is unusable. Most users will find this irrelevant because a large disk > > > array that is filled to the last GB will become quite slow due to long > > > free space search and seek times--you really want to keep usage below 95%, > > > maybe 98% at most, and that means the last GB will never be needed. > > > > > > Reading raid5 metadata could theoretically be faster than raid1, but that > > > depends on a lot of variables, so you can't assume it as a rule of thumb. > > > > > > Raid6 metadata is more interesting because it's the only currently > > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > > > that benefit is rather limited due to the write hole bug. > > > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3 > > > or 4 mirror copies instead of just 2). This would be much better for > > > metadata than raid6--more flexible, more robust, and my guess is that > > > it will be faster as well (no need for RMW updates or journal seeks). > > > > > > > - > > > > FreeMail powered by mail.de - MEHR
Re: List of known BTRFS Raid 5/6 Bugs?
On Mon, Aug 13, 2018 at 09:20:22AM +0200, Menion wrote: > Hi > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > there are contradicting opinions by the, well, "several" ways to check > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > data. > This array is running on kernel 4.17.3 and it definitely experienced > power loss while data was being written. > I can say that it wen through at least a dozen of unclear shutdown > So following this thread I started my first scrub on the array. and > this is the outcome (after having resumed it 4 times, two after a > power loss...): > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > total bytes scrubbed: 2.59TiB with 0 errors > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > scrubbed data. Is it possible that also this values is crap, as the > non zero counters for RAID5 array? I just tested a quick scrub with injected errors on 4.18.0 and it looks like the garbage values are finally fixed (yay!). I never saw invalid values for 'total bytes' from raid5; however, scrub has (had?) trouble resuming, especially if the system was rebooted between cancel and resume, but sometimes just if the scrub had just been suspended too long (maybe if there are changes to the chunk tree...?). 55 hours for 2600 GB is just under 50GB per hour, which doesn't sound too unreasonable for btrfs, though it is known to be a bit slow compared to other raid5 implementations. > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > ha scritto: > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > I guess that covers most topics, two last questions: > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > Not really. It changes the probability distribution (you get an extra > > chance to recover using a parity block in some cases), but there are > > still cases where data gets lost that didn't need to be. > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > There may be benefits of raid5 metadata, but they are small compared to > > the risks. > > > > In some configurations it may not be possible to allocate the last > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > N is an odd number there could be one chunk left over in the array that > > is unusable. Most users will find this irrelevant because a large disk > > array that is filled to the last GB will become quite slow due to long > > free space search and seek times--you really want to keep usage below 95%, > > maybe 98% at most, and that means the last GB will never be needed. > > > > Reading raid5 metadata could theoretically be faster than raid1, but that > > depends on a lot of variables, so you can't assume it as a rule of thumb. > > > > Raid6 metadata is more interesting because it's the only currently > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > > that benefit is rather limited due to the write hole bug. > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3 > > or 4 mirror copies instead of just 2). This would be much better for > > metadata than raid6--more flexible, more robust, and my guess is that > > it will be faster as well (no need for RMW updates or journal seeks). > > > > > - > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > > > > signature.asc Description: PGP signature
Re: List of known BTRFS Raid 5/6 Bugs?
Running time of 55:06:35 indicates that the counter is right, it is not enough time to scrub the entire array using hdd. 2TiB might be right if you only scrubbed one disc, "sudo btrfs scrub start /dev/sdx1" only scrubs the selected partition, whereas "sudo btrfs scrub start /media/storage/das1" scrubs the actual array. Use "sudo btrfs scrub status -d " to view per disc scrubbing statistics and post the output. For live statistics, use "sudo watch -n 1". By the way: 0 errors despite multiple unclean shutdowns? I assumed that the write hole would corrupt parity the first time around, was i wrong? Am 13-Aug-2018 09:20:36 +0200 schrieb men...@gmail.com: > Hi > I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), > there are contradicting opinions by the, well, "several" ways to check > the used space on a BTRFS RAID5 array, but I should be aroud 8TB of > data. > This array is running on kernel 4.17.3 and it definitely experienced > power loss while data was being written. > I can say that it wen through at least a dozen of unclear shutdown > So following this thread I started my first scrub on the array. and > this is the outcome (after having resumed it 4 times, two after a > power loss...): > > menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ > scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc > scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 > total bytes scrubbed: 2.59TiB with 0 errors > > So, there are 0 errors, but I don't understand why it says 2.59TiB of > scrubbed data. Is it possible that also this values is crap, as the > non zero counters for RAID5 array? > Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell > ha scritto: > > > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > > I guess that covers most topics, two last questions: > > > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > > > Not really. It changes the probability distribution (you get an extra > > chance to recover using a parity block in some cases), but there are > > still cases where data gets lost that didn't need to be. > > > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > > > There may be benefits of raid5 metadata, but they are small compared to > > the risks. > > > > In some configurations it may not be possible to allocate the last > > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > > time while raid5 will allocate 1GB chunks from N disks at a time, and if > > N is an odd number there could be one chunk left over in the array that > > is unusable. Most users will find this irrelevant because a large disk > > array that is filled to the last GB will become quite slow due to long > > free space search and seek times--you really want to keep usage below 95%, > > maybe 98% at most, and that means the last GB will never be needed. > > > > Reading raid5 metadata could theoretically be faster than raid1, but that > > depends on a lot of variables, so you can't assume it as a rule of thumb. > > > > Raid6 metadata is more interesting because it's the only currently > > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > > that benefit is rather limited due to the write hole bug. > > > > There are patches floating around that implement multi-disk raid1 (i.e. 3 > > or 4 mirror copies instead of just 2). This would be much better for > > metadata than raid6--more flexible, more robust, and my guess is that > > it will be faster as well (no need for RMW updates or journal seeks). > > > > > - > > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > > > - FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
Re: List of known BTRFS Raid 5/6 Bugs?
Hi I have a BTRFS RAID5 array built on 5x8TB HDD filled with, well :), there are contradicting opinions by the, well, "several" ways to check the used space on a BTRFS RAID5 array, but I should be aroud 8TB of data. This array is running on kernel 4.17.3 and it definitely experienced power loss while data was being written. I can say that it wen through at least a dozen of unclear shutdown So following this thread I started my first scrub on the array. and this is the outcome (after having resumed it 4 times, two after a power loss...): menion@Menionubuntu:~$ sudo btrfs scrub status /media/storage/das1/ scrub status for 931d40c6-7cd7-46f3-a4bf-61f3a53844bc scrub resumed at Sun Aug 12 18:43:31 2018 and finished after 55:06:35 total bytes scrubbed: 2.59TiB with 0 errors So, there are 0 errors, but I don't understand why it says 2.59TiB of scrubbed data. Is it possible that also this values is crap, as the non zero counters for RAID5 array? Il giorno sab 11 ago 2018 alle ore 17:29 Zygo Blaxell ha scritto: > > On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > > I guess that covers most topics, two last questions: > > > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? > > Not really. It changes the probability distribution (you get an extra > chance to recover using a parity block in some cases), but there are > still cases where data gets lost that didn't need to be. > > > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? > > There may be benefits of raid5 metadata, but they are small compared to > the risks. > > In some configurations it may not be possible to allocate the last > gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a > time while raid5 will allocate 1GB chunks from N disks at a time, and if > N is an odd number there could be one chunk left over in the array that > is unusable. Most users will find this irrelevant because a large disk > array that is filled to the last GB will become quite slow due to long > free space search and seek times--you really want to keep usage below 95%, > maybe 98% at most, and that means the last GB will never be needed. > > Reading raid5 metadata could theoretically be faster than raid1, but that > depends on a lot of variables, so you can't assume it as a rule of thumb. > > Raid6 metadata is more interesting because it's the only currently > supported way to get 2-disk failure tolerance in btrfs. Unfortunately > that benefit is rather limited due to the write hole bug. > > There are patches floating around that implement multi-disk raid1 (i.e. 3 > or 4 mirror copies instead of just 2). This would be much better for > metadata than raid6--more flexible, more robust, and my guess is that > it will be faster as well (no need for RMW updates or journal seeks). > > > - > > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > >
Re: List of known BTRFS Raid 5/6 Bugs?
On Sat, Aug 11, 2018 at 08:27:04AM +0200, erentheti...@mail.de wrote: > I guess that covers most topics, two last questions: > > Will the write hole behave differently on Raid 6 compared to Raid 5 ? Not really. It changes the probability distribution (you get an extra chance to recover using a parity block in some cases), but there are still cases where data gets lost that didn't need to be. > Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? There may be benefits of raid5 metadata, but they are small compared to the risks. In some configurations it may not be possible to allocate the last gigabyte of space. raid1 will allocate 1GB chunks from 2 disks at a time while raid5 will allocate 1GB chunks from N disks at a time, and if N is an odd number there could be one chunk left over in the array that is unusable. Most users will find this irrelevant because a large disk array that is filled to the last GB will become quite slow due to long free space search and seek times--you really want to keep usage below 95%, maybe 98% at most, and that means the last GB will never be needed. Reading raid5 metadata could theoretically be faster than raid1, but that depends on a lot of variables, so you can't assume it as a rule of thumb. Raid6 metadata is more interesting because it's the only currently supported way to get 2-disk failure tolerance in btrfs. Unfortunately that benefit is rather limited due to the write hole bug. There are patches floating around that implement multi-disk raid1 (i.e. 3 or 4 mirror copies instead of just 2). This would be much better for metadata than raid6--more flexible, more robust, and my guess is that it will be faster as well (no need for RMW updates or journal seeks). > - > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT > signature.asc Description: PGP signature
Re: List of known BTRFS Raid 5/6 Bugs?
I guess that covers most topics, two last questions: Will the write hole behave differently on Raid 6 compared to Raid 5 ? Is there any benefit of running Raid 5 Metadata compared to Raid 1 ? - FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT
Re: List of known BTRFS Raid 5/6 Bugs?
On Sat, Aug 11, 2018 at 04:18:35AM +0200, erentheti...@mail.de wrote: > Write hole: > > > > The data will be readable until one of the data blocks becomes > > inaccessible (bad sector or failed disk). This is because it is only the > > parity block that is corrupted (old data blocks are still not modified > > due to btrfs CoW), and the parity block is only required when recovering > > from a disk failure. > > I am unsure about your meaning. > Assuming you perform an unclean shutdown (eg. crash), and after restart > perform a scrub, with no additional error (bad sector, bit-rot) before > or after the crash: > will you loose data? No, the parity blocks will be ignored and RAID5 will act like slow RAID0 if no other errors occur. > Will you be able to mount the filesystem like normal? Yes. > Additionaly, will the crash create additional errors like bad > sectors and or bit-rot aside from the parity-block corruption? No, only parity-block corruptions should occur. > Its actually part of my first mail, where the btrfs Raid5/6 page > assumes no data damage while the spinics comment implies the opposite. The above assumes no drive failures or data corruption; however, if this were the case, you could use RAID0 instead of RAID5. The only reason to use RAID5 is to handle cases where at least one block (or an entire disk) fails, so the behavior of RAID5 when all disks are working is almost irrelevant. A drive failure could occur at any time, so even if you mount successfully, if a disk fails immediately after, any stripes affected by write hole will be unrecoverably corrupted. > The write hole does not seem as dangerous if you could simply scrub > to repair damage (On smaller discs that is, where scrub doesnt take > enough time for additional errors to occur) Scrub can repair parity damage on normal data and metadata--it recomputes parity from data if the data passes a CRC check. No repair is possible for data in nodatasum files--the parity can be recomputed, but there is no way to determine if the result is correct. Metadata is always checksummed and transid verified; alas, there isn't an easy way to get btrfs to perform an urgent scrub on metadata only. > > Put another way: if all disks are online then RAID5/6 behaves like a slow > > RAID0, and RAID0 does not have the partial stripe update problem because > > all of the data blocks in RAID0 are independent. It is only when a disk > > fails in RAID5/6 that the parity block is combined with data blocks, so > > it is only in this case that the write hole bug can result in lost data. > > So data will not be lost if no drive has failed? Correct, but the array will have reduced failure tolerance, and RAID5 only matters when a drive has failed. It is effectively operating in degraded mode on parts of the array affected by write hole, and no single disk failure can be tolerated there. It is possible to recover the parity by performing an immediate scrub after reboot, but this cannot be as effective as a proper RAID5 update journal which avoids making the parity bad in the first place. > > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable > > > > to the write hole, but data is. In this configuration you can determine > > > > with high confidence which files you need to restore from backup, and > > > > the filesystem will remain writable to replace the restored data, > > > > because > > > > raid1 does not have the write hole bug. > > In regards to my earlier questions, what would change if i do -draid5 -mraid1? Metadata would be using RAID1 which is not subject to the RAID5 write hole issue. It is much more tolerant of unclean shutdowns especially in degraded mode. Data in RAID5 may be damaged when the array is in degraded mode and a write hole occurs (in either order as long as both occur). Due to RAID1 metadata, the filesystem will continue to operate properly, allowing the damaged data to be overwritten or deleted. > Lost Writes: > > > Hotplugging causes an effect (lost writes) which can behave similarly > > to the write hole bug in some instances. The similarity ends there. > > Are we speaking about the same problem that is causing transid mismatch? Transid mismatch is usually caused by lost writes, by any mechanism that prevents a write from being completed after the disk reports that it was completed. Drives may report that data is "in stable storage", i.e. the drive believes it can complete the write in the future even if power is lost now because the drive or controller has capacitors or NVRAM or similar. If the drive is reset by the SATA host because of a cable disconnect event, the drive may forget that it has promised to do writes in the future. Drives may simply lie, and claim that data has been written to disk when the data is actually in volatile RAM and will disappear in a power failure. btrfs uses a transaction mechanism and CoW metadata to handle lost writes within an interrupted transaction.
Re: List of known BTRFS Raid 5/6 Bugs?
Write hole: > The data will be readable until one of the data blocks becomes > inaccessible (bad sector or failed disk). This is because it is only the > parity block that is corrupted (old data blocks are still not modified > due to btrfs CoW), and the parity block is only required when recovering > from a disk failure. I am unsure about your meaning. Assuming you perform an unclean shutdown (eg. crash), and after restart perform a scrub, with no additional error (bad sector, bit-rot) before or after the crash: will you loose data? Will you be able to mount the filesystem like normal? Additionaly, will the crash create additional errors like bad sectors and or bit-rot aside from the parity-block corruption? Its actually part of my first mail, where the btrfs Raid5/6 page assumes no data damage while the spinics comment implies the opposite. The write hole does not seem as dangerous if you could simply scrub to repair damage (On smaller discs that is, where scrub doesnt take enough time for additional errors to occur) > Put another way: if all disks are online then RAID5/6 behaves like a slow > RAID0, and RAID0 does not have the partial stripe update problem because > all of the data blocks in RAID0 are independent. It is only when a disk > fails in RAID5/6 that the parity block is combined with data blocks, so > it is only in this case that the write hole bug can result in lost data. So data will not be lost if no drive has failed? > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable > > > to the write hole, but data is. In this configuration you can determine > > > with high confidence which files you need to restore from backup, and > > > the filesystem will remain writable to replace the restored data, because > > > raid1 does not have the write hole bug. In regards to my earlier questions, what would change if i do -draid5 -mraid1? Lost Writes: > Hotplugging causes an effect (lost writes) which can behave similarly > to the write hole bug in some instances. The similarity ends there. Are we speaking about the same problem that is causing transid mismatch? > They are really two distinct categories of problem. Temporary connection > loss can do bad things to all RAID profiles on btrfs (not just RAID5/6) > and the btrfs requirements for handling connection loss and write holes > are very different. What kind of bad things? Will scrub (1/10, 5/6) detect and repair it? > > > Hot-unplugging a device can cause many lost write events at once, and > > > each lost write event is very bad. > Transid mismatch is btrfs detecting data > that was previously silently corrupted by some component outside of btrfs. > > btrfs can't prevent disks from silently corrupting data. It can only > try to detect and repair the damage after the damage has occurred. Aside from the chance that all copies of data are corrupted, is there any way scrubbing could fail? > Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any > transid mismatches can be recovered by reading up-to-date data from the > other mirror copy of the metadata, or by reconstructing the data with > parity blocks in the RAID 5/6 case. It is only after this recovery > mechanism fails (i.e. too many disks have a failure or corruption at > the same time on the same sectors) that the filesystem is ended. Does this mean that transid mismatch is harmless unless both copys are hit at once (And in case of Raid 6 all three)? Old hardware: > > > It's fun and/or scary to put known good and bad hardware in the same > > > RAID1 array and watch btrfs autocorrecting the bad data after every > > > other power failure; however, the bad hardware is clearly not sufficient > > > to implement any sort of reliable data persistence, and arrays with bad > > > hardware in them will eventually fail. I am having a hard time wrapping my head around this statement. If Btrfs can repair corrupted data and Raid 6 allows two disc failures at once without data loss, is using old discs even with high average error count not still pretty much safe? You would simply have to repeat the scrubbing process more often to make sure that not enough data is corrupted to break redundancy. > > > I have one test case where I write millions of errors into a raid5/6 and > > > the filesystem recovers every single one transparently while verifying > > > SHA1 hashes of test data. After years of rebuilding busted ext3 on > > > mdadm-raid5 filesystems, watching btrfs do it all automatically is > > > just...beautiful. Once again, if Btrfs is THIS good at repairing data, then is old hardware, hotplugging and maybe even (depending on whether i understood your point) write hole really dangerous? Are there bugs that could destroy the data or filesystem whitout corrupting all copies of data (Or all copies at once)? Assuming Raid 6, corrupted data would not break redundancy and repeated scrubbing would fix any upcoming issue. --
Re: List of known BTRFS Raid 5/6 Bugs?
On Fri, Aug 10, 2018 at 06:55:58PM +0200, erentheti...@mail.de wrote: > Did i get you right? > Please correct me if i am wrong: > > Scrubbing seems to have been fixed, you only have to run it once. Yes. There is one minor bug remaining here: when scrub detects an error on any disk in a raid5/6 array, the error counts are garbage (random numbers on all the disks). You will need to inspect btrfs dev stats or the kernel log messages to learn which disks are injecting errors. This does not impair the scrubbing function, only the detailed statistics report (scrub status -d). If there are no errors, scrub correctly reports 0 for all error counts. Only raid5/6 is affected this way--other RAID profiles produce correct scrub statistics. > Hotplugging (temporary connection loss) is affected by the write hole > bug, and will create undetectable errors every 16 TB (crc32 limitation). Hotplugging causes an effect (lost writes) which can behave similarly to the write hole bug in some instances. The similarity ends there. They are really two distinct categories of problem. Temporary connection loss can do bad things to all RAID profiles on btrfs (not just RAID5/6) and the btrfs requirements for handling connection loss and write holes are very different. > The write Hole Bug can affect both old and new data. Normally, only old data can be affected by the write hole bug. The "new" data is not committed before the power failure (otherwise we would call it "old" data), so any corrupted new data will be inaccessible as a result of the power failure. The filesytem will roll back to the last complete committed data tree (discarding all new and modified data blocks), then replay the fsync log (which repeats and completes some writes that occurred since the last commit). This process eliminates new data from the filesystem whether the new data was corrupted by the write hole or not. Only corruptions that affect old data will remain, because old data is not overwritten by data saved in the fsync log, and old data is not part of the incomplete data tree that is rolled back after power failure. Exception: new data in nodatasum files can also be corrupted, but since nodatasum disables all data integrity or recovery features it's hard to define what "corrupted" means for a nodatasum file. > Reason: BTRFS saves data in fixed size stripes, if the write operation > fails midway, the stripe is lost. > This does not matter much for Raid 1/10, data always uses a full stripe, > and stripes are copied on write. Only new data could be lost. This is incorrect. Btrfs saves data in variable-sized extents (between 1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of its raid layer. Stripes are never copied. In RAID 1/10/DUP all data blocks are fully independent of each other, i.e. writing to any block on these RAID profiles does not corrupt data in any other block. As a result these RAID profiles do not allow old data to be corrupted by partially completed writes of new data. There is striping in some profiles, but it is only used for performance in these cases, and has no effect on data recovery. > However, for some reason Raid 5/6 works with partial stripes, meaning > that data is stored in stripes not completley filled by prior data, In RAID 5/6 each data block is related to all other data blocks in the same stripe with the parity block(s). If any individual data block in the stripe is updated, the parity block(s) must also be updated atomically, or the wrong data will be reconstructed during RAID5/6 recovery. Because btrfs does nothing to prevent it, some writes will occur to RAID5/6 stripes that are already partially occupied by old data. btrfs also does nothing to ensure that parity block updates are atomic, so btrfs has the write hole bug as a result. > and stripes are removed on write. Stripes are never removed...? A stripe is just a group of disk blocks divided on 64K boundaries, same as mdadm and many hardware RAID5/6 implementations. > Result: If the operation fails midway, the stripe is lost as is all > data previously stored it. You can only lose as many data blocks in each stripe as there are parity disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2 blocks); however, multiple writes can be lost affecting multiple stripes in a single power loss event. Losing even 1 block is often too much. ;) The data will be readable until one of the data blocks becomes inaccessible (bad sector or failed disk). This is because it is only the parity block that is corrupted (old data blocks are still not modified due to btrfs CoW), and the parity block is only required when recovering from a disk failure. Put another way: if all disks are online then RAID5/6 behaves like a slow RAID0, and RAID0 does not have the partial stripe update problem because all of the data blocks in RAID0 are independent. It is only when a disk fails in RAID5/6 that the parity block is com
Re: List of known BTRFS Raid 5/6 Bugs?
Did i get you right? Please correct me if i am wrong: Scrubbing seems to have been fixed, you only have to run it once. Hotplugging (temporary connection loss) is affected by the write hole bug, and will create undetectable errors every 16 TB (crc32 limitation). The write Hole Bug can affect both old and new data. Reason: BTRFS saves data in fixed size stripes, if the write operation fails midway, the stripe is lost. This does not matter much for Raid 1/10, data always uses a full stripe, and stripes are copied on write. Only new data could be lost. However, for some reason Raid 5/6 works with partial stripes, meaning that data is stored in stripes not completley filled by prior data, and stripes are removed on write. Result: If the operation fails midway, the stripe is lost as is all data previously stored it. Transid Mismatch can silently corrupt data. Reason: It is a seperate metadata failure that is trigged by lost or incomplete writes, writes that are lost somewhere during transmission. It can happen to all BTRFS configurations and is not trigerred by the write hole. It could happen due to brown out (temporary undersupply of voltage), faulty cables, faulty ram, faulty disc cache, faulty discs in general. Both bugs could damage metadata and trigger the following: Data will be lost (0 to 100% unreadable), the filesystem will be readonly. Reason: BTRFS saves metadata as a tree structure. The closer the error to the root, the more data cannot be read. Transid Mismatch can happen up to once every 3 months per device, depending on the drive hardware! Question: Does this not make transid mismatch way more dangerous than the write hole? What would happen to other filesystems, like ext4? Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8...@umail.furryterror.org: > > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote: > > > I am searching for more information regarding possible bugs related to > > > BTRFS Raid 5/6. All sites i could find are incomplete and information > > > contradicts itself: > > > > > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56) > > > warns of the write hole bug, stating that your data remains safe > > > (except data written during power loss, obviously) upon unclean shutdown > > > unless your data gets corrupted by further issues like bit-rot, drive > > > failure etc. > > > > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are > > no mitigations to prevent or avoid it in mainline kernels. > > > > The write hole results from allowing a mixture of old (committed) and > > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of > > blocks consisting of one related data or parity block from each disk > > in the array, such that writes to any of the data blocks affect the > > correctness of the parity block and vice versa). If the writes were > > not completed and one or more of the data blocks are not online, the > > data blocks reconstructed by the raid5/6 algorithm will be corrupt. > > > > If all disks are online, the write hole does not immediately > > damage user-visible data as the old data blocks can still be read > > directly; however, should a drive failure occur later, old data may > > not be recoverable because the parity block will not be correct for > > reconstructing the missing data block. A scrub can fix write hole > > errors if all disks are online, and a scrub should be performed after > > any unclean shutdown to recompute parity data. > > > > The write hole always puts both old and new data at risk of damage; > > however, due to btrfs's copy-on-write behavior, only the old damaged > > data can be observed after power loss. The damaged new data will have > > no references to it written to the disk due to the power failure, so > > there is no way to observe the new damaged data using the filesystem. > > Not every interrupted write causes damage to old data, but some will. > > > > Two possible mitigations for the write hole are: > > > > - modify the btrfs allocator to prevent writes to partially filled > > raid5/6 stripes (similar to what the ssd mount option does, except > > with the correct parameters to match RAID5/6 stripe boundaries), > > and advise users to run btrfs balance much more often to reclaim > > free space in partially occupied raid stripes > > > > - add a stripe write journal to the raid5/6 layer (either in > > btrfs itself, or in a lower RAID5 layer). > > > > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs > > to btrfs or dramatically increase the btrfs block size) that also solve > > the write hole problem but are somewhat more invasive and less practical > > for btrfs. > > > > Note that the write hole also affects btrfs on top of other similar > > raid5/6 implementations (e.g. mdadm raid5 without stripe journal). > > The btrfs CoW layer does not understand how to allocate data to avoid RMW > > raid5 stripe updates without corrupting exi
Re: List of known BTRFS Raid 5/6 Bugs?
On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote: > I am searching for more information regarding possible bugs related to > BTRFS Raid 5/6. All sites i could find are incomplete and information > contradicts itself: > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56) > warns of the write hole bug, stating that your data remains safe > (except data written during power loss, obviously) upon unclean shutdown > unless your data gets corrupted by further issues like bit-rot, drive > failure etc. The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are no mitigations to prevent or avoid it in mainline kernels. The write hole results from allowing a mixture of old (committed) and new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of blocks consisting of one related data or parity block from each disk in the array, such that writes to any of the data blocks affect the correctness of the parity block and vice versa). If the writes were not completed and one or more of the data blocks are not online, the data blocks reconstructed by the raid5/6 algorithm will be corrupt. If all disks are online, the write hole does not immediately damage user-visible data as the old data blocks can still be read directly; however, should a drive failure occur later, old data may not be recoverable because the parity block will not be correct for reconstructing the missing data block. A scrub can fix write hole errors if all disks are online, and a scrub should be performed after any unclean shutdown to recompute parity data. The write hole always puts both old and new data at risk of damage; however, due to btrfs's copy-on-write behavior, only the old damaged data can be observed after power loss. The damaged new data will have no references to it written to the disk due to the power failure, so there is no way to observe the new damaged data using the filesystem. Not every interrupted write causes damage to old data, but some will. Two possible mitigations for the write hole are: - modify the btrfs allocator to prevent writes to partially filled raid5/6 stripes (similar to what the ssd mount option does, except with the correct parameters to match RAID5/6 stripe boundaries), and advise users to run btrfs balance much more often to reclaim free space in partially occupied raid stripes - add a stripe write journal to the raid5/6 layer (either in btrfs itself, or in a lower RAID5 layer). There are assorted other ideas (e.g. copy the RAID-Z approach from zfs to btrfs or dramatically increase the btrfs block size) that also solve the write hole problem but are somewhat more invasive and less practical for btrfs. Note that the write hole also affects btrfs on top of other similar raid5/6 implementations (e.g. mdadm raid5 without stripe journal). The btrfs CoW layer does not understand how to allocate data to avoid RMW raid5 stripe updates without corrupting existing committed data, and this limitation applies to every combination of unjournalled raid5/6 and btrfs. > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas) > warns of possible incorrigible "transid" mismatch, not stating which > versions are affected or what transid mismatch means for your data. It > does not mention the write hole at all. Neither raid5 nor write hole are required to produce a transid mismatch failure. transid mismatch usually occurs due to a lost write. Write hole is a specific case of lost write, but write hole does not usually produce transid failures (it produces header or csum failures instead). During real disk failure events, multiple distinct failure modes can occur concurrently. i.e. both transid failure and write hole can occur at different places in the same filesystem as a result of attempting to use a failing disk over a long period of time. A transid verify failure is metadata damage. It will make the filesystem readonly and make some data inaccessible as described below. > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html" > target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html) > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption, > but may corrupt your Metadata while trying to do so - meaning you have > to scrub twice in a row to ensure data integrity. Simple corruption (without write hole errors) is fixed by scrubbing as of the last...at least six months? Kernel v4.14.xx and later can definitely do it these days. Both data and metadata. If the metadata is damaged in any way (corruption, write hole, or transid verify failure) on btrfs and btrfs cannot use the raid profile for metadata to recover the damaged data, the filesystem is usually forever readonly, and anywhere from 0 to 100% of the filesystem may be readable depending on where in the metadata tree structure the error occurs (the closer to the ro